Hướng dẫn fuzzy matching python
Fuzzy String Matching in Python We’ve made it our mission to pull in event tickets from every corner of the internet, showing you them all on the same screen so you can compare them and get to your game/concert/show as quickly as possible. Of course, a big problem with most corners of the internet is labeling. One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns). To pick an example completely at random, Cirque du Soleil has a show running in New York called “Zarkana”. When we scour the web to find tickets for sale, mostly those tickets are identified by a title, date, time, and venue. Here is a selection of some titles we’ve actually seen for this show:
As far as the internet goes, this is not too bad. An normal human intern would have no trouble picking up that all of these listings are for the same show. And a normal human intern would have no trouble picking up that those listings are different than the ones below:
But as you might imagine, we have far too many events (over 60,000) to be able to just throw interns at the problem. So we want to do this programmatically, but we also want our programmatic results to pass the “intern” test, and make sense to normal users. To achieve this, we’ve built up a library of “fuzzy” string matching routines to help us along. And good news! We’re open sourcing it. The library is called “Fuzzywuzzy”, the code is pure python, and it depends only on the (excellent) difflib python library. It is available on Github right now. String SimilarityThe simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar:
Looks like a harmless misspelling. Can we quantify it? Using python’s difflib, that’s pretty easy
So it looks like these two strings are about 96% the same. Pretty good! We use this pattern so frequently, we wrote a helper method to encapsulate it
Great, so we’re done! Not quite. It turns out that the standard “string closeness” measurement works fine for very short strings (such as a single word) and very long strings (such as a full book), but not so much for 3-10 word labels. The naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues. Partial String SimilarityHere’s a good illustration:
This doesn’t pass the intern test. The first two strings are clearly referring to the same team, but the second two are clearly referring to different ones. Yet, the score of the “bad” match is higher than the “right” one. Inconsistent substrings are a common problem for us. To get around it, we use a heuristic we call “best partial” when two strings are of noticeably different lengths (such as the case above). If the shorter string is length m, and the longer string is length n, we’re basically interested in the score of the best matching length-m substring. In this case, we’d look at the following combinations
and conclude that the last one is clearly the best. It turns out that “Yankees” and “New York Yankees” are a perfect partial match…the shorter string is a substring of the longer. We have a helper function for this too (and it’s far more efficient than the simplified algorithm I just laid out)
That’s more like it. Out Of OrderSubstrings aren’t our only problem. We also have to deal with differences in string construction. Here is an extremely common pattern, where one seller constructs strings as “
Again, these low scores don’t pass the intern test. If these listings are for the same day, they’re certainly referring to the same baseball game. We need a way to control for string construction. To solve this, we’ve developed two different heuristics: The “token_sort” approach and the “token_set” approach. I’ll explain both. Token SortThe token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:
We then compare the transformed strings with a simple ratio(). That nicely solves our ordering problem, as our helper function below indicates:
Token SetThe token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string. Here is an illustrative example:
Using the token sort method isn’t that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We’d end up comparing:
Not very useful. Instead, the set method allows us to detect that “angels” and “mariners” are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form
And then compare each pair. The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example
There are other ways to combine these values. For example, we could have taken an average, or a min. But in our experience, a “best match possible” approach seems to provide the best real life outcomes. And of course, using a set means that duplicate tokens get lost in the transformation.
ConclusionSo there you have it. One of the secrets of SeatGeek revealed. There are more tidbits in the library (available on Github), including convenience methods for matching values into a list of options. Happy hunting. |