I've been studying up on string matching (fuzzy matching). I found a few articles discussing various approaches like
I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which
originally took 14 days to compute down to 5 minutes
What's in a Name? Fast Fuzzy String Matching - Seth Verrinder & Kyle Putnam - Midwest.io 2015
Below are my notes from watching the video (it is ~40 minutes long but very interesting)
1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (
5X down to 70hrs )
6) Filtering- estimate similarity first then filter (
7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn't rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn't shared- This dropped out quite a bit but was slow (
300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (
340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter
7) Minimize recalculation (
4,000x down to 5 minutes - 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula
Learnings-
· Measure performance and focus on bottleneck
· Order of magnitude doesn't always tell you about actual performance
· Favor simplicity
I've been studying up on string matching (fuzzy matching). I found a few articles discussing various approaches like
[list]
[url=https://en.wikipedia.org/wiki/Levenshtein_distance]Levenstein distance[/url]
[url=https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance]Damerau–Levenshtein distance[/url]
[url=https://en.wikipedia.org/wiki/N-gram]n-gram[/url]
[url=https://en.wikipedia.org/wiki/Soundex]Soundex[/url]
[url=https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance]Jaro-Winkler distance[/url]
[url=https://en.wikipedia.org/wiki/Jaccard_index]Jaccard index[/url]
[/list]
I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which[b][color=#FF0000] originally took 14 days to compute down to 5 minutes[/color][/b]
[url=https://www.youtube.com/watch?v=s0YSKiFdj8Q]What's in a Name?[/url] Fast Fuzzy String Matching - Seth Verrinder & Kyle Putnam - Midwest.io 2015
Below are my notes from watching the video (it is ~40 minutes long but very interesting)
[hr][/hr]
1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) ([b]5X down to 70hrs[/b] )
6) Filtering- estimate similarity first then filter ([b]7x down to 50 hrs[/b] 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn't rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn't shared- This dropped out quite a bit but was slow ([b]300x down to 65 minutes[/b])
o Bitmap filter- use bitwise operations to get unmatched count- very fast! ([b]340X down to 60 minutes[/b] 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter
7) Minimize recalculation ([b]4,000x down to 5 minutes[/b] - 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula
Learnings-
· Measure performance and focus on bottleneck
· Order of magnitude doesn't always tell you about actual performance
· Favor simplicity