Various fuzzy string matching algorithms and Excellent video

Talk about anything
User avatar
Joe Glines
Posts: 698
Joined: 30 Sep 2013, 20:49
Facebook: https://www.facebook.com/theAutomatorGuru/
Google: https://plus.google.com/105328929654286634910
GitHub: joetazz
Location: Dallas
Contact:

Various fuzzy string matching algorithms and Excellent video

29 Dec 2015, 10:30

I've been studying up on string matching (fuzzy matching). I found a few articles discussing various approaches like

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes

What's in a Name? Fast Fuzzy String Matching - Seth Verrinder & Kyle Putnam - Midwest.io 2015

Below are my notes from watching the video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn't rule out too much) Probably still look at 70%

· 26 Character filter- search for character that isn't shared- This dropped out quite a bit but was slow (300x down to 65 minutes)

o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)

o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes - 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula


Learnings-

· Measure performance and focus on bottleneck
· Order of magnitude doesn't always tell you about actual performance
· Favor simplicity

Find & Click AutoHotkey syntax writer Automate my Task :clap:
AHK Tutorials:Web Scraping | AHK Studio | Webservice APIs | Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey :ugeek:
Connect with me on LinkedIn :beer: | YouTube
How-to: Create a shortcut that automatically
logs in to any website

:thumbup: Quick Access Popup, the powerful Windows folders, apps and documents launcher!
Help support the AutoHotkey foundation
robodesign
Posts: 683
Joined: 30 Sep 2017, 03:59
Facebook: marius.sucan
GitHub: mariussucan
Location: Romania
Contact:

Re: Various fuzzy string matching algorithms and Excellent video

23 Mar 2018, 14:02

Where can I find implementation in AHK of such algorithms?
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
robodesign
Posts: 683
Joined: 30 Sep 2017, 03:59
Facebook: marius.sucan
GitHub: mariussucan
Location: Romania
Contact:

Re: Various fuzzy string matching algorithms and Excellent video

23 Mar 2018, 14:31

Helgef... I did so... I came across only on the SIFT3 implementation. Can you help, please? thanks
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
robodesign
Posts: 683
Joined: 30 Sep 2017, 03:59
Facebook: marius.sucan
GitHub: mariussucan
Location: Romania
Contact:

Re: Various fuzzy string matching algorithms and Excellent video

31 Mar 2018, 17:16

guest3456 wrote::lol:
Lol
-------------------------
KeyPress OSD v4: GitHub or forum. (presentation video)
Quick Picto Viewer: GitHub or forum.
AHK GDI+ expanded / compilation library (on GitHub)
My home page.
User avatar
SL5
Posts: 878
Joined: 12 May 2015, 02:10
GitHub: sl5net
Contact:

Re: Various fuzzy string matching algorithms and Excellent video

28 Apr 2018, 11:16

interesting topic. I am very curious what else is coming. the scripts here are not mine:

Code: Select all


FuzzySearch(string1, string2)
{
	lenl := StrLen(string1)
	lens := StrLen(string2)
	if(lenl > lens)
	{
		shorter := string2
		longer := string1
	}
	else if(lens > lenl)
	{
		shorter := string1
		longer := string2
		lens := lenl
		lenl := StrLen(string2)
	}
	else
		return StringDifference(string1, string2)
	min := 1
	Loop % lenl - lens + 1
	{
		distance := StringDifference(shorter, SubStr(longer, A_Index, lens))
		if(distance < min)
			min := distance
	}
	return min
}
;By Toralf:
;basic idea for SIFT3 code by Siderite Zackwehdex 
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html 
;took idea to normalize it to longest string from Brad Wood 
;http://www.bradwood.com/string_compare/ 
;Own work: 
; - when character only differ in case, LSC is a 0.8 match for this character 
; - modified code for speed, might lead to different results compared to original code 
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance) 
;http://www.autohotkey.com/forum/topic59407.html 
StringDifference(string1, string2, maxOffset=1) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common" 
  If (string1 = string2) 
    Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case 
  If (string1 = "" OR string2 = "") 
    Return (string1 = string2 ? 0/1 : 1/1) 
  StringSplit, n, string1 
  StringSplit, m, string2 
  ni := 1, mi := 1, lcs := 0 
  While((ni <= n0) AND (mi <= m0)) { 
    If (n%ni% == m%mi%) 
      EnvAdd, lcs, 1 
    Else If (n%ni% = m%mi%) 
      EnvAdd, lcs, 0.8 
    Else{ 
      Loop, %maxOffset%  { 
        oi := ni + A_Index, pi := mi + A_Index 
        If ((n%oi% = m%mi%) AND (oi <= n0)){ 
            ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8) 
            Break 
        } 
        If ((n%ni% = m%pi%) AND (pi <= m0)){ 
            mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8) 
            Break 
        } 
      } 
    } 
    EnvAdd, ni, 1 
    EnvAdd, mi, 1 
  } 
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0) 
}

some links:
Damerau-Levenshtein Distance - Fuzzy searches https://autohotkey.com/boards/viewtopic ... zy#p182567

[Library] Sift - Fuzzy Search by RegEx and Ngram https://autohotkey.com/boards/viewtopic.php?t=7302

Return to “Offtopic”

Who is online

Users browsing this forum: jdpmdphd and 13 guests