Various fuzzy string matching algorithms and Excellent video

Post a reply


In an effort to prevent automatic submissions, we require that you complete the following challenge.
Smilies
:D :) ;) :( :o :shock: :? 8-) :lol: :x :P :oops: :cry: :evil: :twisted: :roll: :!: :?: :idea: :| :mrgreen: :geek: :ugeek: :arrow: :angel: :clap: :crazy: :eh: :lolno: :problem: :shh: :shifty: :sick: :silent: :think: :thumbup: :thumbdown: :salute: :wave: :wtf: :yawn: :facepalm: :bravo: :dance: :beard: :morebeard: :xmas: :HeHe: :trollface: :cookie: :rainbow: :monkeysee: :monkeysay: :happybday: :headwall: :offtopic: :superhappy: :terms: :beer:
View more smilies

BBCode is ON
[img] is OFF
[flash] is OFF
[url] is ON
Smilies are ON

Topic review
   

Expand view Topic review: Various fuzzy string matching algorithms and Excellent video

Re: Various fuzzy string matching algorithms and Excellent video

Post by SL5 » 28 Apr 2018, 11:16

interesting topic. I am very curious what else is coming. the scripts here are not mine:

Code: Select all


FuzzySearch(string1, string2)
{
	lenl := StrLen(string1)
	lens := StrLen(string2)
	if(lenl > lens)
	{
		shorter := string2
		longer := string1
	}
	else if(lens > lenl)
	{
		shorter := string1
		longer := string2
		lens := lenl
		lenl := StrLen(string2)
	}
	else
		return StringDifference(string1, string2)
	min := 1
	Loop % lenl - lens + 1
	{
		distance := StringDifference(shorter, SubStr(longer, A_Index, lens))
		if(distance < min)
			min := distance
	}
	return min
}
;By Toralf:
;basic idea for SIFT3 code by Siderite Zackwehdex 
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html 
;took idea to normalize it to longest string from Brad Wood 
;http://www.bradwood.com/string_compare/ 
;Own work: 
; - when character only differ in case, LSC is a 0.8 match for this character 
; - modified code for speed, might lead to different results compared to original code 
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance) 
;http://www.autohotkey.com/forum/topic59407.html 
StringDifference(string1, string2, maxOffset=1) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common" 
  If (string1 = string2) 
    Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case 
  If (string1 = "" OR string2 = "") 
    Return (string1 = string2 ? 0/1 : 1/1) 
  StringSplit, n, string1 
  StringSplit, m, string2 
  ni := 1, mi := 1, lcs := 0 
  While((ni <= n0) AND (mi <= m0)) { 
    If (n%ni% == m%mi%) 
      EnvAdd, lcs, 1 
    Else If (n%ni% = m%mi%) 
      EnvAdd, lcs, 0.8 
    Else{ 
      Loop, %maxOffset%  { 
        oi := ni + A_Index, pi := mi + A_Index 
        If ((n%oi% = m%mi%) AND (oi <= n0)){ 
            ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8) 
            Break 
        } 
        If ((n%ni% = m%pi%) AND (pi <= m0)){ 
            mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8) 
            Break 
        } 
      } 
    } 
    EnvAdd, ni, 1 
    EnvAdd, mi, 1 
  } 
  Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0) 
}

some links:
Damerau-Levenshtein Distance - Fuzzy searches https://autohotkey.com/boards/viewtopic ... zy#p182567

[Library] Sift - Fuzzy Search by RegEx and Ngram https://autohotkey.com/boards/viewtopic.php?t=7302

Re: Various fuzzy string matching algorithms and Excellent video

Post by robodesign » 31 Mar 2018, 17:16

guest3456 wrote::lol:
Lol

Re: Various fuzzy string matching algorithms and Excellent video

Post by Helgef » 23 Mar 2018, 14:39

I do not think I can help you. Good luck.

Re: Various fuzzy string matching algorithms and Excellent video

Post by robodesign » 23 Mar 2018, 14:31

Helgef... I did so... I came across only on the SIFT3 implementation. Can you help, please? thanks

Re: Various fuzzy string matching algorithms and Excellent video

Post by Helgef » 23 Mar 2018, 14:23

Re: Various fuzzy string matching algorithms and Excellent video

Post by robodesign » 23 Mar 2018, 14:02

Where can I find implementation in AHK of such algorithms?

Various fuzzy string matching algorithms and Excellent video

Post by Joe Glines » 29 Dec 2015, 10:30

I've been studying up on string matching (fuzzy matching). I found a few articles discussing various approaches like

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes

What's in a Name? Fast Fuzzy String Matching - Seth Verrinder & Kyle Putnam - Midwest.io 2015

Below are my notes from watching the video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn't rule out too much) Probably still look at 70%

· 26 Character filter- search for character that isn't shared- This dropped out quite a bit but was slow (300x down to 65 minutes)

o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)

o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes - 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula


Learnings-

· Measure performance and focus on bottleneck
· Order of magnitude doesn't always tell you about actual performance
· Favor simplicity

Top