Page 1 of 1

Damerau-Levenshtein Distance - Fuzzy searches

Posted: 29 Oct 2017, 02:07
by nnnik

Code: Select all

/*
Levenshtein Distance
Source Wikipedia.com
https://en.wikipedia.org/wiki/Levenshtein_distance
Originally written in C translated into AutoHotkey
*/

LDistance(s, t)
{ 
    ; degenerate cases
    if (StrLen( s ) = 0) 
		return StrLen( t )
	if ( StrLen( t ) = 0) 
		return StrLen( s )
	s := StrSplit( s )
	t := StrSplit( t )
    v0 := []
    Loop % t.Length() + 1
        v0[ A_Index ] := A_Index - 1
	v3 := [v0]
    Loop % s.Length()
    {
        ; calculate v1 (current row distances) from the previous row v0
		i := A_Index
		v1 := [i]
        ; use formula to fill in the rest of the row
        Loop % t.Length()
		{
			cost := !( s[ i ] == t[ A_Index ] )
			v1[ A_Index + 1 ] := _lMin( v1[ A_Index ] + 1
			, v0[ A_Index + 1 ] + 1
			, v0[ A_Index ] + cost )
		}
		v3.Push( v0 )
		v0 := v1
    }

    return v1[ t.Length() +1 ]
}

/*
Damerau-Levenshtein Distance
Source Wikipedia.com
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Originally written in Python translated into AutoHotkey
Thx to Helgef for correcting it.
*/

DLDistance( a, b )
{
    da := {}
    d := []
	a := StrSplit( a )
	b := StrSplit( b )
    maxdist := a.Length() + b.Length()
    d[-1, -1] := maxdist
    Loop % a.Length() + 1
	{
        i := A_Index - 1
		d[i, -1] := maxdist
        d[i, 0] := i
    }
	Loop % b.Length() + 1
	{
		j := A_Index - 1
		d[-1, j] := maxdist
        d[0, j] := j
	}
    Loop % a.Length()
	{
        db := 0
		i := A_Index
        Loop % b.Length()
		{
            j := A_Index
			k :=  da.HasKey(b[j]) ?  da[b[j]] : da[b[j]] := 0
            l := db
            if !( cost := !(a[i] == b[j]) )
				db := j
            d[i, j] := _lmin( d[i-1, j-1] + cost, d[i,   j-1] + 1, d[i-1, j  ] + 1, d[k-1, l-1] + (i-k-1) + 1 + (j-l-1 ) ) 
		}
        da[a[i]] := i
	}
    return d[a.Length(), b.Length()] ;I cant understand why but it always reports 1 more than it should. Since I'm a fan of easy solutions I just subtract 1 here
}

_lMin( p* )
{
	Ret := p.Pop()
	For each,Val in p
		if ( Val < Ret )
		{
			Ret := Val
		}
	return Ret
}

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 29 Oct 2017, 08:28
by Helgef
Great stuff nnnik, thanks for sharing :thumbup:

Cheers.
Edit
iPhilip wrote: used the following script, which could be used for a spell checker, to test how well it compares words:
Neat! :thumbup:

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 29 Oct 2017, 12:27
by joedf
Nice! better than the original ahk v1.0 implementation by poly. I wanted a non global var version :+1:

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 30 Oct 2017, 18:49
by iPhilip
Thank you, nnnik!

I used the following script, which could be used for a spell checker, to test how well it compares words:

Code: Select all

Word := "tset"
WordList := "random,task,test,text,toast"
for each, Word in Sort(Score(Word,StrSplit(WordList,","),"D"))
   Result .= Word.score "-" Word.word "`n"
MsgBox % Substr(Result,1,-1)
ExitApp

Score(Word,WordArray,Func:="") {
   ScoredWords := []
   for Index, WordElement in WordArray
      ScoredWords.Push({word:WordElement, score:%Func%LDistance(Word,WordElement)})
   Return ScoredWords
}

Sort(ScoredWords) {
   for Index, Word in ScoredWords
      SortString .= Word.score " " Index "`n"
   SortString := SubStr(SortString,1,-1)
   Sort, SortString, N
   Indexes := []
   Loop, Parse, SortString, `n
   {  RegExMatch(A_LoopField, "O)(\d+) (\d+)", Match)
      Indexes[Match[1]] .= Match[2] "`n"
   }
   for Index in Indexes
   {  SortString := Substr(Indexes[Index],1,-1)
      Sort, SortString, N
      Indexes[Index] := SortString
   }
   SortedWords := []
   for Score, IndexList in Indexes
      for each, Index in StrSplit(IndexList,"`n")
         SortedWords.Push({word:ScoredWords[Index].word, score:Score})
   Return SortedWords
}
The result using the LDistance function is:

Code: Select all

2-test
2-text
3-task
3-toast
6-random
The result using the DLDistance function is:

Code: Select all

1-test
2-text
3-task
3-toast
6-random
The results show the benefit of the Damerau–Levenshtein distance (DLDistance) over the classical Levenshtein distance (LDistance) because it includes "transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions)." [ref]

Cheers!

Edit: I updated the result for the DLDistance function based on Helgef updated function in this post.

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 31 Oct 2017, 18:07
by burque505
Thanks, nnnik and iPhilip, that's cool. I wrapped iPhilip's code in a function so it would work with a "Run Auto Hot Key Script" [sic] Activity in UiPath Studio, which I posted at https://forum.uipath.com/t/levenshtein-distance/7068/6, linking back to this post. I also modified it so it will give back Levenshtein (i.e. change "D" to "" in the script).

Code: Select all

for each, Word in Sort(Score(Word,StrSplit(WordList,","),""))
instead of

Code: Select all

for each, Word in Sort(Score(Word,StrSplit(WordList,","),"D"))
I like how you made it easy to modify, iPhilip, thanks!

burque505

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 31 Oct 2017, 18:32
by iPhilip
Hi burque505,

You are welcome. :)

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 05 Nov 2017, 11:06
by Nextron
Really cool nnnik! Already got a project in mind which could use this, with something like DLDistance(a, b)-Abs(StrLen(a)-StrLen(b))*(1-(Penalty:=0.15)) to allow sub-string searches/matches still give a nice score.
I cant understand why but it always reports 1 more than it should. Since I'm a fan of easy solutions I just subtract 1 here
I'm still reading through the function, but might it be due to AHK using 1-based indices, resulting in a shift within the matrix being set up?
The subtract on the return value may not be sufficient, based on the following result:

Code: Select all

DLDistance("test string","test string") ;returns 0
DLDistance("test string","Xest string") ;returns 0 too

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 05 Nov 2017, 12:49
by Helgef
I'm still reading through the function, but might it be due to AHK using 1-based indices, resulting in a shift within the matrix being set up?
I was thinking 1 based index vs 0 based index too, but my very quick look at nnnik's code made me think it was handled, but now you mention matrix, which I consider should have 1 based index :)
Originally written in Python translated into AutoHotkey
I'm pretty sure Python uses 0 based indices for lists, but I do not recall if, scipy or numpy matrices uses 1 or 0 based index. :think:

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 06 Nov 2017, 12:07
by nnnik
Yea that's why I hesitated to publish the function initially.
If someone could point out the error I would be really greatful.

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 06 Nov 2017, 13:11
by Helgef
Is it based on this code? It is from the wikipage, it looks most like your code I think.

Code: Select all

/*
algorithm DL-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    da := new array of |Σ| integers
    for i := 1 to |Σ| inclusive do
        da[i] := 0
    let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
    // note that d has indices starting at −1, while a, b and da are one-indexed.
    maxdist := length(a) + length(b)
    d[−1, −1] := maxdist
    for i := 0 to length(a) inclusive do
        d[i, −1] := maxdist
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[−1, j] := maxdist
        d[0, j] := j
    for i := 1 to length(a) inclusive do
        db := 0
        for j := 1 to length(b) inclusive do
            k := da[b[j]]
            ℓ := db
            if a[i] = b[j] then
                cost := 0
                db := j
            else
                cost := 1
            d[i, j] := minimum(d[i−1, j−1] + cost,  //substitution
                               d[i,   j−1] + 1,     //insertion
                               d[i−1, j  ] + 1,     //deletion
                               d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
        da[a[i]] := i
    return d[length(a), length(b)]
*/
I adjusted your code to this,

Code: Select all

msgbox % DLDistance("test string","test string") ;returns 0
msgbox % DLDistance("test string","Xest string") ;returns 1
msgbox % DLDistance("test string","XXst string")


DLDistance( a, b )
{
    da := {}
    d := []
	a := StrSplit( a )
	b := StrSplit( b )
    maxdist := a.Length() + b.Length()
    d[-1, -1] := maxdist
    Loop % a.Length() + 1
	{
        i := A_Index - 1
		d[i, -1] := maxdist
        d[i, 0] := i
    }
	Loop % b.Length() + 1
	{
		j := A_Index - 1
		d[-1, j] := maxdist
        d[0, j] := j
	}
    Loop % a.Length()
	{
        db := 0
		i := A_Index
        Loop % b.Length()
		{
            j := A_Index
			k :=  da.HasKey(b[j]) ?  da[b[j]] : da[b[j]] := 0
            l := db
            if !( cost := !(a[i] == b[j]) )
				db := j
            d[i, j] := _lmin( d[i-1, j-1] + cost, d[i,   j-1] + 1, d[i-1, j  ] + 1, d[k-1, l-1] + (i-k-1) + 1 + (j-l-1 ) ) 
		}
        da[a[i]] := i
	}
    return d[a.Length(), b.Length()] ;I cant understand why but it always reports 1 more than it should. Since I'm a fan of easy solutions I just subtract 1 here
}

_lMin( p* )
{
	Ret := p.Pop()
	For each,Val in p
		if ( Val < Ret )
		{
			Ret := Val
		}
	return Ret
}
Cheers.

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 07 Nov 2017, 09:37
by nnnik
Thanks for that. I have updated the code

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 07 Nov 2017, 19:11
by burque505
nnnik, Helgef, and iPhilip, thank you all for updating your code and your posts. Once again, greatly appreciated.
Regards,
burque505

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 07 Nov 2017, 21:40
by joedf
typo: Originally wriiten in C translated into AutoHotkey -> Originally written in C translated into AutoHotkey

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 14 Nov 2017, 15:28
by nnnik
Thanks

Re: Damerau-Levenshtein Distance - Fuzzy searches

Posted: 14 Nov 2017, 22:00
by joedf
:+1: