 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
esromneb
Joined: 06 Jan 2006 Posts: 53
|
Posted: Sun Jan 08, 2006 11:10 am Post subject: Levenshtein Distance Algorithm |
|
|
I'm relativly new to ahk, but I've translated the Levenshtein Algorithm as found here:
http://www.merriampark.com/ldc.htm
The algorithm is explaned in detial here:
http://www.merriampark.com/ld.htm
(I find it rather ironic how similar those urls are)
Anyways can someone tell me if this is efficient or not? I'm still a little fuzzy on what you can do with references (arrays in this case).
Here it is:
| Code: |
levenshtein_distance( s, t )
{
n := strlen(s)
m := strlen(t)
if( n != 0 and m != 0 )
{
m++
n++
d0 = 0 ; Because A_index starts at 1, we emulate a for loop by hardcoding the first repetition
Loop, %n%
d%A_Index% := A_Index
; I would emulate the first repetition here, but it just sets d0 = 0
Loop, %m%
{
temp1 := A_Index * n
d%temp1% := A_Index
}
B_Index := 0
Loop, %n%
{
B_Index++
Loop, %m%
{
temp1 := B_Index
temp2 := A_Index
StringMid, tempA, s, temp1, 1
StringMid, tempB, t, temp2, 1
;MsgBox, Comparing %tempA% and %tempB%
if( tempA == tempB )
cost := 0
else
cost := 1
temp1 := A_Index * n + B_Index
temp2 := (A_Index - 1) * n + B_Index
temp3 := A_Index * n + B_Index - 1
temp4 := (A_Index - 1) * n + B_Index - 1
d%temp1% := minimum( d%temp2%+1 , d%temp3%+1 , d%temp4%+cost )
} ;Loop, m
} ;Loop, n
temp1 := n*m - 1
distance := d%temp1%
return distance
} ;if
else
return -1
}
minimum( a, b, c )
{
min := a
if(b<min)
min := b
if(c<min)
min := c
return min
}
MsgBox,% levenshtein_distance( "Ben", "Hen" )
|
Thanks!! |
|
| Back to top |
|
 |
toralf
Joined: 31 Jan 2005 Posts: 3842 Location: Bremen, Germany
|
Posted: Sun Jan 08, 2006 11:22 am Post subject: |
|
|
you could replace | Code: | B_Index := 0
Loop, %n%
{
B_Index++
| with this | Code: | Loop, %n%
{
B_Index = %A_Index% |
and you could replace | Code: | temp1 := B_Index
temp2 := A_Index
StringMid, tempA, s, temp1, 1
StringMid, tempB, t, temp2, 1
| with | Code: | StringMid, tempA, s, B_Index , 1
StringMid, tempB, t, A_Index , 1
|
and you could replace | Code: | distance := d%temp1%
return distance
| with
But I goes all will not increase speed very much. _________________ Ciao
toralf  |
|
| Back to top |
|
 |
toralf as guest Guest
|
Posted: Sun Jan 08, 2006 11:48 am Post subject: |
|
|
And you can replace | Code: | n := strlen(s)
m := strlen(t)
if( n != 0 and m != 0 )
{
m++
n++ | with | Code: | n := strlen(s) + 1
m := strlen(t) + 1
if( n != 1 and m != 1 )
{
|
|
|
| Back to top |
|
 |
toralf as guest Guest
|
Posted: Sun Jan 08, 2006 11:55 am Post subject: |
|
|
You could also write | Code: | temp1 := A_Index * n + B_Index
temp2 := (A_Index - 1) * n + B_Index
temp3 := A_Index * n + B_Index - 1
temp4 := (A_Index - 1) * n + B_Index - 1
| as | Code: | temp1 := A_Index * n + B_Index
temp2 := (A_Index - 1) * n + B_Index
temp3 := temp1 - 1
temp4 := temp2 - 1
|
|
|
| Back to top |
|
 |
toralf
Joined: 31 Jan 2005 Posts: 3842 Location: Bremen, Germany
|
Posted: Sun Jan 08, 2006 12:49 pm Post subject: |
|
|
I have taken a look at the original algorithm, it seams that you have made some modifications. Your loops go one iteration to much, since the loops in the c program go to k<n , k<m , i<n and j<m, but the loops in your code go up to n and m.
The original also returns m or n if one of the strings is empty. The c code you used as basis returns -1.
I guess you could also use Loop,Parse instead or normal loops and stringMid. But I do not know if that is faster. _________________ Ciao
toralf  |
|
| Back to top |
|
 |
toralf
Joined: 31 Jan 2005 Posts: 3842 Location: Bremen, Germany
|
Posted: Sun Jan 08, 2006 4:48 pm Post subject: |
|
|
I guess this is the most effective way and closest to the original algorithm: | Code: | MsgBox, % levenshtein_distance( "Ben", "Hen" )
MsgBox, % levenshtein_distance( "Toralf", "esromneb" )
levenshtein_distance( s, t )
{
n := strlen(s)
m := strlen(t)
If n = 0
Return, m
If m = 0
Return, n
d0 = 0
Loop, %n%
d%A_Index% := A_Index
Loop, %m%
{
Index := A_Index * (n + 1)
d%Index% := A_Index
}
Loop, Parse, s
{
i = %A_Index%
si = %A_LoopField%
Loop, Parse, t
{
If (si == A_LoopField)
cost = 0
else
cost = 1
Index := A_Index * (n + 1) + i
iLeft := Index - 1
iAbove := iLeft - n
iDiag := iAbove - 1
d%Index% := minimum( d%iAbove%+1 , d%iLeft%+1 , d%iDiag%+cost )
}
}
Return, d%Index%
}
minimum( min, b, c )
{
if(b<min)
min := b
if(c<min)
min := c
return min
} |
_________________ Ciao
toralf  |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4012 Location: Pittsburgh
|
Posted: Sun Jan 08, 2006 7:10 pm Post subject: |
|
|
Thanks Esromneb, for the great script! It can be useful for many tasks.
In an interpreted language, the efficiency of the code is closely related to the number of lines executed, so there we could save some execution time. In AHK, expression evaluation is slower than direct string assignment, which would allow a little further speedup.
Continuing the trimming of the code, what ToRalf started, we could shave off another couple of lines of code by using two-dimensional arrays directly, instead of the index calculation. The expression (si <> A_LoopField) gives you 0/1 directly for cost, to be used in the function min. | Code: | MsgBox % Levenshtein_distance( "AHK", "" )
MsgBox % Levenshtein_distance( "He", "Ben" )
MsgBox % Levenshtein_distance( "Toralf", "esromneb" )
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )
Levenshtein_distance( s, t )
{
n := StrLen(s)
IfEqual n,0, Return m
m := StrLen(t)
IfEqual m,0, Return n
d0_0 = 0
Loop %n%
d0_%A_Index% := A_Index
Loop %m%
d%A_Index%_0 := A_Index
Loop Parse, s
{
i = %A_Index%
i1:= i-1
si = %A_LoopField%
Loop Parse, t
{
j1 := A_Index - 1
d%A_Index%_%i% := min( d%A_Index%_%i1%+1, d%j1%_%i%+1, d%j1%_%i1%+(si <> A_LoopField) )
}
}
Return d%m%_%n%
}
min( a, b, c )
{
IfLess b, %a%, SetEnv a, %b%
IfLess c, %a%, Return c
Return a
} |
|
|
| Back to top |
|
 |
Decarlo110
Joined: 15 Dec 2004 Posts: 303 Location: United States
|
Posted: Mon Jan 09, 2006 7:47 pm Post subject: |
|
|
i don't know why these type of things get named. I suppose it's for a common terminology rather than an intellectual claim. Same thing for the "Boolean" term and "Venn" diagrams.
Here's my version in Ahk, which gives the same result. It drops the traditional matrix approach, which i find clumsy in this case.
| Code: | ^#l:: ; optional hotkey = Ctrl.Win.L
Loop 2
InputBox str%A_Index%,, Enter string %A_Index%
if ( StrLen(str1) < StrLen(str2) )
{
shorterString = %str1%
longerString = %str2%
}
else
{
shorterString = %str2%
longerString = %str1%
}
Lev_edit_distance = 0
Loop Parse, shorterString
{
A_Index1 = %A_Index%
A_LoopField1 = %A_LoopField%
Loop Parse, longerString
{
if A_Index = %A_Index1%
{
if A_LoopField != %A_LoopField1%
Lev_edit_distance ++
BREAK
}
}
}
msgbox % "The Levenshtein/edit distance is "
. Lev_edit_distance + Abs( StrLen(str1) - StrLen(str2) ) |
A simple way to confound/defeat elementary versions of Levenshtein-distance checking is to use an offset. Smarter versions would perhaps attempt to locate the common initial string of given arbitrary length and work from there, and repeating this for another arbitrary length of characters when there is non-identicalness, in order to measure distributed common strings with a correspondence despite offsets. _________________ 1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php
2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>> |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4012 Location: Pittsburgh
|
Posted: Mon Jan 09, 2006 9:50 pm Post subject: |
|
|
| Decarlo110's version computes a completely different (much simpler) distance between the strings. It is not the minimum number of changes to transform one string to the other, but the number of changes in a certain, monotone-scanning changing strategy. Even if a possible common initial string is located first, the result will be often wrong. |
|
| Back to top |
|
 |
Decarlo110
Joined: 15 Dec 2004 Posts: 303 Location: United States
|
Posted: Tue Jan 10, 2006 2:59 am Post subject: |
|
|
My previous post on this thread was a result of hasty reading on my part, and insufficient example on the website without any mention of deletion. Laszlo is correct; my previous post should be disregarded entirely. _________________ 1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php
2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>> |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4012 Location: Pittsburgh
|
Posted: Tue Jan 10, 2006 4:33 am Post subject: |
|
|
| Decarlo110 wrote: | | my previous post should be disregarded entirely. | It is just different. The original runs in quadratic time of the input length, which could be prohibitively slow for long strings. Decarlo110 posted a linear time computable metric of string differences. Although it is not the minimum number of changes, it does relate to how different the strings are. So, keep this script (or some refined version) for huge strings. |
|
| Back to top |
|
 |
esromneb
Joined: 06 Jan 2006 Posts: 53
|
Posted: Thu Jan 12, 2006 3:50 am Post subject: |
|
|
Sorry for my delay...
Before swapping out my code with yours, I decided to check it first. There is a java (eww) applet online at http://www.merriampark.com/ld.htm
which will show you the correct value for the lev distance.
Tolraf: your code works just fine
Laszlo: your code doesn't work perfectly I wouldn't have noticed it, but your code returns 8 for
| Code: | | MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" ) |
while the website, tolraf's, and my original code return 9. I'm not sure where the error is.
Also, why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match.... |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4012 Location: Pittsburgh
|
Posted: Thu Jan 12, 2006 7:22 am Post subject: |
|
|
The right answer is 8, as you can easily verify: delete "To" from "Toralf" (2 letters) and add "LaDuce" to the end, which is 6 letters, making 8 all together. The question is, do you consider "r" and "R" different? Your code, and Toralf's, use "==" for comparison, which differentiates between upper- and lower- case letters. If you want that behavior, add "StringCaseSense On" to the beginning of my script, or replace "+(si <> A_LoopField)" in the table update expression with "+!(si==A_LoopField)". It is probably the best, if you introduce a third function parameter, which chooses between "StringCaseSense On" and "StringCaseSense Off", according to the user's desire: | Code: | Levenshtein_distance( s, t, Case_sense=1 )
{
Old_sense = %A_StringCaseSense%
If Case_sense
StringCaseSense ON
Else
StringCaseSense Off
;...
StringCaseSense %Old_sense%
Return d%m%_%n%
} | It defaults to case sensitivity. If this is not what you want, at the function call add a third parameter: 0 or False. |
|
| Back to top |
|
 |
esromneb
Joined: 06 Jan 2006 Posts: 53
|
Posted: Thu Jan 12, 2006 8:59 am Post subject: |
|
|
| thanks, I lowered all my strings anyways. Case sensative would be faster for me then, right? |
|
| Back to top |
|
 |
toralf
Joined: 31 Jan 2005 Posts: 3842 Location: Bremen, Germany
|
Posted: Thu Jan 12, 2006 10:35 am Post subject: |
|
|
| esromneb wrote: | | why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match.... | Our scripts do not return 0. They return the length of the other string. This is what the original algorithm had in mind. Since when one string is zero you just have to write the other string (add the chars). If both are nothing the differnce is zero.
You got the -1 idea from the C code you translated, but Laszlo and I used the original algorithm, see your own links. _________________ Ciao
toralf  |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|