AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Levenshtein Distance Algorithm
Goto page 1, 2  Next
 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions
View previous topic :: View next topic  
Author Message
esromneb



Joined: 06 Jan 2006
Posts: 53

PostPosted: Sun Jan 08, 2006 11:10 am    Post subject: Levenshtein Distance Algorithm Reply with quote

I'm relativly new to ahk, but I've translated the Levenshtein Algorithm as found here:
http://www.merriampark.com/ldc.htm

The algorithm is explaned in detial here:
http://www.merriampark.com/ld.htm

(I find it rather ironic how similar those urls are)

Anyways can someone tell me if this is efficient or not? I'm still a little fuzzy on what you can do with references (arrays in this case).
Here it is:
Code:

levenshtein_distance( s, t )
{
  n := strlen(s)
  m := strlen(t)

  if( n != 0 and m != 0 )
  {

    m++
    n++
    d0 = 0 ; Because A_index starts at 1, we emulate a for loop by hardcoding the first repetition

    Loop, %n%
      d%A_Index% := A_Index

    ; I would emulate the first repetition here, but it just sets d0 = 0
    Loop, %m%
    {
      temp1 := A_Index * n
      d%temp1% := A_Index
    }
    B_Index := 0
    Loop, %n%
    {
      B_Index++
      Loop, %m%
      {
        temp1 := B_Index
        temp2 := A_Index
        StringMid, tempA, s, temp1, 1
        StringMid, tempB, t, temp2, 1
        ;MsgBox, Comparing %tempA% and %tempB%
        if( tempA == tempB )
          cost := 0
        else
          cost := 1
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := A_Index * n + B_Index - 1
        temp4 := (A_Index - 1) * n + B_Index - 1
        d%temp1% := minimum( d%temp2%+1 , d%temp3%+1 , d%temp4%+cost )
      } ;Loop, m
    } ;Loop, n
    temp1 := n*m - 1
    distance := d%temp1%
    return distance
  } ;if
  else
    return -1
}



minimum( a, b, c )
{
  min := a
  if(b<min)
    min := b
  if(c<min)
    min := c
  return min
}


MsgBox,% levenshtein_distance( "Ben", "Hen" )


Thanks!!
Back to top
View user's profile Send private message
toralf



Joined: 31 Jan 2005
Posts: 3842
Location: Bremen, Germany

PostPosted: Sun Jan 08, 2006 11:22 am    Post subject: Reply with quote

you could replace
Code:
    B_Index := 0
    Loop, %n%
    {
      B_Index++
with this
Code:
    Loop, %n%
    {
      B_Index = %A_Index%


and you could replace
Code:
        temp1 := B_Index
        temp2 := A_Index
        StringMid, tempA, s, temp1, 1
        StringMid, tempB, t, temp2, 1
with
Code:
        StringMid, tempA, s, B_Index , 1
        StringMid, tempB, t, A_Index , 1


and you could replace
Code:
    distance := d%temp1%
    return distance
with
Code:
    return d%temp1% 


But I goes all will not increase speed very much.
_________________
Ciao
toralf
Back to top
View user's profile Send private message Send e-mail Visit poster's website
toralf as guest
Guest





PostPosted: Sun Jan 08, 2006 11:48 am    Post subject: Reply with quote

And you can replace
Code:
  n := strlen(s)
  m := strlen(t)

  if( n != 0 and m != 0 )
  {

    m++
    n++
with
Code:
  n := strlen(s) + 1
  m := strlen(t) + 1

  if( n != 1 and m != 1 )
  {
Back to top
toralf as guest
Guest





PostPosted: Sun Jan 08, 2006 11:55 am    Post subject: Reply with quote

You could also write
Code:
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := A_Index * n + B_Index - 1
        temp4 := (A_Index - 1) * n + B_Index - 1
as
Code:
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := temp1 - 1
        temp4 := temp2 - 1
Back to top
toralf



Joined: 31 Jan 2005
Posts: 3842
Location: Bremen, Germany

PostPosted: Sun Jan 08, 2006 12:49 pm    Post subject: Reply with quote

I have taken a look at the original algorithm, it seams that you have made some modifications. Your loops go one iteration to much, since the loops in the c program go to k<n , k<m , i<n and j<m, but the loops in your code go up to n and m.
The original also returns m or n if one of the strings is empty. The c code you used as basis returns -1.

I guess you could also use Loop,Parse instead or normal loops and stringMid. But I do not know if that is faster.
_________________
Ciao
toralf
Back to top
View user's profile Send private message Send e-mail Visit poster's website
toralf



Joined: 31 Jan 2005
Posts: 3842
Location: Bremen, Germany

PostPosted: Sun Jan 08, 2006 4:48 pm    Post subject: Reply with quote

I guess this is the most effective way and closest to the original algorithm:
Code:
MsgBox, % levenshtein_distance( "Ben", "Hen" )
MsgBox, % levenshtein_distance( "Toralf", "esromneb" )

levenshtein_distance( s, t )
  {
    n := strlen(s)
    m := strlen(t)

    If n = 0
        Return, m
    If m = 0
        Return, n
 
    d0 = 0
    Loop, %n%
        d%A_Index% := A_Index
    Loop, %m%
      {
        Index := A_Index * (n + 1)
        d%Index% := A_Index
      }

    Loop, Parse, s
      {
        i = %A_Index%
        si = %A_LoopField%
        Loop, Parse, t
          {
            If (si == A_LoopField)
              cost = 0
            else
              cost = 1
           
            Index  := A_Index * (n + 1) + i
            iLeft  := Index - 1
            iAbove := iLeft - n
            iDiag  := iAbove - 1
            d%Index% := minimum( d%iAbove%+1 , d%iLeft%+1 , d%iDiag%+cost )
          }
      }
    Return, d%Index%
  }

minimum( min, b, c )
  {
    if(b<min)
      min := b
    if(c<min)
      min := c
    return min
  }

_________________
Ciao
toralf
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Laszlo



Joined: 14 Feb 2005
Posts: 4012
Location: Pittsburgh

PostPosted: Sun Jan 08, 2006 7:10 pm    Post subject: Reply with quote

Thanks Esromneb, for the great script! It can be useful for many tasks.

In an interpreted language, the efficiency of the code is closely related to the number of lines executed, so there we could save some execution time. In AHK, expression evaluation is slower than direct string assignment, which would allow a little further speedup.

Continuing the trimming of the code, what ToRalf started, we could shave off another couple of lines of code by using two-dimensional arrays directly, instead of the index calculation. The expression (si <> A_LoopField) gives you 0/1 directly for cost, to be used in the function min.
Code:
MsgBox % Levenshtein_distance( "AHK", "" )
MsgBox % Levenshtein_distance( "He", "Ben" )
MsgBox % Levenshtein_distance( "Toralf", "esromneb" )
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )

Levenshtein_distance( s, t )
{
   n := StrLen(s)
   IfEqual n,0, Return m
   m := StrLen(t)
   IfEqual m,0, Return n

   d0_0 = 0
   Loop %n%
      d0_%A_Index% := A_Index
   Loop %m%
      d%A_Index%_0 := A_Index

   Loop Parse, s
   {
      i  = %A_Index%
      i1:= i-1
      si = %A_LoopField%
      Loop Parse, t
      {
         j1 := A_Index - 1
         d%A_Index%_%i% := min( d%A_Index%_%i1%+1, d%j1%_%i%+1, d%j1%_%i1%+(si <> A_LoopField) )
      }
   }
   Return d%m%_%n%
}

min( a, b, c )
{
   IfLess b, %a%, SetEnv a, %b%
   IfLess c, %a%, Return c
   Return a
}
Back to top
View user's profile Send private message
Decarlo110



Joined: 15 Dec 2004
Posts: 303
Location: United States

PostPosted: Mon Jan 09, 2006 7:47 pm    Post subject: Reply with quote

i don't know why these type of things get named. I suppose it's for a common terminology rather than an intellectual claim. Same thing for the "Boolean" term and "Venn" diagrams.

Here's my version in Ahk, which gives the same result. It drops the traditional matrix approach, which i find clumsy in this case.
Code:
^#l::   ; optional hotkey = Ctrl.Win.L
Loop 2
   InputBox str%A_Index%,, Enter string %A_Index%
if ( StrLen(str1) < StrLen(str2) )
{
   shorterString = %str1%
   longerString = %str2%
}
else
{
   shorterString = %str2%
   longerString = %str1%
}
Lev_edit_distance = 0
Loop Parse, shorterString
{
   A_Index1 = %A_Index%
   A_LoopField1 = %A_LoopField%

   Loop Parse, longerString
   {
      if A_Index = %A_Index1%
      {
         if A_LoopField != %A_LoopField1%
            Lev_edit_distance ++
         BREAK
      }
   }
}
msgbox % "The Levenshtein/edit distance is "
. Lev_edit_distance + Abs( StrLen(str1) - StrLen(str2) )


A simple way to confound/defeat elementary versions of Levenshtein-distance checking is to use an offset. Smarter versions would perhaps attempt to locate the common initial string of given arbitrary length and work from there, and repeating this for another arbitrary length of characters when there is non-identicalness, in order to measure distributed common strings with a correspondence despite offsets.
_________________
1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>>
Back to top
View user's profile Send private message
Laszlo



Joined: 14 Feb 2005
Posts: 4012
Location: Pittsburgh

PostPosted: Mon Jan 09, 2006 9:50 pm    Post subject: Reply with quote

Decarlo110's version computes a completely different (much simpler) distance between the strings. It is not the minimum number of changes to transform one string to the other, but the number of changes in a certain, monotone-scanning changing strategy. Even if a possible common initial string is located first, the result will be often wrong.
Back to top
View user's profile Send private message
Decarlo110



Joined: 15 Dec 2004
Posts: 303
Location: United States

PostPosted: Tue Jan 10, 2006 2:59 am    Post subject: Reply with quote

My previous post on this thread was a result of hasty reading on my part, and insufficient example on the website without any mention of deletion. Laszlo is correct; my previous post should be disregarded entirely.
_________________
1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>>
Back to top
View user's profile Send private message
Laszlo



Joined: 14 Feb 2005
Posts: 4012
Location: Pittsburgh

PostPosted: Tue Jan 10, 2006 4:33 am    Post subject: Reply with quote

Decarlo110 wrote:
my previous post should be disregarded entirely.
It is just different. The original runs in quadratic time of the input length, which could be prohibitively slow for long strings. Decarlo110 posted a linear time computable metric of string differences. Although it is not the minimum number of changes, it does relate to how different the strings are. So, keep this script (or some refined version) for huge strings.
Back to top
View user's profile Send private message
esromneb



Joined: 06 Jan 2006
Posts: 53

PostPosted: Thu Jan 12, 2006 3:50 am    Post subject: Reply with quote

Sorry for my delay...
Before swapping out my code with yours, I decided to check it first. There is a java (eww) applet online at http://www.merriampark.com/ld.htm
which will show you the correct value for the lev distance.

Tolraf: your code works just fine

Laszlo: your code doesn't work perfectly I wouldn't have noticed it, but your code returns 8 for
Code:
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )

while the website, tolraf's, and my original code return 9. I'm not sure where the error is.

Also, why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....
Back to top
View user's profile Send private message
Laszlo



Joined: 14 Feb 2005
Posts: 4012
Location: Pittsburgh

PostPosted: Thu Jan 12, 2006 7:22 am    Post subject: Reply with quote

The right answer is 8, as you can easily verify: delete "To" from "Toralf" (2 letters) and add "LaDuce" to the end, which is 6 letters, making 8 all together. The question is, do you consider "r" and "R" different? Your code, and Toralf's, use "==" for comparison, which differentiates between upper- and lower- case letters. If you want that behavior, add "StringCaseSense On" to the beginning of my script, or replace "+(si <> A_LoopField)" in the table update expression with "+!(si==A_LoopField)". It is probably the best, if you introduce a third function parameter, which chooses between "StringCaseSense On" and "StringCaseSense Off", according to the user's desire:
Code:
Levenshtein_distance( s, t, Case_sense=1 )
{
   Old_sense = %A_StringCaseSense%
   If Case_sense
      StringCaseSense ON
   Else
      StringCaseSense Off
   ;...
   StringCaseSense %Old_sense%
   Return d%m%_%n%
}
It defaults to case sensitivity. If this is not what you want, at the function call add a third parameter: 0 or False.
Back to top
View user's profile Send private message
esromneb



Joined: 06 Jan 2006
Posts: 53

PostPosted: Thu Jan 12, 2006 8:59 am    Post subject: Reply with quote

thanks, I lowered all my strings anyways. Case sensative would be faster for me then, right?
Back to top
View user's profile Send private message
toralf



Joined: 31 Jan 2005
Posts: 3842
Location: Bremen, Germany

PostPosted: Thu Jan 12, 2006 10:35 am    Post subject: Reply with quote

esromneb wrote:
why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....
Our scripts do not return 0. They return the length of the other string. This is what the original algorithm had in mind. Since when one string is zero you just have to write the other string (add the chars). If both are nothing the differnce is zero.
You got the -1 idea from the C code you translated, but Laszlo and I used the original algorithm, see your own links.
_________________
Ciao
toralf
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group