Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Levenshtein Distance Algorithm


  • Please log in to reply
22 replies to this topic
esromneb
  • Members
  • 50 posts
  • Last active: Oct 28 2006 07:02 PM
  • Joined: 06 Jan 2006
I'm relativly new to ahk, but I've translated the Levenshtein Algorithm as found here:
http://www.merriampark.com/ldc.htm

The algorithm is explaned in detial here:
http://www.merriampark.com/ld.htm

(I find it rather ironic how similar those urls are)

Anyways can someone tell me if this is efficient or not? I'm still a little fuzzy on what you can do with references (arrays in this case).
Here it is:
levenshtein_distance( s, t )
{
  n := strlen(s)
  m := strlen(t)

  if( n != 0 and m != 0 )
  {

    m++
    n++
    d0 = 0 ; Because A_index starts at 1, we emulate a for loop by hardcoding the first repetition

    Loop, %n%
      d%A_Index% := A_Index

    ; I would emulate the first repetition here, but it just sets d0 = 0
    Loop, %m%
    {
      temp1 := A_Index * n
      d%temp1% := A_Index
    }
    B_Index := 0
    Loop, %n%
    {
      B_Index++
      Loop, %m%
      {
        temp1 := B_Index
        temp2 := A_Index
        StringMid, tempA, s, temp1, 1
        StringMid, tempB, t, temp2, 1
        ;MsgBox, Comparing %tempA% and %tempB%
        if( tempA == tempB )
          cost := 0
        else
          cost := 1
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := A_Index * n + B_Index - 1
        temp4 := (A_Index - 1) * n + B_Index - 1
        d%temp1% := minimum( d%temp2%+1 , d%temp3%+1 , d%temp4%+cost )
      } ;Loop, m
    } ;Loop, n
    temp1 := n*m - 1
    distance := d%temp1%
    return distance
  } ;if
  else
    return -1
}



minimum( a, b, c )
{
  min := a
  if(b<min)
    min := b
  if(c<min)
    min := c
  return min
}


MsgBox,% levenshtein_distance( "Ben", "Hen" )

Thanks!!

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
you could replace
B_Index := 0 
    Loop, %n% 
    { 
      B_Index++ 
with this
Loop, %n% 
    { 
      B_Index = %A_Index%

and you could replace
temp1 := B_Index 
        temp2 := A_Index 
        StringMid, tempA, s, temp1, 1 
        StringMid, tempB, t, temp2, 1 
with
StringMid, tempA, s, B_Index , 1 
        StringMid, tempB, t, A_Index , 1 

and you could replace
distance := d%temp1% 
    return distance 
with
return d%temp1%  

But I goes all will not increase speed very much.
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

toralf as guest
  • Guests
  • Last active:
  • Joined: --
And you can replace
n := strlen(s) 

  m := strlen(t) 



  if( n != 0 and m != 0 ) 

  { 



    m++ 

    n++
with
n := strlen(s) + 1

  m := strlen(t) + 1



  if( n != 1 and m != 1 ) 

  { 



toralf as guest
  • Guests
  • Last active:
  • Joined: --
You could also write
temp1 := A_Index * n + B_Index 

        temp2 := (A_Index - 1) * n + B_Index 

        temp3 := A_Index * n + B_Index - 1 

        temp4 := (A_Index - 1) * n + B_Index - 1 

as
temp1 := A_Index * n + B_Index 

        temp2 := (A_Index - 1) * n + B_Index 

        temp3 := temp1 - 1 

        temp4 := temp2 - 1 



toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
I have taken a look at the original algorithm, it seams that you have made some modifications. Your loops go one iteration to much, since the loops in the c program go to k The original also returns m or n if one of the strings is empty. The c code you used as basis returns -1.

I guess you could also use Loop,Parse instead or normal loops and stringMid. But I do not know if that is faster.
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005
I guess this is the most effective way and closest to the original algorithm:
MsgBox, % levenshtein_distance( "Ben", "Hen" ) 

MsgBox, % levenshtein_distance( "Toralf", "esromneb" ) 



levenshtein_distance( s, t ) 

  { 

    n := strlen(s) 

    m := strlen(t) 



    If n = 0

        Return, m

    If m = 0

        Return, n

  

    d0 = 0

    Loop, %n% 

        d%A_Index% := A_Index 

    Loop, %m% 

      { 

        Index := A_Index * (n + 1) 

        d%Index% := A_Index 

      } 



    Loop, Parse, s

      {

        i = %A_Index%

        si = %A_LoopField%

        Loop, Parse, t

          {

            If (si == A_LoopField)

              cost = 0

            else

              cost = 1

            

            Index  := A_Index * (n + 1) + i

            iLeft  := Index - 1 

            iAbove := iLeft - n 

            iDiag  := iAbove - 1 

            d%Index% := minimum( d%iAbove%+1 , d%iLeft%+1 , d%iDiag%+cost ) 

          }

      }

    Return, d%Index%

  } 



minimum( min, b, c ) 

  { 

    if(b<min) 

      min := b 

    if(c<min) 

      min := c 

    return min 

  }

Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
Thanks Esromneb, for the great script! It can be useful for many tasks.

In an interpreted language, the efficiency of the code is closely related to the number of lines executed, so there we could save some execution time. In AHK, expression evaluation is slower than direct string assignment, which would allow a little further speedup.

Continuing the trimming of the code, what ToRalf started, we could shave off another couple of lines of code by using two-dimensional arrays directly, instead of the index calculation. The expression (si <> A_LoopField) gives you 0/1 directly for cost, to be used in the function min.
MsgBox % Levenshtein_distance( "AHK", "" )
MsgBox % Levenshtein_distance( "He", "Ben" )
MsgBox % Levenshtein_distance( "Toralf", "esromneb" )
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )

Levenshtein_distance( s, t )
{
   n := StrLen(s)
   IfEqual n,0, Return m
   m := StrLen(t)
   IfEqual m,0, Return n

   d0_0 = 0
   Loop %n%
      d0_%A_Index% := A_Index
   Loop %m%
      d%A_Index%_0 := A_Index

   Loop Parse, s
   {
      i  = %A_Index%
      i1:= i-1
      si = %A_LoopField%
      Loop Parse, t
      {
         j1 := A_Index - 1
         d%A_Index%_%i% := min( d%A_Index%_%i1%+1, d%j1%_%i%+1, d%j1%_%i1%+(si <> A_LoopField) )
      }
   }
   Return d%m%_%n%
}

min( a, b, c )
{
   IfLess b, %a%, SetEnv a, %b%
   IfLess c, %a%, Return c
   Return a
}


Decarlo110
  • Members
  • 303 posts
  • Last active: Feb 12 2006 02:15 AM
  • Joined: 15 Dec 2004
i don't know why these type of things get named. I suppose it's for a common terminology rather than an intellectual claim. Same thing for the "Boolean" term and "Venn" diagrams.

Here's my version in Ahk, which gives the same result. It drops the traditional matrix approach, which i find clumsy in this case.
^#l::   ; optional hotkey = Ctrl.Win.L
Loop 2
	InputBox str%A_Index%,, Enter string %A_Index%
if ( StrLen(str1) < StrLen(str2) )
{
	shorterString = %str1%
	longerString = %str2%
}
else
{
	shorterString = %str2%
	longerString = %str1%
}
Lev_edit_distance = 0
Loop Parse, shorterString
{
	A_Index1 = %A_Index%
	A_LoopField1 = %A_LoopField%

	Loop Parse, longerString
	{
		if A_Index = %A_Index1%
		{
			if A_LoopField != %A_LoopField1%
				Lev_edit_distance ++
			BREAK
		}
	}
}
msgbox % "The Levenshtein/edit distance is "
. Lev_edit_distance + Abs( StrLen(str1) - StrLen(str2) )

A simple way to confound/defeat elementary versions of Levenshtein-distance checking is to use an offset. Smarter versions would perhaps attempt to locate the common initial string of given arbitrary length and work from there, and repeating this for another arbitrary length of characters when there is non-identicalness, in order to measure distributed common strings with a correspondence despite offsets.
1) The Open Source Definition http://www.opensourc...ition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <>

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
Decarlo110's version computes a completely different (much simpler) distance between the strings. It is not the minimum number of changes to transform one string to the other, but the number of changes in a certain, monotone-scanning changing strategy. Even if a possible common initial string is located first, the result will be often wrong.

Decarlo110
  • Members
  • 303 posts
  • Last active: Feb 12 2006 02:15 AM
  • Joined: 15 Dec 2004
My previous post on this thread was a result of hasty reading on my part, and insufficient example on the website without any mention of deletion. Laszlo is correct; my previous post should be disregarded entirely.
1) The Open Source Definition http://www.opensourc...ition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <>

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

my previous post should be disregarded entirely.

It is just different. The original runs in quadratic time of the input length, which could be prohibitively slow for long strings. Decarlo110 posted a linear time computable metric of string differences. Although it is not the minimum number of changes, it does relate to how different the strings are. So, keep this script (or some refined version) for huge strings.

esromneb
  • Members
  • 50 posts
  • Last active: Oct 28 2006 07:02 PM
  • Joined: 06 Jan 2006
Sorry for my delay...
Before swapping out my code with yours, I decided to check it first. There is a java (eww) applet online at http://www.merriampark.com/ld.htm
which will show you the correct value for the lev distance.

Tolraf: your code works just fine

Laszlo: your code doesn't work perfectly I wouldn't have noticed it, but your code returns 8 for
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )
while the website, tolraf's, and my original code return 9. I'm not sure where the error is.

Also, why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
The right answer is 8, as you can easily verify: delete "To" from "Toralf" (2 letters) and add "LaDuce" to the end, which is 6 letters, making 8 all together. The question is, do you consider "r" and "R" different? Your code, and Toralf's, use "==" for comparison, which differentiates between upper- and lower- case letters. If you want that behavior, add "StringCaseSense On" to the beginning of my script, or replace "+(si <> A_LoopField)" in the table update expression with "+!(si==A_LoopField)". It is probably the best, if you introduce a third function parameter, which chooses between "StringCaseSense On" and "StringCaseSense Off", according to the user's desire:
Levenshtein_distance( s, t, Case_sense=1 )

{

   Old_sense = %A_StringCaseSense%

   If Case_sense

      StringCaseSense ON

   Else

      StringCaseSense Off

   ;...

   StringCaseSense %Old_sense%

   Return d%m%_%n%

}
It defaults to case sensitivity. If this is not what you want, at the function call add a third parameter: 0 or False.

esromneb
  • Members
  • 50 posts
  • Last active: Oct 28 2006 07:02 PM
  • Joined: 06 Jan 2006
thanks, I lowered all my strings anyways. Case sensative would be faster for me then, right?

toralf
  • Moderators
  • 4035 posts
  • Last active: Aug 20 2014 04:23 PM
  • Joined: 31 Jan 2005

why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....

Our scripts do not return 0. They return the length of the other string. This is what the original algorithm had in mind. Since when one string is zero you just have to write the other string (add the chars). If both are nothing the differnce is zero.
You got the -1 idea from the C code you translated, but Laszlo and I used the original algorithm, see your own links.
Ciao
toralf
 
I use the latest AHK version (1.1.15+)
Please ask questions in forum on ahkscript.org. Why?
For online reference please use these Docs.