AutoHotkey Community

It is currently May 27th, 2012, 6:32 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: January 8th, 2006, 11:10 am 
Offline

Joined: January 6th, 2006, 9:56 am
Posts: 50
I'm relativly new to ahk, but I've translated the Levenshtein Algorithm as found here:
http://www.merriampark.com/ldc.htm

The algorithm is explaned in detial here:
http://www.merriampark.com/ld.htm

(I find it rather ironic how similar those urls are)

Anyways can someone tell me if this is efficient or not? I'm still a little fuzzy on what you can do with references (arrays in this case).
Here it is:
Code:
levenshtein_distance( s, t )
{
  n := strlen(s)
  m := strlen(t)

  if( n != 0 and m != 0 )
  {

    m++
    n++
    d0 = 0 ; Because A_index starts at 1, we emulate a for loop by hardcoding the first repetition

    Loop, %n%
      d%A_Index% := A_Index

    ; I would emulate the first repetition here, but it just sets d0 = 0
    Loop, %m%
    {
      temp1 := A_Index * n
      d%temp1% := A_Index
    }
    B_Index := 0
    Loop, %n%
    {
      B_Index++
      Loop, %m%
      {
        temp1 := B_Index
        temp2 := A_Index
        StringMid, tempA, s, temp1, 1
        StringMid, tempB, t, temp2, 1
        ;MsgBox, Comparing %tempA% and %tempB%
        if( tempA == tempB )
          cost := 0
        else
          cost := 1
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := A_Index * n + B_Index - 1
        temp4 := (A_Index - 1) * n + B_Index - 1
        d%temp1% := minimum( d%temp2%+1 , d%temp3%+1 , d%temp4%+cost )
      } ;Loop, m
    } ;Loop, n
    temp1 := n*m - 1
    distance := d%temp1%
    return distance
  } ;if
  else
    return -1
}



minimum( a, b, c )
{
  min := a
  if(b<min)
    min := b
  if(c<min)
    min := c
  return min
}


MsgBox,% levenshtein_distance( "Ben", "Hen" )


Thanks!!


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 11:22 am 
Offline

Joined: January 31st, 2005, 9:50 am
Posts: 3910
Location: Bremen, Germany
you could replace
Code:
    B_Index := 0
    Loop, %n%
    {
      B_Index++
with this
Code:
    Loop, %n%
    {
      B_Index = %A_Index%


and you could replace
Code:
        temp1 := B_Index
        temp2 := A_Index
        StringMid, tempA, s, temp1, 1
        StringMid, tempB, t, temp2, 1
with
Code:
        StringMid, tempA, s, B_Index , 1
        StringMid, tempB, t, A_Index , 1


and you could replace
Code:
    distance := d%temp1%
    return distance
with
Code:
    return d%temp1% 


But I goes all will not increase speed very much.

_________________
Ciao
toralf
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 11:48 am 
And you can replace
Code:
  n := strlen(s)
  m := strlen(t)

  if( n != 0 and m != 0 )
  {

    m++
    n++
with
Code:
  n := strlen(s) + 1
  m := strlen(t) + 1

  if( n != 1 and m != 1 )
  {


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 11:55 am 
You could also write
Code:
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := A_Index * n + B_Index - 1
        temp4 := (A_Index - 1) * n + B_Index - 1
as
Code:
        temp1 := A_Index * n + B_Index
        temp2 := (A_Index - 1) * n + B_Index
        temp3 := temp1 - 1
        temp4 := temp2 - 1


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 12:49 pm 
Offline

Joined: January 31st, 2005, 9:50 am
Posts: 3910
Location: Bremen, Germany
I have taken a look at the original algorithm, it seams that you have made some modifications. Your loops go one iteration to much, since the loops in the c program go to k<n , k<m , i<n and j<m, but the loops in your code go up to n and m.
The original also returns m or n if one of the strings is empty. The c code you used as basis returns -1.

I guess you could also use Loop,Parse instead or normal loops and stringMid. But I do not know if that is faster.

_________________
Ciao
toralf
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 4:48 pm 
Offline

Joined: January 31st, 2005, 9:50 am
Posts: 3910
Location: Bremen, Germany
I guess this is the most effective way and closest to the original algorithm:
Code:
MsgBox, % levenshtein_distance( "Ben", "Hen" )
MsgBox, % levenshtein_distance( "Toralf", "esromneb" )

levenshtein_distance( s, t )
  {
    n := strlen(s)
    m := strlen(t)

    If n = 0
        Return, m
    If m = 0
        Return, n
 
    d0 = 0
    Loop, %n%
        d%A_Index% := A_Index
    Loop, %m%
      {
        Index := A_Index * (n + 1)
        d%Index% := A_Index
      }

    Loop, Parse, s
      {
        i = %A_Index%
        si = %A_LoopField%
        Loop, Parse, t
          {
            If (si == A_LoopField)
              cost = 0
            else
              cost = 1
           
            Index  := A_Index * (n + 1) + i
            iLeft  := Index - 1
            iAbove := iLeft - n
            iDiag  := iAbove - 1
            d%Index% := minimum( d%iAbove%+1 , d%iLeft%+1 , d%iDiag%+cost )
          }
      }
    Return, d%Index%
  }

minimum( min, b, c )
  {
    if(b<min)
      min := b
    if(c<min)
      min := c
    return min
  }

_________________
Ciao
toralf
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 8th, 2006, 7:10 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Thanks Esromneb, for the great script! It can be useful for many tasks.

In an interpreted language, the efficiency of the code is closely related to the number of lines executed, so there we could save some execution time. In AHK, expression evaluation is slower than direct string assignment, which would allow a little further speedup.

Continuing the trimming of the code, what ToRalf started, we could shave off another couple of lines of code by using two-dimensional arrays directly, instead of the index calculation. The expression (si <> A_LoopField) gives you 0/1 directly for cost, to be used in the function min.
Code:
MsgBox % Levenshtein_distance( "AHK", "" )
MsgBox % Levenshtein_distance( "He", "Ben" )
MsgBox % Levenshtein_distance( "Toralf", "esromneb" )
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )

Levenshtein_distance( s, t )
{
   n := StrLen(s)
   IfEqual n,0, Return m
   m := StrLen(t)
   IfEqual m,0, Return n

   d0_0 = 0
   Loop %n%
      d0_%A_Index% := A_Index
   Loop %m%
      d%A_Index%_0 := A_Index

   Loop Parse, s
   {
      i  = %A_Index%
      i1:= i-1
      si = %A_LoopField%
      Loop Parse, t
      {
         j1 := A_Index - 1
         d%A_Index%_%i% := min( d%A_Index%_%i1%+1, d%j1%_%i%+1, d%j1%_%i1%+(si <> A_LoopField) )
      }
   }
   Return d%m%_%n%
}

min( a, b, c )
{
   IfLess b, %a%, SetEnv a, %b%
   IfLess c, %a%, Return c
   Return a
}


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 9th, 2006, 7:47 pm 
Offline

Joined: December 15th, 2004, 10:24 pm
Posts: 303
Location: United States
i don't know why these type of things get named. I suppose it's for a common terminology rather than an intellectual claim. Same thing for the "Boolean" term and "Venn" diagrams.

Here's my version in Ahk, which gives the same result. It drops the traditional matrix approach, which i find clumsy in this case.
Code:
^#l::   ; optional hotkey = Ctrl.Win.L
Loop 2
   InputBox str%A_Index%,, Enter string %A_Index%
if ( StrLen(str1) < StrLen(str2) )
{
   shorterString = %str1%
   longerString = %str2%
}
else
{
   shorterString = %str2%
   longerString = %str1%
}
Lev_edit_distance = 0
Loop Parse, shorterString
{
   A_Index1 = %A_Index%
   A_LoopField1 = %A_LoopField%

   Loop Parse, longerString
   {
      if A_Index = %A_Index1%
      {
         if A_LoopField != %A_LoopField1%
            Lev_edit_distance ++
         BREAK
      }
   }
}
msgbox % "The Levenshtein/edit distance is "
. Lev_edit_distance + Abs( StrLen(str1) - StrLen(str2) )


A simple way to confound/defeat elementary versions of Levenshtein-distance checking is to use an offset. Smarter versions would perhaps attempt to locate the common initial string of given arbitrary length and work from there, and repeating this for another arbitrary length of characters when there is non-identicalness, in order to measure distributed common strings with a correspondence despite offsets.

_________________
1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>>


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 9th, 2006, 9:50 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Decarlo110's version computes a completely different (much simpler) distance between the strings. It is not the minimum number of changes to transform one string to the other, but the number of changes in a certain, monotone-scanning changing strategy. Even if a possible common initial string is located first, the result will be often wrong.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 10th, 2006, 2:59 am 
Offline

Joined: December 15th, 2004, 10:24 pm
Posts: 303
Location: United States
My previous post on this thread was a result of hasty reading on my part, and insufficient example on the website without any mention of deletion. Laszlo is correct; my previous post should be disregarded entirely.

_________________
1) The Open Source Definition http://www.opensource.org/docs/definition_plain.php

2) Intuitive. Logical. Versatile. Adaptable. <<AutoHotkey>>


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 10th, 2006, 4:33 am 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Decarlo110 wrote:
my previous post should be disregarded entirely.
It is just different. The original runs in quadratic time of the input length, which could be prohibitively slow for long strings. Decarlo110 posted a linear time computable metric of string differences. Although it is not the minimum number of changes, it does relate to how different the strings are. So, keep this script (or some refined version) for huge strings.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 12th, 2006, 3:50 am 
Offline

Joined: January 6th, 2006, 9:56 am
Posts: 50
Sorry for my delay...
Before swapping out my code with yours, I decided to check it first. There is a java (eww) applet online at http://www.merriampark.com/ld.htm
which will show you the correct value for the lev distance.

Tolraf: your code works just fine

Laszlo: your code doesn't work perfectly I wouldn't have noticed it, but your code returns 8 for
Code:
MsgBox % Levenshtein_distance( "Toralf", "RalfLaDuce" )

while the website, tolraf's, and my original code return 9. I'm not sure where the error is.

Also, why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 12th, 2006, 7:22 am 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
The right answer is 8, as you can easily verify: delete "To" from "Toralf" (2 letters) and add "LaDuce" to the end, which is 6 letters, making 8 all together. The question is, do you consider "r" and "R" different? Your code, and Toralf's, use "==" for comparison, which differentiates between upper- and lower- case letters. If you want that behavior, add "StringCaseSense On" to the beginning of my script, or replace "+(si <> A_LoopField)" in the table update expression with "+!(si==A_LoopField)". It is probably the best, if you introduce a third function parameter, which chooses between "StringCaseSense On" and "StringCaseSense Off", according to the user's desire:
Code:
Levenshtein_distance( s, t, Case_sense=1 )
{
   Old_sense = %A_StringCaseSense%
   If Case_sense
      StringCaseSense ON
   Else
      StringCaseSense Off
   ;...
   StringCaseSense %Old_sense%
   Return d%m%_%n%
}
It defaults to case sensitivity. If this is not what you want, at the function call add a third parameter: 0 or False.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 12th, 2006, 8:59 am 
Offline

Joined: January 6th, 2006, 9:56 am
Posts: 50
thanks, I lowered all my strings anyways. Case sensative would be faster for me then, right?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 12th, 2006, 10:35 am 
Offline

Joined: January 31st, 2005, 9:50 am
Posts: 3910
Location: Bremen, Germany
esromneb wrote:
why don't you guys return -1? That seems like a better solution that returning 0. Returning 0 is the same as a perfect match....
Our scripts do not return 0. They return the length of the other string. This is what the original algorithm had in mind. Since when one string is zero you just have to write the other string (add the chars). If both are nothing the differnce is zero.
You got the -1 idea from the C code you translated, but Laszlo and I used the original algorithm, see your own links.

_________________
Ciao
toralf
Image


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next

All times are UTC [ DST ]


Who is online

Users browsing this forum: Apollo, JSLover and 22 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group