AutoHotkey Community

It is currently May 25th, 2012, 11:33 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: February 3rd, 2008, 2:00 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Someone might find this useful, like I did...

Test cases:
Code:
tests =
( LTrim
   AHK   ahk
   He   ben
   this   tihs
   Toralf   Titan
   google   goggle
)
Loop, Parse, tests, `n
{
   StringSplit, w, A_LoopField, %A_Tab%
   l .= """" . w1 . """   =>   """ . w2 . """   " . DamerauLevenshteinDistance(w1, w2) . "`n"
}
MsgBox, %l%

Gives:
Quote:
"AHK" => "ahk" 0
"He" => "ben" 2
"this" => "tihs" 1
"Toralf" => "Titan" 6
"google" => "goggle" 1


Download

_________________
GitHubScriptsIronAHK Contact by email not private message.


Last edited by polyethene on March 3rd, 2010, 9:15 pm, edited 1 time in total.

Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 2:08 pm 
Offline

Joined: January 19th, 2007, 9:09 pm
Posts: 147
What is DamerauLevenshteinDistance? :roll:


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 2:13 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
It's an algorithm for calculating distance between strings by two guys called Damerau and Levenshtein. It lets you find matches based on proximity, which allows for human errors such as typos and spelling mistakes. In the example above "tihs" is 1 distance away from "this" so it's safe to assume the latter is implied, conversely "Toralf" and "Titan" are very far so it can be ignored. You can find more information on teh intarweb.

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 5:50 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
This function can be useful, although it is a simple (2-line) extension of the already posted Levenshtein_distance function, and does not compute the Damerau–Levenshtein distance, but the cost of the so-called optimal string alignment. You can verify it with
Code:
MsgBox % DamerauLevenshteinDistance("TO","OST")

It returns 3 instead of the correct 2 (see in the cited Wikipedia article). The Damerau–Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Another problem with this function is that the computed distance does not satisfy the triangle inequality, and so it cannot be easily employed in text search applications. So, some more work is needed...


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 5:54 pm 
Offline

Joined: January 19th, 2007, 9:09 pm
Posts: 147
Than this is really usefull for an search function (what I want to implement in my application i'm writing at the moment), you can check it and when it has <2, I show something like: Did you meant ...


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 6:31 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Laszlo wrote:
It returns 3 instead of the correct 2 (see in the cited Wikipedia article).
The wiki says: "the distance between TO and OST is 3" so it's correct. Have I missed something?

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." Triangle inequality can be good at times as it gives weight to the control, producing more relevant search results and accurate comparisons.

Jero3n wrote:
you can check it and when it has <2, I show something like: Did you meant ...
Cool. If your dataset is extremely large it may be worth computing dispersion and using matches within the first quartile.

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 6:49 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Titan wrote:
The wiki says: "the distance between TO and OST is 3"
Wikipedia says: “...the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want”… “The above-described pseudo-code calculates only restricted edit distance”. You need not quote Wikipedia: it is easy to verify that the right answer is 2, not 3.
Titan wrote:
Have I missed something?
Yes. This Wikipedia article is misleading, as many more.
Titan wrote:
I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings."
Not exactly. What the Wikipedia article means is: “insertions, deletions, substitutions and transpositions … correspond to more than 80% of all human misspellings.” Their pseudo-code does penalize transpositions by counting them as two operations instead of one.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 7:07 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Calculating transportation in this sense is a lot quicker and accurate enough for most cases. In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" which the current Damerau extension ensures. As I said to Jero3n if you have a large index such as a dictionary it would be useful to rank about the deviation.

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 7:14 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Titan wrote:
Calculating transportation in this sense is a lot quicker and accurate enough for most cases.
(It is transposition.) If you like this edit-distance, use it. Just don’t call it “DamerauLevenshteinDistance”. It is different.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 7:26 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Tomato, tomato it's all the same Laszlo :P
I haven't read Damerau's paper, but if he wrote an implementation that has this flaw I see no reason to rename it.

Quote:
... and a copyright notice for some trivial editing and two extra lines added to already posted code?
I used information from the wikipedia and gave them due attribution. From what I can tell you have done the same thing. Please do not get into an argument with me about this when you know I never based my code around yours. To anybody else reading this: my script is zlibbed so feel absolutely free to copy/remix without restrictions... freedom of speech ftw!

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 8:11 pm 
Offline

Joined: February 14th, 2005, 4:05 pm
Posts: 4710
Location: Boulder, CO
Titan wrote:
if he wrote an implementation that has this flaw I see no reason to rename it
He did not.
Titan wrote:
In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost"
In human languages the notion of closeness is much more complex. The most common typewriter/keyboard misspelling of “the” is “teh”, which should be the closest. Hitting a key next to the right one is a common error, hitting one far away is rare. Other types of common errors are missing one of double letters (“tomorow”, “comon”), bur not certain others (“omorrow”, “commo”). Having a letter repeated is frequent (“appartment”), but inserting a random one is not (“apuartment”). Similar sounding words are also often swapped (“write” – “right”), but a blind edit distance metrics will not find them close. This is why good spell checkers, indexers use (also) misspellings tables.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 3rd, 2008, 9:00 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Laszlo wrote:
In human languages the notion of closeness is much more complex.
Wouldn't you agree that Damerau's variant is superior for this?

If I get the time I'll try to read his paper and maybe write a compliant function. I'm leaning towards the current version because it's very fast and memory conservative. As you said a comprehensive lexer requires several algorithms for optimal fuzzy searching; but for the moment DamerauLevenshteinDistance remains the best AutoHotkey has.

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject: Changed?
PostPosted: April 10th, 2009, 4:06 pm 
Offline

Joined: June 18th, 2006, 8:47 am
Posts: 346
Location: Phoenix, AZ
Titan,
Did you change the code?

I have this in my message box;
Code:
"AHK   ahk"   =>   ""   9
"He   ben"   =>   ""   8
"this   tihs"   =>   ""   11
"Toralf   Titan"   =>   ""   14
"google   goggle"   =>   ""   15


Edit:
This was just because the web page copies spaces in place of tabs. If anyone else tests the example that is the cause.

Thanks for the function I think this will be very useful to me.


Report this post
Top
 Profile  
Reply with quote  
 Post subject: Fuzzy me
PostPosted: June 12th, 2009, 12:35 pm 
oh god.. this is so cool... exactly what i wanted...

adaptive searches ... also check the "hamming code" googleit!!

one problem with this though

distance 1 : script sc*ipt or whatever other errors in single chars are easy detectable, but i need something that ca also account for caracter substraction, or adition

ex : damerau dammerau = 4 and it should be 1 ( -eek!!)
ex: dammerau damerau this returns 1 - correct

So i take it the author was accounting that first is the haystack , and second is the needle ??

What about thinking outside the box? - the search should be unshure about the needle and the haystack
both permutation should return 1 becouse between the 2 only one operation was performed on either the needle or haystack

:o


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: June 12th, 2009, 12:57 pm 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
would making sure the needle is always the shortest string solve it, try this modify function with the new lines, it always returns 1 for me with your example
Code:
....
        StringLen, m, s
   StringLen, n, t
   If m = 0
      Return, n
   If n = 0
      Return, m
; insert these lines       
   If (m > n)
      {
      xtmp:=s
      s:=t
      t:=xtmp
      xtmp:=n
      n:=m
      m:=xtmp
      xtmp=
      }
; until here      
   d0_0 = 0 ....

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC [ DST ]


Who is online

Users browsing this forum: infogulch, Opal Monkey and 12 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group