AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fuzzy string searching with Damerau–Levenshtein distance
Goto page 1, 2  Next
 
Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions
View previous topic :: View next topic  
Author Message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 2:00 pm    Post subject: Fuzzy string searching with Damerau–Levenshtein distance Reply with quote

Someone might find this useful, like I did...

Test cases:
Code:
tests =
( LTrim
   AHK   ahk
   He   ben
   this   tihs
   Toralf   Titan
   google   goggle
)
Loop, Parse, tests, `n
{
   StringSplit, w, A_LoopField, %A_Tab%
   l .= """" . w1 . """   =>   """ . w2 . """   " . DamerauLevenshteinDistance(w1, w2) . "`n"
}
MsgBox, %l%

Gives:
Quote:
"AHK" => "ahk" 0
"He" => "ben" 2
"this" => "tihs" 1
"Toralf" => "Titan" 6
"google" => "goggle" 1


Download
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.


Last edited by polyethene on Wed Mar 03, 2010 9:15 pm; edited 1 time in total
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Jero3n



Joined: 19 Jan 2007
Posts: 147

PostPosted: Sun Feb 03, 2008 2:08 pm    Post subject: Reply with quote

What is DamerauLevenshteinDistance? Rolling Eyes
Back to top
View user's profile Send private message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 2:13 pm    Post subject: Reply with quote

It's an algorithm for calculating distance between strings by two guys called Damerau and Levenshtein. It lets you find matches based on proximity, which allows for human errors such as typos and spelling mistakes. In the example above "tihs" is 1 distance away from "this" so it's safe to assume the latter is implied, conversely "Toralf" and "Titan" are very far so it can be ignored. You can find more information on teh intarweb.
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Laszlo



Joined: 14 Feb 2005
Posts: 4682
Location: Boulder, CO

PostPosted: Sun Feb 03, 2008 5:50 pm    Post subject: Reply with quote

This function can be useful, although it is a simple (2-line) extension of the already posted Levenshtein_distance function, and does not compute the Damerau–Levenshtein distance, but the cost of the so-called optimal string alignment. You can verify it with
Code:
MsgBox % DamerauLevenshteinDistance("TO","OST")

It returns 3 instead of the correct 2 (see in the cited Wikipedia article). The Damerau–Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Another problem with this function is that the computed distance does not satisfy the triangle inequality, and so it cannot be easily employed in text search applications. So, some more work is needed...
Back to top
View user's profile Send private message
Jero3n



Joined: 19 Jan 2007
Posts: 147

PostPosted: Sun Feb 03, 2008 5:54 pm    Post subject: Reply with quote

Than this is really usefull for an search function (what I want to implement in my application i'm writing at the moment), you can check it and when it has <2, I show something like: Did you meant ...
Back to top
View user's profile Send private message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 6:31 pm    Post subject: Reply with quote

Laszlo wrote:
It returns 3 instead of the correct 2 (see in the cited Wikipedia article).
The wiki says: "the distance between TO and OST is 3" so it's correct. Have I missed something?

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." Triangle inequality can be good at times as it gives weight to the control, producing more relevant search results and accurate comparisons.

Jero3n wrote:
you can check it and when it has <2, I show something like: Did you meant ...
Cool. If your dataset is extremely large it may be worth computing dispersion and using matches within the first quartile.
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Laszlo



Joined: 14 Feb 2005
Posts: 4682
Location: Boulder, CO

PostPosted: Sun Feb 03, 2008 6:49 pm    Post subject: Reply with quote

Titan wrote:
The wiki says: "the distance between TO and OST is 3"
Wikipedia says: “...the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want”… “The above-described pseudo-code calculates only restricted edit distance”. You need not quote Wikipedia: it is easy to verify that the right answer is 2, not 3.
Titan wrote:
Have I missed something?
Yes. This Wikipedia article is misleading, as many more.
Titan wrote:
I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings."
Not exactly. What the Wikipedia article means is: “insertions, deletions, substitutions and transpositions … correspond to more than 80% of all human misspellings.” Their pseudo-code does penalize transpositions by counting them as two operations instead of one.
Back to top
View user's profile Send private message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 7:07 pm    Post subject: Reply with quote

Calculating transportation in this sense is a lot quicker and accurate enough for most cases. In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" which the current Damerau extension ensures. As I said to Jero3n if you have a large index such as a dictionary it would be useful to rank about the deviation.
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Laszlo



Joined: 14 Feb 2005
Posts: 4682
Location: Boulder, CO

PostPosted: Sun Feb 03, 2008 7:14 pm    Post subject: Reply with quote

Titan wrote:
Calculating transportation in this sense is a lot quicker and accurate enough for most cases.
(It is transposition.) If you like this edit-distance, use it. Just don’t call it “DamerauLevenshteinDistance”. It is different.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?
Back to top
View user's profile Send private message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 7:26 pm    Post subject: Reply with quote

Tomato, tomato it's all the same Laszlo Razz
I haven't read Damerau's paper, but if he wrote an implementation that has this flaw I see no reason to rename it.

Quote:
... and a copyright notice for some trivial editing and two extra lines added to already posted code?
I used information from the wikipedia and gave them due attribution. From what I can tell you have done the same thing. Please do not get into an argument with me about this when you know I never based my code around yours. To anybody else reading this: my script is zlibbed so feel absolutely free to copy/remix without restrictions... freedom of speech ftw!
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Laszlo



Joined: 14 Feb 2005
Posts: 4682
Location: Boulder, CO

PostPosted: Sun Feb 03, 2008 8:11 pm    Post subject: Reply with quote

Titan wrote:
if he wrote an implementation that has this flaw I see no reason to rename it
He did not.
Titan wrote:
In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost"
In human languages the notion of closeness is much more complex. The most common typewriter/keyboard misspelling of “the” is “teh”, which should be the closest. Hitting a key next to the right one is a common error, hitting one far away is rare. Other types of common errors are missing one of double letters (“tomorow”, “comon”), bur not certain others (“omorrow”, “commo”). Having a letter repeated is frequent (“appartment”), but inserting a random one is not (“apuartment”). Similar sounding words are also often swapped (“write” – “right”), but a blind edit distance metrics will not find them close. This is why good spell checkers, indexers use (also) misspellings tables.
Back to top
View user's profile Send private message
polyethene



Joined: 11 Aug 2004
Posts: 5172
Location: /b/

PostPosted: Sun Feb 03, 2008 9:00 pm    Post subject: Reply with quote

Laszlo wrote:
In human languages the notion of closeness is much more complex.
Wouldn't you agree that Damerau's variant is superior for this?

If I get the time I'll try to read his paper and maybe write a compliant function. I'm leaning towards the current version because it's very fast and memory conservative. As you said a comprehensive lexer requires several algorithms for optimal fuzzy searching; but for the moment DamerauLevenshteinDistance remains the best AutoHotkey has.
_________________
Chat (IRC)PlusNetScriptsIronAHK Contact by email not private message.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
keybored



Joined: 18 Jun 2006
Posts: 160
Location: Phoenix, AZ

PostPosted: Fri Apr 10, 2009 4:06 pm    Post subject: Changed? Reply with quote

Titan,
Did you change the code?

I have this in my message box;
Code:
"AHK   ahk"   =>   ""   9
"He   ben"   =>   ""   8
"this   tihs"   =>   ""   11
"Toralf   Titan"   =>   ""   14
"google   goggle"   =>   ""   15


Edit:
This was just because the web page copies spaces in place of tabs. If anyone else tests the example that is the cause.

Thanks for the function I think this will be very useful to me.
Back to top
View user's profile Send private message
dormeu
Guest





PostPosted: Fri Jun 12, 2009 12:35 pm    Post subject: Fuzzy me Reply with quote

oh god.. this is so cool... exactly what i wanted...

adaptive searches ... also check the "hamming code" googleit!!

one problem with this though

distance 1 : script sc*ipt or whatever other errors in single chars are easy detectable, but i need something that ca also account for caracter substraction, or adition

ex : damerau dammerau = 4 and it should be 1 ( -eek!!)
ex: dammerau damerau this returns 1 - correct

So i take it the author was accounting that first is the haystack , and second is the needle ??

What about thinking outside the box? - the search should be unshure about the needle and the haystack
both permutation should return 1 becouse between the 2 only one operation was performed on either the needle or haystack

Surprised
Back to top
hugov



Joined: 27 May 2007
Posts: 3506

PostPosted: Fri Jun 12, 2009 12:57 pm    Post subject: Reply with quote

would making sure the needle is always the shortest string solve it, try this modify function with the new lines, it always returns 1 for me with your example
Code:
....
        StringLen, m, s
   StringLen, n, t
   If m = 0
      Return, n
   If n = 0
      Return, m
; insert these lines       
   If (m > n)
      {
      xtmp:=s
      s:=t
      t:=xtmp
      xtmp:=n
      n:=m
      m:=xtmp
      xtmp=
      }
; until here      
   d0_0 = 0 ....

_________________
AHK Wiki FAQ
TF : Text files & strings lib, TF Forum
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group