# Fuzzy string searching with Damerau–Levenshtein distance

15 replies to this topic

### #1 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 01:00 PM

Someone might find this useful, like I did...

Test cases:
```tests =
( LTrim
AHK	ahk
He	ben
this	tihs
Toralf	Titan
)
Loop, Parse, tests, `n
{
StringSplit, w, A_LoopField, %A_Tab%
l .= """" . w1 . """	=>	""" . w2 . """	" . DamerauLevenshteinDistance(w1, w2) . "`n"
}
MsgBox, %l%```
Gives:

"AHK" => "ahk" 0
"He" => "ben" 2
"this" => "tihs" 1
"Toralf" => "Titan" 6

### #2 Jero3n

Jero3n
• Members
• 147 posts

Posted 03 February 2008 - 01:08 PM

What is DamerauLevenshteinDistance? :roll:

### #3 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 01:13 PM

It's an algorithm for calculating distance between strings by two guys called Damerau and Levenshtein. It lets you find matches based on proximity, which allows for human errors such as typos and spelling mistakes. In the example above "tihs" is 1 distance away from "this" so it's safe to assume the latter is implied, conversely "Toralf" and "Titan" are very far so it can be ignored. You can find more information on teh intarweb.

### #4 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 03 February 2008 - 04:50 PM

This function can be useful, although it is a simple (2-line) extension of the already posted Levenshtein_distance function, and does not compute the Damerau–Levenshtein distance, but the cost of the so-called optimal string alignment. You can verify it with
`MsgBox % DamerauLevenshteinDistance("TO","OST")`
It returns 3 instead of the correct 2 (see in the cited Wikipedia article). The Damerau–Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Another problem with this function is that the computed distance does not satisfy the triangle inequality, and so it cannot be easily employed in text search applications. So, some more work is needed...

### #5 Jero3n

Jero3n
• Members
• 147 posts

Posted 03 February 2008 - 04:54 PM

Than this is really usefull for an search function (what I want to implement in my application i'm writing at the moment), you can check it and when it has <2, I show something like: Did you meant ...

### #6 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 05:31 PM

It returns 3 instead of the correct 2 (see in the cited Wikipedia article).

The wiki says: "the distance between TO and OST is 3" so it's correct. Have I missed something?

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." Triangle inequality can be good at times as it gives weight to the control, producing more relevant search results and accurate comparisons.

you can check it and when it has <2, I show something like: Did you meant ...

Cool. If your dataset is extremely large it may be worth computing dispersion and using matches within the first quartile.

### #7 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 03 February 2008 - 05:49 PM

The wiki says: "the distance between TO and OST is 3"

Wikipedia says: “...the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want”… “The above-described pseudo-code calculates only restricted edit distance”. You need not quote Wikipedia: it is easy to verify that the right answer is 2, not 3.

Have I missed something?

Yes. This Wikipedia article is misleading, as many more.

I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings."

Not exactly. What the Wikipedia article means is: “insertions, deletions, substitutions and transpositions … correspond to more than 80% of all human misspellings.” Their pseudo-code does penalize transpositions by counting them as two operations instead of one.

### #8 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 06:07 PM

Calculating transportation in this sense is a lot quicker and accurate enough for most cases. In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" which the current Damerau extension ensures. As I said to Jero3n if you have a large index such as a dictionary it would be useful to rank about the deviation.

### #9 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 03 February 2008 - 06:14 PM

Calculating transportation in this sense is a lot quicker and accurate enough for most cases.

(It is transposition.) If you like this edit-distance, use it. Just don’t call it “DamerauLevenshteinDistance”. It is different.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?

### #10 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 06:26 PM

Tomato, tomato it's all the same Laszlo
I haven't read Damerau's paper, but if he wrote an implementation that has this flaw I see no reason to rename it.

... and a copyright notice for some trivial editing and two extra lines added to already posted code?

I used information from the wikipedia and gave them due attribution. From what I can tell you have done the same thing. Please do not get into an argument with me about this when you know I never based my code around yours. To anybody else reading this: my script is zlibbed so feel absolutely free to copy/remix without restrictions... freedom of speech ftw!

### #11 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 03 February 2008 - 07:11 PM

if he wrote an implementation that has this flaw I see no reason to rename it

He did not.

In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost"

In human languages the notion of closeness is much more complex. The most common typewriter/keyboard misspelling of “the” is “teh”, which should be the closest. Hitting a key next to the right one is a common error, hitting one far away is rare. Other types of common errors are missing one of double letters (“tomorow”, “comon”), bur not certain others (“omorrow”, “commo”). Having a letter repeated is frequent (“appartment”), but inserting a random one is not (“apuartment”). Similar sounding words are also often swapped (“write” – “right”), but a blind edit distance metrics will not find them close. This is why good spell checkers, indexers use (also) misspellings tables.

### #12 polyethene

polyethene

• 5473 posts

Posted 03 February 2008 - 08:00 PM

In human languages the notion of closeness is much more complex.

Wouldn't you agree that Damerau's variant is superior for this?

If I get the time I'll try to read his paper and maybe write a compliant function. I'm leaning towards the current version because it's very fast and memory conservative. As you said a comprehensive lexer requires several algorithms for optimal fuzzy searching; but for the moment DamerauLevenshteinDistance remains the best AutoHotkey has.

### #13 keybored

keybored
• Members
• 351 posts

Posted 10 April 2009 - 03:06 PM

Titan,
Did you change the code?

I have this in my message box;
```"AHK   ahk"   =>   ""   9
"He   ben"   =>   ""   8
"this   tihs"   =>   ""   11
"Toralf   Titan"   =>   ""   14

Edit:
This was just because the web page copies spaces in place of tabs. If anyone else tests the example that is the cause.

Thanks for the function I think this will be very useful to me.

### #14 dormeu

dormeu
• Guests

Posted 12 June 2009 - 11:35 AM

oh god.. this is so cool... exactly what i wanted...

one problem with this though

distance 1 : script sc*ipt or whatever other errors in single chars are easy detectable, but i need something that ca also account for caracter substraction, or adition

ex : damerau dammerau = 4 and it should be 1 ( -eek!!)
ex: dammerau damerau this returns 1 - correct

So i take it the author was accounting that first is the haystack , and second is the needle ??

What about thinking outside the box? - the search should be unshure about the needle and the haystack
both permutation should return 1 becouse between the 2 only one operation was performed on either the needle or haystack

### #15 SoLong&Thx4AllTheFish

SoLong&Thx4AllTheFish
• Members
• 4999 posts

Posted 12 June 2009 - 11:57 AM

would making sure the needle is always the shortest string solve it, try this modify function with the new lines, it always returns 1 for me with your example
```....

StringLen, m, s

StringLen, n, t

If m = 0

Return, n

If n = 0

Return, m

[color=red]; insert these lines

If (m > n)

{

xtmp:=s

s:=t

t:=xtmp

xtmp:=n

n:=m

m:=xtmp

xtmp=

}

; until here

[/color]	d0_0 = 0 ....```