| View previous topic :: View next topic |
| Author |
Message |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 2:00 pm Post subject: Fuzzy string searching with Damerau–Levenshtein distance |
|
|
Someone might find this useful, like I did...
Test cases:
| Code: | tests =
( LTrim
AHK ahk
He ben
this tihs
Toralf Titan
google goggle
)
Loop, Parse, tests, `n
{
StringSplit, w, A_LoopField, %A_Tab%
l .= """" . w1 . """ => """ . w2 . """ " . DamerauLevenshteinDistance(w1, w2) . "`n"
}
MsgBox, %l% |
Gives:
| Quote: | "AHK" => "ahk" 0
"He" => "ben" 2
"this" => "tihs" 1
"Toralf" => "Titan" 6
"google" => "goggle" 1 |
Download _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message.
Last edited by polyethene on Wed Mar 03, 2010 9:15 pm; edited 1 time in total |
|
| Back to top |
|
 |
Jero3n
Joined: 19 Jan 2007 Posts: 147
|
Posted: Sun Feb 03, 2008 2:08 pm Post subject: |
|
|
What is DamerauLevenshteinDistance?  |
|
| Back to top |
|
 |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 2:13 pm Post subject: |
|
|
It's an algorithm for calculating distance between strings by two guys called Damerau and Levenshtein. It lets you find matches based on proximity, which allows for human errors such as typos and spelling mistakes. In the example above "tihs" is 1 distance away from "this" so it's safe to assume the latter is implied, conversely "Toralf" and "Titan" are very far so it can be ignored. You can find more information on teh intarweb. _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message. |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4682 Location: Boulder, CO
|
Posted: Sun Feb 03, 2008 5:50 pm Post subject: |
|
|
This function can be useful, although it is a simple (2-line) extension of the already posted Levenshtein_distance function, and does not compute the Damerau–Levenshtein distance, but the cost of the so-called optimal string alignment. You can verify it with
| Code: | | MsgBox % DamerauLevenshteinDistance("TO","OST") |
It returns 3 instead of the correct 2 (see in the cited Wikipedia article). The Damerau–Levenshtein distance is the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Another problem with this function is that the computed distance does not satisfy the triangle inequality, and so it cannot be easily employed in text search applications. So, some more work is needed... |
|
| Back to top |
|
 |
Jero3n
Joined: 19 Jan 2007 Posts: 147
|
Posted: Sun Feb 03, 2008 5:54 pm Post subject: |
|
|
| Than this is really usefull for an search function (what I want to implement in my application i'm writing at the moment), you can check it and when it has <2, I show something like: Did you meant ... |
|
| Back to top |
|
 |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 6:31 pm Post subject: |
|
|
| Laszlo wrote: | | It returns 3 instead of the correct 2 (see in the cited Wikipedia article). | The wiki says: "the distance between TO and OST is 3" so it's correct. Have I missed something?
I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." Triangle inequality can be good at times as it gives weight to the control, producing more relevant search results and accurate comparisons.
| Jero3n wrote: | | you can check it and when it has <2, I show something like: Did you meant ... | Cool. If your dataset is extremely large it may be worth computing dispersion and using matches within the first quartile. _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message. |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4682 Location: Boulder, CO
|
Posted: Sun Feb 03, 2008 6:49 pm Post subject: |
|
|
| Titan wrote: | | The wiki says: "the distance between TO and OST is 3" | Wikipedia says: “...the strings can be made equal using one deletion and one transposition. Clearly, the algorithm does not compute precisely the value we want”… “The above-described pseudo-code calculates only restricted edit distance”. You need not quote Wikipedia: it is easy to verify that the right answer is 2, not 3.
| Titan wrote: | | Have I missed something? | Yes. This Wikipedia article is misleading, as many more.
| Titan wrote: | | I prefer this variant as it doesn't penalize transportation, which according to the same wiki article "correspond to more than 80% of all human misspellings." | Not exactly. What the Wikipedia article means is: “insertions, deletions, substitutions and transpositions … correspond to more than 80% of all human misspellings.” Their pseudo-code does penalize transpositions by counting them as two operations instead of one. |
|
| Back to top |
|
 |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 7:07 pm Post subject: |
|
|
Calculating transportation in this sense is a lot quicker and accurate enough for most cases. In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" which the current Damerau extension ensures. As I said to Jero3n if you have a large index such as a dictionary it would be useful to rank about the deviation. _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message. |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4682 Location: Boulder, CO
|
Posted: Sun Feb 03, 2008 7:14 pm Post subject: |
|
|
| Titan wrote: | | Calculating transportation in this sense is a lot quicker and accurate enough for most cases. | (It is transposition.) If you like this edit-distance, use it. Just don’t call it “DamerauLevenshteinDistance”. It is different.
... and a copyright notice for some trivial editing and two extra lines added to already posted code? |
|
| Back to top |
|
 |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 7:26 pm Post subject: |
|
|
Tomato, tomato it's all the same Laszlo
I haven't read Damerau's paper, but if he wrote an implementation that has this flaw I see no reason to rename it.
| Quote: | | ... and a copyright notice for some trivial editing and two extra lines added to already posted code? | I used information from the wikipedia and gave them due attribution. From what I can tell you have done the same thing. Please do not get into an argument with me about this when you know I never based my code around yours. To anybody else reading this: my script is zlibbed so feel absolutely free to copy/remix without restrictions... freedom of speech ftw! _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message. |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4682 Location: Boulder, CO
|
Posted: Sun Feb 03, 2008 8:11 pm Post subject: |
|
|
| Titan wrote: | | if he wrote an implementation that has this flaw I see no reason to rename it | He did not.
| Titan wrote: | | In language semantics "ost" is a lot different from "to," I would expect it to have a closer relation to "lost" | In human languages the notion of closeness is much more complex. The most common typewriter/keyboard misspelling of “the” is “teh”, which should be the closest. Hitting a key next to the right one is a common error, hitting one far away is rare. Other types of common errors are missing one of double letters (“tomorow”, “comon”), bur not certain others (“omorrow”, “commo”). Having a letter repeated is frequent (“appartment”), but inserting a random one is not (“apuartment”). Similar sounding words are also often swapped (“write” – “right”), but a blind edit distance metrics will not find them close. This is why good spell checkers, indexers use (also) misspellings tables. |
|
| Back to top |
|
 |
polyethene
Joined: 11 Aug 2004 Posts: 5172 Location: /b/
|
Posted: Sun Feb 03, 2008 9:00 pm Post subject: |
|
|
| Laszlo wrote: | | In human languages the notion of closeness is much more complex. | Wouldn't you agree that Damerau's variant is superior for this?
If I get the time I'll try to read his paper and maybe write a compliant function. I'm leaning towards the current version because it's very fast and memory conservative. As you said a comprehensive lexer requires several algorithms for optimal fuzzy searching; but for the moment DamerauLevenshteinDistance remains the best AutoHotkey has. _________________ Chat (IRC) • PlusNet • Scripts • IronAHK • Contact by email not private message. |
|
| Back to top |
|
 |
keybored
Joined: 18 Jun 2006 Posts: 160 Location: Phoenix, AZ
|
Posted: Fri Apr 10, 2009 4:06 pm Post subject: Changed? |
|
|
Titan,
Did you change the code?
I have this in my message box;
| Code: | "AHK ahk" => "" 9
"He ben" => "" 8
"this tihs" => "" 11
"Toralf Titan" => "" 14
"google goggle" => "" 15 |
Edit:
This was just because the web page copies spaces in place of tabs. If anyone else tests the example that is the cause.
Thanks for the function I think this will be very useful to me. |
|
| Back to top |
|
 |
dormeu Guest
|
Posted: Fri Jun 12, 2009 12:35 pm Post subject: Fuzzy me |
|
|
oh god.. this is so cool... exactly what i wanted...
adaptive searches ... also check the "hamming code" googleit!!
one problem with this though
distance 1 : script sc*ipt or whatever other errors in single chars are easy detectable, but i need something that ca also account for caracter substraction, or adition
ex : damerau dammerau = 4 and it should be 1 ( -eek!!)
ex: dammerau damerau this returns 1 - correct
So i take it the author was accounting that first is the haystack , and second is the needle ??
What about thinking outside the box? - the search should be unshure about the needle and the haystack
both permutation should return 1 becouse between the 2 only one operation was performed on either the needle or haystack
 |
|
| Back to top |
|
 |
hugov
Joined: 27 May 2007 Posts: 3506
|
Posted: Fri Jun 12, 2009 12:57 pm Post subject: |
|
|
would making sure the needle is always the shortest string solve it, try this modify function with the new lines, it always returns 1 for me with your example | Code: | ....
StringLen, m, s
StringLen, n, t
If m = 0
Return, n
If n = 0
Return, m
; insert these lines
If (m > n)
{
xtmp:=s
s:=t
t:=xtmp
xtmp:=n
n:=m
m:=xtmp
xtmp=
}
; until here
d0_0 = 0 .... |
_________________ AHK Wiki FAQ
TF : Text files & strings lib, TF Forum |
|
| Back to top |
|
 |
|