 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Sat Dec 27, 2008 8:52 pm Post subject: String-matching using trigrams |
|
|
I made a nice little function that basically takes a string, and matches it to another string from a list of strings. The trick here is that it must pick the "most matching" string. To show you an example of what I mean, consider movie titles. Let's say you have a list of four movie titles:
- The Shawshank Redemption
- The Dark Knight
- Schindler's List
- The Godfather: Part II
And then you find yourself in a situation where, you need to match a movie title to one of those movies. But because it is not formatted the same way, you can't simply check for equality to get a match. For example, the string might be "Shawshank Redemption, The (1994)" or even "[2008] - The Dark Knight". So the question is, how would you go about matching them? What would be your algorithm?
The way I went about it is to compare the string to each item of the list, and counting how many words match. So, for example, "[2008] - The Dark Knight" would match 3 words with "The Dark Knight", and 1 word with both "The Shawshank Redemption" and "The Godfather: Part II". My function simply returns the index of the item of the list that matches the most words with the string. In our particular case, that would be 2 (since The Dark Knight is in the second position in the list).
I'm not only presenting it, but asking also, how would you have done this? Is there a more efficient (or accurate) way of doing this? I was thinking of some kind of re-synchronizing comparison, but it felt too complicated to implement.
Jan 28, 2008 - Edit: The algorithm now calculates the similarity based on trigrams. The script now also gives a "matching fitness" between 0% and 100%.
| Code: |
MyItem = 2008 - The Dark Knight
s1 := "The Dark Knight's Dog"
s2 := "The Dark Knight [2008]"
s3 := "Dark Nights"
s4 := "Dark Nightmares on Elm Street"
VarSetCapacity(ptr, 16)
i := &ptr
Loop 4
i := NumPut(&s%A_Index%, i+0)
;Call function
i := MatchItemFromList(&ptr, 4, MyItem), l := i & 0xFFFF, h := i >> 16
MsgBox % "Winning index: " l "`nWinning title: " s%l% "`nMatch fitness: " h "%"
Return
MatchItemFromList(iPtr, iCount, sItem) {
/*
;iPtr is a pointer to the list of string pointers
;Here is an example on how to use it
;Prepare the list of pointers
VarSetCapacity(ptr, iCount * 4, 0) ;iCount = the number of items in the list
;Add the string pointers to the list
i := &ptr
Loop %iCount%
i := NumPut(&sList%A_Index%, i+0)
;Call function
MatchItemFromList(&ptr, iCount, sItem)
*/
;Get length
iLength := StrLen(sItem)
iTrigrams := iLength - 2
;Retrieve all the strings
Loop %iCount%
sList%A_Index% := DllCall("MulDiv", int, NumGet(iPtr+0, (A_Index - 1) * 4), int, 1, int, 1, str)
;Check for a clean match first
Loop %iCount%
If (sList%A_Index% = sItem)
Return (100 << 16) + A_Index
;CREATE ARRAY OF TRIGRAMS FOR sItem
If (iLength < 3)
Return 0 ;Invalid item
Else { ;Get trigram count
Loop %iTrigrams% {
;Check if the trigram we're about to extract is unique
i := InStr(sItem, SubStr(sItem, A_Index, 3), False, 1)
If i And (i < A_Index) {
sItem_%i% += 1 ;Not unique. Add count to original
sItem_%A_Index% := 0 ;Discard current index
} Else sItem_%A_Index% := InStrCount(sItem, SubStr(sItem, A_Index, 3))
}
}
;COMPARE TRIGRAMS
Loop %iCount% {
i := A_Index
If (StrLen(sList%i%) < 3)
Return 0 ;Invalid item
Else {
Loop %iTrigrams% ;Get trigram count
If sItem_%A_Index%
sList%i%_Diff += Abs(InStrCount(sList%i%, SubStr(sItem, A_Index, 3)) - sItem_%A_Index%)
}
}
iBestI := 0
iBestD := 0x10000
Loop %iCount% {
If (sList%A_Index%_Diff < iBestD) {
iBestD := sList%A_Index%_Diff
iBestI := A_Index
}
}
;Put the winning index in the low-word and a number between 0 and 100 representing the "fitness" of the match in the high-word
Return (Round((iTrigrams - iBestD) * 100 / iTrigrams) << 16) + iBestI
}
;Returns the number of times a trigrams occurs in a string
InStrCount(ByRef Haystack, Trigram) {
j := 0, i := 1
Loop {
i := InStr(Haystack, Trigram, False, i)
If Not i
Return j
j += 1, i += 3
}
} |
Last edited by TheGood on Fri May 14, 2010 3:16 pm; edited 4 times in total |
|
| Back to top |
|
 |
PurloinedHeart
Joined: 04 Apr 2008 Posts: 387 Location: Canada
|
Posted: Sun Dec 28, 2008 2:04 am Post subject: |
|
|
How about your method along with counting letters themselves?
Example:
Movie 1: racecar roar
Movie 2: racecar raar
You're looking for racecar raar
The new method would return 12 for both, however if you checked words as well, one of them would return 13 (12 Matched letters + 1 matched word) |
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Sun Dec 28, 2008 2:35 am Post subject: |
|
|
| Yeah, but the problem is that the length of the string you're looking for and the string it matches with don't necessarily have the same length. Like for example "[2008] - The Dark Knight" and "The Dark Knight". So there's no use in taking it into consideration. |
|
| Back to top |
|
 |
PurloinedHeart
Joined: 04 Apr 2008 Posts: 387 Location: Canada
|
Posted: Sun Dec 28, 2008 7:30 am Post subject: |
|
|
Yeah, I wasn't really thinking when I made that last post
Just noticed that your search wouldn't work in certain examples
| Code: |
The Dark Knight of the Round Table
The Dark Knight [2008]
Dark Nights
Dark Nightmares on Elm Street
|
And you search for "The Dark Knight", you'll be left with
| Code: |
The Dark Knight of the Round Table
The Dark Knight [2008]
|
I think you should have a 'secondary search', incase there isn't an exact match. You could exclude numbers and signs such as []()-, and then remove the words that match, and choose which title has the least left over. Using the example from above:
| Code: |
The Dark Knight of the Round Table
The Dark Knight [2008]
|
The text in black is the match from the user's input, the gray part is excluded, and the red is what's left over. The top search would have a remainder of 19, while the bottom would have a remainder of 1
However, if a number is included in the search, I think you should search without numebrs first, then see the left over, and compare numbers at that point
Final Example, using your example of '[2008] - The Dark Knight'
| Code: |
The Dark Teaparty
The Dark Knight
The Dark knight : Revenge of the Sith
|
You'll be left with:
| Code: |
(Blank)
Revenge of the Sith
|
And a search of [2008] won't yield any results, so the top match would be better.
In order of this method to work, you would probably have to make () and [] interchangable.
I think this script would be very buggy in real-life use, most of the movies on peoples HDDs usually have a format like this:
| Quote: |
Movie Title (Year) (DVD/HD QUALITY) (www.istolethismovie.com)
|
|
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Sun Dec 28, 2008 8:04 am Post subject: |
|
|
Well, the thing is, I don't want to make the algorithm specifically for movie titles. I want the function to be good in any situation where the user wants to match any kind of strings. Movie titles was just an example.
However, I do agree that, even out of the movie title context, the string "The Dark Knight" is most similar to "The Dark Knight [2008]" (from the list you gave). I'll work on incorporating some kind of leftover bias. |
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Sun Dec 28, 2008 8:32 am Post subject: |
|
|
OK, I made the following changes:
- Made the sList parameter ByRef to boost performance for big lists
- Numbers and hyphens are also considered word characters now
- In case there's two list items with the same word count matches, the shortest one wins
I am still hesitant to make changes based on how digits are handled in the leftover count because although assuming they don't matter in the matching process of a movie title is true, we can't be sure that it is always the case for other kinds of strings. Or maybe it is . I need to think about this a little more. |
|
| Back to top |
|
 |
totalmig
Joined: 22 Jul 2008 Posts: 151
|
Posted: Sun Dec 28, 2008 10:54 pm Post subject: |
|
|
Hi..
I would seperate all of the titles into seperate words, then compare the words, letter by letter. |
|
| Back to top |
|
 |
hugov
Joined: 27 May 2007 Posts: 3488
|
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Tue Dec 30, 2008 1:20 am Post subject: |
|
|
Thank you very much for that link! That is indeed very interesting. But I don't know if it's what I need. Using the Levenshtein distance completely disregards words altogether, which isn't what I want. I don't want "Deviled Egg" to match so well with "He lived, Greg"
It's not the kind of "similarity" I have in mind. But it's definitely a powerful tool. I'll see if I can use it in some other way.
| totalmig wrote: | Hi..
I would seperate all of the titles into seperate words, then compare the words, letter by letter. |
Can you explain a little more what you mean? The function I have right now is already comparing words. When you say letter by letter, do you mean, to account for things such as matching "devil" with "lived"? |
|
| Back to top |
|
 |
Rabiator
Joined: 17 Apr 2005 Posts: 285 Location: Sauerland
|
Posted: Tue Dec 30, 2008 2:30 am Post subject: |
|
|
A very satisfactory method is comparing all n-grams (a sequence of n consecutive letters) of both strings.
I used it to find similar sentences for translating parts of the AutoHotkey help into german. Trigrams (3 characters) worked good for me.
Example:
The following string with 15 letters has 13 n-grams:
| Code: | string1 = "String matching":
"Str","tri","rin","ing","ng ","g m"," ma","mat","atc","tch","chi","hin","ing" |
Now we count how often the different n-grams occur: | Code: | "Str" 1
"tri" 1
"rin" 1
"ing" 2
"ng " 1
"g m" 1
" ma" 1
"mat" 1
"atc" 1
"tch" 1
"chi" 1
"hin" 1 |
We want to compare it with the following strings: | Code: | string2 = "matching String":
"mat","atc","tch","chi","hin","ing","ng ","g S"," St","Str","tri","rin","ing"
string3 = "tSgnmatihcrin g":
"tSg","Sgn","gnm","nma","mat","ati","tih","ihc","hcr","cri","rin","in ","n g" |
We count how many n-grams of string1 exist in string2 and string3 and determine the differences: | Code: | n-gram string1 string2 diff string3 diff
"Str" 1 1 0 0 1
"tri" 1 1 0 0 1
"rin" 1 1 0 1 0
"ing" 2 2 0 0 2
"ng " 1 1 0 0 1
"g m" 1 0 1 0 1
" ma" 1 0 1 0 1
"mat" 1 1 0 1 0
"atc" 1 1 0 0 1
"tch" 1 1 0 0 1
"chi" 1 1 0 0 1
"hin" 1 1 0 0 1
--- ---
2 11 | That means a match of 11/13 = 85 % for string2 and 2/13 = 15 % for string3.
That's it!
There are several settings you might adjust to your needs.
- the threshold for your decision matching/not matching
- the length of the sequence. 3 was good for me with sentences of 120 characters on average, might be 2 is better for shoter phrases.
- You could remove all white spaces from the strings to avoid unnecessary differences by reason of merging words together or not.
- same with capitalizing
- To minimize the effect of typical misspellings, you could remove interpunction and ligatures.
- And to minimize the effect of big differences of the strings' length, you might additionally compare the strings in the opposite direction.
- This list is not complete

I'm sorry I don't have AutoHotkey code for you. Initially I tried it with AutoHotkey (the code was not that difficult), but it was too slow for me, therefore I discarded it and wrote a program in C.
__________________________________________
Created with BBCodeWriter 7.0 - the one and only  |
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Tue Dec 30, 2008 3:47 am Post subject: |
|
|
That is very cool! Thanks for the detailed explaination.
| Rabiator wrote: |
I'm sorry I don't have AutoHotkey code for you. Initially I tried it with AutoHotkey (the code was not that difficult) |
I'll get on it soon and see what it gives. I'm sure it will perform better than what I currently have right now.
Thanks again! |
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Wed Jan 28, 2009 8:15 pm Post subject: |
|
|
| I updated the algorithm using Rabiator's idea. Since it now gives back a fitness level, you can apply a threshold to decide whether the match is close enough or not. |
|
| Back to top |
|
 |
Morpheus
Joined: 31 Jul 2008 Posts: 121
|
Posted: Fri May 14, 2010 1:24 am Post subject: |
|
|
I think I found a bug. If I use these as my items:
| Code: | MyItem = Aero
s1 := "afighter"
s2 := "Aerozzzzzzzz"
s3 := "Aeroa"
s4 := "Aero:" |
'Aerozzzzzzzz' is returned as the best match with a 100% fitness level. It seems that it matches the first string that begins with 'Aero', even though 'Aeroa' is probably a better match.
I will get around it, by checking the Strlen(MatchedItem) of all matches that have a 100% fitness level against the Strlen(SearchItem).
I will calculate my own fitness level: Strlen(SearchItem) / Strlen(MatchedItem).
I just thought people would want to know. If someone has a better solution, I would love to hear it. |
|
| Back to top |
|
 |
TheGood
Joined: 30 Jul 2007 Posts: 541
|
Posted: Fri May 14, 2010 6:18 am Post subject: |
|
|
Both s2 and s3 have a fitness level of 100%. It just returns the first of the two. In fact, even s4 has a fitness level of 100%.
I did this a long time ago, so I'm not too sure what is the best way of giving weight to the difference in length of the string.
What I would do is probably first check if multiple items have the same (and highest) fitness level, and then only return the one with the length closest to the length of sItem. |
|
| Back to top |
|
 |
Morpheus
Joined: 31 Jul 2008 Posts: 121
|
Posted: Fri May 14, 2010 12:01 pm Post subject: |
|
|
First I would like to say that this is the best matching algorithm that I have found for my needs. Second, I don't understand how it works, it is too advanced for my skills.
I guess I wasn't going to try to find a better match, I just wanted to make sure that the user knew that it wasn't really a 100% match. That's why I was going to set a new fitness level created by dividing the two string lengths.
I have also changed the way 100% matches are handled. The script I use looks like this:
| Code: | FuzzyInit:
S0 = 0
TwoCount = 0
Loop, %RomFolder%\*.%RomExtension%,0,%Recurse%
{
SplitPath, A_LoopFileName,,, Ext, FileName
If (Strlen(Filename) > 2)
{
s0++
F%s0% := Filename
E%s0% := Ext
}
Else
{
TwoCount++
Rom%TwoCount% := FileName
Ext%TwoCount% := Ext
}
}
VarSetCapacity(ptr, s0 * 4)
i := &ptr
Loop % s0
i := NumPut(&F%A_Index%, i+0)
Return |
I separated the code to find 100% matches from your script so that I can find 1 and 2 letter matches, without redundant code.
This is the code I use to find a match:
| Code: | Fitness = 0
FileName := "ERROR: CAN NOT MATCH NAME"
If StrLen(ArtName) < 3
{
Loop, %TwoCount%
If Rom%A_Index% = %ArtName%
{
Fitness = 100
FileName := Rom%A_Index% . Ext%A_Index%
Break
}
}
Else
{
I1 := MatchItemFromList(&ptr, s0, ArtName), L1 := GetWordFromDblWord(I1, True), Fitness := GetWordFromDblWord(I1, False)
;MsgBox, % "Original" ArtName "`nWinning index: " L1 "`nWinning title: " F%L1% "`nMatch fitness: " Fitness "%"
; Fitness .= `%
If Fitness > 0
FileName := F%L1% . "." . E%L1%
} |
And finally, your algorithm that I modified eliminating the strlen check, and returning 100% matches, since both have been done. If this routine returns a 100% match, I know that it is not really 100%
| Code: | MatchItemFromList(iPtr, iCount, sItem)
{
iTrigrams := StrLen(sItem) - 2
Loop %iCount%
sList%A_Index% := DllCall("MulDiv", int, NumGet(iPtr+0, (A_Index - 1) * 4), int, 1, int, 1, str)
Loop %iCount%
If (sList%A_Index% = sItem)
Return (100 << 16) + A_Index
Loop %iTrigrams%
{
i := InStr(sItem, SubStr(sItem, A_Index, 3), False, 1)
If i And (i < A_Index)
{
sItem_%i% += 1
sItem_%A_Index% := 0
} Else sItem_%A_Index% := InStrCount(sItem, SubStr(sItem, A_Index, 3))
}
Loop %iCount%
{
i := A_Index
Loop %iTrigrams%
If sItem_%A_Index%
sList%i%_Diff += Abs(InStrCount(sList%i%, SubStr(sItem, A_Index, 3)) - sItem_%A_Index%)
}
iBestI := 0
iBestD := 0x10000
Loop %iCount%
{
If (sList%A_Index%_Diff < iBestD)
{
iBestD := sList%A_Index%_Diff
iBestI := A_Index
}
}
Return (Round((iTrigrams - iBestD) * 100 / iTrigrams) << 16) + iBestI
}
InStrCount(ByRef Haystack, Trigram)
{
j := 0
i := 1
Loop
{
i := InStr(Haystack, Trigram, False, i)
If Not i
Return j
j += 1
i += 3
}
}
GetWordFromDblWord(iDblWord, bLowWord = True)
{
iTemp := Mod(iDblWord, 0x10000)
Return bLowWord ? iTemp : Round((iDblWord - iTemp) / 0x10000)
} |
I know my code isn't the best, but I hope it gives you some ideas. |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|