AutoHotkey Community

It is currently May 26th, 2012, 12:55 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: December 27th, 2008, 8:52 pm 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
I made a nice little function that basically takes a string, and matches it to another string from a list of strings. The trick here is that it must pick the "most matching" string. To show you an example of what I mean, consider movie titles. Let's say you have a list of four movie titles:

- The Shawshank Redemption
- The Dark Knight
- Schindler's List
- The Godfather: Part II

And then you find yourself in a situation where, you need to match a movie title to one of those movies. But because it is not formatted the same way, you can't simply check for equality to get a match. For example, the string might be "Shawshank Redemption, The (1994)" or even "[2008] - The Dark Knight". So the question is, how would you go about matching them? What would be your algorithm?

The way I went about it is to compare the string to each item of the list, and counting how many words match. So, for example, "[2008] - The Dark Knight" would match 3 words with "The Dark Knight", and 1 word with both "The Shawshank Redemption" and "The Godfather: Part II". My function simply returns the index of the item of the list that matches the most words with the string. In our particular case, that would be 2 (since The Dark Knight is in the second position in the list).

I'm not only presenting it, but asking also, how would you have done this? Is there a more efficient (or accurate) way of doing this? I was thinking of some kind of re-synchronizing comparison, but it felt too complicated to implement.

Jan 28, 2008 - Edit: The algorithm now calculates the similarity based on trigrams. The script now also gives a "matching fitness" between 0% and 100%.

Code:
   
    MyItem = 2008 - The Dark Knight
   
    s1 := "The Dark Knight's Dog"
    s2 := "The Dark Knight [2008]"
    s3 := "Dark Nights"
    s4 := "Dark Nightmares on Elm Street"
   
    VarSetCapacity(ptr, 16)
    i := &ptr
    Loop 4
        i := NumPut(&s%A_Index%, i+0)
   
    ;Call function
    i := MatchItemFromList(&ptr, 4, MyItem), l := i & 0xFFFF, h := i >> 16
    MsgBox % "Winning index: " l "`nWinning title: " s%l% "`nMatch fitness: " h    "%"
   
Return

MatchItemFromList(iPtr, iCount, sItem) {
   
    /*
    ;iPtr is a pointer to the list of string pointers
    ;Here is an example on how to use it
    ;Prepare the list of pointers
    VarSetCapacity(ptr, iCount * 4, 0) ;iCount = the number of items in the list
    ;Add the string pointers to the list
    i := &ptr
    Loop %iCount%
        i := NumPut(&sList%A_Index%, i+0)
    ;Call function
    MatchItemFromList(&ptr, iCount, sItem)
    */
   
    ;Get length
    iLength := StrLen(sItem)
    iTrigrams := iLength - 2
   
    ;Retrieve all the strings
    Loop %iCount%
        sList%A_Index% := DllCall("MulDiv", int, NumGet(iPtr+0, (A_Index - 1) * 4), int, 1, int, 1, str)
   
    ;Check for a clean match first
    Loop %iCount%
        If (sList%A_Index% = sItem)
            Return (100 << 16) + A_Index
   
    ;CREATE ARRAY OF TRIGRAMS FOR sItem
    If (iLength < 3)
        Return 0    ;Invalid item
    Else {    ;Get trigram count
        Loop %iTrigrams% {
           
            ;Check if the trigram we're about to extract is unique
            i := InStr(sItem, SubStr(sItem, A_Index, 3), False, 1)
            If i And (i < A_Index) {
                sItem_%i% += 1 ;Not unique. Add count to original
                sItem_%A_Index% := 0 ;Discard current index
            } Else sItem_%A_Index% := InStrCount(sItem, SubStr(sItem, A_Index, 3))
        }
    }
   
    ;COMPARE TRIGRAMS
    Loop %iCount% {
        i := A_Index
        If (StrLen(sList%i%) < 3)
            Return 0    ;Invalid item
        Else {
            Loop %iTrigrams% ;Get trigram count
                If sItem_%A_Index%
                    sList%i%_Diff += Abs(InStrCount(sList%i%, SubStr(sItem, A_Index, 3)) - sItem_%A_Index%)
        }
    }
   
    iBestI := 0
    iBestD := 0x10000
    Loop %iCount% {
        If (sList%A_Index%_Diff < iBestD) {
            iBestD := sList%A_Index%_Diff
            iBestI := A_Index
        }
    }
   
    ;Put the winning index in the low-word and a number between 0 and 100 representing the "fitness" of the match in the high-word
    Return (Round((iTrigrams - iBestD) * 100 / iTrigrams) << 16) + iBestI
}

;Returns the number of times a trigrams occurs in a string
InStrCount(ByRef Haystack, Trigram) {
    j := 0, i := 1
    Loop {
        i := InStr(Haystack, Trigram, False, i)
        If Not i
            Return j
        j += 1, i += 3
    }
}


Last edited by TheGood on May 14th, 2010, 3:16 pm, edited 4 times in total.

Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 2:04 am 
Offline

Joined: April 4th, 2008, 8:15 pm
Posts: 538
Location: Canada
How about your method along with counting letters themselves?

Example:
Movie 1: racecar roar
Movie 2: racecar raar

You're looking for racecar raar

The new method would return 12 for both, however if you checked words as well, one of them would return 13 (12 Matched letters + 1 matched word)


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 2:35 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
Yeah, but the problem is that the length of the string you're looking for and the string it matches with don't necessarily have the same length. Like for example "[2008] - The Dark Knight" and "The Dark Knight". So there's no use in taking it into consideration.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 7:30 am 
Offline

Joined: April 4th, 2008, 8:15 pm
Posts: 538
Location: Canada
Yeah, I wasn't really thinking when I made that last post

Just noticed that your search wouldn't work in certain examples
Code:
The Dark Knight of the Round Table
The Dark Knight [2008]
Dark Nights
Dark Nightmares on Elm Street

And you search for "The Dark Knight", you'll be left with
Code:
The Dark Knight of the Round Table
The Dark Knight [2008]


I think you should have a 'secondary search', incase there isn't an exact match. You could exclude numbers and signs such as []()-, and then remove the words that match, and choose which title has the least left over. Using the example from above:

Code:
The Dark Knight of the Round Table
The Dark Knight [2008]

The text in black is the match from the user's input, the gray part is excluded, and the red is what's left over. The top search would have a remainder of 19, while the bottom would have a remainder of 1

However, if a number is included in the search, I think you should search without numebrs first, then see the left over, and compare numbers at that point

Final Example, using your example of '[2008] - The Dark Knight'

Code:
The Dark Teaparty
The Dark Knight
The Dark knight : Revenge of the Sith


You'll be left with:
Code:
(Blank)
Revenge of the Sith

And a search of [2008] won't yield any results, so the top match would be better.

In order of this method to work, you would probably have to make () and [] interchangable.

I think this script would be very buggy in real-life use, most of the movies on peoples HDDs usually have a format like this:
Quote:
Movie Title (Year) (DVD/HD QUALITY) (www.istolethismovie.com)


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 8:04 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
Well, the thing is, I don't want to make the algorithm specifically for movie titles. I want the function to be good in any situation where the user wants to match any kind of strings. Movie titles was just an example.

However, I do agree that, even out of the movie title context, the string "The Dark Knight" is most similar to "The Dark Knight [2008]" (from the list you gave). I'll work on incorporating some kind of leftover bias.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 8:32 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
OK, I made the following changes:
- Made the sList parameter ByRef to boost performance for big lists
- Numbers and hyphens are also considered word characters now
- In case there's two list items with the same word count matches, the shortest one wins

I am still hesitant to make changes based on how digits are handled in the leftover count because although assuming they don't matter in the matching process of a movie title is true, we can't be sure that it is always the case for other kinds of strings. Or maybe it is :?. I need to think about this a little more.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 28th, 2008, 10:54 pm 
Offline

Joined: July 22nd, 2008, 1:49 pm
Posts: 151
Hi..

I would seperate all of the titles into seperate words, then compare the words, letter by letter.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 29th, 2008, 1:05 am 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
You might find this of interest:

Fuzzy string searching with Damerau–Levenshtein distance
http://www.autohotkey.com/forum/topic28243.html

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 30th, 2008, 1:20 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
HugoV wrote:
You might find this of interest:

Fuzzy string searching with Damerau–Levenshtein distance
http://www.autohotkey.com/forum/topic28243.html

Thank you very much for that link! That is indeed very interesting. But I don't know if it's what I need. Using the Levenshtein distance completely disregards words altogether, which isn't what I want. I don't want "Deviled Egg" to match so well with "He lived, Greg" :D
It's not the kind of "similarity" I have in mind. But it's definitely a powerful tool. I'll see if I can use it in some other way.

totalmig wrote:
Hi..

I would seperate all of the titles into seperate words, then compare the words, letter by letter.

Can you explain a little more what you mean? The function I have right now is already comparing words. When you say letter by letter, do you mean, to account for things such as matching "devil" with "lived"?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 30th, 2008, 2:30 am 
Offline

Joined: April 17th, 2005, 7:47 pm
Posts: 289
Location: Sauerland
A very satisfactory method is comparing alln-grams (a sequence of n consecutive letters) of both strings.
I used it to find similar sentences for translating parts of the AutoHotkey help into german. Trigrams (3 characters) worked good for me.

Example:
The following string with 15 letters has 13 n-grams:
Code:
string1 = "String matching":
"Str","tri","rin","ing","ng ","g m"," ma","mat","atc","tch","chi","hin","ing"

Now we count how often the different n-grams occur:
Code:
"Str" 1
"tri" 1
"rin" 1
"ing" 2
"ng " 1
"g m" 1
" ma" 1
"mat" 1
"atc" 1
"tch" 1
"chi" 1
"hin" 1

We want to compare it with the following strings:
Code:
string2 = "matching String":
"mat","atc","tch","chi","hin","ing","ng ","g S"," St","Str","tri","rin","ing"

string3 = "tSgnmatihcrin g":
"tSg","Sgn","gnm","nma","mat","ati","tih","ihc","hcr","cri","rin","in ","n g"

We count how many n-grams of string1 exist in string2 and string3 and determine the differences:
Code:
n-gram  string1      string2  diff       string3  diff
"Str"      1            1      0            0      1
"tri"      1            1      0            0      1
"rin"      1            1      0            1      0
"ing"      2            2      0            0      2
"ng "      1            1      0            0      1
"g m"      1            0      1            0      1
" ma"      1            0      1            0      1
"mat"      1            1      0            1      0
"atc"      1            1      0            0      1
"tch"      1            1      0            0      1
"chi"      1            1      0            0      1
"hin"      1            1      0            0      1
                             ---                 ---
                               2                  11
That means a match of 11/13 = 85 % for string2 and 2/13 = 15 % for string3.
That's it! :)

There are several settings you might adjust to your needs.
  • the threshold for your decision matching/not matching
  • the length of the sequence. 3 was good for me with sentences of 120 characters on average, might be 2 is better for shoter phrases.
  • You could remove all white spaces from the strings to avoid unnecessary differences by reason of merging words together or not.
  • same with capitalizing
  • To minimize the effect of typical misspellings, you could remove interpunction and ligatures.
  • And to minimize the effect of big differences of the strings' length, you might additionally compare the strings in the opposite direction.
  • This list is not complete ;)

I'm sorry I don't have AutoHotkey code for you. Initially I tried it with AutoHotkey (the code was not that difficult), but it was too slow for me, therefore I discarded it and wrote a program in C.
__________________________________________
Created with BBCodeWriter 7.0 - the one and only :D


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 30th, 2008, 3:47 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
That is very cool! Thanks for the detailed explaination. :D

Rabiator wrote:
I'm sorry I don't have AutoHotkey code for you. Initially I tried it with AutoHotkey (the code was not that difficult)

I'll get on it soon and see what it gives. I'm sure it will perform better than what I currently have right now.
Thanks again!


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: January 28th, 2009, 8:15 pm 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
I updated the algorithm using Rabiator's idea. Since it now gives back a fitness level, you can apply a threshold to decide whether the match is close enough or not.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: May 14th, 2010, 1:24 am 
Offline

Joined: July 31st, 2008, 10:27 pm
Posts: 336
I think I found a bug. If I use these as my items:

Code:
MyItem = Aero

s1 := "afighter"
s2 := "Aerozzzzzzzz"
s3 := "Aeroa"
s4 := "Aero:"


'Aerozzzzzzzz' is returned as the best match with a 100% fitness level. It seems that it matches the first string that begins with 'Aero', even though 'Aeroa' is probably a better match.

I will get around it, by checking the Strlen(MatchedItem) of all matches that have a 100% fitness level against the Strlen(SearchItem).
I will calculate my own fitness level: Strlen(SearchItem) / Strlen(MatchedItem).

I just thought people would want to know. If someone has a better solution, I would love to hear it.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: May 14th, 2010, 6:18 am 
Offline

Joined: July 30th, 2007, 11:32 pm
Posts: 581
Both s2 and s3 have a fitness level of 100%. It just returns the first of the two. In fact, even s4 has a fitness level of 100%.

I did this a long time ago, so I'm not too sure what is the best way of giving weight to the difference in length of the string.

What I would do is probably first check if multiple items have the same (and highest) fitness level, and then only return the one with the length closest to the length of sItem.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: May 14th, 2010, 12:01 pm 
Offline

Joined: July 31st, 2008, 10:27 pm
Posts: 336
First I would like to say that this is the best matching algorithm that I have found for my needs. Second, I don't understand how it works, it is too advanced for my skills. :)

I guess I wasn't going to try to find a better match, I just wanted to make sure that the user knew that it wasn't really a 100% match. That's why I was going to set a new fitness level created by dividing the two string lengths.

I have also changed the way 100% matches are handled. The script I use looks like this:

Code:
FuzzyInit:
 S0 = 0
 TwoCount = 0
 Loop, %RomFolder%\*.%RomExtension%,0,%Recurse%
 {
  SplitPath, A_LoopFileName,,, Ext, FileName
  If (Strlen(Filename) > 2)
  {
   s0++
   F%s0% := Filename
   E%s0% := Ext
  }
  Else
  {
   TwoCount++
   Rom%TwoCount% := FileName
   Ext%TwoCount% := Ext
  }
 }
 VarSetCapacity(ptr, s0 * 4)
 i := &ptr
 Loop % s0
   i := NumPut(&F%A_Index%, i+0)
Return


I separated the code to find 100% matches from your script so that I can find 1 and 2 letter matches, without redundant code.

This is the code I use to find a match:
Code:
  Fitness = 0
  FileName := "ERROR: CAN NOT MATCH NAME"
  If StrLen(ArtName) < 3
  {
   Loop, %TwoCount%
   If Rom%A_Index% = %ArtName%
   {
    Fitness = 100
    FileName := Rom%A_Index% . Ext%A_Index%
    Break
   }
  }
  Else
  {
   I1 := MatchItemFromList(&ptr, s0, ArtName), L1 := GetWordFromDblWord(I1, True), Fitness := GetWordFromDblWord(I1, False)
;MsgBox, % "Original" ArtName "`nWinning index: " L1 "`nWinning title: " F%L1% "`nMatch fitness: " Fitness   "%"
;   Fitness .= `%
   If Fitness > 0
      FileName := F%L1% . "." . E%L1%
  }


And finally, your algorithm that I modified eliminating the strlen check, and returning 100% matches, since both have been done. If this routine returns a 100% match, I know that it is not really 100%

Code:
MatchItemFromList(iPtr, iCount, sItem)
{
 iTrigrams := StrLen(sItem) - 2
 Loop %iCount%
   sList%A_Index% := DllCall("MulDiv", int, NumGet(iPtr+0, (A_Index - 1) * 4), int, 1, int, 1, str)
 Loop %iCount%
   If (sList%A_Index% = sItem)
     Return (100 << 16) + A_Index

 Loop %iTrigrams%
 {
  i := InStr(sItem, SubStr(sItem, A_Index, 3), False, 1)
  If i And (i < A_Index)
  {
   sItem_%i% += 1
   sItem_%A_Index% := 0
  } Else sItem_%A_Index% := InStrCount(sItem, SubStr(sItem, A_Index, 3))
 }

 Loop %iCount%
 {
  i := A_Index
  Loop %iTrigrams%
    If sItem_%A_Index%
      sList%i%_Diff += Abs(InStrCount(sList%i%, SubStr(sItem, A_Index, 3)) - sItem_%A_Index%)
 }

 iBestI := 0
 iBestD := 0x10000
 Loop %iCount%
 {
  If (sList%A_Index%_Diff < iBestD)
  {
   iBestD := sList%A_Index%_Diff
   iBestI := A_Index
  }
 }
 Return (Round((iTrigrams - iBestD) * 100 / iTrigrams) << 16) + iBestI
}


InStrCount(ByRef Haystack, Trigram)
{
 j := 0
 i := 1
 Loop
 {
  i := InStr(Haystack, Trigram, False, i)
  If Not i
    Return j
  j += 1
  i += 3
 }
}


GetWordFromDblWord(iDblWord, bLowWord = True)
{
 iTemp := Mod(iDblWord, 0x10000)
 Return bLowWord ? iTemp : Round((iDblWord - iTemp) / 0x10000)
}


I know my code isn't the best, but I hope it gives you some ideas.


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next

All times are UTC [ DST ]


Who is online

Users browsing this forum: DataLife and 13 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group