AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Compare strings & return likelyhood of match

 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
deefault



Joined: 03 Aug 2006
Posts: 18

PostPosted: Sat Dec 09, 2006 5:10 am    Post subject: Compare strings & return likelyhood of match Reply with quote

Ok so I'm in a bit of a pickle...I've got two very long lists of book titles and I'm trying to identify books that exist in both lists...fortunately by using the author's last name I should be able to narrow the results down to 5-6 books to compare to...unfortunately the books in one list might not be an exact match to the other.

For Example:
A book in list A titled "Social Psychology" might yield the following possible matches in list B:

"Social Psychology (11th Edition)"
"Social Psychology- Text Only"
"Social Psychology with Research Navigator, 10th Edition"
"Human Aggression, Second Edition (Perspectives in Social Psychology)"


As you can see its not as simple as comparing variables for an exact match because one simply doesn't exist...but there are selections that are a closer match than others. What I'm hoping to figure out is a way to determine a percent likelyhood of a match...similar to what the search function this forum does...allowing me to rank titles based on the likelyhood of a match.

Does anyone have any ideas, suggestions or faults with this?

Thanks for any help you can offer!
Back to top
View user's profile Send private message
.AHK



Joined: 26 Apr 2006
Posts: 662
Location: USA

PostPosted: Sat Dec 09, 2006 5:42 am    Post subject: Reply with quote

Well, since you know that they all contain the exact phrase "Social Psychology". Do a loop read line and have it search the line for the phrase. If it finds the phrase it will copy the entire line. Should this work?
Back to top
View user's profile Send private message Visit poster's website AIM Address
deefault



Joined: 03 Aug 2006
Posts: 18

PostPosted: Sat Dec 09, 2006 6:11 am    Post subject: Reply with quote

Thanks for the fast reply, I had thought of doing this but it doesn't hold true in every case...in many cases, common words like "the" "a" "of" etc. aren't included in the list. At the moment (using other non-AHK means) I can get the list down to those 5-6 close matches...I just need a system to evaluate how close a match they are to eachother.

i.e. "Means to End: History of Universe"

Should return a pretty high match to a book entitled "Means to an End: Jack Jordan's History of the Universe" but a low match rating to "Means to an End: A Universal Guide to Basket Weaving".

It gets tricky huh? I don't even know if its possible with AHK but I thought I'd put it out and see if I got any good suggestions.
Back to top
View user's profile Send private message
.AHK



Joined: 26 Apr 2006
Posts: 662
Location: USA

PostPosted: Sat Dec 09, 2006 6:33 am    Post subject: Reply with quote

Yeah it sounds possible. You would have to make some kind of rating system though. Say if a line has 6 words and only two match the phrase, the line gets rated at 2/6. This is just an idea though, and might not be the best way to do so.
Back to top
View user's profile Send private message Visit poster's website AIM Address
Laszlo



Joined: 14 Feb 2005
Posts: 4015
Location: Pittsburgh

PostPosted: Sat Dec 09, 2006 3:51 pm    Post subject: Reply with quote

It can be done with AHK or with any other programming language. However, the task is very hard, because the script has to parse natural language phrases to find important and negligible parts, attributes, modifiers, plurals/singulars, etc.
Back to top
View user's profile Send private message
hd0202



Joined: 13 Aug 2006
Posts: 58
Location: Germany

PostPosted: Sat Dec 09, 2006 4:11 pm    Post subject: Reply with quote

Hi

what here is needed is called the "Levenshtein distance". You can find it via google or wikipedia.

Here is my solution:

Code:

lista =
(
"Social Psychology"
)

listb =
( comment
"Social Psychology (11th Edition)"
"Social Psychology- Text Only"
"Social Psychology with Research Navigator, 10th Edition"
"Human Aggression, Second Edition (Perspectives in Social Psychology)"
"Levenshtein distance"
)

loop, parse, lista, `n
{
   str1 := a_loopfield
   total .= "`n" str1 "`n"
   out =
   loop, parse, listb, `n
   {
      str2 := a_loopfield
      stringlen, l1, str1
      stringlen, l2, str2
      gosub levenshtein
   }
   sort, out
   total .= out
}
total .= "`n`nThe scond value is the ""Levenshtein Distance"""
total .= "`nWith the first value I tried to point out the search value"
msgbox, % total
exitapp

levenshtein:
;   pseudo code
;   for i from 0 to lenStr1
;       d[i, 0] := i
;   for j from 1 to lenStr2
;       d[0, j] := j
;   for i from 1 to lenStr1
;       for j from 1 to lenStr2
;           if str1[i] = str2[j] then cost := 0
;                                else cost := 1
;           d[i, j] := minimum(
;                                d[i-1, j] + 1,     // deletion
;                                d[i, j-1] + 1,     // insertion
;                                d[i-1, j-1] + cost   // substitution
;                            )
;   return d[lenStr1, lenstr2]


d_0_0 := 0
loop, %l1%
{
   i := a_index
   d_%i%_0 := --i
}

loop, %l2%
{
   j := a_index
   d_0_%j% := j
}

loop, %l1%
{
   i := a_index
   loop, %l2%
   {
      j := a_index
      cost := (substr(str1, i, 1) = substr(str2, j, 1)) ? 0 : 1
      i1 := i - 1
      j1 := j - 1
      d1 := d_%i1%_%j%
      d2 := d_%i%_%j1%
      d3 := d_%i1%_%j1%
      d1++
      d2++
      d3 += cost
      d_%i%_%j% := min(d1, min(d2, d3))
   }
}
;out .= d_%l1%_%l2% " - " l2 " - " str2 "`n"
out .= 1000 - abs(l2 - d_%l1%_%l2%) " - " d_%l1%_%l2% " - " str2 "`n"
return

min(a, b)
{
   return a < b ? a : b
}


The reading of the data from Gui or File is to your work, the same for the processing of the result . I know that the latter can be much more difficult than the algorithm himself.

Hubert
Back to top
View user's profile Send private message
Laszlo



Joined: 14 Feb 2005
Posts: 4015
Location: Pittsburgh

PostPosted: Sat Dec 09, 2006 5:00 pm    Post subject: Reply with quote

hd0202 wrote:
what here is needed is called the "Levenshtein distance"
I don't think it is the right approach here. The Levenshtein distance gives the minimum number of changes from one string to the other, which can be large if the author's name is inserted, but it is small if the strings are short, although completely different. You cannot get around parsing the phrases, which involves a lot of language dependent rules.

A few scripts have been already posted for computing the Levenshtein distance, and a variant. hd0202: you could also post a link to the original thread, so people could have a choice. With the new AHK features introduced since the older posts, some simplifications of the code are possible.
Back to top
View user's profile Send private message
deefault



Joined: 03 Aug 2006
Posts: 18

PostPosted: Mon Dec 11, 2006 11:38 pm    Post subject: Reply with quote

Thanks for the great suggestion Hubert! I agree with you Laszlo, the algorithm isn't a perfect solution to my situation...but I think it's a good place to start, maybe in in combination with a few other tests run on the titles will produce an accurate enough result...
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group