 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
deefault
Joined: 03 Aug 2006 Posts: 18
|
Posted: Sat Dec 09, 2006 5:10 am Post subject: Compare strings & return likelyhood of match |
|
|
Ok so I'm in a bit of a pickle...I've got two very long lists of book titles and I'm trying to identify books that exist in both lists...fortunately by using the author's last name I should be able to narrow the results down to 5-6 books to compare to...unfortunately the books in one list might not be an exact match to the other.
For Example:
A book in list A titled "Social Psychology" might yield the following possible matches in list B:
"Social Psychology (11th Edition)"
"Social Psychology- Text Only"
"Social Psychology with Research Navigator, 10th Edition"
"Human Aggression, Second Edition (Perspectives in Social Psychology)"
As you can see its not as simple as comparing variables for an exact match because one simply doesn't exist...but there are selections that are a closer match than others. What I'm hoping to figure out is a way to determine a percent likelyhood of a match...similar to what the search function this forum does...allowing me to rank titles based on the likelyhood of a match.
Does anyone have any ideas, suggestions or faults with this?
Thanks for any help you can offer! |
|
| Back to top |
|
 |
.AHK
Joined: 26 Apr 2006 Posts: 662 Location: USA
|
Posted: Sat Dec 09, 2006 5:42 am Post subject: |
|
|
| Well, since you know that they all contain the exact phrase "Social Psychology". Do a loop read line and have it search the line for the phrase. If it finds the phrase it will copy the entire line. Should this work? |
|
| Back to top |
|
 |
deefault
Joined: 03 Aug 2006 Posts: 18
|
Posted: Sat Dec 09, 2006 6:11 am Post subject: |
|
|
Thanks for the fast reply, I had thought of doing this but it doesn't hold true in every case...in many cases, common words like "the" "a" "of" etc. aren't included in the list. At the moment (using other non-AHK means) I can get the list down to those 5-6 close matches...I just need a system to evaluate how close a match they are to eachother.
i.e. "Means to End: History of Universe"
Should return a pretty high match to a book entitled "Means to an End: Jack Jordan's History of the Universe" but a low match rating to "Means to an End: A Universal Guide to Basket Weaving".
It gets tricky huh? I don't even know if its possible with AHK but I thought I'd put it out and see if I got any good suggestions. |
|
| Back to top |
|
 |
.AHK
Joined: 26 Apr 2006 Posts: 662 Location: USA
|
Posted: Sat Dec 09, 2006 6:33 am Post subject: |
|
|
| Yeah it sounds possible. You would have to make some kind of rating system though. Say if a line has 6 words and only two match the phrase, the line gets rated at 2/6. This is just an idea though, and might not be the best way to do so. |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4015 Location: Pittsburgh
|
Posted: Sat Dec 09, 2006 3:51 pm Post subject: |
|
|
| It can be done with AHK or with any other programming language. However, the task is very hard, because the script has to parse natural language phrases to find important and negligible parts, attributes, modifiers, plurals/singulars, etc. |
|
| Back to top |
|
 |
hd0202
Joined: 13 Aug 2006 Posts: 58 Location: Germany
|
Posted: Sat Dec 09, 2006 4:11 pm Post subject: |
|
|
Hi
what here is needed is called the "Levenshtein distance". You can find it via google or wikipedia.
Here is my solution:
| Code: |
lista =
(
"Social Psychology"
)
listb =
( comment
"Social Psychology (11th Edition)"
"Social Psychology- Text Only"
"Social Psychology with Research Navigator, 10th Edition"
"Human Aggression, Second Edition (Perspectives in Social Psychology)"
"Levenshtein distance"
)
loop, parse, lista, `n
{
str1 := a_loopfield
total .= "`n" str1 "`n"
out =
loop, parse, listb, `n
{
str2 := a_loopfield
stringlen, l1, str1
stringlen, l2, str2
gosub levenshtein
}
sort, out
total .= out
}
total .= "`n`nThe scond value is the ""Levenshtein Distance"""
total .= "`nWith the first value I tried to point out the search value"
msgbox, % total
exitapp
levenshtein:
; pseudo code
; for i from 0 to lenStr1
; d[i, 0] := i
; for j from 1 to lenStr2
; d[0, j] := j
; for i from 1 to lenStr1
; for j from 1 to lenStr2
; if str1[i] = str2[j] then cost := 0
; else cost := 1
; d[i, j] := minimum(
; d[i-1, j] + 1, // deletion
; d[i, j-1] + 1, // insertion
; d[i-1, j-1] + cost // substitution
; )
; return d[lenStr1, lenstr2]
d_0_0 := 0
loop, %l1%
{
i := a_index
d_%i%_0 := --i
}
loop, %l2%
{
j := a_index
d_0_%j% := j
}
loop, %l1%
{
i := a_index
loop, %l2%
{
j := a_index
cost := (substr(str1, i, 1) = substr(str2, j, 1)) ? 0 : 1
i1 := i - 1
j1 := j - 1
d1 := d_%i1%_%j%
d2 := d_%i%_%j1%
d3 := d_%i1%_%j1%
d1++
d2++
d3 += cost
d_%i%_%j% := min(d1, min(d2, d3))
}
}
;out .= d_%l1%_%l2% " - " l2 " - " str2 "`n"
out .= 1000 - abs(l2 - d_%l1%_%l2%) " - " d_%l1%_%l2% " - " str2 "`n"
return
min(a, b)
{
return a < b ? a : b
}
|
The reading of the data from Gui or File is to your work, the same for the processing of the result . I know that the latter can be much more difficult than the algorithm himself.
Hubert |
|
| Back to top |
|
 |
Laszlo
Joined: 14 Feb 2005 Posts: 4015 Location: Pittsburgh
|
Posted: Sat Dec 09, 2006 5:00 pm Post subject: |
|
|
| hd0202 wrote: | | what here is needed is called the "Levenshtein distance" | I don't think it is the right approach here. The Levenshtein distance gives the minimum number of changes from one string to the other, which can be large if the author's name is inserted, but it is small if the strings are short, although completely different. You cannot get around parsing the phrases, which involves a lot of language dependent rules.
A few scripts have been already posted for computing the Levenshtein distance, and a variant. hd0202: you could also post a link to the original thread, so people could have a choice. With the new AHK features introduced since the older posts, some simplifications of the code are possible. |
|
| Back to top |
|
 |
deefault
Joined: 03 Aug 2006 Posts: 18
|
Posted: Mon Dec 11, 2006 11:38 pm Post subject: |
|
|
| Thanks for the great suggestion Hubert! I agree with you Laszlo, the algorithm isn't a perfect solution to my situation...but I think it's a good place to start, maybe in in combination with a few other tests run on the titles will produce an accurate enough result... |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|