[Class] string-similarity.ahk

Post your working scripts, libraries and tools
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

[Class] string-similarity.ahk

20 Aug 2019, 10:01

Latest version: https://github.com/Chunjee/string-similarity.ahk
npm: https://www.npmjs.com/package/string-similarity.ahk



string-similarity.ahk
=================


Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.


In a terminal or command line navigated to your project folder:

Code: Select all

npm install string-similarity.ahk
In your code:

Code: Select all

#Include %A_ScriptDir%\node_modules\stringsimilarity.ahk\export.ahk
stringSimilarity := new stringsimilarity()

similarityrating := stringSimilarity.compareTwoStrings("healed", "sealed")

matches := stringSimilarity.findBestMatch("healed", ["edward", "sealed", "theatre"])

bestmatchstring := stringSimilarity.simpleBestMatch("Hard to", [" hard to    ", "hard to", "Hard 2"])


API


Including the module gives an object with three methods: .compareTwoStrings, .findBestMatch, and .simpleBestMatch


compareTwoStrings(string1, string2)
Returns a fraction between 0 and 1, which indicates the degree of similarity between the two strings. 0 indicates completely different strings, 1 indicates identical strings. The comparison is case-insensitive.

Arguments

1. string1 (string): The first string
2. string2 (string): The second string

Order does not make a difference.

Returns

(Number): A fraction from 0 to 1, both inclusive. Higher number indicates more similarity.

Examples

Code: Select all

stringSimilarity.compareTwoStrings("healed", "sealed")
;; → 0.80

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "For sale: table in very good  condition, olive green in colour.")
;; → 0.71

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "For sale: green Subaru Impreza, 210,000 miles")
;; → 0.30

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "Wanted: mountain bike with at least 21 gears.")
;; → 0.11
findBestMatch(mainString, targetStrings)
Compares mainString against each string in targetStrings.

Arguments
1. mainString (string): The string to match each target string against.
2. targetStrings (Array): Each string in this array will be matched against the main string.

Returns
(Object): An object with a ratings property, which gives a similarity rating for each target string, and a bestMatch property, which specifies which target string was most similar to the main string.

Examples

Code: Select all

stringSimilarity.findBestMatch("Olive-green table for sale, in extremely good condition."
  , ["For sale: green Subaru Impreza, 210,000 miles"
  , "For sale: table in very good condition, olive green in colour."
  , "Wanted: mountain bike with at least 21 gears."])
;; → 
{ ratings:
   [ { target: "For sale: green Subaru Impreza, 210,000 miles",
       rating: 0.30 },
     { target: "For sale: table in very good condition, olive green in colour.",
       rating: 0.71 },
     { target: "Wanted: mountain bike with at least 21 gears.",
       rating: 0.11 } ],
  bestMatch:
   { target: "For sale: table in very good condition, olive green in colour.",
     rating: 0.71 } }
simpleBestMatch(mainString, targetStrings)
Compares mainString against each string in targetStrings.

Arguments
1. mainString (string): The string to match each target string against.
2. targetStrings (Array): Each string in this array will be matched against the main string.

Returns
(String): The string that was most similar to the first argument string.

Examples

Code: Select all

stringSimilarity.simpleBestMatch("Hard to"
  , [" hard to    "
  , "hard to"
  , "Hard 2"])
;; → "hard to"
Last edited by Chunjee on 17 Nov 2019, 06:11, edited 7 times in total.
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: string-similarity.ahk

20 Aug 2019, 10:08

Library v1.1.0

Code: Select all

Class stringsimilarity {

    static bestMatch := ""
    static bestMatchRating := 0

    compareTwoStrings(param_string1,param_string2) {
        ;Sørensen-Dice coefficient
        vCount := 0
        oArray := {}
        oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
        Loop, % vCount1 := StrLen(param_string1) - 1
            oArray["z" SubStr(param_string1, A_Index, 2)]++
        Loop, % vCount2 := StrLen(param_string2) - 1
            if (oArray["z" SubStr(param_string2, A_Index, 2)] > 0) {
                oArray["z" SubStr(param_string2, A_Index, 2)]--
                vCount++
            }
        vDSC := Round((2 * vCount) / (vCount1 + vCount2),2)
        if (!vDSC || vDSC < 0.005) { ;round to 0 if less than 0.005
            return 0
        }
        if (vDSC = 1) { 
            return 1
        }
        return vDSC
    }

    findBestMatch(param_string,param_array) {
        if (!IsObject(param_array)) {
            throw Exception("Type Error", -1)
        }
        string_Array := []

        ; Score each option and save into a new array
        loop, % param_array.MaxIndex() {
            string_Array[A_Index, "rating"] := this.compareTwoStrings(param_string, param_array[A_Index])
            string_Array[A_Index, "target"] := param_array[A_Index]
        }

        ;sort the scored array and return the bestmatch
        l_sortedArray := this.internal_Sort2DArrayFast(string_Array,"rating", false) ;false reverses the order so the highest scoring is at the top
        l_Object := {bestMatch:l_sortedArray[1], ratings:l_sortedArray}
        this.bestMatch := l_Object.bestMatch.target
        this.bestMatchRating := l_Object.bestMatch.rating
        return l_Object
    }

    simpleBestMatch(param_string,param_array) {
        if (!IsObject(param_array)) {
            throw Exception("Type Error", -1)
        }

        l_array := this.findBestMatch(param_string,param_array)
        this.bestMatch := l_array.bestMatch.target
        this.bestMatchRating := l_array.bestMatch.rating
        return l_array.bestMatch.target
    }



    internal_Sort2DArrayFast(param_Array,param_Key,param_Ascending:=True) {
        for index, obj in param_Array
            out .= obj[param_Key] "+" index "|" ; "+" allows for sort to work with just the value
        ; out will look like:   value+index|value+index|

        v := param_Array[param_Array.minIndex(), param_Key]
        if v is number 
            type := " N "
        StringTrimRight, out, out, 1 ; remove trailing | 
        Sort, out, % "D| " type  (!Ascending ? " R" : " ")
        l_storage := []
        loop, parse, out, |
            l_storage.insert(param_Array[SubStr(A_LoopField, InStr(A_LoopField, "+") + 1)])
        return l_storage
    }
}
Performance improvements welcomed via pull requests.

Heavy lifting done by jeeswg at this thread: https://www.autohotkey.com/boards/viewtopic.php?p=243009#p243009


Why?
I was making a movie metadata thing and needed to match the user's input with the closest imdb match. Most libraries seem to be designed with short single strings in mind. Longer strings like "Harry Potter and the Chamber of Secrets (2002)" proved difficult and I really like the concept of scoring via a number between 0 and 1. That project is here if you need a realworld example of the lib https://github.com/Chunjee/SA-omdbcloner

When working with longer strings, a slight difference can make a huge Levenshtein distance number. Other times short movie names won't return a large number no matter how badly the match. That is why a fraction between 0 and 1 is best, for honestly just about everything because then it is always to scale no matter the inputs.
Last edited by Chunjee on 18 Mar 2020, 07:23, edited 6 times in total.
burque505
Posts: 1305
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

22 Aug 2019, 13:39

@Chunjee , thanks a lot. A user (MarkusDS) on the UiPath forum was asking about an old post of mine using Levenshtein and Damerau-Levenshtein in UiPath with AHK, and I've pointed him to this post.
Regards,
burque505
User avatar
Delta Pythagorean
Posts: 562
Joined: 13 Feb 2017, 13:44
GitHub: DelPyth
Location: Somewhere in the US

Re: [Library] string-similarity.ahk

22 Aug 2019, 18:09

Pardon my French but, I'll be damned. This might be more useful to me in the future but for right now, I'd more or less use this for some sort of an AI.

- [AHK].......: 1.1.33.00 Unicode 64-bit
- [OS].........: Windows 10.0.18362
- [GITHUB]...: github.com/DeltaPyth
- [PAYPAL]....: paypal.me/DelPyth
- [DISCORD]..: Delta#3324

Remember to use [code]CODE[/code] for your multi-line scripts.
Stay safe, stay inside, and remember to wash your hands for 20 seconds!
User avatar
Coderooney
Posts: 46
Joined: 23 Mar 2017, 22:41

Re: [Library] string-similarity.ahk

22 Aug 2019, 20:36

This is really impressive! I know it'll come in handy for something for sure.
fenchai
Posts: 236
Joined: 28 Mar 2016, 07:57

Re: [Library] string-similarity.ahk

23 Aug 2019, 12:41

pardon my ignorance, but why not post the library? I prefer just copy pasting it rather than installing with another program.
User avatar
Relayer
Posts: 130
Joined: 30 Sep 2013, 13:09
Location: Delaware, USA

Re: [Library] string-similarity.ahk

23 Aug 2019, 12:56

Pardon MY ignorance...

Can anyone speak to whether there are different algorithms to assess similarity depending on what you are trying to accomplish? I can envision that depending on the context of the problem, you may want to weight the various attributes the calculation tries to embody. For example, you may want to stipulate that the first letter must always match to get a high score. Just curious.

Relayer
burque505
Posts: 1305
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

24 Aug 2019, 19:45

@Chunjee, could I ask for some help with this?

Code: Select all

stringSimilarity.compareTwoStrings
and

Code: Select all

stringSimilarity.simpleBestMatch
behave as expected, but [code]stringSimilarity.findBestMatch[/code] gives no output.

EDIT: I guess I figured it out, sorry for the bother. I just saw that simpleBestMatch calls findBestMatch, so it's obviously working, and I see findBestMatch returns an object.
So I used errorseven's obj2str.ahk, and I think Maestrith has a similar object-to-string function that I can't find at the moment. I think Joe Glines had a video using it. Bound to be here somewhere :D

Here's my little test script:

Code: Select all

#Include %A_ScriptDir%\lib\string-similarity.ahk\export.ahk
#Include %A_ScriptDir%\obj2str.ahk
stringSimilarity := new stringsimilarity()
 
similarityrating := stringSimilarity.compareTwoStrings("The eturn of the king", "The Return of the King")
 
;matches := stringSimilarity.findBestMatch("healed", ["edward", "sealed", "theatre"])

newtest := stringSimilarity.findBestMatch("Olive-green table for sale, in extremely good condition."
  , ["For sale: green Subaru Impreza, 210,000 miles"
  , "For sale: table in very good condition, olive green in colour."
  , "Wanted: mountain bike with at least 21 gears."])
 
bestmatchstring := stringSimilarity.simpleBestMatch("Blue table for sale, in extremely good condition."
    , ["For sale: green Subaru Impreza, 210,000 miles"
    , "For sale: table in very good condition, olive green in colour."
    , "Wanted: mountain bike with at least 21 gears."])
	
hoser := obj2str(newtest)

msgbox %similarityrating%`r`n`r`n%hoser%`r`n`r`n%bestmatchstring%
msgbox.PNG
msgbox.PNG (19.14 KiB) Viewed 3825 times
This is a really cool lib, thank you very much!
Regards, burque505
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

25 Aug 2019, 05:08

fenchai wrote:
23 Aug 2019, 12:41
pardon my ignorance, but why not post the library? I prefer just copy pasting it rather than installing with another program.
Yeah probably a good idea as is the norm for ahk. Will be added to Post #2
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

25 Aug 2019, 05:41

burque505 wrote:
24 Aug 2019, 19:45
stringSimilarity.compareTwoStrings and stringSimilarity.simpleBestMatch behave as expected, but stringSimilarity.findBestMatch gives no output.
Sounds like you mostly figured it out. But for completeness for anyone else in the future let me leave an answer:
Since it returns an object, you can't just msgbox the whole object if you are looking for a string. You would do something like msgbox, % newtest.bestMatch.target or msgbox, % newtest.ratings[1].target

I tested the following as a more in depth example:

Code: Select all

newtest := stringSimilarity.findBestMatch("foobar",["foo","bar"])

msgbox, % newtest.bestMatch.target " and " newtest.ratings[1].target " both match"
;; => "bar" for the curious

Regards, burque505\
burque505
Posts: 1305
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

25 Aug 2019, 08:12

Thanks for the excellent example. I know I'm going to get asked how to discard values below a certain confidence rating (today, I'll bet ;) ), so I tried the following:

Code: Select all

newtest2 := stringSimilarity.findBestMatch("foobar",["foo","bar"])
msgbox, % "The match 'for newtest2.bestMatch.target', i.e. '" newtest2.bestMatch.target "', has a confidence rating of " newtest2.ratings[1].rating
That should get me where I need to be. :D
dices-2.PNG
dices-2.PNG (11.97 KiB) Viewed 3685 times
burque505
Posts: 1305
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

26 Aug 2019, 18:23

@Chunjee, saw your post on the Uipath forum, great work! I downloaded your UiPath workflow, made a very minor mod to "Dice.ahk", tweaked the properties a little and it appears to be working, and compared to Markus's original solution it is REALLY fast.
Regards,
burque505
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

27 Aug 2019, 13:33

3 Ways to filter lower ratings for anyone who needs a hand:

Code: Select all

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"], 3:["rating":0, "target":"[WTB]"], 4:["rating":0, "target":"Wanted - "]
newarray := []
loop, % scoreditems.ratings.MaxIndex() {
    if (scoreditems.ratings[A_Index].rating > .60) {
        newarray.push(scoreditems.ratings[A_Index])
    }
}
scoreditems.ratings := newarray
;;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]

Or do it all in a dedicated function:

Code: Select all

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
scoreditems.ratings := filterlowRatings(scoreditems.ratings,.60)
;;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]
filterlowRatings(para_ratingsarray,para_scorethreshold) {
    newarray := []
    loop, % para_ratingsarray.MaxIndex() {
        if (para_ratingsarray[A_Index].rating > para_scorethreshold) {
            newarray.push(para_ratingsarray[A_Index])
        }
    }
    return % newarray
}


/!\ SHAMELESS PLUG INCOMING!!! /!\ :siren:
I'm working on a larger utility library that will allow one to sort out low scoring matches like so:

Code: Select all

ssObj := New stringsimilarity()
A := New biga()

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"], 3:["rating":0, "target":"[WTB]"], 4:["rating":0, "target":"Wanted - "]
scoreditems.ratings := A.filter(scoreditems.ratings,Func("filterLowRatings"))
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]
filterLowRatings(para_interatee) {
    if (para_interatee.rating >= .60) { ; 60% match or better
        return true
    }
}
That .filter method is done, but the library, on the whole, is far from done. You can download or learn more about it in this thread: https://www.autohotkey.com/boards/viewtopic.php?f=76&t=67466
I could certainly use the help expanding that out.
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

16 Nov 2019, 15:32

I noticed some typos on the documentation and fixed them today.

I also added two property to the class called "bestMatch" and "bestMatchRating"
Whenever .findBestMatch or .simpleBestMatch are called. The best matching string will be saved to this property for even easier access later along with the rating it scored at the comparison.

Example:

Code: Select all

stringSimilarity := new stringsimilarity()
stringSimilarity.findBestMatch("The Mask"
    , ["The Last Jedi (2017)"
    , "The Mask (1994)"])

msgbox, % "the best match was " stringSimilarity.bestMatch " with a rating of " stringSimilarity.bestMatchRating
; => the best match was The Mask (1994) with a rating of 0.67
Users that value statelessness should ignore this new feature.
User avatar
Chunjee
Posts: 516
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

14 Mar 2020, 10:17

I made two logos a few weeks ago. They are now included in one .svg file on the git repo.

Official:
Image

Artistic Concept:
Image


If you have an idea of your own feel free to post it.

Return to “Scripts and Functions”

Who is online

Users browsing this forum: No registered users and 79 guests