[Class] string-similarity.ahk

Post your working scripts, libraries and tools
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

[Class] string-similarity.ahk

20 Aug 2019, 10:01

Github: https://github.com/Chunjee/string-similarity.ahk
npm: https://www.npmjs.com/package/string-similarity.ahk



string-similarity.ahk
Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.


Installation
In a terminal or command line navigated to your project folder:

Code: Select all

npm install string-similarity.ahk
In your code:

Code: Select all

#Include %A_ScriptDir%\node_modules
#Include string-similarity.ahk\export.ahk
oStringSimilarity := new stringsimilarity()

oStringSimilarity.compareTwoStrings("test", "testing")
; => 0.67
oStringSimilarity.compareTwoStrings("Hello", "hello")
; => 1.0

API
Including the module provides a class named stringsimilarity with three methods: .compareTwoStrings, .findBestMatch, and .simpleBestMatch


compareTwoStrings(string1, string2)
Returns a fraction between 0 and 1, which indicates the degree of similarity between the two strings. 0 indicates completely different strings, 1 indicates identical strings. The comparison is case-insensitive.

## Arguments
  1. string1 (string): The first string
  2. string2 (string): The second string
Order does not make a difference. ## Returns (Number): A fraction from 0 to 1, both inclusive. Higher number indicates more similarity.

## Example

Code: Select all

stringSimilarity.compareTwoStrings("healed", "sealed")
; => 0.80

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "For sale: table in very good  condition, olive green in colour.")
; => 0.71

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "For sale: green Subaru Impreza, 210,000 miles")
; => 0.30

stringSimilarity.compareTwoStrings("Olive-green table for sale, in extremely good condition."
  , "Wanted: mountain bike with at least 21 gears.")
; => 0.11
findBestMatch(mainString, targetStrings)
Compares mainString against each string in targetStrings.

## Arguments
  1. mainString (string): The string to match each target string against.
  2. targetStrings (Array): Each string in this array will be matched against the main string.
## Returns (Object): An object with a ratings property, which gives a similarity rating for each target string, and a bestMatch property, which specifies which target string was most similar to the main string.

## Example

Code: Select all

stringSimilarity.findBestMatch("Olive-green table for sale, in extremely good condition."
  , ["For sale: green Subaru Impreza, 210,000 miles"
  , "For sale: table in very good condition, olive green in colour."
  , "Wanted: mountain bike with at least 21 gears."])
; =>
{ ratings:
   [ { target: "For sale: green Subaru Impreza, 210,000 miles",
       rating: 0.30 },
     { target: "For sale: table in very good condition, olive green in colour.",
       rating: 0.71 },
     { target: "Wanted: mountain bike with at least 21 gears.",
       rating: 0.11 } ],
  bestMatch:
   { target: "For sale: table in very good condition, olive green in colour.",
     rating: 0.71 } }
simpleBestMatch(mainString, targetStrings)
Compares mainString against each string in targetStrings.

## Arguments
  1. mainString (string): The string to match each target string against.
  2. targetStrings (Array): Each string in this array will be matched against the main string.
## Returns (String): The string that was most similar to the first argument string.

## Example

Code: Select all

stringSimilarity.simpleBestMatch("Hard to"
  , [" hard to    "
  , "hard to"
  , "Hard 2"])
; => "hard to"
Last edited by Chunjee on 01 Sep 2020, 06:18, edited 10 times in total.
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: string-similarity.ahk

20 Aug 2019, 10:08

v1.0.4

Code: Select all

Class stringsimilarity {

	__New() {
		this.info_Array
	}


	compareTwoStrings(para_string1,para_string2) {
		;Sørensen-Dice coefficient
		savedBatchLines := A_BatchLines
		SetBatchLines, -1

		vCount := 0
		oArray := {}
		oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
		Loop, % vCount1 := StrLen(para_string1) - 1
			oArray["z" SubStr(para_string1, A_Index, 2)]++
		Loop, % vCount2 := StrLen(para_string2) - 1
			if (oArray["z" SubStr(para_string2, A_Index, 2)] > 0) {
				oArray["z" SubStr(para_string2, A_Index, 2)]--
				vCount++
			}
		vDSC := Round((2 * vCount) / (vCount1 + vCount2),2)
		if (!vDSC || vDSC < 0.005) { ;round to 0 if less than 0.005
			return 0
		}
		if (vDSC = 1) {
			return 1
		}
		SetBatchLines, % savedBatchLines
		return vDSC
	}


	findBestMatch(para_string,para_array) {
		savedBatchLines := A_BatchLines
		SetBatchLines, -1
		if (!IsObject(para_array)) {
			SetBatchLines, % savedBatchLines
			return false
		}

		this.info_Array := []

		; Score each option and save into a new array
		loop, % para_array.MaxIndex() {
			this.info_Array[A_Index, "rating"] := this.compareTwoStrings(para_string, para_array[A_Index])
			this.info_Array[A_Index, "target"] := para_array[A_Index]
		}

		;sort the scored array and return the bestmatch
		l_sortedArray := this.internal_Sort2DArrayFast(this.info_Array,"rating", false) ;false reverses the order so the highest scoring is at the top
		l_object := {bestMatch:l_sortedArray[1], ratings:l_sortedArray}
		SetBatchLines, % savedBatchLines
		return l_object
	}


	simpleBestMatch(para_string,para_array) {
		if (!IsObject(para_array)) {
			return false
		}

		l_array := this.findBestMatch(para_string,para_array)
		return l_array.bestMatch.target
	}



	internal_Sort2DArrayFast(byRef a, key, Ascending := True)
	{
		for index, obj in a
			out .= obj[key] "+" index "|" ; "+" allows for sort to work with just the value
		; out will look like:   value+index|value+index|

		v := a[a.minIndex(), key]
		if v is number
			type := " N "
		StringTrimRight, out, out, 1 ; remove trailing |
		Sort, out, % "D| " type  (!Ascending ? " R" : " ")
		l_storage := []
		loop, parse, out, |
			l_storage.insert(a[SubStr(A_LoopField, InStr(A_LoopField, "+") + 1)])
		return l_storage
	}
}
Performance improvements welcomed via pull requests.

Heavy lifting done by jeeswg at this thread: https://www.autohotkey.com/boards/viewtopic.php?p=243009#p243009


Why?
I was making a movie metadata thing and needed to match the user's input with the closest imdb match. Most libraries seem to be designed with short single strings in mind. Longer strings like "Harry Potter and the Chamber of Secrets (2002)" proved difficult. I really liked the concept of scoring via a number between 0 and 1. That project is here if you need a realworld example of the package in use: https://github.com/Chunjee/SA-omdbcloner

When working with longer strings, a slight difference like a missing word can make a huge Levenshtein distance number. Other times short movie names wouldn't return a large number even with a bad match. That is why a fraction between 0 and 1 is best, for honestly just about everything because then it is always to scale.
Last edited by Chunjee on 22 Sep 2020, 08:33, edited 8 times in total.
burque505
Posts: 1374
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

22 Aug 2019, 13:39

@Chunjee , thanks a lot. A user (MarkusDS) on the UiPath forum was asking about an old post of mine using Levenshtein and Damerau-Levenshtein in UiPath with AHK, and I've pointed him to this post.
Regards,
burque505
User avatar
Delta Pythagorean
Posts: 567
Joined: 13 Feb 2017, 13:44
GitHub: DelPyth
Location: Somewhere in the US

Re: [Library] string-similarity.ahk

22 Aug 2019, 18:09

Pardon my French but, I'll be damned. This might be more useful to me in the future but for right now, I'd more or less use this for some sort of an AI.

- [AHK].......: 1.1.33.02 Unicode 64-bit
- [OS].........: Windows 10.0.18362
- [GITHUB]...: github.com/DeltaPyth
- [PAYPAL]....: paypal.me/DelPyth
- [DISCORD]..: Delta#3324

Remember to use [code]CODE[/code] for your multi-line scripts.
Stay safe, stay inside, and remember to wash your hands for 20 seconds!
User avatar
Coderooney
Posts: 46
Joined: 23 Mar 2017, 22:41

Re: [Library] string-similarity.ahk

22 Aug 2019, 20:36

This is really impressive! I know it'll come in handy for something for sure.
fenchai
Posts: 237
Joined: 28 Mar 2016, 07:57

Re: [Library] string-similarity.ahk

23 Aug 2019, 12:41

pardon my ignorance, but why not post the library? I prefer just copy pasting it rather than installing with another program.
User avatar
Relayer
Posts: 135
Joined: 30 Sep 2013, 13:09
Location: Delaware, USA

Re: [Library] string-similarity.ahk

23 Aug 2019, 12:56

Pardon MY ignorance...

Can anyone speak to whether there are different algorithms to assess similarity depending on what you are trying to accomplish? I can envision that depending on the context of the problem, you may want to weight the various attributes the calculation tries to embody. For example, you may want to stipulate that the first letter must always match to get a high score. Just curious.

Relayer
burque505
Posts: 1374
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

24 Aug 2019, 19:45

@Chunjee, could I ask for some help with this?

Code: Select all

stringSimilarity.compareTwoStrings
and

Code: Select all

stringSimilarity.simpleBestMatch
behave as expected, but [code]stringSimilarity.findBestMatch[/code] gives no output.

EDIT: I guess I figured it out, sorry for the bother. I just saw that simpleBestMatch calls findBestMatch, so it's obviously working, and I see findBestMatch returns an object.
So I used errorseven's obj2str.ahk, and I think Maestrith has a similar object-to-string function that I can't find at the moment. I think Joe Glines had a video using it. Bound to be here somewhere :D

Here's my little test script:

Code: Select all

#Include %A_ScriptDir%\lib\string-similarity.ahk\export.ahk
#Include %A_ScriptDir%\obj2str.ahk
stringSimilarity := new stringsimilarity()
 
similarityrating := stringSimilarity.compareTwoStrings("The eturn of the king", "The Return of the King")
 
;matches := stringSimilarity.findBestMatch("healed", ["edward", "sealed", "theatre"])

newtest := stringSimilarity.findBestMatch("Olive-green table for sale, in extremely good condition."
  , ["For sale: green Subaru Impreza, 210,000 miles"
  , "For sale: table in very good condition, olive green in colour."
  , "Wanted: mountain bike with at least 21 gears."])
 
bestmatchstring := stringSimilarity.simpleBestMatch("Blue table for sale, in extremely good condition."
    , ["For sale: green Subaru Impreza, 210,000 miles"
    , "For sale: table in very good condition, olive green in colour."
    , "Wanted: mountain bike with at least 21 gears."])
	
hoser := obj2str(newtest)

msgbox %similarityrating%`r`n`r`n%hoser%`r`n`r`n%bestmatchstring%
msgbox.PNG
msgbox.PNG (19.14 KiB) Viewed 4257 times
This is a really cool lib, thank you very much!
Regards, burque505
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

25 Aug 2019, 05:08

fenchai wrote:
23 Aug 2019, 12:41
pardon my ignorance, but why not post the library? I prefer just copy pasting it rather than installing with another program.
Yeah probably a good idea as is the norm for ahk. Will be added to Post #2
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

25 Aug 2019, 05:41

burque505 wrote:
24 Aug 2019, 19:45
stringSimilarity.compareTwoStrings and stringSimilarity.simpleBestMatch behave as expected, but stringSimilarity.findBestMatch gives no output.
Sounds like you mostly figured it out. But for completeness for anyone else in the future let me leave an answer:
Since it returns an object, you can't just msgbox the whole object if you are looking for a string. You would do something like msgbox, % newtest.bestMatch.target or msgbox, % newtest.ratings[1].target

I tested the following as a more in depth example:

Code: Select all

newtest := stringSimilarity.findBestMatch("foobar",["foo","bar"])

msgbox, % newtest.bestMatch.target " and " newtest.ratings[1].target " both match"
;; => "bar" for the curious

Regards, burque505\
burque505
Posts: 1374
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

25 Aug 2019, 08:12

Thanks for the excellent example. I know I'm going to get asked how to discard values below a certain confidence rating (today, I'll bet ;) ), so I tried the following:

Code: Select all

newtest2 := stringSimilarity.findBestMatch("foobar",["foo","bar"])
msgbox, % "The match 'for newtest2.bestMatch.target', i.e. '" newtest2.bestMatch.target "', has a confidence rating of " newtest2.ratings[1].rating
That should get me where I need to be. :D
dices-2.PNG
dices-2.PNG (11.97 KiB) Viewed 4117 times
burque505
Posts: 1374
Joined: 22 Jan 2017, 19:37

Re: [Library] string-similarity.ahk

26 Aug 2019, 18:23

@Chunjee, saw your post on the Uipath forum, great work! I downloaded your UiPath workflow, made a very minor mod to "Dice.ahk", tweaked the properties a little and it appears to be working, and compared to Markus's original solution it is REALLY fast.
Regards,
burque505
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Library] string-similarity.ahk

27 Aug 2019, 13:33

3 Ways to filter lower ratings for anyone who needs a hand:

Code: Select all

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"], 3:["rating":0, "target":"[WTB]"], 4:["rating":0, "target":"Wanted - "]
newarray := []
loop, % scoreditems.ratings.MaxIndex() {
    if (scoreditems.ratings[A_Index].rating > .60) {
        newarray.push(scoreditems.ratings[A_Index])
    }
}
scoreditems.ratings := newarray
;;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]

Or do it all in a dedicated function:

Code: Select all

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
scoreditems.ratings := filterlowRatings(scoreditems.ratings,.60)
;;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]
filterlowRatings(para_ratingsarray,para_scorethreshold) {
    newarray := []
    loop, % para_ratingsarray.MaxIndex() {
        if (para_ratingsarray[A_Index].rating > para_scorethreshold) {
            newarray.push(para_ratingsarray[A_Index])
        }
    }
    return % newarray
}


/!\ SHAMELESS PLUG INCOMING!!! /!\ :siren:
I'm working on a larger utility library that will allow one to sort out low scoring matches like so:

Code: Select all

ssObj := New stringsimilarity()
A := New biga()

scoreditems := ssObj.findBestMatch("For Sale:",["Wanted - ","[WTB]","FOR SALE!","SALE:"])
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"], 3:["rating":0, "target":"[WTB]"], 4:["rating":0, "target":"Wanted - "]
scoreditems.ratings := A.filter(scoreditems.ratings,Func("filterLowRatings"))
;=> scoreditems.ratings = 1:["rating":0.88, "target":"FOR SALE!"], 2:["rating":0.67, "target":"SALE:"]
filterLowRatings(para_interatee) {
    if (para_interatee.rating >= .60) { ; 60% match or better
        return true
    }
}
That .filter method is done, but the library, on the whole, is far from done. You can download or learn more about it in this thread: https://www.autohotkey.com/boards/viewtopic.php?f=76&t=67466
I could certainly use the help expanding that out.
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

16 Nov 2019, 15:32

I noticed some typos on the documentation and fixed them today.
Last edited by Chunjee on 31 Aug 2020, 22:59, edited 1 time in total.
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

14 Mar 2020, 10:17

I made two logos a few weeks ago. They are now included in one .svg file on the git repo.

Official:
Image

Artistic Concept:
Image


If you have an idea of your own feel free to post it.
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

31 Aug 2020, 22:39

v1.0.4 has been published.

Functionally there is no change.
In the unlikely event someone isn't using SetBatchLines, -1; most methods will set that, perform what it needs to, then resume the previous SetBatchLines speed before returning the output.
Also cleaned up some of the tests.
stiuna
Posts: 8
Joined: 13 Jun 2020, 23:11

Re: [Class] string-similarity.ahk

20 Sep 2020, 17:30

Is it possible for you to offer a 100% offline version? Like a CLI API or something like that.
User avatar
Chunjee
Posts: 691
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: [Class] string-similarity.ahk

20 Sep 2020, 19:49

The use of "API" in the documentation is not meant as http API; may update that to "Usage" or something.

This is usable offline.
burque505
Posts: 1374
Joined: 22 Jan 2017, 19:37

Re: [Class] string-similarity.ahk

21 Sep 2020, 07:50

@Chunjee, thanks again for all your work on string-similarity.ahk. This is going to get a lot of traction.
Regards,
burque505
stiuna
Posts: 8
Joined: 13 Jun 2020, 23:11

Re: [Class] string-similarity.ahk

21 Sep 2020, 10:36

Chunjee wrote:
20 Sep 2020, 19:49
The use of "API" in the documentation is not meant as http API; may update that to "Usage" or something.

This is usable offline.
Ah no, that's offline, what I mean is that the installation is 100% offline, for example the node modules are not in github for download, only npm.

Return to “Scripts and Functions”

Who is online

Users browsing this forum: No registered users and 13 guests