Sørensen–Dice coefficient Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
Chunjee
Posts: 1397
Joined: 18 Apr 2014, 19:05
Contact:

Sørensen–Dice coefficient

09 Oct 2018, 14:05

10.16.18 update:
With the help of people in this thread, I put together a library that mirrors and expands on the output of string-similarity on npm.
https://github.com/Chunjee/string-similarity.ahk



I'm not sure how to turn this into a function:
Image


Does anyone have something similar or have a starting point?


Other notes:
https://en.wikipedia.org/wiki/S%C3%B8re ... oefficient
Last edited by Chunjee on 16 Oct 2018, 09:21, edited 1 time in total.
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Sørensen–Dice coefficient

09 Oct 2018, 14:16

Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.

Code: Select all

sharedBigrams := 0
for bigram, _ in bigramsOfFirstWord {
	if bigramsofSecondWord.hasKey(bigram) {
		sharedBigrams++
	}
}
Recommends AHK Studio
MannyKSoSo
Posts: 440
Joined: 28 Apr 2018, 21:59

Re: Sørensen–Dice coefficient

09 Oct 2018, 14:32

For the start we can look at SubStr() here https://autohotkey.com/docs/commands/SubStr.htm
This gives us a good starting point for how we want to find each set

Code: Select all

Woord1 := "pizazz"
Woord2 := "pizzaz"

LoopCount = 0
Length := StrLen(Woord1)
Loop, % Length-1 {
	Woord1_%A_Index% := SubStr(Woord1, A_Index, 2)
	Woord2_%A_Index% := SubStr(Woord2, A_Index, 2)
	LoopCount := A_Index
}
Loop, %LoopCount%
	MsgBox % Woord1_%A_Index% ":" Woord2_%A_Index%
Note with my example I used six letter words, but this works with the example you gave. Then the rest should be fairly easy as you loop through the total count and just compare each string
User avatar
Chunjee
Posts: 1397
Joined: 18 Apr 2014, 19:05
Contact:

Re: Sørensen–Dice coefficient

09 Oct 2018, 14:40

nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.
This is what I was working with but I'm having trouble returning a number between 0 and 1 for similarity.
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Sørensen–Dice coefficient  Topic is solved

09 Oct 2018, 14:40

- Something like this perhaps.
- Note: the Wikipedia article didn't mention whether duplicate bigrams should be discarded. A Wikibooks link (below) mentions that if you did discard duplicates, then 'AA' and 'AAAA' would be considered identical. Thus duplicates should be maintained. Cheers.

Code: Select all

;Sørensen–Dice coefficient - Wikipedia
;https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

;note:
;duplicate bigrams in a word should be counted distinctly
;(per discussion), otherwise 'AA' and 'AAAA' would have a
;dice coefficient of 1
;Algorithm Implementation/Strings/Dice's coefficient - Wikibooks, open books for an open world
;https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

q:: ;Sørensen-Dice coefficient
vText1 := "night", vText2 := "nacht"
;vText1 := "AA", vText2 := "AAAA"
vCount := 0
oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
Loop, % vCount1 := StrLen(vText1) - 1
	oArray["z" SubStr(vText1, A_Index, 2)]++
Loop, % vCount2 := StrLen(vText2) - 1
	if (oArray["z" SubStr(vText2, A_Index, 2)] > 0)
	{
		oArray["z" SubStr(vText2, A_Index, 2)]--
		vCount++
	}
vDSC := (2 * vCount) / (vCount1 + vCount2)
MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Sørensen–Dice coefficient

09 Oct 2018, 14:55

Yeah but tbh this method is not worth it.
The words ste and set share no similarities according to this method.

@jeeswg:
while keeping the second word in a string format rather than turning it into an object might be faster when doing a single comparison.
However it is more likely that said comparison will be used to find an entry on a list of strings.
Once you have multiple entries arrays will be faster than strings.
Recommends AHK Studio
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Sørensen–Dice coefficient

09 Oct 2018, 15:01

For the levenshtein distance I could dig out the following:
https://stackoverflow.com/questions/387 ... make-sense
Recommends AHK Studio
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Sørensen–Dice coefficient

09 Oct 2018, 15:11

@nnnik: Didn't you already write a function?
Damerau-Levenshtein Distance - Fuzzy searches - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=39112
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Sørensen–Dice coefficient

09 Oct 2018, 15:18

Yeah I did - but I didn't like the results.
The topic mentions one of the basic problems of the levenshtein distance.
The jaro winkler distance seems to be better for these tasks.
Recommends AHK Studio
User avatar
Chunjee
Posts: 1397
Joined: 18 Apr 2014, 19:05
Contact:

Re: Sørensen–Dice coefficient

10 Oct 2018, 14:30

nnnik wrote:The jaro winkler distance seems to be better for these tasks.
I'm interested in this as well but not smart enough to write it myself.



Here's what I came up with trying to duplicate the output of the javascript library string-similarity:

Code: Select all

Class stringsimilarity {
	; #Include ../../lib/sort_arrays/export.ahk

	__New() {
        this.info_Array
	}


    compareTwoStrings(para_string1,para_string2) {
        ;Sørensen-Dice coefficient
        vCount := 0
        oArray := {}
        oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
        Loop, % vCount1 := StrLen(para_string1) - 1
            oArray["z" SubStr(para_string1, A_Index, 2)]++
        Loop, % vCount2 := StrLen(para_string2) - 1
            if (oArray["z" SubStr(para_string2, A_Index, 2)] > 0) {
                oArray["z" SubStr(para_string2, A_Index, 2)]--
                vCount++
            }
        vDSC := (2 * vCount) / (vCount1 + vCount2)
        ; MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
        return Round(vDSC,2)
    }


    findBestMatch(para_string,para_array) {
        if (!IsObject(para_array)) {
            return false
        }
        this.info_Array := []

        ; Score each option and save into a new array
        loop, % para_array.MaxIndex() {
            this.info_Array[A_Index, "rating"] := this.compareTwoStrings(para_string, para_array[A_Index])
            this.info_Array[A_Index, "target"] := para_array[A_Index]
        }

        ;sort the scored array and return the bestmatch
        sortedArray := Sort2DArrayFast(this.info_Array,"rating", false) ;false reverses the order so the highest scoring is at the top
        oObject := {bestMatch:sortedArray[1], ratings:sortedArray}
        return oObject
    }


    simpleBestMatch(para_string,para_array) {
        if (!IsObject(para_array)) {
            return false
        }

        l_array := this.findBestMatch(para_string,para_array)
        return l_array.bestMatch.target
    }
}
it has a dependency so I still need to figure out how to package all that up.
User avatar
FanaticGuru
Posts: 1905
Joined: 30 Sep 2013, 22:25

Re: Sørensen–Dice coefficient

10 Oct 2018, 15:02

nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.

Code: Select all

sharedBigrams := 0
for bigram, _ in bigramsOfFirstWord {
	if bigramsofSecondWord.hasKey(bigram) {
		sharedBigrams++
	}
}
If people want to play around with Ngram, here is a function for it:
https://autohotkey.com/boards/viewtopic.php?t=7302

I have had great success using it as a fuzzy find to search for text in a large text haystack.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Google [Bot], Greg_Roberts, inseption86, Lamron750, Rohwedder, william_ahk and 191 guests