10.16.18 update:
With the help of people in this thread, I put together a library that mirrors and expands on the output of string-similarity on npm.
https://github.com/Chunjee/string-similarity.ahk
I'm not sure how to turn this into a function:
Does anyone have something similar or have a starting point?
Other notes:
https://en.wikipedia.org/wiki/S%C3%B8re ... oefficient
Sørensen–Dice coefficient Topic is solved
Sørensen–Dice coefficient
Last edited by Chunjee on 16 Oct 2018, 09:21, edited 1 time in total.
Re: Sørensen–Dice coefficient
Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.
Code: Select all
sharedBigrams := 0
for bigram, _ in bigramsOfFirstWord {
if bigramsofSecondWord.hasKey(bigram) {
sharedBigrams++
}
}
Recommends AHK Studio
-
- Posts: 440
- Joined: 28 Apr 2018, 21:59
Re: Sørensen–Dice coefficient
For the start we can look at SubStr() here https://autohotkey.com/docs/commands/SubStr.htm
This gives us a good starting point for how we want to find each set
Note with my example I used six letter words, but this works with the example you gave. Then the rest should be fairly easy as you loop through the total count and just compare each string
This gives us a good starting point for how we want to find each set
Code: Select all
Woord1 := "pizazz"
Woord2 := "pizzaz"
LoopCount = 0
Length := StrLen(Woord1)
Loop, % Length-1 {
Woord1_%A_Index% := SubStr(Woord1, A_Index, 2)
Woord2_%A_Index% := SubStr(Woord2, A_Index, 2)
LoopCount := A_Index
}
Loop, %LoopCount%
MsgBox % Woord1_%A_Index% ":" Woord2_%A_Index%
Re: Sørensen–Dice coefficient
This is what I was working with but I'm having trouble returning a number between 0 and 1 for similarity.nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.
Re: Sørensen–Dice coefficient Topic is solved
- Something like this perhaps.
- Note: the Wikipedia article didn't mention whether duplicate bigrams should be discarded. A Wikibooks link (below) mentions that if you did discard duplicates, then 'AA' and 'AAAA' would be considered identical. Thus duplicates should be maintained. Cheers.
- Note: the Wikipedia article didn't mention whether duplicate bigrams should be discarded. A Wikibooks link (below) mentions that if you did discard duplicates, then 'AA' and 'AAAA' would be considered identical. Thus duplicates should be maintained. Cheers.
Code: Select all
;Sørensen–Dice coefficient - Wikipedia
;https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
;note:
;duplicate bigrams in a word should be counted distinctly
;(per discussion), otherwise 'AA' and 'AAAA' would have a
;dice coefficient of 1
;Algorithm Implementation/Strings/Dice's coefficient - Wikibooks, open books for an open world
;https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient
q:: ;Sørensen-Dice coefficient
vText1 := "night", vText2 := "nacht"
;vText1 := "AA", vText2 := "AAAA"
vCount := 0
oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
Loop, % vCount1 := StrLen(vText1) - 1
oArray["z" SubStr(vText1, A_Index, 2)]++
Loop, % vCount2 := StrLen(vText2) - 1
if (oArray["z" SubStr(vText2, A_Index, 2)] > 0)
{
oArray["z" SubStr(vText2, A_Index, 2)]--
vCount++
}
vDSC := (2 * vCount) / (vCount1 + vCount2)
MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Re: Sørensen–Dice coefficient
Yeah but tbh this method is not worth it.
The words ste and set share no similarities according to this method.
@jeeswg:
while keeping the second word in a string format rather than turning it into an object might be faster when doing a single comparison.
However it is more likely that said comparison will be used to find an entry on a list of strings.
Once you have multiple entries arrays will be faster than strings.
The words ste and set share no similarities according to this method.
@jeeswg:
while keeping the second word in a string format rather than turning it into an object might be faster when doing a single comparison.
However it is more likely that said comparison will be used to find an entry on a list of strings.
Once you have multiple entries arrays will be faster than strings.
Recommends AHK Studio
Re: Sørensen–Dice coefficient
For the levenshtein distance I could dig out the following:
https://stackoverflow.com/questions/387 ... make-sense
https://stackoverflow.com/questions/387 ... make-sense
Recommends AHK Studio
Re: Sørensen–Dice coefficient
@nnnik: Didn't you already write a function?
Damerau-Levenshtein Distance - Fuzzy searches - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=39112
Damerau-Levenshtein Distance - Fuzzy searches - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=39112
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Re: Sørensen–Dice coefficient
Yeah I did - but I didn't like the results.
The topic mentions one of the basic problems of the levenshtein distance.
The jaro winkler distance seems to be better for these tasks.
The topic mentions one of the basic problems of the levenshtein distance.
The jaro winkler distance seems to be better for these tasks.
Recommends AHK Studio
Re: Sørensen–Dice coefficient
I'm interested in this as well but not smart enough to write it myself.nnnik wrote:The jaro winkler distance seems to be better for these tasks.
Here's what I came up with trying to duplicate the output of the javascript library string-similarity:
Code: Select all
Class stringsimilarity {
; #Include ../../lib/sort_arrays/export.ahk
__New() {
this.info_Array
}
compareTwoStrings(para_string1,para_string2) {
;Sørensen-Dice coefficient
vCount := 0
oArray := {}
oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
Loop, % vCount1 := StrLen(para_string1) - 1
oArray["z" SubStr(para_string1, A_Index, 2)]++
Loop, % vCount2 := StrLen(para_string2) - 1
if (oArray["z" SubStr(para_string2, A_Index, 2)] > 0) {
oArray["z" SubStr(para_string2, A_Index, 2)]--
vCount++
}
vDSC := (2 * vCount) / (vCount1 + vCount2)
; MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
return Round(vDSC,2)
}
findBestMatch(para_string,para_array) {
if (!IsObject(para_array)) {
return false
}
this.info_Array := []
; Score each option and save into a new array
loop, % para_array.MaxIndex() {
this.info_Array[A_Index, "rating"] := this.compareTwoStrings(para_string, para_array[A_Index])
this.info_Array[A_Index, "target"] := para_array[A_Index]
}
;sort the scored array and return the bestmatch
sortedArray := Sort2DArrayFast(this.info_Array,"rating", false) ;false reverses the order so the highest scoring is at the top
oObject := {bestMatch:sortedArray[1], ratings:sortedArray}
return oObject
}
simpleBestMatch(para_string,para_array) {
if (!IsObject(para_array)) {
return false
}
l_array := this.findBestMatch(para_string,para_array)
return l_array.bestMatch.target
}
}
- FanaticGuru
- Posts: 1905
- Joined: 30 Sep 2013, 22:25
Re: Sørensen–Dice coefficient
If people want to play around with Ngram, here is a function for it:nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.Code: Select all
sharedBigrams := 0 for bigram, _ in bigramsOfFirstWord { if bigramsofSecondWord.hasKey(bigram) { sharedBigrams++ } }
https://autohotkey.com/boards/viewtopic.php?t=7302
I have had great success using it as a fuzzy find to search for text in a large text haystack.
FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks
Who is online
Users browsing this forum: gongnl, mikeyww, morkovka18 and 213 guests