# SIFT3 : Super Fast and Accurate string distance algorithm

11 replies to this topic

### #1 toralf

toralf
• Fellows
• 3948 posts

Posted 20 June 2010 - 11:05 AM

The original algorithm can be found here:
<!-- m -->http://siderite.blog... ... tance.html<!-- m -->

It is about 10 times faster then the Levenshtein distance.

It translates to this
```MsgBox % Distance( "A H K", "A H Kn" )
MsgBox % Distance( "A H K", "A H K" )
MsgBox % Distance( "A H K", "A h K" )
MsgBox % Distance( "AHK", "" )
MsgBox % Distance( "He", "Ben" )
MsgBox % Distance( "Toralf", "ToRalf" )
MsgBox % Distance( "Toralf", "esromneb" )
MsgBox % Distance( "Toralf", "RalfLaDuce" )

Distance(string1, string2, maxOffset=5) {
StringSplit, n, string1
StringSplit, m, string2
IfEqual n0,0, Return m0
IfEqual m0,0, Return n0
c = 1
offset1 = 0
offset2 = 0
lcs = 0
While((c + offset1 <= n0) && (c + offset2 <= m0)) {
ni := c + offset1
mi := c + offset2
If (n%ni% == m%mi%)
lcs++
else{
offset1 = 0
offset2 = 0
Loop, % maxOffset  {
oi := c + A_Index
if ((oi < n0) && (n%oi% == m%c%)) {
offset1 := A_Index
break
}
if ((oi < m0) && (n%c% == m%oi%)) {
offset2 := A_Index
break
}
}
}
c++
}
Return (n0 + m0)/2 - lcs
}
```
I haven't checked it completly yet, might have bugs.

EDIT: Found one bug: fixed, last char wasn't considered

Is there a way to make it faster?

### #2 toralf

toralf
• Fellows
• 3948 posts

Posted 20 June 2010 - 03:25 PM

I changed the code a little bit:

It now returns the difference between the strings as a float between 0 and 1. 0 means strings are identical. 1 means they have nothing in common.
```MsgBox % Difference( "A H K", "A H Kn" )       ;0.083333
MsgBox % Difference( "A H K", "A H K" )        ;0.000000
MsgBox % Difference( "A H K", "A h K" )        ;0.040000
MsgBox % Difference( "AHK", "" )               ;1.000000
MsgBox % Difference( "He", "Ben" )             ;0.500000
MsgBox % Difference( "Toralf", "ToRalf" )      ;0.033333
MsgBox % Difference( "Toralf", "esromneb" )    ;0.750000
MsgBox % Difference( "Toralf", "RalfLaDuce" )  ;0.420000

;basic SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;Own idea: when character only differ in case, LSC is a 0.8 match for this character
Difference(string1, string2, maxOffset=5) {    ;returns a float: from 0 = identical to 1 = nothing in common
StringSplit, n, string1
StringSplit, m, string2
IfEqual n0,0, Return 1/1
IfEqual m0,0, Return 1/1
c = 1
offset1 = 0
offset2 = 0
lcs = 0
While((c + offset1 <= n0) && (c + offset2 <= m0)) {
ni := c + offset1
mi := c + offset2
If (n%ni% == m%mi%)
lcs++
Else If (n%ni% = m%mi%)
lcs += 0.8
Else{
offset1 = 0
offset2 = 0
Loop, %maxOffset%  {
oi := c + A_Index
if ((oi < n0) && (n%oi% = m%c%)) {
offset1 := A_Index
lcs += (n%oi% == m%c% ? 1 : 0.8)
break
}
if ((oi < m0) && (n%c% = m%oi%)) {
offset2 := A_Index
lcs += (n%c% == m%oi% ? 1 : 0.8)
break
}
}
}
c++
}
Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}
```

But e.g. how could I find "é" compared with "e" to be a nearly match? Do these char have a fixed offset from their base character?

Edit: 2010-06-21: Improved code to catch different case

### #3 fred

fred
• Members
• 257 posts

Posted 20 June 2010 - 03:56 PM

Nice, string matching the human way (more or less).

But e.g. how could I find "é" compared with "e" to be a nearly match?
Do these char have a fixed offset from their base character?

I don't think so, for e dec. 101 are there are 4 variations dec. 232-235 but for a dec. 96 you have 6 variations dec. 224-229.
Perhaps replacing the special chars to normal chars before calculating?

### #4 toralf

toralf
• Fellows
• 3948 posts

Posted 21 June 2010 - 02:13 PM

I changed and optimized the code quite a bit.
It is now 30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance.
Even though that it might lead to different results then the SIFT3 code, it is accurate enough I hope to find similar strings.

```;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;Own work:
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance)
;http://www.autohotkey.com/forum/topic59407.html
Difference(string1, string2, maxOffset=5) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
If (string1 = string2)
Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case
If (string1 = "" OR string2 = "")
Return (string1 = string2 ? 0/1 : 1/1)
StringSplit, n, string1
StringSplit, m, string2
ni := 1, mi := 1, lcs := 0
While((ni <= n0) AND (mi <= m0)) {
If (n%ni% == m%mi%)
Else If (n%ni% = m%mi%)
Else{
Loop, %maxOffset%  {
oi := ni + A_Index, pi := mi + A_Index
If ((n%oi% = m%mi%) AND (oi <= n0)){
ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
Break
}
If ((n%ni% = m%pi%) AND (pi <= m0)){
mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
Break
}
}
}
}
Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}```

### #5 SFdude

SFdude
• Members
• 4 posts

Posted 02 October 2010 - 11:07 PM

Hi Toralf!

(you posted on: Mon Jun 21, 2010, see above my post).

But when executing the AHK script,
I get this error message from AHK:

" Error at Line 27.
Line Text: While((ni <= n0) AND (mi <= m0))
Error: Functions cannot contain Functions.
Program will exit "

...and it does indeed exit immediately!.

I must be missing something super-simple in AHK....
Included your AHK script, below my signature.

Help! Hilfe! (is there a complete working example?)
SFdude

```MsgBox % Difference( "A H K", "A H K" )        ;0.000000
MsgBox % Difference( "A H K", "A h K" )        ;0.040000
MsgBox % Difference( "AHK", "" )               ;1.000000
MsgBox % Difference( "He", "Ben" )             ;0.500000
MsgBox % Difference( "Toralf", "ToRalf" )      ;0.033333
MsgBox % Difference( "Toralf", "esromneb" )    ;0.750000
MsgBox % Difference( "Toralf", "RalfLaDuce" )  ;0.420000

;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;Own work:
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance)
;http://www.autohotkey.com/forum/topic59407.html
Difference(string1, string2, maxOffset=5) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
If (string1 = string2)
Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case
If (string1 = "" OR string2 = "")
Return (string1 = string2 ? 0/1 : 1/1)
StringSplit, n, string1
StringSplit, m, string2
ni := 1, mi := 1, lcs := 0
While((ni <= n0) AND (mi <= m0)) {
If (n%ni% == m%mi%)
Else If (n%ni% = m%mi%)
Else{
Loop, %maxOffset%  {
oi := ni + A_Index, pi := mi + A_Index
If ((n%oi% = m%mi%) AND (oi <= n0)){
ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
Break
}
If ((n%ni% = m%pi%) AND (pi <= m0)){
mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
Break
}
}
}
}
Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}
```

### #6 Tuncay

Tuncay
• Members
• 1943 posts

Posted 02 October 2010 - 11:58 PM

Ok simple questions first... Do you have the newest AutoHotkey version?

What does
```#NoEnv
SendMode Input
SetWorkingDir %A_ScriptDir%
MsgBox % A_AhkVersion```
report?

### #7 SFdude

SFdude
• Members
• 4 posts

Posted 03 October 2010 - 12:14 AM

Thanks Tuncay for that quick reply.

You are right!!

btw:
Do you know of any JS implementation
of SIFT3?

thanks!
SFdude

### #8 Tuncay

Tuncay
• Members
• 1943 posts

Posted 03 October 2010 - 12:26 AM

Thanks Tuncay for that quick reply.

No problem. In this community you get in general fast answers (if its not too specific or late).

Do you know of any JS implementation of SIFT3?

No sorry, I did not search for it. Never needed.

### #9 SFdude

SFdude
• Members
• 4 posts

Posted 03 October 2010 - 12:36 AM

Thanks again Tuncay!
best -

SFdude

### #10 fragman

fragman
• Members
• 1591 posts

Posted 04 October 2010 - 07:21 PM

Thanks for this nice algorithm, I now use it for a system that features program launcher like launchy, window switcher, uninstaller,... etc.

I had to write a wrapper around it to make it do fuzzy search:
```FuzzySearch(string1, string2)
{
lenl := StrLen(string1)
lens := StrLen(string2)
if(lenl > lens)
{
shorter := string2
longer := string1
}
else if(lens > lenl)
{
shorter := string1
longer := string2
lens := lenl
lenl := StrLen(string2)
}
else
return StringDifference(string1, string2)
min := 1
Loop % lenl - lens + 1
{
distance := StringDifference(shorter, SubStr(longer, A_Index, lens))
if(distance < min)
min := distance
}
return min
}
;By Toralf:
;basic idea for SIFT3 code by Siderite Zackwehdex
;http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
;took idea to normalize it to longest string from Brad Wood
;Own work:
; - when character only differ in case, LSC is a 0.8 match for this character
; - modified code for speed, might lead to different results compared to original code
; - optimized for speed (30% faster then original SIFT3 and 13.3 times faster than basic Levenshtein distance)
;http://www.autohotkey.com/forum/topic59407.html
StringDifference(string1, string2, maxOffset=1) {    ;returns a float: between "0.0 = identical" and "1.0 = nothing in common"
If (string1 = string2)
Return (string1 == string2 ? 0/1 : 0.2/StrLen(string1))    ;either identical or (assumption:) "only one" char with different case
If (string1 = "" OR string2 = "")
Return (string1 = string2 ? 0/1 : 1/1)
StringSplit, n, string1
StringSplit, m, string2
ni := 1, mi := 1, lcs := 0
While((ni <= n0) AND (mi <= m0)) {
If (n%ni% == m%mi%)
Else If (n%ni% = m%mi%)
Else{
Loop, %maxOffset%  {
oi := ni + A_Index, pi := mi + A_Index
If ((n%oi% = m%mi%) AND (oi <= n0)){
ni := oi, lcs += (n%oi% == m%mi% ? 1 : 0.8)
Break
}
If ((n%ni% = m%pi%) AND (pi <= m0)){
mi := pi, lcs += (n%ni% == m%pi% ? 1 : 0.8)
Break
}
}
}
}
Return ((n0 + m0)/2 - lcs) / (n0 > m0 ? n0 : m0)
}```

As you can see, I renamed your function because I think it's more fitting and unique.

### #11 Guests

• Guests

Posted 24 May 2012 - 06:16 PM

after reading the speed comment in the node thread I was wondering if this could be done with mCode?

### #12 fragman

fragman
• Members
• 1591 posts

Posted 24 May 2012 - 10:16 PM

Good idea! It seems likely, but I haven't looked into generating MCode yet. I'd need Unicode x86/x64 versions of it.