Identifying unique strings in a block of text

ineuw · 12 Sep 2019, 19:30

How can strings be uniquely identified if both share the same word? "Beaumont, Crón. Mich" and "Beaumont" both contain the word "Beaumont" but the appear in different part of the block of text?

12 Sep 2019, 20:47

You search throughout for the the longer string and keep a list of the positions where it is found, then you search the same text with the shorter string, and you only consider it a match if it didn't match a position in the first list.

ineuw · 12 Sep 2019, 21:15

boiler wrote: ↑
12 Sep 2019, 20:47
You search throughout for the the longer string and keep a list of the positions where it is found, then you search the same text with the shorter string, and you only consider it a match if it didn't match a position in the first list.

Thanks for the advice. Are you aware of code snippet to look at?

12 Sep 2019, 22:36

No, just came up with the logic. I'm not aware if anyone has done it. It's pretty straightforward code, though.

TheArkive · 13 Sep 2019, 04:43

ineuw wrote: ↑
12 Sep 2019, 21:15
Thanks for the advice. Are you aware of code snippet to look at?

What you are looking for is InStr(). Specifically:

Code: Select all

MyVar := InStr("some block of text","block")
MsgBox %MyVar%

The above MsgBox message would display "6" because the word "block" starts on character position 6 in the string "some block of text".

How you put this together in a script, and what you do with it depends on what you are trying to do.

Chunjee · 13 Sep 2019, 10:31

I'm not sure what kind of output you are looking for but If you're just trying to figure out "is {{x}} anywhere inside this other string" then InStr() would be perfect for you.

Alternatively you may also be interested in https://www.npmjs.com/package/string-similarity.ahk which returns a "percentage" of similarity between two strings. That could be useful if there are slight misspellings in your dataset. The bigger the haystack, the smaller that similarity will be, so in the example below, feeding the entire string returns .72. Futher down the we split up the string into smaller pieces and check against each word, yields a more valid check.

Code: Select all

#Include %A_ScriptDir%\lib\stringsimilarity.ahk\export.ahk
stringSimilarity := new stringsimilarity()

msgbox, % stringSimilarity.compareTwoStrings("Beaumont, Crón. Mich", "Beaumont")
; => 0.72

strings_array := StrSplit("Beaumont, Crón. Mich", " ")
loop, % strings_array.MaxIndex() {
    if (stringSimilarity.compareTwoStrings(strings_array[A_Index], "Beaumont") = 1) {
        msgbox, % "Beaumont found! in the given string"
        continue
    }
    if (stringSimilarity.compareTwoStrings(strings_array[A_Index], "Beaumont") > .90) {
        msgbox, % "Beaumont very likely in the given string! But not an exact match."
    }
}

13 Sep 2019, 12:36

Chunjee wrote: ↑
13 Sep 2019, 10:31
I'm not sure what kind of output you are looking for but If you're just trying to figure out "is {{x}} anywhere inside this other string" then InStr() would be perfect for you.

I believe he doesn’t want to know if it is anywhere in the other string, because he already knows that. He wants to find both in a larger text but differentiate between the two. That’s why I suggested the approach I did, which would find both but separate the two.

ineuw · 16 Sep 2019, 21:51

Thanks to all for the comments and the guidance. I was tied up with life for the past few days, that is why I couldn't reply.

Instr(), Strreplace(), and If() functions are sufficient to resolve my issue.

The text generated by OCR from a page of text of -600 words of an old book, This project contains 6 volumes of 700 pages each. The errors are corrected and the page is proofread. The search strings are identical to the text being searched. My work is very similar to the Distributed Proofreaders of Gutenberg.

Already automated the generation of a simple script in which the search and replace strings are embedded in the code. Adding new strings to the list is done constantly, and a new list is generated once or twice daily.

So far, I know the word lengths and how many times they occur in the page of text. Now, I am working on calculating the starting position of the search strings. The following is a snippet of the Autohotkey script generated automatically. But this only replaces exact occurrence of the strings which is wrong. I must insert the code to identify the differences when searching and replace so that a search for words like "Id", do not replace occurrences like forb''Id''den and outs''Id''e, as well as where "Id" occurs in longer search strings like . . .

Id.
Id,
Id.,
Id., 4a Rel
Id., Chron
Id., Cortés
Id., Gob. Mex
Id., Hist. Mex
Id., Menologia
Id., Palestra
Id., Puebla
Id., Teatro Mex
Id., Trat. Mex

Code: Select all

in_put :="1a Rel. Anón"
out_put :="''1a Rel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

in_put :="2a Rel, Anón"
out_put :="''2a Rel, Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

in_put :="3a Bel. Anón"
out_put :="''3a Bel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

in_put :="4a Rel. Anón"
out_put :="''4a Rel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

in_put :="Aa, Nanukeurige Versameling"
out_put :="''Aa, Nanukeurige Versameling''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

in_put :="Alaman, Disert"
out_put :="''Alaman, Disert''"
in_len := strlen(in_put)
out_len := strlen(out_put)

clipwait, 5

if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}

Kobaltauge · 16 Sep 2019, 23:37

Just a quick tip, I'm not fully awake.

Why not searching for "ld ", it wouldn't find the letters in between a word. Regex can do some more magic for the "needle".

Additional you could create a "dictionary" and search and replace element by element. So you would only curate the dictionary and you are not coping the whole code block for each substitution.

Kobaltauge · 17 Sep 2019, 00:34

Hi.
This is my shot. But I didn't saw the whole problem with "Id". My tip with the "space" doesn't work when "Id" is the only thin in a line. =(

Code: Select all

dict := { "1a Rel. Anón" : "''1a Rel. Anón''"
		,"2a Rel, Anón" : "''2a Rel, Anón''"
		,"3a Bel. Anón" : "''3a Bel. Anón''"
		, "4a Rel. Anón" : "''4a Rel. Anón''"
		, "Aa, Nanukeurige Versameling" : "''Aa, Nanukeurige Versameling''"
		, "Alaman, Disert" : "''Alaman, Disert''"}

for in_put, out_put in dict
{
	clipboard :=strreplace(clipboard, in_put, out_put)
}

ineuw · 17 Sep 2019, 01:30

Kobaltauge wrote: ↑
16 Sep 2019, 23:37
Just a quick tip, I'm not fully awake.

Why not searching for "ld ", it wouldn't find the letters in between a word. Regex can do some more magic for the "needle".

Additional you could create a "dictionary" and search and replace element by element. So you would only curate the dictionary and you are not coping the whole code block for each substitution.

I hope that by the time you read this you are well rested. I know that Regex would be the best tool, but I know nothing about how to construct what I need, even though I am loaded with Regex info links.

My search string list is a "dictionary". I stripped the punctuation from the end of the search string because they are irrelevant, but cannot remove them when they appear in the middle of the string. The "Id " you referred to is really "Id., " or "Id. " or "Id, " or "Id.," Adding a space beyond this makes no difference because they are all terminated with a space. I will try to post the dictionary on pastebin but at the moment they are overloaded. Please look at the "Id" samples again I corrected them.

Kobaltauge · 17 Sep 2019, 13:00

OK. This was a hard one for me, but I learned a lot.

Code: Select all

dict := { "1a Rel. Anón" : "''1a Rel. Anón''"
		,"2a Rel, Anón" : "''2a Rel, Anón''"
		,"3a Bel. Anón" : "''3a Bel. Anón''"
		, "4a Rel. Anón" : "''4a Rel. Anón''"
		, "Aa, Nanukeurige Versameling" : "''Aa, Nanukeurige Versameling''"
		, "Alaman, Disert" : "''Alaman, Disert''"
		, "Id" : "-Id-"}

for in_put, out_put in dict
{
	clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
	
}

I changed strreplace with RegExReplace.
m) = this sets the regex to multiple line.
\b = is the boundary for word. That is the beginning
. in_put . = is the syntax to fit the variable into the regex needle
(\s|$) = after the "in_put" should follow a white space character "\s" or "|" the end of the clipboard "$"

hth

Chunjee · 17 Sep 2019, 13:17

Kobaltauge wrote: ↑
17 Sep 2019, 13:00
OK. This was a hard one for me, but I learned a lot.

Well done

ineuw · 17 Sep 2019, 16:45

Kobaltauge post_id=292827 time=1568743233 user_id=110766]
OK. This was a hard one for me, but I learned a lot.

Kobaltauge, a great thanks for your help, especially for the Regex explanations. I only have one question about RegExReplace(). The pairs of brackets (), are not balanced (an opening bracket is missing) and I don't know if this was intentional. If not, where would the missing bracket be placed?

clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)

Kobaltauge · 17 Sep 2019, 16:59

ineuw wrote: ↑
17 Sep 2019, 16:45
clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)

Yeah, that's intentional. All letters before the unbalanced ")" are options for RegExReplace in AHK syntax. See: https://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm#Options
"m" is for multiple results. "i" for example is to ignore case in the haystack.

ineuw · 17 Sep 2019, 17:05

Kobaltauge wrote: ↑
17 Sep 2019, 16:59

ineuw wrote: ↑
17 Sep 2019, 16:45
clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
Yeah, that's intentional. All letters before the unbalanced ")" are options for RegExReplace in AHK syntax. See: https://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm#Options
"m" is for multiple results. "i" for example is to ignore case in the haystack.

I understand. I was also reading this in the offline help. Thanks

ineuw · 17 Sep 2019, 18:41

I know that I am not doing it correctly, so I pasted three files into pastebin.

https://pastebin.com/u/ineuw - main page of account

https://pastebin.com/YmJhM8fa - italics.ahk
https://pastebin.com/5uyMrhtC - library file
https://pastebin.com/4XCaLGzc - text page cleaned and proofread to test the script.

Kobaltauge · 17 Sep 2019, 23:47

took a quick look into the files. This is a massive task and my solution is working with the clipboard and a dictionary in memory. I probably think, that is to large.
I would go on with one of this ways

1st Instead of loading file and dictionary in memory, i would parse them as a file. We could read 10 dictionary entrys at one iteration to speed it up. Surely this way would take time.

2nd Change language. The regex is working in every language. I would go with Python. The code steps would be the same, but I think it would be handled better.

ineuw · 18 Sep 2019, 02:41

Kobaltauge wrote: ↑
17 Sep 2019, 23:47
took a quick look into the files. This is a massive task and my solution is working with the clipboard and a dictionary in memory. I probably think, that is to large.
I would go on with one of this ways

1st Instead of loading file and dictionary in memory, i would parse them as a file. We could read 10 dictionary entrys at one iteration to speed it up. Surely this way would take time.

2nd Change language. The regex is working in every language. I would go with Python. The code steps would be the same, but I think it would be handled better.

There is 16GB installed and there is no memory problem. With both files loaded simultaneously, the library, the page of text, and all the programs, the load is only 22% of the available RAM.

I was considering Python as well, but at this point I know even less about AHK. But, I am just beginning to have fun. You have given me a lot of info and please don't waste more of your time. I just thought it was an interesting challenge. Thanks again.

Kobaltauge · 18 Sep 2019, 14:54

No problem, as you wrote is an interesting challenge. I downloaded the file and had a few problems to run them.
You have to pay attention with the coding of the files. Probably it's a problem with the download from pastebin.
I removed all the " from the lib.ahk.
I modified the italics.ahk. Now it reads the lib.ahk into a dictionary and does the loop.

The new italics.ahk:

Code: Select all

;2019-09-18 12:29 AM
;===================

;this file is #included in the autoexec.ahk

;italics.ahk
; Alt F12
!f12::
critical, on
autotrim, on
send, ^a^c			; select all text on the page,

dict := {}
Loop, Read, lib.ahk
{
	row := StrSplit(A_LoopReadLine, ":")
	key := Trim(row[1])
	dict[key] := Trim(row[2])
}

clipwait, 5

;code ------------------------------------------------------------
for in_put, out_put in dict
{
	clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
}

;cleanup ---------------------------------------------------------

clipwait, 5
 ;if there are quadruple single quotes
in_put :="''''" 
out_put :="''"
clipboard :=strreplace(clipboard, in_put, out_put)

clipwait, 5
 ;underscore does not exist in the original
in_put :="_"
out_put :=""
clipboard :=strreplace(clipboard, in_put, out_put)

send, ^v
critical, off
return

Identifying unique strings in a block of text

Identifying unique strings in a block of text

Re: Identifying uniue strings in a block of text

Re: Identifying uniue strings in a block of text

Re: Identifying uniue strings in a block of text

Re: Identifying uniue strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Re: Identifying unique strings in a block of text

Who is online