Identifying unique strings in a block of text
Identifying unique strings in a block of text
How can strings be uniquely identified if both share the same word? "Beaumont, Crón. Mich" and "Beaumont" both contain the word "Beaumont" but the appear in different part of the block of text?
Last edited by ineuw on 13 Sep 2019, 04:52, edited 1 time in total.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Re: Identifying uniue strings in a block of text
You search throughout for the the longer string and keep a list of the positions where it is found, then you search the same text with the shorter string, and you only consider it a match if it didn't match a position in the first list.
Re: Identifying uniue strings in a block of text
Thanks for the advice. Are you aware of code snippet to look at?
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Re: Identifying uniue strings in a block of text
No, just came up with the logic. I'm not aware if anyone has done it. It's pretty straightforward code, though.
Re: Identifying uniue strings in a block of text
What you are looking for is InStr(). Specifically:
Code: Select all
MyVar := InStr("some block of text","block")
MsgBox %MyVar%
How you put this together in a script, and what you do with it depends on what you are trying to do.
« AHK Portable Installer » | « CallTipsForAll » | « TheArkive AHK v1 Scripts » | « TheArkive AHK v2 Scrpts » | « TheArkive on GitHub »
Re: Identifying unique strings in a block of text
I'm not sure what kind of output you are looking for but If you're just trying to figure out "is {{x}} anywhere inside this other string" then InStr() would be perfect for you.
Alternatively you may also be interested in https://www.npmjs.com/package/string-similarity.ahk which returns a "percentage" of similarity between two strings. That could be useful if there are slight misspellings in your dataset. The bigger the haystack, the smaller that similarity will be, so in the example below, feeding the entire string returns .72. Futher down the we split up the string into smaller pieces and check against each word, yields a more valid check.
Alternatively you may also be interested in https://www.npmjs.com/package/string-similarity.ahk which returns a "percentage" of similarity between two strings. That could be useful if there are slight misspellings in your dataset. The bigger the haystack, the smaller that similarity will be, so in the example below, feeding the entire string returns .72. Futher down the we split up the string into smaller pieces and check against each word, yields a more valid check.
Code: Select all
#Include %A_ScriptDir%\lib\stringsimilarity.ahk\export.ahk
stringSimilarity := new stringsimilarity()
msgbox, % stringSimilarity.compareTwoStrings("Beaumont, Crón. Mich", "Beaumont")
; => 0.72
strings_array := StrSplit("Beaumont, Crón. Mich", " ")
loop, % strings_array.MaxIndex() {
if (stringSimilarity.compareTwoStrings(strings_array[A_Index], "Beaumont") = 1) {
msgbox, % "Beaumont found! in the given string"
continue
}
if (stringSimilarity.compareTwoStrings(strings_array[A_Index], "Beaumont") > .90) {
msgbox, % "Beaumont very likely in the given string! But not an exact match."
}
}
Re: Identifying unique strings in a block of text
I believe he doesn’t want to know if it is anywhere in the other string, because he already knows that. He wants to find both in a larger text but differentiate between the two. That’s why I suggested the approach I did, which would find both but separate the two.
Re: Identifying unique strings in a block of text
Thanks to all for the comments and the guidance. I was tied up with life for the past few days, that is why I couldn't reply.
Instr(), Strreplace(), and If() functions are sufficient to resolve my issue.
The text generated by OCR from a page of text of -600 words of an old book, This project contains 6 volumes of 700 pages each. The errors are corrected and the page is proofread. The search strings are identical to the text being searched. My work is very similar to the Distributed Proofreaders of Gutenberg.
Already automated the generation of a simple script in which the search and replace strings are embedded in the code. Adding new strings to the list is done constantly, and a new list is generated once or twice daily.
So far, I know the word lengths and how many times they occur in the page of text. Now, I am working on calculating the starting position of the search strings. The following is a snippet of the Autohotkey script generated automatically. But this only replaces exact occurrence of the strings which is wrong. I must insert the code to identify the differences when searching and replace so that a search for words like "Id", do not replace occurrences like forb''Id''den and outs''Id''e, as well as where "Id" occurs in longer search strings like . . .
Instr(), Strreplace(), and If() functions are sufficient to resolve my issue.
The text generated by OCR from a page of text of -600 words of an old book, This project contains 6 volumes of 700 pages each. The errors are corrected and the page is proofread. The search strings are identical to the text being searched. My work is very similar to the Distributed Proofreaders of Gutenberg.
Already automated the generation of a simple script in which the search and replace strings are embedded in the code. Adding new strings to the list is done constantly, and a new list is generated once or twice daily.
So far, I know the word lengths and how many times they occur in the page of text. Now, I am working on calculating the starting position of the search strings. The following is a snippet of the Autohotkey script generated automatically. But this only replaces exact occurrence of the strings which is wrong. I must insert the code to identify the differences when searching and replace so that a search for words like "Id", do not replace occurrences like forb''Id''den and outs''Id''e, as well as where "Id" occurs in longer search strings like . . .
Id.
Id,
Id.,
Id., 4a Rel
Id., Chron
Id., Cortés
Id., Gob. Mex
Id., Hist. Mex
Id., Menologia
Id., Palestra
Id., Puebla
Id., Teatro Mex
Id., Trat. Mex
Code: Select all
in_put :="1a Rel. Anón"
out_put :="''1a Rel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
in_put :="2a Rel, Anón"
out_put :="''2a Rel, Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
in_put :="3a Bel. Anón"
out_put :="''3a Bel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
in_put :="4a Rel. Anón"
out_put :="''4a Rel. Anón''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
in_put :="Aa, Nanukeurige Versameling"
out_put :="''Aa, Nanukeurige Versameling''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
in_put :="Alaman, Disert"
out_put :="''Alaman, Disert''"
in_len := strlen(in_put)
out_len := strlen(out_put)
clipwait, 5
if out_len := (in_len + 4)
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
Last edited by ineuw on 17 Sep 2019, 01:31, edited 2 times in total.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
Just a quick tip, I'm not fully awake.
Why not searching for "ld ", it wouldn't find the letters in between a word. Regex can do some more magic for the "needle".
Additional you could create a "dictionary" and search and replace element by element. So you would only curate the dictionary and you are not coping the whole code block for each substitution.
Why not searching for "ld ", it wouldn't find the letters in between a word. Regex can do some more magic for the "needle".
Additional you could create a "dictionary" and search and replace element by element. So you would only curate the dictionary and you are not coping the whole code block for each substitution.
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
Hi.
This is my shot. But I didn't saw the whole problem with "Id". My tip with the "space" doesn't work when "Id" is the only thin in a line. =(
This is my shot. But I didn't saw the whole problem with "Id". My tip with the "space" doesn't work when "Id" is the only thin in a line. =(
Code: Select all
dict := { "1a Rel. Anón" : "''1a Rel. Anón''"
,"2a Rel, Anón" : "''2a Rel, Anón''"
,"3a Bel. Anón" : "''3a Bel. Anón''"
, "4a Rel. Anón" : "''4a Rel. Anón''"
, "Aa, Nanukeurige Versameling" : "''Aa, Nanukeurige Versameling''"
, "Alaman, Disert" : "''Alaman, Disert''"}
for in_put, out_put in dict
{
clipboard :=strreplace(clipboard, in_put, out_put)
}
Re: Identifying unique strings in a block of text
I hope that by the time you read this you are well rested. I know that Regex would be the best tool, but I know nothing about how to construct what I need, even though I am loaded with Regex info links.Kobaltauge wrote: ↑16 Sep 2019, 23:37Just a quick tip, I'm not fully awake.
Why not searching for "ld ", it wouldn't find the letters in between a word. Regex can do some more magic for the "needle".
Additional you could create a "dictionary" and search and replace element by element. So you would only curate the dictionary and you are not coping the whole code block for each substitution.
My search string list is a "dictionary". I stripped the punctuation from the end of the search string because they are irrelevant, but cannot remove them when they appear in the middle of the string. The "Id " you referred to is really "Id., " or "Id. " or "Id, " or "Id.," Adding a space beyond this makes no difference because they are all terminated with a space. I will try to post the dictionary on pastebin but at the moment they are overloaded. Please look at the "Id" samples again I corrected them.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
OK. This was a hard one for me, but I learned a lot.
I changed strreplace with RegExReplace.
m) = this sets the regex to multiple line.
\b = is the boundary for word. That is the beginning
. in_put . = is the syntax to fit the variable into the regex needle
(\s|$) = after the "in_put" should follow a white space character "\s" or "|" the end of the clipboard "$"
hth
Code: Select all
dict := { "1a Rel. Anón" : "''1a Rel. Anón''"
,"2a Rel, Anón" : "''2a Rel, Anón''"
,"3a Bel. Anón" : "''3a Bel. Anón''"
, "4a Rel. Anón" : "''4a Rel. Anón''"
, "Aa, Nanukeurige Versameling" : "''Aa, Nanukeurige Versameling''"
, "Alaman, Disert" : "''Alaman, Disert''"
, "Id" : "-Id-"}
for in_put, out_put in dict
{
clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
}
m) = this sets the regex to multiple line.
\b = is the boundary for word. That is the beginning
. in_put . = is the syntax to fit the variable into the regex needle
(\s|$) = after the "in_put" should follow a white space character "\s" or "|" the end of the clipboard "$"
hth
Re: Identifying unique strings in a block of text
Kobaltauge, a great thanks for your help, especially for the Regex explanations. I only have one question about RegExReplace(). The pairs of brackets (), are not balanced (an opening bracket is missing) and I don't know if this was intentional. If not, where would the missing bracket be placed?Kobaltauge post_id=292827 time=1568743233 user_id=110766]
OK. This was a hard one for me, but I learned a lot.
clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
Yeah, that's intentional. All letters before the unbalanced ")" are options for RegExReplace in AHK syntax. See: https://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm#Options
"m" is for multiple results. "i" for example is to ignore case in the haystack.
Re: Identifying unique strings in a block of text
I understand. I was also reading this in the offline help. ThanksKobaltauge wrote: ↑17 Sep 2019, 16:59Yeah, that's intentional. All letters before the unbalanced ")" are options for RegExReplace in AHK syntax. See: https://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm#Options
"m" is for multiple results. "i" for example is to ignore case in the haystack.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Re: Identifying unique strings in a block of text
I know that I am not doing it correctly, so I pasted three files into pastebin.
- https://pastebin.com/u/ineuw - main page of account
https://pastebin.com/YmJhM8fa - italics.ahk
https://pastebin.com/5uyMrhtC - library file
https://pastebin.com/4XCaLGzc - text page cleaned and proofread to test the script.
- Attachments
-
- expression is too long.jpg (38.58 KiB) Viewed 2416 times
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
took a quick look into the files. This is a massive task and my solution is working with the clipboard and a dictionary in memory. I probably think, that is to large.
I would go on with one of this ways
1st Instead of loading file and dictionary in memory, i would parse them as a file. We could read 10 dictionary entrys at one iteration to speed it up. Surely this way would take time.
2nd Change language. The regex is working in every language. I would go with Python. The code steps would be the same, but I think it would be handled better.
I would go on with one of this ways
1st Instead of loading file and dictionary in memory, i would parse them as a file. We could read 10 dictionary entrys at one iteration to speed it up. Surely this way would take time.
2nd Change language. The regex is working in every language. I would go with Python. The code steps would be the same, but I think it would be handled better.
Re: Identifying unique strings in a block of text
There is 16GB installed and there is no memory problem. With both files loaded simultaneously, the library, the page of text, and all the programs, the load is only 22% of the available RAM.Kobaltauge wrote: ↑17 Sep 2019, 23:47took a quick look into the files. This is a massive task and my solution is working with the clipboard and a dictionary in memory. I probably think, that is to large.
I would go on with one of this ways
1st Instead of loading file and dictionary in memory, i would parse them as a file. We could read 10 dictionary entrys at one iteration to speed it up. Surely this way would take time.
2nd Change language. The regex is working in every language. I would go with Python. The code steps would be the same, but I think it would be handled better.
I was considering Python as well, but at this point I know even less about AHK. But, I am just beginning to have fun. You have given me a lot of info and please don't waste more of your time. I just thought it was an interesting challenge. Thanks again.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
-
- Posts: 264
- Joined: 09 Mar 2019, 01:52
- Location: Germany
- Contact:
Re: Identifying unique strings in a block of text
No problem, as you wrote is an interesting challenge. I downloaded the file and had a few problems to run them.
You have to pay attention with the coding of the files. Probably it's a problem with the download from pastebin.
I removed all the " from the lib.ahk.
I modified the italics.ahk. Now it reads the lib.ahk into a dictionary and does the loop.
The new italics.ahk:
You have to pay attention with the coding of the files. Probably it's a problem with the download from pastebin.
I removed all the " from the lib.ahk.
I modified the italics.ahk. Now it reads the lib.ahk into a dictionary and does the loop.
The new italics.ahk:
Code: Select all
;2019-09-18 12:29 AM
;===================
;this file is #included in the autoexec.ahk
;italics.ahk
; Alt F12
!f12::
critical, on
autotrim, on
send, ^a^c ; select all text on the page,
dict := {}
Loop, Read, lib.ahk
{
row := StrSplit(A_LoopReadLine, ":")
key := Trim(row[1])
dict[key] := Trim(row[2])
}
clipwait, 5
;code ------------------------------------------------------------
for in_put, out_put in dict
{
clipboard := RegExReplace(clipboard, "m)\b" . in_put . "(\s|$)", out_put)
}
;cleanup ---------------------------------------------------------
clipwait, 5
;if there are quadruple single quotes
in_put :="''''"
out_put :="''"
clipboard :=strreplace(clipboard, in_put, out_put)
clipwait, 5
;underscore does not exist in the original
in_put :="_"
out_put :=""
clipboard :=strreplace(clipboard, in_put, out_put)
send, ^v
critical, off
return
Who is online
Users browsing this forum: jameswrightesq, mikeyww and 332 guests