Identifying unique strings in a block of text

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

18 Sep 2019, 19:41

Hi. replaced the files, with files encoded as follows:

italics.ahk - UTF-8 WITH BOM PC line termination
lib.ahk - UTF-8 WITH BOM PC line termination and removed the double quotes
page_text_after_proofreading - UTF-8 WITHOUT BOM, LINUX line termination

Your version of "italics.ahk", above runs through text of "page_text_after_proofreading.txt" but does nothing.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Identifying unique strings in a block of text

19 Sep 2019, 03:02

I thought that, too. But there a not so many "hits" in your example.

Launch italic.ahk. Open the "page_text_after_proofreading.txt" in an editor, I prefer Notepad++. Click on the text. Press alt+F12.
After that search for a ' and you will finde them in every changed position.

Did you change the files? Yesterday I had other ones. No matter.

1st you don't need the "dict:=" in the lib.ahk anymore. This file is only the dictionay. You could name it lib.txt, so anybody can add more substitutions. So please remove all (don't forget the } at the end) except the pairs without quotes.
dict.png
dict.png (2.65 KiB) Viewed 1411 times


Follow the steps above and Tada!!!


result.png
result.png (15.71 KiB) Viewed 1411 times
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

19 Sep 2019, 23:56

Kobaltauge wrote:
19 Sep 2019, 03:02

Launch italic.ahk. Open the "page_text_after_proofreading.txt" in an editor, I prefer Notepad++. Click on the text. Press alt+F12.
After that search for a ' and you will finde them in every changed position.

Did you change the files? Yesterday I had other ones. No matter.

1st you don't need the "dict:=" in the lib.ahk anymore. This file is only the dictionay. You could name it lib.txt, so anybody can add more substitutions. So please remove all (don't forget the } at the end) except the pairs without quotes.
I use TextPad which is the predecessor of Notepad++. Have both, but prefer TP because I am using it for about 14 years. They are very close in features, but some things TP is better at.

I added some additional words, and noted on the top line the encoding and line termination. The script works exactly as you mentioned. The results are identical to yours, unfortunately there are over 100 matching references to the dictionary, and the few marked are incorrect.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Identifying unique strings in a block of text

20 Sep 2019, 07:30

Sorry, I don't understand the problem.
I took a closer look and found this:
2019-09-20 14_24_06-lib.ahk - Editor.png
2019-09-20 14_24_06-lib.ahk - Editor.png (12.55 KiB) Viewed 1324 times
But that is unfortunately doing the right thing. In the lib the needle "Kingsborough's Mex. Antiq" is without "." at the end. Therefore, the needle is different to the text and it will not be found. I think it was your first problem, that you don't like to find the needle in the words.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

20 Sep 2019, 08:01

I know that there are duplicates because of typos and missing punctuation. Please ignore these. I only want to enclose those that are identical to the library. There are numerous variations of the same reference source and I try to include all variations. The book was published 1883, and these are result of poor typesetting and proofreading. Also, please don't waste your time on it and take a break.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

21 Sep 2019, 00:27

Kobaltauge, Thanks for all your help. For the time being I reworked my original, StrReplace() where the references are sorted in a descenting order based on their length. Everything gets enclosed but duplicate words are enclosed both ways. I am also considering using InString(). The problem is to be able to differentiate, is the key.

This is a part of the reworked code:

Code: Select all

it := "''"

in_put := "Jalisco or Nueva Galicia. Cartog. Pac. Coast"
out_put := it . in_put . it

clipwait, 5

clipboard := strreplace(clipboard, in_put, out_put)

in_put := "Provisiones, Cedulas, Instrumentos, etc"
out_put := it . in_put . it

clipwait, 5

clipboard := strreplace(clipboard, in_put, out_put)

in_put := "Alfonso el Sabio, Laz Siete, Partidas"
out_put := it . in_put . it

clipwait, 5

clipboard := strreplace(clipboard, in_put, out_put)

in_put := "Brasseur de Bourbourg, Hist. Nat. Civ"
out_put := it . in_put . it

clipwait, 5
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Identifying unique strings in a block of text

21 Sep 2019, 04:06

I had another idea. You could search with InString. Then compare the result with RegEx. If the it match, then go on. If it doesn't match, a pop up appears with the possibility to ad the new "needle" to the dictionary.
Probably we could build a possibility to correct it, too.

As I'm writing this, I think I try to reprogram the Word auto correction. :roll:
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

21 Sep 2019, 17:55

Kobaltauge wrote:
21 Sep 2019, 04:06
I had another idea. You could search with InString. Then compare the result with RegEx. If the it match, then go on. If it doesn't match, a pop up appears with the possibility to ad the new "needle" to the dictionary.
Probably we could build a possibility to correct it, too.

As I'm writing this, I think I try to reprogram the Word auto correction. :roll:
:salute: It was 5am EST when your post notification arrived and by that time I was pretty incoherent. So, only now I got the chance to reply. I had an idea about InStr(). It finds everything without a problem, but two subsearches are needed (using the position of the main InStr), whether the quotes should be added. The subsearches are to check the previous 5-10(?) and the following 5-10(?) characters and see if it contains the quotes or not, and then act on it. I've never done a nested If() in AHK.

P.S: The more I think about, only the string's length -1 and +1 need to be checked if it is enclosed. This is the first to be tested.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Identifying unique strings in a block of text

23 Sep 2019, 05:55

One problem was, that InStr() does only catch the first occurrence. I tested a few RegEx but didn't get the right combination.
With a little help of Google I found this https://autohotkey.com/board/topic/115744-instr-for-multiple-occurrences and could modify "our" script.

With findstr() it find all starting positions of the needle. For every position it gets the string and the leading and trailing 2 characters. After that it checks if this four are ''. If not, they will be added.
Now the lib.ahk should only contain the "needles".

Code: Select all

;2019-09-18 12:29 AM
;===================

;this file is #included in the autoexec.ahk

;italics.ahk

;Stolen from https://autohotkey.com/board/topic/115744-instr-for-multiple-occurrences/#entry669623
findstr(h,n,ic=1) ; h=haystack, n=needle ,ic=ignore case
{
	while pos := regexmatch(h,(ic?"i)":"")"\Q" n "\E",m,a_index=1?1:pos+strlen(m))
		fp .= pos " "
	return trim(fp)
}

; Alt F12
!f12::
critical, on
autotrim, on
send, ^a^c			; select all text on the page,

dict := {}
Loop, Read, lib.ahk
{
	row := StrSplit(A_LoopReadLine, ":")
	key := Trim(row[1])
	dict[key] := Trim(row[2])
}

clipwait, 5

;code ------------------------------------------------------------
for in_put, out_put in dict
{
	strposis := StrSplit(findstr(clipboard, in_put), " ")
	for index, strpos in strposis
	{
		match := SubStr(clipboard, strpos, StrLen(in_put))
		match2 := SubStr(clipboard, (strpos-2), (StrLen(in_put)+4))
		RegExMatch(match2,"^''" . match . "''?", found)
		if !found
			clipboard := RegExReplace(clipboard, match, "''" . match . "''",,,strpos)
		
		
	}
}

;cleanup ---------------------------------------------------------

clipwait, 5
 ;if there are quadruple single quotes
in_put :="''''" 
out_put :="''"
clipboard :=strreplace(clipboard, in_put, out_put)

clipwait, 5
 ;underscore does not exist in the original
in_put :="_"
out_put :=""
clipboard :=strreplace(clipboard, in_put, out_put)

send, ^v
critical, off
return

ExitApp
*Esc::
ExitApp

User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

23 Sep 2019, 06:20

Wow, thank you. I was also wondering if we could return to your earlier method. It's much easier to manage. I will let you know my results soon.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

23 Sep 2019, 06:47

I can't believe it. Thank you. :bravo:

I had an error message because #Warn was enabled in the Autoexec.ahk. Disabled it and the results are instantaneous.
Attachments
fp_undeclared_local_var.jpg
fp_undeclared_local_var.jpg (240.89 KiB) Viewed 1119 times
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: Identifying unique strings in a block of text

25 Sep 2019, 22:39

Kobaltauge, I removed two standalone references, "Id" and "Mex", because they could not be uniquely identifiable. Also inserted "sleep, 10" in the "for" loop because either it ran through the script instantly and did nothing, or inserted the '' quotes everywhere. Since then it's been a pleasure to work with. Many thanks for the your help.

Code: Select all

for in_put, out_put in dict
{
	strposis := StrSplit(findstr(clipboard, in_put), " ")
	for index, strpos in strposis
	{
		match := SubStr(clipboard, strpos, StrLen(in_put))
		match2 := SubStr(clipboard, (strpos-2), (StrLen(in_put)+4))
		RegExMatch(match2,"^''" . match . "''?", found)
		if !found
			clipboard := RegExReplace(clipboard, match, "''" . match . "''",,,strpos)
	}
	sleep, 10
}
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Identifying unique strings in a block of text

26 Sep 2019, 13:22

ineuw wrote:
25 Sep 2019, 22:39
Many thanks for the your help
You're welcome. And thank you too. I learned a few things working on it too.

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Google [Bot] and 285 guests