[SOLVED] How can a unicode ligature character be found and replaced with a script? Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

[SOLVED] How can a unicode ligature character be found and replaced with a script?

14 Apr 2019, 22:51

I have many pages of text in which the OCR scan substituted all occurrences of "fi" with the ligature "fi" and now I must reverse the error. I have dozens of successful scripts using either StrReplace() in context, or RegexReplace() to search for Unicode chr() hex strings, but this does not work in this case.
Perhaps it's because the ligature is constructed with three hex segments, or I am not defining it properly?

Code: Select all

;alt-u													;ligature "fi" = chr(0xEF) chr(0xAC) chr(0x81)
!u::
autotrim, on
clipboard:=
	sendinput, ^a										;select all
	clipwait, 0
	sendinput, ^c										;copy selected
	clipwait, 0
	in_put:= chr(0xEF)chr(0xAC)chr(0x81)				;hex definition is from Unicode consortium.
	out_put:= "fi"										;replace it with regular ASCII characters.
	clipboard:=regexreplace(clipboard, in_put, out_put)	
	sendinput, ^v										;paste
	sendinput, {ctrl}{home}								;on completion return to the top of page
return
Last edited by ineuw on 19 Apr 2019, 21:12, edited 1 time in total.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
TAC109
Posts: 1111
Joined: 02 Oct 2013, 19:41
Location: New Zealand

Re: How can a unicode ligature character be found and replaced with a script?

14 Apr 2019, 23:26

Clipboard := "" at the beginning. The first ClipWait is superfluous as the ctrl+a won’t alter the clipboard.
My scripts:-
XRef - Produces Cross Reference lists for scripts
ReClip - A Text Reformatting and Clip Management utility
ScriptGuard - Protects Compiled Scripts from Decompilation
I also maintain Ahk2Exe
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 00:19

TAC109 wrote:
14 Apr 2019, 23:26
Clipboard := "" at the beginning. The first ClipWait is superfluous as the ctrl+a won’t alter the clipboard.
TAC109, thanks for the valuable pointers. One is the := null value assignment to the clipboard, I looked for, but found no examples where the :="" empty string value was assigned. Should I change everywhere to an empty string? Again much thanks.
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
TAC109
Posts: 1111
Joined: 02 Oct 2013, 19:41
Location: New Zealand

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 00:24

To empty the clipboard with traditional syntax clipboard =, or to use expression syntax clipboard := "".

Cheers
My scripts:-
XRef - Produces Cross Reference lists for scripts
ReClip - A Text Reformatting and Clip Management utility
ScriptGuard - Protects Compiled Scripts from Decompilation
I also maintain Ahk2Exe
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 00:31

A big thanks for the explanation. This means that the source of my knowledge was wrong as well. I copied it from bad examples.:-)
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 00:37

.
Last edited by Klarion on 20 Apr 2019, 00:50, edited 1 time in total.
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 01:15

.
Last edited by Klarion on 20 Apr 2019, 00:50, edited 1 time in total.
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

15 Apr 2019, 01:20

.
Last edited by Klarion on 20 Apr 2019, 00:51, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 03:13

First, thanks to all. Unfortunately, none of the possibilities worked as Msgbox, in_put was always empty, as well as the script did nothing.

These strings, chr(0xEF)chr(0xAC)chr(0x81) or chr(0xEF) chr(0xAC) chr(0x81) do not work.
Although in_put:=chr(0x20)chr(0x2014) which I used in another scripts work fine with replace().
The ligature is printable as: sendinput, "fi" or sendinput, {U+FB01}.

I used this web page https://www.compart.com/en/unicode/U+FB01 to check the various formats. In essence {U+FB01} is UTF-16, which native to Windows as Alt + FB01 (which also don't work when tapped on the keyboard).
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 03:14

.
Last edited by Klarion on 20 Apr 2019, 00:51, edited 1 time in total.
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 03:19

.
Last edited by Klarion on 20 Apr 2019, 00:51, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 04:12

Thanks again for pointing out another oversight. I corrected it as follows:

Code: Select all

in_put:= Chr(0x61)Chr(0x62)Chr(0x63)
Msgbox % in_put
msgbox output.jpg
msgbox output.jpg (69.92 KiB) Viewed 3383 times
The result displayed "abc".

I am using Autohotkey 64bit Unicode Version 1.1.30.01 and this is a copy of the script as it is now.

Code: Select all

;alt-u                        ligature "fi" or - chr(0xEF)chr(0xAC)chr(0x81) {U+FB01}
!u::
	clipboard:= ""
	autotrim, on
	sendinput, ^a^c
	clipwait, 4
	in_put:= Chr(0x61)Chr(0x62)Chr(0x63)
	out_put:= "fi"
	Msgbox % in_put
	clipwait, 4
	clipboard:= strreplace(clipboard, in_put, out_put)
	clipwait, 4
	sendinput, ^v
	sendinput, {ctrl}{home}
return

Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 07:49

.
Last edited by Klarion on 20 Apr 2019, 00:51, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 08:34

I could live with the ligatures if they were in the original text, but this time they were generated by the OCR process, so I must replace them to match the original.

Here is a page I am working on. One will find the liguature "fi" in words like . . field, pacific, office, beneficial fine, gratified . . . etc.
set out “at forty-eight hours notice"; yet it was not until the
eighth of March that his cavalry, led by the impetuous Twiggs
and accompanied by Ringgold’s handsome battery, actually
moved off. The infantry brigades followed at intervals of a
day with Duncan's and Bragg's field artillery; and transports
prepared to remove the convalescents, extra baggage and Major
Monroe's artillery company to Point Isabel, near the mouth
of the Rio Grande.<ref>17</ref>

Soon after receiving the instructions to advance, Taylor had
given notice of his orders to influential citizens of Matamoros
then at Corpus Christi, explaining that his march would be
entirely pacific, and that he expected the pending questions to
be settled by negotiation; and similar assurances were con-
veyed to the Mexican customhouse office at “ Brazos Santiago,”
near Point Isabel‘ March 8 a more formal announcement
appeared in General Orders No. 30. Taylor here expressed
the hope that his movement would be “beneficial to all con-
cerned,” insisted upon a scrupulous regard for the civil and
religious rights of the people, and commanded that everything
required for the use of the army should be paid for “at the
highest market price." These orders, which merely antici-
pated instructions then on their way from Washington, were
translated into Spanish, and placed in circulation along the
border.<ref>18</ref>

To the troops the march proved a refreshing and beneficial
change. The weather was now fine, the road almost free from
mud, and the breeze balmy, Frequently the blue lupine, the
gay verbena, the saucy marigold and countless other bright
flowers carpeted the grounds The cactus and the cochineal
excited and gratified curiosity. Ducks and geese often flew
up from the line of advance. Many rabbits and many deer
scampered across the plain; and occasionally wolves, cata-
mounts and panthers were frightened from cover. Wild
horses would gaze for an instant at their cousins in bondage,
and then gallop ofi, tossing their manes disdainfully ; and once
a herd of them, spaced as if to allow room for cannon, were
taken for Mexican cavalry. Innumerable centipedes, taran—
tulas and rattlesnakes furnished a good deal of interest, if
not of charm. The boundless prairie had somewhat the
fascination of the sea; and occasionally, when a mirage con-
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 08:44

.
Last edited by Klarion on 20 Apr 2019, 00:51, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 10:30

Hi, I get an error message about NormalizationFormKC() being an unknown function
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

16 Apr 2019, 11:40

.
Last edited by Klarion on 20 Apr 2019, 00:52, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

17 Apr 2019, 13:20

I understand what the function is doing and what NormalizationFormKC is (after searching the web), but could you kindly clarify where I find function?
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .
Klarion
Posts: 176
Joined: 26 Mar 2019, 10:02

Re: How can a unicode ligature character be found and replaced with a script?

17 Apr 2019, 18:12

.
Last edited by Klarion on 20 Apr 2019, 00:52, edited 1 time in total.
User avatar
ineuw
Posts: 172
Joined: 11 Sep 2014, 14:12

Re: How can a unicode ligature character be found and replaced with a script?

18 Apr 2019, 03:51

Klarion, thanks again for all your efforts to help, but there is no result, unless of course, I didn't implement the subroutine properly. This my current code:

Code: Select all

;alt-u                        ligature "fi" or - chr(0xEF)chr(0xAC)chr(0x81) or {U+FB01}
!u::
	autotrim, on

	clipboard =
	dirtyText =
	cleanResult =

	send, ^a^c
	clipwait, 0
	dirtyText = %clipboard%

	Msgbox, %dirtyText%			;INPUT IS OK

	clipwait, 0

	Loop, Parse, % dirtyText
		cleanResult . = A_LoopField ~ = "[\x{0000}-\x{007F}]" ? A_LoopField : NormalizationFormKC(A_LoopField)

	Msgbox, %cleanResult%		;NO OUTPUT

	send, %cleanResult%

return

NormalizationFormKC(x)
	{
	a := StrLen(x) * 6
	VarSetCapacity(b, a)
	DllCall("Normaliz.dll\NormalizeString", "int", 5, "wstr", x, "int", StrLen(x), "ptr", &b, "int", a)
	Return StrGet(&b, a, "UTF-16")
	}

return
Win 10 Professional 64bit 21H2 16Gb Ram AHK current as of 2021-12-26 .

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Descolada, RickC and 211 guests