Need helf seeing my error

degarb · 01 Feb 2024, 12:21

I am trying screenshot, process and improve image with script and imagemagick, then use tesseract ocr. But is changes first letter or leaves off T, f, I and l

so, I need to write a find and replace, which I thought a ps1 could. However, ps1 files don't do what you ask, randomly breaks, tries to be non humanly readable, and replace things I never asked it to replace and doesn't replace stuff I asked it to.
I need to port it to an ahk file, which was working splendidly before it broke for not apparent reason: Here is my draft. It was working fine, I addeded severeral lines, every thing executed, now it writes a 3 byte txt. nothing works as written. Got to be a typo, but not seeing it.[*]

Code: Select all

FileRead, content, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book.txt

; Replace lines that contain only numbers with an empty string
content := RegExReplace(content, "m)^\d+$", "")

; Replace different types of line breaks with spaces
content := StrReplace(content, "`n", " ")

content := RegExReplace(content, "_", " ")
content := RegExReplace(content, "°", "")
content := RegExReplace(content, "â", "")
content := RegExReplace(content, "€", "")
content := RegExReplace(content, "Â", "")
content := RegExReplace(content, "§", "")
content := RegExReplace(content, "¶", "")


; Add your replacement rules here
content := RegExReplace(content, "i) t ", " I ")
content := RegExReplace(content, "/bi)t/b", " I ")
content := RegExReplace(content, "i) td ", " I'd ")
content := RegExReplace(content, "i) id ", " I'd ")
content := RegExReplace(content, "i) t'll ", " I'll ")
content := RegExReplace(content, "i) tt's ", " It's ")
content := RegExReplace(content, "i) tts ", " Its ")
content := RegExReplace(content, "i) ts ", " Is ")
content := RegExReplace(content, "i) tm ", " i'm ")
content := RegExReplace(content, "i) t'm ", " i'm ")
content := RegExReplace(content, "i) 0 ", " to ")
content := RegExReplace(content, "i) o ", " to ")
content := RegExReplace(content, "i) ollow", " follow")
content := RegExReplace(content, "i) ailed", " failed")
content := RegExReplace(content, "i) ry ", " try ")
content := RegExReplace(content, "i) ell ", " tell ")
content := RegExReplace(content, "i) ime ", " time ")
content := RegExReplace(content, "i) ove ", " love ")
content := RegExReplace(content, "i) oved ", " loved ")
content := RegExReplace(content, "i) ost ", " lost ")
content := RegExReplace(content, "i) ose ", " lose ")
content := RegExReplace(content, "i) oss ", " loss ")
content := RegExReplace(content, "i) oser ", " loser ")
content := RegExReplace(content, "i) tself ", " itself ")
content := RegExReplace(content, "i) oom", " loom")
content := RegExReplace(content, "i) ight", " light")
content := RegExReplace(content, "i) ike ", " like ")
content := RegExReplace(content, "i) iked ", " liked ")
content := RegExReplace(content, "i) ive ", " live ")
content := RegExReplace(content, "i) ived ", " lived ")
content := RegExReplace(content, "i) hing ", " thing ")
content := RegExReplace(content, "i) hings ", " things ")
content := RegExReplace(content, "i) ook", " look")
content := RegExReplace(content, "i) elevision", " Television")
content := RegExReplace(content, "i) she old ", " she told")
content := RegExReplace(content, "i) he old ", " he told")
content := RegExReplace(content, "i) amily", " family")
content := RegExReplace(content, "i) ound", " found")
content := RegExReplace(content, "i)\\bind\\b", " find")
content := RegExReplace(content, "i)\\binds\\b", " finds")
content := RegExReplace(content, "i)\\binding\\b", " finding")
content := RegExReplace(content, "i)\\bindings\\b", " findings ")
content := RegExReplace(content, "i)\\briend\\b", " friend")
content := RegExReplace(content, "i)\\bherefore\\b", " therefore")
content := RegExReplace(content, "i) actory", " factory")
content := RegExReplace(content, "i) rom", " from")
content := RegExReplace(content, "i) ong", " long")
content := RegExReplace(content, "i) ive", " live")
content := RegExReplace(content, "i) -ive", " lived")
content := RegExReplace(content, "i) ind", " find")
content := RegExReplace(content, "i) augh", " laugh")
content := RegExReplace(content, "i) isten", " listen")
content := RegExReplace(content, "i) t'll", " I'll")
content := RegExReplace(content, "i) ix", " fix")
content := RegExReplace(content, "i) irst", " first")

; Fix contractions
content := RegExReplace(content, "i)\byoud\b", "you'd")
content := RegExReplace(content, "i)\bwhod\b", "who'd")
content := RegExReplace(content, "i)\bhed\b", "he'd")
content := RegExReplace(content, "i)\bshes\b", "she's")
content := RegExReplace(content, "i)\bwerent\b", "weren't")
content := RegExReplace(content, "i)\bcant\b", "can't")
content := RegExReplace(content, "i)\bdont\b", "don't")
content := RegExReplace(content, "i)oulnt\b", "ouldn't")
content := RegExReplace(content, "i)oulnt\b", "ouldn't")
content := RegExReplace(content, "i)\bcant\b", " can't")
content := RegExReplace(content, "i)\bid\b", " I'd")
content := RegExReplace(content, "i)\bill\b", " I'll")
content := RegExReplace(content, "i)\bhell\b", " he'll")
content := RegExReplace(content, "i)\bshell\b", " she'll")
content := RegExReplace(content, "i)\bhed\b", " he'd")
content := RegExReplace(content, "i)\bshed\b", " she'd")
content := RegExReplace(content, "\bIll\b", " I'll ")

; conditinal
content := RegExReplace(content, "i)(?<!black|fat|skinny|white|red|your|our|his|big|small|yello|ugly|my|the|a|her|their) hem ", " them ")
content := RegExReplace(content, "i)(?<!\b(black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|the|a|her|my|their|mother)\b) \bhen\b ", " then ")
content := RegExReplace(content, "i)(?<!black|white|red|your|our|his|my|blue|big|small|top|yello|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|my|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) hat ", " that ")


content := RegExReplace(content, " st\.", " Saint")
content := RegExReplace(content, " ph\.d\.", " PHD")
content := RegExReplace(content, " dr\.", " Doctor")
content := RegExReplace(content, " mr\.", " Mister")
content := RegExReplace(content, " mrs\.", " Missus")
content := RegExReplace(content, " ms\.", " Miss")
content := RegExReplace(content, " jr\.", " Junior")
content := RegExReplace(content, " sr\.", " Senior")
content := RegExReplace(content, " co\.", " Company")
content := RegExReplace(content, " inc\.", " Incorporated")
content := RegExReplace(content, " ltd\.", " Limited")
content := RegExReplace(content, " intl\.", " International")
content := RegExReplace(content, " prof\.", " Professor")
content := RegExReplace(content, " gov\.", " Governor")
content := RegExReplace(content, " capt\.", " Captain")
content := RegExReplace(content, " sgt\.", " Sergeant")
content := RegExReplace(content, " corp\.", " Corporation")
content := RegExReplace(content, " ave\.", " Avenue")
content := RegExReplace(content, " blvd\.", " Boulevard")
content := RegExReplace(content, " ft\.", " Fort")
content := RegExReplace(content, " mt\.", " Mount")
content := RegExReplace(content, " ln\.", " Lane")
content := RegExReplace(content, " rd\.", " Road")
content := RegExReplace(content, " etc\.", " Etcetera")
content := RegExReplace(content, " esp\.", " Especially")
content := RegExReplace(content, " e\.g\.", " For example")
content := RegExReplace(content, " i\.e\.", " That is")
content := RegExReplace(content, "i) p\.(?=\s?\d)", " Page")
content := RegExReplace(content, " pp\.", " Pages")
content := RegExReplace(content, " par\.", " Paragraph")
content := RegExReplace(content, " vol\.", " Volume")
content := RegExReplace(content, " lb\.", " Pound")
content := RegExReplace(content, " oz\.", " Ounce")
content := RegExReplace(content, " gal\.", " Gallon")
content := RegExReplace(content, " qt\.", " Quart")
content := RegExReplace(content, " pt\.", " Pint")
content := RegExReplace(content, " yd\.", " Yard")
content := RegExReplace(content, "i)(?<=\d\s?)in\.", " Inch")
content := RegExReplace(content, " ft\.", " Foot")
content := RegExReplace(content, " mi\.", " Mile")
content := RegExReplace(content, " mm\.", " Millimeter")
content := RegExReplace(content, " cm\.", " Centimeter")
content := RegExReplace(content, "i)(?<=\d\s?)m\.", " Meter")
content := RegExReplace(content, " km\.", " Kilometer")
content := RegExReplace(content, " mg\.", " Milligram")
content := RegExReplace(content, "i)(?<=\d\s?)g\.", " Gram")
content := RegExReplace(content, " kg\.", " Kilogram")
content := RegExReplace(content, "i)(?<=\d\s?)l\.", " Liter")
content := RegExReplace(content, " ml\.", " Milliliter")
content := RegExReplace(content, " tbsp\.", " Tablespoon")
content := RegExReplace(content, " tsp\.", " Teaspoon")
content := RegExReplace(content, " sq\.", " Square")

content := RegExReplace(content, "i)\\byoud\\b", "you'd")
content := RegExReplace(content, "i)\\bwhod\\b", "who'd")
content := RegExReplace(content, "i)\\bhed\\b", "he'd")
content := RegExReplace(content, "i)\\bshes\\b", "she's")
content := RegExReplace(content, "i)\\bwerent\\b", "weren't")
content := RegExReplace(content, "i)\\bcant\\b", "can't")
content := RegExReplace(content, "i)\\bdont\\b", "don't")
content := RegExReplace(content, "i)\\barent\\b", "aren't")
content := RegExReplace(content, "i)\\bcant\\b", "can't")
content := RegExReplace(content, "i)\\bcouldve\\b", "could've")
content := RegExReplace(content, "i)\\bcouldnt\\b", "couldn't")
content := RegExReplace(content, "i)\\bdidnt\\b", "didn't")
content := RegExReplace(content, "i)\\bdoesnt\\b", "doesn't")
content := RegExReplace(content, "i)\\bdont\\b", "don't")
content := RegExReplace(content, "i)\\bhadnt\\b", "hadn't")
content := RegExReplace(content, "i)\\bhasnt\\b", "hasn't")
content := RegExReplace(content, "i)\\bhavent\\b", "haven't")
content := RegExReplace(content, "i)\\bhed\\b", "he'd")
content := RegExReplace(content, "i)\\bhes\\b", "he's")
content := RegExReplace(content, "i)\\bhowd\\b", "how'd")
content := RegExReplace(content, "i)\\bhowll\\b", "how'll")
content := RegExReplace(content, "i)\\bhowre\\b", "how're")
content := RegExReplace(content, "i)\\bhowve\\b", "how've")
content := RegExReplace(content, "i)\\bId\\b", "I'd")
content := RegExReplace(content, "i)\\bIll\\b", "I'll")
content := RegExReplace(content, "i)\\bIm\\b", "I'm")
content := RegExReplace(content, "i)\\bim\\b", "I'm")
content := RegExReplace(content, "i)\\bIve\\b", "I've")
content := RegExReplace(content, "i)\\bisnt\\b", "isn't")
content := RegExReplace(content, "i)\\bitd\\b", "it'd")
content := RegExReplace(content, "i)\\bitll\\b", "it'll")
content := RegExReplace(content, "i)\\bits\\b", "it's")
content := RegExReplace(content, "i)\\bIve\\b", "I've")
content := RegExReplace(content, "i)\\bmightve\\b", "might've")
content := RegExReplace(content, "i)\\bmustve\\b", "must've")
content := RegExReplace(content, "i)\\bmustnt\\b", "mustn't")
content := RegExReplace(content, "i)\\bshant\\b", "shan't")
content := RegExReplace(content, "i)\\bshed\\b", "she'd")
content := RegExReplace(content, "i)\\bshes\\b", "she's")
content := RegExReplace(content, "i)\\bshouldve\\b", "should've")
content := RegExReplace(content, "i)\\bshouldnt\\b", "shouldn't")
content := RegExReplace(content, "i)\\bthatd\\b", "that'd")
content := RegExReplace(content, "i)\\bthats\\b", "that's")
content := RegExReplace(content, "i)\\bthered\\b", "there'd")
content := RegExReplace(content, "i)\\btheres\\b", "there's")
content := RegExReplace(content, "i)\\btheyd\\b", "they'd")
content := RegExReplace(content, "i)\\btheyll\\b", "they'll")
content := RegExReplace(content, "i)\\btheyre\\b", "they're")
content := RegExReplace(content, "i)\\btheyve\\b", "they've")
content := RegExReplace(content, "i)\\bwasnt\\b", "wasn't")
content := RegExReplace(content, "i)\\bwerent\\b", "weren't")
content := RegExReplace(content, "i)\\bwhatd\\b", "what'd")
content := RegExReplace(content, "i)\\bwhatll\\b", "what'll")
content := RegExReplace(content, "i)\\bwhats\\b", "what's")
content := RegExReplace(content, "i)\\bwhatve\\b", "what've")
content := RegExReplace(content, "i)\\bwhend\\b", "when'd")
content := RegExReplace(content, "i)\\bwhens\\b", "when's")
content := RegExReplace(content, "i)\\bwhered\\b", "where'd")
content := RegExReplace(content, "i)\\bwheres\\b", "where's")
content := RegExReplace(content, "i)\\bwhereve\\b", "where've")
content := RegExReplace(content, "i)\\bwhod\\b", "who'd")
content := RegExReplace(content, "i)\\bwholl\\b", "who'll")
content := RegExReplace(content, "i) ll ", " I'll ")


; Add more replacements as needed

; Replace two or more spaces with a single space
content := RegExReplace(content, "\s{2,}", " ")

; Remove non-ASCII characters
content := RegExReplace(content, "[^\x00-\x7F]", "")

; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2.txt
FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2 test.txt, UTF-8



MsgBox, Script completed.

[Mod edit: Added [code][/code] tags.]

01 Feb 2024, 12:28

@degarb, your topic was moved from the AHK v2 help forum to the v1 help forum since you posted v1 code. Next time, please choose the correct forum yourself.
Also, please remember to use code tags, especially if you post a longer script like above. Thank you!

ahk7 · 01 Feb 2024, 12:38

Simply comment a few lines, run script, still broken? Comment a few more, etc. At some point you'll find the "what broke it" and then you can fix and uncomment the lines.

As you had a working script before, consider making backup copies before modifying, or use something like git to keep track of changes so you can easily revert.

At first glance, this seems odd \\bind\\b why the \\b, \b should be the actual usage of "word boundary", now you're escaping that regex pattern, as \\ is a literal \ so it looks like you are replacing the literal text \bind\b with find, and it unlikely the OCR text is actually \bind\b

degarb · 01 Feb 2024, 13:21

Code: Select all

content := RegExReplace(content, "(?i)(?<!\b(black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|my|the|a|her|their)\b)\bhem\b", " them ")
content := RegExReplace(content, "(?i)(?<!\b(black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|the|a|her|my|their|mother)\b)\bhen\b", " then ")
content := RegExReplace(content, "(?i)(?<!\b(black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer)\b)\bhat\b", " that ")

doesn't work

Neither does

Code: Select all

content := RegExReplace(content, "(?i)(?<!\\b (black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) \\W \\b)\\bhat\\b", " that ")

[Mod edit: Added [code][/code] tags. Please use them yourself when posting code!]

degarb · 01 Feb 2024, 13:37

also this doesnt work

Code: Select all

content := RegExReplace(content, "(?i)(?<=\\d\\s?)in\\.", " Inch")

[Mod edit: Added [code][/code] tags. Please use them yourself when posting code!]

01 Feb 2024, 13:39

@degarb, please remember to use code tags. Thank you!

: ctags.png (14.18 KiB) Viewed 453 times

degarb · 01 Feb 2024, 15:32

degarb · 01 Feb 2024, 15:45

degarb · 01 Feb 2024, 16:18

After many hours, I think I have an anh ocr fixup, find and replace errors tesseract commonly makes as it drops the letters L, T, and F when the word begins with these letters. I couldn't figure out how to save it as a ascii text file. I am scanning Russian and Ukraine books, and I get lots of garbage that chokes and crashes the screen reader if I use utf-8. It allows too many characters no english speaker knows about. ascii is better.

I tried for weeks with dot net ps1, it would change "a, the, he" to they, even after I took it out and never did I tell it to change he to they or a to they. and it would leave in T standing alone which is I that tesseract mistranscoded. Tesseract on windows does poorly considering it is a screenshot, I enhance, standard fonts, and the age of the program, which I used in 2001. On the other hand, we are getting robbed blind by ocr and natural ai voices.

Code: Select all

FileRead, content, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book.txt

; Check if content is read correctly
if (content = "")
{
    MsgBox, Error reading file.
    return
}

; Write the modified content to the output file
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine1.txt
	;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine1.TXT, UTF-8

; Replace lines that contain only numbers with an empty string
content := RegExReplace(content, "m)^\d+$", "")


; Replace different types of line breaks with spaces
content := StrReplace(content, "`n", " ")

; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2.TXT, UTF-8

; FIX WANKO CHARS
content := RegExReplace(content, "_", " ")
content := RegExReplace(content, "°", "")
content := RegExReplace(content, "â", "")
content := RegExReplace(content, "€", "")
content := RegExReplace(content, "Â", "")
content := RegExReplace(content, "§", "")
content := RegExReplace(content, "¶", "")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine3.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine3.TXT, UTF-8

; Add your replacement rules here
; FIX COMMON TESSERACT ERRORS
content := RegExReplace(content, "i) t ", " I ")
content := RegExReplace(content, "i)/bt/b", " I ")
content := RegExReplace(content, "i) td ", " I'd ")
content := RegExReplace(content, "i) id ", " I'd ")
content := RegExReplace(content, "i) t'll ", " I'll ")
content := RegExReplace(content, "i) tt's ", " It's ")
content := RegExReplace(content, "i) tts ", " Its ")
content := RegExReplace(content, "i) ts ", " Is ")
content := RegExReplace(content, "i) tm ", " i'm ")
content := RegExReplace(content, "i) t'm ", " i'm ")
content := RegExReplace(content, "i) 0 ", " to ")
content := RegExReplace(content, "i) o ", " to ")
content := RegExReplace(content, "i) ollow", " follow")
content := RegExReplace(content, "i) ailed", " failed")
content := RegExReplace(content, "i) ry ", " try ")
content := RegExReplace(content, "i) ell ", " tell ")
content := RegExReplace(content, "i) ime ", " time ")
content := RegExReplace(content, "i) ove ", " love ")
content := RegExReplace(content, "i) oved ", " loved ")
content := RegExReplace(content, "i) ost ", " lost ")
content := RegExReplace(content, "i) ose ", " lose ")
content := RegExReplace(content, "i) oss ", " loss ")
content := RegExReplace(content, "i) oser ", " loser ")
content := RegExReplace(content, "i) tself ", " itself ")
content := RegExReplace(content, "i) oom", " loom")
content := RegExReplace(content, "i) ight", " light")
content := RegExReplace(content, "i) ike ", " like ")
content := RegExReplace(content, "i) iked ", " liked ")
content := RegExReplace(content, "i) ive ", " live ")
content := RegExReplace(content, "i) ived ", " lived ")
content := RegExReplace(content, "i) hing ", " thing ")
content := RegExReplace(content, "i) hings ", " things ")
content := RegExReplace(content, "i) ook", " look")
content := RegExReplace(content, "i) elevision", " Television")
content := RegExReplace(content, "i) she old ", " she told")
content := RegExReplace(content, "i) he old ", " he told")
content := RegExReplace(content, "i) amily", " family")
content := RegExReplace(content, "i) ound", " found")
content := RegExReplace(content, "i)\bind\b", " find")
content := RegExReplace(content, "i)\binds\b", " finds")
content := RegExReplace(content, "i)\binding\b", " finding")
content := RegExReplace(content, "i)\bindings\b", " findings ")
content := RegExReplace(content, "i)\briend\b", " friend")
content := RegExReplace(content, "i)\bherefore\b", " therefore")
content := RegExReplace(content, "i) actory", " factory")
content := RegExReplace(content, "i) rom", " from")
content := RegExReplace(content, "i) ong", " long")
content := RegExReplace(content, "i) ive", " live")
content := RegExReplace(content, "i) -ive", " lived")
content := RegExReplace(content, "i) ind", " find")
content := RegExReplace(content, "i) augh", " laugh")
content := RegExReplace(content, "i) isten", " listen")
content := RegExReplace(content, "i) t'll", " I'll")
content := RegExReplace(content, "i) ix", " fix")
content := RegExReplace(content, "i) irst", " first")
	

; Write the modified content to the output file
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine4.txt
FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine4.TXT, UTF-8


; conditional
; goto SkipConditional

content := RegExReplace(content, "i)(?<!black|fat|skinny|white|red|your|our|his|big|small|yello|ugly|my|the|a|her|their) hem ", " them ")
content := RegExReplace(content, "i)(?<!black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|the|a|her|my|their|mother) hen ", " then ")
content := RegExReplace(content, "i)(?<!Black|white|red|your|our|his|my|blue|big|small|top|yello|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|my|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) hat ", " that ")


book := content

; Write the modified content boto the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5.TXT, UTF-8


;MsgBox, book is %book%

content := RegExReplace(content, "(?i)(?<!black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) hat ", " that ")
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5a.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5a.TXT, UTF-8
;msgbox, hat to that %content%

SkipConditional:  ;conditionals don't work
; fix abbreviations


; Write the modified content to the output file


content := RegExReplace(content, " st\.", " Saint")
content := RegExReplace(content, " ph\.d\.", " PHD")
content := RegExReplace(content, " dr\.", " Doctor")
content := RegExReplace(content, " mr\.", " Mister")
content := RegExReplace(content, " mrs\.", " Missus")
content := RegExReplace(content, " ms\.", " Miss")
content := RegExReplace(content, " jr\.", " Junior")
content := RegExReplace(content, " sr\.", " Senior")
content := RegExReplace(content, " co\.", " Company")
content := RegExReplace(content, " inc\.", " Incorporated")
content := RegExReplace(content, " ltd\.", " Limited")
content := RegExReplace(content, " intl\.", " International")
content := RegExReplace(content, " prof\.", " Professor")
content := RegExReplace(content, " gov\.", " Governor")
content := RegExReplace(content, " capt\.", " Captain")
content := RegExReplace(content, " sgt\.", " Sergeant")
content := RegExReplace(content, " corp\.", " Corporation")
content := RegExReplace(content, " ave\.", " Avenue")
content := RegExReplace(content, " blvd\.", " Boulevard")
content := RegExReplace(content, " ft\.", " Fort")
content := RegExReplace(content, " mt\.", " Mount")
content := RegExReplace(content, " ln\.", " Lane")
content := RegExReplace(content, " rd\.", " Road")
content := RegExReplace(content, " etc\.", " Etcetera")
content := RegExReplace(content, " esp\.", " Especially")
content := RegExReplace(content, " e\.g\.", " For example")
content := RegExReplace(content, " i\.e\.", " That is")
content := RegExReplace(content, "i) p\.(?=\s?\d)", " Page")
content := RegExReplace(content, " pp\.", " Pages")
content := RegExReplace(content, " par\.", " Paragraph")
content := RegExReplace(content, " vol\.", " Volume")
content := RegExReplace(content, " lb\.", " Pound")
content := RegExReplace(content, " oz\.", " Ounce")
content := RegExReplace(content, " gal\.", " Gallon")
content := RegExReplace(content, " qt\.", " Quart")
content := RegExReplace(content, " pt\.", " Pint")
content := RegExReplace(content, " yd\.", " Yard")

;msgbox, %content%
contentb := RegExReplace(content, "(?i)[0-9]) in.", " Inch ") ; Assign back to content
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5b.txt
;msgbox, %content%
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5b.TXT, UTF-8

content := RegExReplace(content, " ft\.", " Foot")
content := RegExReplace(content, " mi\.", " Mile")
content := RegExReplace(content, " mm\.", " Millimeter")
content := RegExReplace(content, " cm\.", " Centimeter")
contentb := RegExReplace(content, "(?i)[0-9]) m.", " meter ")
content := RegExReplace(content, " km\.", " Kilometer")
content := RegExReplace(content, " mg\.", " Milligram")
contentb := RegExReplace(content, "(?i)[0-9]) g.", " Grams ")
content := RegExReplace(content, " kg\.", " Kilogram")

contentb := RegExReplace(content, "(?i)[0-9]) l.", " Liter ")
content := RegExReplace(content, " ml\.", " Milliliter")
content := RegExReplace(content, " tbsp\.", " Tablespoon")
content := RegExReplace(content, " tsp\.", " Teaspoon")
content := RegExReplace(content, " sq\.", " Square")

;
; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine6.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine6.TXT, UTF-8

; fix contractions
content := RegExReplace(content, "i)\byoud\b", "you'd")
content := RegExReplace(content, "i)\bwhod\b", "who'd")
content := RegExReplace(content, "i)\bhed\b", "he'd")
content := RegExReplace(content, "i)\bshes\b", "she's")
content := RegExReplace(content, "i)\bwerent\b", "weren't")
content := RegExReplace(content, "i)\bcant\b", "can't")
content := RegExReplace(content, "i)\bdont\b", "don't")
content := RegExReplace(content, "i)\barent\b", "aren't")
content := RegExReplace(content, "i)\bcant\b", "can't")
content := RegExReplace(content, "i)\bcouldve\b", "could've")
content := RegExReplace(content, "i)\bcouldnt\b", "couldn't")
content := RegExReplace(content, "i)\bdidnt\b", "didn't")
content := RegExReplace(content, "i)\bdoesnt\b", "doesn't")
content := RegExReplace(content, "i)\bdont\b", "don't")
content := RegExReplace(content, "i)\bhadnt\b", "hadn't")
content := RegExReplace(content, "i)\bhasnt\b", "hasn't")
content := RegExReplace(content, "i)\bhavent\b", "haven't")
content := RegExReplace(content, "i)\bhed\b", "he'd")
content := RegExReplace(content, "i)\bhes\b", "he's")
content := RegExReplace(content, "i)\bhowd\b", "how'd")
content := RegExReplace(content, "i)\bhowll\b", "how'll")
content := RegExReplace(content, "i)\bhowre\b", "how're")
content := RegExReplace(content, "i)\bhowve\b", "how've")
content := RegExReplace(content, "i)\bId\b", "I'd")
content := RegExReplace(content, "i)\bIll\b", "I'll")
content := RegExReplace(content, "i)\bIm\b", "I'm")
content := RegExReplace(content, "i)\bim\b", "I'm")
content := RegExReplace(content, "i)\bIve\b", "I've")
content := RegExReplace(content, "i)\bisnt\b", "isn't")
content := RegExReplace(content, "i)\bitd\b", "it'd")
content := RegExReplace(content, "i)\bitll\b", "it'll")
content := RegExReplace(content, "i)\bits\b", "it's")
content := RegExReplace(content, "i)\bIve\b", "I've")
content := RegExReplace(content, "i)\bmightve\b", "might've")
content := RegExReplace(content, "i)\bmustve\b", "must've")
content := RegExReplace(content, "i)\bmustnt\b", "mustn't")
content := RegExReplace(content, "i)\bshant\b", "shan't")
content := RegExReplace(content, "i)\bshed\b", "she'd")
content := RegExReplace(content, "i)\bshes\b", "she's")
content := RegExReplace(content, "i)\bshouldve\b", "should've")
content := RegExReplace(content, "i)\bshouldnt\b", "shouldn't")
content := RegExReplace(content, "i)\bthatd\b", "that'd")
content := RegExReplace(content, "i)\bthats\b", "that's")
content := RegExReplace(content, "i)\bthered\b", "there'd")
content := RegExReplace(content, "i)\btheres\b", "there's")
content := RegExReplace(content, "i)\btheyd\b", "they'd")
content := RegExReplace(content, "i)\btheyll\b", "they'll")
content := RegExReplace(content, "i)\btheyre\b", "they're")
content := RegExReplace(content, "i)\btheyve\b", "they've")
content := RegExReplace(content, "i)\bwasnt\b", "wasn't")
content := RegExReplace(content, "i)\bwerent\b", "weren't")
content := RegExReplace(content, "i)\bwhatd\b", "what'd")
content := RegExReplace(content, "i)\bwhatll\b", "what'll")
content := RegExReplace(content, "i)\bwhats\b", "what's")
content := RegExReplace(content, "i)\bwhatve\b", "what've")
content := RegExReplace(content, "i)\bwhend\b", "when'd")
content := RegExReplace(content, "i)\bwhens\b", "when's")
content := RegExReplace(content, "i)\bwhered\b", "where'd")
content := RegExReplace(content, "i)\bwheres\b", "where's")
content := RegExReplace(content, "i)\bwhereve\b", "where've")
content := RegExReplace(content, "i)\bwhod\b", "who'd")
content := RegExReplace(content, "i)\bwholl\b", "who'll")
content := RegExReplace(content, "i) ll ", " I'll ")
; Fix persisting contractions despite above
content := RegExReplace(content, "i)\byoud\b", "you'd")
content := RegExReplace(content, "i)\bwhod\b", "who'd")
content := RegExReplace(content, "i)\bhed\b", "he'd")
content := RegExReplace(content, "i)\bshes\b", "she's")
content := RegExReplace(content, "i)\bwerent\b", "weren't")
content := RegExReplace(content, "i)\bcant\b", "can't")
content := RegExReplace(content, "i)\bdont\b", "don't")
content := RegExReplace(content, "i)oulnt\b", "ouldn't")
content := RegExReplace(content, "i)oulnt\b", "ouldn't")
content := RegExReplace(content, "i)\bcant\b", " can't")
content := RegExReplace(content, "i)\bid\b", " I'd")
content := RegExReplace(content, "i)\bill\b", " I'll")
content := RegExReplace(content, "i)\bhell\b", " he'll")
content := RegExReplace(content, "i)\bshell\b", " she'll")
content := RegExReplace(content, "i)\bhed\b", " he'd")
content := RegExReplace(content, "i)\bshed\b", " she'd")
content := RegExReplace(content, "\bIll\b", " I'll ")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine7.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine7.TXT, UTF-8


; Replace two or more spaces with a single space
;content := RegExReplace(content, "\s{2,}", " ")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine8.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine8.TXT, UTF-8


; Remove non-ASCII characters
content := RegExReplace(content, "[^\x00-\x7F]", "")

; Write the modified content to the output file
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine9.txt
FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLineAHK.TXT, UTF-8



MsgBox, Script completed.

[Mod edit: Added [code][/code] tags. Please use them yourself when posting code!]

ahk7 · 01 Feb 2024, 16:28

Now you're just posting nonsense and you won't get much help that way, your regexes don't make any sense either.
Go to https://www.regextester.com/ or similar site, choose PCRE and you can SEE what mistake you are making,
there are other regex tools as well incl. nice highlighting and regex builders. Use those, problem solved.

andymbody · 01 Feb 2024, 19:33

It sounds like tesseract has many issues. Are you not able to abandon tesseract for a different OCR that might produce better results?

I see a few issues in some of the regex needles, but I'm not sure what you mean when you say "where I went wrong". Are you asking others to point out the regex needles that require edits?

What is ps1? Powershell?

andymbody · 01 Feb 2024, 19:50

degarb wrote: ↑

01 Feb 2024, 13:21

Code: Select all

content := RegExReplace(content, "(?i)(?<!\b(black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|my|the|a|her|their)\b)\bhem\b", " them ")
content := RegExReplace(content, "(?i)(?<!\b(black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|the|a|her|my|their|mother)\b)\bhen\b", " then ")
content := RegExReplace(content, "(?i)(?<!\b(black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer)\b)\bhat\b", " that ")

doesn't work

Neither does

Code: Select all

content := RegExReplace(content, "(?i)(?<!\\b (black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) \\W \\b)\\bhat\\b", " that ")

it hates groupings
.

AFAIK, multiple choice (OR group) is not supported within look-arounds. Look-arounds require very specific, fixed characters, and cannot include multiple choice lists or +, *, ? quantifiers for characters. You also have too many \ characters in the last one.

Try this instead - with each negative look-behind separated. Or maybe place all the adjectives in an array and build the needle on the fly with a loop. (You will need to add the other adjective choices to this shortened example needle)

Code: Select all

content := RegExReplace(content, "(?i)(?<!black )(?<!fat )(?<!skinny )(?<!mother )\b(he)([mn])\b", "t$1$2") ; either them or then

Code: Select all

content := RegExReplace(content, "(?i)(?<=\\d\\s?)in\\.", " Inch")

The ? quantifier attached to the white-space character \s is not allowed. Characters must be fixed length within look-arounds. And you have too many \ characters. You also do not need the space in front of the replacement Inch.
Use this instead.

Code: Select all

content := RegExReplace(content, "(?i)(\d+\s?)in\.", "$1Inch")   ; use this

auto hotkey doesn't seem to like /d /b and others

What are these? Do you mean \d and \b. Some of them in your needles have been written \\d and \\b, which regex will interpret as a literal \ followed by a literal d or b. Remove one of the \ in these cases.

contentb := RegExReplace(content, "(?i)[0-9]) m.", " meter ")
contentb := RegExReplace(content, "(?i)[0-9]) g.", " Grams ")
contentb := RegExReplace(content, "(?i)[0-9]) l.", " Liter ")

These are incorrect. The extra ) needs to either be:
1. Removed
2. Escaped \)
3. Match a ( before it.

degarb · 01 Feb 2024, 22:03

/d is supposed to be regex for digit, /s space, /b word boundary

I have autohotkey 1 and 2 installed. I think that could be the problem, if v1 doesn't understand regex .. And my script might not be formatted for v2 and needs to be overhauled. I really want the /b, because space doesn't always cut it.

my simplified script works with what I think v1 ahk. I think I tried to call it in a dos bat using ahk v2, 64 and got yelled at and some GUI called me stupid. I wouldn't know where to begin to research each line over so v2 could handle it. I need a break today, been doing this since before dawn, and I am sick.

Powershell used dot net, and very stupid. Got a an error and it doesn't tell you what line. I spent 2 hours trying to get a debugger on it and failed, useless directions. Not to mention, the ps1 file wouldn't replace things it should and replaced all kinds of things it shouldn't. copilot was clueless, and I could read the code, it just didn't freaking work always.

ahk v1 never failed me.

Code: Select all

FileRead, content, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book.txt

; Check if content is read correctly
if (content = "")
{
    MsgBox, Error reading file.
    return
}

; Write the modified content to the output file
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine1.txt
	;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine1.TXT, UTF-8

; Replace lines that contain only numbers with an empty string
content := RegExReplace(content, "m)^\d+$", "")


; Replace different types of line breaks with spaces
content := StrReplace(content, "`n", " ")

; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine2.TXT, UTF-8

; FIX WANKO CHARS
content := RegExReplace(content, "_", " ")
content := RegExReplace(content, "°", "")
content := RegExReplace(content, "â", "")
content := RegExReplace(content, "€", "")
content := RegExReplace(content, "Â", "")
content := RegExReplace(content, "§", "")
content := RegExReplace(content, "¶", "")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine3.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine3.TXT, UTF-8

; Add your replacement rules here
; FIX COMMON TESSERACT ERRORS
content := RegExReplace(content, "i) t ", " I ")
content := RegExReplace(content, "i) t ", " I ")
content := RegExReplace(content, "i) td ", " I'd ")
content := RegExReplace(content, "i) id ", " I'd ")
content := RegExReplace(content, "i) t'll ", " I'll ")
content := RegExReplace(content, "i) tt's ", " It's ")
content := RegExReplace(content, "i) tts ", " Its ")
content := RegExReplace(content, "i) ts ", " Is ")
content := RegExReplace(content, "i) tm ", " i'm ")
content := RegExReplace(content, "i) t'm ", " i'm ")
content := RegExReplace(content, "i) 0 ", " to ")
content := RegExReplace(content, "i) o ", " to ")
content := RegExReplace(content, "i) ollow", " follow")
content := RegExReplace(content, "i) ailed", " failed")
content := RegExReplace(content, "i) ry ", " try ")
content := RegExReplace(content, "i) ell ", " tell ")
content := RegExReplace(content, "i) ime ", " time ")
content := RegExReplace(content, "i) ove ", " love ")
content := RegExReplace(content, "i) oved ", " loved ")
content := RegExReplace(content, "i) ost ", " lost ")
content := RegExReplace(content, "i) ose ", " lose ")
content := RegExReplace(content, "i) oss ", " loss ")
content := RegExReplace(content, "i) oser ", " loser ")
content := RegExReplace(content, "i) tself ", " itself ")
content := RegExReplace(content, "i) oom", " loom")
content := RegExReplace(content, "i) ight", " light")
content := RegExReplace(content, "i) ike ", " like ")
content := RegExReplace(content, "i) iked ", " liked ")
content := RegExReplace(content, "i) ive ", " live ")
content := RegExReplace(content, "i) ived ", " lived ")
content := RegExReplace(content, "i) hing ", " thing ")
content := RegExReplace(content, "i) hings ", " things ")
content := RegExReplace(content, "i) ook", " look")
content := RegExReplace(content, "i) elevision", " Television")
content := RegExReplace(content, "i) she old ", " she told")
content := RegExReplace(content, "i) he old ", " he told")
content := RegExReplace(content, "i) amily", " family")
content := RegExReplace(content, "i) ound", " found")
content := RegExReplace(content, "i) ind ", " find")
content := RegExReplace(content, "i) inds ", " finds")
content := RegExReplace(content, "i) inding ", " finding")
content := RegExReplace(content, "i) indings ", " findings ")
content := RegExReplace(content, "i) riend ", " friend")
content := RegExReplace(content, "i) herefore ", " therefore")
content := RegExReplace(content, "i) actory", " factory")
content := RegExReplace(content, "i) rom", " from")
content := RegExReplace(content, "i) ong", " long")
content := RegExReplace(content, "i) ive", " live")
content := RegExReplace(content, "i) -ive", " lived")
content := RegExReplace(content, "i) ind", " find")
content := RegExReplace(content, "i) augh", " laugh")
content := RegExReplace(content, "i) isten", " listen")
content := RegExReplace(content, "i) t'll", " I'll")
content := RegExReplace(content, "i) ix", " fix")
content := RegExReplace(content, "i) irst", " first")
content := RegExReplace(content, " Screenshot taken", "")
content := RegExReplace(content, "\|", " I ")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine4.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-;singleLine4.TXT, UTF-8


; conditional
; goto SkipConditional

content := RegExReplace(content, "i)(?<!black|fat|skinny|white|red|your|our|his|big|small|yello|ugly|my|the|a|her|their) hem ", " them ")
content := RegExReplace(content, "i)(?<!black|fat|skinny|white|red|your|our|his|big|small|yellow|ugly|the|a|her|my|their|mother) hen ", " then ")
content := RegExReplace(content, "i)(?<!Black|white|red|your|our|his|my|blue|big|small|top|yello|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|my|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Polka-dot|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) hat ", " that ")


book := content

; Write the modified content boto the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5.TXT, UTF-8


;MsgBox, book is %book%

content := RegExReplace(content, "(?i)(?<!black|white|red|your|our|his|my|blue|big|small|top|yellow|ugly|the|a|her|their|brim|brimmed|new|fancy|straw|green|woolen|felt|Cowboy|Fedora|Panama|Baseball|Bowler|Sun|Beanie|Derby|Trilby|Checked|Plaid|Floral|Wool|Silk|Leather|Knit|Felt|floppy|Safety|Hard|Winter|Summer) hat ", " that ")
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5a.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5a.TXT, UTF-8
;msgbox, hat to that %content%

SkipConditional:  ;conditionals don't work
; fix abbreviations


; Write the modified content to the output file


content := RegExReplace(content, " st\.", " Saint")
content := RegExReplace(content, " ph\.d\.", " PHD")
content := RegExReplace(content, " dr\.", " Doctor")
content := RegExReplace(content, " mr\.", " Mister")
content := RegExReplace(content, " mrs\.", " Missus")
content := RegExReplace(content, " ms\.", " Miss")   
content := RegExReplace(content, " jr\.", " Junior")
content := RegExReplace(content, " sr\.", " Senior")
content := RegExReplace(content, " co\.", " Company")
content := RegExReplace(content, " inc\.", " Incorporated")
content := RegExReplace(content, " ltd\.", " Limited")
content := RegExReplace(content, " intl\.", " International")
content := RegExReplace(content, " prof\.", " Professor")
content := RegExReplace(content, " gov\.", " Governor")
content := RegExReplace(content, " capt\.", " Captain")
content := RegExReplace(content, " sgt\.", " Sergeant")
content := RegExReplace(content, " corp\.", " Corporation")
content := RegExReplace(content, " ave\.", " Avenue")
content := RegExReplace(content, " blvd\.", " Boulevard")
content := RegExReplace(content, " ft\.", " Fort")
content := RegExReplace(content, " mt\.", " Mount")
content := RegExReplace(content, " ln\.", " Lane")
content := RegExReplace(content, " rd\.", " Road")
content := RegExReplace(content, " etc\.", " Etcetera")
content := RegExReplace(content, " esp\.", " Especially")
content := RegExReplace(content, " e\.g\.", " For example")
content := RegExReplace(content, " i\.e\.", " That is")
content := RegExReplace(content, "i) p\.(?=\s?\d)", " Page")
content := RegExReplace(content, " pp\.", " Pages")
content := RegExReplace(content, " par\.", " Paragraph")
content := RegExReplace(content, " vol\.", " Volume")
content := RegExReplace(content, " lb\.", " Pound")
content := RegExReplace(content, " oz\.", " Ounce")
content := RegExReplace(content, " gal\.", " Gallon")
content := RegExReplace(content, " qt\.", " Quart")
content := RegExReplace(content, " pt\.", " Pint")
content := RegExReplace(content, " yd\.", " Yard")

;msgbox, %content%
contentb := RegExReplace(content, "(?i)[0-9]) in.", " Inch ") ; Assign back to content
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5b.txt
;msgbox, %content%
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine5b.TXT, UTF-8

content := RegExReplace(content, " ft\.", " Foot")
content := RegExReplace(content, " mi\.", " Mile")
content := RegExReplace(content, " mm\.", " Millimeter")
content := RegExReplace(content, " cm\.", " Centimeter")
contentb := RegExReplace(content, "(?i)[0-9]) m.", " meter ")
content := RegExReplace(content, " km\.", " Kilometer")
content := RegExReplace(content, " mg\.", " Milligram")
contentb := RegExReplace(content, "(?i)[0-9]) g.", " Grams ")
content := RegExReplace(content, " kg\.", " Kilogram")

contentb := RegExReplace(content, "(?i)[0-9]) l.", " Liter ")
content := RegExReplace(content, " ml\.", " Milliliter")
content := RegExReplace(content, " tbsp\.", " Tablespoon")
content := RegExReplace(content, " tsp\.", " Teaspoon")
content := RegExReplace(content, " sq\.", " Square")

;
; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine6.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine6.TXT, UTF-8

; fix contractions
content := RegExReplace(content, "i) youd ", " you'd ")
content := RegExReplace(content, "i) whod ", " who'd ")
content := RegExReplace(content, "i) hed ", " he'd ")
content := RegExReplace(content, "i) shes ", " she's ")
content := RegExReplace(content, "i) werent ", " weren't ")
content := RegExReplace(content, "i) cant ", " can't ")
content := RegExReplace(content, "i) dont ", " don't ")
content := RegExReplace(content, "i) arent ", " aren't ")
content := RegExReplace(content, "i) cant ", " can't ")
content := RegExReplace(content, "i) couldve ", " could've ")
content := RegExReplace(content, "i) couldnt ", " couldn't ")
content := RegExReplace(content, "i) didnt ", " didn't ")
content := RegExReplace(content, "i) doesnt ", " doesn't ")
content := RegExReplace(content, "i) dont ", " don't ")
content := RegExReplace(content, "i) hadnt ", " hadn't ")
content := RegExReplace(content, "i) hasnt ", " hasn't ")
content := RegExReplace(content, "i) havent ", " haven't ")
content := RegExReplace(content, "i) hed ", " he'd ")
content := RegExReplace(content, "i) hes ", " he's ")
content := RegExReplace(content, "i) howd ", " how'd ")
content := RegExReplace(content, "i) howll ", " how'll ")
content := RegExReplace(content, "i) howre ", " how're ")
content := RegExReplace(content, "i) howve ", " how've ")
content := RegExReplace(content, "i) Id ", " I'd ")
content := RegExReplace(content, "i) Ill ", " I'll ")
content := RegExReplace(content, "i) Im ", " I'm ")
content := RegExReplace(content, "i) im ", " I'm ")     ;DON'T THIN LITERALS WORK
content := RegExReplace(content, "i) im ", "  'm ")
content := RegExReplace(content, "i) Ive ", " I've ")
content := RegExReplace(content, "i) isnt ", " isn't ")
content := RegExReplace(content, "i) itd ", " it'd ")
content := RegExReplace(content, "i) itll ", " it'll ")
content := RegExReplace(content, "i) its ", " it's ")
content := RegExReplace(content, "i) Ive ", " I've ")
content := RegExReplace(content, "i) mightve ", " might've ")
content := RegExReplace(content, "i) mustve ", " must've ")
content := RegExReplace(content, "i) mustnt ", " mustn't ")
content := RegExReplace(content, "i) shant ", " shan't ")
content := RegExReplace(content, "i) shed ", " she'd ")
content := RegExReplace(content, "i) shes ", " she's ")
content := RegExReplace(content, "i) shouldve ", " should've ")
content := RegExReplace(content, "i) shouldnt ", " shouldn't ")
content := RegExReplace(content, "i) thatd ", " that'd ")
content := RegExReplace(content, "i) thats ", " that's ")
content := RegExReplace(content, "i) thered ", " there'd ")
content := RegExReplace(content, "i) theres ", " there's ")
content := RegExReplace(content, "i) theyd ", " they'd ")
content := RegExReplace(content, "i) theyll ", " they'll ")
content := RegExReplace(content, "i) theyre ", " they're ")
content := RegExReplace(content, "i) theyve ", " they've ")
content := RegExReplace(content, "i) wasnt ", " wasn't ")
content := RegExReplace(content, "i) werent ", " weren't ")
content := RegExReplace(content, "i) whatd ", " what'd ")
content := RegExReplace(content, "i) whatll ", " what'll ")
content := RegExReplace(content, "i) whats ", " what's ")
content := RegExReplace(content, "i) whatve ", " what've ")
content := RegExReplace(content, "i) whend ", " when'd ")
content := RegExReplace(content, "i) whens ", " when's ")
content := RegExReplace(content, "i) whered ", " where'd ")
content := RegExReplace(content, "i) wheres ", " where's ")
content := RegExReplace(content, "i) whereve ", " where've ")
content := RegExReplace(content, "i) whod ", " who'd ")
content := RegExReplace(content, "i) wholl ", " who'll ")
content := RegExReplace(content, "i) ll ", " I'll ")
; Fix persisting contractions despite above
content := RegExReplace(content, "i) youd ", " you'd ")
content := RegExReplace(content, "i) whod ", " who'd ")
content := RegExReplace(content, "i) hed ", " he'd ")
content := RegExReplace(content, "i) shes ", " she's ")
content := RegExReplace(content, "i) werent ", " weren't ")
content := RegExReplace(content, "i) cant ", " can't ")
content := RegExReplace(content, "i) dont ", " don't ")
content := RegExReplace(content, "i)oulnt ", " ouldn't ")
content := RegExReplace(content, "i)oulnt ", " ouldn't ")
content := RegExReplace(content, "i) cant ", " can't ")
content := RegExReplace(content, "i) id ", " I'd ")
content := RegExReplace(content, "i) ill ", " I'll ")
content := RegExReplace(content, "i) hell ", " he'll ")
content := RegExReplace(content, "i) shell ", " she'll ")
content := RegExReplace(content, "i) hed ", " he'd ")
content := RegExReplace(content, "i) shed ", " she'd ")
content := RegExReplace(content, " Ill ", " I'll ")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine7.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine7.TXT, UTF-8


; Replace two or more spaces with a single space
;content := RegExReplace(content, "\s{2,}", " ")


; Write the modified content to the output file
;FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine8.txt
;FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine8.TXT, UTF-8


; Remove non-ASCII characters
;content := RegExReplace(content, "[^\x00-\x7F]", "")
content := RegExReplace(content, "[^\x20-\x7E]", "")



; Write the modified content to the output file
FileDelete, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine.txt
FileAppend, %content%, D:\Program Files\Tesseract-OCR\whiteimages\doc\Book-singleLine.txt, UTF-8



MsgBox, Script completed.

[Mod edit: Added [code][/code] tags. Please use them yourself when posting code!]

01 Feb 2024, 22:36

@degarb, you again posted (twice) a rather large script without using code tags - obviously by ignoring the latest staff messages.
Do you have any questions concerning the usage of code tags? In this case, please don't hesitate to ask!

Otherwise, we will have to conclude that you are deliberately ignoring our messages. If that is true, we might have to resort to delete your posts until you deicde to react to our messages. Other, more severe sanctions, might also be used used to get your attention.

See my my private message for more details.

andymbody · 01 Feb 2024, 22:56

degarb wrote: ↑
01 Feb 2024, 22:03
/d is supposed to be regex for digit, /s space, /b word boundary

Do you feel that / should precede d, s, b? Rather than \ ?

We have taken the time to provide feedback (as requested), it's up to you whether to apply that feedback or ignore it. So far it appears to be ignored, or the communication barrier is too wide to overcome. I will dismiss myself from the conversation. Best of luck with your project.

Need helf seeing my error

Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Re: Need helf seeing my error

Who is online