OCR Challenge

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
tpitera
Posts: 31
Joined: 27 Oct 2020, 15:56

OCR Challenge

08 Jun 2021, 14:28

I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up

I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
image.png
image.png (156.14 KiB) Viewed 919 times
image.png
image.png (37.35 KiB) Viewed 919 times
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
image.png
image.png (101.95 KiB) Viewed 919 times
Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989

Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs
User avatar
kunkel321
Posts: 1108
Joined: 30 Nov 2015, 21:19

Re: OCR Challenge

08 Jun 2021, 15:20

This is well beyond my abilities... But it does occur to me that "RegExMatch()" might be useful. For example, if your OCR reliably returned
(some number) / (some number) Size of short side
then you could capture the numbers.

Someone who is good at GDIP could maybe rotate the image by 90 degrees, for capturing the sideways parts.

Unfortunately, you said that the OCR part is not reliably recognizing your numbers. Unless you get that solved, none of the rest will matter.

...also your github thing is empty.
ste(phen|ve) kunkel
tpitera
Posts: 31
Joined: 27 Oct 2020, 15:56

Re: OCR Challenge

08 Jun 2021, 16:04

fixed the github link. thats why this is an OCR challenge :)
User avatar
kunkel321
Posts: 1108
Joined: 30 Nov 2015, 21:19

Re: OCR Challenge

08 Jun 2021, 17:35

It looks like brochures from three different brands. Is each brand at least consist with where they locate the schematic (For example will JA Solar always use the top/left of the second page), or is that also inconsistent?

I'm curious what the numbers are(?) Dimensions in millimeters?

Regarding the accuracy or the OCR, I seem to recall that Tesseract allows you to have a whitelist of characters, which may, or may not be helpful.
ste(phen|ve) kunkel
User avatar
FanaticGuru
Posts: 1907
Joined: 30 Sep 2013, 22:25

Re: OCR Challenge

08 Jun 2021, 20:02

tpitera wrote:
08 Jun 2021, 14:28
I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up

I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
image.png image.png
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
image.png Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989

Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs

If all the PDFs are like the one I downloaded, I would not use OCR. The one I looked at was not a scanned picture. It was a data stream PDF. Meaning you can copy and paste out of the document.

You can literally Control-A and Control-C to copy all the text out of the document.

Now I am not sure how I would go about finding the data that I need when the layout varies but I don't see why OCR would be needed. I mean if you can figure out the specific areas to OCR then you can copy and paste that area.

You might be able to just grab all the data and then sort through it from there if there is a pattern in the data.

After looking at a second PDF, this looks more like artificial intelligence problem then an OCR problem. Even with my incredible visual processing brain, it takes a bit of studying of the PDFs to find the information you are looking for.

I assume you are wanting to do this on at least a hundred PDFs or it would not be worth the trouble of developing the code. I think it is unlikely to develop a AutoHotkey script that can extract this information automatically out of a bunch of different formatted PDFs even with perfect OCR which copy and paste already provides.

FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks
buliasz
Posts: 26
Joined: 10 Oct 2016, 14:31
Contact:

Re: OCR Challenge

19 Dec 2021, 20:16

To improve Tesseract OCR results for your task, you can use my Tesstrain GUI to train Tesseract languages so that it will return proper values. Here's the code (AHK v2) and a compiled executable: https://github.com/buliasz/tesstrain-windows-gui

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: CoffeeChaton, kashmirLZ and 124 guests