Page 1 of 1

OCR Challenge

Posted: 08 Jun 2021, 14:28
by tpitera
I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up

I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
image.png
image.png (156.14 KiB) Viewed 928 times
image.png
image.png (37.35 KiB) Viewed 928 times
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
image.png
image.png (101.95 KiB) Viewed 928 times
Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989

Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs

Re: OCR Challenge

Posted: 08 Jun 2021, 15:20
by kunkel321
This is well beyond my abilities... But it does occur to me that "RegExMatch()" might be useful. For example, if your OCR reliably returned
(some number) / (some number) Size of short side
then you could capture the numbers.

Someone who is good at GDIP could maybe rotate the image by 90 degrees, for capturing the sideways parts.

Unfortunately, you said that the OCR part is not reliably recognizing your numbers. Unless you get that solved, none of the rest will matter.

...also your github thing is empty.

Re: OCR Challenge

Posted: 08 Jun 2021, 16:04
by tpitera
fixed the github link. thats why this is an OCR challenge :)

Re: OCR Challenge

Posted: 08 Jun 2021, 17:35
by kunkel321
It looks like brochures from three different brands. Is each brand at least consist with where they locate the schematic (For example will JA Solar always use the top/left of the second page), or is that also inconsistent?

I'm curious what the numbers are(?) Dimensions in millimeters?

Regarding the accuracy or the OCR, I seem to recall that Tesseract allows you to have a whitelist of characters, which may, or may not be helpful.

Re: OCR Challenge

Posted: 08 Jun 2021, 20:02
by FanaticGuru
tpitera wrote:
08 Jun 2021, 14:28
I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up

I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
image.png image.png
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
image.png Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989

Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs

If all the PDFs are like the one I downloaded, I would not use OCR. The one I looked at was not a scanned picture. It was a data stream PDF. Meaning you can copy and paste out of the document.

You can literally Control-A and Control-C to copy all the text out of the document.

Now I am not sure how I would go about finding the data that I need when the layout varies but I don't see why OCR would be needed. I mean if you can figure out the specific areas to OCR then you can copy and paste that area.

You might be able to just grab all the data and then sort through it from there if there is a pattern in the data.

After looking at a second PDF, this looks more like artificial intelligence problem then an OCR problem. Even with my incredible visual processing brain, it takes a bit of studying of the PDFs to find the information you are looking for.

I assume you are wanting to do this on at least a hundred PDFs or it would not be worth the trouble of developing the code. I think it is unlikely to develop a AutoHotkey script that can extract this information automatically out of a bunch of different formatted PDFs even with perfect OCR which copy and paste already provides.

FG

Re: OCR Challenge

Posted: 19 Dec 2021, 20:16
by buliasz
To improve Tesseract OCR results for your task, you can use my Tesstrain GUI to train Tesseract languages so that it will return proper values. Here's the code (AHK v2) and a compiled executable: https://github.com/buliasz/tesstrain-windows-gui