I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up
I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989
Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs
OCR Challenge
Re: OCR Challenge
This is well beyond my abilities... But it does occur to me that "RegExMatch()" might be useful. For example, if your OCR reliably returned
(some number) / (some number) Size of short side
then you could capture the numbers.
Someone who is good at GDIP could maybe rotate the image by 90 degrees, for capturing the sideways parts.
Unfortunately, you said that the OCR part is not reliably recognizing your numbers. Unless you get that solved, none of the rest will matter.
...also your github thing is empty.
(some number) / (some number) Size of short side
then you could capture the numbers.
Someone who is good at GDIP could maybe rotate the image by 90 degrees, for capturing the sideways parts.
Unfortunately, you said that the OCR part is not reliably recognizing your numbers. Unless you get that solved, none of the rest will matter.
...also your github thing is empty.
ste(phen|ve) kunkel
Re: OCR Challenge
fixed the github link. thats why this is an OCR challenge
Re: OCR Challenge
It looks like brochures from three different brands. Is each brand at least consist with where they locate the schematic (For example will JA Solar always use the top/left of the second page), or is that also inconsistent?
I'm curious what the numbers are(?) Dimensions in millimeters?
Regarding the accuracy or the OCR, I seem to recall that Tesseract allows you to have a whitelist of characters, which may, or may not be helpful.
I'm curious what the numbers are(?) Dimensions in millimeters?
Regarding the accuracy or the OCR, I seem to recall that Tesseract allows you to have a whitelist of characters, which may, or may not be helpful.
ste(phen|ve) kunkel
- FanaticGuru
- Posts: 1907
- Joined: 30 Sep 2013, 22:25
Re: OCR Challenge
tpitera wrote: ↑08 Jun 2021, 14:28I am having a lot of trouble obtaining the data from pdf documents that have certain data I need.
Problems:
1. OCR does not always work, either cant read the data or gets the wrong number
2. Each pdf that i need to obtain data from has a different layout, or puts the dimension in a different area
2.A. sometimes one of the dimensions is missing, need an error to pop up
I need to obtain the dimensions of the panel, the bolt spacing, module thickness and the weight
image.png image.png
For example in this pdf the values would be: Long edge = 2024, short edge = 1024, long boltA = 1424, long BoltB = "", short bolt = 984, thickness = 40, weight = 20.3 kg
for the next pdf
image.png Long edge = 2094, short edge = 1038, long boltA = 1400, long BoltB = 1276, short bolt = 989
Here is a github link to the pdfs. Test your skills and help me with my problem, Thank you for your time
https://github.com/piterat/solarPanelPdfs
If all the PDFs are like the one I downloaded, I would not use OCR. The one I looked at was not a scanned picture. It was a data stream PDF. Meaning you can copy and paste out of the document.
You can literally Control-A and Control-C to copy all the text out of the document.
Now I am not sure how I would go about finding the data that I need when the layout varies but I don't see why OCR would be needed. I mean if you can figure out the specific areas to OCR then you can copy and paste that area.
You might be able to just grab all the data and then sort through it from there if there is a pattern in the data.
After looking at a second PDF, this looks more like artificial intelligence problem then an OCR problem. Even with my incredible visual processing brain, it takes a bit of studying of the PDFs to find the information you are looking for.
I assume you are wanting to do this on at least a hundred PDFs or it would not be worth the trouble of developing the code. I think it is unlikely to develop a AutoHotkey script that can extract this information automatically out of a bunch of different formatted PDFs even with perfect OCR which copy and paste already provides.
FG
Hotkey Help - Help Dialog for Currently Running AHK Scripts
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks
AHK Startup - Consolidate Multiply AHK Scripts with one Tray Icon
Hotstring Manager - Create and Manage Hotstrings
[Class] WinHook - Create Window Shell Hooks and Window Event Hooks
Re: OCR Challenge
To improve Tesseract OCR results for your task, you can use my Tesstrain GUI to train Tesseract languages so that it will return proper values. Here's the code (AHK v2) and a compiled executable: https://github.com/buliasz/tesstrain-windows-gui
Who is online
Users browsing this forum: CoffeeChaton, kashmirLZ and 124 guests