Pulling Data from PDF - Regex Loop Issues

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Monoxide3009
Posts: 65
Joined: 09 Apr 2018, 15:53

Pulling Data from PDF - Regex Loop Issues

26 Sep 2019, 10:42

Hello All,

First, forgive me but I don't know the best way to ask my question, so were going to go on a roller coaster ride together.

I am working on a project I have mostly done; I am pulling text data from a PDF. No problems with that in general. My current process is to copy the entire PDF, then using RegEx I pull pertinent data. Due to an NDA, I cant show official examples, but the concept is like a group of invoices in a single PDF. With this in mind, my process is regex to find invoice number for invoice 1, I then set a variable of the location, then on the next loop it uses the same regex to find the next invoice number in order. Code example (this is within a loop):

Code: Select all

	POSITION1 := RegExMatch(PDF_DATA,".*Booking Number: (.*)\RDesc:",BOOKING_NUMBER, POSITION1 + StrLen(BOOKING_NUMBER))
	
	if (BOOKING_NUMBER = "")
		RegExMatch(PDF_DATA,".*Booking Number: (.*)\RDesc:",BOOKING_NUMBER)
	else
		RegExMatch(PDF_DATA,".*Booking Number: (.*)\RDesc:",BOOKING_NUMBER, POSITION1 + StrLen(BOOKING_NUMBER))
This is working all well and good - I am 100% open to ideas on better options than regex, if any exist (Im self taught, so if you have ideas I need ALL the details you can manage =p).

My issue stems from a specific invoice format where there are headers with normal data as per the norm, but then there are tables in between that have data I need to pull. I can make a loop to pull from the table, but my issue comes from how do I break that loop when it finishes that particular table (the data on the next page will be in the same format, so the regex would still catch on it, but its for a different order).

VERY basic examples would be something like this-

Normal invoice - current process works fine with this:
[page 1]
Invoice number: variable1
Booking number: variable2
Material used: variable 3

[page 2]
Invoice number: variable1
Booking number: variable2
Material used: variable 3

New invoice - table in between each data point - The subvariables would have the same invoice/booking number of that page
[page 1]
Invoice number: variable1
Booking number: variable2
Material for item 1: subvariable1
Material for item 2: subvariable2
Material for item 3: subvariable3
Material for item 4: subvariable4

[page 2]
Invoice number: variable1
Booking number: variable2
Material for item 1: subvariable1
Material for item 2: subvariable2
Material for item 3: subvariable3
Material for item 4: subvariable4


I know how to pair the invoice/booking number ot the subvariable, but what I dont know is if I have a loop to pull from the material tables, HOW do I break that loop? If I had some simple regex that pulls Material for item \d+: (.*), then it would catch everything from page 1 and page 2. Some of the invoices do have a line stating "end of booking number xxxx" (but not all invoices have this, so I dont know if it would the best option), though I dont know how to do a test to see if THAT is before the next loop catch.
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Pulling Data from PDF - Regex Loop Issues

26 Sep 2019, 23:35

You are executing your regex on the complete PDF. You could split the PDF in separate pages before and them loop trough them.
Monoxide3009
Posts: 65
Joined: 09 Apr 2018, 15:53

Re: Pulling Data from PDF - Regex Loop Issues

27 Sep 2019, 11:10

Kobaltauge wrote:
26 Sep 2019, 23:35
You are executing your regex on the complete PDF. You could split the PDF in separate pages before and them loop trough them.
Unfortunately, I am building this to accomodate another departments process. I have no choice but to use what is provided, and splitting them up is not an option (I already had suggested it a week or two ago and was instantly shot down). Thank you for the suggestion though.

AHK_fan, I will look into your option and see if it makes sense or would work for my needs. Thanks.
Kobaltauge
Posts: 264
Joined: 09 Mar 2019, 01:52
Location: Germany
Contact:

Re: Pulling Data from PDF - Regex Loop Issues

27 Sep 2019, 12:20

@Monoxide3009 sorry for not explaining me right. I suggest to split up the PDF to scan it. You don't have to save it back. The original PDF can stay untouched.

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Bing [Bot], DataLife, Google [Bot], ShatterCoder and 129 guests