extract text from PDF and save as text file

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 13:21

No worries...I wouldn't have posted if I noticed that you deleted it.

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 13:34

boiler wrote:This is a bit misleading in how it's stated since it is apparently a desire, not a current state as the tense indicates

That's an excellent point! Maybe he means:

When the font changes in the source PDF file from the "standard" text font, I want PDFtoText NOT to copy the text to the ouput text file until the font changes back to the "standard" text font.

Btw, below is the output (in a code block) from Xpdf's PDFfonts utility run on the PDF file that crypter posted:

Code: Select all

name                                           type              emb sub uni prob object ID
---------------------------------------------- ----------------- --- --- --- ---- ---------
KKBZFP+URWPalladioL-Roma                       Type 1            yes yes no          972  0
[none]                                         Type 3            yes no  no          988  0
LENTYM+URWPalladioL-Bold                       Type 1            yes yes no          994  0
GUPLTI+URWPalladioL-Ital                       Type 1            yes yes no          996  0
[none]                                         Type 3            yes no  no         1003  0
[none]                                         Type 3            yes no  no         1336  0
GIGFZE+CMR10                                   Type 1            yes yes no   X     1347  0
WETGPF+PazoMath-Italic                         Type 1            yes yes no   X     1376  0
SXAQZF+Helvetica                               Type 1C           yes yes no         1381  0
JNSQUE+Helvetica                               Type 1C           yes yes no         1455  0
[none]                                         Type 3            yes no  no         1484  0
OIWQLE+CMSY10                                  Type 1            yes yes no   X     1609  0
OSHEEZ+Helvetica                               Type 1C           yes yes no         1612  0
[none]                                         Type 3            yes no  no         1619  0
DEYKJT+CMEX10                                  Type 1            yes yes no   X     1683  0
LJYRQN+Helvetica                               Type 1C           yes yes no         1717  0
QIELJB+CMMI10                                  Type 1            yes yes no   X     1744  0
[none]                                         Type 3            yes no  no         1771  0
GECIDR+Helvetica                               Type 1C           yes yes no         1774  0
IKXQUG+PazoMath                                Type 1            yes yes no   X     1813  0
TZSTAC+Helvetica                               Type 1C           yes yes no         1831  0
KARYYS+Helvetica                               Type 1C           yes yes no         1874  0
JSCVSU+Helvetica                               Type 1C           yes yes no         1955  0
BUHHQA+Helvetica                               Type 1C           yes yes no         1986  0
YGLIRC+Helvetica                               Type 1C           yes yes no         2002  0
YGLIRC+Helvetica                               Type 1C           yes yes no         2007  0
CAJMJK+Helvetica                               Type 1C           yes yes no         2019  0
NBKYPT+Helvetica                               Type 1C           yes yes no         2111  0
ERLTCB+Helvetica                               Type 1C           yes yes no         2125  0
RQXSID+Helvetica                               Type 1C           yes yes no         2218  0
VTQKWR+Helvetica                               Type 1C           yes yes no         2224  0
[none]                                         Type 3            yes no  no         2386  0
MMTMND+Helvetica                               Type 1C           yes yes no         2436  0
KAMVGT+Helvetica                               Type 1C           yes yes no         2452  0
AHPELA+Helvetica                               Type 1C           yes yes no         2468  0
YWEKBM+Helvetica                               Type 1C           yes yes no         2504  0
IMSIFF+Helvetica                               Type 1C           yes yes no         2625  0
MUKVGX+Helvetica                               Type 1C           yes yes no         2654  0
[none]                                         Type 3            yes no  no         2799  0

Post by **boiler** » 30 Dec 2022, 13:42

JoeWinograd wrote: ↑Maybe he means:

When the font changes in the source PDF file from the "standard" text font, I want PDFtoText NOT to copy the text to the ouput text file until the font changes back to the "standard" text font.

I agree with that based on reading the thread more carefully, but I was wrong before!

crypter · Post by **crypter** » 30 Dec 2022, 16:19

JoeWinograd wrote: ↑
30 Dec 2022, 13:12
When I run PDFtoText on the file you posted (with the -layout option), I get the attached text file. As you can see, it has this text in it:

The Python interpreter is a program that reads and executes Python code. Depending
on your environment, you might start the interpreter by clicking on an icon, or by typing
python on a command line. When it starts, you should see output like this:

Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

So, I still don't understand the font issue. Is that NOT what you want in the output text file?

the text with this font should not be added in the result output text file

Code: Select all

Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 16:46

crypter wrote:the text with this font should not be added in the result output text file

PDFtoText does have various options that discard text from the output file, namely:

-nodiag
-marginl
-marginr
-margint
-marginb

But there is no option to discard text based on its font, and I'm not aware of any way to do this with PDFtoText. The only approach I can think of is to delete it from the PDF file before feeding it to PDFtoText, but I don't know how to do that in an automated way. Regards, Joe

crypter · Post by **crypter** » 30 Dec 2022, 17:24

what happens if i make it a word document? will it work better? not sure if i can

wetware05 · Post by **wetware05** » 30 Dec 2022, 19:34

Hello, crypter.

Adobe Acrobat and other programs convert pdf to Word files. I like Abbyy FineReader PDF. There are also online converters.

(I think with how long this whole process is taking—and the problems that arise along the way—I could have done it manually by now.)

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 21:32

crypter wrote:what happens if i make it a word document? will it work better?

Well, Word can search for and replace text using font as a criterion (in the Advanced Find dialog), so it may be possible, but I'm wondering what you're really trying to achieve here. Depending on your end game, there may be an entirely different...and better...approach.

crypter wrote:not sure if i can

Lots of software can do this. Attached is one page converted by Adobe Acrobat Pro DC (posted here under "Fair Use"). Interesting to note that it made the "standard" text with the Book Antiqua font, and the text you want to discard with the SimSun font...that's good for possible mass elimination. Next attachment is the same page converted by ABBYY FineReader 15. It made the "standard" text with the Times New Roman font, and the text you want to discard with the Courier New font...also good for possible mass elimination. Last attachment is the same page converted by Word 365. It made ALL the text with the Calibri font...that's not good for possible mass elimination. Regards, Joe

Post by **boiler** » 30 Dec 2022, 21:40

JoeWinograd wrote: ↑ Well, Word can search for and replace text using font as a criterion (in the Advanced Find dialog), so it may be possible, but I'm wondering what you're really trying to achieve here. Depending on your end game, there may be an entirely different...and better...approach.

I have been working on a script using that approach. I'm currently stuck on the last part where it tries to determine whether the current selection is at the end of the document or not. It correctly skips any selections of a different font than the initial font, but I haven't yet figured out the expression to put after the until. I'm sure there is a property or something that will do it. Still looking. Maybe someone can identify that.

Code: Select all

#Requires AutoHotkey v1.1
oWord := ComObjCreate("Word.Application")
oWord.Visible := True
oWord.Documents.Open(A_ScriptDir "\my test.docx")
oWord.Selection.SelectCurrentFont
CopyText := oWord.Selection.Text
OriginalFont := oWord.Selection.Font.Name

loop {
	oWord.Selection.Move(1, 1)
	oWord.Selection.SelectCurrentFont
	if (oWord.Selection.Font.Name = OriginalFont)
		CopyText .= oWord.Selection.Text
	MsgBox, % CopyText
} until StrLen(oWord.Selection.Text) = 0
gosub ExitScript
return

Esc::
ExitScript:
	oDoc.Close(0)
	oWord.Quit
	ExitApp
return

Post by **boiler** » 30 Dec 2022, 21:57

Got it. This works for me:

Code: Select all

#Requires AutoHotkey v1.1
oWord := ComObjCreate("Word.Application")
oWord.Visible := True ; can change to False to not see the document open
oWord.Documents.Open(A_ScriptDir "\my test.docx")
oWord.Selection.SelectCurrentFont
CopyText := oWord.Selection.Text
OriginalFont := oWord.Selection.Font.Name

loop {
	oWord.Selection.Move(1, 1)
	oWord.Selection.SelectCurrentFont
	if (oWord.Selection.Font.Name = OriginalFont)
		CopyText .= oWord.Selection.Text
} until oWord.Selection.Bookmarks.Exists("\EndOfDoc")
FileAppend, % CopyText, copied text.txt
gosub ExitScript
return

Esc::
ExitScript:
	oDoc.Close(0)
	oWord.Quit
	ExitApp
return

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 21:59

Nice approach! The last part sounds like a job for @flyingDman.

JoeWinograd · Post by **JoeWinograd** » 30 Dec 2022, 22:05

Our messages crossed...glad you got it!

Post by **boiler** » 30 Dec 2022, 22:18

JoeWinograd wrote: ↑ Nice approach! The last part sounds like a job for @flyingDman.

Actually, I also was thinking he would drop in and show the way. He may still show a better approach.

wetware05 · Post by **wetware05** » 31 Dec 2022, 05:33

hi, JoeWinograd.

Ah! Now I think I understand what you want to do. You want to translate that book excluding whatever is python code?

In Word there is the possibility of selecting text of a style, see this page https://keystrokelearning.com.au/selecting-all-text-with-similar-formatting/. Works on version "python convert pdf to docx by AcrobatDC.docx". Select it and delete it and you only have the text you ask for.

JoeWinograd · Post by **JoeWinograd** » 31 Dec 2022, 12:19

wetware05 wrote:hi, JoeWinograd.
Ah! Now I think I understand what you want to do.

Not I...the OP is @crypter...I'm just trying to help. I don't know what crypter's goal is...I asked in an earlier post but haven't heard back. Regards, Joe

wetware05 · Post by **wetware05** » 31 Dec 2022, 13:40

JoeWinograd wrote: ↑
31 Dec 2022, 12:19

wetware05 wrote:hi, JoeWinograd.
Ah! Now I think I understand what you want to do.
Not I...the OP is @crypter...I'm just trying to help. I don't know what crypter's goal is...I asked in an earlier post but haven't heard back. Regards, Joe

Sorry, JoeWinograd. Happy new year!

extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file

Re: extract text from PDF and save as text file