![Smile :)](./images/smilies/icon_e_smile.gif)
extract text from PDF and save as text file
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
No worries...I wouldn't have posted if I noticed that you deleted it. ![Smile :)](./images/smilies/icon_e_smile.gif)
![Smile :)](./images/smilies/icon_e_smile.gif)
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
That's an excellent point! Maybe he means:boiler wrote:This is a bit misleading in how it's stated since it is apparently a desire, not a current state as the tense indicates
When the font changes in the source PDF file from the "standard" text font, I want PDFtoText NOT to copy the text to the ouput text file until the font changes back to the "standard" text font.
Btw, below is the output (in a code block) from Xpdf's PDFfonts utility run on the PDF file that crypter posted:
Code: Select all
name type emb sub uni prob object ID
---------------------------------------------- ----------------- --- --- --- ---- ---------
KKBZFP+URWPalladioL-Roma Type 1 yes yes no 972 0
[none] Type 3 yes no no 988 0
LENTYM+URWPalladioL-Bold Type 1 yes yes no 994 0
GUPLTI+URWPalladioL-Ital Type 1 yes yes no 996 0
[none] Type 3 yes no no 1003 0
[none] Type 3 yes no no 1336 0
GIGFZE+CMR10 Type 1 yes yes no X 1347 0
WETGPF+PazoMath-Italic Type 1 yes yes no X 1376 0
SXAQZF+Helvetica Type 1C yes yes no 1381 0
JNSQUE+Helvetica Type 1C yes yes no 1455 0
[none] Type 3 yes no no 1484 0
OIWQLE+CMSY10 Type 1 yes yes no X 1609 0
OSHEEZ+Helvetica Type 1C yes yes no 1612 0
[none] Type 3 yes no no 1619 0
DEYKJT+CMEX10 Type 1 yes yes no X 1683 0
LJYRQN+Helvetica Type 1C yes yes no 1717 0
QIELJB+CMMI10 Type 1 yes yes no X 1744 0
[none] Type 3 yes no no 1771 0
GECIDR+Helvetica Type 1C yes yes no 1774 0
IKXQUG+PazoMath Type 1 yes yes no X 1813 0
TZSTAC+Helvetica Type 1C yes yes no 1831 0
KARYYS+Helvetica Type 1C yes yes no 1874 0
JSCVSU+Helvetica Type 1C yes yes no 1955 0
BUHHQA+Helvetica Type 1C yes yes no 1986 0
YGLIRC+Helvetica Type 1C yes yes no 2002 0
YGLIRC+Helvetica Type 1C yes yes no 2007 0
CAJMJK+Helvetica Type 1C yes yes no 2019 0
NBKYPT+Helvetica Type 1C yes yes no 2111 0
ERLTCB+Helvetica Type 1C yes yes no 2125 0
RQXSID+Helvetica Type 1C yes yes no 2218 0
VTQKWR+Helvetica Type 1C yes yes no 2224 0
[none] Type 3 yes no no 2386 0
MMTMND+Helvetica Type 1C yes yes no 2436 0
KAMVGT+Helvetica Type 1C yes yes no 2452 0
AHPELA+Helvetica Type 1C yes yes no 2468 0
YWEKBM+Helvetica Type 1C yes yes no 2504 0
IMSIFF+Helvetica Type 1C yes yes no 2625 0
MUKVGX+Helvetica Type 1C yes yes no 2654 0
[none] Type 3 yes no no 2799 0
Re: extract text from PDF and save as text file
I agree with that based on reading the thread more carefully, but I was wrong before!JoeWinograd wrote: ↑Maybe he means:
When the font changes in the source PDF file from the "standard" text font, I want PDFtoText NOT to copy the text to the ouput text file until the font changes back to the "standard" text font.
![Very Happy :D](./images/smilies/icon_e_biggrin.gif)
Re: extract text from PDF and save as text file
the text with this font should not be added in the result output text fileJoeWinograd wrote: ↑30 Dec 2022, 13:12When I run PDFtoText on the file you posted (with the -layout option), I get the attached text file. As you can see, it has this text in it:
The Python interpreter is a program that reads and executes Python code. Depending
on your environment, you might start the interpreter by clicking on an icon, or by typing
python on a command line. When it starts, you should see output like this:
Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
So, I still don't understand the font issue. Is that NOT what you want in the output text file?
Code: Select all
Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
PDFtoText does have various options that discard text from the output file, namely:crypter wrote:the text with this font should not be added in the result output text file
-nodiag
-marginl
-marginr
-margint
-marginb
But there is no option to discard text based on its font, and I'm not aware of any way to do this with PDFtoText. The only approach I can think of is to delete it from the PDF file before feeding it to PDFtoText, but I don't know how to do that in an automated way. Regards, Joe
Re: extract text from PDF and save as text file
what happens if i make it a word document? will it work better? not sure if i can
Re: extract text from PDF and save as text file
Hello, crypter.
Adobe Acrobat and other programs convert pdf to Word files. I like Abbyy FineReader PDF. There are also online converters.
(I think with how long this whole process is taking—and the problems that arise along the way—I could have done it manually by now.)![Laughing :lol:](./images/smilies/icon_lol.gif)
Adobe Acrobat and other programs convert pdf to Word files. I like Abbyy FineReader PDF. There are also online converters.
(I think with how long this whole process is taking—and the problems that arise along the way—I could have done it manually by now.)
![Laughing :lol:](./images/smilies/icon_lol.gif)
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
Well, Word can search for and replace text using font as a criterion (in the Advanced Find dialog), so it may be possible, but I'm wondering what you're really trying to achieve here. Depending on your end game, there may be an entirely different...and better...approach.crypter wrote:what happens if i make it a word document? will it work better?
Lots of software can do this. Attached is one page converted by Adobe Acrobat Pro DC (posted here under "Fair Use"). Interesting to note that it made the "standard" text with the Book Antiqua font, and the text you want to discard with the SimSun font...that's good for possible mass elimination. Next attachment is the same page converted by ABBYY FineReader 15. It made the "standard" text with the Times New Roman font, and the text you want to discard with the Courier New font...also good for possible mass elimination. Last attachment is the same page converted by Word 365. It made ALL the text with the Calibri font...that's not good for possible mass elimination. Regards, Joecrypter wrote:not sure if i can
- Attachments
-
[The extension docx has been deactivated and can no longer be displayed.]
-
[The extension docx has been deactivated and can no longer be displayed.]
-
[The extension docx has been deactivated and can no longer be displayed.]
Re: extract text from PDF and save as text file
I have been working on a script using that approach. I'm currently stuck on the last part where it tries to determine whether the current selection is at the end of the document or not. It correctly skips any selections of a different font than the initial font, but I haven't yet figured out the expression to put after the until. I'm sure there is a property or something that will do it. Still looking. Maybe someone can identify that.JoeWinograd wrote: ↑ Well, Word can search for and replace text using font as a criterion (in the Advanced Find dialog), so it may be possible, but I'm wondering what you're really trying to achieve here. Depending on your end game, there may be an entirely different...and better...approach.
Code: Select all
#Requires AutoHotkey v1.1
oWord := ComObjCreate("Word.Application")
oWord.Visible := True
oWord.Documents.Open(A_ScriptDir "\my test.docx")
oWord.Selection.SelectCurrentFont
CopyText := oWord.Selection.Text
OriginalFont := oWord.Selection.Font.Name
loop {
oWord.Selection.Move(1, 1)
oWord.Selection.SelectCurrentFont
if (oWord.Selection.Font.Name = OriginalFont)
CopyText .= oWord.Selection.Text
MsgBox, % CopyText
} until StrLen(oWord.Selection.Text) = 0
gosub ExitScript
return
Esc::
ExitScript:
oDoc.Close(0)
oWord.Quit
ExitApp
return
Re: extract text from PDF and save as text file
Got it. This works for me:
Code: Select all
#Requires AutoHotkey v1.1
oWord := ComObjCreate("Word.Application")
oWord.Visible := True ; can change to False to not see the document open
oWord.Documents.Open(A_ScriptDir "\my test.docx")
oWord.Selection.SelectCurrentFont
CopyText := oWord.Selection.Text
OriginalFont := oWord.Selection.Font.Name
loop {
oWord.Selection.Move(1, 1)
oWord.Selection.SelectCurrentFont
if (oWord.Selection.Font.Name = OriginalFont)
CopyText .= oWord.Selection.Text
} until oWord.Selection.Bookmarks.Exists("\EndOfDoc")
FileAppend, % CopyText, copied text.txt
gosub ExitScript
return
Esc::
ExitScript:
oDoc.Close(0)
oWord.Quit
ExitApp
return
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
Nice approach! The last part sounds like a job for @flyingDman. ![Smile :)](./images/smilies/icon_e_smile.gif)
![Smile :)](./images/smilies/icon_e_smile.gif)
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
Our messages crossed...glad you got it!
Re: extract text from PDF and save as text file
Actually, I also was thinking he would drop in and show the way. He may still show a better approach.
Re: extract text from PDF and save as text file
hi, JoeWinograd.
Ah! Now I think I understand what you want to do. You want to translate that book excluding whatever is python code?
In Word there is the possibility of selecting text of a style, see this page https://keystrokelearning.com.au/selecting-all-text-with-similar-formatting/. Works on version "python convert pdf to docx by AcrobatDC.docx". Select it and delete it and you only have the text you ask for.
Ah! Now I think I understand what you want to do. You want to translate that book excluding whatever is python code?
In Word there is the possibility of selecting text of a style, see this page https://keystrokelearning.com.au/selecting-all-text-with-similar-formatting/. Works on version "python convert pdf to docx by AcrobatDC.docx". Select it and delete it and you only have the text you ask for.
- JoeWinograd
- Posts: 2214
- Joined: 10 Feb 2014, 20:00
- Location: U.S. Central Time Zone
Re: extract text from PDF and save as text file
Not I...the OP is @crypter...I'm just trying to help. I don't know what crypter's goal is...I asked in an earlier post but haven't heard back. Regards, Joewetware05 wrote:hi, JoeWinograd.
Ah! Now I think I understand what you want to do.
Re: extract text from PDF and save as text file
Sorry, JoeWinograd. Happy new year!JoeWinograd wrote: ↑31 Dec 2022, 12:19Not I...the OP is @crypter...I'm just trying to help. I don't know what crypter's goal is...I asked in an earlier post but haven't heard back. Regards, Joewetware05 wrote:hi, JoeWinograd.
Ah! Now I think I understand what you want to do.