extract text from PDF and save as text file

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

extract text from PDF and save as text file

Post by crypter » 17 Dec 2022, 18:24

i need to create program in AHK that has 1 text file as input.

Code: Select all

Introduction
Chapter 1: Learning About the computer brain
The Origins of the computer brain
Why Use the computer brain?
Chapter 2: The Benefits and Negatives of the computer brain
The Benefits of the computer brain
The Negatives of the computer brain
Chapter 3: Common Terms You Should Know with the computer brain
Chapter 4: Getting Started with the computer brain
Text Editor
Getting IDLE
Chapter 5: Learning the Basics of the computer brain Programming
Chapter 6: A Bit More on Comments
Chapter 7: Variables and What They Do in the computer brain
Conclusion
the result output is 15 files of text that contain the in betweens of the input text lines.

from "Introduction" to "Chapter 1: Learning About the computer brain" is 1 text file with the text
from "Chapter 1: Learning About the computer brain" to "The Origins of the computer brain" is 1 text file with the text, actutally the 2nd file of text

it must remove all line breaks
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 17 Dec 2022, 18:32

Hello,

I don't know that I can provide what you need, but I have suggestions for your post, to aid responders with additional details that may be helpful.

1. A text file is not a PDF file. Please clarify.
2. Attach your input file in a new reply below.
3. Attach or post one example of exactly what the output should be.
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 17 Dec 2022, 18:45

mikeyww wrote:
17 Dec 2022, 18:32
Hello,

I don't know that I can provide what you need, but I have suggestions for your post, to aid responders with additional details that may be helpful.

1. A text file is not a PDF file. Please clarify.
2. Attach your input file in a new reply below.
3. Attach or post one example of exactly what the output should be.

thanks for the advice, hope you can create it

[The extension pdf has been deactivated and can no longer be displayed.]

00001.txt
(826 Bytes) Downloaded 66 times
Attachments
source.txt
(574 Bytes) Downloaded 48 times
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 17 Dec 2022, 20:19

Code: Select all

dir          = %A_ScriptDir%
source       = %dir%\ahk prog source file.pdf
chapterFile  = %dir%\source.txt
app          = d:\utils\Xpdf\bin64\pdftotext.exe ; http://www.xpdfreader.com/
FileRead, chapters, %chapterFile%
RunWait, %app% "%source%",, Hide                 ; Extract text from PDF file
Loop, Read, % RegExReplace(source, "\.pdf$", ".txt")
{ If !StrLen(Trim(A_LoopReadLine))
   Continue
  If Instr("`n" chapters "`r", "`n" A_LoopReadLine "`r")
   out := dir "\" RegExReplace(A_LoopReadLine, "[/\\?%*:|""<>.,;=]") ".txt"
  Else If out
   FileAppend, %A_LoopReadLine%%A_Space%, %out%
}
MsgBox, 64, Done, Done!
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 18 Dec 2022, 14:00

mikeyww wrote:
17 Dec 2022, 20:19

Code: Select all

dir          = %A_ScriptDir%
source       = %dir%\ahk prog source file.pdf
chapterFile  = %dir%\source.txt
app          = d:\utils\Xpdf\bin64\pdftotext.exe ; http://www.xpdfreader.com/
FileRead, chapters, %chapterFile%
RunWait, %app% "%source%",, Hide                 ; Extract text from PDF file
Loop, Read, % RegExReplace(source, "\.pdf$", ".txt")
{ If !StrLen(Trim(A_LoopReadLine))
   Continue
  If Instr("`n" chapters "`r", "`n" A_LoopReadLine "`r")
   out := dir "\" RegExReplace(A_LoopReadLine, "[/\\?%*:|""<>.,;=]") ".txt"
  Else If out
   FileAppend, %A_LoopReadLine%%A_Space%, %out%
}
MsgBox, 64, Done, Done!
what is this line:

Code: Select all

app          = d:\utils\Xpdf\bin64\pdftotext.exe ; http://www.xpdfreader.com/

Screenshot 2022-12-18 135949.png
Screenshot 2022-12-18 135949.png (10.87 KiB) Viewed 2671 times
gregster
Posts: 9114
Joined: 30 Sep 2013, 06:48

Re: extract text from PDF and save as text file

Post by gregster » 18 Dec 2022, 14:02

AutoIt? :shifty:
wetware05
Posts: 750
Joined: 04 Dec 2020, 16:09

Re: extract text from PDF and save as text file

Post by wetware05 » 18 Dec 2022, 14:17

crypter, I imagine that what mikeyww is suggesting is that you download the pdftotext.exe utility, hich will be free and which is the one that will do the work of extracting the text, and it gives you the address where to download it http://www.xpdfreader.com/.

On the other hand, as gregster says, why do you get a message from AutoIt? This program, similar to AutoHotkey, has its own language.
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 18 Dec 2022, 14:28

Correct!

And: it's AutoHotkey that we are using. Try AHK as the file extension.
User avatar
JoeWinograd
Posts: 2214
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: extract text from PDF and save as text file

Post by JoeWinograd » 18 Dec 2022, 15:57

If you'd like to learn more about the PDFtoText tool, my five-minute video Micro Tutorials should be helpful. The first one is an introduction about all nine of the Xpdf utilities:
Xpdf - Command Line Utilities for PDF Files

Note that the link in my video (done eight years ago) to the Xpdf website (http://www.foolabs.com/xpdf/) now redirects to its new location (http://www.xpdfreader.com/).

Here's my video that is specific to PDFtoText:
Xpdf - PDFtoText - Command Line Utility to Convert PDF Files to Plain Text Files

And here's my video that discusses the Xpdf configuration file, which is used by all nine of the Xpdf tools:
xpdfrc - Configuration File for All Xpdf Utilities

In case anyone is interested in the other Xpdf tools, here are links to my five-minute video Micro Tutorials on them:
Xpdf - PDFimages - Command Line Utility to Extract Images from PDF Files
Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
Xpdf - PDFtoPNG - Command Line Utility to Convert a Multi-page PDF File into Separate PNG Files
Xpdf - PDFfonts - Command Line Utility to List Fonts Used in a PDF File
Xpdf - PDFtoHTML - Command Line Utility to Convert a PDF File to HTML
Xpdf - PDFtoPPM - Command Line Utility to Convert a PDF File to PPM, PGM, PBM
Xpdf - PDFtoPS - Command Line Utility to Convert a PDF File to PS (PostScript)

I've used all of these over the years in many AHK scripts, by far the most frequent being PDFtoText. It works very well!

A tip for you: experiment on your particular PDF files with the output format option. Choices are:

-layout
-simple
-table
-lineprinter
-raw
<null>

I find that -layout usually works best, (sometimes -raw or <null>), but it depends on the PDFs and what my script is trying to achieve. For example, suppose you're looking for the text Home Address and want your script to get the text after that. The exact location of that text depends on the internal structure of the PDF and which PDFtoText output option you use. You'll likely have to play with it to determine the best way for your script to gather the desired info from the PDFs. Regards, Joe
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 29 Dec 2022, 18:01

mikeyww wrote:
18 Dec 2022, 14:28
Correct!

And: it's AutoHotkey that we are using. Try AHK as the file extension.
i have it working however i need to filter some results

like ignore font type from being copied, it`s a prompt or code source i dont need this font to be copied. it should ignore this font
Screenshot_1.png
Screenshot_1.png (54.09 KiB) Viewed 2487 times
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 29 Dec 2022, 19:03

Some ways to identify or work with text could be InStr, RegExMatch, RegExReplace, and SubStr. You can have a look to see whatever best fits your need.
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 30 Dec 2022, 06:27

mikeyww wrote:
29 Dec 2022, 19:03
Some ways to identify or work with text could be InStr, RegExMatch, RegExReplace, and SubStr. You can have a look to see whatever best fits your need.
i choose RegExMatch()

can you write an example

Code: Select all

dir          = %A_ScriptDir%
source       = %dir%\python.pdf
chapterFile  = %dir%\source.txt
app          = C:\Users\user\Downloads\xpdf-tools-win-4.04\xpdf-tools-win-4.04\bin64\pdftotext.exe ; http://www.xpdfreader.com/
FileRead, chapters, %chapterFile%
RunWait, %app% "%source%",, Hide                 ; Extract text from PDF file
Loop, Read, % RegExReplace(source, "\.pdf$", ".txt")
{ If !StrLen(Trim(A_LoopReadLine))
   Continue
  If Instr("`n" chapters "`r", "`n" A_LoopReadLine "`r")
   out := dir "\" RegExReplace(A_LoopReadLine, "[/\\?%*:|""<>.,;=]") ".txt"
  Else If out
   FileAppend, %A_LoopReadLine%%A_Space%, %out%
}
MsgBox, 64, Done, Done!
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 30 Dec 2022, 09:10

Can you provide a plain-language description of how you, the user, would identify the text that you would like to delete from your string? How do you describe the pattern there?
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 30 Dec 2022, 10:03

if FONT type different than most font type: DO NOT COPY to file
User avatar
mikeyww
Posts: 27372
Joined: 09 Sep 2014, 18:38

Re: extract text from PDF and save as text file

Post by mikeyww » 30 Dec 2022, 10:04

I see. I never would have guessed that! I am not able to code that, but someone else here probably knows how to do it.
User avatar
JoeWinograd
Posts: 2214
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: extract text from PDF and save as text file

Post by JoeWinograd » 30 Dec 2022, 12:26

crypter wrote:if FONT type different than most font type: DO NOT COPY to file
I don't understand what you're saying. When you run PDFtoText, it creates a plain text file (there's an option, -bom, to insert a Unicode BOM at the start of the output text file). I think you're saying that PDFtoText displays a dialog/prompt when it encounters an "unusual" font in the source PDF file — is that right? Of course, there's no notion of a "font" in the plain text output file. I've been using PDFtoText in AHK scripts for more than 10 years and have never seen this. Please post a sample PDF that exhibits the problem when you run PDFtoText on it. If I can reproduce the problem here, I'll try to help. Regards, Joe
crypter
Posts: 90
Joined: 15 Dec 2020, 09:57

Re: extract text from PDF and save as text file

Post by crypter » 30 Dec 2022, 12:47

when the font changes it`s not copied from the pdf to the text file. so the font is not added on the text file, if it was added there would be undistinguishable text in the text file where all font is the same format

when font type changes it`s not copied to the text file
Screenshot_1.png
Screenshot_1.png (53.62 KiB) Viewed 2359 times

[The extension pdf has been deactivated and can no longer be displayed.]

User avatar
JoeWinograd
Posts: 2214
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: extract text from PDF and save as text file

Post by JoeWinograd » 30 Dec 2022, 13:12

When I run PDFtoText on the file you posted (with the -layout option), I get the attached text file. As you can see, it has this text in it:

The Python interpreter is a program that reads and executes Python code. Depending
on your environment, you might start the interpreter by clicking on an icon, or by typing
python on a command line. When it starts, you should see output like this:

Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

So, I still don't understand the font issue. Is that NOT what you want in the output text file?
Attachments
python-layout.txt
(485.92 KiB) Downloaded 44 times
User avatar
JoeWinograd
Posts: 2214
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: extract text from PDF and save as text file

Post by JoeWinograd » 30 Dec 2022, 13:16

boiler wrote:I'm thinking that the reason it looks different is not simply a font change but an embedded image, so there is no actual text that would be output.
Nope. It is text, not an image...see my previous post (crypter posted an image for whatever reason). Regards, Joe
User avatar
boiler
Posts: 17404
Joined: 21 Dec 2014, 02:44

Re: extract text from PDF and save as text file

Post by boiler » 30 Dec 2022, 13:17

JoeWinograd wrote:
30 Dec 2022, 13:16
boiler wrote:I'm thinking that the reason it looks different is not simply a font change but an embedded image, so there is no actual text that would be output.
Nope. It is text, not an image...see my previous post (crypter posted an image for whatever reason). Regards, Joe
Yes, sorry. I read further up to understand the context. I tried to remove my post before it was read.

This is a bit misleading in how it's stated since it is apparently a desire, not a current state as the tense indicates:
crypter wrote: when the font changes it`s not copied from the pdf to the text file.
Post Reply

Return to “Ask for Help (v1)”