How to search for a string in a .pdf or .docx's contents?

Get help with using AutoHotkey and its commands and hotkeys
User avatar
JoeWinograd
Posts: 1203
Joined: 10 Feb 2014, 20:00

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 19:35

Exaskryz,

Just curious — have you tried Xpdf's PDFtoText that I mentioned here:
https://autohotkey.com/boards/viewtopic ... 5be#p80740

I tried to create an account (several times) at the JAMA site to get the PDF article, but it kept coming up with an "Unexpected Error". I tried to do a password reset — also didn't work. I tried to call the support number but they're not in. I'll send an email to the support folks, but for now, I have no way to try it myself. Regards, Joe
User avatar
JoeWinograd
Posts: 1203
Joined: 10 Feb 2014, 20:00

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 19:39

Hi SifJar,
Our messages crossed. I just spent the last half-hour trying to create an account there (which is the only way to get the PDF). Tried numerous usernames and numerous email addresses. Nothing worked. Were you able to register there?
kon
Posts: 1756
Joined: 29 Sep 2013, 17:11

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 20:47

Exaskrys, did you try qwerty12's suggestion? It looks promising.

I had this lying around, even if you don't use it, future AHK users may find it useful... This is an example of using xpdf's pdftotext. It doesn't need to write xpdf's output to a file because it uses the stdout (it passes the results of the conversion to the AHK script in memory).

Code: Select all

FileSelectFile, MyFile ; Select a pdf file
MsgBox, % PdfToText(MyFile)
return

PdfToText(PdfPath) {
    static XpdfPath := """" A_ScriptDir "\xpdfbin-win-3.04\bin32\pdftotext.exe"""
    objShell := ComObjCreate("WScript.Shell")
 
    ;--------- Building CmdString (look in the .txt docs incuded with xpdf):
    ; From the xpdf docs in [ScriptDir]\xpdfbin-win-3.04\doc\pdftotext.txt:
    ;   SYNOPSIS
    ;       pdftotext [options] [PDF-file [text-file]]
    ;   ...
    ;       If text-file is '-', the text is sent to stdout.
    ; Options (Example option. Look in the xpdf docs for more):
    ;   -nopgbrk    Don't insert page breaks (form feed characters)  between  pages.
    ;---------
    CmdString := XpdfPath " -nopgbrk """ PdfPath """ -"
    objExec := objShell.Exec(CmdString)
    while, !objExec.StdOut.AtEndOfStream ; Wait for the program to finish
        strStdOut := objExec.StdOut.ReadAll()
    return strStdOut
}
User avatar
Masonjar13
Posts: 1459
Joined: 20 Jul 2014, 10:16
GitHub: Masonjar13
Location: Не Россия

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 22:04

I got the file and everything, but I'm not getting any crashes, or returned values. Using A32 of AHK, I get system error 2 (ERROR_FILE_NOT_FOUND). Using U32, there is no system 2 error, but throws a few other errors (requires "Cdecl" in the dllcall and disallows "\ExtractText" to be in the LoadLibrary call). Fixing those errors, there's just not any return value.

Didn't test qwerty12's code; not installing any software.

kon's code works, but also doesn't. The buffer is too small (I assume of stdout) and most of the file gets cut off. I don't remember msgbox ever being self-constraining of text size (normally will go quite far off the screen). Sending it to the clipboard then pasting shows that the entire file is being read in.
OS: Windows 7 Ultimate | Editor: Notepad++
My Personal Function Library | Check Out My Computer Rig
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 22:35

(Edit: Issue may not be resolved. See post after qwerty's and Mason's posts below.)

Thanks SifJar. That was a site I found to see if it was freely available as a link I could share, but being on my school's network it auto-associated my university with the site. I wasn't sure if I could get the PDF because of my school's library or not; seems like it is freely available to the public.

So, what I thought would be important is to see if the file from that download link matches the one that is giving me issues. Turns out, it's different. I didn't get a crash with that file through the link SifJar provided. So I dug around just a bit to compare why things were different. Looking at the file properties, this is what I saw (minus the red censorship, underlining, and circling):

Image

So, the file sizes were different. So they weren't the same file. But then I saw the little security tag. I hit the Unblock button, and tried my code again. This time it didn't crash.

It appears that having a "locked" PDF crashes it. I didn't get any alerts in Adobe Reader that there was any security problem with it - but maybe it remembers my choice from when I first opened the file to OK it and it wouldn't nag me again with a message.

So I may have a way to fix any other PDFs that give me problems. The issue though is still to identify what PDFs might give me that problem. I'd like to distribute my original file that gave issues, but I'm still not sure on the legality. Even if something is technically financially free, the account requirement to get the journal article is probably there for a reason. However, someone who knows enough about PDFs might be able to manually flag the security caution and then do some testing. (Checking the file properties again after I applied changes and exited the properties dialogue, the security section was gone.)

I haven't tried the other suggested methods yet, just because tmplinshi's method seemed the most promising as a one-size-fits-all. It's just lacking an error handling that I can grapple onto - which ultimately may not be necessary, but will make diagnosing easier if I need it because it still takes ~3 minutes to run through my 4500 files to parse them. All in all, I have fixed the one file - I just don't know when I might add other files to my Drive that would crash the script.

I very much appreciate everyone's contributions, and am sorry I hadn't taken the time to test each of them out. Hopefully though with all these options out there, future readers can take advantage of them.

And for the sake of completeness as Masonjar mentioned the ANSI vs Unicode, I'm on Unicode 32-bit 1.1.23.01. (I think it's 32 bit at least.) I didn't get any errors that he encountered.
Last edited by Exaskryz on 17 Apr 2016, 00:15, edited 2 times in total.
qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
GitHub: qwerty12

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 23:46

I tried it with the PDF SifJar linked and I had to include bad workarounds, which my post now contains, but I was able to find the last sentence in the PDF with the PDF IFilter I had installed (SumatraPDF). Saying that, I don't know if it's because (Sumatra)PDF, but spaces around words in some paragraphs of that PDF are missing when MsgBox shows the IFilter's output, so I can't say that it's super reliable, but it does work OK for simple searches and for searches like "sub-set of patients most likely to benefit from a specific drug".
Masonjar13 wrote:Didn't test qwerty12's code; not installing any software.
Unless you use your browser to read PDF files, you probably already have a PDF reader installed that comes with an IFilter plugin. Saying that, I don't think there is a PDF reader lighter than SumatraPDF these days - IFilter aside, I always recommend it to people after a PDF reader (unless they need to fill forms).
Exaskryz wrote: It appears that having a "locked" PDF crashes it. I didn't get any alerts in Adobe Reader that there was any security problem with it - but maybe it remembers my choice from when I first opened the file to OK it and it wouldn't nag me again with a message.

So I may have a way to fix any other PDFs that give me problems. The issue though is still to identify what PDFs might give me that problem.
Checking for blocked files can be done like this (the proper way probably lies in the IAttachmentExecute COM interface):

Code: Select all

FileRead, isBlocked, %pathToFile%:Zone.Identifier:$DATA
if (isBlocked)
	isBlocked := !InStr(isBlocked, "AppZoneId=4")
MsgBox % "File is " (isBlocked ? "blocked" : "not blocked")

; or, hell, just unblock the file from AHK right away:
; FileDelete, %pathToFile%:Zone.Identifier:$DATA
User avatar
Masonjar13
Posts: 1459
Joined: 20 Jul 2014, 10:16
GitHub: Masonjar13
Location: Не Россия

Re: How to search for a string in a .pdf or .docx's contents?

17 Apr 2016, 00:04

Exaskryz wrote:(I think it's 32 bit at least.)
The dll is 32-bit and won't run through 64-bit programs, so you're definitely on 32-bit. I'm on 64-bit by default, so I was drag-and-dropping it over both A32 and U32 for testing.

@qwerty12, I do not, in fact, have a PDF viewer. I never work with PDF's, thus no software is installed pertaining to such. But if I ever need one, I'll keep that one in mind.
OS: Windows 7 Ultimate | Editor: Notepad++
My Personal Function Library | Check Out My Computer Rig
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

17 Apr 2016, 00:15

I have managed to break the file again. Well, I restored the broken file for the sake of testing qwerty's code with the Zone.Identifer data stuff. Turns out that if you use his FileDelete suggestion, it doesn't do enough to fix the file - after running that, it will still crash the script using the dll. Problem is that after I restored the broken file again (by downloading through my original source, different from SifJar's link), and repeated my same method for clearing the security concern in the file properties dialog, it was still crashing the script.

I may well have misattributed using the Unblock button as fixing the file, and something else I had done fixed it. But unfortunately I don't know what.
qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
GitHub: qwerty12

Re: How to search for a string in a .pdf or .docx's contents?

17 Apr 2016, 01:18

@Masonjar13: Fair enough :-)

I'm pretty sure that the problem is that xd2txlib simply doesn't support extracting text from that file. I built the C++ example that came with it and even that shows nothing when pointing it ioi140052.pdf. It does work fine, however, with some other PDFs. As I see it, there's three other options (ordered from best to worst):
  • Use kon's PdfToText. This actually worked on that file and kept the formatting intact.
  • Use my IFilter code, which I can attest to also working (but not as well as xpdf). The problem with that is that its effectiveness can vary from system to system as it's dependent on the IFilter in use. The SumatraPDF IFilter doesn't break up some words properly and it can't join lines - with xpdf, a search for "subset" worked, but with the IFilter, I had to search for "sub-set". Although the hyphenation thing may not be a bug - the word appears as "sub-`r`nset" in the actual PDF when I open it in SumatraPDF.
  • Check for errors from xd2tpdf. Here's an ExtractText function that'll return false when xd2tpdf fails:

    Code: Select all

    ExtractText(ByRef result, fileName) {
    	static hModule := DllCall("LoadLibrary", "Str", "xd2txlib.dll", "Ptr")
    	if (!FileExist(fileName))
    		return false
    	if (fileLength := DllCall("xd2txlib\ExtractText", "Str", fileName, "Int", False, "Ptr*", fileText)) {
    		result := StrGet(fileText, fileLength / 2) ;, "UTF-16")
    		return true
    	}
    	return false
    }
Last edited by qwerty12 on 17 Apr 2016, 02:44, edited 1 time in total.
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

17 Apr 2016, 01:26

Unfortunately that third option doesn't seem to be viable, as the script still crashes instead of just returning false. When I find some free time, I'll run some tests to make sure files that should return as results are being returned to verify that this dll would be the simple method, where if I find it breaks in the future, I'll have to do manual diagnoses by editing the script. (Tooltip of the file path during the search to see where it crashes is how I identified this file before). But if it's not returning the results I'd expect, I'll explore kon's option. (And if that isn't quite working, I'll look into the IFilter, and finally into anything else that may have been suggested (I can't recall if Joe's was a different method or not).)
qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
GitHub: qwerty12

Re: How to search for a string in a .pdf or .docx's contents?

17 Apr 2016, 02:37

Oh, not good. For me, ExtractText was causing crashes when the file didn't exist and when StrGet was attempted when the fileLength returned was 0. The only other thing I can think of would be to use a sacrificial instance of AutoHotkey dedicated to calling ExtractText and exiting with an error code indicating whether a match was found. It's slower, but if it crashes, it won't take down your main script. This isn't very sophisticated:

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.
#SingleInstance Off

file := "\path\to\.pdf"
phraseToFind := "something here"

if (A_PtrSize != 4) {
	MsgBox Try this script with a 32-bit build of AutoHotkey.
	ExitApp
}
 
if (InStr(DllCall("GetCommandLine", "Str"), " /ExtractText ")) {
	file = %2%
	searchPattern = %3%
	if (file && searchPattern && ExtractText(result, file)) {
		if (InStr(result, searchPattern))
			ExitApp 0
	}
	ExitApp 1
}
 
RunWait, "%A_AhkPath%" "%A_ScriptFullPath%" /ExtractText "%file%" "%phraseToFind%",, UseErrorLevel
if (ErrorLevel == 0)
	MsgBox result found
	
ExtractText(ByRef result, fileName) {
	if (!FileExist(fileName))
		return false
	fileLength := DllCall(A_ScriptDir . "\xd2txlib.dll\ExtractText", "Str", fileName, "Int", False, "Int*", fileText)
	if (fileLength > 0) {
		result := StrGet(fileText, fileLength / 2)
		return result != ""
	}
	return false
}
Joe's method is good, I just tested kon's because it also uses xpdf, was nicely wrapped up and it wasn't writing a temporary file.
blue83
Posts: 32
Joined: 11 Apr 2018, 06:38

Re: How to search for a string in a .pdf or .docx's contents?

21 Jan 2019, 15:24

Hi,

I have one excel file that constantly crashes script when it reach result := Strget section in this function.

Other excel files are read properly.

Does anyone knows how to just bypass that file and give me an error variable of that error?
Guest078
Posts: 20
Joined: 28 Aug 2016, 13:44

Re: How to search for a string in a .pdf or .docx's contents?

30 Mar 2019, 13:21

Hi!

Regarding the code from here: https://www.autohotkey.com/boards/viewtopic.php?p=80744#p80744

In its current state it doesn't work with the iFilters that are provided by offfilt.dll (which comes with later versions of Windows by default),
so files like .doc, .xls, .ppt, etc. aren't processed as they should.

If this line
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", IFILTER_INIT_DISABLE_EMBEDDED | IFILTER_INIT_INDEXING_ONLY, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )
is replaced by:
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", 0, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )

then these iFilters work, but the while loops will only go through one iteration and stop afterwards so that not the whole document is processed but only the first chunk.

Does anybody know how to modify the code to process the full document again?

Return to “Ask For Help”

Who is online

Users browsing this forum: 1100++, aircooled, Albireo, just me, MannyKSoSo, njs, Odlanir and 192 guests