How to search for a string in a .pdf or .docx's contents?

Get help with using AutoHotkey and its commands and hotkeys
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

How to search for a string in a .pdf or .docx's contents?

13 Apr 2016, 19:33

I was hoping I might be able to do FileRead and then RegExMatch() some stuff to make my own Windows Explorer search for my Google Drive documents, as Google Drive is not returning any search results.

I'm able to get the stray .docx to appear, and I'm able to get plenty of .txt and .ahk files with the code I've set up. I know, it's very messy and shoddy :shh: . But it seems that you can't use the above method with .pdf and .docx files, and so I'm not sure how best to return them. Or is there a use for the *c or *P options that I haven't figured out?

Code: Select all

^0:: ; check syntax thoroughly
Gui, New
Gui, Add, Edit, vSearch w500
Gui, Add, Button, vButton x+0 gSearch, Search
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
return

Search:
Gui, Submit, NoHide
StringSplit, Sea, Search, %A_Space%
resultsDesc:="",results:=[] ; [] for results?
Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
FileRead, var, % A_LoopFileFullPath
count:=0
Loop % Sea0 ; Sea?
{
RegExReplace(var,"i)" sea%A_Index%,,c)
count:=c * (count?count:1) ; gives weight to the count
}
If count
{
	match++
	results["count" match]:=count
	results["file" match]:=A_LoopFileFullPath
	Gui, Add, Text, xs Section, % results[lowest]
	sink:=results["file" match]
	Gui, Add, Text, gOpenEarly x50 ys vlink%match%, % sink
}
}
peel:=1, duplicate:=[]
For key, value in results
    duplicate[key]:=value
Gui, Destroy
Gui, New
Gui, Default
Gui, Add, Edit, vSearch w500, %Search%
Gui, Add, Button, vButton x+0 gSearch, Search
While peel
{
old_value:=0
For key, value in results
	If InStr(key,"count")
		If (value>old_value)
			lowest:=key, old_value:=value
Gui, Add, Text, xs Section, % results[lowest]
link:=results[f:="file"  n:=SubStr(lowest,6)]
Gui, Add, Text, gOpen x50 ys vlink%n%, % link
results.Delete(lowest)
results.Delete(f)
peel:=0
For key, value in results
    peel++
}
var:="", results:=""
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
return

Open:
Run % duplicate["file" SubStr(A_GuiControl,5)]
return

OpenEarly:
Run % results["file" SubStr(A_GuiControl,5)]
return
kon
Posts: 1756
Joined: 29 Sep 2013, 17:11

How to search for a string in a .pdf or .docx's contents?

13 Apr 2016, 23:18

You can't just read .docx and .pdf files like you would a text file.

.docx is a collection of (mostly) xml files that have been zipped[1][2]. - You can rename a .docx file to .zip and extract the files to view what is inside. The xml can get quite complicated. You're probably better off using word via com to search the document.

.pdf files - One way is to use xpdf via command line to extract the text. This will only work if the pdf contains actual text and not just images of text. The output of xpdf can be written to a file or to its stdout. If you search the forums you may find some examples.
User avatar
Blackholyman
Posts: 1273
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 02:33

As to the PDF's do you have Adobe Professional if yes then then there is code using COM, so no extra external library or programs is required. However, the code works ONLY with Adobe Professional.
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 07:38

Thanks for the resources kon. I figured COM for word would be the way to go, but wanted to double check there isn't one way that works for both - apparently COM is an option for .pdf's - thanks Blackholyman - but I don't have Adobe Professional.
PearlWins

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 07:53

Do you use PERL? I've been using this script for a very long time to extract text from DOCX https://github.com/ahlstromcj/xpc-suite ... pen_xml.pl

for DOC I used AntiWord (Google it to find it) and for pdfs the XPDF toolset
tmplinshi
Posts: 1419
Joined: 01 Oct 2013, 14:57

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 11:02

You can use xd2txlib.dll to extract text, it supports many formats including .pdf/.doc/.docx etc.
xd2txlib.dll is the dll version of xdoc2txt.

Code: Select all

ExtractText(result, "test.pdf")
MsgBox, % result

ExtractText(ByRef result, fileName) {
	fileLength := DllCall("xd2txlib.dll\ExtractText", "str", fileName, "int", false, "ptr*", fileText)
	result := StrGet(fileText, fileLength)
}
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 12:33

@PearlWins, nope, I don't use PERL

@tmplinshi that seems easy and convenient. Problem is it is crashing for some reason I can't figure out. Renaming a file from .doc to .doc.bak allows it to not crash, but then it doesn't seem like it's getting searched. Testing with multiple subfolders, it crashes pretty early on. I suppose I could use try ExtractText(var, A_LoopFileFullPath) so it prevents AHK crashing, but then I don't know how many files were skipped (and I'm not sure how to figure that out).

Code: Select all

Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
If (A_LoopFileExt="docx" || A_LoopFileExt="pdf" || A_LoopFileExt="doc")
{
counter++
}
MsgBox % counter "`n" A_LoopFileFullPath
var:="" ; thought I'd remedy this some how with clearing the variable prior to the function
ExtractText(var, A_LoopFileFullPath)
}
return

ExtractText(ByRef result, fileName) {
	fileLength := DllCall("C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "str", fileName, "int", false, "ptr*", fileText)
	result := StrGet(fileText, fileLength)
}
Sometime later I'll explore kon's suggestions.
User avatar
JoeWinograd
Posts: 1244
Joined: 10 Feb 2014, 20:00

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 13:45

Hi Exaskryz,

Following up on the prior recommendations of the Xpdf utilities, I have used them in many AHK scripts:
http://www.foolabs.com/xpdf/

It is a set of nine command line executables. The one you'll need is pdftotext.exe, which converts PDF files to plain text. There are five different options for creating the text:

-layout
-lineprinter
-raw
-table
<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your script. In some scripts, I try multiple options (usually, -layout, -raw, and the default of <null>) and then check to see if one of them found the result I'm looking for, such as an "Identifier" string (Customer Name, Account Number, Date of Birth, etc.).

The call in most of my AHK scripts is along these lines:

Code: Select all

RunWait,%PDFtoTextEXE% -f %FirstPage% -l %LastPage% %OutputFormatOption% "%SourceFileName%" "%DestFileName%",,Hide
No installation is needed for the Xpdf tools — all of the executables are stand-alone and self-contained. You'll see in the downloaded package that there are both 32-bit and 64-bit versions. I wrote the developer this question:
The Xpdf Windows binaries come in 32-bit and 64-bit versions. I just tested the 32-bit versions of [pdfinfo.exe] and [pdftotext.exe] on 64-bit W7 and they worked fine. I assume there's a reason for having a 64-bit version of your binaries, but since the 32-bit version works fine on a 64-bit system, why would I need the 64-bit version?
Here is his answer:
There's not really any reason to use the 64-bit binaries. For the rasterizer (pdftoppm, and also used in pdftops), there may be some cases where it needs to allocate large chunks of memory. But for pdfinfo and pdftotext, I don't think you'll run into that.
In terms of licensing and cost, Xpdf is open source, licensed under the GNU General Public License (GPL) V2, with no cost stated at the website for non-commercial use. For commercial licensing, the Xpdf site says to see their parent company's site, Glyph & Cog.

As kon stated, PDFtoText works only if the PDF contains text, such as in a PDF Normal file or a PDF Searchable Image file (a PDF file from scanning that has both the scanned image and text from the OCR process). In other words, it won't work on image-only PDFs. In that case, you'll need to perform OCR on the files to create the text.

I've never tried to extract the text from DOC or DOCX files, so if you get that to work, please post back here how you did it. If you can't get it work, one way to do it is to "print" the Word file to a command line PDF print driver; another way is to use a command line tool like OfficeToPDF. These approaches will create a PDF Normal file (i.e., with text), which can then be fed to pdftotext.exe. Regards, Joe
qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
GitHub: qwerty12

Re: How to search for a string in a .pdf or .docx's contents?

14 Apr 2016, 13:49

EDIT: Fixed this so that it can search through Word documents, provided the architecture of the installed Office matches the architecture of the AutoHotkey executable. IOW: you need 64-bit Office installed if you're running this script with 64-bit AHK.

For PDFs, you can search in them using IFilters, which is what Windows Search uses to look inside files and index the contents. Since Windows 8 (because of Reader, I presume), Microsoft has included an IFilter support plugin for PDF files, allowing indexing of their contents out of the box. If you're on 7, just install SumatraPDF. I'd install Adobe Reader as a last resort - if you don't like Sumatra, PDF XChange Viewer supposedly has an IFilter plugin, too. I tell no word of a lie when I say that this is the first time in about eight years that I've installed Adobe Reader on a PC that I own.

I've used the following PDF IFilter plugins with success to do a simple search inside PDF files successfully with this script:
  • Windows 10's PDF IFilter plugin that comes with the OS
  • The IFilter plugin you get when installing SumatraPDF, tested on 64-bit Windows 10
  • Adobe Acrobat Reader DC's on 64-bit Windows 10 (note: I only tried once, so this may be the exception rather than the rule, but installing Acrobat Reader 11/XI on 64-bit Windows 10 broke the ability for the Indexer to look through PDF files!)
The script does a simple case-insensitive substring search. If you want to try RegExMatch, comment out the StrStrI call, and see if you can use RegexMatch on the output of the StrGet once it's assigned to a variable. You'll need to set the file variable to the full path of a PDF file.

Code: Select all

; IFilter AutoHotkey example by qwerty12
; Credits:
; https://tlzprgmr.wordpress.com/2008/02/02/using-the-ifilter-interface-to-extract-text-from-documents/
; https://stackoverflow.com/questions/7177953/loadifilter-fails-on-all-pdfs-but-mss-filtdump-exe-doesnt
; https://forums.adobe.com/thread/1086426?start=0&tstart=0

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
AutoTrim, Off
ListLines, Off
SetBatchLines, -1
#KeyHistory 0
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

; ---
#MaxMem 4 ; Might need to be multipled by 2 and 1 added to it because sizeof(WCHAR == 2). Not sure. It works for me as it is currently.
cchBufferSize := 4 * 1024
; ---
resultstriplinebreaks := true
file := ""
searchstring := ""
; ---

CHUNK_TEXT := 1

STGM_READ := 0

IFILTER_INIT_CANON_PARAGRAPHS	:= 1,
IFILTER_INIT_HARD_LINE_BREAKS	:= 2,
IFILTER_INIT_CANON_HYPHENS	:= 4,
IFILTER_INIT_CANON_SPACES	:= 8,
IFILTER_INIT_APPLY_INDEX_ATTRIBUTES	:= 16,
IFILTER_INIT_APPLY_OTHER_ATTRIBUTES	:= 32,
IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES	:= 256,
IFILTER_INIT_INDEXING_ONLY	:= 64,
IFILTER_INIT_SEARCH_LINKS	:= 128,
IFILTER_INIT_FILTER_OWNED_VALUE_OK	:= 512,
IFILTER_INIT_FILTER_AGGRESSIVE_BREAK	:= 1024,
IFILTER_INIT_DISABLE_EMBEDDED	:= 2048,
IFILTER_INIT_EMIT_FORMATTING	:= 4096

S_OK := 0
FILTER_S_LAST_TEXT := 268041
FILTER_E_NO_MORE_TEXT := -2147215615

if (!A_IsUnicode) {
	MsgBox The IFilter APIs appear to be Unicode only. Please try again with a Unicode build of AHK.
	ExitApp
}

if (!file || !searchstring) {
	MsgBox Please make sure the file to search in and the string to search for is specified in %A_ScriptFullPath%
	ExitApp
}

SplitPath, file,,, ext
VarSetCapacity(FILTERED_DATA_SOURCES, 4*A_PtrSize, 0), NumPut(&ext, FILTERED_DATA_SOURCES,, "Ptr")
VarSetCapacity(FilterClsid, 16, 0)

; Adobe workaround
if (job := DllCall("CreateJobObject", "Ptr", 0, "Str", "filterProc", "Ptr"))
	DllCall("AssignProcessToJobObject", "Ptr", job, "Ptr", DllCall("GetCurrentProcess", "Ptr"))

FilterRegistration := ComObjCreate("{9E175B8D-F52A-11D8-B9A5-505054503030}", "{c7310722-ac80-11d1-8df3-00c04fb6ef4f}")
if (DllCall(NumGet(NumGet(FilterRegistration+0)+3*A_PtrSize), "Ptr", FilterRegistration, "Ptr", 0, "Ptr", &FILTERED_DATA_SOURCES, "Ptr", 0, "Int", false, "Ptr", &FilterClsid, "Ptr", 0, "Ptr*", 0, "Ptr*", IFilter) != 0 ) ; ILoadFilter::LoadIFilter
	ExitApp
if (IsFunc("Guid_ToStr"))
	MsgBox % Guid_ToStr(FilterClsid)
ObjRelease(FilterRegistration)

if (DllCall("shlwapi\SHCreateStreamOnFile", "Str", file, "UInt", STGM_READ, "Ptr*", iStream) != 0 )
	ExitApp
PersistStream := ComObjQuery(IFilter, "{00000109-0000-0000-C000-000000000046}")
if (DllCall(NumGet(NumGet(PersistStream+0)+5*A_PtrSize), "Ptr", PersistStream, "Ptr", iStream) != 0 ) ; ::Load
	ExitApp
ObjRelease(iStream)

status := 0
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", IFILTER_INIT_DISABLE_EMBEDDED | IFILTER_INIT_INDEXING_ONLY, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 ) ; IFilter::Init
	ExitApp

VarSetCapacity(STAT_CHUNK, A_PtrSize == 8 ? 64 : 52)
VarSetCapacity(buf, (cchBufferSize * 2) + 2)
while (DllCall(NumGet(NumGet(IFilter+0)+4*A_PtrSize), "Ptr", IFilter, "Ptr", &STAT_CHUNK) == 0) { ; ::GetChunk
	if (NumGet(STAT_CHUNK, 8, "UInt") & CHUNK_TEXT) {
		while (DllCall(NumGet(NumGet(IFilter+0)+5*A_PtrSize), "Ptr", IFilter, "Int64*", (siz := cchBufferSize), "Ptr", &buf) != FILTER_E_NO_MORE_TEXT) ; ::GetText
		{
			text := StrGet(&buf,, "UTF-16")
			if (resultstriplinebreaks)
				text := StrReplace(text, "`r`n")
			if (InStr(text, searchstring)) {
				MsgBox "%searchstring%" found
				break
			}
		}
	}
}
ObjRelease(PersistStream)
ObjRelease(iFilter)
if (job)
	DllCall("CloseHandle", "Ptr", job)

ExitApp
Last edited by qwerty12 on 16 Apr 2016, 23:56, edited 3 times in total.
tmplinshi
Posts: 1419
Joined: 01 Oct 2013, 14:57

Re: How to search for a string in a .pdf or .docx's contents?

15 Apr 2016, 07:31

Exaskryz wrote:@PearlWins, nope, I don't use PERL

@tmplinshi that seems easy and convenient. Problem is it is crashing for some reason I can't figure out. Renaming a file from .doc to .doc.bak allows it to not crash, but then it doesn't seem like it's getting searched. Testing with multiple subfolders, it crashes pretty early on. I suppose I could use try ExtractText(var, A_LoopFileFullPath) so it prevents AHK crashing, but then I don't know how many files were skipped (and I'm not sure how to figure that out).

Code: Select all

Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
If (A_LoopFileExt="docx" || A_LoopFileExt="pdf" || A_LoopFileExt="doc")
{
counter++
}
MsgBox % counter "`n" A_LoopFileFullPath
var:="" ; thought I'd remedy this some how with clearing the variable prior to the function
ExtractText(var, A_LoopFileFullPath)
}
return

ExtractText(ByRef result, fileName) {
	fileLength := DllCall("C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "str", fileName, "int", false, "ptr*", fileText)
	result := StrGet(fileText, fileLength)
}
Sometime later I'll explore kon's suggestions.
oh, try this one, it wrotes by HotkeyIt years ago.

Code: Select all

ExtractText(ByRef result, fileName) {
	static hModule := DllCall("LoadLibrary", "Str", "xd2txlib.dll", "Ptr")
	fileLength := DllCall("xd2txlib\ExtractText", "Str", fileName, "Int", False, "Int*", fileText)
	result := StrGet( fileText, fileLength / 2 )
}
SifJar
Posts: 398
Joined: 11 Jan 2016, 17:52

Re: How to search for a string in a .pdf or .docx's contents?

15 Apr 2016, 08:17

Exaskryz wrote:@PearlWins, nope, I don't use PERL

@tmplinshi that seems easy and convenient. Problem is it is crashing for some reason I can't figure out. Renaming a file from .doc to .doc.bak allows it to not crash, but then it doesn't seem like it's getting searched. Testing with multiple subfolders, it crashes pretty early on. I suppose I could use try ExtractText(var, A_LoopFileFullPath) so it prevents AHK crashing, but then I don't know how many files were skipped (and I'm not sure how to figure that out).

Code: Select all

Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
If (A_LoopFileExt="docx" || A_LoopFileExt="pdf" || A_LoopFileExt="doc")
{
counter++
}
MsgBox % counter "`n" A_LoopFileFullPath
var:="" ; thought I'd remedy this some how with clearing the variable prior to the function
ExtractText(var, A_LoopFileFullPath)
}
return

ExtractText(ByRef result, fileName) {
	fileLength := DllCall("C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "str", fileName, "int", false, "ptr*", fileText)
	result := StrGet(fileText, fileLength)
}
Sometime later I'll explore kon's suggestions.
If you're not using AHK_H, there is no built in variable false, so unless you've defined that variable, that could be the issue. Replace false in the dllcall with 0 (alternatively you could define the variable false e.g. false := 0 before the dllcall)
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

15 Apr 2016, 10:51

@SifJar are you sure there's no built in variable for false? I mean, just doing MsgBox % false as the first line returns 0, and MsgBox % true gives me a 1. And no, you can't do false:=0. I don't think I'm on AHK_H, I'm on the normal AHK downloaded on the site, which to my understanding used to be AHK_L but then became the main fork so it's just known as AHK now.

@tmplinshi, that seems like it may be working. It isn't crashing early on. I'll do some testing later to see if it successfully searching for strings that I know are in the file or not.

Edit: Seems I have only one file in all the folders I'm interested in searching that crashes the script when I use tmplininshi's suggestion to use HotKeyIt's code. I simply added a hardcoded exception to this file in the loop. I suppose what I'd like some tips on how I might log errors for whatever file crashes the .dll, just in case other files are ever added to my folders that would do that. Should I be putting the try within the function itself, when the DllCall is called? At that point, might I be able to have the function return an error I can log the name of the file that causes problem? I had tried before doing try on the function itself try ExtractText(), but it seemed nothing could be caught by catch myErrorVar; that's why I'm thinking of doing the try catch in the function itself.

Edit: Yeah, I'm not sure how to implement putting the trys into the function itself, as that seems to break the function as no results get returned.
SifJar
Posts: 398
Joined: 11 Jan 2016, 17:52

Re: How to search for a string in a .pdf or .docx's contents?

15 Apr 2016, 18:15

Ah, you're right, I was thinking of null which is in AHK_H but not AHK.

And yes, you are using AHK (which did indeed used to be AHK_L) and not AHK_H.
User avatar
Masonjar13
Posts: 1472
Joined: 20 Jul 2014, 10:16
GitHub: Masonjar13
Location: Не Россия

Re: How to search for a string in a .pdf or .docx's contents?

15 Apr 2016, 18:31

Regarding the crashes:
  • Have you checked A_LastError?
  • Is the problem file large enough to exceed the #MaxMem default?
OS: Windows 7 Ultimate | Editor: Notepad++
My Personal Function Library | Check Out My Computer Rig
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 11:43

I haven't yet tried A_LastError, but I'll give it a go.

The file is only 410 KB, and considering that I have loaded at least 512 MB files into the script (both using FileRead and the xd2txlib.dll; I watched in task manager AHK take up 700+ MB), I don't think the memory was the problem. Though that does pose an interesting thing that maybe in files larger than 64 MB the latter part of the file wasn't being searched by the RegEx, though I don't think that's much of a problem as the files I wanted to search would be smaller than that. (Only one file in the same category of files I want to search exceeds 64 MB that I know of, and I'm fine with omitting that one out of my search results.)
User avatar
Masonjar13
Posts: 1472
Joined: 20 Jul 2014, 10:16
GitHub: Masonjar13
Location: Не Россия

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 11:53

For better performance, I'd suggest reading them in, using them as needed, then overwriting the variable with the next file, if that's an option. But I still can't say why it's crashing. I'll wait for you to try A_LastError first.
OS: Windows 7 Ultimate | Editor: Notepad++
My Personal Function Library | Check Out My Computer Rig
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 13:10

Welp, I can't quite figure out how to make use of A_LastError when the script is being crashed by the DllCall. I tried an OnExit, just in case that can still be executed. I've tried putting try in there, both try ExtractText() and within the function itself when it sets the variables, but that doesn't prevent the crashing. (Not shown in the code below, because I guess that prevents any returns in the GUI of the script..)

As far as I know, I do keep overwriting the variable with the next file in my loop. Here's my current code, and still it's just that one file amongst all my Google Drive folders giving me a problem. There may be some code unnecessary to share here; being dragged out of the house shortly so don't quite have time to filter it out, but will do so later. The scrolling stuff I suppose is unnecessary, but I don't want to break the script with any function calls missing. The !5 is meant to diagnose in my attempt to capture A_LastError, but I couldn't figure it out like I said.



Code: Select all

^0::
Gui, New
Gui, Add, Edit, vSearch w500
Gui, Add, Button, vButton x+0 gSearch, Search
;Gui, Add, Text, x0 y1920, %A_Space%
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
WinGetPos,,,,Height, Search
WinMove, Search,,,,,% Height+1
WinMove, Search,,,,,% Height
return

Search:
Gui, Submit, NoHide
StringSplit, Sea, Search, %A_Space%
resultsDesc:="",results:=[] ; [] for results?
Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
; Progress up to 4454 files. May as well make it 4500 files to count through
fileThatBreaksThingsFullPath:="Censored.pdf"
If (A_LoopFileFullPath=fileThatBreaksThingsFullPath) ; this file breaks the dll
continue 
If (A_LoopFileExt=".nds")
continue
ExtractText(var,A_LoopFileFullPath)
count:=0
Loop % Sea0 ; Sea?
{
RegExReplace(var,"i)" sea%A_Index%,,c)
count:=c * (count?count:1) ; gives weight to the count
}
;MsgBox % count
If count
{
	match++
	results["count" match]:=count
	results["file" match]:=A_LoopFileFullPath
	Gui, Add, Text, xs Section, % results[lowest]
	sink:=results["file" match]
	Gui, Add, Text, gOpenEarly x50 ys vlink%match%, % sink
	UpdateScrollBars(A_Gui, A_GuiWidth, A_GuiHeight)
	WinGetPos,,,,Height, Search
	WinMove, Search,,,,,% Height+1
	WinMove, Search,,,,,% Height
;	Gui, Show, % "x0 y0 h" A_ScreenHeight - 20
}
}
peel:=1, duplicate:=[]
For key, value in results
    duplicate[key]:=value
Gui, Destroy
Gui, New
Gui, Default
Gui, Add, Edit, vSearch w500, %Search%
Gui, Add, Button, vButton x+0 gSearch, Search
While peel
{
old_value:=0
For key, value in results
	If InStr(key,"count")
		If (value>old_value)
			lowest:=key, old_value:=value
Gui, Add, Text, xs Section, % results[lowest]
link:=results[f:="file"  n:=SubStr(lowest,6)]
Gui, Add, Text, gOpen x50 ys vlink%n%, % link
results.Delete(lowest)
results.Delete(f)
peel:=0
For key, value in results
    peel++
}
var:="", results:=""
;MsgBox % resultsDesc
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
WinGetPos,,,,Height, Search
WinMove, Search,,,,,% Height+1
WinMove, Search,,,,,% Height
return

Open:
Run % duplicate["file" SubStr(A_GuiControl,5)]
return

OpenEarly:
Run % results["file" SubStr(A_GuiControl,5)]
return

WordSearch:
ExcelSearch:
PDFSearch:
PDFtoTextEXE:="C:\Users\" A_UserName "\Downloads\PDF..."
;RunWait, 
return

GuiSize: ; Built-in GUI Event that is called when resizing the window
    UpdateScrollBars(A_Gui, A_GuiWidth, A_GuiHeight)
return

#IfWinActive ahk_exe autohotkey.exe
~LButton::
MouseGetPos, Mx, My, A
WinGetPos,,,Wx, Wy, A
;Left = 8,351 to 24,367 (out of 733,376)
;Right = 692,351 to 707,367
;Down = 708,335 to 725, 350
;Up = 708,52 to 725, 67
If (Mx>8 && Mx<24 && My>Wy-26 && My<Wy-10) ; Left 	; Need to recalibrate for any GUI
   OnScroll(0,0,0x114,WinExist(),1)
If (Mx>Wx-41 && Mx<Wx-25 && My>Wy-26 && My<Wy-10) ; Right
   OnScroll(1,0,0x114,WinExist(),1)
If (Mx>Wx-27 && Mx<Wx-4 && My>Wy-19 && My<Wy-4) ; Down 
   OnScroll(1,0,0x115,WinExist(),1)
If (Mx>Wx-27 && Mx<Wx-4 && My>27 && My<42) ; Up
   OnScroll(0,0,0x115,WinExist(),1)
;Tooltip % (Mx>Wx-27) (Mx<Wx-4) (My>Wy-4) (My<Wy-19) "`n" My "`t" Wy "`t" Wy-4 "`t" Wy-19
return

#If 
UpdateScrollBars(GuiNum, GuiWidth, GuiHeight)
{
    static SIF_RANGE=0x1, SIF_PAGE=0x2, SIF_DISABLENOSCROLL=0x8, SB_HORZ=0, SB_VERT=1
    
    Gui, %GuiNum%:Default
    Gui, +LastFound
    
    ; Calculate scrolling area.
    Left := Top := 9999
    Right := Bottom := 0
    WinGet, ControlList, ControlList
    Loop, Parse, ControlList, `n
    {
        GuiControlGet, c, Pos, %A_LoopField%
        if (cX < Left)
            Left := cX
        if (cY < Top)
            Top := cY
        if (cX + cW > Right)
            Right := cX + cW
        if (cY + cH > Bottom)
            Bottom := cY + cH
    }
    Left -= 8
    Top -= 8
    Right += 8
    Bottom += 8
    ScrollWidth := Right-Left
    ScrollHeight := Bottom-Top
    
    ; Initialize SCROLLINFO.
    VarSetCapacity(si, 28, 0)
    NumPut(28, si) ; cbSize
    NumPut(SIF_RANGE | SIF_PAGE, si, 4) ; fMask
    
    ; Update horizontal scroll bar.
	; NumPut(SIF_RANGE | SIF_PAGE | SIF_DISABLENOSCROLL, si, 4) ; fMask
    ;NumPut(ScrollWidth, si, 12) ; nMax
    ;NumPut(GuiWidth, si, 16) ; nPage
    ;DllCall("SetScrollInfo", "uint", WinExist(), "uint", SB_HORZ, "uint", &si, "int", 1)
    
    ; Update vertical scroll bar.
     NumPut(SIF_RANGE | SIF_PAGE | SIF_DISABLENOSCROLL, si, 4) ; fMask
    NumPut(ScrollHeight, si, 12) ; nMax
    NumPut(GuiHeight, si, 16) ; nPage
    DllCall("SetScrollInfo", "uint", WinExist(), "uint", SB_VERT, "uint", &si, "int", 1)
    
    if (Left < 0 && Right < GuiWidth)
        x := Abs(Left) > GuiWidth-Right ? GuiWidth-Right : Abs(Left)
    if (Top < 0 && Bottom < GuiHeight)
        y := Abs(Top) > GuiHeight-Bottom ? GuiHeight-Bottom : Abs(Top)
    if (x || y)
        DllCall("ScrollWindow", "uint", WinExist(), "int", x, "int", y, "uint", 0, "uint", 0)
}

OnScroll(wParam, lParam, msg, hwnd, trigger) ; I have added the trigger parameter
{
    static SIF_ALL=0x17
	SCROLL_STEP:=(trigger?100:10)
    
    bar := msg=0x115 ; SB_HORZ=0, SB_VERT=1
    
    VarSetCapacity(si, 28, 0)
    NumPut(28, si) ; cbSize
    NumPut(SIF_ALL, si, 4) ; fMask
    if !DllCall("GetScrollInfo", "uint", hwnd, "int", bar, "uint", &si)
        return
    
    VarSetCapacity(rect, 16)
    DllCall("GetClientRect", "uint", hwnd, "uint", &rect)
    
    new_pos := NumGet(si, 20, "int") ; nPos
    action := wParam & 0xFFFF
    if action = 0 ; SB_LINEUP
        new_pos -= SCROLL_STEP
    else if action = 1 ; SB_LINEDOWN
        new_pos += SCROLL_STEP
    else if action = 2 ; SB_PAGEUP
        new_pos -= NumGet(rect, 12, "int") - SCROLL_STEP
    else if action = 3 ; SB_PAGEDOWN
        new_pos += NumGet(rect, 12, "int") - SCROLL_STEP
    else if (action = 5 || action = 4) ; SB_THUMBTRACK || SB_THUMBPOSITION
        new_pos := wParam>>16
    else if action = 6 ; SB_TOP
        new_pos := NumGet(si, 8, "int") ; nMin
    else if action = 7 ; SB_BOTTOM
        new_pos := NumGet(si, 12, "int") ; nMax
    else
        return
    
    min := NumGet(si, 8, "int") ; nMin
    max := NumGet(si, 12, "int") - NumGet(si, 16, "int") ; nMax-nPage
    new_pos := new_pos > max ? max : new_pos
    new_pos := new_pos < min ? min : new_pos
    
    old_pos := NumGet(si, 20, "int") ; nPos
    x := y := 0
    if bar = 0 ; SB_HORZ
        x := old_pos-new_pos
    else
        y := old_pos-new_pos
    ; Scroll contents of window and invalidate uncovered area.
    DllCall("ScrollWindow", "uint", hwnd, "int", x, "int", y, "uint", 0, "uint", 0)
    ; Update scroll bar.
    NumPut(new_pos, si, 20, "int") ; nPos
    DllCall("SetScrollInfo", "uint", hwnd, "int", bar, "uint", &si, "int", 1)
}

OnScrollBottom(wParam, lParam, msg, hwnd, trigger) ; I have added the trigger parameter
{
    static SIF_ALL=0x17
	SCROLL_STEP:=(trigger?100:10)
    
    bar := msg=0x115 ; SB_HORZ=0, SB_VERT=1
    
    VarSetCapacity(si, 28, 0)
    NumPut(28, si) ; cbSize
    NumPut(SIF_ALL, si, 4) ; fMask
    if !DllCall("GetScrollInfo", "uint", hwnd, "int", bar, "uint", &si)
        return
    
    min := NumGet(si, 8, "int") ; nMin
    max := NumGet(si, 12, "int") - NumGet(si, 16, "int") ; nMax-nPage
    
    old_pos := NumGet(si, 20, "int") ; nPos
    x := y := 0
    if bar = 0 ; SB_HORZ
        x := old_pos-max
    else
        y := old_pos-max
    ; Scroll contents of window and invalidate uncovered area.
    DllCall("ScrollWindow", "uint", hwnd, "int", x, "int", y, "uint", 0, "uint", 0)
    ; Update scroll bar.
    NumPut(max, si, 20, "int") ; nPos
    DllCall("SetScrollInfo", "uint", hwnd, "int", bar, "uint", &si, "int", 1)
}

!5::
OnExit, please
MsgBox % FileExist(x:="Censored.pdf") ; used to make sure I had the path right
MsgBox % "|" A_LastError ; this message box appears
try ExtractText(result,x)
MsgBox % A_LastError ; this doesn't appear
return

please:
MsgBox % A_LastError
ExitApp
return

ExtractText(ByRef result, fileName) {
	static hModule := DllCall("LoadLibrary", "Str", "C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "Ptr")
	fileLength := DllCall("C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "Str", fileName, "Int", False, "Int*", fileText) ; this is what crashes things with that one file
	result := StrGet( fileText, fileLength / 2 )
}
User avatar
Masonjar13
Posts: 1472
Joined: 20 Jul 2014, 10:16
GitHub: Masonjar13
Location: Не Россия

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 16:42

Can you post or send me the problem file and the next one closest to it (same file type and similar size)? If you wouldn't mind including the dll as well, I'd appreciate it.

Edit: Try changing "Int*", FileText to "UInt*", FileText. You may also try replacing that with "Ptr" or "UPtr".
OS: Windows 7 Ultimate | Editor: Notepad++
My Personal Function Library | Check Out My Computer Rig
User avatar
Exaskryz
Posts: 2876
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 18:55

I'm not legally sure if I'm allowed to distribute the file, so I won't do so. It's a journal article. A citation for anyone who might have legal access to get it: Genotype-Guided vs Clinical Dosing of Warfarin
and Its Analogues. Kathleen Stergiopoulos, MD, PhD; David L. Brown, MD. JAMA Intern Med. 2014;174(8):1330-1338. doi:10.1001/jamainternmed.2014.2368
. The exact file is a 9-page PDF which on the final page includes a section called Warfarin, Genes, and the (Health Care) Environment by authors Dhruv S. Kazi, MD, MSc, MS; Mark A. Hlatky,MD.

It's a 410 KB .pdf

I understand that downloading through another source, or even the same source as I did, may not produce the exact same file. But if other people can replicate the crashing with it, that'd be fantastic.

I tried both suggestions in your Edit Masonjar, with no success. Still seems to be crashing.

Here is a revised and shortened code. ^0 works out normally to parse my Google Drive, still crashing when it reaches that one file (Because the "Censored.pdf" isn't actually doing anything). The !5 is a direct shortcut to it and has been consistently crashing the script.

Code: Select all

^0::
Gui, New
Gui, Add, Edit, vSearch w500
Gui, Add, Button, vButton x+0 gSearch, Search
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
return

Search:
Gui, Submit, NoHide
StringSplit, Sea, Search, %A_Space%
resultsDesc:="",results:=[] ; [] for results?
Loop, Files, C:\Users\%A_UserName%\Google Drive\*, R
{
; Progress up to 4454 files. May as well make it 4500 files to count through
fileThatBreaksThingsFullPath:="Censored.pdf"
If (A_LoopFileFullPath=fileThatBreaksThingsFullPath) ; this file breaks the dll
continue 
If (A_LoopFileExt=".nds")
continue
ExtractText(var,A_LoopFileFullPath)
count:=0
Loop % Sea0 ; Sea?
{
RegExReplace(var,"i)" sea%A_Index%,,c)
count:=c * (count?count:1) ; gives weight to the count
}
If count
{
	match++
	results["count" match]:=count
	results["file" match]:=A_LoopFileFullPath
	Gui, Add, Text, xs Section, % results[lowest]
	sink:=results["file" match]
	Gui, Add, Text, gOpenEarly x50 ys vlink%match%, % sink
}
}
peel:=1, duplicate:=[]
For key, value in results
    duplicate[key]:=value
Gui, New
Gui, Add, Edit, vSearch w500, %Search%
Gui, Add, Button, vButton x+0 gSearch, Search
While peel
{
old_value:=0
For key, value in results
	If InStr(key,"count")
		If (value>old_value)
			lowest:=key, old_value:=value
Gui, Add, Text, xs Section, % results[lowest]
link:=results[f:="file"  n:=SubStr(lowest,6)]
Gui, Add, Text, gOpen x50 ys vlink%n%, % link
results.Delete(lowest)
results.Delete(f)
peel:=0
For key, value in results
    {
	peel:=1
	break
	}
}
var:="", results:=""
Gui, Show, % "x0 y0 h" A_ScreenHeight - 64, Search
return

Open:
Run % duplicate["file" SubStr(A_GuiControl,5)]
return

OpenEarly:
Run % results["file" SubStr(A_GuiControl,5)]
return

!5::
OnExit, please ; works fine when I actually reload the script.
MsgBox % FileExist(x:="Censored.pdf") ; used to make sure I had the path right
MsgBox % "|" A_LastError ; this message box appears
try ExtractText(result,x)
MsgBox % A_LastError ; this doesn't appear
return

please:
MsgBox % A_LastError
ExitApp
return

ExtractText(ByRef result, fileName) {
	static hModule := DllCall("LoadLibrary", "Str", "C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "UPtr")
	fileLength := DllCall("C:\Users\" A_UserName "\Downloads\xd2tx215\dll\xd2txlib.dll\ExtractText", "Str", fileName, "Int", False, "UInt*", fileText) ; this is what crashes things with that one file
	result := StrGet( fileText, fileLength / 2 )
}
SifJar
Posts: 398
Joined: 11 Jan 2016, 17:52

Re: How to search for a string in a .pdf or .docx's contents?

16 Apr 2016, 19:30

Seems like it's available with a free account from here: http://archinte.jamanetwork.com/article ... id=1881013 (for anyone wanting a copy for assisting here)

Return to “Ask For Help”

Who is online

Users browsing this forum: Bing [Bot], ElDunco, Flipeador, Google [Bot], KamanaMMO, TheBeginner and 193 guests