How to search for a string in a .pdf or .docx's contents?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
JoeWinograd
Posts: 2198
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: How to search for a string in a .pdf or .docx's contents?

Post by JoeWinograd » 16 Apr 2016, 19:35

Exaskryz,

Just curious — have you tried Xpdf's PDFtoText that I mentioned here:
https://autohotkey.com/boards/viewtopic ... 5be#p80740

I tried to create an account (several times) at the JAMA site to get the PDF article, but it kept coming up with an "Unexpected Error". I tried to do a password reset — also didn't work. I tried to call the support number but they're not in. I'll send an email to the support folks, but for now, I have no way to try it myself. Regards, Joe

User avatar
JoeWinograd
Posts: 2198
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: How to search for a string in a .pdf or .docx's contents?

Post by JoeWinograd » 16 Apr 2016, 19:39

Hi SifJar,
Our messages crossed. I just spent the last half-hour trying to create an account there (which is the only way to get the PDF). Tried numerous usernames and numerous email addresses. Nothing worked. Were you able to register there?

kon
Posts: 1756
Joined: 29 Sep 2013, 17:11

Re: How to search for a string in a .pdf or .docx's contents?

Post by kon » 16 Apr 2016, 20:47

Exaskrys, did you try qwerty12's suggestion? It looks promising.

I had this lying around, even if you don't use it, future AHK users may find it useful... This is an example of using xpdf's pdftotext. It doesn't need to write xpdf's output to a file because it uses the stdout (it passes the results of the conversion to the AHK script in memory).

Code: Select all

FileSelectFile, MyFile ; Select a pdf file
MsgBox, % PdfToText(MyFile)
return

PdfToText(PdfPath) {
    static XpdfPath := """" A_ScriptDir "\xpdfbin-win-3.04\bin32\pdftotext.exe"""
    objShell := ComObjCreate("WScript.Shell")
 
    ;--------- Building CmdString (look in the .txt docs incuded with xpdf):
    ; From the xpdf docs in [ScriptDir]\xpdfbin-win-3.04\doc\pdftotext.txt:
    ;   SYNOPSIS
    ;       pdftotext [options] [PDF-file [text-file]]
    ;   ...
    ;       If text-file is '-', the text is sent to stdout.
    ; Options (Example option. Look in the xpdf docs for more):
    ;   -nopgbrk    Don't insert page breaks (form feed characters)  between  pages.
    ;---------
    CmdString := XpdfPath " -nopgbrk """ PdfPath """ -"
    objExec := objShell.Exec(CmdString)
    while, !objExec.StdOut.AtEndOfStream ; Wait for the program to finish
        strStdOut := objExec.StdOut.ReadAll()
    return strStdOut
}

User avatar
Masonjar13
Posts: 1555
Joined: 20 Jul 2014, 10:16
Location: Не Россия
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

Post by Masonjar13 » 16 Apr 2016, 22:04

I got the file and everything, but I'm not getting any crashes, or returned values. Using A32 of AHK, I get system error 2 (ERROR_FILE_NOT_FOUND). Using U32, there is no system 2 error, but throws a few other errors (requires "Cdecl" in the dllcall and disallows "\ExtractText" to be in the LoadLibrary call). Fixing those errors, there's just not any return value.

Didn't test qwerty12's code; not installing any software.

kon's code works, but also doesn't. The buffer is too small (I assume of stdout) and most of the file gets cut off. I don't remember msgbox ever being self-constraining of text size (normally will go quite far off the screen). Sending it to the clipboard then pasting shows that the entire file is being read in.
OS: Windows 10 Pro | Editor: Notepad++
My Personal Function Library | Old Build - New Build

User avatar
Exaskryz
Posts: 2882
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

Post by Exaskryz » 16 Apr 2016, 22:35

(Edit: Issue may not be resolved. See post after qwerty's and Mason's posts below.)

Thanks SifJar. That was a site I found to see if it was freely available as a link I could share, but being on my school's network it auto-associated my university with the site. I wasn't sure if I could get the PDF because of my school's library or not; seems like it is freely available to the public.

So, what I thought would be important is to see if the file from that download link matches the one that is giving me issues. Turns out, it's different. I didn't get a crash with that file through the link SifJar provided. So I dug around just a bit to compare why things were different. Looking at the file properties, this is what I saw (minus the red censorship, underlining, and circling):

Image

So, the file sizes were different. So they weren't the same file. But then I saw the little security tag. I hit the Unblock button, and tried my code again. This time it didn't crash.

It appears that having a "locked" PDF crashes it. I didn't get any alerts in Adobe Reader that there was any security problem with it - but maybe it remembers my choice from when I first opened the file to OK it and it wouldn't nag me again with a message.

So I may have a way to fix any other PDFs that give me problems. The issue though is still to identify what PDFs might give me that problem. I'd like to distribute my original file that gave issues, but I'm still not sure on the legality. Even if something is technically financially free, the account requirement to get the journal article is probably there for a reason. However, someone who knows enough about PDFs might be able to manually flag the security caution and then do some testing. (Checking the file properties again after I applied changes and exited the properties dialogue, the security section was gone.)

I haven't tried the other suggested methods yet, just because tmplinshi's method seemed the most promising as a one-size-fits-all. It's just lacking an error handling that I can grapple onto - which ultimately may not be necessary, but will make diagnosing easier if I need it because it still takes ~3 minutes to run through my 4500 files to parse them. All in all, I have fixed the one file - I just don't know when I might add other files to my Drive that would crash the script.

I very much appreciate everyone's contributions, and am sorry I hadn't taken the time to test each of them out. Hopefully though with all these options out there, future readers can take advantage of them.

And for the sake of completeness as Masonjar mentioned the ANSI vs Unicode, I'm on Unicode 32-bit 1.1.23.01. (I think it's 32 bit at least.) I didn't get any errors that he encountered.
Last edited by Exaskryz on 17 Apr 2016, 00:15, edited 2 times in total.

qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

Post by qwerty12 » 16 Apr 2016, 23:46

I tried it with the PDF SifJar linked and I had to include bad workarounds, which my post now contains, but I was able to find the last sentence in the PDF with the PDF IFilter I had installed (SumatraPDF). Saying that, I don't know if it's because (Sumatra)PDF, but spaces around words in some paragraphs of that PDF are missing when MsgBox shows the IFilter's output, so I can't say that it's super reliable, but it does work OK for simple searches and for searches like "sub-set of patients most likely to benefit from a specific drug".
Masonjar13 wrote:Didn't test qwerty12's code; not installing any software.
Unless you use your browser to read PDF files, you probably already have a PDF reader installed that comes with an IFilter plugin. Saying that, I don't think there is a PDF reader lighter than SumatraPDF these days - IFilter aside, I always recommend it to people after a PDF reader (unless they need to fill forms).
Exaskryz wrote: It appears that having a "locked" PDF crashes it. I didn't get any alerts in Adobe Reader that there was any security problem with it - but maybe it remembers my choice from when I first opened the file to OK it and it wouldn't nag me again with a message.

So I may have a way to fix any other PDFs that give me problems. The issue though is still to identify what PDFs might give me that problem.
Checking for blocked files can be done like this (the proper way probably lies in the IAttachmentExecute COM interface):

Code: Select all

FileRead, isBlocked, %pathToFile%:Zone.Identifier:$DATA
if (isBlocked)
	isBlocked := !InStr(isBlocked, "AppZoneId=4")
MsgBox % "File is " (isBlocked ? "blocked" : "not blocked")

; or, hell, just unblock the file from AHK right away:
; FileDelete, %pathToFile%:Zone.Identifier:$DATA

User avatar
Masonjar13
Posts: 1555
Joined: 20 Jul 2014, 10:16
Location: Не Россия
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

Post by Masonjar13 » 17 Apr 2016, 00:04

Exaskryz wrote:(I think it's 32 bit at least.)
The dll is 32-bit and won't run through 64-bit programs, so you're definitely on 32-bit. I'm on 64-bit by default, so I was drag-and-dropping it over both A32 and U32 for testing.

@qwerty12, I do not, in fact, have a PDF viewer. I never work with PDF's, thus no software is installed pertaining to such. But if I ever need one, I'll keep that one in mind.
OS: Windows 10 Pro | Editor: Notepad++
My Personal Function Library | Old Build - New Build

User avatar
Exaskryz
Posts: 2882
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

Post by Exaskryz » 17 Apr 2016, 00:15

I have managed to break the file again. Well, I restored the broken file for the sake of testing qwerty's code with the Zone.Identifer data stuff. Turns out that if you use his FileDelete suggestion, it doesn't do enough to fix the file - after running that, it will still crash the script using the dll. Problem is that after I restored the broken file again (by downloading through my original source, different from SifJar's link), and repeated my same method for clearing the security concern in the file properties dialog, it was still crashing the script.

I may well have misattributed using the Unblock button as fixing the file, and something else I had done fixed it. But unfortunately I don't know what.

qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

Post by qwerty12 » 17 Apr 2016, 01:18

@Masonjar13: Fair enough :-)

I'm pretty sure that the problem is that xd2txlib simply doesn't support extracting text from that file. I built the C++ example that came with it and even that shows nothing when pointing it ioi140052.pdf. It does work fine, however, with some other PDFs. As I see it, there's three other options (ordered from best to worst):
  • Use kon's PdfToText. This actually worked on that file and kept the formatting intact.
  • Use my IFilter code, which I can attest to also working (but not as well as xpdf). The problem with that is that its effectiveness can vary from system to system as it's dependent on the IFilter in use. The SumatraPDF IFilter doesn't break up some words properly and it can't join lines - with xpdf, a search for "subset" worked, but with the IFilter, I had to search for "sub-set". Although the hyphenation thing may not be a bug - the word appears as "sub-`r`nset" in the actual PDF when I open it in SumatraPDF.
  • Check for errors from xd2tpdf. Here's an ExtractText function that'll return false when xd2tpdf fails:

    Code: Select all

    ExtractText(ByRef result, fileName) {
    	static hModule := DllCall("LoadLibrary", "Str", "xd2txlib.dll", "Ptr")
    	if (!FileExist(fileName))
    		return false
    	if (fileLength := DllCall("xd2txlib\ExtractText", "Str", fileName, "Int", False, "Ptr*", fileText)) {
    		result := StrGet(fileText, fileLength / 2) ;, "UTF-16")
    		return true
    	}
    	return false
    }
Last edited by qwerty12 on 17 Apr 2016, 02:44, edited 1 time in total.

User avatar
Exaskryz
Posts: 2882
Joined: 17 Oct 2015, 20:28

Re: How to search for a string in a .pdf or .docx's contents?

Post by Exaskryz » 17 Apr 2016, 01:26

Unfortunately that third option doesn't seem to be viable, as the script still crashes instead of just returning false. When I find some free time, I'll run some tests to make sure files that should return as results are being returned to verify that this dll would be the simple method, where if I find it breaks in the future, I'll have to do manual diagnoses by editing the script. (Tooltip of the file path during the search to see where it crashes is how I identified this file before). But if it's not returning the results I'd expect, I'll explore kon's option. (And if that isn't quite working, I'll look into the IFilter, and finally into anything else that may have been suggested (I can't recall if Joe's was a different method or not).)

qwerty12
Posts: 468
Joined: 04 Mar 2016, 04:33
Contact:

Re: How to search for a string in a .pdf or .docx's contents?

Post by qwerty12 » 17 Apr 2016, 02:37

Oh, not good. For me, ExtractText was causing crashes when the file didn't exist and when StrGet was attempted when the fileLength returned was 0. The only other thing I can think of would be to use a sacrificial instance of AutoHotkey dedicated to calling ExtractText and exiting with an error code indicating whether a match was found. It's slower, but if it crashes, it won't take down your main script. This isn't very sophisticated:

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.
#SingleInstance Off

file := "\path\to\.pdf"
phraseToFind := "something here"

if (A_PtrSize != 4) {
	MsgBox Try this script with a 32-bit build of AutoHotkey.
	ExitApp
}
 
if (InStr(DllCall("GetCommandLine", "Str"), " /ExtractText ")) {
	file = %2%
	searchPattern = %3%
	if (file && searchPattern && ExtractText(result, file)) {
		if (InStr(result, searchPattern))
			ExitApp 0
	}
	ExitApp 1
}
 
RunWait, "%A_AhkPath%" "%A_ScriptFullPath%" /ExtractText "%file%" "%phraseToFind%",, UseErrorLevel
if (ErrorLevel == 0)
	MsgBox result found
	
ExtractText(ByRef result, fileName) {
	if (!FileExist(fileName))
		return false
	fileLength := DllCall(A_ScriptDir . "\xd2txlib.dll\ExtractText", "Str", fileName, "Int", False, "Int*", fileText)
	if (fileLength > 0) {
		result := StrGet(fileText, fileLength / 2)
		return result != ""
	}
	return false
}
Joe's method is good, I just tested kon's because it also uses xpdf, was nicely wrapped up and it wasn't writing a temporary file.

blue83
Posts: 157
Joined: 11 Apr 2018, 06:38

Re: How to search for a string in a .pdf or .docx's contents?

Post by blue83 » 21 Jan 2019, 15:24

Hi,

I have one excel file that constantly crashes script when it reach result := Strget section in this function.

Other excel files are read properly.

Does anyone knows how to just bypass that file and give me an error variable of that error?

Guest078
Posts: 23
Joined: 28 Aug 2016, 13:44

Re: How to search for a string in a .pdf or .docx's contents?

Post by Guest078 » 30 Mar 2019, 13:21

Hi!

Regarding the code from here: https://www.autohotkey.com/boards/viewtopic.php?p=80744#p80744

In its current state it doesn't work with the iFilters that are provided by offfilt.dll (which comes with later versions of Windows by default),
so files like .doc, .xls, .ppt, etc. aren't processed as they should.

If this line
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", IFILTER_INIT_DISABLE_EMBEDDED | IFILTER_INIT_INDEXING_ONLY, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )
is replaced by:
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", 0, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )

then these iFilters work, but the while loops will only go through one iteration and stop afterwards so that not the whole document is processed but only the first chunk.

Does anybody know how to modify the code to process the full document again?

hasantr
Posts: 933
Joined: 05 Apr 2016, 14:18
Location: İstanbul

Re: How to search for a string in a .pdf or .docx's contents?

Post by hasantr » 21 Sep 2020, 18:04

Guest078 wrote:
30 Mar 2019, 13:21
Hi!

Regarding the code from here: https://www.autohotkey.com/boards/viewtopic.php?p=80744#p80744

In its current state it doesn't work with the iFilters that are provided by offfilt.dll (which comes with later versions of Windows by default),
so files like .doc, .xls, .ppt, etc. aren't processed as they should.

If this line
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", IFILTER_INIT_DISABLE_EMBEDDED | IFILTER_INIT_INDEXING_ONLY, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )
is replaced by:
if (DllCall(NumGet(NumGet(IFilter+0)+3*A_PtrSize), "Ptr", IFilter, "UInt", 0, "Int64", 0, "Ptr", 0, "Int64*", status) != 0 )

then these iFilters work, but the while loops will only go through one iteration and stop afterwards so that not the whole document is processed but only the first chunk.

Does anybody know how to modify the code to process the full document again?
Microsoft shelved the filter. This method works.
https://www.autohotkey.com/boards/viewtopic.php?p=80705#p80705
You need to have this dll file.
http://ebstudio.info/download/KWICFinder/xd2tx220_x64.zip

hasantr
Posts: 933
Joined: 05 Apr 2016, 14:18
Location: İstanbul

Re: Bir .pdf ve .docx bir dize ne?

Post by hasantr » 06 Oct 2020, 13:08

Deleted...

hasantr
Posts: 933
Joined: 05 Apr 2016, 14:18
Location: İstanbul

Re: How to search for a string in a .pdf or .docx's contents?

Post by hasantr » 23 Nov 2020, 05:00

qwerty12 wrote:
16 Apr 2016, 23:46
I tried it with the PDF SifJar linked and I had to include bad workarounds, which my post now contains, but I was able to find the last sentence in the PDF with the PDF IFilter I had installed (SumatraPDF). Saying that, I don't know if it's because (Sumatra)PDF, but spaces around words in some paragraphs of that PDF are missing when MsgBox shows the IFilter's output, so I can't say that it's super reliable, but it does work OK for simple searches and for searches like "sub-set of patients most likely to benefit from a specific drug".
Masonjar13 wrote:Didn't test qwerty12's code; not installing any software.
Unless you use your browser to read PDF files, you probably already have a PDF reader installed that comes with an IFilter plugin. Saying that, I don't think there is a PDF reader lighter than SumatraPDF these days - IFilter aside, I always recommend it to people after a PDF reader (unless they need to fill forms).
Exaskryz wrote: It appears that having a "locked" PDF crashes it. I didn't get any alerts in Adobe Reader that there was any security problem with it - but maybe it remembers my choice from when I first opened the file to OK it and it wouldn't nag me again with a message.

So I may have a way to fix any other PDFs that give me problems. The issue though is still to identify what PDFs might give me that problem.
Checking for blocked files can be done like this (the proper way probably lies in the IAttachmentExecute COM interface):

Code: Select all

FileRead, isBlocked, %pathToFile%:Zone.Identifier:$DATA
if (isBlocked)
	isBlocked := !InStr(isBlocked, "AppZoneId=4")
MsgBox % "File is " (isBlocked ? "blocked" : "not blocked")

; or, hell, just unblock the file from AHK right away:
; FileDelete, %pathToFile%:Zone.Identifier:$DATA
how do i run it on such a path?

Code: Select all

\\LocalNet\LocalFolder\LocalBlockingFile.pdf

lucianobarg
Posts: 1
Joined: 03 Jun 2022, 12:36

Re: How to search for a string in a .pdf or .docx's contents?

Post by lucianobarg » 01 Jul 2022, 11:17

Many thanks, Joe! Your suggestion worked like a charm for my needs.
Love AHK (and its community) every day more.
Regards!


Luciano

JoeWinograd wrote:
14 Apr 2016, 13:45
Hi Exaskryz,

Following up on the prior recommendations of the Xpdf utilities, I have used them in many AHK scripts:
http://www.foolabs.com/xpdf/

It is a set of nine command line executables. The one you'll need is pdftotext.exe, which converts PDF files to plain text. There are five different options for creating the text:

-layout
-lineprinter
-raw
-table
<null>, which is the default

I find that -layout usually works best for me, but you should experiment with your particular PDFs to see which output option creates files that work best in your script. In some scripts, I try multiple options (usually, -layout, -raw, and the default of <null>) and then check to see if one of them found the result I'm looking for, such as an "Identifier" string (Customer Name, Account Number, Date of Birth, etc.).

The call in most of my AHK scripts is along these lines:

Code: Select all

RunWait,%PDFtoTextEXE% -f %FirstPage% -l %LastPage% %OutputFormatOption% "%SourceFileName%" "%DestFileName%",,Hide
No installation is needed for the Xpdf tools — all of the executables are stand-alone and self-contained. You'll see in the downloaded package that there are both 32-bit and 64-bit versions. I wrote the developer this question:
The Xpdf Windows binaries come in 32-bit and 64-bit versions. I just tested the 32-bit versions of [pdfinfo.exe] and [pdftotext.exe] on 64-bit W7 and they worked fine. I assume there's a reason for having a 64-bit version of your binaries, but since the 32-bit version works fine on a 64-bit system, why would I need the 64-bit version?
Here is his answer:
There's not really any reason to use the 64-bit binaries. For the rasterizer (pdftoppm, and also used in pdftops), there may be some cases where it needs to allocate large chunks of memory. But for pdfinfo and pdftotext, I don't think you'll run into that.
In terms of licensing and cost, Xpdf is open source, licensed under the GNU General Public License (GPL) V2, with no cost stated at the website for non-commercial use. For commercial licensing, the Xpdf site says to see their parent company's site, Glyph & Cog.

As kon stated, PDFtoText works only if the PDF contains text, such as in a PDF Normal file or a PDF Searchable Image file (a PDF file from scanning that has both the scanned image and text from the OCR process). In other words, it won't work on image-only PDFs. In that case, you'll need to perform OCR on the files to create the text.

I've never tried to extract the text from DOC or DOCX files, so if you get that to work, please post back here how you did it. If you can't get it work, one way to do it is to "print" the Word file to a command line PDF print driver; another way is to use a command line tool like OfficeToPDF. These approaches will create a PDF Normal file (i.e., with text), which can then be fed to pdftotext.exe. Regards, Joe

User avatar
JoeWinograd
Posts: 2198
Joined: 10 Feb 2014, 20:00
Location: U.S. Central Time Zone

Re: How to search for a string in a .pdf or .docx's contents?

Post by JoeWinograd » 01 Jul 2022, 12:54

lucianobarg wrote:Many thanks, Joe! Your suggestion worked like a charm for my needs.
Hi Luciano,

I see that this is your first post here, so let me start with...Welcome Aboard!

You're very welcome...and thanks back at you for letting us know that it worked like a charm for you. It's always helpful to know what does and doesn't work for folks.

Regards, Joe

Argimko
Posts: 3
Joined: 07 Mar 2016, 06:22

Re: How to search for a string in a .pdf or .docx's contents?

Post by Argimko » 28 Nov 2022, 10:25

Hi, I have update qwerty12 code to extract text via IFilter using only AHK code. Also, added ability to extract text from OneNote files: for some reason iFilter OneNote 2016/2021 is not associated with the .one extension.

Source: https://github.com/Argimko/ArTools/blob/master/ExtractIFilterText.ahk

Code: Select all

; IFilter AutoHotkey example by qwerty12
; 
; Credits:
;   https://tlzprgmr.wordpress.com/2008/02/02/using-the-ifilter-interface-to-extract-text-from-documents/
;   https://stackoverflow.com/questions/7177953/loadifilter-fails-on-all-pdfs-but-mss-filtdump-exe-doesnt
;   https://forums.adobe.com/thread/1086426?start=0&tstart=0
;   https://www.autohotkey.com/boards/viewtopic.php?p=80744#p80744
; 
; History:
;   [2018-11-06] https://github.com/qwerty12/AutoHotkeyScripts/blob/master/IFilterPDF/ifilter.ahk
;   [2022-10-30] https://github.com/Ixiko/Addendum-fuer-Albis-on-Windows/blob/master/include/Addendum_PdfHelper.ahk
;   [2022-11-28] https://github.com/Argimko/ArTools/blob/master/ExtractIFilterText.ahk
; 
; Usage examples:
;   ExtractIFilterText(srcPath, dstPath)
;   MsgBox % ExtractIFilterText(srcPath)
;   ExtractIFilterText.ahk "{In}" "{Out}"
;   ExtractIFilterText.ahk ".one"           - Write to the registry OneNote workaround values then exit
;   
; Using with Total Commander MultiArc MVV plugin:
;       wincmd.ini -> [PackerPlugins]:
;           one=320,%COMMANDER_PATH%\PLUGINS\PACKER\MultiArc\MultiArc.wcx
;       MultiArc.ini -> [OneNote]:
;           ID=E4 52 5C 7B 8C D8 A7 4D AE B1 53 78 D0 29 96 D3
;           Extension=one
;           Description=IFilter for OneNote
;           Archiver=C:\AutoHotkey\AutoHotkeyU64.exe \"%COMMANDER_PATH%\Tools\ExtractIFilterText.ahk\" --ext=one
;           Format0=yyyy-tt-dd hh:mm:ss z+ p+ n++
;           BatchUnpack=1
;           List=%PA --list %AQA %O
;           ExtractWithPath=%PA %AQA %FQA

#Warn
#NoEnv
#NoTrayIcon
#SingleInstance Off
SetBatchLines -1

listFile := False
srcPath := dstPath := extForce := ""

For n, param In A_Args {
    ; case-insensetive
    If (param = "--list") {
        listFile := True
    }
    Else If (SubStr(param, 1, 6) = "--ext=") {
        extForce := SubStr(param, 7)
    }
    Else If (srcPath == "") {
        srcPath := param
    }
    Else If (dstPath == "") {
        dstPath := param
    }
}

If (srcPath == "")
    FileSelectFile srcPath, 1,, Source file selecting

If (dstPath == "") {
    SplitPath srcPath,, dir,, filename
    dstPath := dir "\" filename " (text).txt"
}

ExtractIFilterText(srcPath, dstPath, extForce, listFile)

; MsgBox % ExtractIFilterText(srcPath)


ExtractIFilterText(srcPath, dstPath := "", extForce := "", listFile := False, showFilterClsid := False) {
    local

    static STGM_READ                            := 0
    static CHUNK_TEXT                           := 1

    static CHUNK_NO_BREAK                       := 0
    static CHUNK_EOW                            := 1    ; word break
    static CHUNK_EOS                            := 2    ; sentence break
    static CHUNK_EOP                            := 3    ; paragraph break
    static CHUNK_EOC                            := 4    ; chapter break

    static IFILTER_INIT_CANON_PARAGRAPHS        := 1
    static IFILTER_INIT_HARD_LINE_BREAKS        := 2
    static IFILTER_INIT_CANON_HYPHENS           := 4
    static IFILTER_INIT_CANON_SPACES            := 8
    static IFILTER_INIT_APPLY_INDEX_ATTRIBUTES  := 16
    static IFILTER_INIT_APPLY_OTHER_ATTRIBUTES  := 32
    static IFILTER_INIT_INDEXING_ONLY           := 64
    static IFILTER_INIT_SEARCH_LINKS            := 128
    static IFILTER_INIT_APPLY_CRAWL_ATTRIBUTES  := 256
    static IFILTER_INIT_FILTER_OWNED_VALUE_OK   := 512
    static IFILTER_INIT_FILTER_AGGRESSIVE_BREAK := 1024
    static IFILTER_INIT_DISABLE_EMBEDDED        := 2048
    static IFILTER_INIT_EMIT_FORMATTING         := 4096

    static FILTER_S_LAST_TEXT                   := 0x41709
    static FILTER_E_END_OF_CHUNKS               := 0x80041700
    static FILTER_E_NO_MORE_TEXT                := 0x80041701

    static WS_DISABLED                          := 0x8000000

    If (!A_IsUnicode)
        Throw A_ThisFunc ": The IFilter APIs appear to be Unicode only, please, try again with a Unicode build of AHK"

    If (!srcPath)
        Return -9
    
    SplitPath srcPath,,, ext, filename
    If (extForce != "")
        ext := extForce

    runIFilter := True
    If (listFile) {
        dstPath := dstText := ""

        ; Save time and do not make unnecessary IFilter calls when unpacking via Total Commander MultiArc plugin:
        ; Total Commander has progress dialog popup when unpacking, so we do not count filesize while list files due unpacking
        WinGet tcStyle, Style, ahk_class TTOTAL_CMD
        runIFilter := !(tcStyle & WS_DISABLED)
    }

    If (runIFilter) {
        ; Adobe workaround
        filterJob := 0
        If (ext = "pdf" && filterJob := DllCall("CreateJobObject", Ptr,0, Str, "filterProc", Ptr))   ; case-insensetive
            DllCall("AssignProcessToJobObject", Ptr,filterJob, Ptr,DllCall("GetCurrentProcess", Ptr))

        ; OneNote workaround - for some reason iFilter OneNote 2016/2021 is not associated with the .one extension, so we need to edit the registry
        ;   - The value {E772CEB3-E203-4828-ADF1-765713D981B8} can be changed to any other value, for example, 1 more than the original OneNote filter
        ;   - The value {B8D12492-CE0F-40AD-83EA-099A03D493F1} must be strictly so, it cannot be changed - taken from [MS Office Filter Packs](https://www.microsoft.com/en-us/download/details.aspx?id=17062)
        ;   - Useful Nir Sofer apps to debug IFilters : SearchFilterView and DocumentTextExtractor
        If (ext = "one") {   ; case-insensetive
            RegRead persistentHandler, HKCR\.one\PersistentHandler
            If (ErrorLevel) {
                RegRead dllPath, HKCR\CLSID\{6EE84065-8BA3-4A8A-9542-6EC8B56A3378}\InprocServer32
                If (!ErrorLevel) {
                    RegRead threadingModel, HKCR\CLSID\{6EE84065-8BA3-4A8A-9542-6EC8B56A3378}\InprocServer32, ThreadingModel

                    RegWrite REG_SZ, HKCU\Software\Classes\.one\PersistentHandler,, {E772CEB3-E203-4828-ADF1-765713D981B8}
                    RegWrite REG_SZ, HKCU\Software\Classes\CLSID\{E772CEB3-E203-4828-ADF1-765713D981B8}\PersistentAddinsRegistered\{89BCB740-6119-101A-BCB7-00DD010655AF},, {B8D12492-CE0F-40AD-83EA-099A03D493F1}
                    RegWrite REG_SZ, HKCU\Software\Classes\CLSID\{B8D12492-CE0F-40AD-83EA-099A03D493F1},, Microsoft OneNote Indexing Filter
                    RegWrite REG_SZ, HKCU\Software\Classes\CLSID\{B8D12492-CE0F-40AD-83EA-099A03D493F1}\InprocServer32,, % dllPath
                    RegWrite REG_SZ, HKCU\Software\Classes\CLSID\{B8D12492-CE0F-40AD-83EA-099A03D493F1}\InprocServer32, ThreadingModel, % threadingModel
                }
            }

            If (srcPath = ".one")   ; case-insensetive
                Return 1
        }

        VarSetCapacity(FILTERED_DATA_SOURCES, 4*A_PtrSize, 0)
        NumPut(&ext,  FILTERED_DATA_SOURCES, 0,         "Ptr")
        VarSetCapacity(filterClsid, 16, 0)

        filterRegistration := ComObjCreate("{9E175B8D-F52A-11D8-B9A5-505054503030}", "{C7310722-AC80-11D1-8DF3-00C04FB6EF4F}")

        ; ILoadFilter::LoadIFilter
        If (DllCall(NumGet(NumGet(filterRegistration+0)+3*A_PtrSize), Ptr,filterRegistration, Ptr,0, Ptr,&FILTERED_DATA_SOURCES, Ptr,0, Int,False, Ptr,&filterClsid, Ptr,0, PtrP,0, PtrP,iFilter:=0))
            Throw A_ThisFunc ": cannot load IFilter for:`n`n""" srcPath """"

        If (showFilterClsid)
            MsgBox,, IFilter CLSID, % GuidToString(filterClsid)

        ObjRelease(filterRegistration)

        If (DllCall("shlwapi\SHCreateStreamOnFile", Str, srcPath, UInt,STGM_READ, PtrP,iStream:=0))
            Throw A_ThisFunc ": cannot open input file:`n`n""" srcPath """"

        ; IPersistStream::Load
        persistStream := ComObjQuery(iFilter, "{00000109-0000-0000-C000-000000000046}")
        If (DllCall(NumGet(NumGet(persistStream+0)+5*A_PtrSize), Ptr,persistStream, Ptr,iStream))
            Throw A_ThisFunc ": cannot load file stream"

        ObjRelease(iStream)

        flags := IFILTER_INIT_HARD_LINE_BREAKS | IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT_CANON_SPACES
               | IFILTER_INIT_INDEXING_ONLY             ; performance optimization
               | IFILTER_INIT_APPLY_INDEX_ATTRIBUTES    ; allow to process Office 2003 file formats like .doc, .xls with offFilt.dll v2008

        ; IFilter::Init
        If (DllCall(NumGet(NumGet(iFilter+0)+3*A_PtrSize), Ptr,iFilter, UInt,flags, Int64, 0, Ptr,0, Int64P,0))
            Throw A_ThisFunc ": cannot init IFilter for:`n`n""" srcPath """"
        
        prevBreakType := -1
        bufferSize := 32*1024
        VarSetCapacity(STAT_CHUNK, A_PtrSize == 8 ? 64 : 52)
        VarSetCapacity(buf, bufferSize * 2 + 2)

        If (dstPath != "") {
            dstFile := FileOpen(dstPath, "w", "UTF-8")
            If (!IsObject(dstFile))
                Throw A_ThisFunc ": cannot write text to destination file (error: " A_LastError  ")"
        }
        Else
            VarSetCapacity(dstText, bufferSize * 8)

        ; IFilter::GetChunk
        While (DllCall(NumGet(NumGet(iFilter+0)+4*A_PtrSize), Ptr,iFilter, Ptr,&STAT_CHUNK) & 0xFFFFFFFF != FILTER_E_END_OF_CHUNKS) {
            If (NumGet(STAT_CHUNK, 8, "UInt") & CHUNK_TEXT) {

                breakType := NumGet(STAT_CHUNK, 4, "UInt")
                Switch breakType {
                    Case CHUNK_NO_BREAK : breaks := prevBreakType == CHUNK_EOS ? " " : ""   ; case of Word document hyperlinks
                    Case CHUNK_EOW      : breaks := "`r`n"
                    Case CHUNK_EOP      : breaks := "`r`n`r`n"
                    Case CHUNK_EOC      : breaks := "`r`n`r`n`r`n`r`n`r`n"
                    Default             : breaks := ""
                }

                Loop {
                    ; IFilter::GetText
                    result := DllCall(NumGet(NumGet(iFilter+0)+5*A_PtrSize), Ptr,iFilter, Int64P,length:=bufferSize, Ptr,&buf) & 0xFFFFFFFF
                    If (result == FILTER_E_NO_MORE_TEXT)
                        Break

                    If (prevBreakType == -1 || A_Index == 2)
                        breaks := ""

                    txt := breaks . StrGet(&buf, length, "UTF-16")

                    ; Workarounds for Office 2003 Word documents processed by offFilt.dll v2008:
                    ;   - replace soft-returns vertical-tabs with hard return
                    ;   - replace table new row double-tabs with hard return
                    If (ext = "doc" || ext = "dot")   ; case-insensetive
                        txt := RegExReplace(txt, "`v|`t`t", "`r`n")
                    
                    dstPath != "" ? dstFile.Write(txt) : dstText .= txt

                    If (result == FILTER_S_LAST_TEXT)
                        Break
                }

                prevBreakType := breakType
            }
        }

        ObjRelease(persistStream)
        ObjRelease(iFilter)

        If (filterJob)
            DllCall("CloseHandle", Ptr,filterJob)
    }

    If (listFile) {
        FileGetTime timestamp, % srcPath
        FormatTime timestamp, % timestamp, yyyy-MM-dd HH:mm:ss
        size := dstText != "" ? StrPut(dstText, "UTF-8") + 2 : 0

        FileAppend %timestamp% %size% %size% %filename% (text).txt, *, cp0
    }
    Else If (dstPath) {
        ; Remove byte order mark if the file has no content
        If (dstFile.Length == 3)
            dstFile.Length := 0

        dstFile.Close()
        
        FileGetTime timestamp, % srcPath
        FileSetTime timestamp, % dstPath
    }
    Else
        Return dstText
}

GuidToString(ByRef pGUID) {
    VarSetCapacity(string, 78)
    DllCall("ole32\StringFromGUID2", Ptr,&pGUID, Str,string, Int,39)
    return string
}

GuidFromString(ByRef GUID, sGUID) {
    VarSetCapacity(GUID, 16, 0)
    DllCall("ole32\CLSIDFromString", Str,sGUID, Ptr,&GUID)
}

Post Reply

Return to “Ask for Help (v1)”