[request] indexing all words in a word file

Get help with using AutoHotkey and its commands and hotkeys
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

[request] indexing all words in a word file

28 Nov 2013, 09:31

kindly move to appropriate thread if it is not.

request: as suggested in the heading, i have some documents in MS word 2010, is it possible to get an index wise list of all the words that are used in that word file.

i asked the same at IRC and got awesome help from fluffums, but it is like searching one word and you have to manually change the code for another word, which would take a hall of time may be some days :-p
anyways,

the result shall be like this:
a -> 1-10,2-20,3-100...like that
are -> 1-7, 3-10...like that..

the word " a " is used 10 times on page 1, 20 times on page 2, 100 times on page 3 ..

both the words "a" and "are" are taken the document itself.

that is the resulting index is in alphabetical order.

many many thanks in advance for time and insight as this would be difficult for most of the guys
John ... you working ?
kon
Posts: 1756
Joined: 29 Sep 2013, 17:11

Re: [request] indexing all words in a word file

28 Nov 2013, 12:48

I'm not sure how search each page individually, perhaps that is something that could be done with COM.
To get a count of each occurrence of a word I would parse the text into an array and use the words for keys and the number of occurrences for the values:

Code: Select all

Test := "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. "
	  . "Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec "
	  . "consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget "
	  . "libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut "
	  . "porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a "
	  . "non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut "
	  . "gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. "
	  . "Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim "
	  . "sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero "
	  . "in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam "
	  . "pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis "
	  . "dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. "
	  . "Mauris vitae nisi at sem facilisis semper ac in est."
	  
Words := {}

Loop, Parse, Test, %A_Space%`t`r`n, `,.;:'`"!?/<>[]{}\|()*&^`%$#@!
	Words[A_LoopField] := Words[A_LoopField] ? Words[A_LoopField] + 1 : 1

for key, val in Words
	Result .= key " -> " val "`n"

MsgBox, % Result
return
I made a word count function a while ago, but didn't release it because it can still be fooled depending on how the text is formatted. But maybe you will find this useful:

Code: Select all

/*
	WordCount example:
		Highlight text and hit Alt+C to display the WordCount values
*/


!c::
ClipSave := ClipboardAll
Clipboard := ""
Send, ^c
ClipWait, 0.2
Text := Clipboard
Clipboard := ClipSave
if (!ErrorLevel)
{
	r := WordCount(Text)
	MsgBox, % 	"Word count = " r.WordCount
			.	"`nCharacter count not including spaces = " r.CharsNSP
			.	"`nCharacter count including spaces = "  r.CharCount 
			.	"`nSentence count = " r.Sentences 
			.	"`nParagraph count = " r.Paragraphs
			.	"`nNon-blank line count = " r.NonBlankLines
			.	"`nTotal line count = " r.TotalLines
			.	"`nAverage word length = " r.AvgWordLength
			.	"`nAverage words per sentence = " r.AvgSentWords
			.	"`nAverage characters per sentence not including spaces = " r.AvgSentCharsNSP
			.	"`nAverage characters per sentence including spaces = " r.AvgSentChars
			.	"`nAverage words per paragraph = "	r.AvgWordsPerPar
			.	"`nAverage characters per paragraph not including spaces = " r.AvgCharsPerParNSP
			.	"`nAverage characters per paragraph including spaces = " r.AvgCharsPerPar
}
return


/*
	Function: WordCount 
		Returns an object with the following properties:
	
		Result.WordCount				Word count
		Result.CharsNSP					Character count not including spaces
		Result.CharCount				Character count including spaces
		Result.Sentences				Sentence count
		Result.Paragraphs				Paragraph count
		Result.NonBlankLines			Non blank lines count
		Result.TotalLines				Total lines count
		Result.AvgWordLength			Average word length
		Result.AvgSentWords				Average words per sentence
		Result.AvgSentCharsNSP			Average characters per sentence not including spaces
		Result.AvgSentChars				Average characters per sentence including spaces
		Result.AvgWordsPerPar			Average words per paragraph
		Result.AvgCharsPerParNSP		Average characters per paragraph not including spaces
		Result.AvgCharsPerPar			Average characters per paragraph including spaces
*/

WordCount(Text)
{
	Result := {}
	
	RegExReplace(Text, "\b\w+\b", "", x)
	Result.WordCount := x
	
	RegExReplace(Text, "[^\s]", "", y)
	Result.CharsNSP := y
	
	Result.CharCount := z := StrLen(Text)
	
	RegExReplace(Text, "U).+(?=[!\.\?]\s|.$)", "", s)
	Result.Sentences := s

	RegExReplace(Text, "U).+\R|.$", "", p)
	Result.Paragraphs := p
	
	RegExReplace(Text, "Um)^.+$", "", n)
	Result.NonBlankLines := n
	
	RegExReplace(Text, "Um)^.*$", "", t)
	Result.TotalLines := t
	
	Result.AvgWordLength := y // x
	Result.AvgSentWords := x // s
	Result.AvgSentCharsNSP := y // s
	Result.AvgSentChars := z // s
	Result.AvgWordsPerPar := x // p
	Result.AvgCharsPerParNSP := y // p
	Result.AvgCharsPerPar := z // p
	
	return, Result
}
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

28 Nov 2013, 22:22

@k0n
Thanks, bt i need the script to work on 100s of pages

Hopefully someone could help me out.
John ... you working ?
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

29 Nov 2013, 11:24

not possible?
John ... you working ?
User avatar
Blackholyman
Posts: 1281
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: [request] indexing all words in a word file

29 Nov 2013, 13:22

It is but if an example like the one by Kon aint the thing you are after, then it seems like you need a full functioning script and that can take more time than just Any one is willing to use...
User avatar
Blackholyman
Posts: 1281
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: [request] indexing all words in a word file

29 Nov 2013, 15:41

okay here is a stab at it anyway from me ;)

Code: Select all

MyDocument := {}
Words := {}
FileSelectFile, path


oWord := ComObjCreate("Word.Application")
;~ oWord.Visible := true
oWord.Documents.open(path)


Source := oWord.ActiveDocument 

Pages := Source.ActiveWindow.panes(1).pages.count
Counter = 0 
Clipboard := ""
While (Counter < Pages)
{
   Counter := Counter + 1 
   DocName := "Page" . Counter
   Source.Bookmarks("\Page").Range.Cut 
   ClipWait, 1
   MyDocument[DocName] := clipboard
   Clipboard := ""
}
Source.saved := true
Source.close()
oWord.quit()

for, key, val in MyDocument
{
    words[key] := {}
pos = 1
While pos := RegExMatch(val,"\b\w+\b", match, pos+StrLen(match))
    {
        if words[key].haskey(match)
        {
            ;~ msgbox 1
            words[key, match] := words[key, match] + 1
        }
        else 
        {
            ;~ msgbox 2
            words[key, match] := 1
        }
    }
}

for page, val in words
    for c, times in val
        list .= "the word '" c "' was found " times " times on " page "`n"
    
msgbox % list
hope that's more on line of your need :D
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

29 Nov 2013, 21:37

@Blackholyman
WOW! sheer Bliss it is to see the result.
verified the result for two three words, seems RIGHTIO :)
Thanks again man for your TIME+INTELLIGENCE


if i get to know any loopholes i shall tell

That thing cannot be done so easily man!!

RESPECT!
Last edited by smorgasbord on 29 Nov 2013, 23:50, edited 1 time in total.
John ... you working ?
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

29 Nov 2013, 22:00

@Blackholyman

wait wait wait!!
it is working for all the pages. i need to check again

:)
John ... you working ?
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

29 Nov 2013, 23:55

One page count gets shifted somewhere in the middle
:(

IS it because my word file has tables in it??
i forgot to tell that part, my mistake.
sorry
John ... you working ?
User avatar
Blackholyman
Posts: 1281
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: [request] indexing all words in a word file

30 Nov 2013, 17:58

Word is not page layout software. It's a word processor. It sees text as a scroll. Each document is one long scroll of text.

Word barely knows what a page is.

Word paginates a document by constantly talking to the current printer driver. It uses information from the printer driver to know where to chop up its precious scroll if it were required to force it on to individual bits of paper.

If you change the printer driver, so that the new one can fit just a tiny bit more or less text on the page than the previous driver, then all the pagination will change.

Where a page starts and ends is constantly changing as the user adds or deletes content and as the user changes how the document is viewed.

As one demonstration of how fluid is the concept of a 'page', try doing Alt-F9. It toggles between displaying fields and displaying field results. Try it in a document with a substantial table of contents, or several linked spreadsheet tables from Excel, or a couple of large linked images, or some other fields that generate content that takes up a lot of space. The number of pages in the document, and where each starts and stops, can change dramatically.

'But Word can count the number of pages. It must be able to identify an individual page!'
Yes, Word can count the number of pages in a document. Use something like:

Code: Select all

ActiveDocument.Range.ComputeStatistics(wdStatisticWords)
There's no way to get from the ComputeStatistics property to an individual page.

'But Word has a Pages collection. It must be able to identify an individual page!'
You can do something like the following:

Code: Select all

ActiveWindow.Panes(1).Pages(1).Rectangles(1).Range.Select
"That works!", you say.

Yes, it appears to work the first time you try it. But it works in trivial circumstances only. It gets flummoxed by a table or a field that crosses a page boundary.

Here are some examples of problems.

If you have a table that starts on page 16 but a row in the table flows over onto page 17, then

Code: Select all

ActiveWindow.Panes(1).Pages(16).Rectangles(1).Range.Select
will select page 16 and that part of the row that appears on page 17.

If a table is very big, starts on page 20 and ends on page 44, and there are rows that break across pages, then

Code: Select all

ActiveWindow.Panes(1).Pages(36).Rectangles(1).Range.Select
will select all the way from page 20 to page 44!

If you have, say, a three-page table of contents starting on page 1, then

ActiveWindow.Panes(1).Pages(2).Rectangles(1).Range.Select
will select pages 1, 2 and 3.

If your aim was to cycle through each page and perform some kind of processing on each page, then, in this example, your code would have processed each page many times—in this example, up to 24 times.

Cycling through the Pages collection to process each page is not a usable pattern in most professional work.

So i'm not sure how to help you out Any more if you realy need to use the page count.
ahk7
Posts: 206
Joined: 06 Nov 2013, 16:35

Re: [request] indexing all words in a word file

30 Nov 2013, 18:10

If you create make it a PDF, you can then get the text from the PDF using PDFTK which will includes a form feed char between pages which would allow you to actually split / find / use each page. So you have the text per page and then you can count the words from there.
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

30 Nov 2013, 22:58

thanks a ton blackholyman.
and too ahk7

:)

shall try to do it blackholyman :)
might i PM the file to you sir?

seems possible.
John ... you working ?
User avatar
smorgasbord
Posts: 490
Joined: 30 Sep 2013, 09:34

Re: [request] indexing all words in a word file

30 Dec 2013, 07:17

@Blackholyman
Thanks again, you took so much pain for me. :)

@ahk7
thanks for the trick.

Here is what i did:

I made headers/Footers + insert page no.s on the word file, then i created PDF using cutepdf of the word file, then again i copied the whole text on PDF file onto a text file. Then i created a script separating every page based on page no.s and Header/Footer.

Kind of worked everytime, though the tables did create some problem ( as blackholyman already suggested ) , though i forgot about the table issue, as there was some other stuff to do.

Anyways Thanks everyone @blackholyman, @k0n and @ahk7 et al.

All hail AHKSCRIPT!!!
John ... you working ?

Return to “Ask For Help”

Who is online

Users browsing this forum: aifritz, Bing [Bot], Geronimo, Google [Bot], Helgef, NewberEnerNeeder1, Xtra and 218 guests