How to retrieve text above and below String?

Get help with using AutoHotkey and its commands and hotkeys
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

How to retrieve text above and below String?

07 Oct 2019, 04:12

How to retrieve text above and below String?

Question is mainly in the title. I have converted pdfs and doc/docx into .txt files and now I would like to start summarising the .txt files. I would like to search for a word and then extract the word and 10 words above and 10 words below.

For example:
After all, the Earth must wait for spring.
No angel ever changed the pace of time.
Goodness is still tucked away below,
Empty as a field asleep in snow,
Like iron in the harshness of that clime
As God is born in frozen Bethlehem.
If I InStr(.txt, iron) how do I get the following:
away below,
Empty as a field asleep in snow,
Like iron in the harshness of that clime
As God is born
into a variable I can then use to write in a a word document with ComObj?

I can't get it to work but I was thinking maybe something along the lines of:

Code: Select all

FoundPos := InStr(.txt, iron)
StringLeft, ResultLeft, FoundPos, 80
Stringright, ResultRight, FoundPos, 80
Thank you for your help,
Rohwedder
Posts: 3782
Joined: 04 Jun 2014, 08:33
Location: Germany

Re: How to retrieve text above and below String?

07 Oct 2019, 11:16

Hallo,
try:

Code: Select all

txt =
( join
After all, the Earth must wait for spring.
 No angel ever changed the pace of time.
 Goodness is still tucked away below,
 Empty as a field asleep in snow,
 Like iron in the harshness of that clime
 As God is born in frozen Bethlehem.
)
Find = iron
RegexMatch(txt, "(\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h" Find "\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+)",Result)
MsgBox,%  Result1
User avatar
Sir Teddy the First
Posts: 94
Joined: 05 Aug 2019, 12:31
Contact:

Re: How to retrieve text above and below String?

07 Oct 2019, 11:24

Hi,
in addition to the answer above, you can also try this one, which grants you a little bit more control over the search function:

Code: Select all

String :=
(
"After all, the Earth must wait for spring.
No angel ever changed the pace of time.
Goodness is still tucked away below,
Empty as a field asleep in snow,
Like iron in the harshness of that clime
As God is born in frozen Bethlehem."
)

MsgBox % WordPeriphery(String, "iron")

MsgBox % WordPeriphery(String, "iron", 2)

MsgBox % WordPeriphery(String, "iron",, 5)

WordPeriphery(Text, SearchWord, Num1 := 10, Num2 := 10)
{
    SearchPattern := "x S) (\s* \S+){0," Num1 "} \s*" . SearchWord . "[.,?!]* \s* (\S+ \s*){0," Num2 "}"
    RegExMatch(Text, SearchPattern, Result)
    Result := Trim(Result)
    
    return Result
}

The three "MsgBox"-examples demonstrate the usage of the function.
This functions would also work with a word like "frozen" in your example text (any word that is less than 10 words from start or end of your sample text), which Rohwedder's probably does not.
:eh: :think:
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

07 Oct 2019, 12:08

Thank you very much for your time Rohwedder and Sir Teddy the First, it is very much appreciated.

Both of your codes worked, thank you again.

Is there a way of assigning txt or String to result.txt? I note that both your codes include the text/string to search directly in the ahk. Unfortunately, the text/string to search will change each time depending on which pdf or doc/docx is converted into a .txt. Is there a way of assigning the text file result.txt to the both of your txt and String? I tried the following with no success as the returned msbox is blank (no doubt because "iron" is not present in result.txt):

Code: Select all

txt := "result.txt"
Find = iron
RegexMatch(txt, "(\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h" Find "\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+)",Result)
MsgBox,%  Result1
MsgBox % WordPeriphery(result.txt, "iron")

MsgBox % WordPeriphery(result.txt, "iron", 2)

MsgBox % WordPeriphery(result.txt, "iron",, 5)

WordPeriphery(Text, SearchWord, Num1 := 10, Num2 := 10)
{
SearchPattern := "x S) (\s* \S+){0," Num1 "} \s*" . SearchWord . "[.,?!]* \s* (\S+ \s*){0," Num2 "}"
RegExMatch(Text, SearchPattern, Result)
Result := Trim(Result)

return Result
}
My apologies if I didn't make this clear in the original post,

Kind regards
User avatar
Sir Teddy the First
Posts: 94
Joined: 05 Aug 2019, 12:31
Contact:

Re: How to retrieve text above and below String?

07 Oct 2019, 12:30

Hi,
have a look at these pages in the docs:

FileRead
File Object

You`ll probably want to go with FileRead, so you have to include

Code: Select all


FileRead, String, FolderPathOfYourFile\result.txt

at the beginning of either of those scripts.

Autohotkey has to read the file first before it can operate on the file's contents. There is a distinction to be made between the file (as a file) and the file's contents.

In order for this to work, you have to replace "FolderPathOfYourFile" with your files location on your hard drive (something like C:\Folder1\Subfolder1\...).
:eh: :think:
teadrinker
Posts: 2057
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

07 Oct 2019, 13:42

Rohwedder wrote:

Code: Select all

RegexMatch(txt, "(\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h" Find "\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+\h\S+)",Result)
But why? :)

Code: Select all

RegexMatch(txt, "(\S+\h){10}" Find "(\h\S+){10}", Result)
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

07 Oct 2019, 13:46

Thanks Sir Teddy, for future viewers of this post, I put the following code above Sir Teddy's code and it worked:

Code: Select all

FileRead, String, result.txt 		;result.txt is the name of my txt file that I want to search text in. result.txt is on its own because the file is in the same folder as the script. String is the same variable Teddy uses.
However, for some reason when I use it with other .txt files it doesn't always work (the msgbox is blank), is this because the RegEx (Perl?) expression is tailored to the poem example I used? Also, I change this number:

Code: Select all

WordPeriphery(String, "Iron",, 2)
To show more words after the searched word, however:

Code: Select all

WordPeriphery(Text, SearchWord, Num1 := 10, Num2 := 20)
no matter how much I change the Num1 value "10" I cannot modulate the number of words shown before the searched word. Furthermore, if I put it to anything above 11 the msgbox is blank.

Finally, if I want to find more than the first searched word (i.e "Iron") in the .txt file should I run it in a Loop, Read loop?

Thank you very much for your pointers, it has saved me so much time!!

Cheers,
User avatar
Sir Teddy the First
Posts: 94
Joined: 05 Aug 2019, 12:31
Contact:

Re: How to retrieve text above and below String?

07 Oct 2019, 15:41

Hi,
could you please post another example you tried where it did not work so I can have a look at that?

For the numbers:
I structured the function like this:

Code: Select all

WordPeriphery(Text, SearchWord, Num1 := 10, Num2 := 10)
"Num1 := 10" and "Num2 := 10" mean that Num1 and Num2 are optional arguments for that function.
That means if you call that function you can omit them and they each will default to 10. These are not really meant for you to change.

Instead, you have to specify your own numbers when calling the function, like this:

Code: Select all

MyText := WordPeriphery(String, "iron", 12, 16)
If you leave either of these blank, they will default to 10, but if you specify them, they will take precedence over the ones I specified when I defined the function.
And if you use the function that way, it should work.

The example above produces the following result:

Code: Select all

still tucked away below,
Empty as a field asleep in snow,
Like iron in the harshness of that clime
As God is born in frozen Bethlehem.
Notice that there are less than 16 words behind "iron", although specified otherwise, because the function is able to still output a result even if the end of the text is reached before the specified wordcount is complete.

This is what the other functions posted in this thread are not capable of.

Does this solve your second problem or are there still any errors left?


For your last question:
If you want to search for more words, just call this function repeatedly.
You could even create an array with the words you want the function to search and iterate over this array in a loop, like this:

Code: Select all

SearchArray := Array("iron", "frozen", "empty", "Earth")

Loop, % SearchArray.Length()
{
    SearchWord := SearchArray[A_Index]
    ResultText := WordPeriphery(String, SearchWord)

    MsgBox %ResultText%
}
If you take a closer look at the examples I put into the array, you will find that the MessageBox will be empty for the word "empty", because there is no such word in the example text.
In the text it says "Empty as a field asleep" and because RegEx is case-sensitive, it won't find "empty" in lowercase.
If you want to change that, you have to include an "i" in the RegExNeedle's options, like that:
SearchPattern := "i x S) (\s* \S+){0," Num1 "} \s*" . SearchWord . "[.,?!]* \s* (\S+ \s*){0," Num2 "}"
For more information, have a look at the RegEx-Reference in the docs.
:eh: :think:
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

10 Oct 2019, 05:53

Thanks Teddy,

My Apologies for the late reply, also my apologies for being obscure with not disclosing the text I am working with. I believe RegEx displayed a blank message box when the string was after text in brackets "()" for example :
the dog (an animal) is happy
If I remember correctly the string happy wouldn't work. I will try and replicated this with copyfree text.

In the meantime do you know how to continue searching for the same string several times in the same text? For example:
Fifty years, and you remain a child,
Infinitely valued, loved, and treasured.
Fierce winds may rip away at autumn leaves,
The kind of turn by which one's life is measured.
Yet Eden lingers, innocent and wild.
Years matter not, nor chance, nor choice, nor change.
Ever you must be a child still.
Ambition matters not, nor joy, nor grief,
Reason, passion, temper, fortune, will,
Since you know love that nothing can estrange.
Your script is excellent and retreievs the words before and after the first string "child" however how can I run the script so that it shows me the words before and after every instance of the word "child" and not just the first one? I tried playing around with the function loop but this didn't work.

Thank you again,
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

10 Oct 2019, 10:05

Since my last post I have tried the following scripts"

This one works however I can't figure out how to make it display 10 words above and below the string

Code: Select all

FileRead, String, result.txt

Pos := 1
While Pos :=   RegExMatch(String, "i x S) (\s* \S+){0,10} \s*" . child . "[.,?!]* \s* (\S+ \s*){0,30}", M, Pos+StrLen(M1) )
	str .= ((A_Index=1) ? "" : "`n") "Match #" A_Index " = " m
	MsgBox, %str%
	FileAppend, %str%, Dump.doc
	return


This one worked before I put the "Pos" lines 9-12 however it only showed the first string found in the document but not any subsequent strings so I tried changing it but with no success:

Code: Select all

FileRead, Text_example, result.txt

Pos := 1
While Pos {
Pos :=
(
InputBox, UserInput,.      search , , , 200, 85, 300, 380, , , sleep 
StringGetPos, Search_pos, Text_example, %UserInput% 

MsgBox, The string was found at position %Search_pos%. 

StringGetPos, Line_10_u, Text_example, `n, L10, %Search_pos%   ;Position of ten Lines +

Search_pos_d := StrLen(Text_example) - Search_pos
StringGetPos, Line_10_d, Text_example, `n, R10, %Search_pos_d%  ; ;Position of ten Lines -

Result_count := Line_10_u - Line_10_d
StringMid, Pos+StrLen(M1), Text_example, Line_10_d, Result_count

MsgBox, 2 = %Result_String%.
)

Match%A_Index% := M1
}

Msgbox % Match1 "`n" Match2 "`n" Match3



For some reason I can't manage to make a script that shows 10 words above and below the string AND shows all strings in the document and not just the first one encountered.

Cheers,
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

10 Oct 2019, 11:20

Final code I have tested since:

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

/*
	Title: Global Regular Expression Match
		Find all matches of a regex in a string.
	
	------------------------------------------------------------
	
	Function: grep
		Sets the output variable to all the entire or specified subpattern matches and returns their offsets within the haystack.
	
	Parameters:
		h - haystack
		n - regex
		v - output variable (ByRef)
		s - (optional) starting position (default: 1)
		e - (optional) subpattern to save in the output variable, where 0 is the entire match (default: 0)
		d - (optional) delimiter - the character that seperates multiple values (default: EOT (0x04))
	
	Returns:
		The position (or offset) of each entire match.
	
	Remarks:
		Since multiple values are seperated with the delimiter any found within the haystack will be removed.
	
	------------------------------------------------------------
	
	Function: grepcsv
		Similar to the grep function but returned offsets and all matches including their subpatterns are given in CSV format.
	
	Parameters:
		h - haystack
		n - regex
		v - output variable (ByRef)
		s - (optional) starting position (default: 1)
	
	Returns:
		The position (or offset) of each entire match.
	
	Remarks:
		All fields in the output variable and returned value are delimited by double-quote characters.
	
	------------------------------------------------------------
	
	About: License
		- Version 2.0 by Titan <http www.autohotkey.net /~Titan/#regexmatchall>.  Broken Link for safety
		- GNU General Public License 3.0 or higher <http www.gnu.org /licenses/gpl-3.0.txt>  Broken Link for safety
	
*/

FileRead, h, result.txt
n := "i x S) (\s* \S+){0,10} \s* . child . [.,?!]* \s* (\S+ \s*){0,10}"
v := Result


grep("h", "n", ByRef "v", s = 1, e = 0, d = "") {
	v =
	StringReplace, h, h, %d%, , All
	Loop
		If s := RegExMatch(h, n, c, s)
			p .= d . s, s += StrLen(c), v .= d . (e ? c%e% : c)
		Else Return, SubStr(p, 2), v := SubStr(v, 2)
}

msgbox, 1st result %Result%
msgbox, 2nd result %v%
msgbox, wild card %ByRef%

grepcsv(h, n, ByRef v, s = 1) {
	v =
	x = 0
	xp = 1
	Loop {
		If xp := InStr(n, "(", "", xp)
			x += SubStr(n, xp + 1, 1) != "?", xp++
		Else {
			Loop
				If s := RegExMatch(h, n, c, s) {
					p = %p%`n"%s%"
					s += StrLen(c)
					StringReplace, c, c, ", "", All
					v = %v%`n"%c%"
					Loop, %x% {
						StringReplace, cx, c%A_Index%, ", "", All
						v = %v%,"%cx%"
					}
				} Else Return, SubStr(p, 2), v := SubStr(v, 2)
		}
	}
}
Unfortunately a blank message box is still coming up. Sad times.
User avatar
Sir Teddy the First
Posts: 94
Joined: 05 Aug 2019, 12:31
Contact:

Re: How to retrieve text above and below String?

10 Oct 2019, 11:26

Hi, try this one.
I modified my function to output an array containing all the occurrences of the searched word with their surroundings.

Code: Select all

String :=
(
"Fifty years, and you remain a child,
Infinitely valued, loved, and treasured.
Fierce winds may rip away at autumn leaves,
The kind of turn by which one's life is measured.
Yet Eden lingers, innocent and wild.
Years matter not, nor chance, nor choice, nor change.
Ever you must be a child still.
Ambition matters not, nor joy, nor grief,
Reason, passion, temper, fortune, will,
Since you know love that nothing can estrange."
)

TextPieces := WordPeriphery(String, "child", 10, 10)

Loop, % TextPieces.Length()
{
    MsgBox % TextPieces[A_Index]
}


WordPeriphery(Text, SearchWord, Num1 := 10, Num2 := 10)
{
    if !InStr(Text, SearchWord)
    {
        MsgBox 4096,, Error! Word not found in given String!
        return
    }
   
    Strings := Array()
    StartingPos := 1
    StrReplace(Text, SearchWord,, WordCount)    ; get number of occurrences of the searched word
    
    Loop, %WordCount%
    {
        SearchPattern := "i x S) (\s* \S+){0," Num1 "} \s*" . SearchWord . "[.,?!]* \s* (\S+ \s*){0," Num2 "}"      
        RegExMatch(Text, SearchPattern, Result, StartingPos) ; search for the word at the given starting position (1 for the first word)
        Result := Trim(Result)
        Strings.Push(Result)
        
        StartingPos := InStr(Text, SearchWord,, StartingPos) + 1 ; jump beyond the positon of the last word to get the next
    }
    
    return Strings
}
I also tested your example-string with the parentheses, but it worked for me.
:eh: :think:
LiangShuii
Posts: 18
Joined: 29 Sep 2019, 14:34

Re: How to retrieve text above and below String?

28 Nov 2019, 07:19

Hi all,

I just wanted to come back to this and post my final response as I found a solution to this which I wanted to share.

Sir Teddy's and other commentors on this post were very helpful and their code did work, however, I realised that there was a dedicated library for what I wanted to do. The library is grep which is discussed on the aHK forums here https://autohotkey.com/board/topic/14817-grep-global-regular-expression-match/

I found that there was a dead link to the grep library code on the forum post, so I will link it here for futur users.

Grep.ahk:

Code: Select all

/*
	Function: grep
		Sets the output variable to all the entire or specified subpattern matches and returns their offsets within the haystack.
	Parameters:
		h - haystack
		n - regex
		v - output variable (ByRef)
		s - (optional) starting position (default: 1)
		e - (optional) subpattern to save in the output variable, where 0 is the entire match (default: 0)
		d - (optional) delimiter - the character that seperates multiple values (default: EOT (0x04))
	Returns:
		The position (or offset) of each entire match.
	Remarks:
		Since multiple values are seperated with the delimiter any found within the haystack will be removed.
	License:
		- Version 2.0 <http www.autohotkey.net /~polyethene/#grep>  Broken Link for safety
		- Dedicated to the public domain (CC0 1.0) <http creativecommons.org /publicdomain/zero/1.0/>  Broken Link for safety
*/
grep(h, n, ByRef v, s = 1, e = 0, d = "") {
	v =
	StringReplace, h, h, %d%, , All
	Loop
		If s := RegExMatch(h, n, c, s)
			p .= d . s, s += StrLen(c), v .= d . (e ? c%e% : c)
		Else Return, SubStr(p, 2), v := SubStr(v, 2)
}
The code above is a library https://www.autohotkey.com/docs/commands/_Include.htm, meaning you should create a new ahk file (preferably in the same directory as the ahk file you are working on) called grep.ahk and post the grep code above into it. On the ahk file you are working on you should include at the top the following code:

Code: Select all

#Include grep.ahk
You can now use the functon grep(put parameters here) in your ahk script. The code above contains instructions on how to use grep.

Why use grep over RegExMatch? grep is essentially RegExMatch but it returns not just the first RegExMatch in the string but all the matches. In other words it works like Sir Teddy's code above.

To answer my own initial question, I will need to use grep and RegEx to retreive text above and below string.

First I need to decide what my string is. In the following example text:
"Fifty years, and you remain a child,
Infinitely valued, loved, and treasured.
Fierce winds may rip away at autumn leaves,
The kind of turn by which one's life is measured.
Yet Eden lingers, innocent and wild.
Years matter not, nor chance, nor choice, nor change.
Ever you must be a child still.
Ambition matters not, nor joy, nor grief,
Reason, passion, temper, fortune, will,
Since you know love that nothing can estrange."
Let's say I chose love to be the desired string/word I am looking for.
The RegEx would be something like this: (ofcourse it will need to be adapted to each individual use case)

Code: Select all

Love
now this would match nothing in the example text above. That is because Love starts with a capital letter and there are only love strings with lower case in the example text.
I would therefore have to use this regex

Code: Select all

"i) Love"
(Remember that when using RegExMatch or grep the actual regex will need to be enclosed in quotes "). "i) " makes RegEx non case-sensitive.
Now that will match the single word "love" in the last sentence of the example text.

We now note that it does not match "loved" in the second line. In RegEx, a single dot (i.e a .) matches any character, and a single dot followed by a question mark (i.e .?) means either nothing or any character. So, the regex "i) love.?" will match loved. However this would not match "lovers" because there are two characters after "love". To match "love" "loved" and "lovers" we can use the regex "i) love.?.?" Note that we use a question mark so that the regex will still match love, if we did not use a question mark in the aforementioned regex it would only match lovers.

So, to come back to our example text, if we used grep with the regex "i) love.?" it would return "loved" and "love". Brilliant. But what I wanted was to retreive text above and below the string. As Sir Teddy pointed out in his code, a single dot . is any character (whether a word or space) so if we use grep with the regex "i) ...love.?..." it would return "loved" and "love" with the three characters before and after so it would return "d, loved" and "ow love". You can put as many dots before your searched string as you like.

What I eventually realised I wanted to do was get all the text before the string up to the start of the sentence and all the text after the strong until the end of the sentence. To do this I used grep with the RegEx "i) ^.*?love.?.*?$" and this got me "Infinitely valued, loved, and treasured." and "Since you know love that nothing can estrange."". This is because ^ means start of sentence (start of sentence is after a `n which is a new line a.k.a when you press enter in microsoft word), .* means match any character until the end of the whole document, but we don't want to match everything until the end of the whole document we only want everything before love.? until the start of the sentence, so we put love.? after .* (so .*love.?) so that it stops matching everything when it gets to love.? and because there are two instances of love.? in the example text, we use .*? to stop at the first one (so that it matches each sentence with love.? in it, instead of one match with both love.? in it). So we now have two instances of everything from the start of the sentence with love.? in it to love.? match, but we also want everything after love.? until the end of the sentence. $ means the end of the sentence (before a new line `n) and we will again use .*? which means match the first instance of everything after love.? is matched until the end of the sentence where love.? was matched. We then get "Infinitely valued, loved, and treasured." and "Since you know love that nothing can estrange.""!!

I appreciate this is a very crude explanation of RegEx and grep, and that a basic understanding of RegEx is probably required for any of this to make sense. I just found the answer and wanted to briefly give some feedback during my lunch break.

RegEx quick reference guide can be found here https://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
And a very useful tutorial on RegEx can be found here https://www.autohotkey.com/boards/viewtopic.php?t=28031

Thanks again to everyone who helped me with this!
User avatar
Chunjee
Posts: 742
Joined: 18 Apr 2014, 19:05
GitHub: Chunjee

Re: How to retrieve text above and below String?

01 Dec 2019, 11:40

With https://www.npmjs.com/package/biga.ahk, not perfect but the concept works.

Code: Select all

A := new biga()

txt = 
(
After all, the Earth must wait for spring.
No angel ever changed the pace of time.
Goodness is still tucked away below,
Empty as a field asleep in snow,
Like iron in the harshness of that clime
As God is born in frozen Bethlehem.
)
needle := "iron"

if (A.includes(txt, needle)) {
    txt := A.replace(txt, "/\n/", "zzz ")
    txt := A.replace(txt, ",", "yyy")

    strArray := A.split(txt, needle)
    startStr := A.split(A.join(A.reverse(A.words(strArray[1]))),, 10)
    startStr := A.join(A.reverse(startStr), " ")
    enderStr := A.split(A.join(A.words(strArray[2])),, 10)
    enderStr := A.join(enderStr, " ")

    fullString := startStr " " needle " " enderStr
    fullString := A.replace(fullString, "zzz ", "`n")
    fullString := A.replace(fullString, "yyy", ",")
    msgbox, % fullString
    ; away below,
    ; Empty as a field asleep in snow,
    ; Like iron in the harshness of that clime
    ; As God is born
}
To save a little thinking power I replaced the newlines and "," with letters because they would be lost otherwise.

Return to “Ask For Help”

Who is online

Users browsing this forum: Bing [Bot], HELLiPOD, HiSoKa, IceBubble, Lord-9621, TAC109 and 44 guests