How to retrieve text above and below String?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 15 Mar 2021, 08:57

Code: Select all

String :=
(
"Fifty years, and you remain a child,

Infinitely valued, loved, and treasured.
Fierce winds may rip away at autumn leaves,
The kind of turn by which one's life is measured.
Yet Eden lingers, innocent and wild.
Years matter not, nor chance, nor choice, nor change.

Ever you must be a child still.
Ambition matters not, nor joy, nor grief,

Reason, passion, temper, fortune, will,
Since you know love that nothing can estrange."
)

searchWord := "child"
before := 3
after := 2

while RegExMatch(String, "`amO)((^\V*\S\V*$)\R(^$\R)*){0," . before . "}^\V*\b(" . searchWord . ")\b\V*$((\R^$)*\R(?2)){0," . after . "}", m, m ? m.Pos(4) + m.Len(4) : 1)
    MsgBox, % RegExReplace(m[0], "`am)^$\R")

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to retrieve text above and below String?

Post by garry » 15 Mar 2021, 10:05

@teadrinker thank you very much , works fine .... sorry, still haven't tried to learn regex ...

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 15 Mar 2021, 10:59

It's time to start. Keep in mind that this can take years! :lol:

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to retrieve text above and below String?

Post by garry » 15 Mar 2021, 11:48

I know there is a lot of knowledge behind it .... many things I'll try with the good basic commands from autohotkey
thank you again for your big help ... I use many of your scripts and you have always a solution

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 15 Mar 2021, 13:04

Thankyou @teadrinker

While I have gotten this working, there is no space after full stop and new sentences. *update fixed

Why not alter this so that the regex returns the entire sentence.

The average sentence is around 25 words or so, so why not just start the sentence after a full stop and beginning with a capital letter and ending with full stop and space thereafter? It would certainly retain the meaning better.
Last edited by Rikk03 on 15 Mar 2021, 13:51, edited 1 time in total.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 15 Mar 2021, 13:20

Hi Again,

Come to think of it, the average paragraph is around 250-300 words, which is 4 or 5 lines. Why not just return the entire paragraph?

There again, I can just specify the number of words to be 150 before and 150 after to get the majority of the paragraph. However I suspect this will be really slow, counting all those words before and after. Surely grabbing entire sentence or paragraph would be much quicker.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 15 Mar 2021, 14:08

@Rikk03
You have so many ideas at the same time! :) Can you formulate your question more specifically?

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 15 Mar 2021, 14:28

Sorry about that.

I'd like to return the paragraph where the search term resides because a single sentence is not enough to understand the context.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 15 Mar 2021, 15:36

Please clarify, what you mean by the paragraph. Is it text from an empty line to a next empty line?

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 16 Mar 2021, 07:07

Well normally there is a empty line between paragraphs no? To show the separation. Sometimes newline can be \n or \r or both. The other question is how to handle the last paragraph in an article or starting paragraph.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 16 Mar 2021, 09:43

Code: Select all

String :=
(
"Fifty years, and you remain a child,

Infinitely valued, loved, and treasured.
Fierce winds may rip away at autumn leaves,
The kind of turn by which one's life is measured.
Yet Eden lingers, innocent and wild.
Years matter not, nor chance, nor choice, nor change.

Ever you must be a child still.
Ambition matters not, nor joy, nor grief,

Reason, passion, temper, fortune, will,
Since you know love that nothing can estrange."
)

searchWord := "child"

while RegExMatch(String, "sO)(^\R?|\R\R)((.(?!\R\R))*\b" . searchWord . "\b(?-1)*?\V?)(\R\R|\R?$)", m, m ? m.Pos(2) + m.Len(2) : 1)
    MsgBox, % "|" . m[2] . "|"

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to retrieve text above and below String?

Post by garry » 17 Mar 2021, 06:59

@teadrinker thanx again for the good different examples

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 08 Apr 2021, 10:42

@teadrinker Thanks,

It really helped.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 13 Nov 2022, 14:06

@teadrinker

Im using your regex above, however it is only returning 2 results on a page where there are nearer to 12 matches (as regex101 returns).

It is being called within a loop, so that might have something to do with it. Perhaps it just needs more time to complete?

I'd appreciate your thoughts about how to fix it?

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 13 Nov 2022, 18:37

If you give me an example, I'll try to answer.

adrianh
Posts: 135
Joined: 28 Jul 2014, 15:34

Re: How to retrieve text above and below String?

Post by adrianh » 14 Nov 2022, 08:30

OMG this language is infuriating to debug! It really resists trying to be tamed.

So, I wrote the regex so that it's more readable using subpatters. It doesn't remove the blank lines, but it does ignore them. I'm sure that you should be able to figure out a regex that deletes them appropriately.

Code: Select all

search_for_word(text, word, max_lines)
{
regex_re =
(
Oxm`ni)
  (?(DEFINE)                 # This block defines named subpatterns
    (?<eol>\r?+\n|$)         # End of line
    (?<bline>^\h*+(?&eol))   # Blank line

    # A line may be preceeded by any number of blank lines
    # and it can be followed by any number of blank lines.
    (?<line>(?&bline)*+^.*+(?&eol)(?&bline)*+)
  `)
  
  # Here is the actual regex
  (?<before>
    (?&line){0,%max_lines%}  # lines before the word
    .*?                      # characters before the word on the same line
  `)
  (?<word>\b%word%\b)        # the actual word
  (?<after>
    .*+(?&eol)               # characters after the word on the same line
    (?&line){0,%max_lines%}  # lines after the word
  `)
)
  array := []
  foundPos := 1 ; Where to start searching from.
  offset := 0   ; Where to start the search relative to the last foundPos position.
  ; The while loop is structured so that it doesn't search the same text again,
  ; unless overlaps are wanted, then something else would have to be done.
  while (foundPos := RegexMatch(text, regex_re, param, foundPos + offset)) {
    array.push({before: param["before"], word: param["word"], after: param["after"]})
    ; This would skip all the lines matched after word.
    ; offset := param.Len[0]
    
    ; This allows matching the next word found after the previously found word,
    ; but will only have the text between the last matched word and this one if
    ; the number of lines between them is less than max_lines.
    offset := param.Len["before"] + param.Len["word"]
  }

  return array
}


#z::test()
test()
{
text =
(
’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.
)

  found := search_for_word(text, "through", 4)
  text := ""
  loop % found.length() {
    text .= "BEFORE ==========`n" found[A_Index]["before"]
      . "`nWORD ====] " found[A_Index]["word"] " [===="
      . "`nAFTER ==========`n" found[A_Index]["after"] "`n`n"
  }
  
  MsgBox, % "'" text "' found.length() = " found.length() 
}
#z::test()
Output:

Code: Select all

BEFORE ==========
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling 
WORD ====] through [====
AFTER ==========
 the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head


BEFORE ==========
      And burbled as it came!

One, two! One, two! And 
WORD ====] through [====
AFTER ==========
 and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?


BEFORE ==========
 and 
WORD ====] through [====
AFTER ==========

      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?



Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 15 Nov 2022, 02:36

@teadrinker

I'm using Google search results (copied to clipboard), using it to refine the result and return specific information.

I found that on Regex101, there is no O (Object) modifier to test with so it's not really the same.

I tried it with the s and i and g modifiers (on regex101), and the regex correctly finds all instances of the search text.

However there is NO global "g" modifier from what I see in AHK.

so how can I resolve the "O" and "g" issue?

I find I get more accurate results (on regex101) with the slight modifications below

(^\R?|\R\R)((.(?!\R\R))*\b" . insearch . "\b(?-1)*?\V?)(\.\R\R|\R?$) (using isg modifiers)

but it returns nothing at all when using AHK. So confusing.

UPDATE, on Regex101 I get even better results with the following

(^\R?|\R)((.(?!\R))+\b" . insearch . "\b(?-1)+?\V?)(\.\R\R|\R?$) (also using the isg modifiers)

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to retrieve text above and below String?

Post by teadrinker » 15 Nov 2022, 06:39

Rikk03 wrote: so how can I resolve the "O" and "g" issue?
There is no issue with them. O does not affect on RegEx behavior, it's about getting results. The g is needed to show all matches on Regex101, in AHK RegEx can't return all matches at once, you can sequentially show them using RegEx in a cycle.
Rikk03 wrote: I find I get more accurate results (on regex101) with the slight modifications below
Can't say anything since you didn't give me specific text example.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to retrieve text above and below String?

Post by Rikk03 » 15 Nov 2022, 09:14

Got it working. Thanks

adrianh
Posts: 135
Joined: 28 Jul 2014, 15:34

Re: How to retrieve text above and below String?

Post by adrianh » 15 Nov 2022, 19:57

Rikk03 wrote:
15 Nov 2022, 02:36
I find I get more accurate results (on regex101) with the slight modifications below

(^\R?|\R\R)((.(?!\R\R))*\b" . insearch . "\b(?-1)*?\V?)(\.\R\R|\R?$) (using isg modifiers)

but it returns nothing at all when using AHK. So confusing.

UPDATE, on Regex101 I get even better results with the following

(^\R?|\R)((.(?!\R))+\b" . insearch . "\b(?-1)+?\V?)(\.\R\R|\R?$) (also using the isg modifiers)
When using my regex, you need to replace the `) on with ). It is an artifact from AHK that is required when doing continuation sections.

Code: Select all

  (?(DEFINE)                 # This block defines named subpatterns
    (?<eol>\r?+\n|$)         # End of line
    (?<bline>^\h*+(?&eol))   # Blank line

    # A line may be preceeded by any number of blank lines
    # and it can be followed by any number of blank lines.
    (?<line>(?&bline)*+^.*+(?&eol)(?&bline)*+)
  )
  
  # Here is the actual regex
  (?<before>
    (?&line){0,%max_lines%}  # lines before the word
    .*?                      # characters before the word on the same line
  )
  (?<word>\b%word%\b)        # the actual word
  (?<after>
    .*+(?&eol)               # characters after the word on the same line
    (?&line){0,%max_lines%}  # lines after the word
  )
and the options you use are gxmi. That would give you this: https://regex101.com/r/SL7Le9/1

FYI, the `n option is so that the regex recognizes the end of line character. By default, the continuation section EOL marker is a `n, but the regex default is `r`n. Since it would never see a `r`n in the continuation section, it will never know that the EOL has been reached. This isn't a problem except for line comments (which I use liberally). As a line comment ends at the EOL, and wouldn't find it, it would ignore the rest of the regex. Alternatively, I could have set the continuation section to use `r`n, used (?#...) comments, or had no comments at all.

I realized that our different methods were a bit off. I forgot to anchor to the beginning of the line causing unnecessary repeat searches (an anchor matches before or after something but doesn't actually consume anything). Yours had some other issues, though it kinda worked. There is a problem I observed was when the word was on the second paragraph. See: https://regex101.com/r/s4Fodr/1

I was actually curious about (?-1)*?. This actually doesn't do anything that I can decern. Basically, it says to match the same text as was matched in (.(?!\R\R)). Was that what you were intending? Because that match is unlikely, and you have it so that it matches 0 or more of them, it chooses 0. In other words, it doesn't do anything.

Another FYI, when you have a test that you want to share, click on the Save and Share link at the top left corner of the regex101 page and then click on the copy button and paste it here. That way we can see what you are referring to more easily.

So, you saying that you want to stay within the paragraph? If so, then try this regex: https://regex101.com/r/QSyBn7/1

Code: Select all

  (?(DEFINE)                 # This block defines named subpatterns
    # Anchors to the the beginning of the string or is preceeded
    # with an CR or LF character.
    (?<BOL>(?<=^|[\r\n]))
    # Either has an \R or is at the end of the string.
    (?<EOL>\R|$)
    # A blank line must anchor to a BOL and can contain 0 or
    # more horizontal whitespace characters and ends with an EOL.
    (?<bline>(?&BOL)\h*+(?&EOL))

    # A line must anchor to a BOL, is not a blank line, contains
    # 1 or more non CR or LF characters and ends with and EOL.
    #
    # If your paragraph markers will never contain 1 or more
    # horizontal whitespace, then you don't actually need the
    # (?!(?&bline)).  It would then read:
    #
    # A line must anchor to a BOL and must contain one or more
    # non CR or LF characters and ends with an EOL.
    (?<line>(?&BOL)(?!(?&bline))[^\r\n]++(?&EOL))
  )
  
  # Here is the actual regex
  (?<before>
    (?&line){0,4}    # lines before the word
    (?&BOL)[^\r\n]*? # characters before the word on the same line
  )
  (?<word>\bthrough\b) # the actual word
  (?<after>
    [^\r\n]*+(?&EOL) # characters after the word on the same line
    (?&line){0,4}    # lines after the word
  )
For complex regexes, I would recommend trying to think of what you are trying to do semantically and use subexpressions to mirror those semantics as it makes the regex easier to read and reason about.

Post Reply

Return to “Ask for Help (v1)”