Regex Stop if encountered

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Regex Stop if encountered

Post by Rikk03 » 05 Dec 2022, 03:34

Hi,

For this regex code I want to add a stop point.

Code: Select all

<(h1|h2|p)[^>]*?>(<\w+>)?\K[^<]*
If specific text is encountered / or a specific tag. It will only work before this point; everything after won't be matched.

If I can do this, then it would be ok for me to use.

Another question, how could I define a start point such as the very first h1

Code: Select all

<(h1|p)[^>]*?>(<\w+>)?\K[^<]*

Rohwedder
Posts: 7551
Joined: 04 Jun 2014, 08:33
Location: Germany

Re: Regex Stop if encountered

Post by Rohwedder » 05 Dec 2022, 05:50

Code: Select all

RegExMatch(Text, "NeedleRegEx" , M)
MsgBox,% M1
Just give a list of inputs-Text and the desired outputs-M1.

adrianh
Posts: 135
Joined: 28 Jul 2014, 15:34

Re: Regex Stop if encountered

Post by adrianh » 05 Dec 2022, 12:20

Not sure if I fully understand. Maybe if you state what you are asking for rather than posting a regex, it might help. However, from what I'm getting from your current question, sounds like you want to go through a bunch of XML style tags, which have no whitespace between them, and match a bunch till you get to another tag. Is that right? If so, the regex you want is something like this:

Code: Select all

"<(h1|h2|p)[^>]*+>(?:[^<]++|<(?!stop_tag)\w++>)*+"
  1. I replaced your [^>]*? with [^>]*+ because you don't need to check to see if the next char is > at every character. You already have the [^>] character class stating it'll take only characters that are not >.
  2. The extra + after the * or + means don't backtrack if a failure occurs. This increases the match speed if you know for a fact that backtracking won't help in finding a match.
  3. [^<]++ matches as many non-< as possible.
  4. If matches a <, then it must not be followed by stop_tag.
  5. *+ matches 0 or more without backtracking.
To have it stop at a particular string would be slower, esp since you didn't give any context as to where this string can occur. However, slower is relative. This'll prolly be plenty fast. So, something like:

Code: Select all

"<(h1|h2|p)[^>]*+>(?:(?!stop string|<stop_tag>).)*+"
Which basically reads as, if the text from this point on is not stop string or <stop_tag> then match the next character and do that 0 or more times without backtracking. The "slowness" is due to the string comparisons on each and every character, but as I said, prolly still pretty fast. If strings being compared are small enough to fit into a CPU cache line, still really fast.

Not exactly sure what you were doing with the \K[^<]*. You were resetting the match, but I'm not exactly sure why. What was your reason for this?

Let me know if that's what you want and if it helps.

Cheers.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Regex Stop if encountered

Post by Rikk03 » 06 Dec 2022, 02:01

I fully understand your confusion. I am after the text between the tags h1 h2 and p. As an example, any HTML page uses those tags. I've since improved on it.

Code: Select all

<(h1|h2|p)[^>]*?>([<\w+>]+)?\K[^<]*
Thanks for your assistance AdrianH; your contribution, while yours does not match the stop string/tag, still returns matches after the stop string/tag The idea was to define a clear stop point: NO matches after it if encountered. I was thinking a negative lookbehind might work.

(?<!whatever) but I can't get it to work

adrianh
Posts: 135
Joined: 28 Jul 2014, 15:34

Re: Regex Stop if encountered

Post by adrianh » 12 Jan 2023, 14:46

Sorry, I've not been on the board for a while.

If you are still having difficulty with this, could you post a sample string that you want to parse and where you would like to stop?

Post Reply

Return to “Ask for Help (v1)”