Not sure if I fully understand. Maybe if you state what you are asking for rather than posting a regex, it might help. However, from what I'm getting from your current question, sounds like you want to go through a bunch of XML style tags, which have no whitespace between them, and match a bunch till you get to another tag. Is that right? If so, the regex you want is something like this:
Code: Select all
"<(h1|h2|p)[^>]*+>(?:[^<]++|<(?!stop_tag)\w++>)*+"
- I replaced your [^>]*? with [^>]*+ because you don't need to check to see if the next char is > at every character. You already have the [^>] character class stating it'll take only characters that are not >.
- The extra + after the * or + means don't backtrack if a failure occurs. This increases the match speed if you know for a fact that backtracking won't help in finding a match.
- [^<]++ matches as many non-< as possible.
- If matches a <, then it must not be followed by stop_tag.
- *+ matches 0 or more without backtracking.
To have it stop at a particular string would be slower, esp since you didn't give any context as to where this string can occur. However, slower is relative. This'll prolly be plenty fast. So, something like:
Code: Select all
"<(h1|h2|p)[^>]*+>(?:(?!stop string|<stop_tag>).)*+"
Which basically reads as, if the text from this point on is not
stop string or
<stop_tag> then match the next character and do that 0 or more times without backtracking. The "slowness" is due to the string comparisons on each and every character, but as I said, prolly still pretty fast. If strings being compared are small enough to fit into a CPU cache line, still
really fast.
Not exactly sure what you were doing with the
\K[^<]*. You were resetting the match, but I'm not exactly sure why. What was your reason for this?
Let me know if that's what you want and if it helps.
Cheers.