RegExExtract()

iseahound · 09 May 2020, 23:12

The current RegEx functions have poor control flow.

A better and modernized function would allow tokenization.

match := RegExExtract(ByRef haystack, needle, replacement text)

As an example:

If my string is ‘abcd’
Token := RegExExtract(string, “^(.)”, “$1”)
Would return the tokens a,b,c,d (per loop)
And string would be modified from bcd to cd to d (per loop)

Essentially the matching regex should be removed from the original string. Currently this cannot be done efficiently.

guest3456 · 10 May 2020, 05:21

looks like just a combo of both RegExMatch and RegExReplace... although i dont understand why you'd suggest to use $1 for replacement text, because according to current syntax or normal regex behavior that would just replace back with the same text that was extracted. you should use "".

seems like this does it:

Code: Select all

string := "abcd"
regex := "^(.)"
; while RegExMatch(string, regex, token)
; {
;    string := RegExReplace(string, regex, "")
;    MsgBox, token1=%token1%`nnewstr=%string%
; }

while (string)
{
   token := RegExExtract(string, regex, "")
   msgbox token=%token%`nnewstr=%string%
}

RegExExtract(ByRef haystack, needle, replacement)
{
   if RegExMatch(haystack, needle, match)
   {
      haystack := RegExReplace(haystack, needle, replacement)
      return match1
   }
   return ""
}

swagfag · 10 May 2020, 10:51

what about patterns containing more than 1 capturing groups?(eg ^(.).(.))
what about patterns containing nested capturing groups?(eg ^((.).))

given $1's presence in ur example, presumably, u want to be able to also replace the captured group with something other than an empty string(ie erasing it)
so a function accepting an array/map(to account for named patterns) of replacements and returning a collection of matched tokens(ignoring the replacing action, much like how RegExMatch with O) already does) would make more sense
still not sure how ud wanna handle nested groups though

iseahound · 10 May 2020, 13:51

Right. Maybe Extraction Text might be a better term. Would allow named capture groups like ${value}. Or ordered groups like $3$2$1.

Running your code (I haven't bench marked your code specifically) will result in a 5x - 10x slowdown compared to Loop Parse with no possibility of optimization within AutoHotkey. Granted RegEx itself has overhead, but it's reasonably possible to bring that down to 3x at least.

^(.).(.) - I imagine there would be a flag that either (1) deletes everything matched, abcd becomes d or (2) deletes only what is captured. So one run of RegExExtract("abcd", "^(.).(.)", "$1$2") becomes bd.

An alternative is to extend Loop Parse syntax to accept regex.

11 May 2020, 08:18

Your suggestion for a RegExExtract would run magnitudes slower than defining a starting position inside the RegExMatch (or RegExReplace).
I do not see a case where your course of action is the prefered one.
If you could provide any example that would make your suggestion seem more useful than others it would perhaps be easier to see the need for it.

iseahound · 11 May 2020, 12:37

Uhhh... Would be useful for:

Filtering strings using successive filters.
Tokenization without the need of preprocessing to add delimiters into Loop, Parse
Allow for tokenization in cases where delimiters don't naturally occur. Such as in Japanese where words are not separated by spaces.
Remove the need for 2 RegEx statements (one to match and return the matching word, one to match and delete)
Would sidestep an issue with string length in AutoHotkey not actually returning the length of the string but the number of bytes in UTF-16. RegEx is wayyy more accurate. MsgBox % StrLen("𤭢")

RegExMatch isn't even a useful function. It needs to return the match instead of the position of the first match.

11 May 2020, 15:36

In any case you would sacrifice a lot of performance by using this instead of a starting position.
Could you provide a case where shortening like this is absolutely necessary?

iseahound · 11 May 2020, 21:00

Here's an example since you insisted.

Code: Select all

message_in_a_bottle := "Help I me trapped am this on island secret no with or food x water x x"  

decode_message:
while(StrLen(message_in_a_bottle) > 5) {
fragment := RegExExtract(message_in_a_bottle, "^\s*(\w+\b\s*)\b\w+\b\s*(\w+\s*)", "$1$2")
MsgBox % message .= fragment
}

RegExExtract(ByRef haystack, needle, extract, flag := true) {
   value := RegExReplace(haystack, "^.*?" needle ".*$", extract)

   ; flag has the ability to replace all matching text or the captured group only
   if (flag) {
      extract := RegExReplace(extract, "(\$\d)", "🌹💩$1🌹") ; delimit capture groups only.
      matches := RegExReplace(haystack, "^.*?" needle ".*$", extract) ; resolve capture groups into matching text.
      Loop Parse, matches, % "🌹"
      {
         if (A_LoopField ~= "^💩") {
            haystack := StrReplace(haystack, RegExReplace(A_LoopField, "^💩"),,, 1)
         }
      }
   }
   else
      haystack := RegExReplace(haystack, needle)
   return value
}

Notice that each backreference $1, $2 is a token.

I don't feel like coding a second example, but it's basically a sieve. Given a set of words, collect all words that contain the letter a. Then collect all the words that contain the letter b. and so on. Not possible at this time, especially if the criteria suddenly becomes match a pattern regex. To solve this example in the current language, use a regex match object, parse it, use string replace to remove each match, and proceed with the next sieve. The fact that RegExMatch returns an object is a big minus, as that is very slow, and adds complexity. RegExMatch needs to return a delimited string lmao.

Also string functions are super broken in AHK, I tried SubStr(A_LoopField, 2) instead of RegExReplace(A_LoopField, "^

") in my above code and got half a UTF-16 character lmao.

13 May 2020, 04:11

I don't understand why you'd have to use StrReplace or modify the parsing string in this context when you could just use the starting position parameter.

Creating a new object is a neglecteable operation in comparison to the editing that is done upon the original string (or the search within the original string)
The RegExMatch return value is fine too since it doesn't return a RegEx Object, but a found position - an integer.
In any case it should be obvious to anybody reading this why a string consisting of possibly any character shouldnt be delimited by a fixed selection from the same character set.

RegExExtract()

RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Re: RegExExtract()

Who is online