prototype 'RegEx match all' function

Post your working scripts, libraries and tools
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

prototype 'RegEx match all' function

02 Jun 2018, 22:09

- Users are commonly surprised that RegExMatch only returns one match, hence I considered writing a prototype 'RegEx match all' function which perhaps works as they expected.
- My hope being that something like this would appear as an example in the documentation, or possibly as a built-in function.
- Perhaps expectations are based on functionality such as that demonstrated here:
C++ Tutorial 19 : C++ Regular Expressions - YouTube
https://www.youtube.com/watch?v=9K4N6MO_R1Y
- I've tried to anticipate all likely problems, including trying to make it AHK v1/v2 two-way compatible. Note: the function uses the AHK v2 negative offsets behaviour in both AHK v1 and v2.
- Another problem is whether a particular needle would cause an infinite loop. For that scenario the function returns -1, and any matches collected prior to that point are returned by the object.
- As a bonus, the latest version of my StrJoin prototype function is available underneath.
- I've modelled the function to be very similar to RegExMatch. The object returns any matches, and the function's return value is the number of matches, or -1 if there is an error.
- I'd very much welcome any ideas or suggestions, something like this would be an important workhorse function and so would need to be perfect and have widespread appeal.

Code: Select all

q:: ;prototype 'RegEx match all' function
vText := "abcdefghijklmnopqrstuvwxyz"
vNeedle := "..."
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

vText := "abcdefghijklmnopqrstuvwxyz"
vNeedle := "..."
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch, 3)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

vText := "abcdefghijklmnopqrstuvwxyz"
vNeedle := "..."
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch, -9)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

vText := "abcdefghijklmnopqrstuvwxyz"
vNeedle := ".{1,3}"
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

vText := "ab"
vNeedle := "..."
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

vText := "ab"
vNeedle := ""
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)
return

;==================================================

JEE_RegExMatchAll(vText, vNeedle, ByRef oMatch:="", vPos:=1)
{
	;we're not interested in the return value, but we need it to get RegExMatch to take place
	static vRet := RegExMatch("", "", o), vIsV1 := !IsObject(o)
	if vIsV1
	{
		if (vPos = 0)
			return
		else if (vPos <= -1)
			vPos++
		if RegExMatch(vNeedle, "^[A-Za-z`a`n`r `t]*\)")
			vNeedle := "O" vNeedle
		else
			vNeedle := "O)" vNeedle
	}

	vCount := 0, oMatch := []
	while vPos := RegExMatch(vText, vNeedle, oTemp, vPos)
	{
		vCount += 1, vPos += StrLen(oTemp.0)
		if (vPos = vPosLast) ;prevent an infinite loop
			return -1
		oMatch.Push(oTemp.0), vPosLast := vPos
	}
	return vCount
}

;==================================================

; ;e.g.
; oArray := StrSplit("abcdefghijklmnopqrstuvwxyz")
; MsgBox, % JEE_StrJoin(" - ", oArray*)
; MsgBox, % JEE_StrJoin(["=","`r`n"], oArray*)
; MsgBox, % JEE_StrJoin(["`t","`r`n"], oArray*)
; MsgBox, % JEE_StrJoin(["`t","`t","`r`n"], oArray*)
; MsgBox, % JEE_StrJoin(["`t","`t","`t","`r`n"], oArray*)
; MsgBox, % JEE_StrJoin(["`t","`t","`t","`t","`r`n"], oArray*)
; MsgBox, % JEE_StrJoin(["","","","","`r`n"], oArray*)

JEE_StrJoin(vSep, oArray*)
{
	VarSetCapacity(vOutput, oArray.Length()*200*2)
	if (vSep.Length() = 1) ;convert 1-item array to string
		vSep := vSep.1
	if !IsObject(vSep)
	{
		Loop, % oArray.MaxIndex()-1
			vOutput .= oArray[A_Index] vSep
		vOutput .= oArray[oArray.MaxIndex()]
	}
	else
	{
		oSep := vSep, vCount := oSep.Length(), vIndex := 0
		Loop, % oArray.MaxIndex()-1
		{
			;vIndex := Mod(A_Index-1, vCount)+1
			vIndex := (vIndex = vCount) ? 1 : vIndex+1
			, vOutput .= oArray[A_Index] oSep[vIndex]
		}
		vOutput .= oArray[oArray.MaxIndex()]
	}
	return vOutput
}

;==================================================
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

Re: prototype 'RegEx match all' function

03 Jun 2018, 03:10

- Thanks for the link just me.
- Curiously I hadn't seen any such functions referenced in the many threads where people have recreated a function similar to the one that I've presented.
- However, on inspecting the functions, they all have various problems including the use of Transform-Deref, checking global variables, and not accounting for the infinite loop possibility.
- Anyway, I would hope that the main devs might like to put something like what I've suggested into the documentation, I really don't mind if it's quite different to my function. Just so that we can have something standard to work with, that we can make reference to, the next time someone mentions this subject in the forums.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
just me
Posts: 6652
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: prototype 'RegEx match all' function

03 Jun 2018, 03:48

Hi jeeswg,

can you provide an example for an endless loop?
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

Re: prototype 'RegEx match all' function

03 Jun 2018, 04:12

- There's an example above where the needle is blank. RegExMatch finds the blank string, and returns it to you in the match object, if you increment the starting position by the length of the match (0 characters), it starts searching again at the same position, so an infinite loop.
- Another needle that could cause an infinite loop: vNeedle := "(?=.|)"
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

Re: prototype 'RegEx match all' function

03 Jun 2018, 04:31

- One reason I posted was in case anyone had any suggestions for more advanced functionality. Or if there were other common bits of advanced functionality that the function doesn't currently have.
- I just thought of one thing which may be uncommon, but worth mentioning, if an object was specified for the needle, you could have an alternating needle. E.g. a very simple example: grab 2 characters, then 3 characters, then 2, then 3 etc.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
just me
Posts: 6652
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: prototype 'RegEx match all' function

03 Jun 2018, 04:39

The needles of both examples are searching for an empty string. So I'd call it user error, but you cannot prevent users from making such errors.

Why do you want to force v1 users to use v2 conventions for vPos, although they cannot be used with the built-in RegExMatch() function?
jeeswg wrote:- I just thought of one thing which may be uncommon, but worth mentioning, if an object was specified for the needle, you could have an alternating needle. E.g. a very simple example: grab 2 characters, then 3 characters, then 2, then 3 etc.
That wouldn't be a 'global match'.
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

Re: prototype 'RegEx match all' function

03 Jun 2018, 04:57

- Yes, I thought also about that point re. using the AHK v1 convention, on balance I decided against using the AHK v1 convention.
- Well I don't know where you got the term 'global match' from (not that I disagree). The goal was to get all matches, and I've added the idea to get all alternate matches.
- Here's a mod of the original function. It could be quite useful for alternating pairs or triples for example.

Code: Select all

q:: ;prototype 'RegEx match all' function
vText := "abcdefghijklmnopqrstuvwxyz"
oNeedle := [".", "..", "..."]
vCount := JEE_RegExMatchAll(vText, oNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*) ;13 a,bc,def,g,hi,jkl,m,no,pqr,s,tu,vwx,y
return

JEE_RegExMatchAll(vText, vNeedle, ByRef oMatch:="", vPos:=1)
{
	;we're not interested in the return value, but we need it to get RegExMatch to take place
	static vRet := RegExMatch("", "", o), vIsV1 := !IsObject(o)
	if IsObject(vNeedle) && (vNeedle.Length() = 1) ;convert 1-item array to string
		vNeedle := vNeedle.1
	if vIsV1
	{
		if (vPos = 0)
			return
		else if (vPos <= -1)
			vPos++
	}

	vCount := 0, oMatch := []
	if !IsObject(vNeedle)
	{
		if vIsV1
			if RegExMatch(vNeedle, "^[A-Za-z`a`n`r `t]*\)")
				vNeedle := "O" vNeedle
			else
				vNeedle := "O)" vNeedle
		while vPos := RegExMatch(vText, vNeedle, oTemp, vPos)
		{
			vCount += 1, vPos += StrLen(oTemp.0)
			if (vPos = vPosLast) ;prevent an infinite loop
				return -1
			oMatch.Push(oTemp.0), vPosLast := vPos
		}
	}
	else
	{
		oNeedle := vNeedle
		if !oNeedle.Length()
			return -1
		if vIsV1
			for vKey, vNeedle in oNeedle
				if RegExMatch(vNeedle, "^[A-Za-z`a`n`r `t]*\)")
					oNeedle[vKey] := "O" vNeedle
				else
					oNeedle[vKey] := "O)" vNeedle
		vCountN := oNeedle.Length(), vIndex := 1, vNeedle := oNeedle.1
		while vPos := RegExMatch(vText, vNeedle, oTemp, vPos)
		{
			vCount += 1, vPos += StrLen(oTemp.0)
			if (vPos = vPosLast) ;prevent an infinite loop
				return -1
			oMatch.Push(oTemp.0), vPosLast := vPos
			, vIndex := (vIndex = vCountN) ? 1 : vIndex+1
			, vNeedle := oNeedle[vIndex]
		}
	}
	return vCount
}
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Helgef
Posts: 4067
Joined: 17 Jul 2016, 01:02
Contact:

Re: prototype 'RegEx match all' function

03 Jun 2018, 06:58

Hello jeeswg and just me :wave:.

Code: Select all

oMatch.Push(oTemp.0)
You discard all matched subpatterns, it should be oMatch.Push(oTemp).
The needles of both examples are searching for an empty string. So I'd call it user error
Returning -1 is the error. "" matches any position and is a valid regular expression. You should do vPos++ if the overall match has length 0.
AHK v1/v2 two-way compatible
v1 and v2 also matches line breaks differently.
v2 changes wrote:RegEx newline matching defaults to (*ANYCRLF) and (*BSR_ANYCRLF); `r and `n are recognized in addition to `r`n. The `a option implicitly enables (*BSR_UNICODE).
I do not think this function is suitable for two-way compatibility

Code: Select all

StrLen(oTemp.0)
You should use otemp.len[0] instead.
I'd use byref for the first parameter.

Thanks for sharing, cheers.
User avatar
jeeswg
Posts: 6904
Joined: 19 Dec 2016, 01:58
Location: UK

Re: prototype 'RegEx match all' function

03 Jun 2018, 14:08

- @Helgef: Re. subpatterns. I thought of pushing the RegExMatch object each time, however, is that what users would suspect? Is that the right thing to do? Is there an existing function that does something similar? Maybe it would surprise people, to get objects not strings. Furthermore, it is possible to run RegExMatch again on any results if more details wanted. I may even agree with you to an extent, it's tricky. This was one of the main reasons I asked for suggestions.
- I thought of doing vPos++, but it seems quite odd. So if I search for a blank string on a 26-char string, I get an output with 26/27 blank strings? How would you fix this? It uses vPos++ but still results in an infinite loop.

Code: Select all

q:: ;prototype 'RegEx match all' function - handle blank needles
MsgBox, % RegExMatch("abc", "",, 10) ;4

vText := "abc"
;vNeedle := "."
vNeedle := ""
vPos := 1
oMatch := []
while vPos := RegExMatch(vText, vNeedle, oTemp, vPos)
{
	vCount += 1, vPos += Max(1,StrLen(oTemp.0))
	, oMatch.Push(oTemp.0)
	MsgBox, % vPos
}
MsgBox, % oMatch.Length()
return
- Re. v1 and v2 matches line breaks differently. Yes, I meant to start a thread re. this. I'd need some complicated RegEx to fix an AHK v1 needle to make it work like in AHK v2. (I had been hoping that at some point there'd be a built-in option to make RegExMatch/RegExReplace work in AHK v1 as they do in AHK v2.)
- (A theoretical built-in function might use the relevant AHK v1/v2 offset style, or, maybe it might go with the AHK v2 style in both AHK v1/v2, whichever. A documentation example would probably skip the compatibility support to keep it simpler.)
- What's wrong with StrLen(oTemp.0)?
- Haha the old ByRef question. It's a fair point. (Philosophically ByVal, but ByRef to achieve a performance gain.) Cheers.

- [EDIT:] Returning 0 instead of -1, thus 0 for both no match and error, might be more consistent with AutoHotkey functions. Possibly returning a blank string for an error might be worth considering, but I'm not sure if AutoHotkey has ever done this, except perhaps for DllCall.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Helgef
Posts: 4067
Joined: 17 Jul 2016, 01:02
Contact:

Re: prototype 'RegEx match all' function

04 Jun 2018, 04:05

is that what users would suspect? Is that the right thing to do?
Who knows :crazy: ? That's how I do it.
How would you fix this?
I'm certain you will manage ;) .
it uses vPos++ but still results in an infinite loop.
It is because, as I said, "" matches any position, and
StartingPosition wrote:If StartingPosition is beyond the length of Haystack, the search starts at the empty string that lies at the end of Haystack (which typically results in no match).
And regexmatch("", "") results in a match.
What's wrong with StrLen(oTemp.0)
Most importantly, it gives incorrect result in v1 if the match contains a binary zero. Probably because in v1 it will always mesaure the string by calling _tcslen, so it will be extra work too, even if the result is correct. The correct length is stored in the match object, so might aswell use it regardless.
Returning 0 instead of -1, thus 0 for both no match and error, might be more consistent with AutoHotkey functions
If there is an error, you should throw, that is consistent with AHK v2 functions, you should use try-catch-throw on the regexmatch call for the v1 behaviour to be consistent with the v2 behaviour. As I said, it is not a good candidate for two-way comp.
Possibly returning a blank string for an error might be worth considering, but I'm not sure if AutoHotkey has ever done this, except perhaps for DllCall.
This is a v1 convention, all math functions do it, to mention the first thing that comes to my mind.

Cheers.
sankt_Ola
Posts: 10
Joined: 25 Sep 2017, 05:49

Re: prototype 'RegEx match all' function

04 Jan 2020, 11:48

I used this function and like it but capturing groups does not work as I presume.

Test Code:

Code: Select all

VText= 
(
Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
officia deserunt mollit anim id est laborum
Doctor (klartext)
Stephen Falken, Professor 
Lorem ipsum dolor
)

vNeedle := "Doctor \(klartext\).*\R\K(.+)," 
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

With test code below I get a msgbox with text:

1
Stephen Falken,

That is the comma sign gets included in result though I expect only the pattern within paranthesis
should be presented.

If I change the fllowing line in your function it seems to work as expected.

Code: Select all

;oMatch.Push(oTemp.0), vPosLast := vPos
 oMatch.Push(oTemp.Value(1)), vPosLast := vPos

Do I break something else? Any comments?
sankt_Ola
Posts: 10
Joined: 25 Sep 2017, 05:49

Re: prototype 'RegEx match all' function

04 Jan 2020, 11:49

I used this function and like it but capturing groups does not work as I presume.

Test Code:

Code: Select all

VText= 
(
Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
officia deserunt mollit anim id est laborum
Doctor (klartext)
Stephen Falken, Professor 
Lorem ipsum dolor
)

vNeedle := "Doctor \(klartext\).*\R\K(.+)," 
vCount := JEE_RegExMatchAll(vText, vNeedle, oMatch)
MsgBox, % vCount "`r`n" JEE_StrJoin(",", oMatch*)

With test code below I get a msgbox with text:

1
Stephen Falken,

That is the comma sign gets included in result though I expect only the pattern within paranthesis
should be presented.

If I change the fllowing line in your function it seems to work as expected.

Code: Select all

;oMatch.Push(oTemp.0), vPosLast := vPos
 oMatch.Push(oTemp.Value(1)), vPosLast := vPos

Do I break something else? Any comments?
.

Return to “Scripts and Functions”

Who is online

Users browsing this forum: swagfag and 44 guests