Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

RegExMatchArray()


  • Please log in to reply
12 replies to this topic
Slanter
  • Members
  • 739 posts
  • Last active: Jul 08 2011 05:26 AM
  • Joined: 28 May 2008
This is a function that works very similar to RegExMatch(), but instead of only finding the first match, it finds all matches and outputs them to an array. For example, with the string "(abcdef)(ghi)" and the RegEx "\(\w+\)", RegExMatch would only find "(abcdef)" while RegExMatchArray will find both "(abcdef)" and "(ghi)".

Syntax: FoundCount := RegExMatchArray(Haystack, NeedleRegEx, OutputArray[, StartingPosition = 1, SubPats = 1])
Information for Haystack, NeedleRegEx and StartingPosition can be found in the documentation for RegExMatch. Most of the rest of this can be too, but I felt like editing it a little to fit the function (also, I was bored). :)FoundCount:
RegExMatchArray returns the number of times NeedleRegEx could be matched in Haystack. If an error occurs, a blank string is returned and ErrorLevel is set to one of the values here. This value will also be stored in OutputArray0OutputArray:
The base name of the pseudo-array that will contain all of the pieces of Haystack matched by RegExMatchArray. If the pattern is not found, the array will not be created. Note that the array name MUST be enclosed in quotes.

Mode 1 (default): If any capturing subpatterns are found, they will be stored in an array whose base name is OutputArrayMatch#. For example, if the array's name is Match, and we're currently working on the first complete match of NeedleRegEx in Haystack, the substring that matches the first subpattern would be stored in Match1_1, the second in Match1_2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern Year would be stored in Match#_Year. If a particular subpattern does not match anything, it's variable is made blank. The variable OutputArray0_0 will contain the number of subpatterns that could be matched.

Mode 2 (position-and-length): See RegExMatch documentation for more information. Position and length of subpatterns are stored in Match#_Len# and Match#_Pos#.SubPats:
This flag tells the function whether or not to create the secondary arrays for subpatterns (ie. Match#_#). Default is 1.[/list]
And finally, here is the function. Please tell me if anything doesn't work as expected :D
RegExMatchArray(Haystack, NeedleRegEx, OutputArray, StartingPosition=1, SubPats=1) {
   Global
   Local Pos := StartingPosition - 1, Sub := 0, Idx := 0, Mode := 1, SPats, Str
   If (SubPats) {
      Loop {
         If (!Sub := RegExMatch(NeedleRegEx, "(?<!\\)\((?!\?:)", Str, Sub + 1))
            Break
         RegExMatch(NeedleRegEx, "\(\?\<[^\>]+\>", Str, Sub)
         SPats .= (SPats ? "|" : "") . (Str ? SubStr(Str,4,-1) : A_Index)
      }
   }
   If (RegExMatch(NeedleRegEx, "^[^\)]+\)", Str))
      Mode := InStr(Str, "P", 1) ? 2 : 1
   Loop {
      If (!Pos := RegExMatch(Haystack, NeedleRegEx, Str, Pos + 1))
         Break
      IfEqual, 0, % !ErrorLevel, Break
      Idx := A_Index
      %OutputArray%%Idx% := Str
      %OutputArray%0 := Idx
      Loop, Parse, SPats, |
      {
         If (Mode = 1)
            %OutputArray%%Idx%_%A_LoopField% := Str%A_LoopField%
         Else {
            %OutputArray%%Idx%_Len%A_LoopField% := StrLen%A_LoopField%
            %OutputArray%%Idx%_Pos%A_LoopField% := StrPos%A_LoopField%
         }
         %OutputArray%0_0 := A_Index
      }
   }
   IfEqual, 0, % !ErrorLevel, Return
   Return Idx
}

Unless otherwise stated, all code is untested

(\__/) This is Bunny.
(='.'=) Cut, copy, and paste bunny onto your sig.
(")_(") Help Bunny gain World Domination.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

it finds all matches and outputs them to an array.

Good idea, but more explanations/generalizations are needed: In case of RegExMatchArray("123","\d+","o") all non-empty substrings should be returned, not only 123, 23, 3. That is, 1, 12 and 2, too, which are missing. If Haystack is longer, there would be a huge number of outputs. Also, if the empty string is a match (e.g. Needle = \d*), the function gets to an infinite cycle, and eventually crashes.

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts
  • Last active:
  • Joined: 27 May 2007
You may find (no pun intended) grep to be useful:
<!-- m -->http://www.autohotke...topic16164.html<!-- m -->

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

You may find (no pun intended) grep to be useful

I did not. Your link points to a function, which provides even less useful results (just one match with grep("123","\d+",o) ), it also gets to infinite cycle, etc.

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
Based on my limited understanding of pcre.txt, if AutoHotkey were to support (?C) callouts, we could find all possible matches by appending (?C)(*FAIL) to the expression. (?C) would cause the callout (function) to be called with the current match, and (*FAIL) would cause the match to fail, forcing PCRE to backtrack and search for another match.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

if AutoHotkey were to support (?C) callouts....

We only have a hope if it gets on your todo list. Does it?

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
Yes. I looked into it before I saw this thread. We can continue this discussion elsewhere, so as not to hijack the thread...

Joy2DWorld
  • Members
  • 562 posts
  • Last active: Jun 30 2014 07:48 PM
  • Joined: 04 Dec 2006

You may find (no pun intended) grep to be useful

I did not. Your link points to a function, which provides even less useful results (just one match with grep("123","\d+",o) ), it also gets to infinite cycle, etc.


check out RegExMatchG buried in that thread... it's nice!


also

as for callouts,

I use:

RegExLoop("1a is 2b sometihing cool ?3x","\d\K(\w+)","testt") ; returns match as VAR prefix, position in VAR@
   testt:
   if (a_thislabel = "testt") {
      debug( "$ is " $ " at " $@)
;      ; etc;
;      ; if change text of haystack, (ie. replace $ <full match> with alternate contents) set $ to new contents so position in loop is properly set based on len of new insert! ...
   }
;   code continues...
;   



; ex:
;
;	RegExLoop("1a is 2b sometihing cool ?3x","\d\K(\w+)","testt") ; returns match as VAR prefix, position in VAR@
;	testt:
;	if (a_thislabel = "testt") {
;		debug( "$ is " $ " at " $@)
;		; etc;
;		; if change text of haystack, (ie. replace $ <full match> with alternate contents) set $ to new contents so position in loop is properly set based on len of new insert! ...
;	}
;	code continues...
;	
;	older-->
;	goto, testtend
;	testt:
;		debug( "$ is " $ " at " $@)
;	return
;	testtend:
;
RegExLoop(haystack,needle,theprocess ,var = "$", position = 1) {  	; process is subroutine to process..
													;  -->> returns match as VAR prefix, position in VAR@ 
	global
	if islabel(theprocess)
	; local tmps, count
	; local save$ := %var%
		loop 
			if !(position := regexmatch(haystack, needle
						, %var%
						, position)  )
						
				break
			else {
				%var%@  := position
				gosub, %theprocess%
				position +=  strlen(%var%) 
			}
	
 }  

Joyce Jamce

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Interesting, this reminds me of RegExMatchAll which I worked on a while ago. Due to the variable scope issue one cannot use either inside another function. grep was supposed to be a compromise but the result by its nature is as Lexicos said "less useful." What we really need is true arrays so something like PHP's preg_grep or GroupCollection from .NET can be supported.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006

as Lexicos said "less useful."

I find grep's results generally more useful. No such user posted in this thread... I think you mean Laszlo.

What we really need is true arrays

Too true.

Due to the variable scope issue one cannot use either inside another function.

Well, not by nice, conventional means. See my post beginning with "That aside".

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006
Arrays are definitely good, but AHK callback function for each match and boolean result to continue searching or stopping would be also nice.
Posted Image

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006
I made a RegExCallback function to demonstrate what I am talking about.

text := "abc123abc456aaa bla 123bla456bbbb 987a"
	RegExCallback(text, "123[a-z]+456", "MyFun")
	
	text := "abc123abc456aaa bla 123bla456bbbb 987a"
	RegExCallback(text, "(123)([a-z]+)(456)", "MyFun2", 3)
return

MyFun(All){
	msgbox %A_ThisFunc%: %all%
}

MyFun2(All, s1, s2, s3){
	msgbox %A_ThisFunc%: %s1% | %s2% | %s3%
	if s2 = abc
		return 0
}


RegExCallback(HayStack, NeedleRegEx, Fun, SubPaterns=0, StartingPosition=1){	
	pos := StartingPosition, n:=0
	loop, 
	{
		i := RegExMatch(HayStack, NeedleRegEx, o, pos), n++
		IfEqual, i, 0, return n
		pos := i+Strlen(o)
		goto RegExCallback%SubPaterns%

		RegExCallback0: 
			r := %Fun%(o)
			goto RegExCallback
		RegExCallback1: 
			r := %Fun%(o, o1)
			goto RegExCallback
		RegExCallback2: 
			r := %Fun%(o, o1, o2)
			goto RegExCallback
		RegExCallback3: 
			r := %Fun%(o, o1, o2, o3)
			goto RegExCallback
		RegExCallback4: 
			r := %Fun%(o, o1, o2, o3, o4)
			goto RegExCallback
		RegExCallback5: 
			r := %Fun%(o, o1, o2, o3, o4, o5)
		
		RegExCallback:
			ifEqual, r, 0, return n			
	}
}
The function accepts n+1 parameter where n is number of subpaterns and additional one contains entire match. You can return true to continue iterating or false to stop it.

You can use the function to provide any kind of processing on data patterns. In lexikos version of AHK it can be much simpler due to updates to dynamic function calls (i.e. SubPaterns parameter can be calculated and user doesn't have to think about it).

The true power of this concept comes in form of RegExReplaceCallback where function result is used as replacement string.
Posted Image

Arion
  • Members
  • 7 posts
  • Last active: Mar 17 2012 01:58 AM
  • Joined: 21 Oct 2008
Reviving this thread to say your solution worked flawlessly. Thank you a lot!