Jump to content


Photo

RegExMatchArray()


  • Please log in to reply
12 replies to this topic

#1 Slanter

Slanter
  • Members
  • 739 posts

Posted 02 January 2009 - 12:17 PM

This is a function that works very similar to RegExMatch(), but instead of only finding the first match, it finds all matches and outputs them to an array. For example, with the string "(abcdef)(ghi)" and the RegEx "\(\w+\)", RegExMatch would only find "(abcdef)" while RegExMatchArray will find both "(abcdef)" and "(ghi)".

Syntax: FoundCount := RegExMatchArray(Haystack, NeedleRegEx, OutputArray[, StartingPosition = 1, SubPats = 1])
Information for Haystack, NeedleRegEx and StartingPosition can be found in the documentation for RegExMatch. Most of the rest of this can be too, but I felt like editing it a little to fit the function (also, I was bored). :)FoundCount:
RegExMatchArray returns the number of times NeedleRegEx could be matched in Haystack. If an error occurs, a blank string is returned and ErrorLevel is set to one of the values here. This value will also be stored in OutputArray0OutputArray:
The base name of the pseudo-array that will contain all of the pieces of Haystack matched by RegExMatchArray. If the pattern is not found, the array will not be created. Note that the array name MUST be enclosed in quotes.

Mode 1 (default): If any capturing subpatterns are found, they will be stored in an array whose base name is OutputArrayMatch#. For example, if the array's name is Match, and we're currently working on the first complete match of NeedleRegEx in Haystack, the substring that matches the first subpattern would be stored in Match1_1, the second in Match1_2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern Year would be stored in Match#_Year. If a particular subpattern does not match anything, it's variable is made blank. The variable OutputArray0_0 will contain the number of subpatterns that could be matched.

Mode 2 (position-and-length): See RegExMatch documentation for more information. Position and length of subpatterns are stored in Match#_Len# and Match#_Pos#.SubPats:
This flag tells the function whether or not to create the secondary arrays for subpatterns (ie. Match#_#). Default is 1.[/list]
And finally, here is the function. Please tell me if anything doesn't work as expected :D
RegExMatchArray(Haystack, NeedleRegEx, OutputArray, StartingPosition=1, SubPats=1) {
   Global
   Local Pos := StartingPosition - 1, Sub := 0, Idx := 0, Mode := 1, SPats, Str
   If (SubPats) {
      Loop {
         If (!Sub := RegExMatch(NeedleRegEx, "(?<!\\)\((?!\?:)", Str, Sub + 1))
            Break
         RegExMatch(NeedleRegEx, "\(\?\<[^\>]+\>", Str, Sub)
         SPats .= (SPats ? "|" : "") . (Str ? SubStr(Str,4,-1) : A_Index)
      }
   }
   If (RegExMatch(NeedleRegEx, "^[^\)]+\)", Str))
      Mode := InStr(Str, "P", 1) ? 2 : 1
   Loop {
      If (!Pos := RegExMatch(Haystack, NeedleRegEx, Str, Pos + 1))
         Break
      IfEqual, 0, % !ErrorLevel, Break
      Idx := A_Index
      %OutputArray%%Idx% := Str
      %OutputArray%0 := Idx
      Loop, Parse, SPats, |
      {
         If (Mode = 1)
            %OutputArray%%Idx%_%A_LoopField% := Str%A_LoopField%
         Else {
            %OutputArray%%Idx%_Len%A_LoopField% := StrLen%A_LoopField%
            %OutputArray%%Idx%_Pos%A_LoopField% := StrPos%A_LoopField%
         }
         %OutputArray%0_0 := A_Index
      }
   }
   IfEqual, 0, % !ErrorLevel, Return
   Return Idx
}


#2 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 02 January 2009 - 02:52 PM

it finds all matches and outputs them to an array.

Good idea, but more explanations/generalizations are needed: In case of RegExMatchArray("123","\d+","o") all non-empty substrings should be returned, not only 123, 23, 3. That is, 1, 12 and 2, too, which are missing. If Haystack is longer, there would be a huge number of outputs. Also, if the empty string is a match (e.g. Needle = \d*), the function gets to an infinite cycle, and eventually crashes.

#3 SoLong&Thx4AllTheFish

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts

Posted 02 January 2009 - 03:06 PM

You may find (no pun intended) grep to be useful:
<!-- m -->http://www.autohotke...topic16164.html<!-- m -->

#4 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 02 January 2009 - 03:46 PM

You may find (no pun intended) grep to be useful

I did not. Your link points to a function, which provides even less useful results (just one match with grep("123","\d+",o) ), it also gets to infinite cycle, etc.

#5 Lexikos

Lexikos
  • Administrators
  • 8842 posts

Posted 03 January 2009 - 04:15 AM

Based on my limited understanding of pcre.txt, if AutoHotkey were to support (?C) callouts, we could find all possible matches by appending (?C)(*FAIL) to the expression. (?C) would cause the callout (function) to be called with the current match, and (*FAIL) would cause the match to fail, forcing PCRE to backtrack and search for another match.

#6 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 03 January 2009 - 03:18 PM

if AutoHotkey were to support (?C) callouts....

We only have a hope if it gets on your todo list. Does it?

#7 Lexikos

Lexikos
  • Administrators
  • 8842 posts

Posted 03 January 2009 - 04:30 PM

Yes. I looked into it before I saw this thread. We can continue this discussion elsewhere, so as not to hijack the thread...

#8 Joy2DWorld

Joy2DWorld
  • Members
  • 562 posts

Posted 04 January 2009 - 02:33 PM

You may find (no pun intended) grep to be useful

I did not. Your link points to a function, which provides even less useful results (just one match with grep("123","\d+",o) ), it also gets to infinite cycle, etc.


check out RegExMatchG buried in that thread... it's nice!


also

as for callouts,

I use:

RegExLoop("1a is 2b sometihing cool ?3x","\d\K(\w+)","testt") ; returns match as VAR prefix, position in VAR@
   testt:
   if (a_thislabel = "testt") {
      debug( "$ is " $ " at " $@)
;      ; etc;
;      ; if change text of haystack, (ie. replace $ <full match> with alternate contents) set $ to new contents so position in loop is properly set based on len of new insert! ...
   }
;   code continues...
;   



; ex:
;
;	RegExLoop("1a is 2b sometihing cool ?3x","\d\K(\w+)","testt") ; returns match as VAR prefix, position in VAR@
;	testt:
;	if (a_thislabel = "testt") {
;		debug( "$ is " $ " at " $@)
;		; etc;
;		; if change text of haystack, (ie. replace $ <full match> with alternate contents) set $ to new contents so position in loop is properly set based on len of new insert! ...
;	}
;	code continues...
;	
;	older-->
;	goto, testtend
;	testt:
;		debug( "$ is " $ " at " $@)
;	return
;	testtend:
;
RegExLoop(haystack,needle,theprocess ,var = "$", position = 1) {  	; process is subroutine to process..
													;  -->> returns match as VAR prefix, position in VAR@ 
	global
	if islabel(theprocess)
	; local tmps, count
	; local save$ := %var%
		loop 
			if !(position := regexmatch(haystack, needle
						, %var%
						, position)  )
						
				break
			else {
				%var%@  := position
				gosub, %theprocess%
				position +=  strlen(%var%) 
			}
	
 }  


#9 polyethene

polyethene

    Administrator

  • Administrators
  • 5473 posts

Posted 04 January 2009 - 02:51 PM

Interesting, this reminds me of RegExMatchAll which I worked on a while ago. Due to the variable scope issue one cannot use either inside another function. grep was supposed to be a compromise but the result by its nature is as Lexicos said "less useful." What we really need is true arrays so something like PHP's preg_grep or GroupCollection from .NET can be supported.

#10 Lexikos

Lexikos
  • Administrators
  • 8842 posts

Posted 05 January 2009 - 10:13 AM

as Lexicos said "less useful."

I find grep's results generally more useful. No such user posted in this thread... I think you mean Laszlo.

What we really need is true arrays

Too true.

Due to the variable scope issue one cannot use either inside another function.

Well, not by nice, conventional means. See my post beginning with "That aside".

#11 majkinetor

majkinetor
  • Fellows
  • 4511 posts

Posted 06 January 2009 - 12:26 PM

Arrays are definitely good, but AHK callback function for each match and boolean result to continue searching or stopping would be also nice.

#12 majkinetor

majkinetor
  • Fellows
  • 4511 posts

Posted 06 January 2009 - 01:12 PM

I made a RegExCallback function to demonstrate what I am talking about.

text := "abc123abc456aaa bla 123bla456bbbb 987a"
	RegExCallback(text, "123[a-z]+456", "MyFun")
	
	text := "abc123abc456aaa bla 123bla456bbbb 987a"
	RegExCallback(text, "(123)([a-z]+)(456)", "MyFun2", 3)
return

MyFun(All){
	msgbox %A_ThisFunc%: %all%
}

MyFun2(All, s1, s2, s3){
	msgbox %A_ThisFunc%: %s1% | %s2% | %s3%
	if s2 = abc
		return 0
}


RegExCallback(HayStack, NeedleRegEx, Fun, SubPaterns=0, StartingPosition=1){	
	pos := StartingPosition, n:=0
	loop, 
	{
		i := RegExMatch(HayStack, NeedleRegEx, o, pos), n++
		IfEqual, i, 0, return n
		pos := i+Strlen(o)
		goto RegExCallback%SubPaterns%

		RegExCallback0: 
			r := %Fun%(o)
			goto RegExCallback
		RegExCallback1: 
			r := %Fun%(o, o1)
			goto RegExCallback
		RegExCallback2: 
			r := %Fun%(o, o1, o2)
			goto RegExCallback
		RegExCallback3: 
			r := %Fun%(o, o1, o2, o3)
			goto RegExCallback
		RegExCallback4: 
			r := %Fun%(o, o1, o2, o3, o4)
			goto RegExCallback
		RegExCallback5: 
			r := %Fun%(o, o1, o2, o3, o4, o5)
		
		RegExCallback:
			ifEqual, r, 0, return n			
	}
}
The function accepts n+1 parameter where n is number of subpaterns and additional one contains entire match. You can return true to continue iterating or false to stop it.

You can use the function to provide any kind of processing on data patterns. In lexikos version of AHK it can be much simpler due to updates to dynamic function calls (i.e. SubPaterns parameter can be calculated and user doesn't have to think about it).

The true power of this concept comes in form of RegExReplaceCallback where function result is used as replacement string.

#13 Arion

Arion
  • Members
  • 7 posts

Posted 15 May 2012 - 07:21 PM

Reviving this thread to say your solution worked flawlessly. Thank you a lot!