Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

grep() - global regular expression match


  • Please log in to reply
23 replies to this topic
polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Details in the script.

Download

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
I've added a link to this great resource from the "related" section of RegExMatch().

Your code is so short I was tempted to put it into the examples section. But I don't understand it well enough to be comfortable with that -- in part due to not fully understanding Grep. So I just linked to this topic, which provides your full examples and makes it easier for you to update it.

Thanks.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
I designed them to be as fast as possible which is why some expressions look a bit cryptic. grep which I just updated, returns matches and positions as comma seperated values which is ideal for use in functions since StringSplit facilitates for global/local array creation. RegExMatchAll is better for general use as it supports subpatterns.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
In version 1.2 I fixed the infinite recursion bug that occurred when UnquotedOutputVar had the same address as Haystack. Performance increased by 20% somehow. Version 1.3 brings a few new options to grep().

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
RegExReplace often provides a simpler alternative (putting all matches in a variable)
t = C:\Windows\system32\systeminfo.exe`n

s = `n                ; separator character between matches, like "|" or ","

p =  (?<=([\\\.]))\w+ ; pattern to search for



t := RegExReplace(t, "(.*?)((" . p . ")|$)", "$2" . s)

StringTrimRight t, t, SubStr(t,-2,1) = s ? 3 : 2



MsgBox [%t%]
If you want multi char separators, only the StringTrimRight command needs adaptation.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
I use that method for a few things but it has its limitations i.e. it can't get subpatterns or positions, fails with options and complex expressions, can't recurse or backtrack, breaks backreferences and possibly some anchors and atomic groups, matches surrounding whitespace/delimiter chars etc. The lazy wildcard is said to be very slow and is generally discouraged.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Joy2DWorld
  • Members
  • 562 posts
  • Last active: Jun 30 2014 07:48 PM
  • Joined: 04 Dec 2006
To do a MATCHALL, there is an (?) easier / FASTER way...

example:

start := (matched := 0) + 1
loop
	if regexmatch("12345","(?<whole>.*?(?<ANSWER" . a_index . ">\d))",The_,start) and ++matched
		start += strlen(The_whole) 
	else
		break

msgbox % matched

loop % matched
	msgbox % The_Answer%A_index%

exitapp


if this is helpful(?)

ie. regexmatch(HAYSTACK, "(?<whole>" ... rest of match block ... ")",Array_Var_Name, PlaceHolder)

the match block is normal except... in place of (?<named>) use "(?<named" . a_index . ">" for each named match.

or:

if regexmatch(Haystack,"sx)

(?<whole>.*?  #  .*? ie.  if there can be optional text between matches
(?<ANSWER" . a_index . ">  # name of section + index counter
\d)  # for each section you want to capture
)"  ; Closing paren for entire match section
,Array_,start)

Joyce Jamce

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
That's how grep() works, but instead of using a Loop you get the convenience of a single function and a few extra options.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Joy2DWorld
  • Members
  • 562 posts
  • Last active: Jun 30 2014 07:48 PM
  • Joined: 04 Dec 2006
if we de-obscifate your function:



grep(Haystack, Needle, ByRef outputVar, whichmatchB = 0, positionS = 1, charhopefullyNOTinanymatcheD = ",", matchfromlastZ = true) {
	Loop
		If positionS := RegExMatch(Haystack, Needle, X, PositionS) {
			outPut .= positionS . charhopefullyNotinanymacheD
			positionS += matchfromlastZ  ? StrLen(X) : 1
			Y .= (whichmatchB ? X%whichmatchB% : X) . charhopefullyNotinanymacheD
		} Else { 
			outputVar := SubStr(Y, 1, -1)
			Return SubStr(outPut, 1, -1)
		}

}

we see your very cool use of the return value (foundposition), but some huge LIMITATIONS with the function:

1) only return one (1) match, not multiple arrayed matches...... [with the simple loop suggested above, can make as many named matches/submatches as desired... ], eg:
start := (matched := 0) + 1 
loop 
   if regexmatch("12345","x)(?<whole>.*?(?<ANSWER" . a_index . ">\d)(?<AnswerB" . a_index . ">\d))",The_,start) and ++matched 
      start += strlen(The_whole) 
   else 
      break

2) need a 'pray it isn't in the match result' character to seperate the matches!!!

3) don't return an array....

4) are much slower...

and notably,

as pointed out above in thread, to create a conjoined string with matches, simply use regexreplace to kill everything between matches and insert whatever 'between' string you desire......
Joyce Jamce

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012

only return one (1) match, not multiple arrayed matches

Which could be any one. I'll try see if this limitation can be lifted with a new paradigm.

need a 'pray it isn't in the match result' character to seperate the matches

That's true, I usually escape my commas prior to calling grep. Until real arrays/objects are supported AutoHotkey will always have a problem here.

don't return an array

In most cases this is better. You can transform the string with Sort, parsing loops and not worry about variable scope within functions.

much slower

I did a few tests, and I found that it was only ever slightly slower (1-2%). Like I said this function is for convenience, looping is not a new concept. If performance is critical you can write a Dll in ASM and call your exported functions.

use regexreplace to kill everything between matches

In my follow up I listed a few reasons why it's not a practical solution - breaking backreferences is a major worry because I use them a lot.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Joy2DWorld
  • Members
  • 562 posts
  • Last active: Jun 30 2014 07:48 PM
  • Joined: 04 Dec 2006
ok...


so easy, even pre-beginner can use !!!



STARTER:

use it like this,


RegexMall( HAYSTACK, NEEDLE)

get back

for *each* (?<named_match>xxxxxx)

ie.. "(?<first>\d+)\D+(?<second>\d*?)\s*(?<third>\w+)" is legal.


matched result contained in named matches, eg:

$named_match1 - 9999

function returns # of matches




also:

$ has full matches

$RegEx1-999 has full match for that match #

$0 also has # of matches...




options:


RegexMall(haystack,Needle, "VARIABLE TO USE", "Spacer to use", Position to start in regexreplace)


spacer applies to TOTAL MATCH return string... eg. "$" (if no variable alternative designated)


RegexMall(hay,needle,"THIS_") ->

returns

THIS_namedmatch1
THIS_namedmatch2

etc..

ie. in place of default "$"

/*
; example
msgbox % RegexMall("test and more ! and test and more and more!! ","(?<happy>test).*?(?<more>more)","Yes_") "- " Yes_happy1 " " Yes_more2 "`n" Yes_ "-" Yes_0
*/


RegExMall(haystack,needle,var = "$",spacer = "", position = 1) {
	global
	local tmps, count
	; local save$ := %var%
	loop 
		if !(position := regexmatch(haystack
						, regexreplace(needle,"(?<!\\)\(\?\<(\w+)>"
							, "(?<$1" . a_index . ">")
						, %var%
						, position)  + strlen(%var%) )
			break
		else {
			tmps .= %var%Regex%a_index% := %var% . spacer
			++count
		}
	;%var% := save$
	%var% := tmps
	%var%0 := count
	return count
 }


hope this helps!


oh... is about 2-3 times slower than directly doing loop:

pos = 0
loop
	if !( pos := regexmatch(test, "(?<=is)\s*a\s*(?<match" . a_index .  ">[a-z]++)\s*(?<number" . a_index . ">\d+)",$,pos + 1))
		break

(as is doing 2 regex's!!)

but..

regex is FAST...

and larger the haystack, lower the time differential..


"S" option degrades performance...

------------------------------------------

ok and a second, FASTER (3x) version,

works also with unnamed match sections "(xxx)"

but less friendly output style.

$1_match#orid

eg:

$1_1
$1_2

$2_1
$2_2


etc..


RegExMatchG(haystack,needle,var = "$",spacer = "", position = 1) {
	global
	local tmps, count
	; local save$ := %var%
	loop 
		if !(position := regexmatch(haystack, needle
					, %var%%a_index%_
					, position)  
					+ strlen(%var%%a_index%_) )
			break
		else {
			tmps .= %var%%a_index%_ . spacer
			++count
		}
	;%var% := save$
	%var% := tmps
	%var%0 := count
	return count
 }  


for example:

loop 100 
test .= "this is a match 2345,"

RegExMatchG(test,"(?<=is)\s*(a)\s*(?<match>[a-z]++)\s*(?<number>\d+)")
msgbox % benchmark()  "-" $1_1 "-" $1_match "`n" $0 "-" $

Joyce Jamce

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
Your proposals look good. I will try to update my scripts soon, in the mean time users can copy your version.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


px
  • Guests
  • Last active:
  • Joined: --
RegExMall(haystack,needle,var = "$",spacer = "", position = 1) {
   global
   [color=red]local tmps, count[/color]
   ; local save$ := %var%
   loop

count should also be in local var else the 2nd time u accumulate it, it will always be incremented since u didnt reset the count.

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
In version 2.0 RegExMatchAll() has been replaced with grepcsv(). Details are in the script.

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


Joy2DWorld
  • Members
  • 562 posts
  • Last active: Jun 30 2014 07:48 PM
  • Joined: 04 Dec 2006

count should also be in local var else the 2nd time u accumulate it, it will always be incremented since u didnt reset the count.


thanks.
Joyce Jamce