RegExMatch() get all matches Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
brasil

RegExMatch() get all matches  Topic is solved

21 Mar 2017, 20:18

Don't really understand Regex, help is much appreciated.
I have a large file with thousands of entries that I need to get matches from as fast as possible.
I need to use it in different places in my script, so I need to modify what to look for.

thanks

Here is a test script...

Code: Select all

Data=
(
	<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)


; This is just an example:

str_start := "<item ID="
str_end   := "</item>"

RegExMatch( Data, ...

MsgBox, %allmatches%

ExitApp
4GForce
Posts: 553
Joined: 25 Jan 2017, 03:18
Contact:

Re: RegExMatch() get all matches

21 Mar 2017, 22:04

brasil wrote:Don't really understand Regex, help is much appreciated.
You might wanna look at general regex ( not ahk specific ), many sites offers explanations and ways to test regexes like http://regexr.com/
AHK might have some specific synthax mostly regarding EscapeChars.

I can't test codes atm but you might consider looping while a pos is returned ...

Code: Select all

pos := 1
while(pos := RegExMatch(Haystack, NeedleRegEx, OutputVar, pos)) {
	; do something with OutputVar
}
User avatar
boiler
Posts: 16767
Joined: 21 Dec 2014, 02:44

Re: RegExMatch() get all matches

22 Mar 2017, 06:04

In your example, do you want it to return all the instances of all the text it finds between str_start and str_end? It would seem that's what you want, but the text it would return using your example seems kind of strange. At least I would think that you wouldn't want it to include the ">" at the end of the first tag.
Guest

Re: RegExMatch() get all matches

22 Mar 2017, 06:17

Search (using Google) for grep autohotkey and you'll find various scripts you can use.
garry
Posts: 3736
Joined: 22 Dec 2013, 12:50

Re: RegExMatch() get all matches

22 Mar 2017, 06:45

a little complicated ...

Code: Select all

;-------- saved at Mittwoch, 22. März 2017 12:29:43 --------------
;-------- https://autohotkey.com/boards/viewtopic.php?f=5&t=29483&sid=31fa1915c67a3de82db4470d0bd31f7a ---
#warn
e:=""
Data=
(
	<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)

begin=<item ID=
end  =</item>
c=0
Loop,parse,data,`n,`r
 {
 T:= A_LoopField
 if T contains %end%
   {
   c=0
   e .= "------------------`r`n"
   continue
   }
 if T contains %begin%
   {
   c=1
   continue
   }
 if C=1
   {
   t=%t%
   e .= T . "`r`n"
   }
 }
e:=RegExReplace(e, "<.*?>" )
msgbox,%e%
e=
return
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: RegExMatch() get all matches

22 Mar 2017, 08:46

A question that requires getting all matches, suggests the use of RegEx callouts. However, my script is not working as expected.

Regular Expression Callouts
https://autohotkey.com/docs/misc/RegExCallout.htm

E.g. instead of: '1,2,3'. I'm getting: '1,12,123,2,23,3'.

Do callouts not handle 'greedy'/'ungreedy', for example? Do they treat 'dot all' differently? Any assistance would be most appreciated.

Code: Select all

q:: ;regex callouts not working as expected
Data=
(
	<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)

allmatches := ""
RegExMatch(Data, "<item ID=.*?</item>(?CCallout)")
;Clipboard := allmatches
Clipboard := StrReplace(allmatches, "`n", "`r`n")
MsgBox % allmatches
Return

Callout(m)
{
global allmatches
allmatches .= "[[[" m "]]]`r`n"
;allmatches .= m "`r`n"
Return 1
}
[EDIT:] A revised version, but please use RegEx callouts with caution, and double-check the results.

Code: Select all

q:: ;regex callouts
Data=
(
	<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)

allmatches := ""
poslast := ""
RegExMatch(Data, "<item ID=.*?</item>(?CCallout)")
;Clipboard := allmatches
Clipboard := StrReplace(allmatches, "`n", "`r`n")
MsgBox % allmatches
Return

Callout(match, num, pos)
{
global allmatches
global poslast
if !(poslast = pos)
	allmatches .= match "`r`n"
poslast := pos
Return 1
}
Last edited by jeeswg on 22 Mar 2017, 15:22, edited 1 time in total.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
4GForce
Posts: 553
Joined: 25 Jan 2017, 03:18
Contact:

Re: RegExMatch() get all matches

22 Mar 2017, 14:12

jeeswg wrote:However, my script is not working as expected.

Regular Expression Callouts
https://autohotkey.com/docs/misc/RegExCallout.htm

E.g. instead of: '1,2,3'. I'm getting: '1,12,123,2,23,3'.

Do callouts not handle 'greedy'/'ungreedy', for example? Do they treat 'dot all' differently? Any assistance would be most appreciated.
Nice, I wasn't aware of RegEx Callouts, ty.
As for your script I think it works as intended since 12, 123 and 23 are valid matches to the RegEx and are different from 1, 2 and 3.

What I don't understand is why a simple loop was too advanced for him and he settled for the callouts.

Code: Select all

Data=
(
	<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)

F1::
	allMatches := ""
	p := 1
	while(p := RegExMatch(Data, "<item ID=(.*?)</item>", match, p)) {
		allMatches .= match1 . "`n"
		p++
	}
	msgbox % allMatches
Return
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: RegExMatch() get all matches

22 Mar 2017, 14:30

@brasil. If you are using the RegEx callouts. PLEASE double-check that you are getting the correct results (and that you are not getting unwanted text duplication)! Some of the special RegEx characters and modes that I tested, did not seem to be working as I expected them to.

@4GForce. Yes, callouts are a little bit hidden in the documentation! The behaviour I demonstrated, that I *didn't* want this time, could come in handy in future. When callouts do work, they are so convenient versus loops, although I've probably only been using RegEx a lot since I wrote my tutorial about a month ago, and became a big fan, using it in many more places than I used to. It's nice since I became able to decipher all the examples on this forum.

RegEx callouts helped me greatly speed up a method I used that parsed about a MB of text, which I had used a parsing loop for. Basically it counted datestamps, and checked for invalid/blank 'headers' that I put underneath datestamps, in my 1 MB temporary text file, every time I saved the file. The procedure got slower as the txt got bigger, and I had to move out the text, but after the rewrite it's still fast enough at the current txt's size.

The tutorial I wrote:
RegEx handy examples (RegExMatch, RegExReplace) - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?t=28031

==================================================

Some tests:

Code: Select all

q::
vOutput := ""
RegExMatch("abcdef", "..(?CCallout)") ;ab,bc,cd,de,ef
;RegExReplace("abcdef", "..(?CCallout)") ;same result as above
MsgBox % Clipboard := vOutput

vOutput := ""
RegExMatch("abcdef", "...(?CCallout)") ;abc,bcd,cde,def
MsgBox % Clipboard := vOutput

vOutput := ""
RegExMatch("abc", ".*(?CCallout)") ;abc,ab,a,,
MsgBox % Clipboard := vOutput

vOutput := ""
RegExMatch("abc", ".{1,3}(?CCallout)") ;abc,ab,a,bc,b,c
MsgBox % Clipboard := vOutput
Return

Callout(m)
{
global vOutput
if (vOutput = "")
	vOutput .= m
else
	vOutput .= "," m
Return 1
}
[EDIT:] I posted a revised version of the earlier script, above.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
brasil

Re: RegExMatch() get all matches

22 Mar 2017, 22:15

Thanks for the help guys.
@jeeswg, your script doesn't work if I read data from file, why?

https://www.upload.ee/files/6817804/reg ... s.zip.html
garry
Posts: 3736
Joined: 22 Dec 2013, 12:50

Re: RegExMatch() get all matches

23 Mar 2017, 02:34

you can remove unwanted parts in script like this :
/*
vvv
nnn
*/

( old script again )

Code: Select all

#warn
e:=""

f1=%a_scriptdir%\test.xml
Fileread,data,%f1%

begin=<item ID=
end  =</item>
c=0
Loop,parse,data,`n,`r
 {
 T:= A_LoopField
 if T contains %end%
   {
   c=0
   e .= "------------------`r`n"
   continue
   }
 if T contains %begin%
   {
   c=1
   continue
   }
 if C=1
   {
   t=%t%
   e .= T . "`r`n"
   }
 }
e:=RegExReplace(e, "<.*?>" )
msgbox,%e%
e=
data=
return
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: RegExMatch() get all matches

23 Mar 2017, 03:32

@brasil. Try changing:
RegExMatch(Data, "<item ID=.*?</item>(?CCallout)")
to:
RegExMatch(Data, "s)<item ID=.*?</item>(?CCallout)")
(I simply added in 's)'.)
The problem is probably to do with CRLF v. LF (carriage return + linefeed v. linefeed).
In your example, the continuation section used LFs only, your file probably uses CRLFs.

It would appear that: the '.' in the first RegEx line was capturing LFs on the sample text, but not capturing the CRLFs on the file text. The s is the 'dot all' mode, see:
Regular Expressions (RegEx) - Quick Reference
https://autohotkey.com/docs/misc/RegEx-QuickRef.htm

==================================================

Btw to parse through your file, I might recommend this method instead. It could end up being faster and simpler than using the RegEx callout method. You can then apply RegExMatch or RegExReplace individually to each vTemp variable, in order to extract data.

Code: Select all

q:: ;find an unused character, parse the text, apply RegExMatch
vText = ;continuation section
(Join`r`n
%A_Tab%<item ID="item 1">
		<Label>item 1 Label</Label>
		<Price>item 1 price</Price>
	</item>
	<item ID="item 2">
		<Label>item 2 Label</Label>
		<Price>item 2 price</Price>
	</item>
	<item ID="item 3">
		<Label>item 3 Label</Label>
		<Price>item 3 price</Price>
	</item>
)

;vPath = %A_Desktop%\MyFile.txt
;FileRead, vText, % vPath

vUnused := ""
Loop, 255
if !InStr(vText, Chr(A_Index))
{
	vUnused := Chr(A_Index)
	break
}
if (vUnused = "")
{
	MsgBox % "error: no delimiter available"
	Return
}

vText := StrReplace(vText, "`t<item ID=""item ", vUnused "`t<item ID=""item ")
if (SubStr(vText, 1, 1) = vUnused)
	vText := SubStr(vText, 2)
Loop, Parse, vText, % vUnused
{
	vTemp := A_LoopField
	RegExMatch(vTemp, "<item ID=""\K(.*?)(?="">)", vItemID)
	RegExMatch(vTemp, "<Label>\K(.*?)(?=</Label>)", vLabel)
	RegExMatch(vTemp, "<Price>\K(.*?)(?=</Price>)", vPrice)
	MsgBox % "[" vItemID "][" vLabel "][" vPrice "]`r`n" vTemp
}
Return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
brasil

Re: RegExMatch() get all matches

23 Mar 2017, 08:14

Thanks :)

Although it works, it's not solid enough.
What if the XML file is not prettified\indented?
I think StrSplit() is amazingly fast and does the job.

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Google [Bot] and 118 guests