Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Tutorial: An AHK Introduction to RegEx


  • Please log in to reply
40 replies to this topic

Poll: Did you find this tutorial helpful? (45 member(s) have cast votes)

Did you find this tutorial helpful?

  1. Yes, I found it helpful. (48 votes [90.57%])

    Percentage of vote: 90.57%

  2. No, it wasn't helpful. (2 votes [3.77%])

    Percentage of vote: 3.77%

  3. Who taught you how to write, e.e. cummings? (3 votes [5.66%])

    Percentage of vote: 5.66%

Vote Guests cannot vote
sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
"When we last left our hero, he had just discovered the correct window matching statement in the Case of the Evil Twin Programs, but this mystery would have yet another twist in store..."

Right! Stop that! It's silly...and a bit suspect I think... </Monty Python>

So I've got the unique IDs for the two windows into an array, but now I need to tell which is which. As I said before, information in the URLs for each window holds the key:


https://intranet.corporate.com/InquiryWeb/accountInquiryAction.do
https://intranet.corporate.com/BillingWeb/requestForOutput.do
No problem, just SubStr out the pertinent part, right? If only it were that easy...one fun surprise the application provides every now and again is that it loses its secure designation (yikes!):

http://
The prospect of putting in something like a gazillion if/else statements to accommodate this one nifty little quirk is not on my agenda of fun. But RegEx will have a solution for us now, won't it?

Dealing with the http with an optional s is easy:

Now I just need to match every character up to the next dash:

And capturing only the "word" characters in a subpattern after that dash should get me the information I need:

And if you run that pattern against either of our URLs you will be delighted to find that you are WRONG.

Yet again, I led you down the wrong path. Instead of capturing this text from our URLs:

InquiryWeb
BillingWeb
Our statement will instead capture this text:

accountInquiryAction
requestForOutput
How come? A behavior that was previously touched upon but not explored until now: greed.

If you've kept the RegEx Quick Reference handy, you'll see that some of our match characters (asterisk, question mark, plus, and min/max) are by default greedy or, in other words, they will continue to match characters up to the last possible match that will still satisfy the pattern. So where we wanted this portion of our statement:

To stop matching here:

https://intranet.corporate.com/
The greed of the asterisk dictated that matching to the next slash would still satisfy the pattern of our statement, so it stopped matching here:

https://intranet.corporate.com/InquiryWeb/
So how do we stop greed? A global economic depression followed by a mass implementation of socialist government policies at the national level would be helpful but in our case (focus!) the cure is question mark, which causes any of the aforementioned greedy characters that directly precede it to stop matching at the first match encountered. So this:

Ungreedifies our statement so it matches only this:

https://intranet.corporate.com/
Gotta love that question mark. I made up that word up by the way, ungreedify™ (it's trademarked now, watch yourselves!). ANYWAY, now let's review our correctly working RegEx statement in its entirety:

And now we turn the controls over to RegExMatch. The manual shows RegExMatch being used to yield the position of our match, but it doesn't have to be used that way. We can also use it strictly for the purpose of saving a subpattern to a variable, and in our case that's exactly what we need:

So now all we have to do is assign each unique ID in the window array to a group based on the contents of the variable:

Hopefully some of you who are glued to the manual have caught what I'm up to because if you didn't, trying to make this code work is gonna suck. I'll let the manual take over:

If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in an array whose base name is OutputVar. For example, if the variable's name is Match, the substring that matches the first subpattern would be stored in Match1, the second would be stored in Match2, and so on.


So SubPat isn't a variable, it's an array. Let's make it go:

For our next lesson, we'll try something slightly more complex and extract several pieces of data from one statement using RegExMatch.
					
					

tonne
  • Members
  • 1654 posts
  • Last active: May 06 2014 06:22 PM
  • Joined: 06 Jun 2006
1+ for e.e.cummings in lack of "why didn't monty python make a regex sketch too".

I prefer sipping red wine when ever I'm into regex's.

  • Guests
  • Last active:
  • Joined: --

circumflex

s/circumflex/caret/g

...up to the next dash:

after that dash...

s/dash/slash/g

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
Before I begin this lesson I must give all appropriate credit to jaco0646 for starting me on the path.

Once I learned how to use COM with web pages in Internet Explorer, one of my first orders of business was to use it to extract an address from our web application, sort it into its appropriate parts (Address Line 1/Address Line 2/City/State/Zip) and run it through the U.S. Postal Service website to verify the standardized address when we've had a problem mailing to a particular client.

That's all well and good but there are also problems, one being that some addresses will have a sub-address, which can be alphanumeric (sometimes with a dash), and people can use any number of name types for their sub-address:

Apt 1406
Fl 2
Ste 604-C
Spc F
Trlr 131
# 29


Another problem is that in some cases there's no spacing between the address, sub-address type and sub-address number:

205 W Lilac Ln # 14
205 W Lilac Ln# 14
205 W Lilac Ln #14
205 W Lilac Ln#14


As for line 2, when sorting the City/State/Zip the name of the city isn't always one word:

Houston TX
New York NY
East Palo Alto CA


And yet another problem is that if I need to retrieve the zip code, in some cases it is alone, in others it has a +4 attached to it and in still some others it has an incomplete +4:

Houston TX 78912
New York NY 18142-8654
East Palo Alto CA 97116-524


And to top it all off, the character case throughout the address is not consistent. Quite the myriad of problems!

Let's take them one at a time though, starting with the first address line. I think first sorting out the things that give us the most information on how to sort them will be the most helpful, and in this case that is the optional sub-address. First we can sort all of the potential sub-address types into a subpattern with alternatives:

[color=red]([/color]Apt[color=red]|[/color]Fl[color=red]|[/color]Ste[color=red]|[/color]Spc[color=red]|[/color]Trlr[color=red]|[/color]#[color=red])[/color]

But we'll need to add to this to make it work. Since there may optionally be a space between the type and the number, we'll now introduce and use a new search character: \s, which will match any single whitespace character, most typically space, tab, and newline:

Apt[color=red]\s?[/color]

The next thing is the number, which can be alphanumeric with an undetermined number of characters and possibly a dash. We can't use \w by itself because it cannot account for the optional dash, and if we add the optional dash any word characters after the dash must not only be accounted for but also be optional, something like this:

Apt\s?[color=red]\w+([/color]-[color=red]?\w+)?[/color]

I don't know about you, but to me that's just messy looking and I don't feel like looking for a subpattern inside a subpattern in that way. We could also use character class ranges ([a-zA-Z0-9]) and we can even add the dash into the class to accomodate it like this:

Apt\s?[color=red][a-zA-Z0-9[/color]-[color=red]]+[/color]

Definitely less messy looking with better matching tolerance but that will make for one awfully long statement if I have to repeat that four or five times...but this is another area in which the power of the character class brackets becomes apparent. Although the manual states that we can specify [a-zA-Z0-9_] to match any "word" character, the character class will also accept its shorthand equivalent (\w) to do the same thing. Now we can combine the convenience of \w with the matching tolerance of the character class brackets including the dash:

Apt[color=red]\s?[\w[/color]-[color=red]]+[/color]

Perfect. Now in previous examples I avoided creating a subpattern within a subpattern, but in this case it will result in a short, concise and still readable working statement:

([color=red]([/color]Apt[color=red]|[/color]Fl[color=red]|[/color]Ste[color=red]|[/color]Spc[color=red]|[/color]Trlr[color=red]|[/color]#[color=red])\s?[\w[/color]-[color=red]]+[/color])

Notice in all of the above examples that I place the dash in the last position inside the character class brackets. That is to avoid any conflict with any other specified characters in the brackets, as you can imagine what will happen if you're trying to find only the characters a, t and dash and you use [a-t]. The manual states that you should use the backslash to signify that it is a literal dash, which is also a good practice but not necessary so long as you are careful.

So that's done, but now what do we do about the rest of the address? Well since the address is only being parsed into two sections and we already know how to find one section (if it exists), let's use RegExMatch to set up an if/else. How? By creating a named subpattern:

address := "205 W Lilac Ln #14"

RegExMatch(address,"([color=red]?P<Line2>[/color](Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)",[color=red]_[/color])
RegExMatch UnquotedOutputVar[/url">

":54xhjkcv]If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in an array whose base name is OutputVar...The exception to this is named subpatterns: they are stored by name instead of number. For example, if the variable's name is Match the substring that matches the named subpattern (?P<Year>\d{4}) would be stored in MatchYear.


In this case OutputVar is an underscore, so our match is stored in _Line2 if it exists. Now what do we do if _Line2 exists? We can use RegExReplace to remove _Line2 from the original variable and save the rest to a new variable, _Line1, but with a small twist:

address := "205 W Lilac Ln #14"

RegExMatch(address,"(?P<Line2>(Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)",_)
[color=red]if (_Line2)[/color]
 _Line1 := RegExReplace(address,"([color=red]\s?[/color](Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)")

Remember that I mentioned earlier that some of the addresses will have a space between the address and the sub-address, while others will not. That space doesn't matter when we attempt to get the sub-address, but when removing the sub-address to save the address to _Line1 we may as well remove it if it exists. And if _Line2 doesn't exist then the address itself can be saved to _Line1:

address := "205 W Lilac Ln #14"

RegExMatch(address,"(?P<Line2>(Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)",_)
if (_Line2)
 _Line1 := RegExReplace(address,"(\s?(Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)")
[color=red]else
 _Line1 := address[/color]

And address line 1 of the address is solved...except for one other small detail I mentioned ealier:

...the character case throughout the address is not consistent...


All the sub-address name types are still case sensitive(ugh!). Fortunately, correcting this problem in the RegExMatch and RegExReplace statements is a minor detail:

address := "205 W Lilac Ln #14"

RegExMatch(address,"[color=red]i)[/color](?P<Line2>(Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)",_)
if (_Line2)
 _Line1 := RegExReplace(address,"[color=red]i)[/color](\s?(Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)")
else
 _Line1 := address
RegExMatch Options - i)[/url">

":54xhjkcv]Case-insensitive matching, which treats the letters A through Z as identical to their lowercase counterparts.


NOW we can move on to address line 2.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
So before we tackle address line 2 let's take a look back at some possible scenarios we'll find in address line 2:

Houston TX 78912
New York NY 18142-8654
East Palo Alto CA 97116-524


It's not as bad as you first thought it would be, is it? Right away there are three distinct patterns we can work with: the state (two capital letters):

[color=red]([A-Z]{2})[/color]

The raw zip code(5 digits):

[color=red](\d{5})[/color]

And the spaces before and after the state, so I guess we'll just build our statement from the inside out:

[color=red]\s[/color]([A-Z]{2})[color=red]\s[/color](\d{5})

And while we're at it let's make sure to name our subpatterns:

\s([color=red]?P<State>[/color][A-Z]{2})\s([color=red]?P<Zip>[/color]\d{5})

The next easiest part to knock out will probably be the optional +4 after the zip code, which would be a dash and up to four digits. Can you guess what character matching technique I'm going to use?

\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})[color=red]([\d[/color]-[color=red]]+)?[/color]

While I could use min/max there as well, there will be no other characters encountered at or after that point anyway so there's no harm in using plus.

We're breezin' right through this, eh?

The only thing left to deal with is the city itself, and since the rest of the patterns in our statement constrain what can be matched where so tightly all we need to specify is what characters we might find in the city name, which should only be letters, spaces and possibly dashes. You kids remember my rule about re-inventing the wheel:

[color=red](?P<City>[\w\s[/color]-[color=red]]+)[/color]\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?

Seriously, when did this get so easy? We spent 40 days and 40 nights riding out the great flood working on address line 1 and we cut through address line 2 like a knife through warm butter:

address2 = East Palo Alto CA 97116-524

RegExMatch(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?",_)
MsgBox % "" _City " " _State " " _Zip ""

Sick, just sick what a mockery we made of that example. But don't worry, you're not getting off that easy! Who knows what surprises we might find in examples to come?

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
Since I've mentioned using COM to work with Internet Explorer via AHK in the last lesson (and I'll be mentioning it again), it is also worth mentioning that if you would like to learn more about how to do this, AHK member tank has his own tutorial on how to use COM with Internet Explorer, and when in doubt you can always check into the original COM Standard Library thread with the originator himself, Sean.


Another piece of information I can extract from my web application using COM is a ticket number, which I then log into a spreadsheet for my own reference. Unfortunately the page element that I extract the number from also extracts some HTML tags along with it, like this:

<td><b>0321945</b></td>

And depending on what company I'm "impersonating" the number can be as short as five digits and as long as nine digits.

The beauty of AHK is that you can get accustomed to laziness quite easily and in this case, I want lazy. I want to get only the number out of the page element I extracted and convert it directly over to a statement like this:

Cnf# 0321945

And paste it into my spreadsheet. The matching we can obviously do with a RegEx statement but we don't need RegExMatch to tell us it exists because we already know that. We also don't need to use RegExMatch just to get a named subpattern to use in another statement, too much work (yawn). I want to get everything into that one statement and I want it NOW.

So if it's not RegExMatch, what is it?! Why RegExReplace of course!

"But...what are you going to do once you rip the number out of the statement??? It's gone, you babbling fool!"


The babbling fool part, well, that's another subject for another day, but RegExReplace doesn't have to be used just to take something out of a statement; we can also use it to keep some stuff and replace the rest with new stuff we want there (hence the word Replace Einstein, and for that you get to stay after school and clean the erasers).

But, as always, we can't take out part without identifying the whole. I already know what I want, a bunch of digits:

([color=red]\d+[/color])

Again, I could've used a min/max range {5,9}, but since there are no other digits in that data it won't conflict with anything else. Also notice that I grouped it in a subpattern, a not-so-subtle foreshadowing clue but we'll come back to that. Now for what I DON'T want: the tags.

.* is good for eating up useless garbage, but alas, our good friend greed rears its ugly head. The RegEx Quick Reference gives us a nifty little trick for dealing with the tags before the digits:

[color=red]<.*?>[/color](\d+)

But I do have to take a moment to explain why that works, because this will correctly match what we want:

<td><b>0321945


But if we try the same thing on the other side of the digits:

(\d+)[color=red]<.*?>[/color]

It will not:

0321945</b>


This is because the first method limits greed, but probably not the way you were expecting it to. Although it starts with open tag character (<) it still consumes one close tag character (>) before it stops at the other. Greed stilll dictates that it should to satisfy the overall pattern, which says "match up to the close tag character directly preceding digits":

<.*?[color=red]>(\d[/color]+)

So now that we've cleared up that minor greedy ungreedification™ detail, how do we match to the right side of the digits? Well it doesn't matter how we match it since we're getting rid of it, right? That sounds like a job for Mr. Clean (.*):

<.*?>(\d+)[color=red].*[/color]

So let's look at what we have for a working statement so far:

line = <td><b>0321945</b></td>

result := RegExReplace(line,"<.*?>(\d+).*")

"Alright Boy Wonder, NOW what?"


I told you we designated that subpattern for a reason, now it's time to cash in:

result := RegExReplace(line,"<.*?>(\d+).*","Cnf# [color=red]$1[/color]")

And no, I'm not buying my subpattern back from AHK for a dollar, smarty. Read the manual:

RegExReplace - Replacement":3o18jfs1">

[/url]...[replacements] may include backreferences like $1, which brings in the substring from Haystack that matched the first subpattern. The simplest backreferences are $0 through $9, where $0 is the substring that matched the entire pattern, $1 is the substring that matched the first subpattern, $2 is the second, and so on.


Now THAT'S convenience. And now for the finished product, the pièce de résistance otherwise known as "Man (Not) At Work":

line = <td><b>0321945</b></td>

result := RegExReplace(line,"<.*?>(\d+).*","Cnf# $1")

And we'll continue on with examples, but I would again like to say that any comments, suggestions or even examples of your own which you think would be of benefit to RegEx beginners within the AHK community are more than welcome.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
In our last lesson I briefly expanded upon subtle differences in how greed may work within a RegEx statement. This short lesson will do the same, hopefully to show how you can manipulate greed in ways that will work for you when writing statements.

In a previous lesson I extracted a portion of a URL so I could identify which window of my web application is which. Now I want to do one better. Since certain data can only be extracted from certain pages within each application, I want to restrict which pages certain hotkeys will work on. Again, the window titles are no help but the URL will again hold the key:

https://intranet.corporate.com/InquiryWeb/[color=red]accountInquiryAction[/color].do 
https://intranet.corporate.com/BillingWeb/[color=red]requestForOutput[/color].do
https://intranet.corporate.com/InquiryWeb/[color=red]specialNotesReviewInquiryAction[/color].do

So how do I write a RegEx statement that will catch what I want? I could go the literal route since specificity makes for the best matching:

https://intranet[color=red]\[/color].corporate[color=red]\[/color].com/[color=red](Inquiry|Billing)[/color]Web/[color=red](\w+)\.do[/color]

But that's REEEAAAAALLY LOOOOONNNGG. I can back off the specificity just a little bit and still match:

https://intranet[color=red]\[/color].corporate[color=red]\[/color].com/[color=red].*?[/color]/(\w+)\.do

But that's still entirely too long for me (remember, LAZY AND SIMPLE). So how else can we cut the fat? Like this:

https?://[color=red].*?[/color]/(\w+)\.do

Now if you happened to look over the previous lesson I linked to you're probably saying to yourself, "Didn't he use that statement before?" Pretty much, yes:

https?://.*?/(\w+)

And your next question is probably, "Then how could that possibly work?!" Let's look at it, and while we're looking think less about greed and more about overall pattern in our statement.

So what is this portion of our previous example trying to tell us?

.*?/(\w+)

It's telling us to match anything after the first slash encountered which has word characters after it, so in the previous example using limited greed matched our pattern the way we typically expect it to:

https?://.*?/(\w+)  [color=green]<~~ RegEx[/color]
https://intranet.corporate.com/[color=red]InquiryWeb[/color]  [color=green]<~~ match[/color]

Our new example does something slightly different:

.*?/(\w+)[color=red]\.do[/color]

This example tells us to match anything after the first slash encountered which has any number of word characters after it followed by ".do" In this case, after the first slash greed does not encounter any word characters followed by ".do" before it encounters another slash:

https?://.*?/(\w+)\[color=red].do[/color]  [color=green]<~~ RegEx[/color]
https://intranet.corporate.com/[color=red]InquiryWeb[/color]/  [color=black]<~~ no match[/color]

So greed comsumes the first slash and looks beyond the second slash where it finds what it wants:

https?://.*?/(\w+)\[color=red].do[/color]  [color=green]<~~ RegEx[/color]
https://intranet.corporate.com/InquiryWeb/[color=red]accountInquiryAction.do[/color]  [color=green]<~~ match[/color]

This is a situation in which greedy ungreedification™ works to our advanage in that it allows us to create a much shorter but still working RegEx statement:

url = https://intranet.corporate.com/InquiryWeb/specialNotesReviewInquiryAction.do

RegExMatch(url,"https?://.*?/(?P<PageID>\w+)\.do",_)
MsgBox % "" _PageID ""

Now using greed in this way will not work in all situations so you will want to use such a technique with caution, but in situations where it will work it can significantly cut down how much writing you need to do make RegEx work for you.

And now you can print up a T-Shirt:

"I read sinkfaze's 500th post and all I got was this lousy RegEx lesson!"

tank
  • Administrators
  • 4345 posts
  • AutoHotkey Foundation
  • Last active: May 02 2019 09:16 PM
  • Joined: 21 Dec 2007
ha ha i know COM with IE like a champ and you know regex well would you beleive i dont. I like the tutorial learing slowly
Never lose.
WIN or LEARN.

Krogdor
  • Members
  • 1391 posts
  • Last active: Jun 08 2011 05:31 AM
  • Joined: 18 Apr 2008
This looks like a great tutorial.

Great work, sinkfaze! Keep it up :D

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006

This is a situation in which greedy ungreedification™ works to our advanage in that it allows us to create a much shorter but still working RegEx statement:

I guess you haven't noticed it works without ?. The reason is that .* matches to the end of the string (assuming there is only one line), but then it back-tracks one character at a time until / matches. When ? is specified, .* continues to consume characters only until the next part of the pattern matches (we could call this the "anchor").

If both patterns will always match the same sub-string, efficiency depends on the length of the sub-strings before and after the "anchor". For instance, consider the following two strings:
abcdef/xyz
abcdefghi/xyz
.*?/ is (very slightly) faster than .*/ matching the first string, but (very slightly) slower matching the second string. With the rest of the alphabet in there, it is roughly 50% faster to omit ?.


I'd like to point out that AutoHotkey_L supports RegEx Callouts, which make it possible to "debug" a regular expression by stepping through each item in the pattern. There is an example ready to use, though it has a very basic interface. At each step you may see the captured sub-strings, the current position in the input string and the pattern item being evaulated. This allows you to see, for instance, how differently .* and .*? work.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008

I guess you haven't noticed it works without ?.


Well I did, but I decided to stand pat on the manual's interpretation of this particular issue since beginners will likely be using that as their reference...not that I've been particularly loyal to the manual's interpretation of things throughout the course of the tutorial anyway, so it's not unfair to call me on it.

It's safe to say I haven't worked with RegEx on any scale large enough to where the speed of using some options over others has a discernible value. It is nonetheless very interesting to see how quickly the search speed will turn over for one option or another even in a relatively small string.

I'll be honest, it seemed to me like AutoHotkey_L was geared towards a user whose level of "programming" sophistication far exceeds my own, so I never gave it too much thought. But I have to admit, the RegEx callback feature has me interested and I will have a go.

sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
And as we we continue onward, I'll again point back at past work (and in the dictionary under redundant it says "see: redundant").

If you'll recall in this lesson and this lesson the object was to sort out portions of an address I retrieved from my web application using RegExMatch. Now there is another instance in which I retrieve this address from my web application: to insert it into a payment application. And the payment application itself has a delightful quirk in that it will accept uppercase letters only, which normally wouldn't be a problem except for the fact that ControlSetText can bypass the application's case detection. And when you try to run the application after incorrectly inputting all of your data, a crash occurs, you start all over again and many four-letter words your mother would slap you for if she heard you say them out loud wash over your mind.

So if you haven't guessed, making sure that we send this information as uppercase is pretty important. I had become accustomed to using RegExMatch, which has no option to update the case, then running four or five StringUpper commands to pass the array elements to variables appended with the applications initials (_Addr1 to WUAddr1, _City to WUCity, etc.).

Yes yes, a distinct violation of the Laziness Code but I hadn't gotten around to updating it. And we already know it's not RegExMatch, so RegExReplace, here we come!

How in blazes do you plan to get RegExReplace to do this?!


You sure are tough to please but bear with me, and we'll use the second half of the address lesson as a point of reference. Now we've already figured out the method by which we can sort all of the information we need on such an address line:

(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?

So let's set up the RegExReplace statement, shall we?

address2 = East Palo Alto CA 97116-524

RegExReplace(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?")

Notice that I'm keeping my subpatterns in the statement because they will be handy to us in a short while. But I can't use this statement as is because I'm still going to have to rely on RegExMatch to get the individual pieces into array elements, so I'll need to save my results to a variable. The same one I started with is always a good idea:

address2 = East Palo Alto CA 97116-524

[color=red]address2 :=[/color] RegExReplace(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?")

Now where to? Now we return to the magic of backreferences. Not only can we access backreferences via the ($1,$2,$3) method, we can also access them through named subpatterns:

$[color=red]{City}[/color]
$[color=red]{State}[/color]
$[color=red]{Zip}[/color]

Not only that, we can convert the case of backreferences by specifying a U(uppercase) or a T(title case) after the $ sign:

$[color=red]U[/color]{City}
$[color=red]U[/color]{State}

Now we're on our way to saving some time:

address2 = East Palo Alto CA 97116-524

address2 := RegExReplace(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?","$U{City}\s$U{State}\s${Zip}")

Beautiful, isn't it? Survey says...

XXX

????????

What horrible form of RegEx waterboard torture hath we been subjected to now? Perhaps you caught it, perhaps you didn't, but the manual will set it straight:

RegExReplace - Replacement[/url">

":24tcj84n]The string to be substituted for each match, which is plain text (not a regular expression).


$U{City}[color=red]\s[/color]$U{State}[color=red]\s[/color]${Zip}

Ain't I a stinker? Let's clean it up:

address2 = East Palo Alto CA 97116-524

address2 := RegExReplace(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?","$U{City} $U{State} ${Zip}")

So part 1 is done, I have my text in all uppercase just how I want it. Not only do I not have to change my original statement (if I don't want to) when I send it back through RegExMatch, I now also have the option of updating my old bad habit and quit passing to different variables as a reference to the application that will be using them. But since I don't have to hunt for the +4 portion of the statement as RegExReplace took care of that, I'm taking it out:

address2 = East Palo Alto CA 97116-524

address2 := RegExReplace(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})([\d-]+)?","$U{City} $U{State} ${Zip}")
RegExMatch(address2,"(?P<City>[\w\s-]+)\s(?P<State>[A-Z]{2})\s(?P<Zip>\d{5})",[color=red]WU[/color])
MsgBox % "" [color=red]WUCity[/color] " " [color=red]WUState[/color] " " [color=red]WUZip[/color] ""

And now I've cut down a few lines of work to even fewer lines of work by pairing the functions together and using their strengths.

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006

Well I did, but I decided to stand pat on the manual's interpretation of this particular issue since beginners will likely be using that as their reference...

I mention it because for a while, I thought .*abc would never find a match as .* consumes abc - I was unaware of back-tracking. (Actually, there are some advanced cases where back-tracking is disabled, but I needn't go into that here.) The manual hints at the truth:

By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern.



sinkfaze
  • Moderators
  • 6367 posts
  • Last active: Nov 30 2018 08:50 PM
  • Joined: 18 Mar 2008
So how do you like this, 27 posts in and I'm still not done misleading you in the ways of RegEx goodness! It's a miracle I have a day job.

Throughout the tutorial I have shown you several different ways of creating subpatterns. But there is one important distinction that is worth pointing out:

someprogram[color=red](\.log|\.ini)[/color]
Tournament \d+ Table \d+ - [color=red](Holdem |Omaha |7 Card Stud Hi/Lo )[/color][color=red](6 seat |9 seat )[/color]?- Stakes \d+/\d+[color=red]( Ante \d+)[/color]?
(?P<Line2>[color=red](Apt|Fl|Ste|Spc|Trlr|#)[/color]\s?[\w-]+)

15 points and a trip to the Showcase Showdown goes to the person that can tell me what all the red subpatterns have in common.

Didn't figure it out? Shame, shame. What all of the above subpatterns have in common is that we don't require the subpatterns to be captured for a possible use later by RegExMatch or RegExReplace; we're using the subpatterns strictly as a way to group potential matches to a statement. In some respects this is a pretty trivial distinction but in a larger scale use of RegExMatch or RegExReplace, the repeated capturing of useless subpatterns may affect performance.

So how do we jettison the welfare bums suckling at the RegEx teet? Fortunately that's an easy one:

someprogram([color=red]?:[/color]\.log|\.ini)
Tournament \d+ Table \d+ - ([color=red]?:[/color]Holdem |Omaha |7 Card Stud Hi/Lo )([color=red]?:[/color]6 seat |9 seat )?- Stakes \d+/\d+([color=red]?:[/color] Ante \d+)?
(?P<Line2>([color=red]?:[/color]Apt|Fl|Ste|Spc|Trlr|#)\s?[\w-]+)

Regular Expressions - Subpattern[/url">

":3b6dzlsp]To use the parentheses without the side-effect of capturing a subpattern, specify ?: as the first two characters inside the parentheses; for example: (?:.*)


And before we close this lesson, I leave you with an inspirational quote to keep you motivated on the way to the Tao of RegEx:

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." - Jamie Zawinski



Frankie
  • Members
  • 2930 posts
  • Last active: Feb 05 2015 02:49 PM
  • Joined: 02 Nov 2008
Amazing tutorial. I learned a lot from this. The part about backrefrences will be useful in my future scripts.
aboutscriptappsscripts
Request Video Tutorials Here or View Current Tutorials on YouTube
Any code ⇈ above ⇈ requires AutoHotkey_L to run