Here is a quick link list to each individual post after this one, which is Lesson 1: A Brief Explanation of RegEx and Examples:
Lesson 2: Breaking down Example 2
Lesson 3: Breaking down Example 3
Lesson 4: Using RegEx with window titles
Lesson 5: RegEx and window titles, Ex. 2
Lesson 6: RegEx and window titles, Ex. 3
Lesson 7, Part 1: A window title and a URL - RegEx with window titles review
Lesson 7, Part 2: A window title and a URL - Introduction to using RegExMatch
Lesson 8, Part 1: Address parsing - More RegExMatch techniques and an Introduction to using RegExReplace
Lesson 8, Part 2: Address parsing - RegExMatch review
Lesson 9: HTML and RegEx - RegExReplace
I’m sure many users within the AHK community have heard about or have seen someone who has solved a problem using Regular Expressions, or RegEx, and wondered to themselves, “What the heck is that?!” A bunch of periods, pluses, backslashes, asterisks, question marks…wtf mate? This tutorial is aimed at making RegEx a little more understandable to the user who has no experience with it by looking at a few basic examples of RegEx statements and breaking them down.
What is RegEx?
Contrary to what you may think, the concepts behind RegEx probably aren’t that un-familiar to you. Have you ever done a search and used a wild card asterisk (*) before? Let’s take a look at a phrase we might use to search our PC for all of the text files on our hard drive:
*.txtSo what is that phrase telling us? It’s telling us to search for any file that ends in .txt; the wild card is, in effect, a short-circuit term meaning “any”. But let’s say we want to find various files that all have the same file name (“someprogram”) but with any extension. Now we use the wild card in a different way:
someprogram.*Or, if we want to search for certain text files whose name begins with help but ends with different alphanumerical sequences follow by .txt, we could use the wild card to search for them like this:
help*.txtNow all of these searches are, of course, looking for certain text in file names, but by using the wild card we can now better see that we are also searching for certain patterns in file names. The entirety of each of the above search terms, patterns and all, are expressions.
RegEx Explained
Regular expressions are similar in that they use “short-circuit terms” to create searchable terms, but the syntax is a little different. So let’s look back over our previous three search examples to see how a similar RegEx search would look. For our first example, the equivalent RegEx search term would be this:
.*\.txtNow you might be saying to yourself “…uhhh-huh-uh, what?” but let’s refrain from reverting to Beavis and Butthead and take a moment to examine this statement (and keep the Regular Expressions Quick Reference from the manual handy!). In RegEx the period represents any single character that can be matched, so no matter what that one character is, the period will match it. That’s great if we’re matching only one character but clearly in this example we will likely need to match several characters; that’s where the asterisk comes into play.
(EDIT: Please note the addendum from Lexikos below regarding the period and "newline" characters.)
The asterisk will match zero or more of the preceding character, class or subpattern. We will discuss classes and subpatterns later, but for our example the action of the asterisk is dictated by the preceding character, which in this case is the period. In other words, .* will match zero or more occurrences of any character. If you haven’t guessed it already (or if you just read it in the manual) .* is one of the most permissive RegEx patterns since it will match, well, anything!
(Now there are two other match characters that act very similarly to the asterisk: the plus (+) and the question mark (?). Unlike the asterisk, the plus matches one or more of the preceding character, class or subpattern, so it has to match something in order to be valid whereas the asterisk can match nothing and still be valid. The question mark, on the other hand, matches zero or one of the preceding character, class or subpattern but it does so optionally, so if the RegEx statement colou?r doesn't find the word colour it will still match the word color. For our purposes we will continue using the asterisk in our examples but it is good to mention these additional match characters for future reference.)
So now that we’ve covered the .* let’s look at the rest of the search term. The backslash (\) is the escape character in RegEx, which means that the following single character will be literal. So \. means that we are looking for a literal period as opposed to "any single character". Since that is followed by the letters txt, which are already literal, we can see that \.txt represents the literal characters .txt. Put it all together and you’re doing a search for a file with any name (.*) that ends in .txt (\.txt), .*\.txt!
That wasn’t so bad, was it?
Now I do want to clarify one thing before we move on to the second example. If we had not used a backslash to signify the literal period (\.) RegEx still would've matched in virtually the same way since the period matches any character and the literal period is a character (duh). But there will be times when using the backslash to signify literals in your statements will be critical to matching terms, so we will use it here just to enforce good writing habits.