 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
PhiLho
Joined: 27 Dec 2005 Posts: 6723 Location: France (near Paris)
|
Posted: Thu Jun 22, 2006 5:29 pm Post subject: Regular expressions: a wrapper around the PCRE DLL |
|
|
This code is obsolete with version 1.0.45 that embed PCRE!
It can be still of interest for educational purpose...
OK, I see half of the readers (uh? two-thirds? ninety percent?) asking "What are regular expressions?".
Well, I won't provide full explaination here, I started to hand-write a tutorial, I have to type that and finish it...
But in few words, regular expressions, or regexp or regex or RE are a powerful (but a bit geeky) way to manipulate text.
With them, you can see if a generic string (eg. "5 letters followed by 2 digits) is inside a text, you can extract this string (eg. getting the current version number from the AutoHotkey download page), check if a string meets some criteria (does the user has typed a date in the right format?), transform a text (morph a list of C's #defines to a list of AHK's variable assignments), split a string with complex requirements (eg. get all words of a natural text, separated by spaces or punctuation signs), etc.
The drawback is its syntax, a bit cryptic for the uninitiated (and sometime for the initiated...), but with practice, it appears that most of the tasks use rather simple expressions.
Currently, AutoHotkey doesn't support regular expressions, so we have to rely on some external DLL. One of the most used is PCRE (Perl Compatible Regular Expressions), which is powerful and can be compiled to a rather small DLL.
Thomas Lauer already provided a wrapper DLL for PCRE 5.0.
It has the advantage of being small and implementing a replace algorithm, since PCRE does only searches.
It has the inconveniences of relying on an old version of this library (but the latest ones are big!), of using only the Posix version of the library, of needing a supplementary wrapper (in AHK) around this DLL, of being rather inefficient by compiling an RE at each of its use, of being difficult to change (need a C compiler), etc.
So, I tried to make my own implementation in pure AutoHotkey using only the official DLL.
Thus, if a new version comes out, you can use it. Or, possibly with some changes, you can use an older, smaller version. You can customize the wrapper to your tastes, since it is pure script.
The replace algorithm might be a bit slower, because I had to write it all in AHK, but you can compensate by adding extra power by hacking these routines.
I provide no split function, because it is inconvenient to write in AutoHotkey, as we cannot return arrays. So either the result would be global, or hard to fetch. But implementing a split with the provided functions should be quite trivial.
The version I release today is a bit geeky, in the sense you don't use the RE strings directly, but you have to compile them before using them.
The advantage is performance: you compile a regular expression once, then reuse it as many time as you like, the library won't need to recompile it again.
The disavantage is that's not much intuitive, not in the spirit of AutoHotkey.
So I am planning to do another version more in the spirit of my signature... or of AHK.
The trade off will be less performance, but it probably won't be noticeable except to parse a very big file line per line... And it will be perfect, for example, for a quick validation of a formatted edit field.
Plus this version might serve as prototype to a future integration of regular expressions in AHK... Note that such implementation can be more performant, perhaps by caching the expressions. If caching (hashing) is much faster than compiling, there might be an advantage. That's the way Perl mangage REs too: it only avoid to cache dynamic expressions (ie. resulting of concatenation or variable expansion, etc.).
It should implement also friendlier options (letters instead of big constant names).
Now it is time to take a look:
PCRE_DLL.ahk
TestPCRE_DLL.ahk
PCRE-6.4.zip, (only) the DLL. You can get other compiled DLLs at GnuWin32 or at Psyon site (untested yet, may be smaller).
As you can see, the test script is becoming big, but it only touch the surface of the library, with simple expressions, no option, no offset.
So there can be bugs there. If you find any, please report them here.
An overview of the usage of the library:
| Code: | stringToSearch = You can do /Regular Expressions/ in AutoHotkey too!
; Compile regular expression and get a reference to the result
hRE := PCRE_RegisterRegExp("R(A|H)(u|o)**")
; There is an error, the handle is null, we can use the provided mini-GUI
; that point out where the error is in the expression (if single line).
if (hRE = 0)
PCRE_ShowLastError()
; Compile a correct RE
hRE := PCRE_RegisterRegExp("([A-Z])([a-z])")
; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
; Get both position and length of the match, in a string, separated by a pipe (|)
pos@len := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
; Get the matched string
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)
; Get the first match of this RE on the given string, as a reference for use in further calls
hMatch := PCRE_Match(hRE, stringToSearch)
If (ErrorLevel = #PCRE_ERROR_NOMATCH)
{
MsgBox No match!
ExitApp
}
; Get how many captured string there was in this match:
; number of matched captures, plus the implicit capture of the whole match.
n := PCRE_GetMatchedCaptureNumber(hMatch)
; Get position and length of the captures
PCRE_GetMatchVals(hMatch, 0, pos0, len0) ; Whole match
PCRE_GetMatchVals(hMatch, 1, pos1, len1) ; First capture
PCRE_GetMatchVals(hMatch, 2, pos2, len2) ; Second capture
; Get strings of captures
s0 := PCRE_GetMatchStr(hMatch, 0) ; Whole match
s1 := PCRE_GetMatchStr(hMatch, 1) ; First capture
s2 := PCRE_GetMatchStr(hMatch, 2) ; Second capture
; Find next match and update the reference
PCRE_MatchNext(hRE, hMatch)
; Similar to:
hMatch := PCRE_Match(hRE, stringToSearch, pos0 + len0)
; but the later is less efficient, creating another reference instead of reusing it.
; Replace the whole match(es) by the given string,
; with $n replaced by the nth capture.
hRS1 := PCRE_RegisterReplaceString("$2-$1!")
; Idem with user-defined symbol.
hRS2 := PCRE_RegisterReplaceString("\_2-\_1!", "\_")
; Idem, two-parts symbol, to avoid ambiguity
hRS3 := PCRE_RegisterReplaceStringEx("${2}1-${1}0!")
; Idem, with user-defined symbols
hRS4 := PCRE_RegisterReplaceStringEx("\_2_/-\_1_/!", "\_", "_/")
; Note that unlike Perl, you cannot mix both notations. See the test file for more explainations.
; "A" is to replace all occurences (default), can be a maximum number of replacements.
resultString := PCRE_Replace(hRE, hRS1, stringToReplace, "A")
resultString := PCRE_Replace(hRE, hRS3, stringToReplace, 1)
; This call is optional, it will unload the library (automatically loaded on first use)
; and free the data allocated by the DLL.
; If not called, Windows will free all this on script exit.
PCRE_End()
|
_________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")
Last edited by PhiLho on Thu Nov 09, 2006 11:12 am; edited 3 times in total |
|
| Back to top |
|
 |
evl
Joined: 24 Aug 2005 Posts: 1234
|
|
| Back to top |
|
 |
foom
Joined: 19 Apr 2006 Posts: 386
|
|
| Back to top |
|
 |
Chris Site Admin
Joined: 02 Mar 2004 Posts: 10667
|
Posted: Fri Jun 23, 2006 2:01 am Post subject: |
|
|
Great presentation. I used to be a RegEx novice but have learned a lot more about it in conjunction with phpBB and .htaccess modifications. Armed with this understanding and your work in this and other topics, RegEx is getting closer to being integrated with AHK.
Thanks. |
|
| Back to top |
|
 |
BoBo Guest
|
Posted: Fri Jun 23, 2006 8:22 am Post subject: Regular Expressions DLL for Win32 Programs |
|
|
| Quote: | Regular Expressions DLL for Win32 Programs
121010 Kb
1999-03-16 00:00:00
gnuregex.dll: Regular Expressions for Win32 Programs
----------------------------------------------------
If you've ever wanted to add regular expressions to
a Win32 program, here's your chance.
This DLL is under the GNU General Public License
(almost all the source for it comes from the regex
library 0.12), so if you distribute a program that
uses it, you must follow the terms detailed in
COPYING.
[Download] | No idea if this is worth a look, stumbled over it while checking for other things ...  |
|
| Back to top |
|
 |
corrupt
Joined: 29 Dec 2004 Posts: 2446
|
Posted: Fri Jun 23, 2006 9:26 am Post subject: Re: Regular Expressions DLL for Win32 Programs |
|
|
Thanks for the link BoBo
typo? unpacked size 375,860 bytes, .dll size 41.5 KB
Last edited by corrupt on Fri Jun 23, 2006 9:29 am; edited 1 time in total |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6723 Location: France (near Paris)
|
Posted: Fri Jun 23, 2006 9:27 am Post subject: |
|
|
Thank you.
The RegEx Coach is an excellent tool, I use it to verify complex expressions.
There is also the PCRE Workbench which, as the name imply, uses the same library as me. Its weakness is that it limits test strings to one line.
The Regular Expression Laboratory looks OK too, so is the JRegExpTester, a Java application, ie. targeted at Java syntax. It has the advantage of using the RE library of regular-expressions.info.
Of course, there are more similar tools, I might even write one in AutoHotkey... With the weakness that currently, it is hard to colorize the various parts of a string.
Note also, for those wanting to do quick tests without downloading a program, two online programs to test REs (there are others...):
- Java syntax: RegEx
- JavaScript and both PHP syntaxes: REGex TESTER. This one is impressive because it uses Ajax to transmit the strings to test to the PHP program and retreive the results, so you never leave or refresh the page.
Well, there is also regular-expressions.info's JavaScript tester, which isn't bad either.
I hope this will wet your appetite for REs!  _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 7184
|
Posted: Fri Jun 23, 2006 10:05 am Post subject: |
|
|
Dear PhiLho,
I do not know anything about RegEx, but just heard of it when I gave a try to sed sometime ago. Since it is you, I think this is something that I should try & learn. I thank you for contributing such a nice thing to the community.
Regards,  _________________ Suresh Kumar A N |
|
| Back to top |
|
 |
olfen
Joined: 04 Jun 2005 Posts: 99 Location: Stuttgart, Germany
|
Posted: Sat Jun 24, 2006 8:17 am Post subject: |
|
|
| Thanks, PhiLho, for your superbly commented work. Very useful! |
|
| Back to top |
|
 |
Rajat
Joined: 28 Mar 2004 Posts: 1687
|
Posted: Sat Jun 24, 2006 7:35 pm Post subject: |
|
|
Thanks Philho firstly for the indepth knowledge of regex that I didn't have earlier, and then for making regex easily available to us folks. _________________
 |
|
| Back to top |
|
 |
olfen
Joined: 04 Jun 2005 Posts: 99 Location: Stuttgart, Germany
|
Posted: Sat Jul 01, 2006 10:34 pm Post subject: |
|
|
Hello PhiLho,
I just did a couple of tests. Test1 and Test3 don't work as expected. All 3 RegExes give the expected match in Regex Coach.
Am I doing something wrong?
| Code: | #SingleInstance Force
#NoEnv
#Include PCRE_DLL.ahk
stringToSearch = Test123
;Test1 - No match found. Expected: "Test1"
hRE := PCRE_RegisterRegExp("^.{5}")
if (hRe = 0)
PCRE_ShowLastError()
; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)
MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
)
;Test2 - Working as expected, match: "123"
hRE := PCRE_RegisterRegExp(".{3}$")
if (hRe = 0)
PCRE_ShowLastError()
; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)
MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
)
;Test3 - No match found. Expected: "Test123"
hRE := PCRE_RegisterRegExp("^\D*\d*$")
if (hRe = 0)
PCRE_ShowLastError()
; Get the position of a match on the given string.
pos := PCRE_GetMatch(hRE, stringToSearch)
res := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 0, #PCRE_GETSTRING)
MsgBox,
(
hRE: %hRE% (%ErrorLevel%)
Pos: %pos%
Pos&Len: %res%
Match: %match%
) |
|
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6723 Location: France (near Paris)
|
Posted: Mon Jul 03, 2006 10:00 am Post subject: |
|
|
It is my fault, I changed the API of PCRE_GetMatch so _startOffset defaulted to 1, ie. start of string, first char, in AutoHotkey tradition, instead of 0, in C tradition... But my test code, which you took as base, still used 0, so I gave -1 to the DLL... Garbage in, garbage out...
So, please, change the calls to:
| Code: | res := PCRE_GetMatch(hRE, stringToSearch, 1, #PCRE_GETLENGTH)
match := PCRE_GetMatch(hRE, stringToSearch, 1, #PCRE_GETSTRING)
| (file is updated)
Thank you to report the problem, and sorry for the confusion. _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
olfen
Joined: 04 Jun 2005 Posts: 99 Location: Stuttgart, Germany
|
Posted: Mon Jul 03, 2006 5:52 pm Post subject: |
|
|
| Thanks for the explanation and solution, seems to work fine now. |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6723 Location: France (near Paris)
|
Posted: Mon Jul 24, 2006 4:11 pm Post subject: |
|
|
OK, after some days of "rest" (doing something else), I finally acheived my regular expression tutorial!
You can find it on my site.
As I explain at the start, I tried to make a tutorial with concrete, real examples, yet avoiding forward references ("This expression uses some features we will see later"...), and trying to add some levity to this otherwise rather arid subject...
If you are curious and adventurous enough to read it, don't hesitate to give me feedback.
Feedback on the content (is it clear enough, should I insist on some point, etc.) is welcome in the specific topic I created in the Utilities & Resources section.
Feedback on the form (syntax, phrasing, sentences sounding too "Frenchy", etc.) should be private (PhiLho(a)GMX.net) to avoid adding noise to these topics.
Note that the page is printer friendly: I use a specific stylesheet for the printer (if your browser is smart enough) and you should be able to print in two columns, for example (if your driver is smart enough...).
Now, I should go back to work on my EasyRegEx wrapper... _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6723 Location: France (near Paris)
|
Posted: Thu Aug 24, 2006 9:38 am Post subject: |
|
|
About my (current) signature:
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")
I explained the image in another topic, but I re-explain it here:
It is a drawing I made (much larger!) inspired by celtic knots. I name it KnotMan. Previous link points to a larger version of the picture (500x500 pixels, 28KB) for the curious people...
The text explains the origin of the nickname with a regular expression, in a pseudo-AutoHotkey expression (pseudo because REs are not built in, but I will create someday the RegExReplace function using the above library), with syntax coloring.
This signature shows three of my points of interest: drawing (and celtic knots!), regular expressions, and my work on Scintilla, the syntax highlighting editor component, and SciTE, my editor of choice using this component.
If you apply the ^(\w{3})\w*\s+\b(\w{3})\w*$ expression to my real name (Philippe Lhoste) to replace it with the given substitution string ($1$2), you will get "PhiLho", which I shown with the chosen variable name.
The expression, a bit more convoluted than necessary to make it more cryptic means "match, at the start of the string, three word chars (captured) followed by any number of word chars, then at least one space (or tab, blank char), then at a start of a word (the redundant part), match and capture three word chars, then match the remainder of the word up to the end of the string".
Several persons asked me, privately or not, what it means, I thought it was an appropriate place to explain it. I hope it is clearer now.  _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|