| View previous topic :: View next topic |
| Author |
Message |
Joy2DWorld
Joined: 04 Dec 2006 Posts: 400 Location: Galil, Israel
|
Posted: Fri Jun 29, 2007 11:13 am Post subject: Regex doc re: "un-Greedy" * ("*?") doc |
|
|
doc today:
| Quote: | | Greed: By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern. To instead have them stop at the first possible character, follow them with a question mark. For example, the pattern <.+> (which lacks a question mark) means: "search for a <, followed by one or more of any character, followed by a >". To stop this pattern from matching the entire string <em>text</em>, append a question mark to the plus sign: <.+?>. This causes the match to stop at the first '>' and thus it matches only the first tag <em>. |
what is actually causes is:
Instead of allowing the search after gobbeling up all of the possible matches for ".+" to one by one 'give up' and release characters in order to match the next element in the search 'needle'. (and thus to match the *last* >),
allows the seach to match the next element in the search 'needle' as soon as possible.
| Code: | | msgbox % regexmatch("111122221111","(?<x>1+?2)",X) " - " Xx | and with null subseqent criteria:
| Code: | | msgbox % regexmatch("111122221111","(?<x>1+?)",X) " - " Xx |
(hope this is helpful)
ie.
| Quote: | | This causes the seach to match the next element in the search 'needle' as soon as possible and thus it matches only the first tag <em>, as the > after em is the first possible match for the rest of the search criteria |
or in AHK style maybe more:
| Quote: | | This causes the seach to match the next element in the search 'needle' as soon as possible and thus it matches only the first tag <em>. |
_________________ Joyce Jamce |
|
| Back to top |
|
 |
Joy2DWorld
Joined: 04 Dec 2006 Posts: 400 Location: Galil, Israel
|
Posted: Sun Jul 01, 2007 10:42 pm Post subject: |
|
|
also, in case it has not already been updated:
current doc:
| Quote: | | Within a regular expression, special characters such as tab and newline can be escaped with either an accent (`) or a backslash (\). For example, `t is the same as \t. |
is incorrect. The items cannot be "escaped" by the accent at the Regex level.
so... if helpful,
maybe something like:
| Quote: | | Within the regular expression engine, special escaped characters such as tab and newline are recognized by a backslash sequence (\n,\t, etc.). Special characters can also be included in expressions as literal tabs, newlines, etc., or by the standard AHK accent escapes. For example, `t , `n, etc. |
_________________ Joyce Jamce |
|
| Back to top |
|
 |
Chris Site Admin
Joined: 02 Mar 2004 Posts: 10463
|
Posted: Thu Jul 05, 2007 1:55 am Post subject: |
|
|
I don't agree that the current wordings are wrong. I wrote the docs colloquially and concisely to help users understand and use RegEx -- not to explain how regular expressions work internally.
On the other hand, I welcome second opinions: if PhiLho or anyone else thinks the current wordings are wrong or misleading, please let me know. |
|
| Back to top |
|
 |
corrupt
Joined: 29 Dec 2004 Posts: 2381
|
Posted: Thu Jul 05, 2007 6:05 am Post subject: |
|
|
| I think it's misleading. If they are not the same then they are not the same. |
|
| Back to top |
|
 |
Titan
Joined: 11 Aug 2004 Posts: 5009 Location: imaginationland
|
Posted: Thu Jul 05, 2007 8:28 am Post subject: |
|
|
I think Chris' definition is correct. For a more precise explanation see http://www.pcre.org/pcre.txt section titled 'PCRE MATCHING ALGORITHMS'. _________________
RegExReplace("irc.freenode.net/autohotkey", "^(?=(.(?=[\0-r\[]*((?<=\.).))))(?:[c-\x73]{2,8}(\S))+((2)|\b[^\2-]){2}\D++$", "$u3$1$3$4$2") |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6721 Location: France (near Paris)
|
Posted: Thu Jul 05, 2007 12:03 pm Post subject: |
|
|
You are arguing on the wording of an example. Which is correct.
Chris is right, his doc is just a quick reference, and cannot be exhaustive, and isn't a tutorial. _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
corrupt
Joined: 29 Dec 2004 Posts: 2381
|
Posted: Thu Jul 05, 2007 4:09 pm Post subject: |
|
|
The doc states that \ and ' escaped characters are the same. Do they always have the same effect in all cases? If they don't then the docs are not correct. If the docs are not meant to be specific to AutoHotkey's implementation of RegEx then they should be deleted and a link to an alternate site should be added instead. This is where people will look for specific differences between RegEx usage in other languages vs usage in AutoHotkey. For this reason, as much detail as possible should be provided for syntax/behaviour that is specific to AutoHotkey. I'm not sure I understand why the RegEx sections should be an exception. If anything, these are the areas where the documentation should go into more detail, not less detail. I wouldn't worry about the current documentation on RegEx being unintentionally interpreted as being any form of tutorial . |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6721 Location: France (near Paris)
|
Posted: Fri Jul 06, 2007 2:27 pm Post subject: |
|
|
I was commenting on the first part. Indeed, Joy2DWorld pointed out `n and \n can be different in some cases.
Now, even if the RegEx quick reference might be handy, somehow I agree with corrupt, perhaps no reference, or one concentrating on specific gotchas, might be better. The Net is full of tutorials / references anyway.
On the other hand, it can be frustrating for newcomers to regexes, not to have at least the basis of the syntax of a function...
As always, a delicate balance. _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
corrupt
Joined: 29 Dec 2004 Posts: 2381
|
Posted: Fri Jul 06, 2007 3:54 pm Post subject: |
|
|
| PhiLho wrote: | | As always, a delicate balance. | I agree to a certain extent, but I think that help sections related to advanced topics should contain more information. Unfortunately, troubleshooting complex RE can be extremely time consuming. Especially when the problem turns out to be an exception and/or gotcha that is specific to AutoHotkey. Most people will likely not be too pleased to discover that what they might have spent the last few hours trying to troubleshoot was a known issue that the author felt would clutter the documentation too much if it was mentioned in an advanced section of the documentation. |
|
| Back to top |
|
 |
PhiLho
Joined: 27 Dec 2005 Posts: 6721 Location: France (near Paris)
|
Posted: Fri Jul 06, 2007 4:23 pm Post subject: |
|
|
Well, your idea of "advanced documentation", listing all the obscure gotchas that Chris don't want to expand, but explains in the Bug Reports section (or elsewhere) is good.
The Wiki looks like a good place to do this.
That would be a iFAQ (infrequently asked questions)...  _________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2") |
|
| Back to top |
|
 |
Joy2DWorld
Joined: 04 Dec 2006 Posts: 400 Location: Galil, Israel
|
Posted: Sat Jul 07, 2007 10:39 pm Post subject: |
|
|
Maybe this is helpful, and can suggest the proper doc wording:
`n is NOT a REGEX escape sequence.
`n is the equivalent of inserting a chr(13) into the relevant string.
\n is NOT equivalent to inserting a chr(13) into the relevant string.
the x) option affects the regex response to chr(13) in the needle string. Thus the distinction is important.
`n is an ESCAPE SEQUENCE FOR AHK, *NOT* for REGEX !
there is no shame in that!!! it's even kind of cool.... and the more info given.. allows users more power... more understanding and options.....
hope this helps.
oh and...
| Quote: | | Most people will likely not be too pleased to discover that what they might have spent the last few hours trying to troubleshoot was a known issue that the author felt would clutter the documentation too much if it was mentioned... |
exactly. _________________ Joyce Jamce |
|
| Back to top |
|
 |
Grumpy Guest
|
Posted: Mon Jul 09, 2007 2:16 pm Post subject: |
|
|
| Joy2DWorld wrote: | | `n is the equivalent of inserting a chr(13) into the relevant string. | Not really, that's chr(10), but it doesn't invalidate your argument... |
|
| Back to top |
|
 |
Lexikos
Joined: 17 Oct 2006 Posts: 2472 Location: Australia, Qld
|
Posted: Sat Jul 14, 2007 4:28 am Post subject: |
|
|
| Joy2DWorld wrote: | | the x) option affects the regex response to chr(13) in the needle string. Thus the distinction is important. |
| RegExMatch, Options, x wrote: | | Ignores whitespace characters in the pattern except when escaped or inside a character class. The characters `n and `t are among those ignored because by the time they get to PCRE, they are already raw/literal whitespace characters (by contrast, \n and \t are not ignored because they are PCRE escape sequences). The x option also ignores characters between a non-escaped # outside a character class and the next newline character, inclusive. This makes it possible to include comments inside complicated patterns. However, this applies only to data characters; whitespace may never appear within special character sequences such as (?(, which begins a conditional subpattern. | A distinction is very clearly made.
Are there any other gotchas? Perhaps ones that aren't documented? |
|
| Back to top |
|
 |
|