AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Regex doc re: "un-Greedy" * ("*?") doc

 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Bug Reports
View previous topic :: View next topic  
Author Message
Joy2DWorld



Joined: 04 Dec 2006
Posts: 400
Location: Galil, Israel

PostPosted: Fri Jun 29, 2007 11:13 am    Post subject: Regex doc re: "un-Greedy" * ("*?") doc Reply with quote

doc today:

Quote:
Greed: By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern. To instead have them stop at the first possible character, follow them with a question mark. For example, the pattern <.+> (which lacks a question mark) means: "search for a <, followed by one or more of any character, followed by a >". To stop this pattern from matching the entire string <em>text</em>, append a question mark to the plus sign: <.+?>. This causes the match to stop at the first '>' and thus it matches only the first tag <em>.




what is actually causes is:

Instead of allowing the search after gobbeling up all of the possible matches for ".+" to one by one 'give up' and release characters in order to match the next element in the search 'needle'. (and thus to match the *last* >),

allows the seach to match the next element in the search 'needle' as soon as possible.


Code:
msgbox % regexmatch("111122221111","(?<x>1+?2)",X) " - " Xx
and with null subseqent criteria:

Code:
msgbox % regexmatch("111122221111","(?<x>1+?)",X) " - " Xx


(hope this is helpful)

ie.

Quote:
This causes the seach to match the next element in the search 'needle' as soon as possible and thus it matches only the first tag <em>, as the > after em is the first possible match for the rest of the search criteria


or in AHK style maybe more:

Quote:
This causes the seach to match the next element in the search 'needle' as soon as possible and thus it matches only the first tag <em>.

_________________
Joyce Jamce
Back to top
View user's profile Send private message
Joy2DWorld



Joined: 04 Dec 2006
Posts: 400
Location: Galil, Israel

PostPosted: Sun Jul 01, 2007 10:42 pm    Post subject: Reply with quote




also, in case it has not already been updated:

current doc:

Quote:
Within a regular expression, special characters such as tab and newline can be escaped with either an accent (`) or a backslash (\). For example, `t is the same as \t.


is incorrect. The items cannot be "escaped" by the accent at the Regex level.


so... if helpful,
maybe something like:

Quote:
Within the regular expression engine, special escaped characters such as tab and newline are recognized by a backslash sequence (\n,\t, etc.). Special characters can also be included in expressions as literal tabs, newlines, etc., or by the standard AHK accent escapes. For example, `t , `n, etc.

_________________
Joyce Jamce
Back to top
View user's profile Send private message
Chris
Site Admin


Joined: 02 Mar 2004
Posts: 10463

PostPosted: Thu Jul 05, 2007 1:55 am    Post subject: Reply with quote

I don't agree that the current wordings are wrong. I wrote the docs colloquially and concisely to help users understand and use RegEx -- not to explain how regular expressions work internally.

On the other hand, I welcome second opinions: if PhiLho or anyone else thinks the current wordings are wrong or misleading, please let me know.
Back to top
View user's profile Send private message Send e-mail
corrupt



Joined: 29 Dec 2004
Posts: 2381

PostPosted: Thu Jul 05, 2007 6:05 am    Post subject: Reply with quote

I think it's misleading. If they are not the same then they are not the same.
Back to top
View user's profile Send private message Visit poster's website
Titan



Joined: 11 Aug 2004
Posts: 5009
Location: imaginationland

PostPosted: Thu Jul 05, 2007 8:28 am    Post subject: Reply with quote

I think Chris' definition is correct. For a more precise explanation see http://www.pcre.org/pcre.txt section titled 'PCRE MATCHING ALGORITHMS'.
_________________

RegExReplace("irc.freenode.net/autohotkey", "^(?=(.(?=[\0-r\[]*((?<=\.).))))(?:[c-\x73]{2,8}(\S))+((2)|\b[^\2-]){2}\D++$", "$u3$1$3$4$2")
Back to top
View user's profile Send private message Visit poster's website
PhiLho



Joined: 27 Dec 2005
Posts: 6721
Location: France (near Paris)

PostPosted: Thu Jul 05, 2007 12:03 pm    Post subject: Reply with quote

You are arguing on the wording of an example. Which is correct.
Chris is right, his doc is just a quick reference, and cannot be exhaustive, and isn't a tutorial.
_________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")
Back to top
View user's profile Send private message Visit poster's website
corrupt



Joined: 29 Dec 2004
Posts: 2381

PostPosted: Thu Jul 05, 2007 4:09 pm    Post subject: Reply with quote

The doc states that \ and ' escaped characters are the same. Do they always have the same effect in all cases? If they don't then the docs are not correct. If the docs are not meant to be specific to AutoHotkey's implementation of RegEx then they should be deleted and a link to an alternate site should be added instead. This is where people will look for specific differences between RegEx usage in other languages vs usage in AutoHotkey. For this reason, as much detail as possible should be provided for syntax/behaviour that is specific to AutoHotkey. I'm not sure I understand why the RegEx sections should be an exception. If anything, these are the areas where the documentation should go into more detail, not less detail. I wouldn't worry about the current documentation on RegEx being unintentionally interpreted as being any form of tutorial Shocked .
Back to top
View user's profile Send private message Visit poster's website
PhiLho



Joined: 27 Dec 2005
Posts: 6721
Location: France (near Paris)

PostPosted: Fri Jul 06, 2007 2:27 pm    Post subject: Reply with quote

I was commenting on the first part. Indeed, Joy2DWorld pointed out `n and \n can be different in some cases.
Now, even if the RegEx quick reference might be handy, somehow I agree with corrupt, perhaps no reference, or one concentrating on specific gotchas, might be better. The Net is full of tutorials / references anyway.
On the other hand, it can be frustrating for newcomers to regexes, not to have at least the basis of the syntax of a function...
As always, a delicate balance.
_________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")
Back to top
View user's profile Send private message Visit poster's website
corrupt



Joined: 29 Dec 2004
Posts: 2381

PostPosted: Fri Jul 06, 2007 3:54 pm    Post subject: Reply with quote

PhiLho wrote:
As always, a delicate balance.
I agree to a certain extent, but I think that help sections related to advanced topics should contain more information. Unfortunately, troubleshooting complex RE can be extremely time consuming. Especially when the problem turns out to be an exception and/or gotcha that is specific to AutoHotkey. Most people will likely not be too pleased to discover that what they might have spent the last few hours trying to troubleshoot was a known issue that the author felt would clutter the documentation too much if it was mentioned in an advanced section of the documentation.
Back to top
View user's profile Send private message Visit poster's website
PhiLho



Joined: 27 Dec 2005
Posts: 6721
Location: France (near Paris)

PostPosted: Fri Jul 06, 2007 4:23 pm    Post subject: Reply with quote

Well, your idea of "advanced documentation", listing all the obscure gotchas that Chris don't want to expand, but explains in the Bug Reports section (or elsewhere) is good.
The Wiki looks like a good place to do this.
That would be a iFAQ (infrequently asked questions)... Very Happy
_________________
vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")
Back to top
View user's profile Send private message Visit poster's website
Joy2DWorld



Joined: 04 Dec 2006
Posts: 400
Location: Galil, Israel

PostPosted: Sat Jul 07, 2007 10:39 pm    Post subject: Reply with quote

Maybe this is helpful, and can suggest the proper doc wording:

`n is NOT a REGEX escape sequence.

`n is the equivalent of inserting a chr(13) into the relevant string.


\n is NOT equivalent to inserting a chr(13) into the relevant string.



the x) option affects the regex response to chr(13) in the needle string. Thus the distinction is important.


`n is an ESCAPE SEQUENCE FOR AHK, *NOT* for REGEX !



there is no shame in that!!! it's even kind of cool.... and the more info given.. allows users more power... more understanding and options.....


hope this helps.

oh and...
Quote:
Most people will likely not be too pleased to discover that what they might have spent the last few hours trying to troubleshoot was a known issue that the author felt would clutter the documentation too much if it was mentioned...


exactly.
_________________
Joyce Jamce
Back to top
View user's profile Send private message
Grumpy
Guest





PostPosted: Mon Jul 09, 2007 2:16 pm    Post subject: Reply with quote

Joy2DWorld wrote:
`n is the equivalent of inserting a chr(13) into the relevant string.
Not really, that's chr(10), but it doesn't invalidate your argument...
Back to top
Lexikos



Joined: 17 Oct 2006
Posts: 2472
Location: Australia, Qld

PostPosted: Sat Jul 14, 2007 4:28 am    Post subject: Reply with quote

Joy2DWorld wrote:
the x) option affects the regex response to chr(13) in the needle string. Thus the distinction is important.
RegExMatch, Options, x wrote:
Ignores whitespace characters in the pattern except when escaped or inside a character class. The characters `n and `t are among those ignored because by the time they get to PCRE, they are already raw/literal whitespace characters (by contrast, \n and \t are not ignored because they are PCRE escape sequences). The x option also ignores characters between a non-escaped # outside a character class and the next newline character, inclusive. This makes it possible to include comments inside complicated patterns. However, this applies only to data characters; whitespace may never appear within special character sequences such as (?(, which begins a conditional subpattern.
A distinction is very clearly made. Razz

Are there any other gotchas? Perhaps ones that aren't documented?
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Bug Reports All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group