Descolada wrote: ↑25 Jul 2022, 00:58
@rj8810, could you post a concise description of what you are trying to match and what the rules of matching need to be? Preferably post the real haystack you are using, not the one with reds and yellows
GRacias
1.- my wish is to find a pattern composed of 3 words:
keyword = href="
keyword2 = %variable%
keyword3 = /div(id:123)"> ; note: 123 but can be it is any number
2.- each keyword is separate by an indeterminate number of unknown characters, The pattern I want to match is something like this:
href="xxxanythingvariablekeyword2valuexxx/div(id:123)">
3.- the document has multiple exact matching patterns and other similar ones which should be discarded as:
href="222xxxhref="111xxxanythingvariablekeyword2valuexxx/div(id:123)">yyyy/div(id:222)"> ; note that ...href="... appears 2 times here, the same as .../div(id:123)">... . I only need to start from the word href=" closest to the value of variable keyword2 from right to left and the /div(id:123)"> closest to the value of variable keyword2 from left to right
I know it could be done many ways with several lines of code, but I would really like a single line of code with regexmatch and possibly look-ahead and look-behind assertions,
this is what i can do:
Code: Select all
q::
document = href="222xxxhref="111xxxanything-xxx/div(id:123)">yyyy/div(id:222)">
keyword2 = anything
search = i)href="(.*?%keyword2%-.*?\/div\(id:\d.*?)"> ;la i) es para que no distinga entre mayusculas y minusculas
p := 1
array := []
array2 := []
while p:= RegExMatch(document, search, pattern, p+StrLen(pattern))
{
Array[A_Index] := pattern ;esto sólo recupera el patron
Array2[A_Index] := pattern1 ;esto sólo recupera el subpatron
msgbox % "pattern number " . A_Index . " is " . Array[A_Index]
msgbox % "subpattern number " . A_Index . " is " . Array2[A_Index]
Count := Array.Count()
}
MsgBox, % Count
return
but the problem that I have never been able to solve is that it does not match from the closest href=" from right to left of the variable value %keyword2%. therefore href=" appears 2 times.
So I investigated and found a solution with Look-ahead and look-behind assertions:(?!.*href="), which establishes a condition to match: that there is no other href=" to the right of href="
Code: Select all
document = href="222xxxhref="111xxxanything-xxx/div(id:123)">yyyy/div(id:222)">
keyword2 = anything
search = i)href="(?!.*href=")(.*?%keyword2%-.*?\/div\(id:\d.*?)"> ;la i) es para que no distinga entre mayusculas y minusculas
p := 1
array := []
array2 := []
while p:= RegExMatch(document, search, pattern, p+StrLen(pattern))
{
Array[A_Index] := pattern ;esto sólo recupera el patron
Array2[A_Index] := pattern1 ;esto sólo recupera el subpatron
msgbox % "pattern number " . A_Index . " is " . Array[A_Index]
msgbox % "subpattern number " . A_Index . " is " . Array2[A_Index]
Count := Array.Count()
}
MsgBox, % Count
return
The problem with this code is that when there are multiple matching patterns in the document, it doesn't work.
apparently (?!.*href=") does not stop at the end(/div(id:123)">) of the pattern to be matched. apparently it looks for an href=" from left to right until the end of the document, and if it finds it, the match does not occur
Code: Select all
ñ::
document = href="222xxxhref="111xxxanything-xxx/div(id:123)">yyyy/div(id:222)">thepatternrepeatshref="222xxxhref="111xxxanything-xxx/div(id:333)">yyyy/div(id:444)">href="...longtext
keyword2 = anything
search = i)href="(?!.*href=")(.*?%keyword2%-.*?\/div\(id:\d.*?)"> ; la i) es para que no distinga entre mayusculas y minusculas
p := 1
array := []
array2 := []
while p:= RegExMatch(document, search, pattern, p+StrLen(pattern)) ; Aquí lo que se busca es esto: href="/item/domiciliario-iid-1115341856"> pero Para escapar de comillas literales hay que anteponer otra " En regex, O almacenar la cadena literal en una variable ya que las variedades asumen todo como literal excepto los caracteres especiales de regex lo cual es una buena solucion
{
Array[A_Index] := pattern ;esto sólo recupera el patron
Array2[A_Index] := pattern1 ;esto sólo recupera el subpatron
msgbox % "pattern number " . A_Index . " is " . Array[A_Index]
msgbox % "subpattern number " . A_Index . " is " . Array2[A_Index]
Count := Array.Count()
}
MsgBox, % Count
return
the strange thing is that the use of .*? and of positive look-ahead like abc(?=.*xyz) works perfectly to find the closest word to another word from left to right, but using look-behind assertions (?<=...) and (?<!.. .) do not fulfill the task.
Code: Select all
q::
MsgBox % RegExMatch("xxxxyellow222xxxyellow111yyyyblueñññred111xxxred222", "(yellow(?<!.*yellow).*?blue.*?red)", SsubPat)
MsgBox, %SsubPat% ; show yellowyyyyblueñññred
return
autohotkey documentation about it:
Greed: By default, *, ?, +, and {min,max} are greedy because they consume all characters up through the last possible one that still satisfies the entire pattern. To instead have them stop at the first possible character, follow them with a question mark. For example, the pattern <.+> (which lacks a question mark) means: "search for a <, followed by one or more of any character, followed by a >". To stop this pattern from matching the entire string <em>text</em>, append a question mark to the plus sign: <.+?>. This causes the match to stop at the first '>' and thus it matches only the first tag <em>.
Look-ahead and look-behind assertions: The groups (?=...), (?!...), (?<=...), and (?<!...) are called assertions because they demand a condition to be met but don't consume any characters. For example, abc(?=.*xyz) is a look-ahead assertion that requires the string xyz to exist somewhere to the right of the string abc (if it doesn't, the entire pattern is not considered a match). (?=...) is called a positive look-ahead because it requires that the specified pattern exist. Conversely, (?!...) is a negative look-ahead because it requires that the specified pattern not exist. Similarly, (?<=...) and (?<!...) are positive and negative look-behinds (respectively) because they look to the left of the current position rather than the right. Look-behinds are more limited than look-aheads because they do not support quantifiers of varying size such as *, ?, and +. The escape sequence \K is similar to a look-behind assertion because it causes any previously-matched characters to be omitted from the final matched string. For example, foo\Kbar matches "foobar" but reports that it has matched "bar".
and this is where the Alphabravo code appears,with which I can also capture subpatterns, it works perfectly, but unfortunately only for the following specific case:
Code: Select all
q::
H = yellow333yellow222yellow111blue111red111red222red333yellowkkkyellow777yellow555yellow444blue444red444red555red777yellow
K = i)yellow((([^ybr]+|y+(?!ellow)|b+(?!lue)|r+(?!ed))?)blue(?2))red
p := 1
array := []
array2 := []
while p:= RegExMatch(H, k, m, p+StrLen(m))
{
Array[A_Index] := m ;esto sólo recupera el patron
Array2[A_Index] := m1 ;esto sólo recupera el subpatron
msgbox % "pattern number " . A_Index . " is " . Array[A_Index]
msgbox % "subpattern number " . A_Index . " is " . Array2[A_Index]
Count := Array.Count()
}
msgbox, % "total match =" Count
return