RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 07:46

[Moderator's note: Topic moved from Bug Reports.]

RegEx - Error placement. Multi-line and single-line queries consisting only of groups ().

Errors in the results of the RegExReplace() and RegExMatch() functions.
The error occurs only in cases where the "condition" of the regular expression consists only of "groups" that do not have "outside groups" conditions.


A few rules:
1. Examples of "regular expressions" consisting only of groups () are considered. I believe that the error occurs only under this main condition.
2. Groups do not have outer quantifiers outside of main groups. Those. each group appears in the text only once in each line.
3. The code is considered as the desire to correctly convert each line, relative to the input regular expression and text example. The example provides various control test values of each lines.
4. Finding bugs focuses on the possibility of "compile/interpretation errors" by their functions, and not on a programmer's error in the "regular expression" string condition set.
5. The expected result of all examples is the output of the extracted and added information (based on the input):
"third group", then "some symbolic expression", and finally "first group".

Based on rule #5, I want the result "$3 --=-- $1". Here I emphasize that in your tests do not use the short result "$3$1", from groups 3 and 1, since in this case the error cannot be detected.


Introduction to the program code:
1. There is multi-line text in the code.
2. Attention is required when the results of RegExReplace() and RegExMatch() work. The functions in the example are used on their own or are in the middle of a loop.
3. The verification code consists of three options separated by comments, for example "; ##3##". Each option has its own number.

Code: Select all

TestText1 := "
(Join`n
a expression at the end Teach #346
a expression at the end, a expression is part of textTeach #2344
a expression at the end, a period before the expression . without a space  ....  . .Teach #8542
expression Teach #642 in the middle of the text
expression in the middle of the text, at the end of a period and a space, this line should be considered unchanged in the output Teach #113. .. .
)"

resultX := ""
matchX := ""

; ##1##
  Loop, Parse, TestText1, "`n"
	{
		if (A_Index != 1)
			resultX .= "`n"
		RegExMatch(A_LoopField ,"(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)$", matchX)
		resultX .= matchX3 . " --=-- " . matchX1
	}
MsgBox, % resultX

; ##2##
resultX := ""
  Loop, Parse, TestText1, "`n"
	{
		if (A_Index != 1)
			resultX .= "`n"
		resultX .= RegExReplace(A_LoopField ,"(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)$", "$3 --=-- $1")
		;        or RegExMatch , then it will give 0/1 (a negative/positive response to the number of lines, and in some cases it will detect extra !!!!)
		;  resultX .= RegExMatch(A_LoopField ,"(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)$", "$3 --=-- $1")
	}
MsgBox, % resultX


; ##3##
TestText11 := RegExReplace(TestText1 ,"m)(*ANYCRLF)(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)$", "$3 --=-- $1")
MsgBox, % TestText11

Statement:
In this example, only the first verification code ##1## works correctly.

Verification code #2, #3 - do not work correctly.

These code options create unnecessary duplicate output of "some symbolic expression" to strings, which is not expected by the basics of programming.

empty value of the 3rd group, as a result on the 3rd group - this is normal for some lines, according to the condition `(Teach[\s]#[\d]*|)`
Last edited by ConTrast77 on 30 Jan 2023, 16:47, edited 2 times in total.

ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 09:02

Statement.
If: change the code for all lines instead of the separator "`n" to the separator "`r`n"

Code: Select all

(Join`r`n

Code: Select all

   Loop, Parse, TestText1, "`r`n"
Result of 1st code example ; ##1##
will change to "incorrect" result, just as in examples 2 and 3. Now all three versions of the code produce the same "incorrect" result.

- we take this final code into account, since I can build other results of the proofs on it. In this post.


Let's continue:
After applying the set conditions and parsing strings with the "`r`n" pattern, we get all examples with "wrong" results.

in example 2 ; ##2##
Let's try to comment out the line

Code: Select all

resultX .= RegExReplace(… … …
and uncomment the line

Code: Select all

resultX .= RegExMatch(… … …
and we find that the 5 lines processed through the Parse loop are recognized as 10 true lines (more precisely, as 10 independent results) with "some extra newline character" after each. Value "1" = True. But 10 times!!!

In this case, the result of the third example ; ##3##
now prints four times (1 + three duplicates), the text insertion " --=-- ".

ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 09:36

Statement.
If we take the source code from the first post in the topic, and in example 2 ; ##2##
change the line to:
resultX .= RegExReplace(A_LoopField ,"(.*?)([\.\s]*?)(Teach[\s]#[\d]*)$", "$3 --=-- $1")
Oops ... This line before the change (in the first message of the topic) should have had a third group with a comparison sign | OR, i.e. (Teach[\s]#[\d]*|)

then the extra duplicate result " --=-- " in the lines disappears, but there is an error processing lines 4 and 5. There is only indirect evidence that the result of the 3rd search group can only be a filled value (Teach[\s]#[\ d]*). But its empty value is passed to the "$3 --=-- $1" condition in the RegExReplace() function. The function erroneously starts ignoring also the text value " --=-- ", and instead of a string according to the " --=-- $1" scheme, we get a string according to the "$1" scheme, which completely contradicts the function parameters.

swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by swagfag » 30 Jan 2023, 09:46

i gave up after the second sentence

RegExMatch() only ever matches once, then its finished.
in contrast, RegExReplace() matches as many times as it can, then its finished.

uve written a regex pattern that permits empty matches, thats why u get duplicates of "$3(where $3 is an empty match) --=-- $1(and where $1 is also an empty match)". there is no bug(except maybe in ur regexp)

ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 10:46

swagfag wrote:
30 Jan 2023, 09:46
....
in contrast, RegExReplace() matches as many times as it can, then its finished.
...
I understand you are saying a condition where there are many CONSECUTIVE matches in one line, or multiple matches across the entire text.

Since I limit the entire text by "tying" to the right edge with the sign "$" in combination with more precise values in the condition of the 3rd group, then duplicates cannot exist, even with the coincidence of the simultaneous return of an empty value for the group three $3 and one $1 in a probable next sequence that was not covered by a probable previous search. The priority is binding to the edge of the line ^, $. There can be no repetitions. Even with empty values.

Also, my observations are confirmed by the ambiguous reaction of the first example ; ##1##

- as a result of parsing "`r`n" or "`n" lines, - in synchronism with the same declaration of multi-line text, at the output we have lines without additional special characters. But we will still have different behavior.
- If the line separation is "`n", then the result is true. If the line separation is "`r`n", then the result is different. Somehow this affects the response of RegExMatch(). Although according to the code, the parsing logic does not change at all in both cases. and special characters must not enter the middle of strings in RegExMatch() processing.

ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 16:21

PARTIAL SOLUTION:
Well, it turns out that the cycle works differently.

Code: Select all

   Loop, Parse, TestText1, "`r`n"   ; bad cycle design for this condition
however, with leading strings "`n"
and cycles for them

Code: Select all

   Loop, Parse, TestText1, "`n"  ; << proper looping for multi-line transition "`n"
and

Code: Select all

   Loop, Parse, TestText1, "`n", "`r"   ; << proper looping for multi-line transition "`r`n"
work the same for: example - three , when compared with example - one.

====
swagfag,
Then your statement is completely true. But this does not preclude the following statement:

====
Statement.
- Nothing should contradict a priority stop, a one-time search in this regular expression, if one of the sides or both sides has an anchor to the end of the string ^, $.

- In the example there are three, then ; ##3##
when removing the sign | OR , we get the normal good generation of the first three lines, and as an exception we get the 4th and 5th lines without any changes, and without the results of the pattern (the pattern is excluded), since no matches were found:

Code: Select all

; ##3##
TestText11 := RegExReplace(TestText1 ,"m)(*ANYCRLF)(.*?)([\.\s]*?)(Teach[\s]#[\d]*)$", "$3 --=-- $1")   ; without |
but with the primary option for example three,
we get as many as four results for the pattern "$3 --=-- $1", but the first result is correct, and then the pattern is duplicated three times with empty values $3 and $1.

Code: Select all

; ##3##
TestText11 := RegExReplace(TestText1 ,"m)(*ANYCRLF)(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)$", "$3 --=-- $1")   ; with |

This means that RegExReplace does not process the regular expression in this way, and considers each character CR, LF as part of the search. But also does not ignore them as line separators.
I'll build a quote:
"we must understand that in our lines there is something superfluous that does not fall into the search from the first time from each line.
But also three times! It is not ignored after the first successful finding and ignoring the $ sign - the end of the line.
There are no other answers except for suspicion of CR, LF".
But now the problem is the same for all 5 rows. In all lines, the first time the pattern works correctly and is not ignored.
The 4th and 5th lines have the empty value $3 in the pattern "$3 --=-- $1".
And then three times!, at the end of all five lines duplicated with empty $3, $1 in the same pattern "$3 --=-- $1" .
We get " --=-- --=-- --=-- "

I have already tried to change the condition for multipage search between "m)(*ANYCRLF)" and "`am)"

ConTrast77
Posts: 12
Joined: 10 Jan 2021, 02:11

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by ConTrast77 » 30 Jan 2023, 16:28

FULL SOLUTION:

Now let's try to change the regular expression string, in the example - three ; ##3##
to the following:

Code: Select all

; ##3##
TestText11 := RegExReplace(TestText1 ,"m)(*ANYCRLF)(.*?)([\.\s]*?)(Teach[\s]#[\d]*|)\W$", "$3 --=-- $1")   ; with | ...and \W
and now we have a fully working expression properly handled by the function. The third group has the additional value \W outside/behind of this group.
I'll build a quote:
There is a suspicion that the presence of the \W option did not fix the regular expression query, but changed it.
By adding the "optional" \W parameter, we created a condition when the regular expression query contains not only groups in brackets ( ),
covering the entire search string with $ attached to the end of the string, but created additional conditions outside of these groups.


A regex query consisting only of groups was designated at the beginning of the topic as a "special case", as a condition and cause of the bug.

But this rule only works when declaring a multiline text:

Code: Select all

; function   RegExReplace(… , "m)(*ANYCRLF) … … (Teach[\s]#[\d]*|)\W$", … )
; works with    (Join`r`n
; but doesn't work with    (Join`n
;

TestText1 := "
(Join`r`n
; ...

TOTAL:
1. for loops, a solution for multiline text has appeared:
2. For multi-line processing of groups via RegExReplace(), the \W command is added after the last group but before $.
The command of \W or \D or \S is permissible, but not the \z, \Z, \> commands.
3. I'm use AHK 1.1.36.02, and file format for testing - UTF 16 LE with BOM

swagfag,
thanks for your participation,

Best Regards,
Bondar Roman /Tracker/

safetycar
Posts: 435
Joined: 12 Aug 2017, 04:27

Re: RegEx - Error placement. Multi-line and single-line queries consisting only of groups ()

Post by safetycar » 31 Jan 2023, 11:14

If you notice, what swagfag was saying about empty string still applies. Part of your fixes consist in avoiding matching an empty string at the end.
A different approach is usually anchoring on both sides.

Post Reply

Return to “Ask for Help (v1)”