Regex help - how to escape \E ?

andymbody · Post by **andymbody** » 28 Nov 2023, 23:10

I'm working on a regex project and ran into this while testing. I'm hoping someone can tell me how to overcome this issue when using \Q...\E

Copy this entire script to the clipboard, then run the script. What do you expect to get?

Code: Select all

needle := "\E"	; simplified needle for demo purposes
newStr := RegExReplace(Clipboard, "\Q" . needle . "\E", "###")
MsgBox % "[" . newStr . "]"
ExitApp

Also, if the needle is """\E""", then of course no replacement occurs at all, since it short-circuits the overall needle.

Is there a way to escape \E to avoid these issues when the source string contains it? I tried \\E but that doesn't seem to be valid according to online testers, and all it did in AHK was target the backslashes. I searched online but was unable to find a way. I could create a work around but was curious if there is a more efficient solution.

EDIT:
One key ingredient for my particular project...

Both the haystack and the needle are dynamic. Both change with each RegexReplace. I do not know what the haystack or needle will be from one check to the other. This is the reason I need the \Q...\E, but "\E" can (and will) definitely be within the needle sometimes, so a workaround can be implemented.

Andy

Rohwedder · Post by **Rohwedder** » 29 Nov 2023, 03:41

Hallo,
instead of trying to convert RegExReplace to StrReplace simply use StrReplace:

Code: Select all

FileRead, ClipBoard,% A_ScriptFullPath
needle := "\E"
StringCaseSense, On
newStr := StrReplace(Clipboard, needle, "###")
MsgBox % "[" . newStr . "]"
ExitApp

andymbody · Post by **andymbody** » 29 Nov 2023, 06:21

Rohwedder wrote: ↑
29 Nov 2023, 03:41
Simply use StringReplace

Thanks for your response... that would work of course if the actual needles were this simple. The needle used in my example is just for demonstration purposes. The actual needle and haystack are much more complex.

I can definitely work around this unique situation, just wanted to know if there is a more direct and elegant "built in" solution in the AHK language. One that I may not be aware of. I didn't see anything in the docs about it, but that doesn't mean it's not there... I have somehow overlooked the exact thing I was seeking in the past while reading the docs.

mikeyww · Post by **mikeyww** » 29 Nov 2023, 07:23

The issue is not AHK but is that \E is a special "tag" in regular expressions. As such, it is recognized by the parser.

Code: Select all

#Requires AutoHotkey v1.1.33
haystack := "abc\E12\Ezzz3"
literal  := "c\E12\Ezzz"
needle1  := "b" literal
needle2  := "." literalRegex(literal)
MsgBox %   StrReplace(haystack, needle1, "###") "`n"
       . RegExReplace(haystack, needle2, "###")

literalRegex(str) {
 Return "\Q" StrReplace(str, "\E", "\E\\E\Q") "\E"
}

andymbody · Post by **andymbody** » 29 Nov 2023, 07:33

mikeyww wrote: ↑
29 Nov 2023, 07:23
The issue is not AHK but is that \E is a special "tag" in regular expressions.

Yes, it is not allowed in regex in general, I am aware of that.

I think I have come up with a simple solution. I will have to test on multiple haystacks to be sure.

Code: Select all

needle := "\E"														; simplified needle for demo purposes only
needle := Format("{:L}", needle)									; convert needle to lower case
newStr := RegExReplace(Clipboard, "i)\Q" . needle . "\E", "###")	; use case-insensitive
MsgBox % "[" . newStr . "]"
ExitApp

mikeyww · Post by **mikeyww** » 29 Nov 2023, 07:37

OK. This would not always work, but it might work in your situation. You are demonstrating that regular expressions are always tailored to the need. In other words, they depend on the specific set of possible input strings.

andymbody · Post by **andymbody** » 29 Nov 2023, 07:45

mikeyww wrote: ↑
29 Nov 2023, 07:37
OK. This would not always work, but it might work in your situation. You are demonstrating that regular expressions are always tailored to the need. In other words, they depend on the specific set of possible input strings.

Yes, testing is still needed of course. Thanks for your code, I have not tested it yet (on my way to work). Will test it as soon as I can. Thank you!

andymbody · Post by **andymbody** » 29 Nov 2023, 08:06

I just realized that I forgot to mention a key ingredient for my particular project...

Both the haystack and the needle are partly dynamic. Both change with each RegexReplace. I do not know what the haystack or needle will be (exactly) from one check to the other. This is the reason I need the \Q...\E, but "\E" can (and will) definitely be within the needle sometimes, so a workaround can be implemented. I will edit my original post to reflect this point.

Rohwedder · Post by **Rohwedder** » 29 Nov 2023, 08:42

Try:

Code: Select all

FileRead, ClipBoard,% A_ScriptFullPath
needle := "\E"
newStr := R(RegExReplace(R(Clipboard), "\Q" R(needle) "\E", R("###")))
MsgBox % "[" . newStr . "]"
ExitApp
R(In)
{
	Loop, Parse, In
        Out .= Chr(Asc(A_LoopField)^0xFFFF)
    Return Out
}

andymbody · Post by **andymbody** » 29 Nov 2023, 10:38

Rohwedder wrote: ↑
29 Nov 2023, 08:42
Try:

Ok, will test after work. Thanks.

TAC109 · Post by **TAC109** » 29 Nov 2023, 17:59

You could just forget about the \Q....\E format and just escape the relevant characters in the needle:

Code: Select all

RegExEsc(Needle)
{ Loop Parse, % "\.*?+[{|()^$"
    Needle := StrReplace(Needle, A_LoopField, "\" A_LoopField)
  return Needle
}
MsgBox % RegExEsc("\.*?+[{|()^$abcde\.*?+[{|()^$")  ; Proof of concept

Cheers

Datapoint · Post by **Datapoint** » 29 Nov 2023, 18:48

Here's another one.

Code: Select all

haystack := Clipboard
needle := "\E"
newStr := RegExReplace(haystack, RegExEscape(needle), "###")
MsgBox % "[" . newStr . "]"

RegExEscape(str) { ; https://www.regular-expressions.info/characters.html
	return RegExReplace(str, "([\\^$.|?*+()[{])", "\$1")
}

andymbody · Post by **andymbody** » 29 Nov 2023, 20:52

Thanks everyone for the suggestions. You have all given me good ideas to think about. I really appreciate it!

I also came up with this slightly altered version of my earlier idea (changing needle case), which I think is much cleaner than it was originally. Of course this is only valid when case is not important, as Mikey pointed out.

Code: Select all

dynamicVal := "\E" 	; bare minimum example just for demo
MsgBox % RegExReplace(Clipboard, "i)\Q" . RegExReplace(dynamicVal, "\\E", "\e") . "\E", "###")
ExitApp

Both @mikeyww, and @Rohwedder versions work in initial (minimal) tests. Although I would have thought @TAC109 and @Datapoint version would also work, the extra escape char actually caused failure for my particular implementation. My updated version works, but only when case-sensitivity is not critical. Between the two that work, I think Mikey's function is a little easier for me to wrap my pea brain around (although I would never have been able to come up with that configuration of characters on my own, nor the other version). I will do further testing to see if this initial finding is conclusive.

I do like the clean parsing in the escape versions, and will be able to use that concept in other projects, so thanks for those too.

Thank you all for your time, knowledge and generosity with this query!

Andy

andymbody · Post by **andymbody** » 02 Dec 2023, 10:30

I understand what @mikeyww's version is doing... splitting the string into 2 literal sections and converting the \E found in the source string to a literal \E that Regex no longer treats as special. This is basically escaping the \E, which is what I was after. It's a brilliant approach for doing this, which I had not considered. Thank you for this.

For @Rohwedder's version, I think I understand what is happening technically (asking for clarification). It looks like a reversible (XOR) Caesar-Cipher substitution... to shift the char values into the UTF-16 range maybe? Do I interpret that correctly? Is this simply to eliminate any possibility of special characters during Regex processing? Or am I interpreting the function incorrectly?

Rohwedder · Post by **Rohwedder** » 03 Dec 2023, 03:26

Correct, not a Caesar cipher, but an XOR cipher. It prevents special regex characters from being recognized.

just me · Post by **just me** » 03 Dec 2023, 04:13

Within (a part of) a needle enclosed with \Q and \E only the sequence \E (case-sensitive) needs to be 'escaped':

Code: Select all

#NoEnv
haystack := ".*\E[^a]*\e"
needle := "\E"	; simplified needle for demo purposes
newStr := RegExReplace(haystack, "\Q" . StrReplace(needle, "\E", "\E\\E\Q", 1) . "\E", "###")
MsgBox % "[" . newStr . "]"
ExitApp

Edit: Sorry, I missed that @mikeyww already showed this solution.

mikeyww wrote: ↑
29 Nov 2023, 07:23
...

mikeyww · Post by **mikeyww** » 03 Dec 2023, 04:43

For each \E in a "literal string", my script handles the following three consecutive parts of the string.

1. Before \E, it adds \Q and \E, so that this leading part of the string will be considered literal text.

2. Converts the existing \E in the literal string into \\E, so that this will be interpreted as regex meaning "\E". This string will not be between \Q and \E.

3. After \E, it adds \Q and \E, so that this trailing part of the string will be considered literal text.

This is the goal but not actually what the script does. Instead, achieving this requires only putting \E before, and \Q after, each instance, and then adding the flanking \Q and \E once (not part of the string replacement).

(\Q [some literal text] \E) (🢂 [\\E as regex] 🢀) (\Q [some literal text] \E)

TAC109 has an elegant solution. This leaves the string as regex but escapes every special character that would otherwise be interpreted differently by the regex parser. Although the choice of a parameter name does not matter, the first parameter of a string replacement is traditionally called a "haystack" rather than a "needle". The concept is that one wants to find a needle in a haystack.

andymbody · Post by **andymbody** » 03 Dec 2023, 10:28

mikeyww wrote: ↑
03 Dec 2023, 04:43
Although the choice of a parameter name does not matter, the first parameter of a string replacement is traditionally called a "haystack" rather than a "needle". The concept is that one wants to find a needle in a haystack.

Yes, and my example may appear to be confusing the terms, but the clipboard was my haystack (in this case) and the needle is intended to be just simple enough to focus on chars that are causing trouble. I guess it would be helpful to see what my project is actually focused on, but I didn't want to show any more of the needle than was necessary to focus on the exact issue I needed resolved. This narrows the focus of the thread without inviting solutions to irrelevant characters. As I stated before, the haystack and needle are mostly dynamic. I realize that it is very unusual for a needle to be dynamic (written on the fly, by the script itself, rather than the traditional static needle written in advance), but that is what is required for my current project.

andymbody · Post by **andymbody** » 03 Dec 2023, 10:31

To give context about this project, in case anyone is interested...

The project is actually a wrapper class for Regex itself. The plan is to enhance the current capabilities of the native AHK RegexMatch and RegexReplace. For instance... one of the enhancements will be to allow "exclusions" to be applied to the haystack prior to the normal regex search. The (unlimited number of) exclusions will be secondary needles which will allow certain parts of the primary haystack to be ignored while conduction the primary needle search.

For instance, if I want to remove all line comments within a .ahk file, a simplified regex might look something like this

Code: Select all

myStr := "this is my string"	; this is my comment
lineInAhkFile :=	; haystack
(
"myStr := ""this is my string""	; this is my comment"
)
msgbox % RegexReplace(lineInAhkFile, ";.*$")	; acts as expected

BUT... what if the haystack includes the needle itself (which was the situation that led to this thread)

Code: Select all

needle := ";.*$"
lineInAhkFile :=	; haystack
(
"needle := "";.*$"""
)
msgbox % RegexReplace(lineInAhkFile, needle)	; removes part of the regex needle

In this general example, the contents of a .ahk file are unknown, so general static-needles may not accommodate every possible scenario.

Traditionally we would need to know that a semi-colon may be found within a critical code-string (like a regex needle), and we would need to design our primary needle for this possibility. Which may involve look-arounds, and trying to think of every possible immediate character that may come before or after the semicolon, checking to be sure the semicolon doesn't fall somewhere within the boundaries of a string, etc. It would involve a ridiculous amount of work for minimal (if any) payback. And the needle would be much more complex than is necessary in normal circumstances.

Instead, what we really need (built in to rexex itself) is a way to ignore certain situations in the haystack, like semicolons being part of a string. We need to remove the string from the haystack prior to conducting our search/replace, then put the string back where we found it prior to returning the primary regex result. This simplifies the process in my opinion, since we don't need to accommodate every possible scenario that could break our needle. True, the primary (simplified) needle used within the enhanced Regex may not perform the same outside of AHK, but who cares? In my opinion, the general Regex design standard is due for an upgrade that includes enhancements like this anyway.

So, my plan to to create the enhanced version (with many other enhancements as well). Hopefully I can post it for others to try once it is done. If it is rejected by the masses, that's ok too.

BTW... I anticipate some readers may want to offer their solution to the 'comment' example above that does not involve a Regex enhancement. This is not necessary... I know how to solve this particular situation with a more appropriately designed needle. The plan is to design a class that can support more than just this one example... and should do so as well as (or better than) the native Regex. And no, I do not plan to accommodate every possible scenario within the class itself. This will be accomplished by the caller passing secondary needles to identify what should be ignored. This thread was not intended to dig into the details of the Regex enhancement project itself, which is why I left out the details in the beginning. But I now feel it necessary to at least provide the context for asking my initial question.

Anyway, thanks for all the suggestions and help with resolving the \E obstacle. I really appreciate you all!

Andy

andymbody · Post by **andymbody** » 03 Dec 2023, 16:26

just me wrote: ↑
03 Dec 2023, 04:13
I missed that @mikeyww already showed this solution.

Thanks! Yes, I think his/your solution works best for my situation. It's modular and seems to fit the project well. Thanks @mikeyww !

And thank you all for your contributions! I'm sure they will come in handy for other needs in the future...

AutoHotkey Community

Regex help - how to escape \E ? Topic is solved

Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ? Topic is solved

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?

Re: Regex help - how to escape \E ?