Detecting whether AutoHotkey.exe or a compiled script is Unicode
Posted: 23 Jun 2020, 06:23
Within a script, A_IsUnicode tells you whether the running AutoHotkey executable is Unicode. (Unless you're using a recent v2 alpha, in which case A_IsUnicode is undefined, but there's no ANSI version anyway.)
But what if you want to know whether some other AutoHotkey.exe, AutoHotkeySC.bin or compiled script is Unicode?
You can scan the file for a string.
Certain strings are only present in the native encoding, so if you search for a UTF-16 string in an ANSI executable, you won't find it. However, you have to be careful about which string you use.
In order to operate, the interpreter binary must contain the name of every built-in function/command, so those are good candidates (if they exist in all versions of AutoHotkey). "AutoHotkey" won't work because it's always present in both UTF-16 and UTF-8 (in the manifest resource). Built-in variables can be used, but you must omit the "A_" prefix, and remember that A_IsUnicode isn't defined in recent v2 alphas. Older versions of AutoHotkey included them in lower-case, with some names broken up, like "loop" "file" "fullpath".
To run this, you would need to copy InBuf() from Ahk2Exe/BinMod.ahk (for some reason the parameters are ordered differently to wOxxOm's original version).
What about without InBuf?
RegExMatch and RegExReplace can search past null characters, but there are caveats:Most of the function is just adjusting for the weirdness of ClipboardAll/FileRead *c. There's actually an easier way: AutoHotkey removes the "binary clipboard" status from a variable if you tamper with it by passing it directly to NumPut:
Actually, we don't have to use FileRead *c. Originally it was necessary because normal FileRead specifically recalculated the variable's length based on the location of the first null character. FileRead hasn't done this since Unicode support was added, but it does convert from the "source" encoding to the native encoding, which will corrupt any data that isn't actually in the "source" encoding. To avoid that (and get the original data), you just need to match the source encoding to the native encoding:
Now, we also have the File object and its RawRead method:
This only works because when RawRead resizes the variable, it sets the variable's length regardless of any null characters. It only works in v1.1, because in v2 RawRead doesn't accept a variable "ByRef".
Backing up a bit, I mentioned that you need to be careful about the encoding of the needle - "MsgBox" in this case. If we run the above code with an ANSI version of AutoHotkey, the result is wrong because RegExMatch is searching for an ANSI string. You might think to simply invert the result with something like this:However, this will get the wrong result in compiled Unicode scripts if the script itself contains "MsgBox". The way around that is to include a null terminator in the search string. Fortunately this is easy with RegEx:
Since we don't actually need the raw binary data, there's another trick we can use: let FileRead return "corrupt" data. If UTF-16 is specified for the source encoding, the entire file will be treated as a UTF-16 string. In a Unicode version of AutoHotkey, the file data will be returned as is and we will search for a UTF-16 string within it. In an ANSI version of AutoHotkey, any real UTF-16 strings will convert correctly to ANSI while any ANSI strings will be "corrupted" and won't be found by the search (because each pair of bytes will be misinterpreted as a UTF-16 code unit and undergo "conversion" to ANSI).
For reference, it can be done with AutoHotkey v2.0-a112 like this:
One possible flaw in this technique is that compiled scripts can FileInstall other versions of AutoHotkey. In such cases, it might be necessary to scan for strings in multiple encodings, and compare the found positions. I'm not really sure where UpdateResource puts resources (i.e. the FileInstall data) in relation to code or strings.
But what if you want to know whether some other AutoHotkey.exe, AutoHotkeySC.bin or compiled script is Unicode?
You can scan the file for a string.
Certain strings are only present in the native encoding, so if you search for a UTF-16 string in an ANSI executable, you won't find it. However, you have to be careful about which string you use.
In order to operate, the interpreter binary must contain the name of every built-in function/command, so those are good candidates (if they exist in all versions of AutoHotkey). "AutoHotkey" won't work because it's always present in both UTF-16 and UTF-8 (in the manifest resource). Built-in variables can be used, but you must omit the "A_" prefix, and remember that A_IsUnicode isn't defined in recent v2 alphas. Older versions of AutoHotkey included them in lower-case, with some names broken up, like "loop" "file" "fullpath".
Code: Select all
IsUnicodeAutoHotkey(path) {
FileRead fd, *c %path% ; Get the raw file data.
fb := StrLen(fd) * (A_IsUnicode ? 2 : 1) ; Get the size in bytes (may be safer to use StrLen than FileGetSize).
needle := "MsgBox" ; Pick a string to search for, and convert it to UTF-16:
VarSetCapacity(ns := "", nb := StrPut(needle, "utf-16")), StrPut(needle, &ns, "utf-16")
; Search!
return InBuf(&fd, fb, &ns, nb) != -1 ; -1 means "not found"
}
Loop Files, %A_AhkPath%\..\*.exe
MsgBox % A_LoopFileName "`n" IsUnicodeAutoHotkey(A_LoopFilePath)
Loop Files, %A_AhkPath%\..\Compiler\*.bin
MsgBox % A_LoopFileName "`n" IsUnicodeAutoHotkey(A_LoopFilePath)
#NoEnv
What about without InBuf?
RegExMatch and RegExReplace can search past null characters, but there are caveats:
- You must pass a variable, and the variable's internal string length (StrLen) must match the data size. FileRead fd, *c %path% does set StrLen appropriately, but...
- Binary clipboard variables (from var := ClipboardAll or FileRead fd, *c %path%) are... special. By design, the internal string length of that type of variable is ignored in many cases, including RegExMatch.
- If you're searching for a string, you need to take care that it's in the right encoding - it will depend on which version of AutoHotkey you run, unless you perform conversion.
Above, Laszlo shows how to get a normal variable from a binary clipboard variable. This was before Unicode, so needs to be adjusted for that. We can use it like this:Laszlo wrote:When a binary file is to be read into RAM, we have to use the *c option, which sets StrLen the file size, but the data is stored in a special variable, not usable for RegEx. We have to copy it into another variable (or use dllcalls to open/read/close the file).Source: Machine code binary buffer searching regardless of NULL - Scripts and Functions - AutoHotkey CommunityCode: Select all
FileRead a, *c %A_AhkPath% VarSetCapacity(b,StrLen(a),1) DllCall("RtlMoveMemory", UInt,&b, UInt,&a, Uint,StrLen(a)) MsgBox % "It is found: " RegExMatch(b, "\0\03")
Code: Select all
IsUnicodeAutoHotkey(path) {
FileRead fd, *c %path%
VarSetCapacity(b, cb := StrLen(fd)*(A_IsUnicode?2:1), 1)
DllCall("RtlMoveMemory", UInt,&b, UInt,&fd, Uint,cb)
return !!RegExMatch(b, "MsgBox")
}
Loop Files, %A_AhkPath%\..\*.exe
MsgBox % A_LoopFileName "`n" IsUnicodeAutoHotkey(A_LoopFilePath)
Loop Files, %A_AhkPath%\..\Compiler\*.bin
MsgBox % A_LoopFileName "`n" IsUnicodeAutoHotkey(A_LoopFilePath)
#NoEnv
Code: Select all
IsUnicodeAutoHotkey(path) {
FileRead fd, *c %path%
NumPut(NumGet(fd, "char"), fd, "char") ; Clear the "binary clip" status.
return !!RegExMatch(fd, "MsgBox")
}
Code: Select all
IsUnicodeAutoHotkey(path) {
FileRead fd, % "*p" (A_IsUnicode ? 1200 : 0) " " path
return !!RegExMatch(fd, "MsgBox")
}
Code: Select all
IsUnicodeAutoHotkey(path) {
f := FileOpen(path, "r"), f.RawRead(fd, f.Length)
return !!RegExMatch(fd, "MsgBox")
}
Backing up a bit, I mentioned that you need to be careful about the encoding of the needle - "MsgBox" in this case. If we run the above code with an ANSI version of AutoHotkey, the result is wrong because RegExMatch is searching for an ANSI string. You might think to simply invert the result with something like this:
Code: Select all
IsUnicodeAutoHotkey(path) {
f := FileOpen(path, "r"), f.RawRead(fd, f.Length)
return !!RegExMatch(fd, "MsgBox") = !!A_IsUnicode
}
Code: Select all
IsUnicodeAutoHotkey(path) {
f := FileOpen(path, "r"), f.RawRead(fd, f.Length)
return !!RegExMatch(fd, "MsgBox\0") = !!A_IsUnicode
}
Code: Select all
IsUnicodeAutoHotkey(path) {
FileRead fd, *p1200 %path%
return !!RegExMatch(fd, "MsgBox\0")
}
For reference, it can be done with AutoHotkey v2.0-a112 like this:
Code: Select all
IsUnicodeAutoHotkey(path) {
fd := FileRead(path, "UTF-16")
return !!RegExMatch(fd, "MsgBox\0")
}
Code: Select all
IsUnicodeAutoHotkey(path) {
buf := FileRead(path, "RAW") ; Returns a Buffer.
fd := StrGet(buf, -buf.size//2) ; Get the data as a string.
return !!RegExMatch(fd, "MsgBox\0")
}
One possible flaw in this technique is that compiled scripts can FileInstall other versions of AutoHotkey. In such cases, it might be necessary to scan for strings in multiple encodings, and compare the found positions. I'm not really sure where UpdateResource puts resources (i.e. the FileInstall data) in relation to code or strings.