AutoHotkey Community

Posted: **17 Apr 2024, 13:31**

It seems like lots of folks have run into the problem of not having their ahk files encoded as UTF-8. Should AutoHotKey (during validation) detect if there are Unicode characters, then return a warning, if the ahk file is not UTF-8 (??)

Posted: **17 Apr 2024, 14:41**

I would second that Steve. I have seen this issue pop up numerous times here in the past. Given the worldwide use of AHK whose languages use many characters that we in the USA do not, it would make sense for the launcher and compiler to analyze the source file for proper encoding.

Russ

Posted: **18 Apr 2024, 10:31**

I third that. AHK v2 is Unicode-only, so it makes sense to coerce the user into using the proper encoding for that. Perhaps it should be a #Warn directive such as #Warn FileEncoding (I chose "FileEncoding" to be consistent with A_FileEncoding), which would display something like "The script file(s) you are attempting to run might not support special (Unicode) characters. To remove this warning either save the script file(s) in UTF-8 or UTF-16 file encoding, or apply the directive "#Warn FileEncoding, Off"."

Posted: **19 Apr 2024, 19:57**

I agree with the above.

This is one of those problems that are really tricky to debug some times.
Having a warning at runtime might curve the head-scratching a bit in certain cases.

Posted: **02 May 2024, 20:34**

A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?

Posted: **03 May 2024, 09:49**

lexikos wrote: ↑
02 May 2024, 20:34
A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?

This is a good point... If I'm understanding what you mean, when a person (erroneously) saves the file in non-Unicode format, the Unicode characters get replaced with substitute characters most (all?) of the time. At the point that AutoHokey does its validity scan, all of those replacements have already been made, so there aren't actually any Unicode characters in the file to detect.

I don't know what the solution is. I guess it's up to the editor (SciTE, VSCode, Notepad, etc) to alert the user to this.

Posted: **03 May 2024, 14:12**

Yeah, turns out file encodings is quite a mess and detecting regular UTF-8 is probably impossible with 100% accuracy. IsTextUnicode exists, but according to multiple StackOverflow posts is rather unreliable.

A perhaps more reliable option could be to add a "strict mode" option which would enforce BOM encodings for the script and its include files. Although that still wouldn't guarantee a 100% detection rate (eg text starting with "ï»¿" being confused with UTF-8-BOM), I figure 99%+ is good enough.

Posted: **05 May 2024, 17:00**

@kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.

But maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8.

Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the text is UTF-8.
Source: Byte order mark - Wikipedia

@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.

Non-UTF-8 text starting with "ï»¿" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.

Posted: **05 May 2024, 17:40**

lexikos wrote: ↑
05 May 2024, 17:00
UTF-8 without the signature is preferred by some crowds.

Non-UTF-8 text starting with "ï»¿" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.

This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.

Posted: **05 May 2024, 20:20**

@RaptorX, AutoHotkey v2 does not by any definition "kind of enforce the use of the BOM". Script files must be UTF-8 with or without BOM unless they have a UTF-16 BOM or you use the /CP switch. If a Unicode character is written directly inside a script file which is saved as UTF-8-RAW, it will be decoded correctly.

Files read by the script default to ANSI, but there are several ways to specify that a file (or all files) should be read as UTF-8.

Let us not even speak of AutoHotkey v1.

Posted: **06 May 2024, 05:31**

lexikos wrote: @kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.

They might notice it depending on the situation. If the substituted characters are on line 500 of 1000, then they might not notice. This might happen for example when copying from some other source into the editor (which perhaps has an ANSI-encoded file open?), and then running into unexpected problems.

@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.

I would think a directive would suit this purpose, something like #Requires ScriptEncoding UTF-8-BOM. If saving in a wrong encoding totally garbles the content then the script won't run at all, causing the user to probably investigate encoding issues. Otherwise the directive would take effect and alert the user of encoding problems (eg a script accidentally saved in ANSI). How often that would be useful is another question, as I've noticed these kind of posts have lessened in frequency in the v2 forums section, most likely because UTF-8 has become common enough overall?

UTF-8 is likely more common than UTF-8-BOM, and I'm aware that the Unicode standard says that regular UTF-8 is preferred because, quoting from Wikipedia, "Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII" and "A BOM is unnecessary for detecting UTF-8 encoding". The latter doesn't seem to be true though, as byte sequences can be valid strings in UTF-8 as well as valid strings in other encodings. So the most sure way seems to use/require the BOM.

RaptorX wrote: ↑
05 May 2024, 17:40
This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.

Are you able to bring an example of this? I remember similar situations as well, but couldn't replicate it any longer. I think that perhaps UTF-8 has become standard enough that it isn't such a big problem as before. Notepad defaults to UTF-8, most (all?) editors default to UTF-8 +-BOM, AHK v2 interprets scripts in UTF-8 etc. Are there any editors out there that default to ANSI?

Posted: **06 May 2024, 05:49**

Descolada wrote:I would think a directive would suit this purpose,

A directive which is encoded as...? The program has already decided on an encoding and begun reading the file.

Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?

Is this directive supposed to be used by the author of a file, presuming that some other user might save the file with the wrong encoding? The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.

Posted: **06 May 2024, 06:20**

kunkel321 wrote: ↑
17 Apr 2024, 13:31
It seems like lots of folks have run into the problem of not having their ahk files encoded as UTF-8.

Recently? In this forum or elsewhere? Asking since I haven't noticed such folks here recently.

Here's a related 2023 discussion on if UTF-8 BOM should be adviced in the v2 doc FAQ
viewtopic.php?f=86&t=121928&p=541713

Posted: **06 May 2024, 07:54**

neogna2 wrote: ↑
06 May 2024, 06:20
....Recently? In this forum or elsewhere? Asking since I haven't noticed such folks here recently...

Here's a v1 help thread where I was stumped for a while..
viewtopic.php?f=76&t=119308&p=529417&hilit#p529417
I spent a fair bit of time thinking it was caused by using different editors. Finally someone set me straight.

I've seen three or four other help posts (moslty I'm on v2 these days), where I suggested changing the encoding. One was the same day I post this thread on the Wish List. Sorry, I don't recall the other v2 help topics.

EDIT. Actually.... Do an 'advanced' forum search for "UTF-8 with BOM" in the v2 help forum. There are 69 results. That's not a lot, considering there are nearly 30k posts.. Still though, the preview of those 69 results suggests that many are due to folks needing to change their file encoding and they don't know it.

Posted: **06 May 2024, 08:03**

lexikos wrote: ↑
06 May 2024, 05:49
A directive which is encoded as...? The program has already decided on an encoding and begun reading the file.

The directive is encoded as whatever the user decided or accidentally used... Practically all encodings share the ASCII character set, which is what most of AHK syntax uses, so AHK may load the script without syntax errors. The problem lies in how everything else is interpreted. Say the user is using Chinese simplified (GB2312) as their default encoding, and copies a script from the forums:

Code: Select all

#Requires AutoHotkey v2
MsgBox "汉字"

AHK reads the saved script just fine and displays the following:

Code: Select all

���

If the code snipped included #Requires ScriptEncoding UTF-8-BOM (or if AHK by default required a specific encoding such as UTF-8-BOM), then instead the user would see a helpful message about saving the file in UTF-8-BOM, which could save time for the user and maybe prevent another forums post.

Sure, the alternative is something like

Code: Select all

#Requires AutoHotkey v2
if Ord("汉") != 27721
	throw Error("Save the script in UTF-8 encoding")
MsgBox "汉字"

but would it work with all incorrect encodings? Also what if the snippet includes another file that should also be in the correct encoding?

Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?

The same person who includes a #Requires AutoHotkey v2 line in their code. Or maybe the new script template file included it.

The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.

Which features should or shouldn't be included in the language itself (rather than implemented by users) is largely a matter of opinion. But I emphasize again that this proposed feature might not be useful any longer as UTF-8 seems to be nearly universal. Personally I've encountered such problems only in v1, and with file operations in v2 (IMO default A_FileEncoding should've been UTF-8 or UTF-8-BOM!).

Posted: **07 May 2024, 06:17**

@Descolada
I wrote the code that reads script files, and one of its optimizations is that it bypasses translation for ASCII characters. I obviously know that all of the encodings supported by AutoHotkey contain ASCII as a subset. My point is that upon finding the directive, the program has to check for a UTF-8 BOM... again. Or it needs code to record the fact that there was a BOM just for the purpose of this warning that would only be of use to those who are unlikely to have enabled it.

Which features are or aren't included in the program is completely a matter of opinion. Specifically, mine. That's a fact.

Use a Unicode supplementary character. In the unlikely event that decoding the file with an ANSI codepage is somehow able to return the correct character, do we care that the user is not warned that the file is not UTF-8? Have you lost sight of the purpose?

Code: Select all

if Ord("👍") != 128077
	throw Error("Save the script in UTF-8 encoding")

In fact, this will also permit UTF-16 and UTF-8-RAW.

If you wanted to enforce the use of UTF-8 with BOM as implied by the name of your directive, you could actually check for it with 100% reliability, by reading the file. If it's not compiled...

The same person who includes a #Requires AutoHotkey v2 line in their code.

No I won't. Why would I? I don't believe that many other users would either, even if they learned of the need for UTF-8, even if it was actually still a common problem.

Why should anyone ever be forced to save the script with a byte order mark?

Anyway, don't encourage me to continue this. As I said, maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8. I will look into detecting decoding errors at some point. If it works, it would be better for everyone.

Posted: **26 May 2024, 06:15**

UTF-8 lead bytes have the bit pattern 110xxxxx, 1110xxxx or 11110xxx, followed by 1, 2 or 3 suffix bytes with the bit pattern 10xxxxxx. Therefore some ANSI sequences which are obviously not valid UTF-8:

0x80..0xBF not as part of one of the following sequences.
0xC0..0xDF not followed by 0x80..0xBF.
0xE0..0xEF not followed by 0x80..0xBF, 0x80..0xBF.
0xF0..0xF7 not followed by 0x80..0xBF, 0x80..0xBF, 0x80..0xBF.
0xF8..0xFF.

So any single upper ANSI character surrounded by ASCII is never valid UTF-8. I suppose that covers many common cases in some languages, like "resumè" or è::.

I thought: if non-UTF-8 data can be detected so easily, why not support ANSI? So I carried out two experiments:

Show a warning if there were any decoding errors while reading a script file.
Implement an "automatic code page" mode: when the first non-ASCII character is encountered, switch to UTF-8 if it is the beginning of a valid UTF-8 sequence, otherwise cp0. It passed some simple testing.

They had similar cost (for code size), but after reading a lot more about various code pages, I am leaning toward implementing #1 and not #2. It turns out that there are a range of code pages which AutoHotkey doesn't decode correctly, although I don't know if any are commonly used as a system default code page.

One example is that cp50220 (ISO-2022 Japanese) encodes 点 as 1B 24 42 45 40 1B 28 42; i.e. ␛$BE@␛(B. The actual character is E@ and the rest are codepage-switching escape sequences. AutoHotkey currently fails to decode this in stream mode, although FileRead works. Neither #1 nor #2 will work with this example, because the entire data can be interpreted as valid ASCII.

If a file containing UTF-8 happens to have some invalid bytes (and they're the first non-ASCII bytes), the file would be decoded incorrectly as cp0 and the user will get no help; but maybe such files are very rare. Another problem is that #2 only looks at the first non-ASCII byte sequence (possibly the only simple way to add it into the current routine), which might not be reliable enough in practice. Improving on it would come at a cost in various ways.

I think that the warning would be raised in most cases.

Posted: **05 Jun 2024, 06:27**

v2.1-alpha.13 implements a warning in the case of UTF-8 decoding errors while loading a script file.

Posted: **08 Jun 2024, 13:16**

Re: Character Encoding Detection

In the past, people who were looking for a C/C++ library that could to guess** a text file's character encoding, would often be referred to Mozilla's Universal Charset Detector. AFAIK, Mozilla stopped maintaining this library many years ago, however, it seems to have been continued on GitHub: https://github.com/Joungkyun/libchardet -- this is the upstream for several Linux distros, and I believe it still compiles on Windows (though perhaps now requiring MinGW).

** I say "guess" because it is impossible to determine an arbitrary text file's character encoding with 100% accuracy. It is entirely possible for a given byte sequence to decode to valid character sequences under multiple encodings. Mozilla UCD use heuristics / statistics to make an educated guess, but it is still just a guess.

Folks who are interested in charset detection might be curious to poke inside the source code of Mozilla UCD, to see how others approached the problem.

Re: Byte-order-marks (BOMs)

BOMs are a way to unambiguously indicate the encoding of a text file. At first glance, that sounds great. But in practice they can often be more trouble than they're worth:

- UTF-8 files are not required to have a BOM, and many (if not most) do not. UTF-16 and UTF-32 files are "required" to have a BOM, but the BOM can be either big-endian or little-endian, and some software will write UTF-16 without a BOM.

- BOMs can be a nuisance, because they are not always handled correctly by software. For example, if you concatenate two files:

Linux.... cat file1_with_bom.txt file2_with_bom.txt > combined.txt
Windows copy file1_with_bom.txt + file2_with_bom.txt combined.txt

...then the BOM from file2.txt will be included in the middle of combined.txt, which can cause problems for software that is not expecting it -- and plenty of programs are not expecting it.

- Similar issues can arise when combining files programmatically. For example Python is able to automatically detect/remove a BOM when reading a Unicode text file if you open the file with a "-sig" encoding (e.g. encoding="utf-8-sig"). Otherwise the BOM will be included as characters in the string that is read from the file, and it is the programmer's responsibility to remove it if necessary, which they often do not know they need to do. The situation is similar, though often slightly different, for other programming languages.

Some text editors mishandle BOMs, even if the only BOM in the file is correctly placed at the beginning. All this considered, I prefer to not use BOMs except maybe in controlled circumstances where all software that produces and consumes the files are known to be compliant and compatible with them. This would not be the case with AutoHotkey scripts (IMO), where users typically expect to be able to use their preferred text editor to edit scripts -- which might be any one of dozens of different programs -- and to be able to send those scripts as attachments to other people, who might be using some other text editor that might have differing support for BOMs.

AutoHotkey Community

Notify if not UTF-8? -- [Update: Not possible, me thinks]

Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8?

Re: Notify if not UTF-8?

Re: Notify if not UTF-8?

Re: Notify if not UTF-8?

Re: Notify if not UTF-8?

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]