Notify if not UTF-8? -- [Update: Not possible, me thinks]
Notify if not UTF-8? -- [Update: Not possible, me thinks]
It seems like lots of folks have run into the problem of not having their ahk files encoded as UTF-8. Should AutoHotKey (during validation) detect if there are Unicode characters, then return a warning, if the ahk file is not UTF-8 (??)
Last edited by kunkel321 on 03 May 2024, 09:52, edited 1 time in total.
name := "ste(phen|ve) kunkel"
Re: Notify if not UTF-8?
I would second that Steve. I have seen this issue pop up numerous times here in the past. Given the worldwide use of AHK whose languages use many characters that we in the USA do not, it would make sense for the launcher and compiler to analyze the source file for proper encoding.
Russ
Russ
Re: Notify if not UTF-8?
I third that. AHK v2 is Unicode-only, so it makes sense to coerce the user into using the proper encoding for that. Perhaps it should be a #Warn directive such as #Warn FileEncoding (I chose "FileEncoding" to be consistent with A_FileEncoding), which would display something like "The script file(s) you are attempting to run might not support special (Unicode) characters. To remove this warning either save the script file(s) in UTF-8 or UTF-16 file encoding, or apply the directive "#Warn FileEncoding, Off"."
Re: Notify if not UTF-8?
I agree with the above.
This is one of those problems that are really tricky to debug some times.
Having a warning at runtime might curve the head-scratching a bit in certain cases.
This is one of those problems that are really tricky to debug some times.
Having a warning at runtime might curve the head-scratching a bit in certain cases.
Projects:
AHK-ToolKit
AHK-ToolKit
Re: Notify if not UTF-8?
A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?
Re: Notify if not UTF-8?
This is a good point... If I'm understanding what you mean, when a person (erroneously) saves the file in non-Unicode format, the Unicode characters get replaced with substitute characters most (all?) of the time. At the point that AutoHokey does its validity scan, all of those replacements have already been made, so there aren't actually any Unicode characters in the file to detect.lexikos wrote: ↑02 May 2024, 20:34A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?
I don't know what the solution is. I guess it's up to the editor (SciTE, VSCode, Notepad, etc) to alert the user to this.
name := "ste(phen|ve) kunkel"
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
Yeah, turns out file encodings is quite a mess and detecting regular UTF-8 is probably impossible with 100% accuracy. IsTextUnicode exists, but according to multiple StackOverflow posts is rather unreliable.
A perhaps more reliable option could be to add a "strict mode" option which would enforce BOM encodings for the script and its include files. Although that still wouldn't guarantee a 100% detection rate (eg text starting with "" being confused with UTF-8-BOM), I figure 99%+ is good enough.
A perhaps more reliable option could be to add a "strict mode" option which would enforce BOM encodings for the script and its include files. Although that still wouldn't guarantee a 100% detection rate (eg text starting with "" being confused with UTF-8-BOM), I figure 99%+ is good enough.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
@kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.
But maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8.
@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.
Non-UTF-8 text starting with "" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.
But maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8.
Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the text is UTF-8.
Source: Byte order mark - Wikipedia
@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.
Non-UTF-8 text starting with "" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.lexikos wrote: ↑05 May 2024, 17:00UTF-8 without the signature is preferred by some crowds.
Non-UTF-8 text starting with "" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.
Projects:
AHK-ToolKit
AHK-ToolKit
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
@RaptorX, AutoHotkey v2 does not by any definition "kind of enforce the use of the BOM". Script files must be UTF-8 with or without BOM unless they have a UTF-16 BOM or you use the /CP switch. If a Unicode character is written directly inside a script file which is saved as UTF-8-RAW, it will be decoded correctly.
Files read by the script default to ANSI, but there are several ways to specify that a file (or all files) should be read as UTF-8.
Let us not even speak of AutoHotkey v1.
Files read by the script default to ANSI, but there are several ways to specify that a file (or all files) should be read as UTF-8.
Let us not even speak of AutoHotkey v1.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
They might notice it depending on the situation. If the substituted characters are on line 500 of 1000, then they might not notice. This might happen for example when copying from some other source into the editor (which perhaps has an ANSI-encoded file open?), and then running into unexpected problems.lexikos wrote: @kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.
I would think a directive would suit this purpose, something like #Requires ScriptEncoding UTF-8-BOM. If saving in a wrong encoding totally garbles the content then the script won't run at all, causing the user to probably investigate encoding issues. Otherwise the directive would take effect and alert the user of encoding problems (eg a script accidentally saved in ANSI). How often that would be useful is another question, as I've noticed these kind of posts have lessened in frequency in the v2 forums section, most likely because UTF-8 has become common enough overall?@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.
UTF-8 is likely more common than UTF-8-BOM, and I'm aware that the Unicode standard says that regular UTF-8 is preferred because, quoting from Wikipedia, "Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII" and "A BOM is unnecessary for detecting UTF-8 encoding". The latter doesn't seem to be true though, as byte sequences can be valid strings in UTF-8 as well as valid strings in other encodings. So the most sure way seems to use/require the BOM.
Are you able to bring an example of this? I remember similar situations as well, but couldn't replicate it any longer. I think that perhaps UTF-8 has become standard enough that it isn't such a big problem as before. Notepad defaults to UTF-8, most (all?) editors default to UTF-8 +-BOM, AHK v2 interprets scripts in UTF-8 etc. Are there any editors out there that default to ANSI?RaptorX wrote: ↑05 May 2024, 17:40This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
A directive which is encoded as...? The program has already decided on an encoding and begun reading the file.Descolada wrote:I would think a directive would suit this purpose,
Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?
Is this directive supposed to be used by the author of a file, presuming that some other user might save the file with the wrong encoding? The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
Recently? In this forum or elsewhere? Asking since I haven't noticed such folks here recently.
Here's a related 2023 discussion on if UTF-8 BOM should be adviced in the v2 doc FAQ
viewtopic.php?f=86&t=121928&p=541713
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
Here's a v1 help thread where I was stumped for a while..
viewtopic.php?f=76&t=119308&p=529417&hilit#p529417
I spent a fair bit of time thinking it was caused by using different editors. Finally someone set me straight.
I've seen three or four other help posts (moslty I'm on v2 these days), where I suggested changing the encoding. One was the same day I post this thread on the Wish List. Sorry, I don't recall the other v2 help topics.
EDIT. Actually.... Do an 'advanced' forum search for "UTF-8 with BOM" in the v2 help forum. There are 69 results. That's not a lot, considering there are nearly 30k posts.. Still though, the preview of those 69 results suggests that many are due to folks needing to change their file encoding and they don't know it.
Last edited by kunkel321 on 07 May 2024, 07:25, edited 3 times in total.
name := "ste(phen|ve) kunkel"
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
The directive is encoded as whatever the user decided or accidentally used... Practically all encodings share the ASCII character set, which is what most of AHK syntax uses, so AHK may load the script without syntax errors. The problem lies in how everything else is interpreted. Say the user is using Chinese simplified (GB2312) as their default encoding, and copies a script from the forums:
Code: Select all
#Requires AutoHotkey v2
MsgBox "汉字"
Code: Select all
���
Sure, the alternative is something like
Code: Select all
#Requires AutoHotkey v2
if Ord("汉") != 27721
throw Error("Save the script in UTF-8 encoding")
MsgBox "汉字"
The same person who includes a #Requires AutoHotkey v2 line in their code. Or maybe the new script template file included it.Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?
Which features should or shouldn't be included in the language itself (rather than implemented by users) is largely a matter of opinion. But I emphasize again that this proposed feature might not be useful any longer as UTF-8 seems to be nearly universal. Personally I've encountered such problems only in v1, and with file operations in v2 (IMO default A_FileEncoding should've been UTF-8 or UTF-8-BOM!).The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
@Descolada
I wrote the code that reads script files, and one of its optimizations is that it bypasses translation for ASCII characters. I obviously know that all of the encodings supported by AutoHotkey contain ASCII as a subset. My point is that upon finding the directive, the program has to check for a UTF-8 BOM... again. Or it needs code to record the fact that there was a BOM just for the purpose of this warning that would only be of use to those who are unlikely to have enabled it.
Which features are or aren't included in the program is completely a matter of opinion. Specifically, mine. That's a fact.
Use a Unicode supplementary character. In the unlikely event that decoding the file with an ANSI codepage is somehow able to return the correct character, do we care that the user is not warned that the file is not UTF-8? Have you lost sight of the purpose?
In fact, this will also permit UTF-16 and UTF-8-RAW.
If you wanted to enforce the use of UTF-8 with BOM as implied by the name of your directive, you could actually check for it with 100% reliability, by reading the file. If it's not compiled...
Why should anyone ever be forced to save the script with a byte order mark?
Anyway, don't encourage me to continue this. As I said, maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8. I will look into detecting decoding errors at some point. If it works, it would be better for everyone.
I wrote the code that reads script files, and one of its optimizations is that it bypasses translation for ASCII characters. I obviously know that all of the encodings supported by AutoHotkey contain ASCII as a subset. My point is that upon finding the directive, the program has to check for a UTF-8 BOM... again. Or it needs code to record the fact that there was a BOM just for the purpose of this warning that would only be of use to those who are unlikely to have enabled it.
Which features are or aren't included in the program is completely a matter of opinion. Specifically, mine. That's a fact.
Use a Unicode supplementary character. In the unlikely event that decoding the file with an ANSI codepage is somehow able to return the correct character, do we care that the user is not warned that the file is not UTF-8? Have you lost sight of the purpose?
Code: Select all
if Ord("👍") != 128077
throw Error("Save the script in UTF-8 encoding")
If you wanted to enforce the use of UTF-8 with BOM as implied by the name of your directive, you could actually check for it with 100% reliability, by reading the file. If it's not compiled...
No I won't. Why would I? I don't believe that many other users would either, even if they learned of the need for UTF-8, even if it was actually still a common problem.The same person who includes a #Requires AutoHotkey v2 line in their code.
Why should anyone ever be forced to save the script with a byte order mark?
Anyway, don't encourage me to continue this. As I said, maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8. I will look into detecting decoding errors at some point. If it works, it would be better for everyone.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
UTF-8 lead bytes have the bit pattern 110xxxxx, 1110xxxx or 11110xxx, followed by 1, 2 or 3 suffix bytes with the bit pattern 10xxxxxx. Therefore some ANSI sequences which are obviously not valid UTF-8:
I thought: if non-UTF-8 data can be detected so easily, why not support ANSI? So I carried out two experiments:
One example is that cp50220 (ISO-2022 Japanese) encodes 点 as 1B 24 42 45 40 1B 28 42; i.e. ␛$BE@␛(B. The actual character is E@ and the rest are codepage-switching escape sequences. AutoHotkey currently fails to decode this in stream mode, although FileRead works. Neither #1 nor #2 will work with this example, because the entire data can be interpreted as valid ASCII.
If a file containing UTF-8 happens to have some invalid bytes (and they're the first non-ASCII bytes), the file would be decoded incorrectly as cp0 and the user will get no help; but maybe such files are very rare. Another problem is that #2 only looks at the first non-ASCII byte sequence (possibly the only simple way to add it into the current routine), which might not be reliable enough in practice. Improving on it would come at a cost in various ways.
I think that the warning would be raised in most cases.
- 0x80..0xBF not as part of one of the following sequences.
- 0xC0..0xDF not followed by 0x80..0xBF.
- 0xE0..0xEF not followed by 0x80..0xBF, 0x80..0xBF.
- 0xF0..0xF7 not followed by 0x80..0xBF, 0x80..0xBF, 0x80..0xBF.
- 0xF8..0xFF.
I thought: if non-UTF-8 data can be detected so easily, why not support ANSI? So I carried out two experiments:
- Show a warning if there were any decoding errors while reading a script file.
- Implement an "automatic code page" mode: when the first non-ASCII character is encountered, switch to UTF-8 if it is the beginning of a valid UTF-8 sequence, otherwise cp0. It passed some simple testing.
One example is that cp50220 (ISO-2022 Japanese) encodes 点 as 1B 24 42 45 40 1B 28 42; i.e. ␛$BE@␛(B. The actual character is E@ and the rest are codepage-switching escape sequences. AutoHotkey currently fails to decode this in stream mode, although FileRead works. Neither #1 nor #2 will work with this example, because the entire data can be interpreted as valid ASCII.
If a file containing UTF-8 happens to have some invalid bytes (and they're the first non-ASCII bytes), the file would be decoded incorrectly as cp0 and the user will get no help; but maybe such files are very rare. Another problem is that #2 only looks at the first non-ASCII byte sequence (possibly the only simple way to add it into the current routine), which might not be reliable enough in practice. Improving on it would come at a cost in various ways.
I think that the warning would be raised in most cases.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
v2.1-alpha.13 implements a warning in the case of UTF-8 decoding errors while loading a script file.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
Re: Character Encoding Detection
In the past, people who were looking for a C/C++ library that could to guess** a text file's character encoding, would often be referred to Mozilla's Universal Charset Detector. AFAIK, Mozilla stopped maintaining this library many years ago, however, it seems to have been continued on GitHub: https://github.com/Joungkyun/libchardet -- this is the upstream for several Linux distros, and I believe it still compiles on Windows (though perhaps now requiring MinGW).
** I say "guess" because it is impossible to determine an arbitrary text file's character encoding with 100% accuracy. It is entirely possible for a given byte sequence to decode to valid character sequences under multiple encodings. Mozilla UCD use heuristics / statistics to make an educated guess, but it is still just a guess.
Folks who are interested in charset detection might be curious to poke inside the source code of Mozilla UCD, to see how others approached the problem.
Re: Byte-order-marks (BOMs)
BOMs are a way to unambiguously indicate the encoding of a text file. At first glance, that sounds great. But in practice they can often be more trouble than they're worth:
- UTF-8 files are not required to have a BOM, and many (if not most) do not. UTF-16 and UTF-32 files are "required" to have a BOM, but the BOM can be either big-endian or little-endian, and some software will write UTF-16 without a BOM.
- BOMs can be a nuisance, because they are not always handled correctly by software. For example, if you concatenate two files:
Linux.... cat file1_with_bom.txt file2_with_bom.txt > combined.txt
Windows copy file1_with_bom.txt + file2_with_bom.txt combined.txt
...then the BOM from file2.txt will be included in the middle of combined.txt, which can cause problems for software that is not expecting it -- and plenty of programs are not expecting it.
- Similar issues can arise when combining files programmatically. For example Python is able to automatically detect/remove a BOM when reading a Unicode text file if you open the file with a "-sig" encoding (e.g. encoding="utf-8-sig"). Otherwise the BOM will be included as characters in the string that is read from the file, and it is the programmer's responsibility to remove it if necessary, which they often do not know they need to do. The situation is similar, though often slightly different, for other programming languages.
Some text editors mishandle BOMs, even if the only BOM in the file is correctly placed at the beginning. All this considered, I prefer to not use BOMs except maybe in controlled circumstances where all software that produces and consumes the files are known to be compliant and compatible with them. This would not be the case with AutoHotkey scripts (IMO), where users typically expect to be able to use their preferred text editor to edit scripts -- which might be any one of dozens of different programs -- and to be able to send those scripts as attachments to other people, who might be using some other text editor that might have differing support for BOMs.
In the past, people who were looking for a C/C++ library that could to guess** a text file's character encoding, would often be referred to Mozilla's Universal Charset Detector. AFAIK, Mozilla stopped maintaining this library many years ago, however, it seems to have been continued on GitHub: https://github.com/Joungkyun/libchardet -- this is the upstream for several Linux distros, and I believe it still compiles on Windows (though perhaps now requiring MinGW).
** I say "guess" because it is impossible to determine an arbitrary text file's character encoding with 100% accuracy. It is entirely possible for a given byte sequence to decode to valid character sequences under multiple encodings. Mozilla UCD use heuristics / statistics to make an educated guess, but it is still just a guess.
Folks who are interested in charset detection might be curious to poke inside the source code of Mozilla UCD, to see how others approached the problem.
Re: Byte-order-marks (BOMs)
BOMs are a way to unambiguously indicate the encoding of a text file. At first glance, that sounds great. But in practice they can often be more trouble than they're worth:
- UTF-8 files are not required to have a BOM, and many (if not most) do not. UTF-16 and UTF-32 files are "required" to have a BOM, but the BOM can be either big-endian or little-endian, and some software will write UTF-16 without a BOM.
- BOMs can be a nuisance, because they are not always handled correctly by software. For example, if you concatenate two files:
Linux.... cat file1_with_bom.txt file2_with_bom.txt > combined.txt
Windows copy file1_with_bom.txt + file2_with_bom.txt combined.txt
...then the BOM from file2.txt will be included in the middle of combined.txt, which can cause problems for software that is not expecting it -- and plenty of programs are not expecting it.
- Similar issues can arise when combining files programmatically. For example Python is able to automatically detect/remove a BOM when reading a Unicode text file if you open the file with a "-sig" encoding (e.g. encoding="utf-8-sig"). Otherwise the BOM will be included as characters in the string that is read from the file, and it is the programmer's responsibility to remove it if necessary, which they often do not know they need to do. The situation is similar, though often slightly different, for other programming languages.
Some text editors mishandle BOMs, even if the only BOM in the file is correctly placed at the beginning. All this considered, I prefer to not use BOMs except maybe in controlled circumstances where all software that produces and consumes the files are known to be compliant and compatible with them. This would not be the case with AutoHotkey scripts (IMO), where users typically expect to be able to use their preferred text editor to edit scripts -- which might be any one of dozens of different programs -- and to be able to send those scripts as attachments to other people, who might be using some other text editor that might have differing support for BOMs.
Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]
@MC256 This topic is "Notify if not UTF-8?". The purpose is to assist users who make a common mistake when saving their AutoHotkey script files. The scope of issues is very limited because script files are supposed to be UTF-8, and editors generally default to either UTF-8 or the system default ANSI codepage. The vast majority are likely to be reading the script file on the same system which created it (or a system with the same default codepage), so we don't even need to care about different ANSI codepages; just UTF-8 and CP0. For the actual request in the title and top post, the scope is even more limited: we only care about whether the file is all valid UTF-8.
Actually, UTF-8 with BOM is tangential to this topic. If there is a UTF-8 or UTF-16-LE BOM, the notification is not needed. The FAQ recommends using a BOM because editors which default to ANSI are common, and much more commonly used with AutoHotkey scripts than editors which don't support a UTF-8 BOM.
Combining AutoHotkey script files is much more complex than just pasting files together, but if you just want to do the latter, FileRead and FileAppend won't have any problem with a UTF-8 BOM. A more likely problem is that FileAppend won't default to UTF-8. AutoHotkey script files aren't written to be read by Python or command-line tools.
Actually, UTF-8 with BOM is tangential to this topic. If there is a UTF-8 or UTF-16-LE BOM, the notification is not needed. The FAQ recommends using a BOM because editors which default to ANSI are common, and much more commonly used with AutoHotkey scripts than editors which don't support a UTF-8 BOM.
Combining AutoHotkey script files is much more complex than just pasting files together, but if you just want to do the latter, FileRead and FileAppend won't have any problem with a UTF-8 BOM. A more likely problem is that FileAppend won't default to UTF-8. AutoHotkey script files aren't written to be read by Python or command-line tools.