Notify if not UTF-8? -- [Update: Not possible, me thinks]

Propose new features and changes
User avatar
kunkel321
Posts: 1118
Joined: 30 Nov 2015, 21:19

Notify if not UTF-8? -- [Update: Not possible, me thinks]

17 Apr 2024, 13:31

It seems like lots of folks have run into the problem of not having their ahk files encoded as UTF-8. Should AutoHotKey (during validation) detect if there are Unicode characters, then return a warning, if the ahk file is not UTF-8 (??)
Last edited by kunkel321 on 03 May 2024, 09:52, edited 1 time in total.
ste(phen|ve) kunkel
RussF
Posts: 1285
Joined: 05 Aug 2021, 06:36

Re: Notify if not UTF-8?

17 Apr 2024, 14:41

I would second that Steve. I have seen this issue pop up numerous times here in the past. Given the worldwide use of AHK whose languages use many characters that we in the USA do not, it would make sense for the launcher and compiler to analyze the source file for proper encoding.

Russ
Descolada
Posts: 1164
Joined: 23 Dec 2021, 02:30

Re: Notify if not UTF-8?

18 Apr 2024, 10:31

I third that. AHK v2 is Unicode-only, so it makes sense to coerce the user into using the proper encoding for that. Perhaps it should be a #Warn directive such as #Warn FileEncoding (I chose "FileEncoding" to be consistent with A_FileEncoding), which would display something like "The script file(s) you are attempting to run might not support special (Unicode) characters. To remove this warning either save the script file(s) in UTF-8 or UTF-16 file encoding, or apply the directive "#Warn FileEncoding, Off"."
User avatar
RaptorX
Posts: 388
Joined: 06 Dec 2014, 14:27
Contact:

Re: Notify if not UTF-8?

19 Apr 2024, 19:57

I agree with the above.

This is one of those problems that are really tricky to debug some times.
Having a warning at runtime might curve the head-scratching a bit in certain cases.
Projects:
AHK-ToolKit
lexikos
Posts: 9631
Joined: 30 Sep 2013, 04:07
Contact:

Re: Notify if not UTF-8?

02 May 2024, 20:34

A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?
User avatar
kunkel321
Posts: 1118
Joined: 30 Nov 2015, 21:19

Re: Notify if not UTF-8?

03 May 2024, 09:49

lexikos wrote:
02 May 2024, 20:34
A warning in the case of UTF-8 decoding errors might be possible, but in some cases a byte sequence is valid in both UTF-8 and the system ANSI code page. I think the result of the user using the wrong encoding is often a valid sequence of characters which doesn't match what they intend. In such cases, how would the program detect that they are not using UTF-8?
This is a good point... If I'm understanding what you mean, when a person (erroneously) saves the file in non-Unicode format, the Unicode characters get replaced with substitute characters most (all?) of the time. At the point that AutoHokey does its validity scan, all of those replacements have already been made, so there aren't actually any Unicode characters in the file to detect.

I don't know what the solution is. I guess it's up to the editor (SciTE, VSCode, Notepad, etc) to alert the user to this.
ste(phen|ve) kunkel
Descolada
Posts: 1164
Joined: 23 Dec 2021, 02:30

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

03 May 2024, 14:12

Yeah, turns out file encodings is quite a mess and detecting regular UTF-8 is probably impossible with 100% accuracy. IsTextUnicode exists, but according to multiple StackOverflow posts is rather unreliable.

A perhaps more reliable option could be to add a "strict mode" option which would enforce BOM encodings for the script and its include files. Although that still wouldn't guarantee a 100% detection rate (eg text starting with "" being confused with UTF-8-BOM), I figure 99%+ is good enough.
lexikos
Posts: 9631
Joined: 30 Sep 2013, 04:07
Contact:

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

05 May 2024, 17:00

@kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.

But maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8.
Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the text is UTF-8.
Source: Byte order mark - Wikipedia

@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.

Non-UTF-8 text starting with "" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.
User avatar
RaptorX
Posts: 388
Joined: 06 Dec 2014, 14:27
Contact:

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

05 May 2024, 17:40

lexikos wrote:
05 May 2024, 17:00
UTF-8 without the signature is preferred by some crowds.

Non-UTF-8 text starting with "" is not a possibility that we need to consider. Anyone writing that as ANSI at the start of a script is likely trying to prove a point, knowing full well that it will be interpreted as UTF-8. Not only that, but it must be interpreted as UTF-8, because that is the documented behaviour.
This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.
Projects:
AHK-ToolKit
lexikos
Posts: 9631
Joined: 30 Sep 2013, 04:07
Contact:

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

05 May 2024, 20:20

@RaptorX, AutoHotkey v2 does not by any definition "kind of enforce the use of the BOM". Script files must be UTF-8 with or without BOM unless they have a UTF-16 BOM or you use the /CP switch. If a Unicode character is written directly inside a script file which is saved as UTF-8-RAW, it will be decoded correctly.

Files read by the script default to ANSI, but there are several ways to specify that a file (or all files) should be read as UTF-8.

Let us not even speak of AutoHotkey v1.
Descolada
Posts: 1164
Joined: 23 Dec 2021, 02:30

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

06 May 2024, 05:31

lexikos wrote: @kunkel321, a user would likely notice characters being substituted on save. That generally doesn't happen, because the user uses a codepage that includes the characters needed for their language. The real issue is that a sequence of bytes can often be interpreted multiple ways without decoding errors.
They might notice it depending on the situation. If the substituted characters are on line 500 of 1000, then they might not notice. This might happen for example when copying from some other source into the editor (which perhaps has an ANSI-encoded file open?), and then running into unexpected problems.
@Descolada, how would such an option be enabled, and who would enable it? Also, UTF-8 without the signature is preferred by some crowds.
I would think a directive would suit this purpose, something like #Requires ScriptEncoding UTF-8-BOM. If saving in a wrong encoding totally garbles the content then the script won't run at all, causing the user to probably investigate encoding issues. Otherwise the directive would take effect and alert the user of encoding problems (eg a script accidentally saved in ANSI). How often that would be useful is another question, as I've noticed these kind of posts have lessened in frequency in the v2 forums section, most likely because UTF-8 has become common enough overall?

UTF-8 is likely more common than UTF-8-BOM, and I'm aware that the Unicode standard says that regular UTF-8 is preferred because, quoting from Wikipedia, "Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII" and "A BOM is unnecessary for detecting UTF-8 encoding". The latter doesn't seem to be true though, as byte sequences can be valid strings in UTF-8 as well as valid strings in other encodings. So the most sure way seems to use/require the BOM.
RaptorX wrote:
05 May 2024, 17:40
This is true, a large number of tools prefer UTF without bom, and even the UTF-8 standard doesnt recommend it. I was really surprised Autohotkey kind of enforces the use of the BOM so to speak. When saving a UTF-8-RAW script and trying to display a MsgBox with certain characters or other similar situations you get encoding errors which is surprising some times hehe.
Are you able to bring an example of this? I remember similar situations as well, but couldn't replicate it any longer. I think that perhaps UTF-8 has become standard enough that it isn't such a big problem as before. Notepad defaults to UTF-8, most (all?) editors default to UTF-8 +-BOM, AHK v2 interprets scripts in UTF-8 etc. Are there any editors out there that default to ANSI?
lexikos
Posts: 9631
Joined: 30 Sep 2013, 04:07
Contact:

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

06 May 2024, 05:49

Descolada wrote:I would think a directive would suit this purpose,
A directive which is encoded as...? The program has already decided on an encoding and begun reading the file.

Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?

Is this directive supposed to be used by the author of a file, presuming that some other user might save the file with the wrong encoding? The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.
neogna2
Posts: 596
Joined: 15 Sep 2016, 15:44

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

06 May 2024, 06:20

kunkel321 wrote:
17 Apr 2024, 13:31
It seems like lots of folks have run into the problem of not having their ahk files encoded as UTF-8.
Recently? In this forum or elsewhere? Asking since I haven't noticed such folks here recently.

Here's a related 2023 discussion on if UTF-8 BOM should be adviced in the v2 doc FAQ
viewtopic.php?f=86&t=121928&p=541713
User avatar
kunkel321
Posts: 1118
Joined: 30 Nov 2015, 21:19

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

06 May 2024, 07:54

neogna2 wrote:
06 May 2024, 06:20
....Recently? In this forum or elsewhere? Asking since I haven't noticed such folks here recently...
Here's a v1 help thread where I was stumped for a while..
viewtopic.php?f=76&t=119308&p=529417&hilit#p529417
I spent a fair bit of time thinking it was caused by using different editors. Finally someone set me straight.

I've seen three or four other help posts (moslty I'm on v2 these days), where I suggested changing the encoding. One was the same day I post this thread on the Wish List. Sorry, I don't recall the other v2 help topics.

EDIT. Actually.... Do an 'advanced' forum search for "UTF-8 with BOM" in the v2 help forum. There are 69 results. That's not a lot, considering there are nearly 30k posts.. Still though, the preview of those 69 results suggests that many are due to folks needing to change their file encoding and they don't know it.
Last edited by kunkel321 on 07 May 2024, 07:25, edited 3 times in total.
ste(phen|ve) kunkel
Descolada
Posts: 1164
Joined: 23 Dec 2021, 02:30

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

06 May 2024, 08:03

lexikos wrote:
06 May 2024, 05:49
A directive which is encoded as...? The program has already decided on an encoding and begun reading the file.
The directive is encoded as whatever the user decided or accidentally used... Practically all encodings share the ASCII character set, which is what most of AHK syntax uses, so AHK may load the script without syntax errors. The problem lies in how everything else is interpreted. Say the user is using Chinese simplified (GB2312) as their default encoding, and copies a script from the forums:

Code: Select all

#Requires AutoHotkey v2
MsgBox "汉字"
AHK reads the saved script just fine and displays the following:

Code: Select all

���
If the code snipped included #Requires ScriptEncoding UTF-8-BOM (or if AHK by default required a specific encoding such as UTF-8-BOM), then instead the user would see a helpful message about saving the file in UTF-8-BOM, which could save time for the user and maybe prevent another forums post.

Sure, the alternative is something like

Code: Select all

#Requires AutoHotkey v2
if Ord("汉") != 27721
	throw Error("Save the script in UTF-8 encoding")
MsgBox "汉字"
but would it work with all incorrect encodings? Also what if the snippet includes another file that should also be in the correct encoding?
Again, who would enable it? The author who doesn't know that the file needs to be saved as UTF-8?
The same person who includes a #Requires AutoHotkey v2 line in their code. Or maybe the new script template file included it.
The author can write code into the file that checks its encoding, or write a single unicode character and compare against its ordinal value. Then the author can display whatever warning or (hopefully) helpful notice that he wishes.
Which features should or shouldn't be included in the language itself (rather than implemented by users) is largely a matter of opinion. But I emphasize again that this proposed feature might not be useful any longer as UTF-8 seems to be nearly universal. Personally I've encountered such problems only in v1, and with file operations in v2 (IMO default A_FileEncoding should've been UTF-8 or UTF-8-BOM!).
lexikos
Posts: 9631
Joined: 30 Sep 2013, 04:07
Contact:

Re: Notify if not UTF-8? -- [Update: Not possible, me thinks]

07 May 2024, 06:17

@Descolada
I wrote the code that reads script files, and one of its optimizations is that it bypasses translation for ASCII characters. I obviously know that all of the encodings supported by AutoHotkey contain ASCII as a subset. My point is that upon finding the directive, the program has to check for a UTF-8 BOM... again. Or it needs code to record the fact that there was a BOM just for the purpose of this warning that would only be of use to those who are unlikely to have enabled it.

Which features are or aren't included in the program is completely a matter of opinion. Specifically, mine. That's a fact. ;)

Use a Unicode supplementary character. In the unlikely event that decoding the file with an ANSI codepage is somehow able to return the correct character, do we care that the user is not warned that the file is not UTF-8? Have you lost sight of the purpose?

Code: Select all

if Ord("👍") != 128077
	throw Error("Save the script in UTF-8 encoding")
In fact, this will also permit UTF-16 and UTF-8-RAW.

If you wanted to enforce the use of UTF-8 with BOM as implied by the name of your directive, you could actually check for it with 100% reliability, by reading the file. If it's not compiled...
The same person who includes a #Requires AutoHotkey v2 line in their code.
No I won't. Why would I? I don't believe that many other users would either, even if they learned of the need for UTF-8, even if it was actually still a common problem.

Why should anyone ever be forced to save the script with a byte order mark?


Anyway, don't encourage me to continue this. As I said, maybe I underestimate how often an ANSI sequence containing non-ASCII characters would be invalid UTF-8. I will look into detecting decoding errors at some point. If it works, it would be better for everyone.

Return to “Wish List”

Who is online

Users browsing this forum: No registered users and 9 guests