UTF-8 as active code page

Discuss Autohotkey related topics here. Not a place to share code.
Forum rules
Discuss Autohotkey related topics here. Not a place to share code.
lexikos
Posts: 9811
Joined: 30 Sep 2013, 04:07
Contact:

UTF-8 as active code page

Post by lexikos » 25 Nov 2022, 01:58

The active code page is what Win32 functions with the "A" suffix accept or return. "ANSI" builds of AutoHotkey rely on these functions, which is why they can't handle the full range of Unicode characters. Sometimes even Unicode builds of AutoHotkey are affected by such limitations because some APIs don't support UTF-16.

However, Windows version 1903 added a way for programs to opt in to UTF-8 as the active code page. If the code that was previously dealing with ANSI strings is able to handle UTF-8 correctly, this makes the full range of Unicode characters work with very little effort; otherwise, it can break the program. I won't be enabling it by default because of the implications to compatibility, but users can choose to enable it either by modifying AutoHotkey or by compiling a script with a custom manifest.

The article Use UTF-8 code pages in Windows apps describes the necessary manifest element, under "Fusion manifest for an unpackaged Win32 app". For AutoHotkey's embedded manifest, it is just a case of inserting the following inside the <v3:windowsSettings> ... </<v3:windowsSettings> element:

Code: Select all

<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>

Note: On Windows versions prior to 1903, the active code page will be an ANSI code page as before.

Note: Reading ANSI-encoded files with "cp0" or omitting the code page will not work, because "cp0" will equate to UTF-8, not whatever code page the file was encoded with. Such files can be read by explicitly specifying the actual code page (such as cp1252) when calling FileOpen or FileRead, or setting it as the default with FileEncoding.


UTF-8 StdOut

There is one particularly useful API that I only recently realized is affected by this modification: WScript.Shell. The documentation for Run contains examples that use WScript.Shell to execute a process and catch its output, but it normally has one big limitation: it only supports ANSI output. All characters that aren't in the active code page are replaced with question marks. However, if the active code page is UTF-8, WScript.Shell is then able to handle UTF-8.

To make use of this with WScript.Shell, one must also ensure that the process writing to stdout is using UTF-8. I wasn't able to get the dir command internal to cmd.exe to do that, but if a process executed via cmd.exe outputs UTF-8, it does appear to work. For instance, executing AutoHotkey.exe /ErrorStdOut=UTF-8 produced error messages with Unicode characters (in the supplementary plane) that I was able to catch with WScript.Shell and display correctly with MsgBox.

Note that even without UTF-8 support, it's possible to get UTF-8 with the "mojibake" method; i.e. have the program output UTF-8 but let WScript.Shell apply ANSI->UTF-16 conversion, then perform UTF-16->ANSI conversion and finally reinterpret it with UTF-8->UTF-16 conversion. I suppose that this might fail if some UTF-8 bytes aren't valid ANSI bytes.


UTF-8 INI Files

Another implication of the active code page being UTF-8 is that INI files which do not have a UTF-16 byte order mark will be read as UTF-8. IniWrite creates UTF-16 files, but any existing files that aren't UTF-16 will only be handled correctly if any non-ASCII characters are encoded as UTF-8. If the active code page is UTF-8, a script can force IniWrite to use UTF-8 by ensuring the file is created without a byte order mark:

Code: Select all

FileOpen("filename.ini", "w", "UTF-8-RAW")
Note that this just creates an empty file which is assumed to use the active code page, not normally UTF-8.


UTF-8 System-wide

Windows version 1903 also added the capability to set UTF-8 as the system default code page, but this breaks some applications.

User avatar
jibap
Posts: 2
Joined: 13 Dec 2024, 12:18
Contact:

Re: UTF-8 as active code page

Post by jibap » 13 Dec 2024, 12:29

Hi,

Really good post, i searched a long time about this and i found it... :bravo:

My script use WScript.Shell to run a powershell script which return output (in UTF-8). In console, terminal or powershell, the return is OK (with accents, because i'm french) but from my ahk script no.

When i activate the UTF-8 System-wide, it works, but as you said :
this breaks some applications.
There is a solution to not activate it ?

Thanks

JB

lexikos
Posts: 9811
Joined: 30 Sep 2013, 04:07
Contact:

Re: UTF-8 as active code page

Post by lexikos » 15 Dec 2024, 19:16

jibap wrote:
13 Dec 2024, 12:29
There is a solution to not activate it ?
What?

Everything in my post above "UTF-8 System-wide" is about enabling UTF-8 as the active code page only for AutoHotkey. The part under "UTF-8 System-wide" is a vague reference to (iirc) a registry setting which affects all programs. If you enable UTF-8 for programs which do not support it, you will create problems. The solution is to not do that.

User avatar
jibap
Posts: 2
Joined: 13 Dec 2024, 12:18
Contact:

Re: UTF-8 as active code page

Post by jibap » 21 Dec 2024, 15:22

Hi,

sorry to answer late (i have forgotten to subscribe to the subject...)

It seems that the french translation weren't really good, after reading again, i anderstand that your method works too with Wscript.Shell.

But i don't anderstand what i have to do... :yawn:

User avatar
thqby
Posts: 593
Joined: 16 Apr 2021, 11:18
Contact:

Re: UTF-8 as active code page

Post by thqby » 23 Dec 2024, 03:09

I just recently ran into this issue. When activate the UTF-8 System-wide, cmd does not properly decode unicode characters passed in from stdin.

For details see https://github.com/thqby/ahk2_lib/issues/78

Post Reply

Return to “General Discussion”