UTF-8 as active code page

Discuss Autohotkey related topics here. Not a place to share code.
Forum rules
Discuss Autohotkey related topics here. Not a place to share code.
lexikos
Posts: 9494
Joined: 30 Sep 2013, 04:07
Contact:

UTF-8 as active code page

Post by lexikos » 25 Nov 2022, 01:58

The active code page is what Win32 functions with the "A" suffix accept or return. "ANSI" builds of AutoHotkey rely on these functions, which is why they can't handle the full range of Unicode characters. Sometimes even Unicode builds of AutoHotkey are affected by such limitations because some APIs don't support UTF-16.

However, Windows version 1903 added a way for programs to opt in to UTF-8 as the active code page. If the code that was previously dealing with ANSI strings is able to handle UTF-8 correctly, this makes the full range of Unicode characters work with very little effort; otherwise, it can break the program. I won't be enabling it by default because of the implications to compatibility, but users can choose to enable it either by modifying AutoHotkey or by compiling a script with a custom manifest.

The article Use UTF-8 code pages in Windows apps describes the necessary manifest element, under "Fusion manifest for an unpackaged Win32 app". For AutoHotkey's embedded manifest, it is just a case of inserting the following inside the <v3:windowsSettings> ... </<v3:windowsSettings> element:

Code: Select all

<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>

Note: On Windows versions prior to 1903, the active code page will be an ANSI code page as before.

Note: Reading ANSI-encoded files with "cp0" or omitting the code page will not work, because "cp0" will equate to UTF-8, not whatever code page the file was encoded with. Such files can be read by explicitly specifying the actual code page (such as cp1252) when calling FileOpen or FileRead, or setting it as the default with FileEncoding.


UTF-8 StdOut

There is one particularly useful API that I only recently realized is affected by this modification: WScript.Shell. The documentation for Run contains examples that use WScript.Shell to execute a process and catch its output, but it normally has one big limitation: it only supports ANSI output. All characters that aren't in the active code page are replaced with question marks. However, if the active code page is UTF-8, WScript.Shell is then able to handle UTF-8.

To make use of this with WScript.Shell, one must also ensure that the process writing to stdout is using UTF-8. I wasn't able to get the dir command internal to cmd.exe to do that, but if a process executed via cmd.exe outputs UTF-8, it does appear to work. For instance, executing AutoHotkey.exe /ErrorStdOut=UTF-8 produced error messages with Unicode characters (in the supplementary plane) that I was able to catch with WScript.Shell and display correctly with MsgBox.

Note that even without UTF-8 support, it's possible to get UTF-8 with the "mojibake" method; i.e. have the program output UTF-8 but let WScript.Shell apply ANSI->UTF-16 conversion, then perform UTF-16->ANSI conversion and finally reinterpret it with UTF-8->UTF-16 conversion. I suppose that this might fail if some UTF-8 bytes aren't valid ANSI bytes.


UTF-8 INI Files

Another implication of the active code page being UTF-8 is that INI files which do not have a UTF-16 byte order mark will be read as UTF-8. IniWrite creates UTF-16 files, but any existing files that aren't UTF-16 will only be handled correctly if any non-ASCII characters are encoded as UTF-8. If the active code page is UTF-8, a script can force IniWrite to use UTF-8 by ensuring the file is created without a byte order mark:

Code: Select all

FileOpen("filename.ini", "w", "UTF-8-RAW")
Note that this just creates an empty file which is assumed to use the active code page, not normally UTF-8.


UTF-8 System-wide

Windows version 1903 also added the capability to set UTF-8 as the system default code page, but this breaks some applications.

Return to “General Discussion”