UTF-8 ini files

Post your working scripts, libraries and tools for AHK v1.1 and older
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

UTF-8 ini files

18 Oct 2017, 08:33

A workaround to create 'UTF-8 ini files'.

Note: ANSI/Unicode ini files cannot handle certain characters, this would also affect 'UTF-8 ini files'.

IniRead/IniWrite/IniDelete which are based on:
GetPrivateProfileString function (Windows)
https://msdn.microsoft.com/en-us/librar ... s.85).aspx
WritePrivateProfileString function (Windows)
https://msdn.microsoft.com/en-us/librar ... s.85).aspx

And:
GetPrivateProfileSection function (Windows)
https://msdn.microsoft.com/en-us/librar ... s.85).aspx
GetPrivateProfileSectionNames function (Windows)
https://msdn.microsoft.com/en-us/librar ... s.85).aspx
WritePrivateProfileSection function (Windows)
https://msdn.microsoft.com/en-us/librar ... s.85).aspx

can only handle ANSI and UTF-16.

The following code provides a way to use UTF-8 instead, essentially by using an ANSI ini, and converting between UTF-8/UTF-16 (or UTF-8/ANSI) every time you do a read/write.

;==================================================

Code: Select all

q:: ;test UTF-8 conversion
vText := Chr(8730) Chr(33) Chr(333) Chr(3333) Chr(33333) Chr(8730)
MsgBox, % vText
vUtf8 := JEE_StrTextToUtf8Bytes(vText)
MsgBox, % vUtf8
vText2 := JEE_StrUtf8BytesToText(vUtf8)
MsgBox, % (vText = vText2)
return

w:: ;test 'UTF-8 ini files'
vText := ";this line is required for a 'UTF-8 ini file'"
vPath := A_Desktop "\MyUtf8Ini.ini"
FileAppend, % vText, % "*" vPath, UTF-8
vSection := Chr(8730) "Section" Chr(8730)
vKey := Chr(8730) "Key" Chr(8730)
vValue := Chr(8730) Chr(33) Chr(333) Chr(3333) Chr(33333) Chr(8730)
JEE_IniWriteUtf8(vValue, vPath, vSection, vKey)
MsgBox, % JEE_IniReadUtf8(vPath, vSection, vKey)
return

;==================================================

;note: a 'UTF-8 ini file' will need a comment as the first line
;e.g. ';my comment'
JEE_IniReadUtf8(vPath, vSection:="", vKey:="", vDefault:="")
{
	local vOutput
	vSection := JEE_StrTextToUtf8Bytes(vSection)
	vKey := JEE_StrTextToUtf8Bytes(vKey)
	IniRead, vOutput, % vPath, % vSection, % vKey, % vDefault
	if !ErrorLevel
		return JEE_StrUtf8BytesToText(vOutput)
}

;==================================================

JEE_IniWriteUtf8(vValue, vPath, vSection, vKey:="")
{
	vSection := JEE_StrTextToUtf8Bytes(vSection)
	vKey := JEE_StrTextToUtf8Bytes(vKey)
	vValue := JEE_StrTextToUtf8Bytes(vValue)
	IniWrite, % vValue, % vPath, % vSection, % vKey
	return !ErrorLevel
}

;==================================================

JEE_IniDeleteUtf8(vPath, vSection, vKey:="")
{
	vSection := JEE_StrTextToUtf8Bytes(vSection)
	vKey := JEE_StrTextToUtf8Bytes(vKey)
	IniDelete, % vPath, % vSection, % vKey
	return !ErrorLevel
}

;==================================================

JEE_StrUtf8BytesToText(vUtf8)
{
	if A_IsUnicode
	{
		VarSetCapacity(vUtf8X, StrPut(vUtf8, "CP0"))
		StrPut(vUtf8, &vUtf8X, "CP0")
		return StrGet(&vUtf8X, "UTF-8")
	}
	else
		return StrGet(&vUtf8, "UTF-8")
}

;==================================================

JEE_StrTextToUtf8Bytes(vText)
{
	VarSetCapacity(vUtf8, StrPut(vText, "UTF-8"))
	StrPut(vText, &vUtf8, "UTF-8")
	return StrGet(&vUtf8, "CP0")
}

;==================================================
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
dd900
Posts: 121
Joined: 27 Oct 2013, 16:03

Re: UTF-8 ini files

29 Oct 2017, 15:12

Nice code. But whats wrong with normal Ini cmds?

Works fine:

Code: Select all

IniWrite, % Chr(8730) Chr(33) Chr(333) Chr(3333) Chr(33333) Chr(8730), ini.ini, sec, key
IniRead, nonASCII, ini.ini, sec, key
MsgBox, % nonASCII
Notepad++ is telling me the above ini is encoded with UCS-2 LE BOM

Funny because the documented workaround does not work for me

Code: Select all

FileAppend,, ini.ini, UTF-8-RAW
IniWrite, % Chr(8730) Chr(33) Chr(333) Chr(3333) Chr(33333) Chr(8730), ini.ini, sec, key
IniRead, nonASCII, ini.ini, sec, key
MsgBox, % nonASCII
Notepad++ says UTF-8 but the characters in the ini are not UTF

Maybe you can shed some light on this for me?
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: UTF-8 ini files

29 Oct 2017, 15:24

UCS-2 LE BOM = UTF-16 LE BOM

There's nothing 'wrong' with the IniRead/IniWrite functions per se, just that they can only handle UTF-16 (UTF-16 files are often bigger in size than UTF-8 files if your language uses the Latin alphabet) and ANSI (Unicode to ANSI is lossy).

FileAppend,, ini.ini, UTF-8-RAW
- This creates a blank (0 byte) file which will then be regarded as either UTF-16 or ANSI by the IniWrite command, depending on whether AHK is ANSI/Unicode.
- AHK Unicode creates new ini files as UTF-16.
- AHK ANSI creates new ini files as ANSI.
- AHK Unicode and ANSI see any existing file as ANSI, unless it has a UTF-16 BOM. This means that both see any blank (0 byte) file as ANSI.
- FileAppend with 'UTF-8' appends the text, and if the file is empty/doesn't exist, prepends a BOM. Creating a file by appending an empty string results in a 3-byte file (the 3 bytes are the BOM).
- FileAppend with 'UTF-8-RAW' appends the text, but never prepends a BOM. Creating a file by appending an empty string results in a 0-byte file.

FileEncoding
https://autohotkey.com/docs/commands/FileEncoding.htm
•UTF-8: Unicode UTF-8, equivalent to CP65001.
•UTF-16: Unicode UTF-16 with little endian byte order, equivalent to CP1200.
•UTF-8-RAW or UTF-16-RAW: As above, but no byte order mark is written when a new file is created.
IniWrite
https://autohotkey.com/docs/commands/IniWrite.htm
New files are created in either the system's default ANSI code page or UTF-16, depending on the version of AutoHotkey.

...

In Unicode scripts, IniWrite uses UTF-16 for each new file. If this is undesired, ensure the file exists before calling IniWrite.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: UTF-8 ini files

27 Oct 2018, 11:54

There is more info re. ;this line is required for a 'UTF-8 ini file', here:
IniRead requires blank line - AutoHotkey Community
https://autohotkey.com/boards/viewtopic ... 91#p246291
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
DRocks
Posts: 565
Joined: 08 May 2018, 10:20

Re: UTF-8 ini files

02 Nov 2018, 05:41

Thank you
pneumatic
Posts: 338
Joined: 05 Dec 2016, 01:51

Re: UTF-8 ini files

29 May 2019, 08:06

jeeswg wrote:
29 Oct 2017, 15:24
UCS-2 LE BOM = UTF-16 LE BOM

There's nothing 'wrong' with the IniRead/IniWrite functions per se, just that they can only handle UTF-16 (UTF-16 files are often bigger in size than UTF-8 files if your language uses the Latin alphabet) and ANSI (Unicode to ANSI is lossy).

FileAppend,, ini.ini, UTF-8-RAW
- This creates a blank (0 byte) file which will then be regarded as either UTF-16 or ANSI by the IniWrite command, depending on whether AHK is ANSI/Unicode.
- AHK Unicode creates new ini files as UTF-16.
- AHK ANSI creates new ini files as ANSI.
- AHK Unicode and ANSI see any existing file as ANSI, unless it has a UTF-16 BOM. This means that both see any blank (0 byte) file as ANSI.
- FileAppend with 'UTF-8' appends the text, and if the file is empty/doesn't exist, prepends a BOM. Creating a file by appending an empty string results in a 3-byte file (the 3 bytes are the BOM).
- FileAppend with 'UTF-8-RAW' appends the text, but never prepends a BOM. Creating a file by appending an empty string results in a 0-byte file.

FileEncoding
https://autohotkey.com/docs/commands/FileEncoding.htm
•UTF-8: Unicode UTF-8, equivalent to CP65001.
•UTF-16: Unicode UTF-16 with little endian byte order, equivalent to CP1200.
•UTF-8-RAW or UTF-16-RAW: As above, but no byte order mark is written when a new file is created.
IniWrite
https://autohotkey.com/docs/commands/IniWrite.htm
New files are created in either the system's default ANSI code page or UTF-16, depending on the version of AutoHotkey.

...

In Unicode scripts, IniWrite uses UTF-16 for each new file. If this is undesired, ensure the file exists before calling IniWrite.

I'm so confused.
I am using ahk Unicode x64.
In my script I create ini files with FileAppend, passing no FileEncoding parameter.
After they are created I use IniRead and IniWrite to them.
I do not explicitly specify any file encoding anywhere in my scripts.
I don't have the first line blank in any of my ini files (my first line contains [Section], second line is a Key)
Notepad++ says all my files created with FileAppend are UTF-8 (not UTF-8-BOM).
Everything is working fine, including reading the first line section with IniRead.

Will it fail on other systems, or can I be confident if it's working on my test system it should work on other systems?

edit: it seems UTF-8 is compatible with IniRead/IniWrite, but UTF-8-BOM is not, as I am able to reproduce the problem reading the first line by manually changing it to UTF-8-BOM in Notepad++.

So by default FileAppend ,, Test.ini creates a 0-byte UTF-8 with no BOM (aka UTF-8-RAW)

This is the default FileEncoding ahk uses, but A_FileEncoding is blank, so the only way to know is with Notepad++.

Can anyone shed some light on why UTF-8 is compatible with IniRead/IniWrite? Is it just coincidence? Should I make sure I specify UTF-16 just to be safe?
pneumatic
Posts: 338
Joined: 05 Dec 2016, 01:51

Re: UTF-8 ini files

29 May 2019, 10:17

I guess UTF-8 is using 1 byte per character which makes it compatible with ANSI which Get/WritePrivateProfileString are interpreting the file as.

But I still don't understand how it knows how many bytes per char are being used by UTF-8 as it could be up to 3.
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: UTF-8 ini files

29 May 2019, 10:46

- A 'UTF-8 ini' is really an ANSI ini.
- If I want to IniWrite a square root sign, , I convert it to these 3 bytes: √, and store it in an ANSI ini.
- To IniRead those bytes, I read those 3 bytes from the ANSI ini, and convert them to the square root sign.

- ASCII characters remain the same. Characters 128-1114111, are stored as combinations of characters 128-255. The Winapi functions just see these as characters 128-255. But once we have read those characters into AutoHotkey, we can convert them to Unicode strings internally.

- If a 3-byte BOM is added to the start of the ANSI ini, nothing about the ini changes, the benefit of the BOM, is that you can easily edit the ini file (sections, keys, values) in a text editor.
- The benefit of the 'UTF-8 ini', is that you can store the same data you would in a UTF-16 ini, but the file is about half the size (if you mainly use ASCII characters).
- And also, of course, you can store Unicode characters, something you can't do with a standard ANSI ini.

- I always use a BOM with UTF-8 and UTF-16 text files, to avoid any issues with any software.

- Here is what happens when you add a UTF-8 BOM:

Code: Select all

;before:
[Section1]
Key1=Value1

;after:
[Section1]
Key1=Value1

;after (workaround):
;my comment
[Section1]
Key1=Value1
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
pneumatic
Posts: 338
Joined: 05 Dec 2016, 01:51

Re: UTF-8 ini files

30 May 2019, 13:14

Thanks.

To be safe I changed all my ini files to UTF-16, aka "USC2-LE-BOM", by using the parameter UTF-16 with FileAppend.

In future I think I'll just avoid using iniread/write and do it manually.

Return to “Scripts and Functions (v1)”

Who is online

Users browsing this forum: No registered users and 131 guests