Move to Utf-8

Discuss the future of the AutoHotkey language
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Move to Utf-8

27 Sep 2017, 04:38

Recommends AHK Studio
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Move to Utf-8

27 Sep 2017, 06:36

Interesting article, thanks for sharing.

I did write some UTF-8 functions for AutoHotkey Basic. Thinking that I could take advantage of some existing functions that handled ANSI/UTF-8, e.g. FileAppend and Transform (Clipboard).

Unicode functions for AutoHotkey Basic / AutoHotkey x32 ANSI - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=32487

Since then I was thinking that I should start the project from scratch and use UTF-16 instead. I find UTF-16 much easier for handling strings.

What could be useful is to have some custom functions that can handle UTF-8 e.g. equivalents for StrLen/InStr/StrReplace, to operate on binary data (UTF-8 text fed via the File object's RawRead). I might work on some functions myself, but I'm unsure at present, re. the best way to do a case-insensitive UTF-8 search.

I mention an 'ANSI' mode for RegEx functions here, which could be used for case-sensitive searching:

Wish List 2.0 - AutoHotkey Community
https://autohotkey.com/boards/viewtopic ... 13&t=36789

Btw they suggest using LF instead of CRLF, and omitting the UTF-8 BOM. Hmm. Also I find it very suspect when people refuse to distinguish between simple 'length' in UTF-16, (size in bytes)/2, and the more advanced length which is either smaller or equal to that where 4-byte characters (built from surrogate pairs) are counted as 1 character (cf. the usual 2-byte characters). Both measures are useful, and possibly the first is actually more useful. In theory an additional parameter could be added to StrLen to specify the advanced length.

One thing I haven't found is Winapi functions where you can search for a string within a substring, functions only seem to end searching when a null character is reached. A workaround could be to add in a null character temporarily, and then restore the original character.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Move to Utf-8

27 Sep 2017, 07:05

Also AHK doesn't distinguish between the 2 sizes either. imo a additional "mode" parameter would be a horrible design choice. I would prefer a new function instead. ( e.g. StrLen and ChrLen )
This is also the reason why I am suggesting this. Currently AHK will Split surrogate pairs into 2 parts in StrSplit, Loop-Parse and StringSplit. While I do like the consistency it should be obvious why that is a problem at best.
There exists a third way of counting that only counts the resulting visible characters.

It's going to be a lot easier if you forget what ANSI actually means - no matter what it could be useful for.

Why rely on WinAPI? We can write our own function for this.
Recommends AHK Studio
just me
Posts: 9424
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Move to Utf-8

27 Sep 2017, 08:30

Unfortunately, AHK depends on Windows API calls, and very most of this functions don't support UTF-8. This means every time you pass some text to or want to retrieve text from an API call, the text needs to be converted. It might decrease performance and increase memory usage in some cases.

As soon as Microsoft will change to UTF-8, AHK should change also. ;)
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Move to Utf-8

27 Sep 2017, 10:15

Umm no
If you have read the link I provided you will see as to why that argument doesn't matter.
Recommends AHK Studio
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Move to Utf-8

27 Sep 2017, 10:48

So your thread proposes to internally use UTF-8 instead of UTF-16 LE, or to add new functions to handle UTF-8 data, or both?

I mention Winapi because that usually means good performance, otherwise machine code functions are a possibility. I would be glad to know of any fast way to do a case-insensitive search on UTF-8 data (or possibly a way to convert UTF-8 data, or binary data interspersed with occasional UTF-8 data, to lowercase). I'd like to have fast techniques for searching for text in text/binary/all files (on the hard drive), ANSI/UTF-8/UTF-16.

Out of interest are there any decent UTF-8 Winapi functions, other than MultiByteToWideChar and WideCharToMultiByte?

Simple is good. Hence continuing to treat strings as simple 2-byte units. In fact I would like to have a version of Ord, that works exactly as Asc did, that only checks the first 2 bytes. Perhaps Ord(vText, 1) cf. Ord(SubStr(vText, 1, 1)).

Have you written any UTF-8 data functions, or character-identifying functions for UTF-16? E.g. a version of StrSplit that inserts a separator, between every true 'character', that handles surrogate pairs. Btw are there other complications to do with 'combining characters'? You could then use that with Loop Parse. Cheers.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
just me
Posts: 9424
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Move to Utf-8

27 Sep 2017, 11:24

Hmmm, I cannot find a proven statement in the link saying that it wouldn't affect performance.

But:
Only use Win32 functions that accept widechars (LPWSTR), never those which accept LPTSTR or LPSTR.
SirRFI
Posts: 404
Joined: 25 Nov 2015, 16:52

Re: Move to Utf-8

27 Sep 2017, 15:24

Also this:
Q: What do you think about Byte Order Marks?

A: According to the Unicode Standard (v6.2, p.30): "Use of a BOM is neither required nor recommended for UTF-8".

Byte order issues are yet another reason to avoid UTF-16. UTF-8 has no endianness issues, and the UTF-8 BOM exists only to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, most UTF-8 text files omit BOMs today.

Using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation. This is unacceptable.
AHK scripts still require BOM to actually display the symbols properly.
Use

Code: Select all

[/c] forum tag to share your code.
Click on [b]✔[/b] ([b][i]Accept this answer[/i][/b]) on top-right part of the post if it has answered your question / solved your problem.
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Move to Utf-8

27 Sep 2017, 17:06

Yes, that's one thing that concerned me in the link. Every decent file format has an identifier at the front, I don't see why text files should be any different. We've had enough problems re. the uncertainty around 'extended ASCII', the last 128 of 256 characters in ANSI files.

The BOM is very useful because it allows you to *know* that the file is UTF-8, especially when other text formats like UTF-16 are going to be with us for the long-term, and which are more efficient than UTF-8 for various languages.

Btw what is the best solution re. case-insensitive UTF-8 search?
- Convert UTF-8 to UTF-16, then to lowercase, then search for an exact needle.
- Convert UTF-8 to UTF-16, then search for a case-insensitive needle.
- Search for the first character uppercase/lowercase, then check the bytes after that. E.g. write in C++, convert to machine code.
- Something else.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
SirRFI
Posts: 404
Joined: 25 Nov 2015, 16:52

Re: Move to Utf-8

27 Sep 2017, 18:02

jeeswg wrote:Yes, that's one thing that concerned me in the link. Every decent file format has an identifier at the front, I don't see why text files should be any different. We've had enough problems re. the uncertainty around 'extended ASCII', the last 128 of 256 characters in ANSI files.

The BOM is very useful because it allows you to *know* that the file is UTF-8, especially when other text formats like UTF-16 are going to be with us for the long-term, and which are more efficient than UTF-8 for various languages.
That's not the point of the "article". They want UTF-8 to become a standard/default encoding, so BOM wouldn't be needed. What I meant instead is that Unicode AHK 1.1/2.0a doesn't display unicode characters properly without BOM, despite knowing it is unicode. Well, unless it default to other encoding by default, by I recall seeing only ASCII and UTF-8 mentions in the docs.
Use

Code: Select all

[/c] forum tag to share your code.
Click on [b]✔[/b] ([b][i]Accept this answer[/i][/b]) on top-right part of the post if it has answered your question / solved your problem.
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Move to Utf-8

28 Sep 2017, 01:09

jeeswg wrote:Yes, that's one thing that concerned me in the link. Every decent file format has an identifier at the front, I don't see why text files should be any different. We've had enough problems re. the uncertainty around 'extended ASCII', the last 128 of 256 characters in ANSI files.
Utf-8 is not a file format but a string format. It won't tell you which format the file is in even if you know that the file is Utf-8.
The BOM is very useful because it allows you to *know* that the file is UTF-8, especially when other text formats like UTF-16 are going to be with us for the long-term, and which are more efficient than UTF-8 for various languages.
You seem to be very confused. The goal is it to remove the 16 bit era relic Utf-16 from the face of this plantet not maintain or support it in any form.
Btw what is the best solution re. case-insensitive UTF-8 search?
- Convert UTF-8 to UTF-16, then to lowercase, then search for an exact needle.
- Convert UTF-8 to UTF-16, then search for a case-insensitive needle.
Both are horrible ideas
Recommends AHK Studio
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Move to Utf-8

29 Sep 2017, 17:35

Worth a mention: Scintilla controls use UTF-8, ini files don't support UTF-8 (although you can add a BOM, edit the file in Notepad as UTF-8, read/write using AHK's IniRead/IniWrite as ANSI and convert to/from Unicode).

One disadvantage of UTF-8 is that uppercase/lowercase versions of a string don't necessarily have the same byte length.

If UTF-16 creates smaller text files for some languages, it's a good encoding.

I actually thought there should be something like a 'txtx' format. If you want to save a big file that rarely needs or doesn't need editing, the PC could run through 20, even 100 well-known compression formats and find the smallest one. Such files would need a txtx identifier, a compression identifier, a checksum/hash for the original data, the compressed data. E.g. if I want to list file paths/details, registry information, but would rather not store it in an archive file.

There is no reason why text files shouldn't have advanced lossless formats like png or ape.

That would be the real innovation, not scrapping UTF-16.

People should find it horrifying, anyone that suggests that removing a file format or string format identifier is a good idea, especially if that identifier is only 3 bytes long. There are enough problems as it is re. guessing a text file's format.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
nnnik
Posts: 4500
Joined: 30 Sep 2013, 01:01
Location: Germany

Re: Move to Utf-8

30 Sep 2017, 03:41

In practice Utf-8 is almost always shorter than Utf-16. The cases where Utf-16 is shorter than Utf-8 are so rare that it's existence causes more troubles than it does good.
Utf-16 should be considered harmful since due to the way it works almost any implementation will contain bugs. And why would you need an identifier for the most common and therefore expected encoding?
( In order to make this more accessible think of a phone call where one person says "Hey Utf-8" and the other person responding with "I'm Utf-8" )
Also the UTF-8 BOM is causing many problems too. This suggestion is to remove those problematic practices.
Also the suggestion is not to make things more complex but to remove the unnecessary complexity we currently have.
Recommends AHK Studio

Return to “AutoHotkey Development”

Who is online

Users browsing this forum: No registered users and 29 guests