FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

JoeWinograd · 04 Mar 2018, 12:35

According to the FileEncoding doc, the default is ANSI, and it applies to FileRead (as well as FileReadLine, Loop Read, FileAppend, and FileOpen). So, when doing a FileRead of a UTF-16 file without having changed the FileEncoding setting, I would expect it not to read the file correctly, but it does. Here's my test script:

Code: Select all

FileInUTF16:="c:\temp\FileInUTF16.txt"
FileOutANSI:="c:\temp\FileOutANSI.txt"

oFile:=FileOpen(FileInUTF16,"r")
encoding:=oFile.Encoding
MsgBox,4096,Debugging,Input file encoding=%encoding% ; shows that encoding is UTF-16
oFile.Close()

FileRead,vFile,%FileInUTF16%
FileDelete,%FileOutANSI%
FileAppend,%vFile%,%FileOutANSI%
If (ErrorLevel<>0)
{
  MsgBox,4112,FileAppend Error,ErrorLevel=%ErrorLevel%
  ExitApp
}

oFile:=FileOpen(FileOutANSI,"r")
encoding:=oFile.Encoding
MsgBox,4096,Debugging,Output file encoding=%encoding% ; shows that encoding is CP1252 (ANSI Latin 1)
oFile.Close()
ExitApp

As you can see, the script makes no changes to the FileEncoding setting, yet it reads the UTF-16 file correctly (and creates a correct ANSI-CP1252 version of it). Can anyone explain why? I'm using AHK U32, if that matters (v1.1.28.00). Thanks, Joe

teadrinker · 04 Mar 2018, 17:47

JoeWinograd wrote:So, when doing a FileRead of a UTF-16 file without having changed the FileEncoding setting, I would expect it not to read the file correctly, but it does.

Hi,

It depends on the BOM presence in your FileInUTF16.txt.
Try:

Code: Select all

FileAppend, test, FileInUTF16.txt, UTF-16-RAW   ; UTF-16 without BOM

oFile:=FileOpen("FileInUTF16.txt","r")
encoding := oFile.Encoding
text := oFile.Read()
oFile.Close()
FileDelete, FileInUTF16.txt
MsgBox, % "text: " . text . "`n`nencoding: " . encoding

FileAppend, test, FileInUTF16.txt, UTF-16       ; UTF-16 with BOM

oFile:=FileOpen("FileInUTF16.txt","r")
encoding := oFile.Encoding
text := oFile.Read()
oFile.Close()
FileDelete, FileInUTF16.txt
MsgBox, % "text: " . text . "`n`nencoding: " . encoding

JoeWinograd · 04 Mar 2018, 21:36

Well, now, that's very interesting! I've been writing AHK programs for many years that read and write plain text files, seemingly correctly, but I guess it's fair to say that I've been blissfully ignorant of this issue.

A few things I'm not understanding:

(1) Your code shows encoding: CP1252, even though your FileAppend sets the encoding to UTF-16-RAW. I would expect it to show encoding: UTF-16-RAW.

(2) The file is read correctly (i.e., text: test) if there is FileEncoding,UTF-16-RAW (or FileEncoding,UTF-16) before the FileOpen, which makes perfect sense. But that presents a Catch-22, i.e., how can you determine the encoding in order to read the file correctly if the encoding that is returned is wrong when you haven't set the encoding correctly via FileEncoding?

(3) As noted above, FileEncoding,UTF-16-RAW and FileEncoding,UTF-16 before the FileOpen both result in a correct read. I would expect only FileEncoding,UTF-16-RAW to work, since that's what you used to encode it.

(4) As also noted above, FileEncoding,UTF-16-RAW and FileEncoding,UTF-16 before the FileOpen both result in encoding of UTF-16 being returned. I would expect FileEncoding,UTF-16-RAW to return encoding of UTF-16-RAW.

Thanks for your insight on this. Regards, Joe

teadrinker · 05 Mar 2018, 09:23

JoeWinograd wrote:(1) Your code shows encoding: CP1252, even though your FileAppend sets the encoding to UTF-16-RAW. I would expect it to show encoding: UTF-16-RAW.

FileAppend overrides the default encoding for current command only, not for the rest of the script.

(2) The file is read correctly (i.e., text: test) if there is FileEncoding,UTF-16-RAW (or FileEncoding,UTF-16) before the FileOpen, which makes perfect sense. But that presents a Catch-22, i.e., how can you determine the encoding in order to read the file correctly if the encoding that is returned is wrong when you haven't set the encoding correctly via FileEncoding?

Now I am slightly discouraged. For me the code

Code: Select all

FileAppend, test, FileInUTF16.txt, UTF-16-RAW   ; UTF-16 without BOM

oFile:=FileOpen("FileInUTF16.txt","r")
encoding := oFile.Encoding
text := oFile.Read()
oFile.Close()
FileDelete, FileInUTF16.txt
MsgBox, % "text: " . text . "`n`nencoding: " . encoding

shows

text: t
encoding: CP1251

CP1251 is my default encoding, it's right. But my code can't read the text correctly in this case without speciefing oFile.Encoding := "UTF-16", and it is what I expected.
Can you check it once again?

Code: Select all

FileDelete, FileInUTF16.txt
FileAppend, test, FileInUTF16.txt, UTF-16-RAW   ; UTF-16 without BOM

oFile:=FileOpen("FileInUTF16.txt","r")
encoding := oFile.Encoding
text := oFile.Read()
oFile.Close()
FileDelete, FileInUTF16.txt
MsgBox, % "text: " . text . "`n`nencoding: " . encoding

JoeWinograd · 07 Mar 2018, 15:55

Hi teadrinker,
Thanks very much for sticking with me on this one.

Your first set of code above gives this here:

text: t
encoding: CP1252

So, the text and encoding output are both wrong. The FileAppend is making it a UTF-16-RAW file, not a CP1252 file (note that Msgbox % A_FileEncoding shows null/empty on my system).

Your second set of code above gives the same output, viz.:

text: t
encoding: CP1252

Btw, your first and second sets of code above are identical, except for the FileDelete in the second one — is that what you intended?

The file is read correctly (i.e., text: test) if there is FileEncoding,UTF-16-RAW (or FileEncoding,UTF-16) before the FileOpen.

Problem is, since the code returns encoding as CP1252, there's no way of knowing that it's really a UTF-16-RAW (or UTF-16) file, and, thus, no way of reading it correctly — unless I'm missing something here. Regards, Joe

jeeswg · 07 Mar 2018, 17:30

- If a file starts with a UTF-8 or UTF-16 LE BOM (LE: little endian), then I would assume that those files are UTF-8 or UTF-16 LE text. (It is theoretically possible that it's an ANSI text file with some curious initial characters.)
- Otherwise, if there is no BOM, I assume it's ANSI. There is no perfect assumption. You could try to guess, e.g. by looking for null characters, which implies a UTF-16 LE file. I always make sure to save UTF-8 and UTF-16 LE files with a BOM.
- If you know that a file is UTF-8 or UTF-16 LE, but without a BOM, you could read in the binary data and then use StrGet.
- When reading files, if AutoHotkey sees a UTF-8 or UTF-16 LE BOM, it usually does something *smart*, it may override any encoding parameters that you specify.
- If you want to force read a UTF-8 or UTF-16 LE file, that has a BOM, as another encoding, AutoHotkey makes this difficult, because of it's *smart* behaviour. I will double-check what the best approach is, possibly to read binary data and use StrGet.
- I'm not sure when writing to files, if AutoHotkey uses *smart* behaviour, if the file already has a BOM.
- In general, be careful when reading from/writing to text files, in case AutoHotkey does something *smart*, even when you've specified a parameter in an attempt to override any automatic behaviour.

jeeswg · 07 Mar 2018, 18:04

Here's an example of reading/writing files with different encodings.

Code: Select all

w:: ;create text files with different encodings
vList := "CP1252,UTF-8,UTF-8-RAW,UTF-16,UTF-16-RAW"
vDir := A_ScriptDir "\! eg files"
if !FileExist(vDir)
	FileCreateDir, % vDir
Loop, Parse, vList, % ","
{
	vPath := vDir "\square root " A_LoopField ".txt"
	if !FileExist(vPath)
		FileAppend, % Chr(8730), % "*" vPath, % A_LoopField
}
return

q:: ;force read UTF-8/UTF-16 files as ANSI
vDir := A_ScriptDir "\! eg files"
vPath8 := vDir "\square root UTF-8.txt"
vPath8R := vDir "\square root UTF-8-RAW.txt"
vPath16 := vDir "\square root UTF-16.txt"
vPath16R := vDir "\square root UTF-16-RAW.txt"
FileRead, vText8, % "*P1252 " vPath8 ;reads as UTF-8 not as CP1252
FileRead, vText8R, % "*P1252 " vPath8R ;reads as CP1252
FileRead, vText16, % "*P1252 " vPath16 ;reads as UTF-16 LE not as CP1252
FileRead, vText16R, % "*P1252 " vPath16R ;reads as CP1252
MsgBox, % vText8
MsgBox, % vText8R
MsgBox, % vText16
MsgBox, % vText16R

oFile := FileOpen(vPath8, "r")
oFile.Pos := 0
oFile.RawRead(vData, vSize := oFile.Length)
oFile.Close()
vText8 := StrGet(&vData, vSize, "CP1252")
MsgBox, % vText8

oFile := FileOpen(vPath16, "r")
oFile.Pos := 0
oFile.RawRead(vData, vSize := oFile.Length)
oFile.Close()
vText16 := StrGet(&vData, vSize, "CP1252")
MsgBox, % vText16
return

JoeWinograd · 08 Mar 2018, 03:02

Hi jeeswg,
Thanks for all of these thoughts. Overall, great stuff! A few comments:

I always make sure to save UTF-8 and UTF-16 LE files with a BOM.

I haven't thought about this issue before, so I've been using FileAppend without an Encoding option to create all the text files in my AHK programs. According to the doc, this means that it uses the system default ANSI code page.

When reading files, if AutoHotkey sees a UTF-8 or UTF-16 LE BOM, it usually does something *smart*, it may override any encoding parameters that you specify.

Yes, it's clearly "smart" and figures out what to do when reading the file, even when the file encoding doesn't match the current encoding setting. This whole thing came to light for me when I wrote a script to read the exported XML of Task Scheduler tasks and write them out via FileAppend as ANSI files. It works, which means that FileRead is handling the XML fine, even though I haven't set the encoding and the XML is UTF-16 (LE BOM).

I'm not sure when writing to files, if AutoHotkey uses *smart* behaviour, if the file already has a BOM.

It seems to. I just ran a test where I did a FileAppend without an Encoding option to one of the UTF-16 XML files — worked fine.

In general, be careful when reading from/writing to text files, in case AutoHotkey does something *smart*, even when you've specified a parameter in an attempt to override any automatic behaviour.

I'm sure that's good advice, although I expect that my code will continue to take advantage of the "smart" reading and writing (without attempting to override the default).

Thanks for the sample script to read/write files with different encodings — works perfectly and is illuminating! Regards, Joe

08 Mar 2018, 04:07

JoeWinograd wrote:(3) As noted above, FileEncoding,UTF-16-RAW and FileEncoding,UTF-16 before the FileOpen both result in a correct read. I would expect only FileEncoding,UTF-16-RAW to work, since that's what you used to encode it.

UTF-16, UTF-16-RAW and CP1200 all specify the same encoding. The difference between them is just as documented (see FileEncoding):

UTF-16: Unicode UTF-16 with little endian byte order, equivalent to CP1200.
[...] UTF-16-RAW: As above, but no byte order mark is written when a new file is created.

Really, UTF-16-RAW isn't an encoding; it's a string representing a combination of options ("code page 1200" and "don't write a BOM").

JoeWinograd wrote:(4) As also noted above, FileEncoding,UTF-16-RAW and FileEncoding,UTF-16 before the FileOpen both result in encoding of UTF-16 being returned. I would expect FileEncoding,UTF-16-RAW to return encoding of UTF-16-RAW.

Again, this is explicitly covered in the documentation (for File.Encoding, this time):

RetrievedEncoding is never a value with the -RAW suffix, regardless of how the file was opened or whether it contains a byte order mark (BOM). Setting NewEncoding never causes a BOM to be added or removed, as the BOM is normally written to the file when it is first created.

The documentation is a little unclear about when the default encoding is (or is not) used by most of the commands, but there's a hint in the FileOpen documentation:

Encoding: The code page to use for text I/O if the file does not contain a UTF-8 or UTF-16 byte order mark, or if the h (handle) flag is used. If omitted, the current value of A_FileEncoding is used.

jeeswg wrote:If you want to force read a UTF-8 or UTF-16 LE file, that has a BOM, as another encoding, AutoHotkey makes this difficult,

It's not difficult. Just set File.Encoding after opening the file, and reset File.Pos as well if you want to read the BOM itself.

I'm not sure when writing to files, if AutoHotkey uses *smart* behaviour, if the file already has a BOM.

If you overwrite a file (FileOpen with "w" only), the file is not checked for a BOM, or read at all prior to being overwritten.

JoeWinograd · 08 Mar 2018, 04:17

Thanks, lexicos, very helpful. I'll take another spin through the doc. Regards, Joe

jeeswg · 10 Apr 2018, 23:34

FILE OBJECT AND STRGET: TOO HARD FOR NEWBIES
- I believe that the File object and StrGet are quite fiddly to use, and I had to do some very subtle checks to make sure that what I was doing was both correct and safe. Hence I would suggest adding a version of at least one of the simple functions presented here:
file set text/empty/get encoding/force read with specific encoding - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=47084
JEE_FileEmpty
JEE_FileGetEncoding
JEE_FileSetText (aka JEE_FileWrite, this can also achieve what JEE_FileEmpty does)
(I also include JEE_FileReadForce there)
- I would steer newbies clear of FileOpen. FileOpen with 'w' ERASES/EMPTIES the file's contents, just by its presence. Even though it warns that it 'overwrites' the data in the documentation, people might assume that it can only overwrite the data, once you *do* something, not simply by opening the file. Also, people can be unclear what 'overwrites' means, in this context it means 'empties' rather than write over. I.e. it reduces the file size to 0, if you simply open it in write mode and close it.
- Another confusing thing about File objects, is that you can set the encoding using File.Encoding, but AFAIK you have to explicitly add the BOM as a string, Chr(0xFEFF), using oFile.Write, to give the file a BOM.
- I would say that the File object should be thought of as something for experts only, for dealing with binary data, or for dealing with a large file without loading it entirely into memory. IMO the simpler functionality should be available via functions, i.e. 2 or 3 of the 3 functions I outlined above.
- StrGet is fiddly for example, if you want to apply it to binary data, you can't specify to read from the first n bytes of data, only the maximum number of characters to return. Thus if you try to read binary data as though it were UTF-8, using StrGet, you can go beyond the data, unless it reaches a null byte. To apply StrGet to the binary data of a file, I'm not sure if there is an easy way to ensure that it will reach a null character, unless you create a buffer, with a null character at the end, and copy the data there. E.g. a buffer that is equal to the size of the file plus 3 null bytes to handle ANSI/UTF-8/UTF-16.

SET THE TEXT OF AN EXISTING FILE
- This example shows how fiddly it can be to set the text of an existing file:
Help with %A_BACKSPACE% - AutoHotkey Community
https://autohotkey.com/boards/viewtopic ... 60#p210560
The whole thing could be done with one function. Cf. ControlGetText/ControlSetText, or setting the text of a variable, or FileRead, which are easy.
- Note: it's fiddly to make sure that you change the text but don't change the encoding of the file.
- Also, I don't like the use of FileDelete, because users should want to preserve the creation date of the file. (I know that the system sometimes tries to make assumptions to assist programs that delete a file, when they intend to empty it, prior to appending content.)
- Note: users need to know that writing to UTF-8 or UTF-16 without the BOM, means that AutoHotkey can't then automatically identify that file as UTF-8 or UTF-16, it assumes that they are ANSI. I would always encourage people to use the BOM.

FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Re: FileEncoding behavior - ANSI (CP1252), Unicode (UTF-16)

Who is online