Re: Character Encoding Detection
In the past, people who were looking for a C/C++ library that could to guess** a text file's character encoding, would often be referred to Mozilla's Universal Charset Detector. AFAIK, Mozilla stopped maintaining this library many years ago, however, it seems to have been continued on GitHub:
https://github.com/Joungkyun/libchardet -- this is the upstream for several Linux distros, and I believe it still compiles on Windows (though perhaps now requiring MinGW).
** I say "guess" because it is impossible to determine an arbitrary text file's character encoding with 100% accuracy. It is entirely possible for a given byte sequence to decode to valid character sequences under multiple encodings. Mozilla UCD use heuristics / statistics to make an educated guess, but it is still just a guess.
Folks who are interested in charset detection might be curious to poke inside the source code of Mozilla UCD, to see how others approached the problem.
Re: Byte-order-marks (BOMs)
BOMs are a way to unambiguously indicate the encoding of a text file. At first glance, that sounds great. But in practice they can often be more trouble than they're worth:
- UTF-8 files are not required to have a BOM, and many (if not most) do not. UTF-16 and UTF-32 files are "required" to have a BOM, but the BOM can be either big-endian or little-endian, and some software will write UTF-16 without a BOM.
- BOMs can be a nuisance, because they are not always handled correctly by software. For example, if you concatenate two files:
Linux....
cat file1_with_bom.txt file2_with_bom.txt > combined.txt
Windows
copy file1_with_bom.txt + file2_with_bom.txt combined.txt
...then the BOM from file2.txt will be included in the middle of combined.txt, which can cause problems for software that is not expecting it -- and plenty of programs are not expecting it.
- Similar issues can arise when combining files programmatically. For example Python is able to automatically detect/remove a BOM when reading a Unicode text file
if you open the file with a "-sig" encoding (e.g.
encoding="utf-8-sig"). Otherwise the BOM will be included as characters in the string that is read from the file, and it is the programmer's responsibility to remove it if necessary, which they often do not know they need to do. The situation is similar, though often slightly different, for other programming languages.
Some text editors mishandle BOMs, even if the only BOM in the file is correctly placed at the beginning. All this considered, I prefer to not use BOMs except maybe in controlled circumstances where all software that produces and consumes the files are known to be compliant and compatible with them. This would not be the case with AutoHotkey scripts (IMO), where users typically expect to be able to use their preferred text editor to edit scripts -- which might be any one of dozens of different programs -- and to be able to send those scripts as attachments to other people, who might be using some other text editor that might have differing support for BOMs.