In the topic about
Base64 coder/decoder, Laszlo suggested that we could use the full Ansi charset to encode binary data.
A nice side effect of his proposal was that plain text would be still readable.
Finding the idea interesting and challenging, I had fun designing such a format, the difficult part being to make something unambiguous, ie. lossless: decoding must restore the data as it was before encoding! While I was at it, I made some design choices to compress slightly data.
I give here the comments on design I have put in my code:
Plain Encoding of Binary data to Windows Ansi (Pebwa)
The purpose is to take binary data, which cannot be put as is in a script or on a Web page, for example, and to encode it to displayable chars.
What displayable chars does we have?
In the 7bit Ascii character map, codes < 32 and 127 are control chars, so are excluded. I will remove also 32 (space) as it is too often trimmed out by display (HTML) or processing.
I can use also the High-Ascii range. ISO-8859-x considers the 128-159 range as reserved, offering the risk of becoming control chars if the high bit is cleared (eg. in some old e-mail client or server).
Since I target only Windows computers, I will use the displayable Microsoft Windows Latin-1 characters added in this range as control chars.
160 (non breaking space) is removed too.
The encoded data starts with a signature (magic number) allowing automatic identification of the format: it starts with 159 (Ÿ) 156 (œ) <a reserved char>, where the reserved char will be used in the future to indicate which level / version of encoding is used. Currently 1.
To ease the decoding, the signature is followed by the size of the unencoded data, encoded as described below, ending with char 151 (—).
A quantity (number of chars) is encoded using the 33-126 and 161-255 ranges:
0-93 is encoded by a char of code 33-126 (plus 33) and 94-188 is encoded by a char of code 161-255 (plus 67).
Greater quantities are encoded using base189:
2140187847 = 11323745 * 189 + 42
11323745 = 59913 * 189 + 188
59913 = 317 * 189 + 0
317 = 1 * 189 + 128
so 2140187847 = (((1 * 189 + 128) * 189 + 0) * 189 + 188) * 189 + 42
or 2140187847 = 1, 128, 0, 188, 42
After, the base rule is: all characters in the 33-126 and 161-255 ranges represent themselves. One advantage is that plain text is still readable.
Single 32 (space) and 160 (non breaking space) chars are encoded respectively with 145 (‘) and 146 (’).
The zero byte is quite frequent in binary data, I choose to encode it with a single char: 149 (•).
Single chars in the 1-31 and the 127-159 ranges are encoded with the char 134 (†) followed by the code of char plus 33 (34-64 and 161-192).
Multiple consecutive runs of chars are encoded with the char 137 (‰) for the displayble chars and the char 135 (‡) for the control chars followed by a quantity (as above), the char 151 (—) and either the char itself (in the displayable range) or its encoding (for 0-32, 127-160).
That's run-length encoding (RLE).
Obviously, since this takes at least 4 chars, it is interesting only for encoding at least 5 identical displayable chars or 3 control chars.
Note that on worse case (random chars of code between 1 and 31, for example), we double the size of the file, which is comparable with hex encoding. Not too bad. With true random data in the full range, we have roughtly 64 chars out of 255 whose length will be doubled, that makes around a 25% size increase, which isn't bad compared to Base64 (around 33% size increase).
The file with the encoding / decoding routines is
Pebwa.ahk
I have made two test scripts:
EncodeBinaryFile.ahk to get a file and put its encoding in the clipboard; and
CreateBinaryFile.ahk has a number of small graphic files encoded and embeded: it creates the files, then display them in a small GUI. You will need the
BinReadWrite.ahk library to make it working.
Note that EncodeBinaryFile doesn't check if a closing parenthesis is at the start of a line. The case is rare, but I have seen it. It breaks the continuation section. You have to search for them manually and move the parenthesis to the end of the previous line. I didn't felt the need to automate this...
The speed is correct on a modern computer, but for a 100KB Zip file, we have to wait a little... But this encoding is more designed for small files, say below 25KB.
Note that continuation sections has a limited size, for large files, you may need to split the result in several parts.
I spent quite some time on this script, but it was fun! I prefer that to resolving sudoku grids... But it is harder to do on public transport... If somebody wants to donate an old Pocket PC...
