Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

A new encoding of binary files: pebwa!


  • Please log in to reply
10 replies to this topic
PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
In the topic about Base64 coder/decoder, Laszlo suggested that we could use the full Ansi charset to encode binary data.
A nice side effect of his proposal was that plain text would be still readable.

Finding the idea interesting and challenging, I had fun designing such a format, the difficult part being to make something unambiguous, ie. lossless: decoding must restore the data as it was before encoding! While I was at it, I made some design choices to compress slightly data.

I give here the comments on design I have put in my code:

Plain Encoding of Binary data to Windows Ansi (Pebwa)

The purpose is to take binary data, which cannot be put as is in a script or on a Web page, for example, and to encode it to displayable chars.
What displayable chars does we have?
In the 7bit Ascii character map, codes < 32 and 127 are control chars, so are excluded. I will remove also 32 (space) as it is too often trimmed out by display (HTML) or processing.
I can use also the High-Ascii range. ISO-8859-x considers the 128-159 range as reserved, offering the risk of becoming control chars if the high bit is cleared (eg. in some old e-mail client or server).
Since I target only Windows computers, I will use the displayable Microsoft Windows Latin-1 characters added in this range as control chars.
160 (non breaking space) is removed too.

The encoded data starts with a signature (magic number) allowing automatic identification of the format: it starts with 159 (Ÿ) 156 (œ) , where the reserved char will be used in the future to indicate which level / version of encoding is used. Currently 1.

To ease the decoding, the signature is followed by the size of the unencoded data, encoded as described below, ending with char 151 (—).

A quantity (number of chars) is encoded using the 33-126 and 161-255 ranges:
0-93 is encoded by a char of code 33-126 (plus 33) and 94-188 is encoded by a char of code 161-255 (plus 67).
Greater quantities are encoded using base189:
2140187847 = 11323745 * 189 + 42
11323745 = 59913 * 189 + 188
59913 = 317 * 189 + 0
317 = 1 * 189 + 128
so 2140187847 = (((1 * 189 + 128) * 189 + 0) * 189 + 188) * 189 + 42
or 2140187847 = 1, 128, 0, 188, 42

After, the base rule is: all characters in the 33-126 and 161-255 ranges represent themselves. One advantage is that plain text is still readable.

Single 32 (space) and 160 (non breaking space) chars are encoded respectively with 145 (‘) and 146 (’).
The zero byte is quite frequent in binary data, I choose to encode it with a single char: 149 (•).

Single chars in the 1-31 and the 127-159 ranges are encoded with the char 134 (†) followed by the code of char plus 33 (34-64 and 161-192).

Multiple consecutive runs of chars are encoded with the char 137 (‰) for the displayble chars and the char 135 (‡) for the control chars followed by a quantity (as above), the char 151 (—) and either the char itself (in the displayable range) or its encoding (for 0-32, 127-160).
That's run-length encoding (RLE).
Obviously, since this takes at least 4 chars, it is interesting only for encoding at least 5 identical displayable chars or 3 control chars.

Note that on worse case (random chars of code between 1 and 31, for example), we double the size of the file, which is comparable with hex encoding. Not too bad. With true random data in the full range, we have roughtly 64 chars out of 255 whose length will be doubled, that makes around a 25% size increase, which isn't bad compared to Base64 (around 33% size increase).

The file with the encoding / decoding routines is
Pebwa.ahk
I have made two test scripts: EncodeBinaryFile.ahk to get a file and put its encoding in the clipboard; and CreateBinaryFile.ahk has a number of small graphic files encoded and embeded: it creates the files, then display them in a small GUI. You will need the BinReadWrite.ahk library to make it working.
Note that EncodeBinaryFile doesn't check if a closing parenthesis is at the start of a line. The case is rare, but I have seen it. It breaks the continuation section. You have to search for them manually and move the parenthesis to the end of the previous line. I didn't felt the need to automate this...

The speed is correct on a modern computer, but for a 100KB Zip file, we have to wait a little... But this encoding is more designed for small files, say below 25KB.
Note that continuation sections has a limited size, for large files, you may need to split the result in several parts.

I spent quite some time on this script, but it was fun! I prefer that to resolving sudoku grids... But it is harder to do on public transport... If somebody wants to donate an old Pocket PC... :-P
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

polyethene
  • Members
  • 5519 posts
  • Last active: May 17 2015 06:39 AM
  • Joined: 26 Oct 2012
BinaryEncodingDecoding.ahk seems to do nothing and EncodeBinaryFile.ahk seems to require DllCallStruct.ahk which I don't have (note: I have BinReadWrite.ahk and CreateBinaryFile.ahk).

Accoding a White Paper on Binary Encoding the Base85 algorithm (dubbed Ascii85 by Adobe) is the most effective encoder. There is an Ascii85() function for AutoHotkey already. Remember that these encoding methods can process binary to plain text and vice versa.

The concept of Pebwa looks interesting and I'm looking forward to trying it out :)

autohotkey.com/net Site Manager

 

Contact me by email (polyethene at autohotkey.net) or message tidbit


PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005

BinaryEncodingDecoding.ahk seems to do nothing

That's only a library of functions.

EncodeBinaryFile.ahk seems to require DllCallStruct.ahk which I don't have.

Agh! I removed it from CreateBinaryFile, not from there. It is no longer needed, I moved the encoders to BinaryEncodingDecoding (which is included in DllCallStruct...). Corrected now. Thanks for testing and reporting.

Accoding a White Paper on Binary Encoding the Base85 algorithm (dubbed Ascii85 by Adobe) is the most effective encoder. There is an Ascii85() function for AutoHotkey already. Remember that these encoding methods can process binary to plain text and vice versa.

Yes, I know that. I plan to compare the performances someday. I made this encoding because:
1) It was fun and challenging; actually, I was expecting much worse performances in size.
2) It has the side effect and perhaps advantage of not scrambling plain text: you can still read the comments/resource strings in binary files.
And I guess than on a file with lot of plain text, like a Word file, it may do better than Ascii85...

The concept of Pebwa looks interesting and I'm looking forward to trying it out :)

Thank you. Enjoy!
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

robiandi
  • Guests
  • Last active:
  • Joined: --

EncodeBinaryFile.ahk seems to require DllCallStruct.ahk which I don't have.

Agh! I removed it from CreateBinaryFile, not from there. It is no longer needed, I moved the encoders to BinaryEncodingDecoding (which is included in DllCallStruct...). Corrected now.

But you have forgotten to replace in CreateBinaryFile.ahk
#Include DllCallStruct.ahk
by
#Include BinaryEncodingDecoding.ahk


PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Strange quote, a bit confusing...
OK, I don't know where I have put my mind. I think I will accept a FTP account, this Web interface is annoying when many files are involved.
Thanks for reporting, I corrected it.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Lemming
  • Members
  • 184 posts
  • Last active: Feb 03 2014 11:03 AM
  • Joined: 20 Dec 2005
Is this a method to write binary files to disk?

If so, I wonder if this could be used to create a native AHK screen capture function? Right now, all AHK screencapt solutions rely on a 3rd party app, usually Irfanview or XNView.

I figure it might be possible to convert the clipboard data from Print Screen or ALt-Print Screen into a binary file.

Of course, we'd need to figure out the graphics file formats and how to convert raw clipboard data to graphics. I think a lossless format like .BMP might be easier.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
You can write binary files to disk without encoding it. For screen capture the easieast is

ClipboardAll may also be saved to a file (in this mode, FileAppend always overwrites any existing file):
FileAppend, %ClipboardAll%, C:\Company Logo.clip ; The file extension does not matter.

To later load the file back onto the clipboard (or into a variable), follow this example:
FileRead, Clipboard, *c C:\Company Logo.clip ; Note the use of *c, which must precede the filename.



PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
The issue has been raised already (and is a bit unrelated to this topic, but I don't mind).

This a method to write a given binary file to disk. To write variable binary data (like altered clipboard content), you can just use the BinReadWrite routines, used here and explained in another thread.

Now, to do what you want isn't a small task, as you must respect the file format. It is probably easier to rely on system functions, perhaps on GDI+.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
I compared the sizes of files treated by Pebwa vs. those encoded with Base64.
The text file is just for completeness, it is outside the scope of these encodings (ie. unecessary). Otherwise, I tried to check some useful formats.
Filename         Orig. size Pebwa size  Base64 size
LexAHK.cxx            26615     29489     36967
AHK32.bmp              3126      3800      4342
TestA.gif (animated)   4821      6111      6698
TestS.gif (fixed)     19770     26417     27461
HOT!Key.ico            8230      6753     11435
KM200.png              6691      8534      9296
HMIM1595.jpg         100726    113203    139902
PCREAHK.dll           24064     30248     33426
Pebwa is consistently creating smaller encoding than Base64... Perhaps Ascii85 is better here, I have to complete these measures.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

Perhaps Ascii85 is better here, I have to complete these measures.

I posted simple functions for Ascii85 en/decode binary buffers. Pebwa could still provide smaller encoded data, because it uses 8-bit ANSI, while Base64 and Ascii85 use 7-bit ASCII. ASCII might be safer in some environments.

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Thanks. I have yet to try it, but I will, to update my stats.
Since I didn't found ready-made binary for Windows, I started to write my own encoding in AutoHotkey along the lines of Pebwa. Idem for Base64 that lacked a binary to Base64 codec. Now that you provided both, I will drop my code that didn't went far anyway.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")