jeeswg's base64 mini-tutorial: size calculations

Put simple Tips and Tricks that are not entire Tutorials in this forum
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

jeeswg's base64 mini-tutorial: size calculations

Post by jeeswg » 13 Jun 2018, 16:29

- I couldn't find anywhere on the Internet, an article that clearly listed all 4 of the base64 formulas, and a simple explanation of their origin.
- Do notify of any issues or of any good links.

Code: Select all

[updated: 2019-08-25]
jeeswg's base64 tutorial: size calculations

QUICK REFERENCE: THE FORMULAE

[binary data -> base64 string (without padding)]
n bytes -> ceil(4*(n/3)) chars (excluding padding)

[binary data -> base64 string (with padding)]
n bytes -> 4*ceil(n/3) chars (including padding)

[base64 string (without padding) -> binary data]
n chars (excluding padding) -> floor(n*(3/4)) bytes

[base64 string (with padding) -> binary data]
[where n is a multiple of 4]
for n = 0: n chars (including padding) -> 0 bytes
for n > 0: n chars (including padding) -> min: n*(3/4)-2 bytes, max: n*(3/4) bytes

INTRODUCTION: 4 CALCULATIONS

This 'tutorialette' is about converting between binary data and base64 strings.

There are 4 important size calculations to consider:
binary data -> base64 string (without padding)
binary data -> base64 string (with padding)
base64 string (without padding) -> binary data
base64 string (with padding) -> binary data

INTRODUCTION: BASE64 V. HEX

Hex and base64 are useful because they allow you to store binary data as strings.
Normally, in ANSI strings, you cannot store null bytes, as these are used to indicate the end of a string, hence other systems for storing binary data as strings are needed.

Let's start off with hex.
Hex uses 16 characters to store data:
0123456789ABCDEF
This can be abbreviated as: 0-9A-F.
Let's say we have 6 bytes: 1 2 3 4 5 6.
We can write this in hex as 010203040506.
If we had: 255 254 253 252 251 250.
We can write this in hex as FFFEFDFCFBFA.
In both cases we have: 6 bytes -> 12 characters.
And in general: n bytes -> 2n characters.
And for the reverse direction: n characters -> n/2 bytes.
It's reasonably straightforward to convert binary data to hex by hand.

Now let's look at base64.
Base64 uses 64 characters to store data:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
This can be abbreviated as: A-Za-z0-9+/.
(Note: YouTube video IDs use the same character set but with -_ instead of +/.)
Let's say we have 6 bytes: 1 2 3 4 5 6.
We can write this in base64 as AQIDBAUG.
If we had: 255 254 253 252 251 250.
We can write this in base64 as //79/Pv6.
In both cases we have: 6 bytes -> 8 characters.
And in general: n bytes -> approx. (4/3)*n characters.
And in general: n characters -> approx. (3/4)*n bytes.
Where 4/3 ~= 1.333333 and 3/4 = 0.75.
It's difficult to convert binary data to base64 by hand.

Base64 has the advantage that you can store more data in fewer characters.

INTRODUCTION: BINARY DATA TO BASE64 STRINGS

Below is a demonstration of splitting binary data from 8-bit units (bytes) into 6-bit units, as part of the binary data to base64 conversion process.

[note: where each capital letter represents 2 bits = 1/4 byte]
[note: where b means bytes, and c means characters]
[0b -> 0c]
[1b -> 2c] AAAA -> AAA A
[2b -> 3c] AAAA BBBB -> AAA ABB BB
[3b -> 4c] AAAA BBBB CCCC -> AAA ABB BBC CCC
[4b -> 6c] AAAA BBBB CCCC DDDD -> AAA ABB BBC CCC DDD D

Earlier we used hex '010203040506' as an example. Here we demonstrate converting that data to hex.

hex: 010203040506
bin (8-bit groups): 00000001 00000010 00000011 00000100 00000101 00000110
bin (6-bit groups): 000000 010000 001000 000011 000001 000000 010100 000110
dec (6-bit groups as dec): 0 16 8 3 1 0 20 6
(in the next step we replace each 6-bit group with one of 64 characters: e.g. the 0th char is A, the 16th char is Q)
chars (6-bit groups as chars): AQIDBAUG

We'll do a very similar example, this time hex '0102030405'.

hex: 0102030405
bin (8-bit groups): 00000001 00000010 00000011 00000100 00000101
bin (6-bit groups): 000000 010000 001000 000011 000001 000000 0101
(the last group didn't have enough characters, so we pad it with zeros:)
bin (6-bit groups): 000000 010000 001000 000011 000001 000000 010100
dec (6-bit groups as dec): 0 16 8 3 1 0 20
(in the next step we replace each 6-bit group with one of 64 characters: e.g. the 0th char is A, the 16th char is Q)
chars (6-bit groups as chars): AQIDBAU
(commonly base64 is padded, such that the string has a length which is a multiple of 4 characters, commonly '=' is used to do the padding)
chars (6-bit groups as chars): AQIDBAU=

We'll again do a very similar example, this time hex '01020304'.

hex: 01020304
bin (8-bit groups): 00000001 00000010 00000011 00000100
bin (6-bit groups): 000000 010000 001000 000011 000001 00
(the last group didn't have enough characters, so we pad it with zeros:)
bin (6-bit groups): 000000 010000 001000 000011 000001 000000
dec (6-bit groups as dec): 0 16 8 3 1 0
(in the next step we replace each 6-bit group with one of 64 characters: e.g. the 0th char is A, the 16th char is Q)
chars (6-bit groups as chars): AQIDBA
(commonly base64 is padded, such that the string has a length which is a multiple of 4 characters, commonly '=' is used to do the padding)
chars (6-bit groups as chars): AQIDBA==

Note: the system uses 0-based indexes, so 'A' is not the 1st char, but the 0th char.
Note: the system starts with 'ABC', not '123'.
Note: with or without the padding, you can convert the base64 string to binary data.

FORMULAS: INTRODUCTION

In the next 4 sections, we try to come up with formulas for calculating the characters in a base64 string based on the number of bytes in binary data, and vice versa.
binary data -> base64 string (without padding)
binary data -> base64 string (with padding)
base64 string (without padding) -> binary data
base64 string (with padding) -> binary data

There are essentially 3 scenarios to consider, the size of the data in bytes is 3n bytes (0, 3, 6, ...), 3n+1 bytes (1, 4, 7, ...) or 3n+2 bytes (2, 5, 8, ...).

FORMULAS: BINARY DATA -> BASE64 STRING (WITHOUT PADDING)

[note: where each capital letter represents 2 bits = 1/4 byte]
[note: where b means bytes, and c means characters]
[0b -> 0c]
===============
[1b -> 2c] AAAA -> AAA A
[2b -> 3c] AAAA BBBB -> AAA ABB BB
[3b -> 4c] AAAA BBBB CCCC -> AAA ABB BBC CCC
===============
[4b -> 6c] AAAA BBBB CCCC DDDD -> AAA ABB BBC CCC DDD D
[5b -> 7c] AAAA BBBB CCCC DDDD EEEE -> AAA ABB BBC CCC DDD DEE EE
[6b -> 8c] AAAA BBBB CCCC DDDD EEEE FFFF -> AAA ABB BBC CCC DDD DEE EEF FFF
===============
[7b -> 10c] AAAA BBBB CCCC DDDD EEEE FFFF GGGG -> AAA ABB BBC CCC DDD DEE EEF FFF GGG G

[some calculations to give us a feel for the situation]
[we multiply by 4/3 because we expect the base64 to be roughly 4/3 the size of the original data]
1*(4/3) = 1.333333 -> 2
2*(4/3) = 2.666667 -> 3
3*(4/3) = 4 -> 4
[which can be rewritten as:]
4*(1/3) = 1.333333 -> 2
4*(2/3) = 2.666667 -> 3
4*(3/3) = 4 -> 4

[formula]
n bytes -> ceil(4*(n/3)) chars (excluding padding)

[some test calculations]
ceil(4*(0/3)) = 0
ceil(4*(1/3)) = 2
ceil(4*(2/3)) = 3
ceil(4*(3/3)) = 4
ceil(4*(4/3)) = 6
ceil(4*(5/3)) = 7
ceil(4*(6/3)) = 8
ceil(4*(7/3)) = 10

FORMULAS: BINARY DATA -> BASE64 STRING (WITH PADDING)

[note: where each capital letter represents 2 bits = 1/4 byte]
[note: where b means bytes, and c means characters]
[note: where # represents a pad character]
[0b -> 0c]
===============
[1b -> 4c] AAAA -> AAA A # #
[2b -> 4c] AAAA BBBB -> AAA ABB BB #
[3b -> 4c] AAAA BBBB CCCC -> AAA ABB BBC CCC
===============
[4b -> 8c] AAAA BBBB CCCC DDDD -> AAA ABB BBC CCC DDD D # #
[5b -> 8c] AAAA BBBB CCCC DDDD EEEE -> AAA ABB BBC CCC DDD DEE EE #
[6b -> 8c] AAAA BBBB CCCC DDDD EEEE FFFF -> AAA ABB BBC CCC DDD DEE EEF FFF
===============
[7b -> 10c] AAAA BBBB CCCC DDDD EEEE FFFF GGGG -> AAA ABB BBC CCC DDD DEE EEF FFF GGG G # #

[some calculations to give us a feel for the situation]
[the number will always be a multiple of 4]
[so instead of 1 or 2 or 3 -> 4, and 4 or 5 or 6 -> 8]
[we can consider: 1 or 2 or 3 -> 1, and 4 or 5 or 6 -> 2]
(1/3) = 0.333333 -> 1 = ceil(1/3)
(2/3) = 0.666667 -> 1 = ceil(2/3)
(3/3) = 1 -> 1 = ceil(3/3)

[formula]
n bytes -> 4*ceil(n/3) chars (including padding)

[some test calculations]
4*ceil(0/3) = 0
4*ceil(1/3) = 4
4*ceil(2/3) = 4
4*ceil(3/3) = 4
4*ceil(4/3) = 8
4*ceil(5/3) = 8
4*ceil(6/3) = 8
4*ceil(7/3) = 12

FORMULAS: BASE64 STRING (WITHOUT PADDING) -> BINARY DATA

base64 string without padding, to binary data
0c -> 0b
2c -> 1b
3c -> 2b
4c -> 3b
6c -> 4b
7c -> 5b
8c -> 6b
10c -> 7b

[some calculations to give us a feel for the situation]
[we multiply by 3/4 because we expect the binary data to be roughly 3/4 the size of the base64 string]
0*(3/4) = 0 -> 0
2*(3/4) = 1.5 -> 1
3*(3/4) = 2.25 -> 2
4*(3/4) = 3 -> 3
6*(3/4) = 4.5 -> 4
7*(3/4) = 5.25 -> 5
8*(3/4) = 6 -> 6
10*(3/4) = 7.5 -> 7

[formula]
n chars (excluding padding) -> floor(n*(3/4)) bytes

[some test calculations]
floor(0*(3/4)) = 0
floor(2*(3/4)) = 1
floor(3*(3/4)) = 2
floor(4*(3/4)) = 3
floor(6*(3/4)) = 4
floor(7*(3/4)) = 5
floor(8*(3/4)) = 6
floor(10*(3/4)) = 7

FORMULAS: BASE64 STRING (WITH PADDING) -> BINARY DATA

for a given number of base64 characters (including padding),
the maximum possible size of the data:

0c -> 0b
4c -> 3b
8c -> 6b
12c -> 9b

0*(3/4) = 0
4*(3/4) = 3
8*(3/4) = 6
12*(3/4) = 9

for a given number of base64 characters (including padding),
the minimum possible size of the data:

0c -> 0b
4c -> 1b
8c -> 4b
12c -> 7b

0*(3/4)-2 = -2 [the size can't be negative, so we assume 0]
4*(3/4)-2 = 1
8*(3/4)-2 = 4
12*(3/4)-2 = 7

[formula]
[where n is a multiple of 4]
for n = 0: n chars (including padding) -> 0 bytes
for n > 0: n chars (including padding) -> min: n*(3/4)-2 bytes, max: n*(3/4) bytes
- Some code to test the 'including padding' formulas, using CryptBinaryToString (encode binary data to base64 string), and CryptStringToBinary (decode base64 string to binary data):

Code: Select all

;encode: bytes to char count
VarSetCapacity(vData, 100, 0)
Loop 100
{
	vSize := A_Index
	vChars := 0
	;CRYPT_STRING_BASE64 := 0x1
	DllCall("crypt32\CryptBinaryToString", "Ptr",&vData, "UInt",vSize, "UInt",0x1, "Ptr",0, "UInt*",vChars)
	VarSetCapacity(vTemp, vChars*2)
	DllCall("crypt32\CryptBinaryToString", "Ptr",&vData, "UInt",vSize, "UInt",0x1, "Ptr",&vTemp, "UInt*",vChars)
	vBase64 := StrGet(&vTemp)
	vBase64 := StrReplace(vBase64, "`r`n")
	vChars := StrLen(vBase64)

	;n bytes -> 4*ceil(n/3) chars (including padding)
	vChars2 := 4*Ceil(vSize/3)
	vOutput .= Format("{}`t{}`t{}`r`n", vSize, vChars, vChars2)

	if !(vChars = vChars2)
		MsgBox, % Format("mismatch: {} {} {}`r`n", vSize, vChars, vChars2)
}
Clipboard := vOutput
MsgBox, % "done: part 1 of 2"

vOutput .= "`r`n"

;decode: char count to bytes
vText := ""
Loop 400
	vText .= "A"
Loop 100
{
	vChars := 4*Ceil(A_Index/4)
	vRem := Mod(A_Index, 4)
	if (vRem = 1)
		continue
	else if (vRem = 2)
		vBase64 := SubStr(vText, 1, 4*Ceil(vChars/4)-2) "==", vOffset := 2
	else if (vRem = 3)
		vBase64 := SubStr(vText, 1, 4*Ceil(vChars/4)-1) "=", vOffset := 1
	else if (vRem = 0)
		vBase64 := SubStr(vText, 1, 4*Ceil(vChars/4)), vOffset := 0
	vSize := 0
	;CRYPT_STRING_BASE64 := 0x1
	DllCall("crypt32\CryptStringToBinary", "Ptr",&vBase64, "UInt",vChars, "UInt",0x1, "Ptr",0, "UInt*",vSize, "Ptr",0, "Ptr",0)

	;where n is a multiple of 4
	;n chars (including padding) -> n*(3/4) - (2 or 1 or 0) bytes
	vSize2 := Round(vChars*(3/4)) - vOffset ;note: using Round to do float to integer
	vOutput .= Format("{}`t{}`t{}`r`n", vChars, vSize, vSize2)

	if !(vSize = vSize2)
		MsgBox, % Format("mismatch: {} {} {}", vChars, vSize, vSize2)
}
Clipboard := vOutput
MsgBox, % "done: part 2 of 2"
return
LINKS

StrPut/StrGet + hex/base64 - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=50528
Tomasz Ostrowski - Base64 decoder
http://tomeko.net/online_tools/base64.php?lang=en
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA

Return to “Tips and Tricks (v1)”