jeeswg's floats and doubles mini-tutorial

jeeswg · 14 Apr 2019, 15:56

INTRODUCTION

- Float (4 bytes = 32 bits) (6 digits of precision) (single-precision floating-point format).
- Double (8 bytes = 64 bits) (15 digits of precision) (double-precision floating-point format).

- The Float and Double can be used to store numbers with fractional parts, cf. (U)Char/(U)Short/(U)Int/(U)Int64 which can only handle integers.
- They can also be used to store integers beyond the range of UInt64/Int64, however, they cannot store all integers exactly, some integers are approximated.
- The smallest positive integer that can't be represented exactly:
- For Floats: 16,777,217 = 2**24 + 1
- For Doubles: 9,007,199,254,740,993 = 2**53 + 1

- Char/Short/Int/Int64, and their unsigned equivalents, store integers.
- Float and Double store floating-point numbers, i.e. numbers with an integer part and/or a fractional part.
- E.g. -1.23456e2 = -1.23456 * 10**2 = -123.456
- Three components are stored in a Float/Double, a sign bit (0 for positive, 1 for negative), a significand e.g. 1.23456, and an exponent (a positive/0/negative integer) e.g. 2.
- The significand and exponent are stored as binary numbers.

- Bits per component:
- Float: 1+8+23=32, sign/exponent/significand.
- Double: 1+11+52=64, sign/exponent/significand.

FLOATING-POINT NUMBERS IN AUTOHOTKEY

- Examples of creating a floating-point number (a double, with/without an exponent) for ordinary use in AutoHotkey:

Code: Select all

vNum := -123.456
MsgBox, % vNum

vNum := -1.23456e2
MsgBox, % vNum
MsgBox, % Format("{:.6f}", vNum)

vNum := -123.456e0 ;the 'e0' here is redundant
MsgBox, % vNum
MsgBox, % Format("{:.6f}", vNum)

BINARY REPRESENTATION OF FLOATS/DOUBLES

- To manually inspect a Float/Double. Reverse the bytes (they are stored in little endian order) and read the bits from left-to-right.
- We demonstrate 7, -7, and the max/min values for Float and Double:

Code: Select all

;we use these 3 formulas to calculate the values:

;general formula:
;float: (-1)**sign * 2**(pow-127) * 1.fraction
;double: (-1)**sign * 2**(pow-1023) * 1.fraction

;formula where pow bits all 0:
;float: 0, -0 or denormal numbers: (-1)**sign * 2**(-126) * 0.fraction
;double: 0, -0 or denormal numbers: (-1)**sign * 2**(-1022) * 0.fraction

;formula where pow bits all 1:
;float: +infinity, -infinity or NaN
;double: +infinity, -infinity or NaN

;min/max values when the sign is ignored (excluding positive infinity and zero):
;max (n): the maximum normal value
;min (n): the minimum normal value
;min (d): the minimum denormal value
;note: for each of the three values, specify a negative sign bit, for the negative equivalent

;Float:
7	0 10000001 11000000000000000000000 ;7 = (-1)**0 * 2**(129-127) * 0b1.11 = 1 * 4 * (1+0.5+0.25) = 1 * 4 * 1.75 = 7
-7	1 10000001 11000000000000000000000 ;-7 = (-1)**1 * 2**(129-127) * 0b1.11 = -1 * 4 * (1+0.5+0.25) = -1 * 4 * 1.75 = -7
max (n)	0 11111110 11111111111111111111111 ;max (normal) = (-1)**0 * 2**(254-127) * 0b1.11111111111111111111111 = 1 * 2**127 * (2-(2**-23)) = approx. 3.402823 * 10**38
min (n)	0 00000001 00000000000000000000000 ;min (normal) = (-1)**0 * 2**(1-127) * 0b1 = 1 * (2**-126) * 1 = 2**-126 = approx. 1.18 * 10**-38
min (d)	0 00000000 00000000000000000000001 ;min (denormal) = (-1)**0 * 2**(-126) * 0b0.00000000000000000000001 = 1 * 2**(-126) * 2**(-23) = 2**(-149) = approx. 1.40 * 10**-45

;Double:
7	0 10000000001 1100000000000000000000000000000000000000000000000000 ;7 = (-1)**0 * 2**(1025-1023) * 0b1.11 = 1 * 4 * (1+0.5+0.25) = 1 * 4 * 1.75 = 7
-7	1 10000000001 1100000000000000000000000000000000000000000000000000 ;-7 = (-1)**1 * 2**(1025-1023) * 0b1.11 = -1 * 4 * (1+0.5+0.25) = -1 * 4 * 1.75 = -7
max (n)	0 11111111110 1111111111111111111111111111111111111111111111111111 ;max (normal) = (-1)**0 * 2**(2046-1023) * 0b1.1111111111111111111111111111111111111111111111111111 = 1 * 2**1023 * (2-(2**-52)) = approx. 1.797693 * 10**308
min (n)	0 00000000001 0000000000000000000000000000000000000000000000000000 ;min (normal) = (-1)**0 * 2**(1-1023) * 0b1 = 1 * (2**-1022) * 1 = 2**-1022 = approx. 2.23 * 10**-308
min (d)	0 00000000000 0000000000000000000000000000000000000000000000000001 ;min (denormal) = (-1)**0 * 2**(-1022) * 0b0.0000000000000000000000000000000000000000000000000001 = 1 * 2**(-1022) * 2**(-52) = 2**(-1074) = approx. 4.94 * 10**-324

- This script retrieves the values for various (Float/Double) binary patterns:

Code: Select all

q:: ;fill floats/doubles with bytes and retrieve the values
vOutput := ""
VarSetCapacity(vOutput, 1000000*2)
Loop, 256
{
	vNum := A_Index-1
	VarSetCapacity(vData, 8, vNum)
	vNum1 := NumGet(&vData, 0, "Float") ;read first 4 bytes only
	vNum2 := NumGet(&vData, 0, "Double") ;read all 8 bytes
	vNum1 := Format("{:.60f}", vNum1)
	vNum2 := Format("{:.60f}", vNum2)
	;vNum1 := Format("{:.400f}", vNum1)
	;vNum2 := Format("{:.400f}", vNum2)
	vNum1 := RTrim(vNum1, "0")
	vNum2 := RTrim(vNum2, "0")
	if (vNum1 ~= "\.$")
		vNum1 .= "0"
	if (vNum2 ~= "\.$")
		vNum2 .= "0"
	vOutput .= vNum "`t" vNum1 "`t" vNum2 "`r`n"
}
Clipboard := vOutput
MsgBox, % "done"
return

- Here are some examples of how numbers are stored as Floats:

Code: Select all

note: where NaN means 'not a number'

-1.#QNAN0	1 11111111 11111111111111111111111 [NaN (quiet, signalling)]
-0.000000	1 00000000 11111111111111111111111 [denormal number]
-1.#INF00	1 11111111 00000000000000000000000 [-infinity]
-0.000000	1 00000000 00000000000000000000000 [-0]

1.#QNAN0	0 11111111 11111111111111111111111 [NaN (quiet, signalling)]
0.000000	0 00000000 11111111111111111111111 [denormal number]
1.#INF00	0 11111111 00000000000000000000000 [+infinity]
0.000000	0 00000000 00000000000000000000000 [0]

-1	1 01111111 00000000000000000000000 [1->-1, 127->0, 0.25] -1 * 2**0 * (1+0)
0	0 00000000 00000000000000000000000 special case: 0
0.1 (approx.)	0 01111011 10011001100110011001101 [0->1, 123->-4, 0.6] 1 * 2**(-4) * (1+0.6)
0.25	0 01111101 00000000000000000000000 [0->1, 125->-2, 0] 1 * 2**(-2) * (1+0)
0.33333	(approx.) 0 01111101 01010101010101010101011 [0->1, 125->-2, 0.25] 1 * 2**(-2) * (1+1/3)
0.5	0 01111110 00000000000000000000000 [0->1, 126->-1, 0] 1 * 2**(-1) * (1+0)
1	0 01111111 00000000000000000000000 [0->1, 127->0, 0] 1 * 2**0 * (1+0)
1.1 (approx.)	0 01111111 00011001100110011001101 [0->1, 127->0, 0.1] 1 * 2**0 * (1+0.1)
1.25	0 01111111 01000000000000000000000 [0->1, 127->0, 0.25] 1 * 2**0 * (1+0.25)
1.33333 (approx.)	0 01111111 01010101010101010101011 [0->1, 127->0, 1/3] 1 * 2**0 * (1+1/3)
1.5	0 01111111 10000000000000000000000 [0->1, 127->0, 0.5] 1 * 2**0 * (1+0.5)
2	0 10000000 00000000000000000000000 [0->1, 128->1, 0] 1 * 2**1 * (1+0)
3	0 10000000 10000000000000000000000 [0->1, 128->1, 0.5] 1 * 2**1 * (1+0.5)
4	0 10000001 00000000000000000000000 [0->1, 129->2, 0] 1 * 2**2 * (1+0)
5	0 10000001 01000000000000000000000 [0->1, 129->2, 0] 1 * 2**2 * (1+0.25)
6	0 10000001 10000000000000000000000 [0->1, 129->2, 0.5] 1 * 2**2 * (1+0.5)
7	0 10000001 11000000000000000000000 [0->1, 129->2, 0.75] 1 * 2**2 * (1+0.75)
8	0 10000010 00000000000000000000000 [0->1, 130->3, 0] 1 * 2**3 * (1+0)
9	0 10000010 00100000000000000000000 [0->1, 130->3, 0.125] 1 * 2**3 * (1+0.125)
10	0 10000010 01000000000000000000000 [0->1, 130->3, 0.25] 1 * 2**3 * (1+0.25)
11	0 10000010 01100000000000000000000 [0->1, 130->3, 0.375] 1 * 2**3 * (1+0.375)
12	0 10000010 10000000000000000000000 [0->1, 130->3, 0.5] 1 * 2**3 * (1+0.5)
100	0 10000101 10010000000000000000000 [0->1, 133->6, 0.5625] 1 * 2**6 * (1+0.5625)
1000	0 10001000 11110100000000000000000 [0->1, 136->9, 0.953125] 1 * 2**9 * (1+0.953125)
10000	0 10001100 00111000100000000000000 [0->1, 140->13, 0.220703125] 1 * 2**13 * (1+0.220703125)
100000	0 10001111 10000110101000000000000 [0->1, 143->16, 0.52587890625] 1 * 2**16 * (1+0.52587890625)
1000000	0 10010010 11101000010010000000000 [0->1, 146->19, 0.9073486328125] 1 * 2**19 * (1+0.9073486328125)
10000000	0 10010110 00110001001011010000000 [0->1, 150->23, 0.1920928955078125] 1 * 2**23 * (1+0.1920928955078125)
16777216	0 10010111 00000000000000000000000 [0->1, 151->24, 0] 1 * 2**24 * (1+0)
16777218	0 10010111 00000000000000000000001 [0->1, 151->24, 0.00000011920928955078125] 1 * 2**24 * (1+0.00000011920928955078125)
100000000	0 10011001 01111101011110000100000 [0->1, 153->26, 0.490116119384765625] 1 * 2**26 * (1+0.490116119384765625)
1000000000 (approx.)	0 10011100 11011100110101100101000 [0->1, 156->29, 0.862645149230957] 1 * 2**29 * (1+0.862645149230957)
10000000000 (approx.)	0 10100000 00101010000001011111001 [0->1, 160->33, 0.1641532182693481] 1 * 2**33 * (1+0.1641532182693481)

note: 16777217 is the smallest positive integer that can't be represented exactly
note: if you try to store 100000000000 in AHK as a Float, then retrieve it, you get 99999997952.000000

- Here are some functions for experimenting with binary representations:

Code: Select all

;==================================================

;JEE_Bin2Dec
JEE_BinToDec(vBin)
{
	local
	vIndex := StrLen(vBin)
	vOutput := 0
	Loop, Parse, vBin
	{
		vIndex--
		if (A_LoopField = "0")
			continue
		else if (A_LoopField = "1")
			vOutput += 2 ** vIndex
		else
			return
	}
	return vOutput
}

;==================================================

;where vLen is the minimum length of the number to return (i.e. pad it with zeros if necessary)
;JEE_Dec2Bin
JEE_DecToBin(vNum, vLen:=1)
{
	local
	;convert '0x' form to dec
	if !RegExMatch(vNum, "^\d+$")
		vNum += 0
	if !RegExMatch(vNum, "^\d+$")
		return
	vBin := ""
	while vNum
		vBin := (vNum & 1) vBin, vNum >>= 1
	return Format("{:0" vLen "}", vBin)

	;if (StrLen(vBin) < vLen)
	;	Loop, % vLen - StrLen(vBin)
	;		vBin := "0" vBin
	;return vBin
}

;==================================================

;e.g. MsgBox, % JEE_FloatToBin(-123.456) ;1 10000101 11101101110100101111001

JEE_FloatToBin(vNum)
{
	local
	VarSetCapacity(vData, 4)
	NumPut(vNum, &vData, 0, "Float")
	vBin := ""
	Loop, 4
		vBin .= JEE_DecToBin(NumGet(&vData, 4-A_Index, "UChar"), 8)
	return RegExReplace(vBin, "^(.)(.{8})", "$1 $2 ")
}

;==================================================

;e.g. MsgBox, % JEE_DoubleToBin(-123.456) ;1 10000000101 1110110111010010111100011010100111111011111001110111

JEE_DoubleToBin(vNum)
{
	local
	VarSetCapacity(vData, 8)
	NumPut(vNum, &vData, 0, "Double")
	vBin := ""
	Loop, 8
		vBin .= JEE_DecToBin(NumGet(&vData, 8-A_Index, "UChar"), 8)
	return RegExReplace(vBin, "^(.)(.{11})", "$1 $2 ")
}

;==================================================

;e.g. MsgBox, % JEE_BinToFloat("1 10000101 11101101110100101111001") ;-123.456001

JEE_BinToFloat(vBin)
{
	local
	vBin := StrReplace(vBin, " ")
	VarSetCapacity(vData, 4)
	Loop, 4
	{
		vNum := JEE_BinToDec(SubStr(vBin, A_Index*8-7, 8))
		, NumPut(vNum, &vData, 4-A_Index, "UChar")
	}
	return NumGet(&vData, 0, "Float")
}

;==================================================

;e.g. MsgBox, % JEE_BinToDouble("1 10000000101 1110110111010010111100011010100111111011111001110111") ;-123.456000

JEE_BinToDouble(vBin)
{
	local
	vBin := StrReplace(vBin, " ")
	VarSetCapacity(vData, 8)
	Loop, 8
		vNum := JEE_BinToDec(SubStr(vBin, A_Index*8-7, 8))
		, NumPut(vNum, &vData, 8-A_Index, "UChar")
	return NumGet(&vData, 0, "Double")
}

;==================================================

LINKS

IEEE 754 - Wikipedia
https://en.wikipedia.org/wiki/IEEE_754
Single-precision floating-point format - Wikipedia
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
Double-precision floating-point format - Wikipedia
https://en.wikipedia.org/wiki/Double-precision_floating-point_format

[smallest positive integer that Floats/Doubles can't store exactly]
types - Which is the first integer that an IEEE 754 float is incapable of representing exactly? - Stack Overflow
https://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e

[useful for converting binary fractions]
0.1001 base 2 to base 10 - Wolfram|Alpha
https://www.wolframalpha.com/input/?i=0.1001+base+2+to+base+10

jeeswg's floats and doubles mini-tutorial

jeeswg's floats and doubles mini-tutorial

Who is online