In the deep bits of UTF-16
In the deep bits of UTF-16
see next
Last edited by FredOoo on 12 Jun 2022, 16:19, edited 1 time in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
If I simplify the code as much as I can, I still get wrong bytes in the buffer.
Can anyone explain what is wrong?
Can anyone explain what is wrong?
Code: Select all
#Requires AutoHotkey v2.0-beta.3
strTest := "A📝" ; U+41 U+1F4DD
strArray := [0x41, 0x00, 0xDD, 0xF4, 0x01, 0x00, 0x00, 0x00] ; <<< expected
; 「⢂」 「⠀」 「⣛」 「⡗」 「⢀」 「⠀」 Null-termination
Buff := Buffer(strPut( strTest, "UTF-16" ))
strPut(strTest, Buff, "UTF-16")
msg := ''
loop Buff.size {
num := numGet(Buff, a_Index-1, 'UChar')
msg .= a_Index-1 ": " num " – " strArray[a_Index] '`n'
}
msgBox msg
Code: Select all
; index: buffer – expected
0: 65 – 65
1: 0 – 0
2: 61 – 221 ;|
3: 216 – 244 ;|> Why bytes read in Buffer are not
4: 221 – 1 ;| expected values?
5: 220 – 0 ;|
6: 0 – 0
7: 0 – 0
Last edited by FredOoo on 12 Jun 2022, 19:32, edited 3 times in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
May be Buffer doesn’t works?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
If I check expected value, they look correct:
221 1101 1101 DD ⣛
244 1111 0100 F4 ⡗
1 0000 0001 01 ⢀
0 0000 0000 00 ⠀
v := b2d('000000011111010011011101') ; bin2dec
console.log chr( v ) ;
So, does the `strPut()` write correctly the string in the buffer?
I can’t belive this.
So what?
221 1101 1101 DD ⣛
244 1111 0100 F4 ⡗
1 0000 0001 01 ⢀
0 0000 0000 00 ⠀
v := b2d('000000011111010011011101') ; bin2dec
console.log chr( v ) ;
So, does the `strPut()` write correctly the string in the buffer?
I can’t belive this.
So what?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
When I write the string "A" in UTF-16 to a buffer, I believe I am writing this to the buffer:
"⢂⠀⣛⡗⢀⠀"
And so that I can then read this buffer byte by byte and get:
1: ⢂
2: ⠀
3: ⣛
4: ⡗
5: ⢀
6: ⠀
Nope ?
"⢂⠀⣛⡗⢀⠀"
And so that I can then read this buffer byte by byte and get:
1: ⢂
2: ⠀
3: ⣛
4: ⡗
5: ⢀
6: ⠀
Nope ?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
It seems soSo, does the `strPut()` write correctly the string in the buffer?
Code: Select all
msgBox strget(Buff, 'utf-16') == strTest ; 1
srcUnicode Character 📝 (U+1F4DD) wrote: UTF-16 Encoding: 0xD83D 0xDCDD
Also, see Wikipedia: Code points from U+010000 to U+10FFFF
Code: Select all
;U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
;W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
;W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
U := 0x1F4DD
U_ := U-0x10000
yyyyyyyyyy := (U_>>10)
W1 := 0xD800 + yyyyyyyyyy
xxxxxxxxxx := U_ & 0x3FF
W2 := 0xDC00 + xxxxxxxxxx
msgbox f(W1) "`n" f(W2)
;0xD83D 0xDCDD
f(n) => format('{:#x}', n)
Code: Select all
strArray := [0x41, 0x00, 0x3D, 0xD8, 0xDD, 0xDC, 0x00, 0x00] ; <<< expected
Re: In the deep bits of UTF-16
First of all, thank you Helgef for your answer. You again.
I said that `strPut()` might not write correctly to the Buffer, but that was a small provocation.
In reality I spent a lot more time checking that my expected values were the right ones.
Then I started checking the bytes in memory. There I saw that the bytes are the same as in the buffer…
I do not thank those who might have been able to put me on the right track in a few words.
I also started reading the Wikipedia page on UTF-16 in Elglish because it is ultimately clearer than the French one.
Now I will study your answer and come back here soon. But I'm beginning to remember from similar work done a few years ago on UTF-8 (and successfully, at this time), that the code point bytes of Unicode are not the bytes of UTF-16.
The mistake I made was to believe that the ord() function returns UTF-16 bytes. If I understood correctly, but I still have to clarify all this…
The AHK doc says that ord() returns the corresponding Unicode character code (a number between 0x10000 and 0x10FFFF), if the string begins with a Unicode "supplementary character". (another word for "with a surrogate pair")
But the next sentence says...
Well, I'll come back later.
I said that `strPut()` might not write correctly to the Buffer, but that was a small provocation.
In reality I spent a lot more time checking that my expected values were the right ones.
Code: Select all
#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
strTest := "A📝" ; U+41 U+1F4DD
ord_enum( strData,* ){
_enum( &code, &char ){
global STOP
if strLen(strData)=0 or STOP
return false
else {
code := ord(strData)
char := chr(code)
strData := subStr( strData, 1 + strLen(char) )
return true
}
}
return _enum
}
msg := ''
for code, char in ord_enum(strTest)
msg .= a_Index ": " code " – " format("0x{:04X}",code) " – " char " — " getBytes( code ) " — " d2b( code ) (isSet(BB)?" — |" BB.bytes(code) "|":'') "`n"
msgBox msg,,0x40000
getBytes( code ){
r := ''
loop 2 + (2*(code>65535))
r := "|" format('{:03}', code & 0x00FF ) r , code >>= 8
return r "|"
}
d2b( dec ){
; To convert a decimal number to binary number
if dec=0
return '0'
bin𝕤 := []
𝓲 := 0
while dec {
𝓲++
bin𝕤.insertAt( 𝓲, mod(dec, 2) )
dec := dec // 2
}
bin := ''
while 𝓲 {
bin .= bin𝕤[𝓲]
𝓲--
}
return bin
}
Code: Select all
1: 65 – 0x0041 – A — |000|065| — 1000001
2: 128221 – 0x1F4DD – 📝 — |000|001|244|221| — 11111010011011101
Then I started checking the bytes in memory. There I saw that the bytes are the same as in the buffer…
Code: Select all
#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
strTest := "A📝"
msgBox memory( &strTest ), "memory", 0x40000
memory( &strVar ){
A := strPtr(strVar) ; string address
pB := -1 ; previous byte
B := -1
msg := "@" A "`n"
loop {
offset := a_Index-1
pB := B
B := numGet( A , offset, 'UChar')
msg .= offset ": " B "`n"
} until B==0 and pB==0 or STOP
return msg
}
Code: Select all
@5573944
0: 65
1: 0
2: 61
3: 216
4: 221
5: 220
6: 0
7: 0
I do not thank those who might have been able to put me on the right track in a few words.
I also started reading the Wikipedia page on UTF-16 in Elglish because it is ultimately clearer than the French one.
Now I will study your answer and come back here soon. But I'm beginning to remember from similar work done a few years ago on UTF-8 (and successfully, at this time), that the code point bytes of Unicode are not the bytes of UTF-16.
The mistake I made was to believe that the ord() function returns UTF-16 bytes. If I understood correctly, but I still have to clarify all this…
The AHK doc says that ord() returns the corresponding Unicode character code (a number between 0x10000 and 0x10FFFF), if the string begins with a Unicode "supplementary character". (another word for "with a surrogate pair")
But the next sentence says...
Well, I'll come back later.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
Now I can convert Unicode to UTF-16 and check in memory I have the same UTF-16 bytes.
But bytes in UTF-16 are not the same as bytes in Unicode. That’s normal.
Code: Select all
#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
global ¶ := '`r`n'
global ∫ := '∫'
#include C:\sys\AutoHotkey\SheBang\BBytes\BBytes3.ahk
; BBytes3 @ www.autohotkey.com/boards/viewtopic.php?style=7&f=83&t=80500&p=467193#p467193
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
;; Unicode → UTF-16
msg := ''
cha := "𐐷"
ucha := Cha_Unicode( cha )
msg .= cha " → " ucha " → " fhex( ucha ) (isSet(BB)?" → |" BB.bytes(ucha) "|":'') " ◄ Unicode" ¶
msg .= fhex( Unicode_UTF16( ucha ) ) " ◄ UTF-16 calculated" ¶
uim := UTF16_inMemory( cha )
msg .= fhex( uim ) " ◄ UTF-16 in memory" ¶
msg .= (isSet(BB)?" → |" BB.bytes(uim) "|":'') ¶
cha := "A"
ucha := Cha_Unicode( cha )
msg .= cha " → " ucha " → " fhex( ucha ) (isSet(BB)?" → |" BB.bytes(ucha) "|":'') " ◄ Unicode" ¶
msg .= fhex( Unicode_UTF16( ucha ) ) " ◄ UTF-16 calculated" ¶
uim := UTF16_inMemory( cha )
msg .= fhex( uim ) " ◄ UTF-16 in memory" ¶
msg .= (isSet(BB)?" → |" BB.bytes(uim) "|":'') ¶
CPRINT msg, "Unicode → UTF-16"
Cha_Unicode( strVar ){
return ord( strVar )
}
Unicode_UTF16( U ){
U_ := U-0x10000
S0 := (U_>>10) + 0xD800
S1 := (U_&0x3FF) + 0xDC00
;Console.log "— " chr(U) " — " fhex(S0) " " fhex(S1) ; <<<<< S0 et S1 calculés
if (S1 & 0xD8)==0xD8
return (S0<<16)+S1
else
return U
}
UTF16_inMemory( strVar ){
varA := strPtr(strVar)
varS0 := numGet( varA , 0, 'UShort')
varS1 := numGet( varA , 2, 'UShort')
;Console.log fhex(varS0) " — " fhex(varS1) ; <<<<< S0 et S1 en mémoire
if (varS1 & 0xD8)==0xD8 ; surrogate pair (11011…) ou varS0>65535 ?
return (varS0<<16)+varS1
else
return varS0
}
fhex( value, w:=4, prefix:=1 ){
if value>0xFFFF
w := 6
if value>0xFFFFFF
w := 8
return format( (prefix?'0x':'') "{:0" w "X}", value )
}
CPRINT( msg, title:=unset ){
if isSet( Console ){
Console.log "============ " title
Console.log msg
}
else
msgBox msg, (isSet(title)?title:a_ScriptName), 0x40000
}
Code: Select all
𐐷 → 66615 → 0x010437 → ∫⢀⠐⣴∫ ◄ Unicode
0xD801DC37 ◄ UTF-16 calculated
0xD801DC37 ◄ UTF-16 in memory
→ ∫⡋⢀⡛⣴∫
But to convert UTF-16 to Unicode is more harder.
…I almost forgot what I wanted to do at first.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
-
- Posts: 27
- Joined: 06 Mar 2022, 17:45
Re: In the deep bits of UTF-16
UTF-16:
Codepoints 0x0000 to 0xFFFF are stored as 2 bytes.
Codepoints 0x010000 to 0x10FFFF are stored as 4 bytes. As surrogate pairs.
Codepoints 0xD800 to 0xDFFF aren't proper codepoints. They're used for surrogate pairs.
Total number of codepoints: 0x10000-1024-1024+1024*1024 = 1112064.
0x0000-0xD7FF regular chars [chars 0-55295]
0xD800-0xDBFF high surrogate [for chars 65536-1114111]
0xDC00-0xDFFF low surrogate [for chars 65536-1114111]
0xE000-0xFFFF regular chars [57344-65535]
e.g. 'A': Chr(0x65)
e.g. treble clef: MsgBox(Chr(0x1D11E) == Chr(0xD834) Chr(0xDD1E)) ; True
Your example:
MsgBox(Chr(0x1F4DD) == Chr(0xD83D) Chr(0xDCDD)) ; True
To calculate the bytes manually:
Just check the AHK source code. BIF_Ord and BIF_Chr.
Codepoints 0x0000 to 0xFFFF are stored as 2 bytes.
Codepoints 0x010000 to 0x10FFFF are stored as 4 bytes. As surrogate pairs.
Codepoints 0xD800 to 0xDFFF aren't proper codepoints. They're used for surrogate pairs.
Total number of codepoints: 0x10000-1024-1024+1024*1024 = 1112064.
0x0000-0xD7FF regular chars [chars 0-55295]
0xD800-0xDBFF high surrogate [for chars 65536-1114111]
0xDC00-0xDFFF low surrogate [for chars 65536-1114111]
0xE000-0xFFFF regular chars [57344-65535]
e.g. 'A': Chr(0x65)
e.g. treble clef: MsgBox(Chr(0x1D11E) == Chr(0xD834) Chr(0xDD1E)) ; True
Your example:
MsgBox(Chr(0x1F4DD) == Chr(0xD83D) Chr(0xDCDD)) ; True
To calculate the bytes manually:
Just check the AHK source code. BIF_Ord and BIF_Chr.
Re: In the deep bits of UTF-16
Thank Emile for your help.
And that’s a good idea to look at AHK source code. I already get an old version but I’m not use to do that.
Next post, I have something that works. May be not the best code, but I think it works. (almost…)
And that’s a good idea to look at AHK source code. I already get an old version but I’m not use to do that.
Next post, I have something that works. May be not the best code, but I think it works. (almost…)
Last edited by FredOoo on 16 Jun 2022, 01:50, edited 2 times in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
Code: Select all
#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
¶ := '`n'
msg := ''
test_UU16( "ሴ" ) ; ሴ U+1234 ETHIOPIC SYLLABLE SEE
test_UU16( "𝄞" ) ; 𝄞 U+1D11 MUSICAL SYMBOL G CLEF
test_UU16( "📝" ) ; 📝 U+1F4DD MEMO
msgBox msg,, 0x40000
test_UU16( C ){
U16 := Unicode_UTF16( ord(C) )
M := UTF16_inMemory( C )
U := UTF16_Unicode( U16[1], U16[2] )
global msg
msg .= C " – " fhex(ord(C)) ¶
msg .= "U16: " fhex(U16[1]) " – " fhex(U16[2]) ¶
msg .= "M: " fhex(M[1]) " – " fhex(M[2]) ¶
msg .= "U: " U " → " fhex(U) (U=ord(C)?" ✓":'') ¶
msg .= "---" ¶ ¶
}
Unicode_UTF16( U ){
;; Unicode to UTF16 encode
; retirer la valeur 0x10000 : … 0001 / 0000 0000 0000 0000
U_ := U-0x10000
; séparer les 10 bits de poid faible et les 10 bits de poid fort
BE := (U_>>10) + 0xD800 ; aux 10 bits de poid fort
LE := (U_&0x3FF) + 0xDC00 ; aux 10 bits de poid faible
; xxxx xxxx xx]xx xxxx / xxxx xx[xx xxxx xxxx
; 0xD800 ; 1101 1000 0000 0000 /
; 0xDC00 ; / 1101 1100 0000 0000
if (LE & 0xD8)==0xD8 ; 1101 1xxx xxxx xxxx /
return [ BE, LE ]
else
return [ 0, U ]
}
UTF16_Unicode( BE, LE ){
;; UTF16 decode to Unicode
if BE=0
return LE
BE2 := BE - 0xD800
LE2 := LE - 0xDC00
U_ := (BE2 << 10) + LE2
return U_ + 0x10000
}
UTF16_inMemory( C ){
;; Read UTF16 in memory
A := strPtr(C)
BE := numGet( A , 0, 'UShort')
LE := numGet( A , 2, 'UShort')
if (LE & 0xD8)==0xD8 ; surrogate pair (11011…) or BE>65535
return [ BE, LE ]
else
return [ 0, BE ]
}
fhex( value, width:=4, prefix:=1 ){
;; Format hex
if value>0xFFFF
width := 6
if value>0xFFFFFF
width := 8
return format( (prefix?'0x':'') "{:0" width "X}", value )
}
Code: Select all
ሴ – 0x1234
U16: 0x0000 – 0x1234
M: 0x0000 – 0x1234
U: 4660 → 0x1234 ✓
---
𝄞 – 0x01D11E
U16: 0x0000 – 0x01D11E
M: 0x0000 – 0xD834
U: 119070 → 0x01D11E ✓
---
📝 – 0x01F4DD
U16: 0xD83D – 0xDCDD
M: 0xD83D – 0xDCDD
U: 128221 → 0x01F4DD ✓
---
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
But every thing is still not perfectly clear.
Why I can’t just read memory like this?
Why I can’t just read memory like this?
Code: Select all
UTF16_inMemory( C ){
;; Read UTF16 in memory
A := strPtr(C)
BE := numGet( A , 0, 'UShort')
LE := numGet( A , 2, 'UShort')
return [ BE, LE ]
;if (LE & 0xD8)==0xD8 ; surrogate pair (11011…) or BE>65535
; return [ BE, LE ]
;else
; return [ 0, BE ]
}
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
On the second test (SYMBOL G CLEF), lines U16 and M are not identical.
Code: Select all
𝄞 – 0x01D11E
U16: 0x0000 – 0x01D11E ; ←
M: 0x0000 – 0xD834 ; ←
U: 119070 → 0x01D11E ✓
---
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
Re: In the deep bits of UTF-16
This one is correct, I think.
Checks non-character and over range.
Code: Select all
#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
¶ := '`n'
msg := ''
test( "A" ) ; A U+41 LETTER LATIN A
test( "𐐷" ) ; 𐐷 U+10437
test( "ሴ" ) ; ሴ U+1234 ETHIOPIC SYLLABLE SEE
test( "𝄞" ) ; 𝄞 U+1D11 MUSICAL SYMBOL G CLEF
test( "📝" ) ; 📝 U+1F4DD MEMO
test( "😎" ) ; 😎 U+1F60E → D8 3D DE 0E en UTF-16BE
U16 := Unicode_UTF16( 0xD9 ) ; non-character
msg .= "U16:`t" fhex(U16[1]) " – " fhex(U16[2]) ¶
msg .= octet(U16[1]) " · " octet(U16[2]) ¶ ¶
U16 := Unicode_UTF16( 0x11FFFF ) ¶ ¶ ; over max
msgBox msg,, 0x40000
/*
C : Character → "📝" / "A"
U : Unicode → 0x 0001 F4DD / 0x 0000 0041 ‘BE’
L : Low address
H : High address
M : Memory [L, H] → [ 0x D83D, 0x DCDD ] / [ 0x 0041, 0x 00000 ] ‘LE’
U16 : UTF-16 [L, H] → [ 0x D83D, 0x DCDD ] / [ 0x 0041, 0x 00000 ] ‘LE’ calculated, same as memory
‘BE’ : Big-Endian [B, L] most significant byte in the smallest address
‘LE’ : Little-Endian [B, L]
*/
test( C ){
U16 := Unicode_UTF16( ord(C) )
M := UTF16_inMemory( C )
U := UTF16_Unicode( U16[1], U16[2] )
global msg
msg .= C " – " fhex( ord(C) ) ¶
msg .= "U16:`t" ( U16[1]>0 ?fhex(U16[1]) :"none") " – " fhex(U16[2]) ¶
msg .= "M:`t" fhex(M[1]) " – " ( M[2]>0 ?fhex(M[2]) :"none") ¶
msg .= "U:`t" fhex(U) ( U=ord(C) ?" ✓" :'' ) ¶
msg .= "---" ¶ ¶
}
Unicode_UTF16( U ){
;; Unicode to UTF16 encode
; A = 65 = 0x 0000 0041 = UL,UH ‘LE’ → [ 0041, 0000 ] [ L, H ] ‘BE’
UL:= U >> 16
UH:= U & 0xFFFF
if U < 0x10000 {
global msg
if (UH & 0xD8)==0xD8 ; 1101 1xxx xxxx xxxx
msg .= fhex( U ) " : non-character" ¶ ; throw ValueError("non-character")
return [ 0, UH ]
} else if U > 0x10FFFF {
msg .= fhex( U ) " : over max" ¶ ; throw ValueError("over max")
} else {
U_ := U-0x10000
UH := 0xd800 + ((U_ >> 10) & 0x3ff)
UL := 0xdc00 + ( U_ & 0x3ff)
return [ UH, UL ]
}
}
UTF16_Unicode( L, H ){
;; UTF16 decode to Unicode
if L=0
return H
return 0x10000 + (L - 0xD800) * 0x400 + (H - 0xDC00)
}
UTF16_inMemory( C ){
;; Read UTF16 in memory
; 0x 41 00 00 00 [ 0041, 0000 ] [ L, H ] ‘BE’ (for Unicode 0x 0000 0041 = 65 = A ‘LE’)
A := strPtr(C)
L := numGet( A , 0, 'UShort') ; Low address
H := numGet( A , 2, 'UShort') ; High address (Null-termination if not surrogate)
UTF16_inMemory2( C ) ; Just to see
return [ L, H ]
}
; Format hex:
fhex( value, width:=4, prefix:=true ){
;; Format hex
if value>0xFFFF
width := 6
if value>0xFFFFFF
width := 8
return format( (prefix?'0x':'') "{:0" width "X}", value )
}
; Just to see memory byte a byte:
UTF16_inMemory2( C ){
;; Read bytes in memory
A := strPtr(C)
B0 := numGet( A , 0, 'UChar')
B1 := numGet( A , 1, 'UChar')
B2 := numGet( A , 2, 'UChar')
B3 := numGet( A , 3, 'UChar')
global msg
msg .= octet(B0) " · " octet(B1) " · " octet(B2) " · " octet(B3) ¶
return [ B0, B1, B2, B3 ]
}
octet( byte ){
; 0000 0000
return format('{:04} {:04}',d2b((byte & 0xF0) >> 4),d2b(byte & 0x0F))
}
d2b( dec ){
; decimal to binary
if dec=0
return '0'
i := 0
B := ''
while dec {
i++
B := mod(dec, 2) B
dec := dec // 2
}
return B
}
Code: Select all
/*
0100 0001 · 0000 0000 · 0000 0000 · 0000 0000
A – 0x0041
U16: none – 0x0041
M: 0x0041 – none
U: 0x0041 ✓
---
0000 0001 · 1101 1000 · 0011 0111 · 1101 1100
𐐷 – 0x010437
U16: 0xD801 – 0xDC37
M: 0xD801 – 0xDC37
U: 0x010437 ✓
---
0011 0100 · 0001 0010 · 0000 0000 · 0000 0000
ሴ – 0x1234
U16: none – 0x1234
M: 0x1234 – none
U: 0x1234 ✓
---
0011 0100 · 1101 1000 · 0001 1110 · 1101 1101
𝄞 – 0x01D11E
U16: 0xD834 – 0xDD1E
M: 0xD834 – 0xDD1E
U: 0x01D11E ✓
---
0011 1101 · 1101 1000 · 1101 1101 · 1101 1100
📝 – 0x01F4DD
U16: 0xD83D – 0xDCDD
M: 0xD83D – 0xDCDD
U: 0x01F4DD ✓
---
0011 1101 · 1101 1000 · 0000 1110 · 1101 1110
😎 – 0x01F60E
U16: 0xD83D – 0xDE0E
M: 0xD83D – 0xDE0E
U: 0x01F60E ✓
---
0x00D9 : non-character
U16: 0x0000 – 0x00D9
0000 0000 · 1101 1001
0x11FFFF : over max
*/
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
-
- Posts: 27
- Joined: 06 Mar 2022, 17:45
Re: In the deep bits of UTF-16
This example might clarify things.
Code: Select all
; codepoint to surrogate pair:
codepoint := 119070
char := Chr(codepoint)
pair1 := Ord(SubStr(char, 1, 1))
pair2 := Ord(SubStr(char, 2, 1))
MsgBox(A_Clipboard := pair1 " " pair2) ; 55348 56606
codepoint := 119074
char := Chr(codepoint)
pair1 := Ord(SubStr(char, 1, 1))
pair2 := Ord(SubStr(char, 2, 1))
MsgBox(A_Clipboard := pair1 " " pair2) ; 55348 56610
MsgBox(Chr(119070) == Chr(55348) Chr(56606)) ; True (treble clef)
MsgBox(Chr(119074) == Chr(55348) Chr(56610)) ; True (bass clef)
return