In the deep bits of UTF-16

Get help with using AutoHotkey (v2 or newer) and its commands and hotkeys
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

In the deep bits of UTF-16

Post by FredOoo » 12 Jun 2022, 04:30

see next
Last edited by FredOoo on 12 Jun 2022, 16:19, edited 1 time in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 12 Jun 2022, 12:10

If I simplify the code as much as I can, I still get wrong bytes in the buffer.

Can anyone explain what is wrong?
 

Code: Select all

#Requires AutoHotkey v2.0-beta.3

    strTest  := "A📝" ; U+41 U+1F4DD
    strArray := [0x41, 0x00,   0xDD, 0xF4, 0x01, 0x00,      0x00, 0x00] ; <<< expected
    ;            「⢂」    「⠀」      「⣛」   「⡗」   「⢀」   「⠀」        Null-termination
    Buff := Buffer(strPut( strTest, "UTF-16" ))
    strPut(strTest, Buff, "UTF-16")
    msg := ''
    loop Buff.size {
        num := numGet(Buff, a_Index-1, 'UChar')
        msg .=  a_Index-1 ": " num " – " strArray[a_Index] '`n'
    }
    msgBox msg
 

Code: Select all

; index: buffer – expected
 0:   65 – 65
 1:    0 – 0
 2:   61 – 221	;|
 3:  216 – 244	;|> Why bytes read in Buffer are not
 4:  221 – 1	;|   expected values?
 5:  220 – 0	;|
 6:    0 – 0
 7:    0 – 0
Last edited by FredOoo on 12 Jun 2022, 19:32, edited 3 times in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 12 Jun 2022, 16:23

May be Buffer doesn’t works?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 12 Jun 2022, 20:21

If I check expected value, they look correct:

221 1101 1101 DD ⣛
244 1111 0100 F4 ⡗
  1 0000 0001 01 ⢀
  0 0000 0000 00 ⠀

v := b2d('000000011111010011011101') ; bin2dec
console.log chr( v ) ; 📝

So, does the `strPut()` write correctly the string in the buffer?
I can’t belive this.
So what?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 12 Jun 2022, 21:28

When I write the string "A📝" in UTF-16 to a buffer, I believe I am writing this to the buffer:
"⢂⠀⣛⡗⢀⠀"
And so that I can then read this buffer byte by byte and get:
1: ⢂
2: ⠀
3: ⣛
4: ⡗
5: ⢀
6: ⠀
Nope ?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

Helgef
Posts: 4709
Joined: 17 Jul 2016, 01:02
Contact:

Re: In the deep bits of UTF-16

Post by Helgef » 13 Jun 2022, 11:40

So, does the `strPut()` write correctly the string in the buffer?
It seems so

Code: Select all

msgBox strget(Buff, 'utf-16') == strTest ; 1

Unicode Character 📝 (U+1F4DD) wrote: UTF-16 Encoding: 0xD83D 0xDCDD
src
Also, see :arrow: Wikipedia: Code points from U+010000 to U+10FFFF

Code: Select all

;U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
	;W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
	;W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx
	
	U := 0x1F4DD
	U_ := U-0x10000
	
	yyyyyyyyyy := (U_>>10)
	W1 := 0xD800 + yyyyyyyyyy
	
	xxxxxxxxxx := U_ & 0x3FF
	W2 := 0xDC00 + xxxxxxxxxx
	
	msgbox f(W1) "`n" f(W2) 
	;0xD83D 0xDCDD
	
	f(n) => format('{:#x}', n)
So,

Code: Select all

strArray := [0x41, 0x00,   0x3D, 0xD8, 0xDD, 0xDC,    0x00, 0x00] ; <<< expected
Cheers

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 13 Jun 2022, 16:07

First of all, thank you Helgef for your answer. You again.

I said that `strPut()` might not write correctly to the Buffer, but that was a small provocation.
In reality I spent a lot more time checking that my expected values ​​were the right ones.
 

Code: Select all

#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

strTest  := "A📝" ; U+41 U+1F4DD

ord_enum( strData,* ){
    _enum( &code, &char ){
        global STOP
        if strLen(strData)=0 or STOP
            return false
        else {
            code := ord(strData)
            char := chr(code)
            strData := subStr( strData, 1 + strLen(char) )
            return true
        }
    }
    return _enum
}

msg := ''
for code, char in ord_enum(strTest)
    msg .= a_Index ": " code " – " format("0x{:04X}",code) " – " char " — " getBytes( code ) " — " d2b( code ) (isSet(BB)?" — |" BB.bytes(code) "|":'') "`n"
msgBox msg,,0x40000

getBytes( code ){
    r := ''
    loop 2 + (2*(code>65535))
        r := "|" format('{:03}', code & 0x00FF ) r , code >>= 8
    return r "|"
}

d2b( dec ){
    ; To convert a decimal number to binary number
    if dec=0
        return '0'
    bin𝕤 := []
    𝓲    := 0
    while dec {
        𝓲++
        bin𝕤.insertAt( 𝓲, mod(dec, 2) )
        dec := dec // 2
    }
    bin := ''
    while 𝓲 {
        bin .= bin𝕤[𝓲]
        𝓲--
    }
    return bin
}
 

Code: Select all

1: 65 – 0x0041 – A — |000|065| — 1000001
2: 128221 – 0x1F4DD – 📝 — |000|001|244|221| — 11111010011011101

Then I started checking the bytes in memory. There I saw that the bytes are the same as in the buffer…
 

Code: Select all

#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

strTest := "A📝"
msgBox memory( &strTest ), "memory", 0x40000

memory( &strVar ){
    A  := strPtr(strVar) ; string address
    pB := -1             ; previous byte
    B  := -1
    msg := "@" A "`n"
    loop {
        offset := a_Index-1
        pB := B
        B  := numGet( A , offset, 'UChar')
        msg .= offset ": " B "`n"
    } until B==0 and pB==0 or STOP
    return msg
}
 

Code: Select all

@5573944
0: 65
1: 0
2: 61
3: 216
4: 221
5: 220
6: 0
7: 0

I do not thank those who might have been able to put me on the right track in a few words.

I also started reading the Wikipedia page on UTF-16 in Elglish because it is ultimately clearer than the French one.

Now I will study your answer and come back here soon. But I'm beginning to remember from similar work done a few years ago on UTF-8 (and successfully, at this time), that the code point bytes of Unicode are not the bytes of UTF-16.

The mistake I made was to believe that the ord() function returns UTF-16 bytes. If I understood correctly, but I still have to clarify all this…

The AHK doc says that ord() returns the corresponding Unicode character code (a number between 0x10000 and 0x10FFFF), if the string begins with a Unicode "supplementary character". (another word for "with a surrogate pair")
But the next sentence says...
Well, I'll come back later.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 14 Jun 2022, 04:01

 
Now I can convert Unicode to UTF-16 and check in memory I have the same UTF-16 bytes.

But bytes in UTF-16 are not the same as bytes in Unicode. That’s normal.
 

Code: Select all

#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
^!R::reload
a_TrayMenu.default := "&Reload Script"
global ¶ := '`r`n'
global ∫ := '∫'
#include C:\sys\AutoHotkey\SheBang\BBytes\BBytes3.ahk
; BBytes3 @ www.autohotkey.com/boards/viewtopic.php?style=7&f=83&t=80500&p=467193#p467193
STOP := false ; loop security on dev time
Esc::global STOP := true
;⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

    ;; Unicode → UTF-16
    msg := ''

    cha  := "𐐷"
    ucha := Cha_Unicode( cha )
    msg .= cha " → " ucha " → " fhex( ucha ) (isSet(BB)?" → |" BB.bytes(ucha) "|":'') " ◄ Unicode" ¶
    msg .= fhex( Unicode_UTF16( ucha ) ) " ◄ UTF-16 calculated" ¶
    uim := UTF16_inMemory( cha )
    msg .= fhex( uim ) " ◄ UTF-16 in memory" ¶
    msg .= (isSet(BB)?" → |" BB.bytes(uim) "|":'') ¶

    cha  := "A"
    ucha := Cha_Unicode( cha )
    msg .= cha " → " ucha " → " fhex( ucha ) (isSet(BB)?" → |" BB.bytes(ucha) "|":'') " ◄ Unicode" ¶
    msg .= fhex( Unicode_UTF16( ucha ) ) " ◄ UTF-16 calculated" ¶
    uim := UTF16_inMemory( cha )
    msg .= fhex( uim ) " ◄ UTF-16 in memory" ¶
    msg .= (isSet(BB)?" → |" BB.bytes(uim) "|":'') ¶


    CPRINT msg, "Unicode → UTF-16"
    
    Cha_Unicode( strVar ){
        return ord( strVar )
    }
    
    Unicode_UTF16( U ){
        U_ := U-0x10000
        S0 := (U_>>10)   + 0xD800
        S1 := (U_&0x3FF) + 0xDC00
        ;Console.log "— " chr(U) " — " fhex(S0) " " fhex(S1) ; <<<<< S0 et S1 calculés
        if (S1 & 0xD8)==0xD8
            return (S0<<16)+S1
        else
            return U
    }
    UTF16_inMemory( strVar ){
        varA := strPtr(strVar)
        varS0 := numGet( varA , 0, 'UShort')
        varS1 := numGet( varA , 2, 'UShort')
        ;Console.log fhex(varS0) " — " fhex(varS1) ; <<<<< S0 et S1 en mémoire
        if (varS1 & 0xD8)==0xD8  ; surrogate pair (11011…)  ou varS0>65535 ?
            return (varS0<<16)+varS1
        else
            return varS0
    }

    
    fhex( value, w:=4, prefix:=1 ){
        if value>0xFFFF
            w := 6
        if value>0xFFFFFF
            w := 8
        return format( (prefix?'0x':'') "{:0" w "X}", value )
    }
    CPRINT( msg, title:=unset ){
        if isSet( Console ){
            Console.log "============ " title
            Console.log msg
        }
        else
            msgBox msg, (isSet(title)?title:a_ScriptName), 0x40000
    }
 

Code: Select all

𐐷 → 66615 → 0x010437 → ∫⢀⠐⣴∫ ◄ Unicode
0xD801DC37 ◄ UTF-16 calculated
0xD801DC37 ◄ UTF-16 in memory
 → ∫⡋⢀⡛⣴∫
 
But to convert UTF-16 to Unicode is more harder.
…I almost forgot what I wanted to do at first.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

Emile HasKey
Posts: 27
Joined: 06 Mar 2022, 17:45

Re: In the deep bits of UTF-16

Post by Emile HasKey » 15 Jun 2022, 14:23

UTF-16:

Codepoints 0x0000 to 0xFFFF are stored as 2 bytes.
Codepoints 0x010000 to 0x10FFFF are stored as 4 bytes. As surrogate pairs.
Codepoints 0xD800 to 0xDFFF aren't proper codepoints. They're used for surrogate pairs.
Total number of codepoints: 0x10000-1024-1024+1024*1024 = 1112064.

0x0000-0xD7FF regular chars [chars 0-55295]
0xD800-0xDBFF high surrogate [for chars 65536-1114111]
0xDC00-0xDFFF low surrogate [for chars 65536-1114111]
0xE000-0xFFFF regular chars [57344-65535]

e.g. 'A': Chr(0x65)
e.g. treble clef: MsgBox(Chr(0x1D11E) == Chr(0xD834) Chr(0xDD1E)) ; True
Your example:

MsgBox(Chr(0x1F4DD) == Chr(0xD83D) Chr(0xDCDD)) ; True
To calculate the bytes manually:

Just check the AHK source code. BIF_Ord and BIF_Chr.

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 16 Jun 2022, 00:16

Thank Emile for your help.
And that’s a good idea to look at AHK source code. I already get an old version but I’m not use to do that.

Next post, I have something that works. May be not the best code, but I think it works. (almost…)
 
Last edited by FredOoo on 16 Jun 2022, 01:50, edited 2 times in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 16 Jun 2022, 00:16

Code: Select all

#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
¶ := '`n'
msg := ''

test_UU16( "ሴ" ) ; ሴ U+1234 ETHIOPIC SYLLABLE SEE
test_UU16( "𝄞" ) ; 𝄞 U+1D11 MUSICAL SYMBOL G CLEF
test_UU16( "📝" ) ; 📝 U+1F4DD MEMO

msgBox msg,, 0x40000

test_UU16( C ){
    U16 := Unicode_UTF16( ord(C) )
    M   := UTF16_inMemory( C )
    U   := UTF16_Unicode( U16[1], U16[2] )
    global msg
    msg .= C " – " fhex(ord(C)) ¶
    msg .= "U16: " fhex(U16[1]) " – " fhex(U16[2]) ¶
    msg .= "M:   " fhex(M[1]) " – " fhex(M[2]) ¶
    msg .= "U:   " U " → " fhex(U) (U=ord(C)?" ✓":'') ¶
    msg .= "---" ¶ ¶
}
Unicode_UTF16( U ){
    ;; Unicode to UTF16 encode
    ; retirer la valeur 0x10000 : … 0001 / 0000 0000 0000 0000
    U_ := U-0x10000
    ; séparer les 10 bits de poid faible et les 10 bits de poid fort
    BE := (U_>>10)   + 0xD800 ; aux 10 bits de poid fort
    LE := (U_&0x3FF) + 0xDC00 ; aux 10 bits de poid faible
                         ; xxxx xxxx  xx]xx xxxx / xxxx xx[xx  xxxx xxxx
                ; 0xD800 ; 1101 1000  0000  0000 /
                ; 0xDC00 ;                       / 1101 1100  0000 0000
    if (LE & 0xD8)==0xD8 ; 1101 1xxx  xxxx xxxx  /
        return [ BE, LE ]
    else
        return [ 0, U ]
}
UTF16_Unicode( BE, LE ){
    ;; UTF16 decode to Unicode
    if BE=0
        return LE
    BE2 := BE - 0xD800
    LE2 := LE - 0xDC00
    U_  := (BE2 << 10) + LE2
    return U_ + 0x10000
}
UTF16_inMemory( C ){
    ;; Read UTF16 in memory
    A := strPtr(C)
    BE := numGet( A , 0, 'UShort')
    LE := numGet( A , 2, 'UShort')
    if (LE & 0xD8)==0xD8  ; surrogate pair (11011…)  or BE>65535
        return [ BE, LE ]
    else
        return [ 0, BE ]
}
fhex( value, width:=4, prefix:=1 ){
    ;; Format hex
    if value>0xFFFF
        width := 6
    if value>0xFFFFFF
        width := 8
    return format( (prefix?'0x':'') "{:0" width "X}", value )
}
 

Code: Select all

ሴ – 0x1234
U16: 0x0000 – 0x1234
M:   0x0000 – 0x1234
U:   4660 → 0x1234 ✓
---

𝄞 – 0x01D11E
U16: 0x0000 – 0x01D11E
M:   0x0000 – 0xD834
U:   119070 → 0x01D11E ✓
---

📝 – 0x01F4DD
U16: 0xD83D – 0xDCDD
M:   0xD83D – 0xDCDD
U:   128221 → 0x01F4DD ✓
---
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 16 Jun 2022, 01:00

But every thing is still not perfectly clear.
 
Why I can’t just read memory like this?
 

Code: Select all

UTF16_inMemory( C ){
    ;; Read UTF16 in memory
    A := strPtr(C)
    BE := numGet( A , 0, 'UShort')
    LE := numGet( A , 2, 'UShort')
    return [ BE, LE ]
    
    ;if (LE & 0xD8)==0xD8  ; surrogate pair (11011…) or BE>65535
    ;    return [ BE, LE ]
    ;else
    ;    return [ 0, BE ]
}
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 16 Jun 2022, 01:10

On the second test (SYMBOL G CLEF), lines U16 and M are not identical.
 

Code: Select all

𝄞 – 0x01D11E
U16: 0x0000 – 0x01D11E    ; ←
M:   0x0000 – 0xD834      ; ←
U:   119070 → 0x01D11E ✓
---
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: In the deep bits of UTF-16

Post by FredOoo » 18 Jun 2022, 22:12

 
This one is correct, I think.
Checks non-character and over range.
 

Code: Select all

#Requires AutoHotkey v2.0-beta.3
#SingleInstance force
Persistent
 ^!R::reload
 a_TrayMenu.default := "&Reload Script"
¶   := '`n'
msg := ''

test( "A" ) ; A U+41 LETTER LATIN A
test( "𐐷" ) ; 𐐷 U+10437
test( "ሴ" ) ; ሴ U+1234 ETHIOPIC SYLLABLE SEE
test( "𝄞" ) ; 𝄞 U+1D11 MUSICAL SYMBOL G CLEF
test( "📝" ) ; 📝 U+1F4DD MEMO
test( "😎" ) ; 😎 U+1F60E → D8 3D DE 0E en UTF-16BE

U16 := Unicode_UTF16( 0xD9 ) ; non-character
msg .= "U16:`t" fhex(U16[1]) " – " fhex(U16[2]) ¶
msg .= octet(U16[1]) " · " octet(U16[2]) ¶ ¶

U16 := Unicode_UTF16( 0x11FFFF ) ¶ ¶ ; over max

msgBox msg,, 0x40000

/*
    C : Character         → "📝" / "A"
    U : Unicode           → 0x 0001 F4DD / 0x 0000 0041 ‘BE’
    L : Low address
    H : High address
    M   : Memory [L, H]   → [ 0x D83D, 0x DCDD ] / [ 0x 0041, 0x 00000 ] ‘LE’
    U16 : UTF-16 [L, H]   → [ 0x D83D, 0x DCDD ] / [ 0x 0041, 0x 00000 ] ‘LE’ calculated, same as memory
    ‘BE’  : Big-Endian    [B, L] most significant byte in the smallest address
    ‘LE’  : Little-Endian [B, L]
*/
test( C ){
    U16 := Unicode_UTF16( ord(C) )
    M   := UTF16_inMemory( C )
    U   := UTF16_Unicode( U16[1], U16[2] )
    global msg
    msg .= C " – " fhex( ord(C) ) ¶
    msg .= "U16:`t" ( U16[1]>0 ?fhex(U16[1]) :"none") " – " fhex(U16[2]) ¶
    msg .= "M:`t"   fhex(M[1]) " – " ( M[2]>0 ?fhex(M[2]) :"none") ¶
    msg .= "U:`t"   fhex(U) ( U=ord(C) ?" ✓" :'' ) ¶
    msg .= "---" ¶ ¶
}
Unicode_UTF16( U ){
    ;; Unicode to UTF16 encode
     ; A = 65 = 0x 0000 0041 = UL,UH ‘LE’  →  [ 0041, 0000 ]  [ L, H ] ‘BE’
    UL:= U >> 16
    UH:= U & 0xFFFF
    if U < 0x10000 {
        global msg
        if (UH & 0xD8)==0xD8  ; 1101 1xxx  xxxx xxxx
            msg .= fhex( U ) " : non-character" ¶ ; throw ValueError("non-character")
        return [ 0, UH ]
    } else if U > 0x10FFFF {
            msg .= fhex( U ) " : over max" ¶ ; throw ValueError("over max")
    } else {
        U_ := U-0x10000
        UH := 0xd800 + ((U_ >> 10) & 0x3ff)
        UL := 0xdc00 + ( U_        & 0x3ff)
        return [ UH, UL ]
    }
}
UTF16_Unicode( L, H ){
    ;; UTF16 decode to Unicode
    if L=0
        return H
    return 0x10000 + (L - 0xD800) * 0x400 + (H - 0xDC00)
}
UTF16_inMemory( C ){
    ;; Read UTF16 in memory
     ; 0x 41 00 00 00  [ 0041, 0000 ]  [ L, H ] ‘BE’  (for Unicode 0x 0000 0041 = 65 = A ‘LE’)
    A := strPtr(C)
    L := numGet( A , 0, 'UShort') ; Low address
    H := numGet( A , 2, 'UShort') ; High address (Null-termination if not surrogate)
    UTF16_inMemory2( C ) ; Just to see
    return [ L, H ]
}

; Format hex:

fhex( value, width:=4, prefix:=true ){
    ;; Format hex
    if value>0xFFFF
        width := 6
    if value>0xFFFFFF
        width := 8
    return format( (prefix?'0x':'') "{:0" width "X}", value )
}

; Just to see memory byte a byte:

UTF16_inMemory2( C ){
    ;; Read bytes in memory
    A := strPtr(C)
    B0 := numGet( A , 0, 'UChar')
    B1 := numGet( A , 1, 'UChar')
    B2 := numGet( A , 2, 'UChar')
    B3 := numGet( A , 3, 'UChar')
    global msg
    msg .= octet(B0) " · " octet(B1) " · " octet(B2) " · " octet(B3) ¶
    return [ B0, B1, B2, B3 ]
}
octet( byte ){
    ; 0000 0000
    return format('{:04} {:04}',d2b((byte & 0xF0) >> 4),d2b(byte & 0x0F))
}
d2b( dec ){
    ; decimal to binary
    if dec=0
        return '0'
    i    := 0
    B    := ''
    while dec {
        i++
        B := mod(dec, 2) B
        dec := dec // 2
    }
    return B
}
 

Code: Select all

/*
0100 0001 · 0000 0000 · 0000 0000 · 0000 0000
A – 0x0041
U16:	none – 0x0041
M:	0x0041 – none
U:	0x0041 ✓
---

0000 0001 · 1101 1000 · 0011 0111 · 1101 1100
𐐷 – 0x010437
U16:	0xD801 – 0xDC37
M:	0xD801 – 0xDC37
U:	0x010437 ✓
---

0011 0100 · 0001 0010 · 0000 0000 · 0000 0000
ሴ – 0x1234
U16:	none – 0x1234
M:	0x1234 – none
U:	0x1234 ✓
---

0011 0100 · 1101 1000 · 0001 1110 · 1101 1101
𝄞 – 0x01D11E
U16:	0xD834 – 0xDD1E
M:	0xD834 – 0xDD1E
U:	0x01D11E ✓
---

0011 1101 · 1101 1000 · 1101 1101 · 1101 1100
📝 – 0x01F4DD
U16:	0xD83D – 0xDCDD
M:	0xD83D – 0xDCDD
U:	0x01F4DD ✓
---

0011 1101 · 1101 1000 · 0000 1110 · 1101 1110
😎 – 0x01F60E
U16:	0xD83D – 0xDE0E
M:	0xD83D – 0xDE0E
U:	0x01F60E ✓
---

0x00D9 : non-character
U16:	0x0000 – 0x00D9
0000 0000 · 1101 1001

0x11FFFF : over max
*/
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

Emile HasKey
Posts: 27
Joined: 06 Mar 2022, 17:45

Re: In the deep bits of UTF-16

Post by Emile HasKey » 02 Jul 2022, 19:01

This example might clarify things.

Code: Select all

; codepoint to surrogate pair:

codepoint := 119070
char := Chr(codepoint)
pair1 := Ord(SubStr(char, 1, 1))
pair2 := Ord(SubStr(char, 2, 1))
MsgBox(A_Clipboard := pair1 " " pair2) ; 55348 56606

codepoint := 119074
char := Chr(codepoint)
pair1 := Ord(SubStr(char, 1, 1))
pair2 := Ord(SubStr(char, 2, 1))
MsgBox(A_Clipboard := pair1 " " pair2) ; 55348 56610

MsgBox(Chr(119070) == Chr(55348) Chr(56606)) ; True (treble clef)
MsgBox(Chr(119074) == Chr(55348) Chr(56610)) ; True (bass clef)
return

Post Reply

Return to “Ask for Help (v2)”