StrPut()

Share your ideas as to how the documentation can be improved.
neogna2
Posts: 589
Joined: 15 Sep 2016, 15:44

StrPut()

Post by neogna2 » 08 Mar 2021, 05:27

In general I think the documentation for StrPut() can be difficult to read because of the two different uses: Copy a string to a memory adress or return the buffer size of a string. Here are some suggestions for improvement within that constraint.

StrPut() section Parameters
Encoding [...] Specify an empty string or "CP0" to use the system default ANSI code page.
In Unicode versions of AutoHotkey all of these seem to return a size value in UTF-16 code-units.

Code: Select all

MsgBox % StrPut("𐍈")            ; 3
MsgBox % StrPut("𐍈", "")        ; 3
MsgBox % StrPut("𐍈", "UTF-16")  ; 3
I think in all these cases the return value 3 represents 2 16-bit (UTF-16) code units for "𐍈" and 1 16-bit code unit for the null-terminator. Is that true? Or am I making a mistake about return value of StrPut("𐍈", "") ? (Edit: it was false, I was mistaken. See subsequent posts.) If true the documentation could change to
In Unicode versions of AutoHotkey specify an empty string "" to use UTF-16. In ANSI versions of AutoHotkey specify an empty string "" or "CP0" to use the system default ANSI code page.

StrPut() section Return Value
This function returns the number of characters written. If no Target was given, it returns the required buffer size in characters.
The second sentence leaves implicit which code units the returned buffer size value uses. This expansion would make that clear.
If no Target was given, it returns the required buffer size in the target encoding's code units (e.g. 1 byte (8bit) units for "UTF-8", 2 byte (16bit) units for "UTF-16") and the size includes the null-terminator.

StrPut() section Examples
Inbetween example #1 and #2 add some simple examples that only showcase returning the buffer size with different target encodings.

Code: Select all

StrPut("𐍈")             ; 3
StrPut("𐍈", "")         ; 3
StrPut("𐍈", "UTF-16")   ; 3
; Target encoding is "UTF-16" so the return value is in 16-bit code units
; and 2 16-bit code units for "𐍈" + 1 for null-terminator = 3

StrPut("𐍈", "UTF-8")    ; 5
; Target encoding is "UTF-8" so the return value is in 8-bit code units
; and 4 8-bit code units for "𐍈" + 1 for null-terminator = 5
Previous discussion by jeeswg and others in 2019 continuing into next page of that thread and also here.

Suggestion added 2021-03-09:
StrPut() section Parameters
Length must not be omitted unless the buffer size is known to be sufficient, such as if the buffer was allocated based on a previous call to StrPut with the same Source and Encoding.
Source should be changed to String.
Last edited by neogna2 on 09 Mar 2021, 06:29, edited 2 times in total.
just me
Posts: 9442
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: StrPut()

Post by just me » 09 Mar 2021, 04:54

Code: Select all

; In Unicode versions of AHK
MsgBox % StrPut("𐍈")            ; 3 Unicode characters including the null-terminator
MsgBox % StrPut("𐍈", "")        ; 3 ANSI    characters including the null-terminator
MsgBox % StrPut("𐍈", "UTF-16")  ; 3 Unicode characters including the null-terminator

Code: Select all

U := "𐍈"
VarSetCapacity(ANSI, 512, 0)
StrPut(U, &ANSI, "")
A := StrGet(&ANSI, "")
MsgBox, %U% - %A%
neogna2
Posts: 589
Joined: 15 Sep 2016, 15:44

Re: StrPut()

Post by neogna2 » 09 Mar 2021, 06:21

@just me thanks for proving me wrong :)
Your last line MsgBox, %U% - %A% outputs 𐍈 - ?? in unicode AutoHotkey. Is this the correct interpretation of that:
  • The input string "𐍈" is in Unicode AutoHotkey's native encoding UTF-16 represented as two 16bit code units (a "surrogate pair").
  • Specifying Encoding as an empty string "" does use ANSI as target encoding. But ANSI cannot handle the Unicode character 𐍈 correctly.
  • What happens is that StrPut converts each input 16bit code unit into an ANSI 8bit code unit(*), which when we MsgBox the ANSI string show up as two unknown/error symbols ??.
  • Those two symbols (one 8bit code unit each) and a null-terminator (one 8bit code unit) make the return size 3.
(* This powershell oneliner identifies my PC's ANSI code page as 1252, "a single-byte character encoding of the Latin alphabet")
Post Reply

Return to “Suggestions on Documentation Improvements”