How to read the BOM ? (not the DOM) Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

How to read the BOM ? (not the DOM)

01 Jun 2019, 16:47

Why if I RawRead a file, I can't see the DOM ?
(And why I can't read the star char ?)

Code: Select all

;★ AHK v2 (UTF-16 LE with DOM)
FileTest := FileOpen( A_ScriptFullPath, "r" )
nbrBytes := 3
BUFFER   := BufferAlloc( nbrBytes )
FileTest.RawRead( BUFFER, nbrBytes )
byte1 := NumGet( BUFFER, 0, "UChar" )
str   := StrGet( &BUFFER+1, 2 )
MsgBox byte1 " => " Chr(byte1) " => " Format("0x{:08x}", byte1 ), A_FileEncoding
MsgBox str
FileTest.Close
Last edited by FredOoo on 05 Jun 2019, 02:44, edited 1 time in total.
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: How to read the DOM ?  Topic is solved

01 Jun 2019, 17:26

DOM?? do u mean BOM? https://en.wikipedia.org/wiki/Byte_order_mark
FileOpen starts u off with the pointer already placed after the bom, so u have to reset it first: fileObj.Pos := 0 if u want to do something with

ur other question i dont understand
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the DOM ?

01 Jun 2019, 18:13


Oh yes, I mean about the BOM.
And thanks for « FileTest.Pos := 0 »

Code: Select all

;★ AHK v2 (UTF-16 LE with BOM)
FileTest := FileOpen( A_ScriptFullPath, "r" )
FileTest.Pos := 0
nbrBytes := 4
BUFFER   := BufferAlloc( nbrBytes )
FileTest.RawRead( BUFFER, nbrBytes )
byte0 := NumGet( BUFFER, 0, "UChar" )
char1   := StrGet( &BUFFER+1, 2, "UTF-16" )
char2   := StrGet( &BUFFER+3, 2, "UTF-16" )

;MsgBox A_FileEncoding
MsgBox byte0 " => " Chr(byte0) " => " Format("0x{:08x}", byte0 )
MsgBox char1
MsgBox char2
FileTest.Close
/*
byte0: 255 => ÿ => 0x000000ff   I think that's ok
------

If I save as a UTF-8 LE with BOM I get this:
byte0: 239 => ï => 0x000000ef   I think that's ok
------

If I save as a Windows 1252:
byte0: 59 => ; => 0x0000003b   That's ok
------
*/

About the next two chars (2 bytes each), it was a try to read the « ; » and the « ★ », just for checking. But it doewn't matter so much (except I would prefer to understand, but here I have no need now).

But now I get the byte0, do you think I get a trustable way to do that below ?

Code: Select all

if byte0 = 0xff || byte0 = 0xef {
	; run this file with AutoHotkeyU
} else {
	; run this file with AutoHotkeyA
}

I'm trying to make a quick SheBang for AHK.
Last edited by FredOoo on 01 Jun 2019, 19:01, edited 1 time in total.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: How to read the DOM ?

01 Jun 2019, 19:00

u seem to want to recognize the encoding of a file without having been told what it is. this is tricky because theres practically no way to reliably do that

Code: Select all

Encoding  | BOM | Bytes                               | ANSI View
----------|-----|-------------------------------------|-------------
ANSI      |     | 68 65 6C 6C 6F                      | hello
UTF-8     |     | 68 65 6C 6C 6F                      | hello
UTF-8     |  X  | EF BB BF 68 65 6C 6C 6F             | hello
UTF-16 LE |     | 68 00 65 00 6C 00 6C 00 6F 00       | h.e.l.l.o.
UTF-16 LE |  X  | FF FE 68 00 65 00 6C 00 6C 00 6F 00 | ÿþh.e.l.l.o.
UTF-16 BE |     | 00 68 00 65 00 6C 00 6C 00 6F       | .h.e.l.l.o
UTF-16 BE |  X  | FE FF 00 68 00 65 00 6C 00 6C 00 6F | þÿ.h.e.l.l.o
. => NUL
Last edited by swagfag on 06 Jun 2019, 05:07, edited 1 time in total.
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the DOM ?

01 Jun 2019, 19:31

I don't want to deal with the variable number of bytes used by UTF-8.
AHK use ANSI (no BOM, never) with AutoHotkeyA32
AHK can also use UTF-16 or UTF-8 (0xff or 0xef BOM) with AutoHotkeyU (32 or 34 bits)

;#! AHK v2 Unicode 64
;#! Ahk v1.1 Ansi 32

Lets try… But how other people do when they often switch between versions ? Do they rename *.ahk to *.ah1 and *.ah2 ?
Few years ago, I worked with Python and I got a SheBang dealing perfectly that. I try to do the same for AHK.
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: How to read the DOM ?

01 Jun 2019, 20:10

i dont know. how often do u imagine people switch their ahk versions? probably not that often, most just pick one and stick to that

u can make urself a script launcher
u can have ur editor launch scripts with different ahk.exes.
Scite has this button
u can regedit and define different .ah?? extensions
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: How to read the DOM ?

01 Jun 2019, 21:09

- @FredOoo: Please rename the thread from DOM (web browsers) to BOM (text encodings), it's incredibly confusing. Plus it will hurt searching for this thread, if it remains as is.
- Good thread topic, I make all my code Unicode/ANSI compatible. But otherwise I'd use an AHK script to handle opening ahk files, it would search for a comment like ';AHK v2 script', inside the script, to determine which exe to use.
- Btw in your second post, UTF-16 characters appear at even offsets,

- @swagfag: Nice encodings table. I thought I'd try and reverse-engineer it, as a speed-coding challenge. It has a lot of fiddly bits to it though. I wonder if your code was anything like mine.
- Btw for your UTF-16 BE example, the BOM should be þÿ.

Code: Select all

q:: ;test write string in different encodings
vList := "CP0,UTF-8,UTF-16,UTF-16"
oEnc := StrSplit("ANSI,UTF  8   ,UTF 16,UTF 16", ",")
oEnc2 := StrSplit(",,LE,BE", ",")
vText := "hello"
vOutput := ""
Loop, Parse, vList, % ","
{
	vEnc := A_LoopField
	vIndex := A_Index
	Loop, 2
	{
		if (A_Index = 2) && (vEnc = "CP0")
			break
		vPfx := (A_Index = 1) ? "" : Chr(0xFEFF) ;BOM
		vEnc2 := oEnc[vIndex]
		if oEnc2[vIndex]
			vEnc2 .= " " oEnc2[vIndex]
		if (SubStr(vEnc2, 1, 1) = "U")
			vEnc2 .= (A_Index = 1) ? " NO BOM" : " BOM"
		vSize := StrPut(vPfx vText, vEnc)
		vSize--
		if (vEnc = "UTF-16")
			vSize *= 2
		VarSetCapacity(vData, vSize)
		if (oEnc2[vIndex] = "BE")
		{
			VarSetCapacity(vData2, vSize)
			StrPut(vPfx vText, &vData2, vEnc)
			;LCMAP_BYTEREV := 0x800
			DllCall("kernel32\LCMapStringW", "UInt",0, "UInt",0x800, "WStr",vData2, "Int",vSize/2, "WStr",vData, "Int",vSize/2)
		}
		else
			StrPut(vPfx vText, &vData, vEnc)
		vHex := vPreview := ""
		Loop, % vSize
		{
			vOrd := NumGet(&vData, A_Index-1, "UChar")
			vHex .= (A_Index=1?"":" ") Format("{:02X}", vOrd)
			vPreview .= vOrd ? Chr(vOrd) : "."
		}
		;vOutput .= vSize " " vEnc "`r`n"
		vOutput .= Format("{: -20}{: -39}{}`r`n", vEnc2, vHex, vPreview)
	}
}
Clipboard := vOutput
MsgBox, % vOutput
return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the BOM ?

05 Jun 2019, 02:44

This one reads the chars correctly :
Use BUFFER.ptr instead of &BUFFER

Code: Select all

;#! AHK v2  ; (UTF-16 LE with DOM)
#SingleInstance force
#Persistent

FileTest := FileOpen( A_ScriptFullPath, "r" )
FileTest.Pos := 0
nbrBytes := 6
BUFFER   := BufferAlloc( nbrBytes )
FileTest.RawRead( BUFFER, nbrBytes )
byte0 := NumGet( BUFFER, 0, "UChar" )
byte1 := NumGet( BUFFER, 1, "UChar" )
char1 := StrGet( BUFFER.ptr+2, 1 )
char2 := StrGet( BUFFER.ptr+4, 1 )

Console := new CConsole
Console.log "a_AhkPath: " a_AhkPath
Console.log "a_FileEncoding: " a_FileEncoding
Console.log "︽"
Console.log "byte0: => " Chr(byte0) " => " Format("0x{:04x}", byte0 )
Console.log "byte1: => " Chr(byte1) " => " Format("0x{:04x}", byte1 )
Console.log "char1: " char1
Console.log "char1: " char2
Console.log "︾"
FileTest.Close

/*
	a_AhkPath: C:\sys\AutoHotkey_v2\AutoHotkeyU64.exe
	a_FileEncoding: CP0
	byte0: 255 => ÿ => 0x00ff   ; Ok
	byte1: 254 => þ => 0x00fe   ; Ok
	char1: ;                    ; Ok
	char2: #                    ; Ok
*/

class CConsole {
	__new(){
		dllCall("AllocConsole")
		this.stdOut := fileOpen("*","w `n")
		hwnd :=  dllCall("GetConsoleWindow")
		winWait( "ahk_id " hwnd )
		winMove( 0, 0,,, "ahk_id " hwnd )
	}
	log(var:=""){
		if isObject(var)
			for index, item in var
				this.stdOut.writeLine( index ": " item )
		else
			this.stdOut.writeLine( var )
		this.stdOut.read(0) ;flush
	}
}
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 03:23

@jeeswg: Did you test your code ?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
just me
Posts: 9425
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 04:02

Why do you want to 'RawRead' the BOM?

If AHK detects an UTF-8 (EFBBBF) or UTF-16 (LE) (FFFE) BOM when you open an existing file for reading:
  • File.Pos is 3 and File.Encoding is UTF-8 for UTF-8
  • File.Pos is 2 and File.Encoding is UTF-16 for UTF-16
In all other cases File.Pos is 0.
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 05:19

Yes, now I know. But I didn't before I worked on it…
Byw, what a surpprise that in « all other cases » File.Pos is 0 then AHK starts all Indexes at 1.

I've make a Zip installation, so I was wondering witch version to execute: A32, U32, U64 ?
Now I know if the script has a BOM, it always goes to Unicode version, but some scripts with no BOM (even if officialy required for ahk files) can be Unicode too. And when I get a new script, I don't know if it is build for 32 or 64 bits, or both (dllCals, Ptr size…)

And what to do if I get an ANSI script but I don't know if the developper if Russian writing his comments in English ? It will probably not work perfectly on my CP1252 system, and I can't know the original encoding at the first look.

I'm a beginer at AHK and I start learning it with v2. But as most of examples I found are v1.1, I must to run them too…
What about a SheBang ?
A « #! » directive…
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 08:46

@FredOoo: Yes I tested my code. I generally always do, even for basic scripts. It's an AHK v1 script that recreates swagfag's table.

Thanks for renaming the thread: byte order mark v. Document Object Model.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 10:28

@jeeswg : I thought it was a AHK v3.alpha. The MsgBox say "". Why ?
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »
User avatar
FredOoo
Posts: 186
Joined: 07 May 2019, 21:58
Location: Paris

Re: How to read the BOM ? (not the DOM)

05 Jun 2019, 12:15

More than one hour and half later, reading my newspaper, I sudenly got this :
d-day.PNG
d-day.PNG (271.5 KiB) Viewed 2464 times
(Alan Turing) « What would be the point of saying that A = B if it was really the same thing? »
(Albert Camus) « Misnaming things is to add to the misfortunes of the world. »

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: skeerrt and 145 guests