[Dll] UdeExport.dll (x64) Detect File/String Encoding

Post your working scripts, libraries and tools for AHK v1.1 and older
User avatar
dd900
Posts: 121
Joined: 27 Oct 2013, 16:03

[Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by dd900 » 28 Dec 2021, 14:44

No longer supported.

UdeExport.dll (x64) based on Mozilla Universal Charset Detector (Ude C# port)
if there is need for a x86 build let me know. x86 dll added

Original code for Ude can be found here.

"Mostly" Accurate Encoding Detector
Not much different than the original. I updated some of the code a little, but for the most part I just added a couple exported functions and packed it up so its easily callable from AutoHotkey.

There are two functions available
GetFileEncoding

GetStringEncoding <-- Not Very Useful, but it's there

GetFileEncoding will often return "ASCII" even if the file is UTF-8. Unless there are UTF-8 specific characters or a UTF-8 BOM this will always be the case.

Here is the Dictionary<Encoding string, Codepage int> that contains all available return values. I could not find some codepages.

Code: Select all

internal static Dictionary<string, int> UdeCharsetCodePages = new()
{
	{ "ASCII", 20127 },
	{ "UTF-8", 65001 },
	{ "UTF-16LE", 1200 },
	{ "UTF-16BE", 1201 },
	{ "UTF-32BE", 12001 },
	{ "UTF-32LE", 12000 },
	{ "X-ISO-10646-UCS-4-3412", 0 },
	{ "X-ISO-10646-UCS-4-2413", 0 },
	{ "windows-1251", 1251 },
	{ "windows-1252", 1252 },
	{ "windows-1253", 1253 },
	{ "windows-1255", 1255 },
	{ "Big-5", 950 },
	{ "EUC-KR", 51949 },
	{ "EUC-JP", 51932 },
	{ "EUC-TW", 0 },
	{ "gb18030", 54936 },
	{ "ISO-2022-JP", 50222 },
	{ "ISO-2022-CN", 0 },
	{ "ISO-2022-KR", 50225 },
	{ "HZ-GB-2312", 52936 },
	{ "Shift-JIS", 932 },
	{ "x-mac-cyrillic", 10007 },
	{ "KOI8-R", 20866 },
	{ "IBM855", 855 },
	{ "IBM866", 866 },
	{ "ISO-8859-2", 28592 },
	{ "ISO-8859-5", 28595 },
	{ "ISO-8859-7", 28597 },
	{ "ISO-8859-8", 28598 },
	{ "TIS620", 874 }
};
Attachments
UdeExport_x86.zip
x86
(73.35 KiB) Downloaded 108 times
UdeExport.zip
x64
(73.38 KiB) Downloaded 141 times
Last edited by dd900 on 27 Feb 2023, 02:04, edited 5 times in total.

Sam_
Posts: 146
Joined: 20 Mar 2014, 20:24

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by Sam_ » 11 Jan 2022, 16:46

I'm excited to try this out. If it's not too much trouble, I could really use a 32-bit version also. I have three questions:

1) If you point it to a file that does not contain valid UTF-8 strings (was created with some local codepage) but which was erroneously given a UTF-8 BOM, what does this tool do: stop detecting at the BOM or actually scan the contents?

2) In your opinion, would it be feasible/reasonable to implement a return value that indicates something to the effect of "this is binary data that does not represent human readable text in any recognizable encoding"? Or is this result generally obvious from the Confidence level?

3) What do you mean by:
dd900 wrote:
28 Dec 2021, 14:44
I could not find some codepages.

User avatar
dd900
Posts: 121
Joined: 27 Oct 2013, 16:03

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by dd900 » 11 Jan 2022, 21:21

Sam_ wrote:
11 Jan 2022, 16:46
I'm excited to try this out. If it's not too much trouble, I could really use a 32-bit version also. I have three questions:

1) If you point it to a file that does not contain valid UTF-8 strings (was created with some local codepage) but which was erroneously given a UTF-8 BOM, what does this tool do: stop detecting at the BOM or actually scan the contents?

2) In your opinion, would it be feasible/reasonable to implement a return value that indicates something to the effect of "this is binary data that does not represent human readable text in any recognizable encoding"? Or is this result generally obvious from the Confidence level?

3) What do you mean by:
dd900 wrote:
28 Dec 2021, 14:44
I could not find some codepages.
I added the 32bit dll to the first post.

1) It will read the BOM and stop. A file that is not UTF-8 should not have a UTF-8 BOM.

2) No. Charset detection is mostly guess work. The confidence level is the best you will get. To my knowledge there is no Charset detector that exists that could give a "reliable" response like you ask.

3) As you can see in the Dictionary in the OP some of the encodings have 0 for a value. This is because I could not find the windows codepage integer that represents the encoding.

tuzi
Posts: 223
Joined: 27 Apr 2016, 23:40

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by tuzi » 28 Apr 2022, 06:09

thank you dd900, it's very useful, i create a library for it.

Code: Select all

FileGetEncoding(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr  := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[1] and encArr[2] >= minConfidence)
        return encArr[1]
}

FileGetFormat(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

; str must use ByRef, otherwise the string is easily truncated by 0x00.
StringGetEncoding(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[1] and encArr[2] >= minConfidence)
        return encArr[1]
}

StringGetFormat(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

/*
    https://www.autohotkey.com/boards/viewtopic.php?t=98241
    Please note that if their encoding number = 0, that means we can detect them, but their encoding number is unknown for Windows.
    That also means you can NOT transform them by using AHK.
    
    X-ISO-10646-UCS-4-3412 = 0
    X-ISO-10646-UCS-4-2413 = 0
    EUC-TW                 = 0
    ISO-2022-CN            = 0
    ASCII                  = 20127
    UTF-8                  = 65001
    UTF-16LE               = 1200
    UTF-16BE               = 1201
    UTF-32LE               = 12000
    UTF-32BE               = 12001
    windows-1251           = 1251
    windows-1252           = 1252
    windows-1253           = 1253
    windows-1255           = 1255
    Big-5                  = 950
    EUC-KR                 = 51949
    EUC-JP                 = 51932
    gb18030                = 54936
    ISO-2022-JP            = 50222
    ISO-2022-KR            = 50225
    HZ-GB-2312             = 52936
    Shift-JIS              = 932
    x-mac-cyrillic         = 10007
    KOI8-R                 = 20866
    IBM855                 = 855
    IBM866                 = 866
    ISO-8859-2             = 28592
    ISO-8859-5             = 28595
    ISO-8859-7             = 28597
    ISO-8859-8             = 28598
    TIS620                 = 874
*/
example here.

Code: Select all

MsgBox % FileGetEncoding("x:\test.txt")
MsgBox % FileGetFormat("x:\test.txt")

MsgBox % StringGetEncoding(var)
MsgBox % StringGetFormat(var)
Last edited by tuzi on 28 Apr 2022, 06:21, edited 2 times in total.

tuzi
Posts: 223
Joined: 27 Apr 2016, 23:40

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by tuzi » 28 Apr 2022, 06:15

@dd900

i think there is a bug in GetStringEncoding

here is test code.

Code: Select all

; var is a utf-8 string
StrPutVar("apple,哈hahaha,abc", var, "utf-8")
; save RAW of var to a file for test
binsave(var,"bug.txt")

; show utf-8 that's right
MsgBox % FileGetFormat("bug.txt")

; show IBM866 that's wrong
MsgBox % StringGetFormat(var)

BinSave(ByRef var, filepath)
{
  f:=FileOpen(filepath, "w")
  f.RawWrite(var, VarSetCapacity(var))
  f.Close()
}

StrPutVar(string, ByRef var, encoding)
{
    VarSetCapacity( var, StrPut(string, encoding)
        * ((encoding="utf-16"||encoding="cp1200") ? 2 : 1) )
    return StrPut(string, &var, encoding)
}

FileGetFormat(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    encArr := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

StringGetFormat(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

User avatar
dd900
Posts: 121
Joined: 27 Oct 2013, 16:03

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by dd900 » 24 Jan 2023, 02:03

tuzi wrote: i think there is a bug in GetStringEncoding
No bug. It just doesn't work well with strings. I pointed it out in the OP.
dd900 wrote: GetStringEncoding <-- Not Very Useful, but it's there
GetStringEncoding was never meant to be used in production. I put it there for the sake of trying it. If I have time in the near future I will take a look and see if I can improve the detection for strings.

Cubex
Posts: 8
Joined: 06 Sep 2014, 06:23

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by Cubex » 31 Jan 2023, 08:30

I'm sorry. It still doesn't support UTF-8 filenames (for example with ž š)

User avatar
dd900
Posts: 121
Joined: 27 Oct 2013, 16:03

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by dd900 » 27 Feb 2023, 02:03

I'm no longer supporting this code. It works as intended. Further modifications will only bring more bugs. My best advice for those wanting to use this beyond it's original intent is to get the source code from the OP and write some c# classes to your liking and use it with CLR.ahk.

tuzi
Posts: 223
Joined: 27 Apr 2016, 23:40

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Post by tuzi » 09 Dec 2023, 01:56

base on uchardet.dll and swagfag 's code, i create a lib ahk-chardet.

i think this lib can replace UdeExport.dll

hope it helps someone in need. :D

Post Reply

Return to “Scripts and Functions (v1)”