Page 1 of 1

[Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 28 Dec 2021, 14:44
by dd900
No longer supported.

UdeExport.dll (x64) based on Mozilla Universal Charset Detector (Ude C# port)
if there is need for a x86 build let me know. x86 dll added

Original code for Ude can be found here.

"Mostly" Accurate Encoding Detector
Not much different than the original. I updated some of the code a little, but for the most part I just added a couple exported functions and packed it up so its easily callable from AutoHotkey.

There are two functions available
GetFileEncoding

GetStringEncoding <-- Not Very Useful, but it's there

GetFileEncoding will often return "ASCII" even if the file is UTF-8. Unless there are UTF-8 specific characters or a UTF-8 BOM this will always be the case.

Here is the Dictionary<Encoding string, Codepage int> that contains all available return values. I could not find some codepages.

Code: Select all

internal static Dictionary<string, int> UdeCharsetCodePages = new()
{
	{ "ASCII", 20127 },
	{ "UTF-8", 65001 },
	{ "UTF-16LE", 1200 },
	{ "UTF-16BE", 1201 },
	{ "UTF-32BE", 12001 },
	{ "UTF-32LE", 12000 },
	{ "X-ISO-10646-UCS-4-3412", 0 },
	{ "X-ISO-10646-UCS-4-2413", 0 },
	{ "windows-1251", 1251 },
	{ "windows-1252", 1252 },
	{ "windows-1253", 1253 },
	{ "windows-1255", 1255 },
	{ "Big-5", 950 },
	{ "EUC-KR", 51949 },
	{ "EUC-JP", 51932 },
	{ "EUC-TW", 0 },
	{ "gb18030", 54936 },
	{ "ISO-2022-JP", 50222 },
	{ "ISO-2022-CN", 0 },
	{ "ISO-2022-KR", 50225 },
	{ "HZ-GB-2312", 52936 },
	{ "Shift-JIS", 932 },
	{ "x-mac-cyrillic", 10007 },
	{ "KOI8-R", 20866 },
	{ "IBM855", 855 },
	{ "IBM866", 866 },
	{ "ISO-8859-2", 28592 },
	{ "ISO-8859-5", 28595 },
	{ "ISO-8859-7", 28597 },
	{ "ISO-8859-8", 28598 },
	{ "TIS620", 874 }
};

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 11 Jan 2022, 16:46
by Sam_
I'm excited to try this out. If it's not too much trouble, I could really use a 32-bit version also. I have three questions:

1) If you point it to a file that does not contain valid UTF-8 strings (was created with some local codepage) but which was erroneously given a UTF-8 BOM, what does this tool do: stop detecting at the BOM or actually scan the contents?

2) In your opinion, would it be feasible/reasonable to implement a return value that indicates something to the effect of "this is binary data that does not represent human readable text in any recognizable encoding"? Or is this result generally obvious from the Confidence level?

3) What do you mean by:
dd900 wrote:
28 Dec 2021, 14:44
I could not find some codepages.

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 11 Jan 2022, 21:21
by dd900
Sam_ wrote:
11 Jan 2022, 16:46
I'm excited to try this out. If it's not too much trouble, I could really use a 32-bit version also. I have three questions:

1) If you point it to a file that does not contain valid UTF-8 strings (was created with some local codepage) but which was erroneously given a UTF-8 BOM, what does this tool do: stop detecting at the BOM or actually scan the contents?

2) In your opinion, would it be feasible/reasonable to implement a return value that indicates something to the effect of "this is binary data that does not represent human readable text in any recognizable encoding"? Or is this result generally obvious from the Confidence level?

3) What do you mean by:
dd900 wrote:
28 Dec 2021, 14:44
I could not find some codepages.
I added the 32bit dll to the first post.

1) It will read the BOM and stop. A file that is not UTF-8 should not have a UTF-8 BOM.

2) No. Charset detection is mostly guess work. The confidence level is the best you will get. To my knowledge there is no Charset detector that exists that could give a "reliable" response like you ask.

3) As you can see in the Dictionary in the OP some of the encodings have 0 for a value. This is because I could not find the windows codepage integer that represents the encoding.

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 28 Apr 2022, 06:09
by tuzi
thank you dd900, it's very useful, i create a library for it.

Code: Select all

FileGetEncoding(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr  := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[1] and encArr[2] >= minConfidence)
        return encArr[1]
}

FileGetFormat(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

; str must use ByRef, otherwise the string is easily truncated by 0x00.
StringGetEncoding(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[1] and encArr[2] >= minConfidence)
        return encArr[1]
}

StringGetFormat(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

/*
    https://www.autohotkey.com/boards/viewtopic.php?t=98241
    Please note that if their encoding number = 0, that means we can detect them, but their encoding number is unknown for Windows.
    That also means you can NOT transform them by using AHK.
    
    X-ISO-10646-UCS-4-3412 = 0
    X-ISO-10646-UCS-4-2413 = 0
    EUC-TW                 = 0
    ISO-2022-CN            = 0
    ASCII                  = 20127
    UTF-8                  = 65001
    UTF-16LE               = 1200
    UTF-16BE               = 1201
    UTF-32LE               = 12000
    UTF-32BE               = 12001
    windows-1251           = 1251
    windows-1252           = 1252
    windows-1253           = 1253
    windows-1255           = 1255
    Big-5                  = 950
    EUC-KR                 = 51949
    EUC-JP                 = 51932
    gb18030                = 54936
    ISO-2022-JP            = 50222
    ISO-2022-KR            = 50225
    HZ-GB-2312             = 52936
    Shift-JIS              = 932
    x-mac-cyrillic         = 10007
    KOI8-R                 = 20866
    IBM855                 = 855
    IBM866                 = 866
    ISO-8859-2             = 28592
    ISO-8859-5             = 28595
    ISO-8859-7             = 28597
    ISO-8859-8             = 28598
    TIS620                 = 874
*/
example here.

Code: Select all

MsgBox % FileGetEncoding("x:\test.txt")
MsgBox % FileGetFormat("x:\test.txt")

MsgBox % StringGetEncoding(var)
MsgBox % StringGetFormat(var)

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 28 Apr 2022, 06:15
by tuzi
@dd900

i think there is a bug in GetStringEncoding

here is test code.

Code: Select all

; var is a utf-8 string
StrPutVar("apple,哈hahaha,abc", var, "utf-8")
; save RAW of var to a file for test
binsave(var,"bug.txt")

; show utf-8 that's right
MsgBox % FileGetFormat("bug.txt")

; show IBM866 that's wrong
MsgBox % StringGetFormat(var)

BinSave(ByRef var, filepath)
{
  f:=FileOpen(filepath, "w")
  f.RawWrite(var, VarSetCapacity(var))
  f.Close()
}

StrPutVar(string, ByRef var, encoding)
{
    VarSetCapacity( var, StrPut(string, encoding)
        * ((encoding="utf-16"||encoding="cp1200") ? 2 : 1) )
    return StrPut(string, &var, encoding)
}

FileGetFormat(path, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    encArr := ComObject(0x200C, DllCall(dllPath "\GetFileEncoding", "Ptr", &path, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

StringGetFormat(ByRef str, minConfidence := 0.5)
{
    static dllPath := A_PtrSize=8 ? "UdeExport.dll" : "UdeExport_x86.dll"
    encArr := ComObject(0x200C, DllCall(dllPath "\GetStringEncoding", "Ptr", &str, "Ptr"), 1)
    if (encArr[0] and encArr[2] >= minConfidence)
        return encArr[0]
}

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 24 Jan 2023, 02:03
by dd900
tuzi wrote: i think there is a bug in GetStringEncoding
No bug. It just doesn't work well with strings. I pointed it out in the OP.
dd900 wrote: GetStringEncoding <-- Not Very Useful, but it's there
GetStringEncoding was never meant to be used in production. I put it there for the sake of trying it. If I have time in the near future I will take a look and see if I can improve the detection for strings.

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 31 Jan 2023, 08:30
by Cubex
I'm sorry. It still doesn't support UTF-8 filenames (for example with ž š)

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 27 Feb 2023, 02:03
by dd900
I'm no longer supporting this code. It works as intended. Further modifications will only bring more bugs. My best advice for those wanting to use this beyond it's original intent is to get the source code from the OP and write some c# classes to your liking and use it with CLR.ahk.

Re: [Dll] UdeExport.dll (x64) Detect File/String Encoding

Posted: 09 Dec 2023, 01:56
by tuzi
base on uchardet.dll and swagfag 's code, i create a lib ahk-chardet.

i think this lib can replace UdeExport.dll

hope it helps someone in need. :D