Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

[Solved] Encoding Problem (UCS-2 Big Endian)


  • Please log in to reply
6 replies to this topic
Hamlet
  • Members
  • 302 posts
  • Last active: Mar 23 2014 03:37 PM
  • Joined: 22 Jan 2009
I got a kind of an TXT file.
Notepad++ tells me it is an "UCS-2 Big Endian" .

Now, I have problem from here.

I tried these 10 encodings. Not a gain at all.

FileEncoding, CP65000						;  UTF-7 
FileEncoding, UTF-8	
FileEncoding, UTF-8-RAW
FileEncoding, UTF-16						
FileEncoding, UTF-16-RAW		
FileEncoding, CP12000						;  UTF-32, Little Endian 
FileEncoding, CP12001						;  UTF-32, Big Endian 
FileEncoding, CP51949
FileEncoding, CP28591
FileEncoding, CP949

Help me please.

I want to read this file as an Unicode or ANSI whatever is good as far as AHK can read it.
If I convert one of it into UTF-8 in Notepad++, it works fine.
But, you know that.
I have almost 2,000 of them.
I need some kind of Automatic AHK ways.

  • Guests
  • Last active:
  • Joined: --
problemFIlePath := A_Desktop . "\Problems.txt"
For identifier, descryption in CP {
    If identifier is Digit
    {
        f := fileOpen(problemFilePath, "r", "CP" . identifier)
        MsgBox % "encoding : " . descryption . "`n`n" . f.Read()
        f.Close()
    }
}

Class CP
{
    static 037 := "IBM037  IBM EBCDIC US-Canada"
    static 437 := "IBM437  OEM United States"
    static 500 := "IBM500  IBM EBCDIC International"
    static 708 := "ASMO-708    Arabic (ASMO 708)"
    static 709 := "    Arabic (ASMO-449+, BCON V4)"
    static 710 := "    Arabic - Transparent Arabic"
    static 720 := "DOS-720 Arabic (Transparent ASMO); Arabic (DOS)"
    static 737 := "ibm737  OEM Greek (formerly 437G); Greek (DOS)"
    static 775 := "ibm775  OEM Baltic; Baltic (DOS)"
    static 850 := "ibm850  OEM Multilingual Latin 1; Western European (DOS)"
    static 852 := "ibm852  OEM Latin 2; Central European (DOS)"
    static 855 := "IBM855  OEM Cyrillic (primarily Russian)"
    static 857 := "ibm857  OEM Turkish; Turkish (DOS)"
    static 858 := "IBM00858    OEM Multilingual Latin 1 + Euro symbol"
    static 860 := "IBM860  OEM Portuguese; Portuguese (DOS)"
    static 861 := "ibm861  OEM Icelandic; Icelandic (DOS)"
    static 862 := "DOS-862 OEM Hebrew; Hebrew (DOS)"
    static 863 := "IBM863  OEM French Canadian; French Canadian (DOS)"
    static 864 := "IBM864  OEM Arabic; Arabic (864)"
    static 865 := "IBM865  OEM Nordic; Nordic (DOS)"
    static 866 := "cp866   OEM Russian; Cyrillic (DOS)"
    static 869 := "ibm869  OEM Modern Greek; Greek, Modern (DOS)"
    static 870 := "IBM870  IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2"
    static 874 := "windows-874 ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)"
    static 875 := "cp875   IBM EBCDIC Greek Modern"
    static 932 := "shift_jis   ANSI/OEM Japanese; Japanese (Shift-JIS)"
    static 936 := "gb2312  ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)"
    static 949 := "ks_c_5601-1987  ANSI/OEM Korean (Unified Hangul Code)"
    static 950 := "big5    ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)"
    static 1026 :=    "IBM1026 IBM EBCDIC Turkish (Latin 5)"
    static 1047 :=    "IBM01047    IBM EBCDIC Latin 1/Open System"
    static 1140 :=    "IBM01140    IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)"
    static 1141 :=    "IBM01141    IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)"
    static 1142 :=    "IBM01142    IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)"
    static 1143 :=    "IBM01143    IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)"
    static 1144 :=    "IBM01144    IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)"
    static 1145 :=    "IBM01145    IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)"
    static 1146 :=    "IBM01146    IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)"
    static 1147 :=    "IBM01147    IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)"
    static 1148 :=    "IBM01148    IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)"
    static 1149 :=    "IBM01149    IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)"
    static 1200 :=    "utf-16  Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications"
    static 1201 :=    "unicodeFFFE Unicode UTF-16, big endian byte order; available only to managed applications"
    static 1250 :=    "windows-1250    ANSI Central European; Central European (Windows)"
    static 1251 :=    "windows-1251    ANSI Cyrillic; Cyrillic (Windows)"
    static 1252 :=    "windows-1252    ANSI Latin 1; Western European (Windows)"
    static 1253 :=    "windows-1253    ANSI Greek; Greek (Windows)"
    static 1254 :=    "windows-1254    ANSI Turkish; Turkish (Windows)"
    static 1255 :=    "windows-1255    ANSI Hebrew; Hebrew (Windows)"
    static 1256 :=    "windows-1256    ANSI Arabic; Arabic (Windows)"
    static 1257 :=    "windows-1257    ANSI Baltic; Baltic (Windows)"
    static 1258 :=    "windows-1258    ANSI/OEM Vietnamese; Vietnamese (Windows)"
    static 1361 :=    "Johab   Korean (Johab)"
    static 10000 :=   "macintosh   MAC Roman; Western European (Mac)"
    static 10001 :=   "x-mac-japanese  Japanese (Mac)"
    static 10002 :=   "x-mac-chinesetrad   MAC Traditional Chinese (Big5); Chinese Traditional (Mac)"
    static 10003 :=   "x-mac-korean    Korean (Mac)"
    static 10004 :=   "x-mac-arabic    Arabic (Mac)"
    static 10005 :=   "x-mac-hebrew    Hebrew (Mac)"
    static 10006 :=   "x-mac-greek Greek (Mac)"
    static 10007 :=   "x-mac-cyrillic  Cyrillic (Mac)"
    static 10008 :=   "x-mac-chinesesimp   MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)"
    static 10010 :=   "x-mac-romanian  Romanian (Mac)"
    static 10017 :=   "x-mac-ukrainian Ukrainian (Mac)"
    static 10021 :=   "x-mac-thai  Thai (Mac)"
    static 10029 :=   "x-mac-ce    MAC Latin 2; Central European (Mac)"
    static 10079 :=   "x-mac-icelandic Icelandic (Mac)"
    static 10081 :=   "x-mac-turkish   Turkish (Mac)"
    static 10082 :=   "x-mac-croatian  Croatian (Mac)"
    static 12000 :=   "utf-32  Unicode UTF-32, little endian byte order; available only to managed applications"
    static 12001 :=   "utf-32BE    Unicode UTF-32, big endian byte order; available only to managed applications"
    static 20000 :=   "x-Chinese_CNS   CNS Taiwan; Chinese Traditional (CNS)"
    static 20001 :=   "x-cp20001   TCA Taiwan"
    static 20002 :=   "x_Chinese-Eten  Eten Taiwan; Chinese Traditional (Eten)"
    static 20003 :=   "x-cp20003   IBM5550 Taiwan"
    static 20004 :=   "x-cp20004   TeleText Taiwan"
    static 20005 :=   "x-cp20005   Wang Taiwan"
    static 20105 :=   "x-IA5   IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)"
    static 20106 :=   "x-IA5-German    IA5 German (7-bit)"
    static 20107 :=   "x-IA5-Swedish   IA5 Swedish (7-bit)"
    static 20108 :=   "x-IA5-Norwegian IA5 Norwegian (7-bit)"
    static 20127 :=   "us-ascii    US-ASCII (7-bit)"
    static 20261 :=   "x-cp20261   T.61"
    static 20269 :=   "x-cp20269   ISO 6937 Non-Spacing Accent"
    static 20273 :=   "IBM273  IBM EBCDIC Germany"
    static 20277 :=   "IBM277  IBM EBCDIC Denmark-Norway"
    static 20278 :=   "IBM278  IBM EBCDIC Finland-Sweden"
    static 20280 :=   "IBM280  IBM EBCDIC Italy"
    static 20284 :=   "IBM284  IBM EBCDIC Latin America-Spain"
    static 20285 :=   "IBM285  IBM EBCDIC United Kingdom"
    static 20290 :=   "IBM290  IBM EBCDIC Japanese Katakana Extended"
    static 20297 :=   "IBM297  IBM EBCDIC France"
    static 20420 :=   "IBM420  IBM EBCDIC Arabic"
    static 20423 :=   "IBM423  IBM EBCDIC Greek"
    static 20424 :=   "IBM424  IBM EBCDIC Hebrew"
    static 20833 :=   "x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended"
    static 20838 :=   "IBM-Thai    IBM EBCDIC Thai"
    static 20866 :=   "koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)"
    static 20871 :=   "IBM871  IBM EBCDIC Icelandic"
    static 20880 :=   "IBM880  IBM EBCDIC Cyrillic Russian"
    static 20905 :=   "IBM905  IBM EBCDIC Turkish"
    static 20924 :=   "IBM00924    IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)"
    static 20932 :=   "EUC-JP  Japanese (JIS 0208-1990 and 0121-1990)"
    static 20936 :=   "x-cp20936   Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)"
    static 20949 :=   "x-cp20949   Korean Wansung"
    static 21025 :=   "cp1025  IBM EBCDIC Cyrillic Serbian-Bulgarian"
    static 21027 :=   "    (deprecated)"
    static 21866 :=   "koi8-u  Ukrainian (KOI8-U); Cyrillic (KOI8-U)"
    static 28591 :=   "iso-8859-1  ISO 8859-1 Latin 1; Western European (ISO)"
    static 28592 :=   "iso-8859-2  ISO 8859-2 Central European; Central European (ISO)"
    static 28593 :=   "iso-8859-3  ISO 8859-3 Latin 3"
    static 28594 :=   "iso-8859-4  ISO 8859-4 Baltic"
    static 28595 :=   "iso-8859-5  ISO 8859-5 Cyrillic"
    static 28596 :=   "iso-8859-6  ISO 8859-6 Arabic"
    static 28597 :=   "iso-8859-7  ISO 8859-7 Greek"
    static 28598 :=   "iso-8859-8  ISO 8859-8 Hebrew; Hebrew (ISO-Visual)"
    static 28599 :=   "iso-8859-9  ISO 8859-9 Turkish"
    static 28603 :=   "iso-8859-13 ISO 8859-13 Estonian"
    static 28605 :=   "iso-8859-15 ISO 8859-15 Latin 9"
    static 29001 :=   "x-Europa    Europa 3"
    static 38598 :=   "iso-8859-8-i    ISO 8859-8 Hebrew; Hebrew (ISO-Logical)"
    static 50220 :=   "iso-2022-jp ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)"
    static 50221 :=   "csISO2022JP ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)"
    static 50222 :=   "iso-2022-jp ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)"
    static 50225 :=   "iso-2022-kr ISO 2022 Korean"
    static 50227 :=   "x-cp50227   ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)"
    static 50229 :=   "    ISO 2022 Traditional Chinese"
    static 50930 :=   "    EBCDIC Japanese (Katakana) Extended"
    static 50931 :=   "    EBCDIC US-Canada and Japanese"
    static 50933 :=   "    EBCDIC Korean Extended and Korean"
    static 50935 :=   "    EBCDIC Simplified Chinese Extended and Simplified Chinese"
    static 50936 :=   "    EBCDIC Simplified Chinese"
    static 50937 :=   "    EBCDIC US-Canada and Traditional Chinese"
    static 50939 :=   "    EBCDIC Japanese (Latin) Extended and Japanese"
    static 51932 :=   "euc-jp  EUC Japanese"
    static 51936 :=   "EUC-CN  EUC Simplified Chinese; Chinese Simplified (EUC)"
    static 51949 :=   "euc-kr  EUC Korean"
    static 51950 :=   "    EUC Traditional Chinese"
    static 52936 :=   "hz-gb-2312  HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)"
    static 54936 :=   "GB18030 Windows XP and later := GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)"
    static 57002 :=   "x-iscii-de  ISCII Devanagari"
    static 57003 :=   "x-iscii-be  ISCII Bengali"
    static 57004 :=   "x-iscii-ta  ISCII Tamil"
    static 57005 :=   "x-iscii-te  ISCII Telugu"
    static 57006 :=   "x-iscii-as  ISCII Assamese"
    static 57007 :=   "x-iscii-or  ISCII Oriya"
    static 57008 :=   "x-iscii-ka  ISCII Kannada"
    static 57009 :=   "x-iscii-ma  ISCII Malayalam"
    static 57010 :=   "x-iscii-gu  ISCII Gujarati"
    static 57011 :=   "x-iscii-pa  ISCII Punjabi"
    static 65000 :=   "utf-7   Unicode (UTF-7)"
    static 65001 :=   "utf-8  Unicode (UTF-8)"
}

:)

Hamlet
  • Members
  • 302 posts
  • Last active: Mar 23 2014 03:37 PM
  • Joined: 22 Jan 2009
funny ....

Hamlet
  • Members
  • 302 posts
  • Last active: Mar 23 2014 03:37 PM
  • Joined: 22 Jan 2009
This website have changed worse.
It is way hard to copy code from it.
Where is the "copy" button ?
Do I have to drag all of them ? Very inconvenient, feels stupid.

Anyway. I am going to try your code.
(You konw what.. If you open the file in some editor. You can see it wrote UTF-16 sometihg... ^^ )

Hamlet
  • Members
  • 302 posts
  • Last active: Mar 23 2014 03:37 PM
  • Joined: 22 Jan 2009
No. No. No.
It looks like a UTF-16 LE but.. I see alien letters only.
It should be not so hard.
I do not konw...


Thanks your code. Guest...

Lexikos
  • Administrators
  • 9844 posts
  • AutoHotkey Foundation
  • Last active:
  • Joined: 17 Oct 2006
AutoHotkey_L doesn't support any 16-bit big-endian encoding types.

Workaround based on a post by SKAN:
FileRead BE, *c Problemitic.txt  [color=green]; Read as binary.[/color]

VarSetCapacity(LE, 2*cch:=VarSetCapacity(BE)//2), LCMAP_BYTEREV := 0x800
DllCall( "LCMapStringW", UInt,0, UInt,LCMAP_BYTEREV, Str,BE, UInt,cch, Str,LE, UInt,cch )

MsgBox % LE  [color=green]; AutoHotkey_L Unicode only.[/color]
ANSI versions of AutoHotkey require an additional conversion, as SKAN demonstrated.

Hamlet
  • Members
  • 302 posts
  • Last active: Mar 23 2014 03:37 PM
  • Joined: 22 Jan 2009
Thanks, Most of all, Your reply.


Thanks again, confirming that AHK does not support UCS2-BE.


Thanks finally, workaround !!!!