 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
Icarus
Joined: 24 Nov 2005 Posts: 824
|
Posted: Tue Aug 18, 2009 11:33 am Post subject: How to work with UTF-8 |
|
|
Hi everybody,
I am using the nice xpath library to read some RSS feeds and all is nice.
I bumped into an obstacle, when reading RSS feed that is UTF-8 encoded.
I am guessing the solution does not have anything to do with xpath but with how to work with UTF-8 in general.
So, what do I need to do in order to show a UTF-8 encoded string on the GUI properly?
Specifically, I have this string that does not show correctly:
| Code: |
You Won’t Find Putin On Russia’s Gogul
|
(I hope this preserves its qualities when pasted here)
Also, since my script reads several XML files, some will be UTF-8 encoded and some wont, so I hope there is solution that accounts for this. _________________ Sector-Seven - Freeware tools built with AutoHotkey |
|
| Back to top |
|
 |
hugov
Joined: 27 May 2007 Posts: 2474
|
|
| Back to top |
|
 |
Icarus
Joined: 24 Nov 2005 Posts: 824
|
Posted: Tue Aug 18, 2009 12:06 pm Post subject: |
|
|
Hugo, you are the main man!
There is a lot of information to read there, but I was partially successful with YMP's code
Made it into something like this:
| Code: |
#SingleInstance Force
FileRead bad, test.txt ; Reading a UTF-8 encoded file
good := Decode( bad )
msgbox %bad%`n%good%
Return
Decode( str ) {
RawLen := StrLen(str)
BufSize := (RawLen + 1) * 2
VarSetCapacity(Buf, BufSize, 0)
DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
, "int", -1, "uint", &Buf, "uint", RawLen + 1)
DllCall("WideCharToMultiByte", "uint", 1252, "int", 0, "uint", &Buf, "int", -1
, "str", str, "uint", RawLen + 1
, "int", 0, "int", 0)
Return str
}
|
Which works fine for some characters, but not others.
Need to further investigate.
Thanks man (and YMP and VxE)
(And if anyone else has a better/other solution, please post) _________________ Sector-Seven - Freeware tools built with AutoHotkey |
|
| Back to top |
|
 |
YMP
Joined: 23 Dec 2006 Posts: 329 Location: Russia
|
Posted: Tue Aug 18, 2009 2:02 pm Post subject: |
|
|
Icarus
If utf-8 text contains letters that belong to different ANSI charsets, it can't be translated to ANSI correctly. Another possibility is that your system's default codepage differs from 1252, which I used in my code. Try replacing 1252 with 0. It will instruct WideCharToMultiByte to use your default codepage. Maybe that's all you need.  |
|
| Back to top |
|
 |
Icarus
Joined: 24 Nov 2005 Posts: 824
|
Posted: Tue Aug 18, 2009 2:09 pm Post subject: |
|
|
YMP, you are great.
The 0 change indeed fixed it all, that's terrific.
If you ask me, I think that many things done with DllCalls like this in AHK should already make it into the official build as official functions.
String handling in AHK is lacking a little. _________________ Sector-Seven - Freeware tools built with AutoHotkey |
|
| Back to top |
|
 |
Icarus
Joined: 24 Nov 2005 Posts: 824
|
Posted: Sun Aug 23, 2009 9:51 am Post subject: |
|
|
@YMP (or someone who understands this encoding stuff... )
After I have used this function to make the UTF text displayable on the AHK GUI windows, lets say I want to save it as part of an XML.
What is the encoding I need to use?
Is it
| Code: | | <?xml version="1.0" encoding="Windows-1252"?> |
I have used YMPs advice and using 0 instead of 1252 in the UtfDecode function - I am guessing it needs to be considered when dumping back to XML?
Of course, the best would probably be for me to have the opposite function so that I can generate UTF-8 XMLs.
Any help is appreciated. _________________ Sector-Seven - Freeware tools built with AutoHotkey |
|
| Back to top |
|
 |
YMP
Joined: 23 Dec 2006 Posts: 329 Location: Russia
|
Posted: Sun Aug 23, 2009 10:38 am Post subject: |
|
|
There is probably not so much to be changed:
| Code: |
Encode( str ) {
RawLen := StrLen(str)
BufSize := (RawLen + 1) * 2
VarSetCapacity(Buf1, BufSize, 0) ; For UTF-16.
VarSetCapacity(Buf2, BufSize, 0) ; For UTF-8.
DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
, "int", -1, "uint", &Buf1, "uint", RawLen + 1)
DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
, "int", -1, "str", Buf2, "uint", BufSize
, "int", 0, "int", 0)
Return Buf2
}
|
I assume the byte order mark (BOM) is not needed at the beginning of the UTF-8 file, since the encoding is specified in the XML header. But I am not quite sure. If you need it, you can do so:
| Code: |
BOM := Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
FileAppend, %BOM%, %A_Desktop%\test.xml
utf8 := Encode(Text)
FileAppend, %utf8%, %A_Desktop%\test.xml
|
|
|
| Back to top |
|
 |
Icarus
Joined: 24 Nov 2005 Posts: 824
|
Posted: Sun Aug 23, 2009 10:55 am Post subject: |
|
|
YMP,
This is excellent. Appreciate the fast and complete response.
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences...
In any case, thanks a lot, you are very helpful.
EDIT:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
| Code: |
/*
; Tester
#SingleInstance Force
FileRead utfText, test.txt ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText ) ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%
backToUtf := UtfEncode( ansiText ) ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%
FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt
Return
*/
UtfDecode( str ) {
RawLen := StrLen(str)
Charset := 0 ; Put 1252 or 0
BufSize := (RawLen + 1) * 2
VarSetCapacity(Buf, BufSize, 0)
DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
, "int", -1, "uint", &Buf, "uint", RawLen + 1)
DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
, "str", str, "uint", RawLen + 1
, "int", 0, "int", 0)
Return str
}
UtfEncode( str ) {
RawLen := StrLen(str)
BufSize := (RawLen + 1) * 2
VarSetCapacity(Buf1, BufSize, 0) ; For UTF-16.
VarSetCapacity(Buf2, BufSize, 0) ; For UTF-8.
DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
, "int", -1, "uint", &Buf1, "uint", RawLen + 1)
DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
, "int", -1, "str", Buf2, "uint", BufSize
, "int", 0, "int", 0)
Return Buf2
}
UtfBom() {
; Put this string at the beginning of the file, if Byte Order Mark is needed
; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}
|
_________________ Sector-Seven - Freeware tools built with AutoHotkey |
|
| Back to top |
|
 |
YMP
Joined: 23 Dec 2006 Posts: 329 Location: Russia
|
Posted: Sun Aug 23, 2009 11:13 am Post subject: |
|
|
| Icarus wrote: |
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences... |
Well, the only alternative to MSDN I have used is downloading a Platform SDK, but its documentation contains the same sentences. Yes, it's sometimes a bit hard to get through. But practice makes it easier. |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|