AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to work with UTF-8

 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
Icarus



Joined: 24 Nov 2005
Posts: 824

PostPosted: Tue Aug 18, 2009 11:33 am    Post subject: How to work with UTF-8 Reply with quote

Hi everybody,

I am using the nice xpath library to read some RSS feeds and all is nice.
I bumped into an obstacle, when reading RSS feed that is UTF-8 encoded.

I am guessing the solution does not have anything to do with xpath but with how to work with UTF-8 in general.

So, what do I need to do in order to show a UTF-8 encoded string on the GUI properly?

Specifically, I have this string that does not show correctly:
Code:

You Won’t Find Putin On Russia’s Gogul

(I hope this preserves its qualities when pasted here)

Also, since my script reads several XML files, some will be UTF-8 encoded and some wont, so I hope there is solution that accounts for this.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
hugov



Joined: 27 May 2007
Posts: 2474

PostPosted: Tue Aug 18, 2009 11:37 am    Post subject: Reply with quote

Look at YMP and VxE solutions in http://www.autohotkey.com/forum/topic47400.html. Before processing the XML you could convert them
using either of these functions (ansi or "html")
_________________
Tut 4 Newbies
TF : Text file & string lib, TF Forum
Back to top
View user's profile Send private message Visit poster's website
Icarus



Joined: 24 Nov 2005
Posts: 824

PostPosted: Tue Aug 18, 2009 12:06 pm    Post subject: Reply with quote

Hugo, you are the main man!

There is a lot of information to read there, but I was partially successful with YMP's code

Made it into something like this:
Code:

#SingleInstance Force

FileRead bad, test.txt ; Reading a UTF-8 encoded file
good := Decode( bad )
msgbox %bad%`n%good%

Return

Decode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 1252, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}


Which works fine for some characters, but not others.
Need to further investigate.

Thanks man (and YMP and VxE)
(And if anyone else has a better/other solution, please post)
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 329
Location: Russia

PostPosted: Tue Aug 18, 2009 2:02 pm    Post subject: Reply with quote

Icarus
If utf-8 text contains letters that belong to different ANSI charsets, it can't be translated to ANSI correctly. Another possibility is that your system's default codepage differs from 1252, which I used in my code. Try replacing 1252 with 0. It will instruct WideCharToMultiByte to use your default codepage. Maybe that's all you need. Rolling Eyes
Back to top
View user's profile Send private message
Icarus



Joined: 24 Nov 2005
Posts: 824

PostPosted: Tue Aug 18, 2009 2:09 pm    Post subject: Reply with quote

YMP, you are great.
The 0 change indeed fixed it all, that's terrific.

If you ask me, I think that many things done with DllCalls like this in AHK should already make it into the official build as official functions.

String handling in AHK is lacking a little.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
Icarus



Joined: 24 Nov 2005
Posts: 824

PostPosted: Sun Aug 23, 2009 9:51 am    Post subject: Reply with quote

@YMP (or someone who understands this encoding stuff... Smile )

After I have used this function to make the UTF text displayable on the AHK GUI windows, lets say I want to save it as part of an XML.

What is the encoding I need to use?
Is it
Code:
<?xml version="1.0" encoding="Windows-1252"?>


I have used YMPs advice and using 0 instead of 1252 in the UtfDecode function - I am guessing it needs to be considered when dumping back to XML?

Of course, the best would probably be for me to have the opposite function so that I can generate UTF-8 XMLs.

Any help is appreciated.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 329
Location: Russia

PostPosted: Sun Aug 23, 2009 10:38 am    Post subject: Reply with quote

There is probably not so much to be changed:
Code:

Encode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}

I assume the byte order mark (BOM) is not needed at the beginning of the UTF-8 file, since the encoding is specified in the XML header. But I am not quite sure. Rolling Eyes If you need it, you can do so:
Code:

BOM := Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
FileAppend, %BOM%, %A_Desktop%\test.xml

utf8 := Encode(Text)
FileAppend, %utf8%, %A_Desktop%\test.xml
Back to top
View user's profile Send private message
Icarus



Joined: 24 Nov 2005
Posts: 824

PostPosted: Sun Aug 23, 2009 10:55 am    Post subject: Reply with quote

YMP,

This is excellent. Appreciate the fast and complete response.
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences... Smile

In any case, thanks a lot, you are very helpful.

EDIT:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
Code:

/*
; Tester
#SingleInstance Force

FileRead utfText, test.txt           ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText )     ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%

backToUtf := UtfEncode( ansiText )    ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%

FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt

Return
*/


UtfDecode( str ) {
  RawLen := StrLen(str)
 
  Charset := 0    ; Put 1252 or 0

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}

UtfEncode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}


UtfBom() {
  ; Put this string at the beginning of the file, if Byte Order Mark is needed
  ; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
  Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}

_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 329
Location: Russia

PostPosted: Sun Aug 23, 2009 11:13 am    Post subject: Reply with quote

Icarus wrote:

Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences...

Well, the only alternative to MSDN I have used is downloading a Platform SDK, but its documentation contains the same sentences. Smile Yes, it's sometimes a bit hard to get through. But practice makes it easier.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group