AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to work with UTF-8

 
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
Icarus



Joined: 24 Nov 2005
Posts: 851

PostPosted: Tue Aug 18, 2009 10:33 am    Post subject: How to work with UTF-8 Reply with quote

Hi everybody,

I am using the nice xpath library to read some RSS feeds and all is nice.
I bumped into an obstacle, when reading RSS feed that is UTF-8 encoded.

I am guessing the solution does not have anything to do with xpath but with how to work with UTF-8 in general.

So, what do I need to do in order to show a UTF-8 encoded string on the GUI properly?

Specifically, I have this string that does not show correctly:
Code:

You Won’t Find Putin On Russia’s Gogul

(I hope this preserves its qualities when pasted here)

Also, since my script reads several XML files, some will be UTF-8 encoded and some wont, so I hope there is solution that accounts for this.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
SoLong&Thx4AllTheFish



Joined: 27 May 2007
Posts: 4999

PostPosted: Tue Aug 18, 2009 10:37 am    Post subject: Reply with quote

Look at YMP and VxE solutions in http://www.autohotkey.com/forum/topic47400.html. Before processing the XML you could convert them
using either of these functions (ansi or "html")
_________________
AHK Wiki FAQ
TF : Text files & strings lib, TF Forum
Back to top
View user's profile Send private message
Icarus



Joined: 24 Nov 2005
Posts: 851

PostPosted: Tue Aug 18, 2009 11:06 am    Post subject: Reply with quote

Hugo, you are the main man!

There is a lot of information to read there, but I was partially successful with YMP's code

Made it into something like this:
Code:

#SingleInstance Force

FileRead bad, test.txt ; Reading a UTF-8 encoded file
good := Decode( bad )
msgbox %bad%`n%good%

Return

Decode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 1252, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}


Which works fine for some characters, but not others.
Need to further investigate.

Thanks man (and YMP and VxE)
(And if anyone else has a better/other solution, please post)
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 418
Location: Russia

PostPosted: Tue Aug 18, 2009 1:02 pm    Post subject: Reply with quote

Icarus
If utf-8 text contains letters that belong to different ANSI charsets, it can't be translated to ANSI correctly. Another possibility is that your system's default codepage differs from 1252, which I used in my code. Try replacing 1252 with 0. It will instruct WideCharToMultiByte to use your default codepage. Maybe that's all you need. Rolling Eyes
Back to top
View user's profile Send private message
Icarus



Joined: 24 Nov 2005
Posts: 851

PostPosted: Tue Aug 18, 2009 1:09 pm    Post subject: Reply with quote

YMP, you are great.
The 0 change indeed fixed it all, that's terrific.

If you ask me, I think that many things done with DllCalls like this in AHK should already make it into the official build as official functions.

String handling in AHK is lacking a little.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
Icarus



Joined: 24 Nov 2005
Posts: 851

PostPosted: Sun Aug 23, 2009 8:51 am    Post subject: Reply with quote

@YMP (or someone who understands this encoding stuff... Smile )

After I have used this function to make the UTF text displayable on the AHK GUI windows, lets say I want to save it as part of an XML.

What is the encoding I need to use?
Is it
Code:
<?xml version="1.0" encoding="Windows-1252"?>


I have used YMPs advice and using 0 instead of 1252 in the UtfDecode function - I am guessing it needs to be considered when dumping back to XML?

Of course, the best would probably be for me to have the opposite function so that I can generate UTF-8 XMLs.

Any help is appreciated.
_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 418
Location: Russia

PostPosted: Sun Aug 23, 2009 9:38 am    Post subject: Reply with quote

There is probably not so much to be changed:
Code:

Encode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}

I assume the byte order mark (BOM) is not needed at the beginning of the UTF-8 file, since the encoding is specified in the XML header. But I am not quite sure. Rolling Eyes If you need it, you can do so:
Code:

BOM := Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
FileAppend, %BOM%, %A_Desktop%\test.xml

utf8 := Encode(Text)
FileAppend, %utf8%, %A_Desktop%\test.xml
Back to top
View user's profile Send private message
Icarus



Joined: 24 Nov 2005
Posts: 851

PostPosted: Sun Aug 23, 2009 9:55 am    Post subject: Reply with quote

YMP,

This is excellent. Appreciate the fast and complete response.
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences... Smile

In any case, thanks a lot, you are very helpful.

EDIT:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
Code:

/*
; Tester
#SingleInstance Force

FileRead utfText, test.txt           ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText )     ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%

backToUtf := UtfEncode( ansiText )    ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%

FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt

Return
*/


UtfDecode( str ) {
  RawLen := StrLen(str)
 
  Charset := 0    ; Put 1252 or 0

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}

UtfEncode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}


UtfBom() {
  ; Put this string at the beginning of the file, if Byte Order Mark is needed
  ; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
  Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}

_________________
Sector-Seven - Freeware tools built with AutoHotkey
Back to top
View user's profile Send private message Visit poster's website
YMP



Joined: 23 Dec 2006
Posts: 418
Location: Russia

PostPosted: Sun Aug 23, 2009 10:13 am    Post subject: Reply with quote

Icarus wrote:

Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences...

Well, the only alternative to MSDN I have used is downloading a Platform SDK, but its documentation contains the same sentences. Smile Yes, it's sometimes a bit hard to get through. But practice makes it easier.
Back to top
View user's profile Send private message
luetkmeyer



Joined: 26 Feb 2010
Posts: 38

PostPosted: Sun Apr 04, 2010 8:46 pm    Post subject: Reply with quote

Thank you! Works with me too. Very Happy
Back to top
View user's profile Send private message
majstang



Joined: 29 Aug 2008
Posts: 385

PostPosted: Thu Feb 17, 2011 4:51 pm    Post subject: Reply with quote

Icarus wrote:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
Code:

/*
; Tester
#SingleInstance Force

FileRead utfText, test.txt           ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText )     ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%

backToUtf := UtfEncode( ansiText )    ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%

FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt

Return
*/


UtfDecode( str ) {
  RawLen := StrLen(str)
 
  Charset := 0    ; Put 1252 or 0

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}

UtfEncode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}


UtfBom() {
  ; Put this string at the beginning of the file, if Byte Order Mark is needed
  ; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
  Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}

Icarus and/or YMP, could this great script be converted for use with Ahk_L Unicode? I cant convert to ANSI any longer when going from basic to Ahk_L.
Back to top
View user's profile Send private message
Lexikos



Joined: 17 Oct 2006
Posts: 7279
Location: Australia

PostPosted: Fri Feb 18, 2011 9:28 am    Post subject: Reply with quote

See FileEncoding. There's probably no need for you to explicitly encode or decode it in this case. If there was, I'd suggest you look at StrPut / StrGet which are also available as scripts for AutoHotkey Basic.
Back to top
View user's profile Send private message Visit poster's website
majstang



Joined: 29 Aug 2008
Posts: 385

PostPosted: Fri Feb 18, 2011 3:13 pm    Post subject: Reply with quote

Im sorry Lexikos, im such a noob Embarassed I cannot get anything to work. Not FileEncoding nor StrPut and yes Im in need of some converting as you see in non working code below. My string is in UTF-8 and I have to convert it to ANSI somehow. Can anyone help me with this? I do use Autohotkey_L Unicode on Win7.
Edit: If saving the script below as ANSI the code works, but if saving it as UTF-8 it doesnt work. I plan to use this code in a script saved as UTF-8 (other parts in this script requires UTF-8 ), so any ideas on how to make it work?

Code:
ConvertUtf8(string)
{
    var := "x"
    ; Ensure capacity.
    len := StrPut(string, "UTF-8")   
    VarSetCapacity( var, len)
    ; convert the string.
    StrPut(string, &var, len, "CP0")
    return StrGet(&var, len, "CP0")       
}

myvar := "Det står klart att någon har tjallat om gömstället."
; the converted string should read "Det står klart att någon har tjallat om gömstället."
myANSI := ConvertUtf8(myvar)
msgbox % myANSI
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group