AutoHotkey Community

It is currently May 26th, 2012, 8:37 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 13 posts ] 
Author Message
 Post subject: How to work with UTF-8
PostPosted: August 18th, 2009, 11:33 am 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
Hi everybody,

I am using the nice xpath library to read some RSS feeds and all is nice.
I bumped into an obstacle, when reading RSS feed that is UTF-8 encoded.

I am guessing the solution does not have anything to do with xpath but with how to work with UTF-8 in general.

So, what do I need to do in order to show a UTF-8 encoded string on the GUI properly?

Specifically, I have this string that does not show correctly:
Code:
You Won’t Find Putin On Russia’s Gogul

(I hope this preserves its qualities when pasted here)

Also, since my script reads several XML files, some will be UTF-8 encoded and some wont, so I hope there is solution that accounts for this.

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 18th, 2009, 11:37 am 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
Look at YMP and VxE solutions in http://www.autohotkey.com/forum/topic47400.html. Before processing the XML you could convert them
using either of these functions (ansi or "html")

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 18th, 2009, 12:06 pm 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
Hugo, you are the main man!

There is a lot of information to read there, but I was partially successful with YMP's code

Made it into something like this:
Code:
#SingleInstance Force

FileRead bad, test.txt ; Reading a UTF-8 encoded file
good := Decode( bad )
msgbox %bad%`n%good%

Return

Decode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 1252, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}


Which works fine for some characters, but not others.
Need to further investigate.

Thanks man (and YMP and VxE)
(And if anyone else has a better/other solution, please post)

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 18th, 2009, 2:02 pm 
Offline

Joined: December 23rd, 2006, 6:02 pm
Posts: 424
Location: Russia
Icarus
If utf-8 text contains letters that belong to different ANSI charsets, it can't be translated to ANSI correctly. Another possibility is that your system's default codepage differs from 1252, which I used in my code. Try replacing 1252 with 0. It will instruct WideCharToMultiByte to use your default codepage. Maybe that's all you need. :roll:


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 18th, 2009, 2:09 pm 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
YMP, you are great.
The 0 change indeed fixed it all, that's terrific.

If you ask me, I think that many things done with DllCalls like this in AHK should already make it into the official build as official functions.

String handling in AHK is lacking a little.

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 23rd, 2009, 9:51 am 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
@YMP (or someone who understands this encoding stuff... :) )

After I have used this function to make the UTF text displayable on the AHK GUI windows, lets say I want to save it as part of an XML.

What is the encoding I need to use?
Is it
Code:
<?xml version="1.0" encoding="Windows-1252"?>


I have used YMPs advice and using 0 instead of 1252 in the UtfDecode function - I am guessing it needs to be considered when dumping back to XML?

Of course, the best would probably be for me to have the opposite function so that I can generate UTF-8 XMLs.

Any help is appreciated.

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 23rd, 2009, 10:38 am 
Offline

Joined: December 23rd, 2006, 6:02 pm
Posts: 424
Location: Russia
There is probably not so much to be changed:
Code:
Encode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}

I assume the byte order mark (BOM) is not needed at the beginning of the UTF-8 file, since the encoding is specified in the XML header. But I am not quite sure. :roll: If you need it, you can do so:
Code:
BOM := Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
FileAppend, %BOM%, %A_Desktop%\test.xml

utf8 := Encode(Text)
FileAppend, %utf8%, %A_Desktop%\test.xml


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 23rd, 2009, 10:55 am 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
YMP,

This is excellent. Appreciate the fast and complete response.
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences... :)

In any case, thanks a lot, you are very helpful.

EDIT:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
Code:
/*
; Tester
#SingleInstance Force

FileRead utfText, test.txt           ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText )     ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%

backToUtf := UtfEncode( ansiText )    ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%

FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt

Return
*/


UtfDecode( str ) {
  RawLen := StrLen(str)
 
  Charset := 0    ; Put 1252 or 0

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}

UtfEncode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}


UtfBom() {
  ; Put this string at the beginning of the file, if Byte Order Mark is needed
  ; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
  Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 23rd, 2009, 11:13 am 
Offline

Joined: December 23rd, 2006, 6:02 pm
Posts: 424
Location: Russia
Icarus wrote:
Is there any good resource where I can learn a bit more about how to use DllCalls?
I easily get lost on MSDN and dont understand half their sentences...

Well, the only alternative to MSDN I have used is downloading a Platform SDK, but its documentation contains the same sentences. :) Yes, it's sometimes a bit hard to get through. But practice makes it easier.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: April 4th, 2010, 9:46 pm 
Offline

Joined: February 26th, 2010, 1:11 am
Posts: 38
Thank you! Works with me too. :D


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 17th, 2011, 5:51 pm 
Offline

Joined: August 29th, 2008, 9:14 pm
Posts: 386
Icarus wrote:
If anyone else reaches this thread, here is what I am now using to encode / decode UTF, based on YMPs collection of solutions.
Code:
/*
; Tester
#SingleInstance Force

FileRead utfText, test.txt           ; Reading a UTF-8 encoded file
ansiText := UtfDecode( utfText )     ; Convert to ANSI
msgbox %utfText%`n`nConverted to`n`n%ansiText%

backToUtf := UtfEncode( ansiText )    ; Convert back to UTF-8
msgbox Back to UTF:`n%backToUtf%

FileDelete output.txt
FileAppend %backToUtf%, output.txt
Run output.txt

Return
*/


UtfDecode( str ) {
  RawLen := StrLen(str)
 
  Charset := 0    ; Put 1252 or 0

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf, BufSize, 0)

  DllCall("MultiByteToWideChar", "uint", 65001, "int", 0, "str", str
                               , "int", -1, "uint", &Buf, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", Charset, "int", 0, "uint", &Buf, "int", -1
                               , "str", str, "uint", RawLen + 1
                               , "int", 0, "int", 0)
  Return str
}

UtfEncode( str ) {
  RawLen := StrLen(str)

  BufSize := (RawLen + 1) * 2
  VarSetCapacity(Buf1, BufSize, 0)    ; For UTF-16.
  VarSetCapacity(Buf2, BufSize, 0)    ; For UTF-8.

  DllCall("MultiByteToWideChar", "uint", 0, "int", 0, "str", str
                               , "int", -1, "uint", &Buf1, "uint", RawLen + 1)
  DllCall("WideCharToMultiByte", "uint", 65001, "int", 0, "uint", &Buf1
                               , "int", -1, "str", Buf2, "uint", BufSize
                               , "int", 0, "int", 0)
  Return Buf2
}


UtfBom() {
  ; Put this string at the beginning of the file, if Byte Order Mark is needed
  ; Example: FileAppend % UtfBom() . VarContainingUTF, %Filename%
  Return Chr(0xEF) . Chr(0xBB) . Chr(0xBF)
}

Icarus and/or YMP, could this great script be converted for use with Ahk_L Unicode? I cant convert to ANSI any longer when going from basic to Ahk_L.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 18th, 2011, 10:28 am 
Offline

Joined: October 17th, 2006, 4:15 pm
Posts: 7502
Location: Australia
See FileEncoding. There's probably no need for you to explicitly encode or decode it in this case. If there was, I'd suggest you look at StrPut / StrGet which are also available as scripts for AutoHotkey Basic.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 18th, 2011, 4:13 pm 
Offline

Joined: August 29th, 2008, 9:14 pm
Posts: 386
Im sorry Lexikos, im such a noob :oops: I cannot get anything to work. Not FileEncoding nor StrPut and yes Im in need of some converting as you see in non working code below. My string is in UTF-8 and I have to convert it to ANSI somehow. Can anyone help me with this? I do use Autohotkey_L Unicode on Win7.
Edit: If saving the script below as ANSI the code works, but if saving it as UTF-8 it doesnt work. I plan to use this code in a script saved as UTF-8 (other parts in this script requires UTF-8 ), so any ideas on how to make it work?

Code:
ConvertUtf8(string)
{
    var := "x"
    ; Ensure capacity.
    len := StrPut(string, "UTF-8")   
    VarSetCapacity( var, len)
    ; convert the string.
    StrPut(string, &var, len, "CP0")
    return StrGet(&var, len, "CP0")       
}

myvar := "Det står klart att någon har tjallat om gömstället."
; the converted string should read "Det står klart att någon har tjallat om gömstället."
myANSI := ConvertUtf8(myvar)
msgbox % myANSI


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: BrandonHotkey, Google [Bot], joetazz, Leef_me, Mickers, tidbit, tomoe_uehara and 54 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group