Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

UTF-8 HEX Text Conversion



  • Please log in to reply
9 replies to this topic
mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

Hi all, I have what I think might be "UTF-8 hex" text which I just want to look normal.

 

For example, this would be the input text:

 

=?UTF-8?Q?Discover_Santa_Cruz_for_Mother=E2=80=99s_Day?=

 

This is the desired output text:

 

Discover Santa Cruz for Mother’s Day

 

Any suggestions of how to approach this would be appreciated.

 

- Mike



rbrtryn
  • Members
  • 1177 posts
  • Last active: Sep 11 2013 08:04 PM
  • Joined: 22 Jun 2011
For that particular input:
; Autoexecute
    #NoEnv
    #SingleInstance force
    
    txt := "=?UTF-8?Q?Discover_Santa_Cruz_for_Mother=E2=80=99s_Day?="
    out := RegExReplace(txt, "^.*Q\?|=\w{2}|=")
    StringReplace out, out, _, %A_Space%, All
    MsgBox % out
return

My Scripts are written for the latest released version of AutoHotkey.

Need a secure, accessible place to backup your stuff? Use Dropbox!


Linear Spoon
  • Members
  • 842 posts
  • Last active: Sep 29 2015 03:56 AM
  • Joined: 29 Oct 2011
✓  Best Answer

Wrote this up... Its not the most elegant solution but it's working for me (AHK_L 1.1.09.04).

UTF-8 hex "escape sequences" should be =hex1=hex2=hex3..etc. This should handle variable length sequences and as many sequences as are in the string.

 

Note this doesn't use the same input string as the original post, I threw in some extra random escape sequences for testing.

The expected output of this example is: =?UTF-8?Q?Discover⁴ Sant¥a Cruz forW Mother’s Day?=

input := "=?UTF-8?Q?Discover=E2=81=b4_Sant=c2=A5a_Cruz_for=57_Mother=E2=80=99s_Day?="

;While there are still escape sequences...
while(oldpos := RegexMatch(input, "=\K[0-9a-fA-F]{1,2}", num))
{
  bytes := {}  ;Set bytes to an empty object
  Loop
  {
    ;Find the next number
    foundpos := RegexMatch(input, "=\K[0-9a-fA-F]{1,2}", num)
    if (foundpos = oldpos)
    {
      ;Erase what we just found
      StringReplace, input, input, =%num%
      ;Insert this hex value into an array
      bytes.Insert("0x" num)
      ;To determine if the next number found is part of the same sequence
      oldpos := foundpos
    }
    else
    {
      ;Basically inserts our byte array converted to unicode at the last position a number was found in
      input := SubStr(input, 1, oldpos-2) UTF8ToUnicode(bytes*) SubStr(input, oldpos-1)
      break
    }
  }
}

;Replace _ with spaces
StringReplace, input, input, _, %A_Space%, 1
;substr or stringtrimleft/right to remove the rest if it is not wanted
Msgbox % input

return

;This is a variadic function, takes a list of byte values, builds an array from them, and uses strget to get the unicode equivalent
;This actually allows it to handle multiple back to back escape sequences because theres no limit to the input bytes
;(try input := "=?UTF-8?Q?Discover=_Santa_Cruz_for_Mother=E2=80=99=E2=80=99s_Day?=", it will produce a double quote: Mother''s)
UTF8ToUnicode(bytes*)
{
  VarSetCapacity(data, bytes.MaxIndex()+1, 0)
  for k,v in bytes
    NumPut(v, data, k-1, "char")
  return StrGet(&data, "utf-8")
}

Join us at the new forum - http://www.ahkscript.org/

 


mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

Thank you both for your replies, it is appreciated. Linear Spoon, this is excellent. You didn't just point me in the right direction, you completely solved the problem. I appreciate the commenting as well. I am struggling to understand how this is working but will continue to study the code until it makes more sense (probably within a few hours). I may have some follow-up questions for you, if you don't mind. Thank you again.

 

- Mike



mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

To aid others with search results, this code is also very useful when decoding quoted printable email text.

 

- Mike



Linear Spoon
  • Members
  • 842 posts
  • Last active: Sep 29 2015 03:56 AM
  • Joined: 29 Oct 2011

Here is an alternative version that should do the same thing, but with far less code and much faster for long inputs.

input := "=?UTF-8?Q?Discover=E2=81=b4_Sant=c2=A5a_Cruz_for=57_Mother=E2=80=99s_Day?="
Msgbox % UnEscape(input)

UnEscape(input)
{
  input := RegexReplace(input, "=([0-9a-fA-F]{1,4})", "&#x$1;") ;Exchange these escape sequences for html escape sequences
  doc := ComObjCreate("HTMLfile")
  doc.write(input) ;write our input to an html document
  return doc.body.innerText  ;get the translated input
}

Join us at the new forum - http://www.ahkscript.org/

 


atnbueno
  • Members
  • 91 posts
  • Last active: Feb 16 2016 07:04 PM
  • Joined: 24 Mar 2007

Hello there.

 

A bit of extra info and a tiny correction to Linear Spoon's code: The string from the OP is formatted following section 2 of RFC2047, so the proper regular expression would be

=([0-9a-fA-F]{2})

Regards,
Antonio

Linear Spoon
  • Members
  • 842 posts
  • Last active: Sep 29 2015 03:56 AM
  • Joined: 29 Oct 2011

Thanks atnbueno. I did not know where this came from, so I assumed it could take any unicode character (1-4 hex digits). It seems they also recommend upper case for hex digits (4.2.1), so:

input := "=?UTF-8?Q?Discover=E2=81=b4_Sant=c2=A5a_Cruz_for=57_Mother=E2=80=99s_Day?="
Msgbox % UnEscape(input)

UnEscape(input)
{
  input := RegexReplace(input, "=([0-9A-F]{2})", "&#x$1;") ;Exchange these escape sequences for html escape sequences
  doc := ComObjCreate("HTMLfile")
  doc.write(input) ;write our input to an html document
  return doc.body.innerText  ;get the translated input
}

Join us at the new forum - http://www.ahkscript.org/

 


mikek
  • Members
  • 161 posts
  • Last active: Nov 09 2015 05:02 PM
  • Joined: 21 Nov 2008

Hi Linear Spoon, thank you very much for trying to improve the code. You already have my awe and respect from your original code. A speed enhancement would be a good thing, but the output is not coming out as expected. Try changing the input to the following:

 

input := "=C3=A9" ; é
 

Your original code returns the desired output of "é". However, both sets of the new code output "é".

 

- Mike



Linear Spoon
  • Members
  • 842 posts
  • Last active: Sep 29 2015 03:56 AM
  • Joined: 29 Oct 2011

You're right. I didn't test this very well. My first solution works because it blindly inserts bytes into an array and then uses StrGet to interpret it as UTF-8. The HTMLfile com object is interpreting the escapes as individual characters (when in UTF-8, they both combine into é). I'll see if it can be fixed and post back here later.


Join us at the new forum - http://www.ahkscript.org/