AutoHotkey Community

It is currently May 27th, 2012, 5:49 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 15 posts ] 
Author Message
PostPosted: August 23rd, 2010, 7:09 pm 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
Hotstrings that use unicode glyphs appear to malfunction in Autohotkey_L.

Consider the hotstring: ::a::bbbbb. When the 'a' glyph in the A-string is given its normal ASCII codepoint, the hotstring correctly returns: bbbbb. But when it is coded to a Unicode codepoint (in this example in the Unicode Private Use Area) the hotstring incorrectly returns: abbbb.

Likewise consider the hotstring: ::aa::bbbbb. When the 'a' glyphs in the A-string are given their normal ASCII codepoint, the hotstring correctly returns: bbbbb. But when they are coded to a Unicode codepoint the hotstring incorrectly returns: aabbbb.

It seems (1) the initial character(s) of the returned string are the same as the A-string, and (2) the returned string has about the same number of characters as the B-strung.

This suggests that Autohotkeys_L is not backspacing correctly. It leaves the A-string in place instead of deleting it, and then appends only part of the B-string. The resulting string is roughly the right length but contains a mixture of A-string and B-string characters.

Has anyone experienced the same problem, and if so, do you have a solution?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 23rd, 2010, 9:58 pm 
Offline

Joined: June 18th, 2008, 8:36 am
Posts: 4923
Location: AHK Forum
Can you post an example, and what do you mean by "unicode codepoint" :?:

_________________
AHK_H (2alpha) AHF TT _Struct WatchDir Yaml _Input ObjTree RapidHotkey DynaRun :wink:


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 24th, 2010, 9:46 am 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
First, I want to say how much I appreciate your help.

I’ll address the "unicode codepoint" issue first as that will make the rest of it clearer. This lucid and witty article gives the background better than I ever could: http://www.joelonsoftware.com/articles/Unicode.html

So, a codepoint is a numerical label for each character (actually ‘glyph’, but let’s not worry about that), and there are thousands of them in the Unicode system.

As I understand it, standard vanilla Autohotkey (but not Autohotkey_L) accepts only characters with codepoints U+0000 thru U+007F in Unicode terms, that is, the first 128 codepoints – and maybe not all of those. These make up the ASCII character set. It may also accept the next 128 codepoints as well, being U+0080 thru U+00FF – I don’t know as I haven’t tried it. These, in conjuction with ASCII, make up the ANSI character set.

Either way, standard vanilla Autohotkey doesn’t accept the vast majority of characters.

That’s where Autohotkey_L comes in. It’s supposed to accept characters with all Unicode codepoints, so in theory you can input strings coded to non-ANSI codepoints.

Unfortunately, it didn’t work for me. Here’s what I did. Consider this six-line script:

::a::ABCDEFGHIJ
::ab::ABCDEFGHIJ
::abc::ABCDEFGHIJ
::abcd::ABCDEFGHIJ
::abcde::ABCDEFGHIJ
::abcdef::ABCDEFGHIJ

The input strings look like ASCII characters but they’re not. I assigned them to codepoints in the Unicode ‘Private Use Area’, U+E000 thru U+F8FF. Naturally I cannot do so in this email as they would not display properly (or at all), but trust me, that's what I did. FYI I used FontCreator.

This is what I got back when I tested each of the hotstrings in turn:

aABCDEFGHI
abABCDEFGH
abcABCDEFG
abcdABCDEF
abcdeABCDE
abcdefABCD

The output string always has the correct length --10 characters -- but the front of the output string is made up of the input string.

Weird, don’t you think?

If anyone can tell what’s going on here I’d be glad to know. Also, can anyone tell me if standard vanilla Autohotkey handles ‘extended ASCII’ (U+0080 thru U+00FF), because if so, I may have a workaround.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 24th, 2010, 6:07 pm 
Offline
User avatar

Joined: March 19th, 2008, 12:43 am
Posts: 5482
Location: the tunnel(?=light)
Are those supported by UTF-8? Did you try changing the codepage?

_________________
Image
Try Quick Search for Autohotkey or see the tutorial for newbies.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 25th, 2010, 1:44 am 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
Thanks for the response, sinkfaze.

You ask, 'Are those [my codepoints, I assume] supported by UTF-8?'

As far as I know, yes. UTF-8 supposedly supports every Unicode codepoint that exists or ever could exist. It just needs ever more bytes to do it: the higher the codepoint, the more bytes needed. But that's okay because the UTF-8 encoding algorithm knows how many bytes each codepoint needs. As far as I'm aware, my AHK scripts are all encoded in UTF-8 -- at least that's what it says under 'encoding' in the Windows dialog box when I save it off. So I assume that the codepoints I use, which all lie in the ‘Private Use Area’, U+E000 thru U+F8FF (hereafter 'PUA'), are indeed correctly coded.

The question then is: Does AHK_L support UTF-8? More precisely, Does it do so correctly? In case it matters, I note that the PUA (which contains my codepoints) requires three bytes per character under UTF-8. This is rare: ASCII needs only one byte, and nearly all other scripts (though not the East Asian ones) need only two. So maybe there's an error in AHK_L's algorithm for decoding UTF-8 whereby it assumes all non-ASCII characters use two bytes, whereas mine use three? Unless you used AHK_L with Chinese, you wouldn't notice it. Of course Chinese speakers use AHK, but my guess is their hotkeys and hotstrings are all ASCII. Am I right, anyone?

If so, as a workaround I could use codepoints in the range that UTF-8 has two bytes for. This is strictly speaking bad practice but, well, who's watching?

Lastly, you ask if I'd tried 'changing the codepage'.

No. As I understand it (and correct me if I'm wrong) codepages only affect codepoints U+0080 thru U+00FF, that is the ANSI character set. My codepoints are way higher than that. Besides, aren't codepages history now that we have Unicode?

Over to you. La lucha continua!


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 25th, 2010, 10:02 am 
Offline

Joined: October 17th, 2006, 4:15 pm
Posts: 7503
Location: Australia
stentor wrote:
The question then is: Does AHK_L support UTF-8?
AutoHotkey_L uses the following to determine the number of bytes following a given possible lead byte:
Code:
if (*mPos < 0x80)
   // single byte UTF-8 character
   return (TCHAR) *mPosA++;
// The size in bytes of UTF-8 characters.
if ((*mPos & 0xE0) == 0xC0)
   iBytes = 2;
else if ((*mPos & 0xF0) == 0xE0)
   iBytes = 3;
else if ((*mPos & 0xF8) == 0xF0)
   iBytes = 4;
else {
   // Invalid in current UTF-8 standard.
   mPosA++;
   return INVALID_CHAR;
}
After this, the indicated number of bytes are passed to MultiByteToWideChar to do the conversion. This is a standard Windows API, so as long as the above is correct, it will support whatever Windows supports.
Quote:
Unless you used AHK_L with Chinese, you wouldn't notice it.
I believe jackieku, who did most of the work for Unicode support (AutoHotkeyU), is Chinese or at least has a Chinese codepage as the system default.
Quote:
Besides, aren't codepages history now that we have Unicode?
If the source text is in UTF-8 and you're running a Unicode build (natively UTF-16), ANSI codepages should be entirely irrelevant.


Hotstrings weren't designed with Unicode in mind. For the Unicode build, I think we've made only the minimum changes required to get the code to compile. Specifically, ToUnicodeEx is called in place of ToAsciiEx and some variable types were changed as appropriate. Looking over the code, it appears the hotstring text must match--at a binary level--the string of characters in AutoHotkey's internal buffer, which is based on the aforementioned functions. (However, note that the entire content of the script is converted to UTF-16 as it is read from the file.) CharLower is used to support case-insensitivity. As for problems with backspacing/replacing, I wouldn't have a clue as I've barely skimmed over that part of the code.
Quote:
But when it is coded to a Unicode codepoint (in this example in the Unicode Private Use Area) the hotstring incorrectly returns: abbbb.
I would be surprised that the hotstring is even recognized. Perhaps MultiByteToWideChar is normalizing the UTF-8 string as it is converted to UTF-16?
Quote:
Also, can anyone tell me if standard vanilla Autohotkey handles ‘extended ASCII’ (U+0080 thru U+00FF),
Non-Unicode applications under Windows use ANSI code pages, where bytes 0x80..0xFF have different meaning depending on what the system's default ANSI code page is set to. Additionally, U+0000..U+00FF are directly equivalent to ISO-8859-1 - extended ASCII is completely different and both are different from what standard AutoHotkey uses (ANSI).

(Edit: To be clear, U+xxxx is common notation for a Unicode code point with hexadecimal value xxxx, so U+0080 is necessarily Unicode while 0x80 is just a hexadecimal number. The meaning of ANSI character code 0x80 is different depending on the code page, whereas U+0080 is by definition Unicode, which is not affected.)

Edit: If you want to know how your multi-byte UTF-8 character ends up in memory, something like this should work:
Code:
v= ; Place your character here.
SetFormat IntegerFast, Hex
Loop, Parse, v  ; <- Probably not needed.
    s .= Asc(A_LoopField)
MsgBox % s


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 26th, 2010, 8:26 am 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
Thanks, Lexikos.

As I understand it, you're telling me that if I code my source code is UTF-8 (and I believe it is), and since AHK_L runs UTF-16 natively (I'll take your word for it), then everything should work.

But you also say you've 'barely skimmed over' the matter backspacing, so perhaps the problem lies there? The symptoms may support this view. At least it's worth looking at.

I have noticed that the left-hand (LH) string is not replaced if and only if it's coded to PUA codepoints. Instead what happens is the characters of the right-hand (RH) string are appended to the LH string until the output string reaches the length of the RH string. The remaining RH string characters are discarded.

To me it looks for all the world as if an output buffer of the correct size is created to accommodate the RH string, but the LH string refuses to vacate the buffer. It's a bit like extending a house to suit a new tenant but the old tenant refuses to leave. So when the new tenant moves in, they don't all fit; some household members are left outside to be taken by wolves.

As you can no doubt tell from this thoroughly unprofessional description, I'm no programmer. Which makes me all the more appreciative of your efforts and advice.

In case you're curious about what it is I'm trying to do, I'll explain. I have a glossary of terms. The input strings are composed of characters of my own devising, hence I put them in the PUA; the output strings are plain old ASCII. The script is ludicrously simple: just a list of hotstrings.

Lastly, thanks for putting me right on Unicode. Codepoints are one thing, encodings another.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 26th, 2010, 10:44 am 
Offline

Joined: December 23rd, 2006, 6:02 pm
Posts: 424
Location: Russia
Why not save your script in UTF-16? Thus you will avoid the conversion.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 26th, 2010, 11:21 pm 
Offline

Joined: October 17th, 2006, 4:15 pm
Posts: 7503
Location: Australia
YMP wrote:
Why not save your script in UTF-16?
It will take almost twice the space with no real benefit. If I insert character U+E000 into a script, save it as UTF-8 and run it, the character comes out as U+E000. I don't have to care how it's encoded.
stentor wrote:
As I understand it, you're telling me that [...] everything should work.
I never said the hotstring should work. My point was that how the character is represented in UTF-8 isn't relevant to the hotstring since the hotstring only sees the final UTF-16 string.
Quote:
I have noticed that the left-hand (LH) string is not replaced if and only if it's coded to PUA codepoints.
I think any characters which don't have corresponding keycodes will have the same problem. Btw, exactly how does one type a PUA character in a way that it can trigger a hotstring?

I think I've found (and fixed) the issue. However, it applied to any character in the RH text which has no corresponding keycode. The LH text was irrelevant. This also reproduced the problem:
Code:
SendInput % "a" Chr(0xE000) "b" ; I get {U+E000}ab

When any character was encountered which has no corresponding keycode, it was sent immediately using SendInput(). With SendEvent this isn't a problem, since keystrokes are sent one at a time. However, SendInput (not to be confused with SendInput()) and SendPlay use a buffer -- the entire string is converted to an array of events which are sent all in one go. Since these "special" characters were sent immediately when they were encountered in the string, they were actually sent before every other character. In the case of hotstrings, the backspace characters obviously have a corresponding keycode, so were sent after any "special" characters.

The solution has two parts: Firstly, if SendInput is the current mode, insert the Unicode pseudo-keystroke into the buffer rather than sending it immediately. Secondly, if SendPlay is the current mode, use the Alt+Numpad method instead since it is likely to have better results (though it relies on the target application to support Unicode).

Note that the problem didn't occur for hotstrings if I used the "SE" (SendEvent) option or left my hotkey scripts running (since the presence of another keyboard hook causes SendInput to fall back to SendEvent).

Please try the latest build and let me know your results.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 27th, 2010, 3:33 am 
Offline

Joined: December 23rd, 2006, 6:02 pm
Posts: 424
Location: Russia
Lexikos wrote:
It will take almost twice the space with no real benefit.

I meant for testing. To leave out the conversion and see if it somehow affects the result.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 28th, 2010, 7:23 am 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
Lexikos, I salute you and all the others who work so selflessly for Autohotkey. It now appears to be working.

I would seek your advice on another matter -- not a bug, I think, but an unwanted feature (at least for my purposes). Consider two hotstrings, one of which is a proper subset of the other and constitutes its final characters, such as
Code:
::de::DE
::abcde::ABCDE

When I input 'abcde' it returns abcDE instead of the desired 'ABCDE', as if the input string is prematurely triggering the first hotstring rather than the second. I haven't tested it rigorously, but I think the problem only arises in very long scripts (mine has thousands of lines).

This may be deliberate, but it's not what I need. I need each input string to be indivisible, much like a phone number. I have searched for a suitable Hotstring directive to force this interpretation, and experimented with some, but without success.

A solution I found was to order my hotstrings in such a way that the longest were encountered first. This works because the longer hotstring always triggers first. But it isn't altogether ideal, for if you mistakenly enter an incorrect string, the correct hotstring naturally isn't triggered. Instead AHK is liable to recognise the last few characters of the incorrect input string as shorter legitmate hotstring and so return a mishmash. I'd prefer it to return nothing at all, or better, an error message.

Any ideas?

Lastly, in answer to your question: '... exactly how does one type a PUA character in a way that it can trigger a hotstring?' I used MSKLC to create a custom keyboard. When I want to enter PUA characters, I just toggle on the keyboard and bingo!


Report this post
Top
 Profile  
Reply with quote  
PostPosted: August 28th, 2010, 8:00 am 
Offline

Joined: August 27th, 2010, 1:53 pm
Posts: 3
Perhaps someday I'll write one killer article all about AutoHotkey and how it can accelerate your productivity, but for now I'll just discuss one problem it can (help) solve: inputting special characters such as α or → with ease. I put "help" in quotation marks, because AHK does not have an easy, built-in way of inputting Unicode characters (yet), but it can do effective text auto-replace (a.k.a. hotstrings). There are several clever solutions on the AHK forums for the Unicode deficit (SendU, etc.), and here's mine...


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 29th, 2010, 2:17 am 
Offline

Joined: October 17th, 2006, 4:15 pm
Posts: 7503
Location: Australia
stentor wrote:
When I input 'abcde' it returns abcDE instead of the desired 'ABCDE',
When I input 'abcde', it returns 'ABCDE'. It should only return 'abcDE' if you are using the '?' option.
Quote:
? (question mark): The hotstring will be triggered even when it is inside another word; that is, when the character typed immediately before it is alphanumeric.
Source: Hotstrings and Auto-replace (similar to AutoText and AutoCorrect)

Quote:
... but I think the problem only arises in very long scripts (mine has thousands of lines).
Maybe you've used #Hotstring ? somewhere? It can be turned off using #HotString ?0.
Quote:
I used MSKLC to create a custom keyboard.
Interesting.
kittu wrote:
I put "help" in quotation marks, because AHK does not have an easy, built-in way of inputting Unicode characters (yet),
This topic is about AutoHotkey_L, which does have an easy, built-in way of inputting Unicode characters. If you were aware of that, perhaps I misunderstood your post.
Code:
; Must be saved as UTF-8.
Send α→
::(a)::α
::->::→

Quote:
and here's mine...
Where?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 29th, 2010, 4:13 am 
Offline

Joined: August 23rd, 2010, 2:50 pm
Posts: 6
Location: Melbourne, Australia
You (Lexikos) mention the '?' option:
Quote:
It should only return 'abcDE' if you are using the '?' option.

Did I use it? Yes and no. I experimented with it. The way I read it, the default state is that hotstrings are indivisible, but that the '?' option allows them to be divisible. Rereading it, I still make this interpretation. I therefore couldn't, and still can't, understand why my script doesn't seem to work that way. But thinking that perhaps I had misunderstood the instructions, as an experiment I tried using '?' in case it meant precisely the opposite of what I thought: namely that '?' turned indivisibility on, not off. It made no difference that I could see, so I removed it.

You say:
Code:
Maybe you've used #Hotstring ? somewhere? It can be turned off using #HotString ?0.

Again, yes and no. When I was experimenting with the various hotstring 'switches' I invariably used the 'directive' syntactic form as I wanted the switch to apply to every hotstring. I would place the directive at the very top of my script. (I did not, however, turn off the directive at the end. Is this an error? Since I was at the end of my script, I assumed it didn't matter). But otherwise, and now, my script consists of 10,000 hotstrings and nothing else -- no commands of any description. This has to be perhaps the longest and simplest AHK script in history.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 29th, 2010, 7:23 am 
Offline

Joined: October 17th, 2006, 4:15 pm
Posts: 7503
Location: Australia
I prefer not to think of it as dividing the hotstring or dividing what you type; the option when applied to any given hotstring purely affects whether any previously typed text is considered when recognizing that hotstring. The quote in my previous post describes precisely how it works.

If you type 'abcde' and that option is not enabled, it cannot trigger the hotstring 'de' since the character typed immediately before 'd' is 'c', which is alphanumeric. If the option is enabled, the character before 'd' is ignored, so 'abcde' may trigger the 'de' hotstring.

If you add the ?0 option to the hotstring, as in...
Code:
:?0:bc::BC
...and typing 'abc' triggers that hotstring, there's probably a bug or something else resetting hotstring recognition. On that note, it might help to use the Z option, or remove it if you're already using it.
Quote:
I did not, however, turn off the directive at the end.
That isn't a problem. The problem is when you enable the option and forget you've done so. Adding the option to the hotstring itself (as above) overrides the global setting, but only for that hotstring.

If a given hotstring shows incorrect behaviour, try removing it from that script and adding it into a new one.


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: No registered users and 4 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group