Optimized access to string characters

Propose new features and changes
User avatar
Coiler
Posts: 114
Joined: 29 Nov 2020, 09:06

Optimized access to string characters

25 Feb 2021, 08:34

There is currently no efficient method to test strings for a prefix, suffix, or random-index character. As far as I can tell, we only have the ability to use SubStr to extract a new string containing the specific character, then compare it to our target. It would be great to have the ability to simply check a specific index character value. Something like if mystring[A_Index] == "X" .

Thanks!
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

26 Feb 2021, 20:09

As far as I can tell, we only have the ability to use SubStr to extract a new string containing the specific character, then compare it to our target.
Is that a problem? Your suggestion is to add a shortcut (shorthand syntax) for doing exactly the same thing.

There are other options, but SubStr-comparison is likely to be the most efficient, especially when you are only extracting one character.

You can make string[] a shortcut for SubStr by adding a __Get meta-function to the default base object (v1) or adding a parameterized __Item property to String.Prototype (v2).

How do you gauge efficiency? Is it about your efficiency, or the program's? What is the criteria for a method to be judged as not efficient?
HotKeyIt
Posts: 2364
Joined: 29 Sep 2013, 18:35
Contact:

Re: Optimized access to string characters

26 Feb 2021, 22:57

Like this?

Code: Select all

text:="randomtext"
needle:="ex"
MsgBox % DllCall("msvcrt\wcsncmp","PTR",(&text)+14,"Str",needle,"Int",2)
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

27 Feb 2021, 06:20

That is less efficient.
iPhilip
Posts: 796
Joined: 02 Oct 2013, 12:21

Re: Optimized access to string characters

02 Mar 2021, 12:58

lexikos wrote:
26 Feb 2021, 20:09
You can make string[] a shortcut for SubStr by adding a __Get meta-function to the default base object (v1) or adding a parameterized __Item property to String.Prototype (v2).
This is how I implemented the above for v1.

Code: Select all

"".base.__Get := Func("Default_Get")
Default_Get(var, index) {
   return SubStr(var, index, 1)
}

Str := "hello world", Index := 5
MsgBox % Str[Index] " == " "hello world"[Index] " == " SubStr(Str, Index, 1)
Similarly, this is for v2.

Code: Select all

DefProp := {}.GetMethod("DefineProp")
%DefProp%("".base, "__Item", {get : (var, index) => SubStr(var, index, 1)})

Str := "hello world", Index := 5
MsgBox Str[Index] " == " "hello world"[Index] " == " SubStr(Str, Index, 1)
Windows 10 Pro (64 bit) - AutoHotkey v2.0+ (Unicode 64-bit)
iPhilip
Posts: 796
Joined: 02 Oct 2013, 12:21

Re: Optimized access to string characters

02 Mar 2021, 21:28

Hi @lexikos,

I have a follow up question: How would I go about removing a __Get meta-function to restore the default behavior?

Code: Select all

"".base.__Get := Func("Default_Get")
adds a __Get meta-function to the default base object but it's not clear how I would undo that.
Windows 10 Pro (64 bit) - AutoHotkey v2.0+ (Unicode 64-bit)
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

02 Mar 2021, 23:26

All you did was assign (and thereby create) a property/key-value pair. To undo that, delete what you created.
iPhilip
Posts: 796
Joined: 02 Oct 2013, 12:21

Re: Optimized access to string characters

03 Mar 2021, 16:13

lexikos wrote:
02 Mar 2021, 23:26
All you did was assign (and thereby create) a property/key-value pair. To undo that, delete what you created.
Hi @lexikos,

Thank you for your reply. I figured as much. The sticky issue for me was not realizing that I had to allow for the object base to be returned by the Default_Get function after assigning the __Get meta-function to the default base object.

The code below works for v1.

Code: Select all

#NoEnv

Str := "hello world", Index := 5
MsgBox % Str[Index] " == " "hello world"[Index] " != " SubStr(Str, Index, 1)

"".base.__Get := Func("Default_Get")
MsgBox % Str[Index] " == " "hello world"[Index] " == " SubStr(Str, Index, 1)

"".base.Delete("__Get")
MsgBox % Str[Index] " == " "hello world"[Index] " != " SubStr(Str, Index, 1)

Default_Get(var, key) {
   if (key != "base")
      return SubStr(var, key, 1)
}
The code below works for v2.

Code: Select all

Str := "hello world", Index := 5
try
   MsgBox Str[Index] " == " "hello world"[Index] " != " SubStr(Str, Index, 1)
catch
   MsgBox 'Error:  This value of type "String" has no property named "__Item".'

DefProp := {}.GetMethod("DefineProp")
%DefProp%("".base, "__Item", {get : (var, index) => SubStr(var, index, 1)})
MsgBox Str[Index] " == " "hello world"[Index] " == " SubStr(Str, Index, 1)

DelProp := {}.GetMethod("DeleteProp")
%DelProp%("".base, "__Item")
try
   MsgBox Str[Index] " == " "hello world"[Index] " != " SubStr(Str, Index, 1)
catch
   MsgBox 'Error:  This value of type "String" has no property named "__Item".'
Cheers!
Windows 10 Pro (64 bit) - AutoHotkey v2.0+ (Unicode 64-bit)
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

03 Mar 2021, 21:27

You may also use Func("SubStr").Bind(,, 1) in place of (var, index) => SubStr(var, index, 1).

With a future release (current 'x' branch), SubStr.Bind(,, 1) will work.
iPhilip
Posts: 796
Joined: 02 Oct 2013, 12:21

Re: Optimized access to string characters

03 Mar 2021, 21:40

lexikos wrote:
03 Mar 2021, 21:27
You may also use Func("SubStr").Bind(,, 1) in place of (var, index) => SubStr(var, index, 1).

With a future release (current 'x' branch), SubStr.Bind(,, 1) will work.
Thank you, @lexikos. I had forgotten about the new Bind() feature.
Windows 10 Pro (64 bit) - AutoHotkey v2.0+ (Unicode 64-bit)
User avatar
Chunjee
Posts: 1400
Joined: 18 Apr 2014, 19:05
Contact:

Re: Optimized access to string characters

09 Mar 2021, 14:06

Coiler wrote:
25 Feb 2021, 08:34
There is currently no efficient method to test strings for a prefix, suffix, or random-index character.

Perhaps biga.ahk can be of some use:

Code: Select all

A := new biga() ; requires https://www.npmjs.com/package/biga.ahk

msgbox, % A.startsWith("NOT REQUIRED", "NOT")
; => true

msgbox, % A.endsWith("NOT REQUIRED", "D")
; => true

; get a random character:
msgbox, % A.sample(strSplit("NOT REQUIRED"))
; => "E"
I find this fits efficiently in my brain toolbox; but is probably not as fast computationally than a dll

https://biga-ahk.github.io/biga.ahk/#/?id=startswith
https://biga-ahk.github.io/biga.ahk/#/?id=endswith
https://biga-ahk.github.io/biga.ahk/#/?id=sample
User avatar
Coiler
Posts: 114
Joined: 29 Nov 2020, 09:06

Re: Optimized access to string characters

16 Mar 2021, 15:17

lexikos wrote:
26 Feb 2021, 20:09
As far as I can tell, we only have the ability to use SubStr to extract a new string containing the specific character, then compare it to our target.
Is that a problem? Your suggestion is to add a shortcut (shorthand syntax) for doing exactly the same thing.

There are other options, but SubStr-comparison is likely to be the most efficient, especially when you are only extracting one character.

You can make string[] a shortcut for SubStr by adding a __Get meta-function to the default base object (v1) or adding a parameterized __Item property to String.Prototype (v2).

How do you gauge efficiency? Is it about your efficiency, or the program's? What is the criteria for a method to be judged as not efficient?
I'm sorry, I didn't mean to touch a soft spot. I apologize for my poor choice of words. I wasn't implying that any of the existing functionality is inefficient. Just that SubStr() is doing a lot more work than what is necessary to achieve the desired result. If we just want to look at a single character's value, we are required to create an entirely new, separate string in memory. I assume the new string is held in heap memory, etc? Even if it uses the stack or a very efficient recycling system, its still doing a lot more than necessary to compare a single character value to another value (memory + index lookup).

If it would take too much time to implement, or would be too difficult to implement, then that makes sense. Otherwise, I personally think the functionality should be available in such a string-friendly language. You can do tons of amazing things with strings, but not the most basic operation.

EDIT: I personally won't be adding a shortcut to pretend this operation exists, but I do recommend faking it officially. If for no other reason, just to represent a possible future upgrade. This would allow script writers to make use of the optimized string[index] syntax, allowing things to fall into place if/when the functionality is eventually added. Or if you don't like array syntax, maybe GetCharAtIndex(string,index).
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

16 Mar 2021, 21:57

I think you misread the tone of my post and have made some poor assumptions about how strings are implemented and what the performance would be.

What makes you think that indexing into the string and having it return a single-character string (for comparison, as in your hypothetical example) would be implemented or perform any differently to SubStr indexing into the string and having it return a single-character string? It's literally doing the exact same thing, but the indexer would be invoked through an additional layer of abstraction (invoking a virtual property vs calling a function directly) whereas SubStr has one more parameter.
swagfag
Posts: 6222
Joined: 11 Jan 2017, 17:59

Re: Optimized access to string characters

16 Mar 2021, 23:18

sounds like he wants std::string_view implemented
User avatar
Coiler
Posts: 114
Joined: 29 Nov 2020, 09:06

Re: Optimized access to string characters

19 Mar 2021, 05:44

lexikos wrote:
16 Mar 2021, 21:57
I think you misread the tone of my post and have made some poor assumptions about how strings are implemented and what the performance would be.

What makes you think that indexing into the string and having it return a single-character string (for comparison, as in your hypothetical example) would be implemented or perform any differently to SubStr indexing into the string and having it return a single-character string? It's literally doing the exact same thing, but the indexer would be invoked through an additional layer of abstraction (invoking a virtual property vs calling a function directly) whereas SubStr has one more parameter.
More apologies. Apparently the tone of all of your posts come off as angry. I haven't looked at the source code, but I'm assuming the comparison operators are not required to create variables for the expressions. There are several situations where a comparison operator can "look inside" of another variable, so I guess it depends on how AHK handles those situations. Does it create new variables to represent a portion of the other variables, or does it compare the data directly? And in either case, there's no reason to create new variables - just compare the data directly. Especially in those simple situations where a<50> == b<30>. And in cases where the user is assigning the data to a new variable, use SubStr().

I didn't realize this much depth was necessary to ask for a feature. Just ignore it if you don't think it will benefit AHK.
lexikos
Posts: 9553
Joined: 30 Sep 2013, 04:07
Contact:

Re: Optimized access to string characters

20 Mar 2021, 23:31

If all you wanted was to ask for a feature, you achieved that with the first post. I assume you posted because you want it implemented. But if you wanted it for the reason you stated, then implementing it wouldn't actually help you since SubStr would likely still be more efficient (depending on how you measure "efficiency"). I posted not to reject your request, but to address misinformation.

if mystring[A_Index] == "X" has two variables: mystring and A_Index. The equality operator takes two values and compares them. The sub-expression mystring[A_Index] invokes the default property of the value contained by mystring, with the parameter A_Index. If the invocation does not throw an exception (e.g. because the property doesn't exist), the value it produces is then compared to "X" by ==. No variables are created by this expression at runtime.

The default property for an Object is __Item, but for COM objects it has no name.

There is no such thing as "a portion of a variable". A variable contains a value. mystring presumably contains a string. You want mystring[A_Index] to return a single-character substring of the string contained by the variable, not a portion of the variable itself.

It seems that you think mystring[A_Index] == "X" could compare the character at that index directly, without having any need for an intermediate string containing that character. That would mean resolving this to a single operation, similar to CompareCharAt(mystring, A_Index, "X"), not evaluating each side of the operator and then evaluating the operator. But why would you assume it could work that way for mystring[A_Index] == "X" and not for SubStr(mystring, A_Index, 1) == "X"?

Actually, it is not practical for mystring[A_Index] because the indexing operation must be dispatched according to the runtime type of the value, and mystring could contain anything (e.g. an Array, Map or Gui). The CompareCharAt operation would need to perform the equivalent of evaluating mystring[A_Index] itself if mystring does not contain a string. Furthermore, String.prototype.__Item might have been redefined with different behaviour, and checking for that would possibly defeat the optimization. By contrast, it's always known at load-time whether SubStr refers to the built-in function (i.e. hasn't been shadowed by an inane user-defined one) and that it will return a string, so it would be technically possible to make the comparison work this way. However, it would almost certainly be a waste of time ("premature optimization"), because "creating" a single-character string for return is a very fast operation.


With the current AutoHotkey v2 implementation, there are a number of different ways strings might be returned from a built-in function. For SubStr, only the following are relevant:
  • If the substring is empty, the constant "" is returned directly.
  • If the substring is a suffix of the input, it is returned directly. (Otherwise, the string must be copied in order to null-terminate it.)
  • If the substring is less than 256 characters, it is copied into pre-allocated stack-based memory provided by the caller.
  • Otherwise, memory is allocated from the heap and the substring is copied into it.
After return, there are a number of possibilities that might be taken by the expression evaluator. In this case, it will copy the single-character substring from the temporary buffer into a more persistent temporary buffer prior to the comparison.

A null-terminated single-character UTF-16 string is (typically) 4 bytes including the null-terminator. Copying these values repeatedly is not a performance concern, and is almost certainly faster than allocating new memory.

At some point in the future, the code may be reviewed to remove the reliance on null-termination. Then, SubStr can just return a pointer and length, without copying any strings. (The expression evaluator may still need to make a copy, as it does now.)


In any case, mystring[A_Index] will never be more CPU-efficient than SubStr(mystring, A_Index, 1). The difference in CPU-efficiency has nothing to do with allocating or copying strings; it is the cost of dynamic dispatch. Built-in properties are implemented the same way as user-defined ones; the only difference is that a built-in function is called instead of a user-defined one.

For a very fast and simple operation like retrieving a character from a string, the difference is more pronounced percentage-wise because the overall operation is still very fast. I did some testing with a simple "custom built-in" function ChrAt, functionally identical to SubStr when its third parameter is 1, but simpler. The results were:

Code: Select all

============  ChrAt
=============  SubStr
====================================  ChrAt via []
========================================  SubStr.Bind via []
============================================================  UDF via []
Shorter = faster. The UDF was ((s, i) => SubStr(s, i, 1)). The test was to iterate over mystring := A_AhkPath with Loop StrLen(mystring) and one of the above. The slowest was still under 5µs per iteration (averaged over 100k iterations), and that's iterating over the entire string, not just a single call to SubStr/ChrAt/[].

So even if you're concerned about the performance of such a trivial operation, there is very little to be gained by adding this feature as built-in over implementing it yourself. If it is added, it will not be for performance reasons, but perhaps (subjective) readability or convenience.
User avatar
Coiler
Posts: 114
Joined: 29 Nov 2020, 09:06

Re: Optimized access to string characters

29 Mar 2021, 12:16

lexikos wrote:
20 Mar 2021, 23:31
There is no such thing as "a portion of a variable". A variable contains a value. mystring presumably contains a string. You want mystring[A_Index] to return a single-character substring of the string contained by the variable, not a portion of the variable itself.
I was referring to a portion of memory that belongs to the variable. The variable we are referring to is a string of character bytes, not a self-contained chunk of memory. There actually is such a thing as a portion of the variable - it would be any single or any subset of those bytes. I do not want the operation to return a substring at all. I want the operation to return a single byte or character. But since AHK does not have variables to represent bytes (AFAIK), an optimized-internal-comparison would also work, if such a thing is possible. If its not possible with AHK's setup, or the changes needed to perform such a comparison are too extreme, then that makes sense.
lexikos wrote:
20 Mar 2021, 23:31
It seems that you think mystring[A_Index] == "X" could compare the character at that index directly, without having any need for an intermediate string containing that character. That would mean resolving this to a single operation, similar to CompareCharAt(mystring, A_Index, "X"), not evaluating each side of the operator and then evaluating the operator. But why would you assume it could work that way for mystring[A_Index] == "X" and not for SubStr(mystring, A_Index, 1) == "X"?
I haven't assumed that. I would be fine using the function if I knew it was (or will ever be) capable of directly comparing the characters. But according to the description of the function, that is not what it does. It reports that it generates a second string in memory for my single character, even if I intend to throw that character away immediately after the comparison.
lexikos wrote:
20 Mar 2021, 23:31
Actually, it is not practical for mystring[A_Index] because the indexing operation must be dispatched according to the runtime type of the value, and mystring could contain anything (e.g. an Array, Map or Gui). The CompareCharAt operation would need to perform the equivalent of evaluating mystring[A_Index] itself if mystring does not contain a string.
The syntax doesn't really matter. Anything that directs execution to the right place without conflicts would work: if( string@3 == "X" ) or if( string@3isX ) or if( string@3[X] ) or if( string@[3]is[X] ), etc. You guys are much better suited to know what would conflict and what would be difficult. I'm just a begger on the street asking for change 😆
lexikos wrote:
20 Mar 2021, 23:31
At some point in the future, the code may be reviewed to remove the reliance on null-termination. Then, SubStr can just return a pointer and length, without copying any strings. (The expression evaluator may still need to make a copy, as it does now.)
That would be fantastic. You could generate substrings on the fly without allocating memory. I implemented a string class like this once in C++, but it was strictly a container class - it wasn't allowed to own memory. So there were normal strings, and container strings. Container strings worked like sort of a const reference, which could only be modified in its "containment" sense. Sort of like a window that can be opened and closed on both ends to see more or less of the outside. Something like this (two string types) might be possible for AHK using its impressive reference system. You could start strings out as containers, then upgrade them when necessary.

lexikos wrote:
20 Mar 2021, 23:31
I did some testing...
I appreciate the amount of time you put into it. And I respect the efficiency of AHK - it really is impressive. Especially for such a string-heavy language. Efficiency is important, but its not really the reason I want dedicated syntax to operate on a character vs character level in AHK. It's more like a promise of efficiency. Or having the right tool for the job kind of thing. Even if AHK is using SubStr in the background, having syntax that enforces a character-level operation is communicating to the programmer "I have this covered - don't worry about this". From a programmer's perspective, it feels like this tool is missing from the shelf. Maybe it is my specific background that makes it feel that way.

Return to “Wish List”

Who is online

Users browsing this forum: No registered users and 36 guests