Geek's HashMap.ahk

10 Jan 2024, 07:05

HashMap.ahk

This library provides a hash map based alternative implementation of AutoHotkey's Map class, with the goal of allowing drop-in replacement for situations where the native Map's performance limitations have been reached.

Why

AutoHotkey's native Maps, as well as its basic Objects, use sorted arrays and binary search as their item lookup strategy. When dealing with only a few items, such as into the thousands, that approach works fine for almost any purpose. However, there comes a point where having to re-arrange/re-sort the dense array and perform logarithmically more searching steps can slow down data operations noticeably. This can often be seen when loading large amounts of JSON data, or when dumping a lot of SQL data into memory.

Hash maps have better performance characteristics at the large scale, because they do not require the entire array of items to be re-arranged whenever a new item is inserted, and because they perform searches linearly with respect to the key size rather than logarithmically with respect to the Map size. They work by putting each key through a hashing function that converts it into a predictable number between 0 and the length of the items array. Once it has that number, it can jump directly to the item without performing any search steps. The items array is only re-arranged when the quantity of items gets too large for how much space was originally allocated for them. In this implementation, deleting items can also lead to a partial re-arrangement.

How

This library uses machine code to construct objects compatible with the COM IDispatch standard, so that they may be used directly by AutoHotkey without the added scripting complexity of foreign function interface code like DllCall.

The objects provide standard Map methods, but implement them by passing the keys and values as COM VARIANTs into a generic hash map implementation. The hash map implementation has had its hashing and equality functions overridden to provide unique hashes of VARIANT objects of different types, and to allow string arguments to be hashed and compared by value rather than by reference. The hashing function is a variant on FNV-1a which has been modified to accept a seed value. Arguments are staged for hashing by dereferencing appropriate VARIANT fields to form a densely packed buffer that includes both the type and bytes representing the data in whatever format it is best compared. For example, a string pointer will be dereferenced to the string itself. By this method, two VARIANTs with equal strings but differing string pointers will still hash the same, and an integer VARIANT whose bytes happen to match up with the characters of a string will not hash the same.

All this machine code has been compiled using the MCL MCode library, so that it may easily import a wide variety of COM functions from the Windows API, and so that the standard C code hash map implementation may be compiled with no special workarounds. Once compiled, plain C constructor functions for the custom COM objects are automatically exported to AutoHotkey and get wrapped by the HashMap class to convert the IDispatch pointer returned by the constructor function into an AutoHotkey ComObject wrapper, in order that it may be used like any other object.

This approach is in stark contrast to the approach used by previous hash map implementations, which have either used mostly pure AHK code or used MCode that exposes a C API to be thickly wrapped with layers of functions and DllCalls which introduce a lot of overhead. Although to my knowledge it has not yet been done, this library also aims to be more stable than one written using the Native.ahk library by the infamous @thqby. Native.ahk attempts to generate objects from machine code that implements AutoHotkey's internal and officialy unstable APIs, so it can be broken easily by any updates made to the AutoHotkey interpreter. At best it just does not work, but at worst this can lead to corruption of internal data structures which may not be revealed right away. Because this library's object integration is COM based in nature, it is unaffected by the changing of internal APIs but instead should be portable across AHK versions (and even non-AHK languages) so long as the MCode can be loaded into memory.

Who

This library builds on the work of many authors and people who have come before.

@G33kDude who has built this library

@CloakerSmoker who co-authored the MCL library, without whom this C code could almost never have been compiled, let alone efficiently compiled.

@nothings (Sean Barrett) and the Public Domain Data Structures project contributors, who engineered the majority of the hash map implementation.

Uncountable other AHKers who have worked to teach new users and provide previous iterations of tooling.

And an honorable mention to @HelgeffegleH who built the original hashTable implementation for AutoHotkey. This project takes no code or inspiration from Helgef's project. We simply appreciate that Helgef beat us to the punch by several years.

Compatibility Notes

This library supports 32- and 64-bit AutoHotkey v2.0, and makes no assurances about compatibility with future releases of AutoHotkey. However, it has been designed with future compatibility in mind.

Still left to implemented:
CaseSense

Differences from native Maps:

You cannot set OwnProps on the HashMap. This may be supported in future releases, however it is not currently on the roadmap.

The HashMap supports Float keys directly, rather than implicitly casting them to String keys. This may be configurable in future releases.

There is no __New alias to Set.

There is no Capacity property, as the backing hash map implementation does not directly expose this information, and even if it did it does not allow direct manipulation of the capacity.

CaseSense will not support Locale mode.

Default cannot be a dynamic property or determined by meta-function, but instead must always be set by value. If allowing OwnProps to be set on the HashMap makes it onto the roadmap, some support for dynamic properties may be considered if only for the Default OwnProp.

In AHK v2.0 using a standard Map, calling Set to set an item key or value to unset causes that assignment to be ignored, while other assignments in the same parameter list may be accepted. Trying to set an item value to unset will cause AutoHotkey to raise a missing parameter exception. However, the HashMap treats unset more like a null value, and will gladly accept setting a value to unset. That unset will be passed back to AHK when that key is asked for, rather than raising an exception that the item has no value. When AHK v2.1 becomes stable and the allowed usage of unset in native AHK is expanded, this library will likely be modified to behave more in line with however v2.1 handles these scenarios.

Documentation

See the GitHub. Mostly, it's just like a normal Map.

Downloads

See the releases on GitHub

Chunjee · 26 May 2024, 17:17

Excellent as usual from this fantastic Dude

crocodile · 28 May 2024, 02:43

I compared the two.
HashMap() is 3x faster than Map() for adding keys and values.
loop 100000
m[a_index ""] := a_index

For enumeration, Map() is 2x faster than HashMap().
For reading, m[“k”], Map() is 25% faster than HashMap().

I wonder if there is any possibility of optimizing the latter two?

28 May 2024, 10:46

crocodile wrote: ↑
28 May 2024, 02:43
HashMap() is 3x faster than Map() for adding keys and values.
loop 100000
m[a_index ""] := a_index

This is not a useful test for comparing the two data structures, as the two structures have very different performance characteristics.

AHK's Map is implemented as a sorted array on which Binary Search is performed. Map increases its capacity by 1 each time you add an item over the current capacity, which (as I recall) defaults to 1. If you pre-set the capacity of a map and run your same loop again, it will run much faster. However, this is the best case scenario for inserting into a map as you are inserting the keys in order. This allows AHK to almost always resize the map without having to relocate all the map's memory to a new location. If instead you insert the keys in reverse order so that it has to shift all the other items down 1 to fit the new item, it will run much more slowly and its speed will continue to slow proportionally to how large it becomes because a larger items array means more memory to move in order to push a new item to the front. m[(100000 - A_Index) ""] := A_Index.

HashMap is implemented as a hash table, which means all the keys are converted into pseudo-random "hash" numbers and then placed into an array at the indexes corresponding to the hashes. That array is resized exponentially, doubling each time it approaches its target load ratio. I don't recall the exact number, but I think I set it around 70% full. I did not add support to change the capacity manually, so you have to live with the doubling (for now anyways). Instead of slowing down by how sorted or unsorted your inserts are, it slows down by how long it takes to compute the hash of the input. Which means that longer string keys take longer to insert than shorter string keys.

The overhead of computing the hash for the key is potentially much higher than the overheads related to moving memory around in a small Map, up to a certain size after which manipulating the map is slower than manipulating the hash map. To test these two structures against each other effectively, you need to test inserting many different quantities of data, and many different types of data. At minimum, I'd recommend to test "sorted", "reverse-sorted", and "randomized" keys. It would be useful to test both integer keys, and string keys of varying sizes. You could add some semi-sorted data as well, though it's not as clear what a semi-sorted set might look like. Test these, doubling the amount of data each time up until it takes a few seconds for each run to test, then chart the values and go back re-run the test with adjusted sizes to fill in additional data points around areas where the lines cross (assuming they do end up crossing). Then you should have a set of graphs that can help you make useful predictions about whether a Map or HashMap will be faster for your situation.

crocodile wrote: ↑
28 May 2024, 02:43
For enumeration, Map() is 2x faster than HashMap().

I think it is not likely this can be made faster, as I imagine most of the overhead here comes from the COM layer. Something like thqby's Native.ahk could probably reduce this overhead, but I avoid Native.ahk as libraries built with it are not portable between AHK releases, even within the same minor version.

crocodile wrote: ↑
28 May 2024, 02:43
For reading, m[“k”], Map() is 25% faster than HashMap().

This is not a useful test for comparing the two data structures, as the two structures have very different performance characteristics.

AHK's Map is implemented as a sorted array on which Binary Search is performed, meaning the time to retrieve a value from a key is dependent on how long it takes to perform a search. The larger the Map is, the more items it has to search through. Binary Search's performance is logarithmic compared to the size of the data set, so it's much faster than just checking each item until the correct one is found, but for large data sets it can still be slow. Especially if your Map contains a lot of string keys where the strings start with the same text, as this can slow down the search by making it waste time on near-matches.

HashMap's hash table implementation can search in constant time compared to the size of the data set. Instead, its search time is linear compared to the size of the search key. Once the hash number is generated from the search key, it can get the item out of the corresponding array index basically instantly.

The overhead of computing the hash for the key is potentially much higher than the overheads related to performing Binary Search in a small Map, up to a certain size after which searching the map becomes slower than computing key hashes. To test these, you basically need to do the same thing as for inserting -- test them at many different sizes, with different key types. Including multiple lengths of string keys. I don't think sortedness matters as much for retrieving items, as Map sorts the items automatically on insert, and HashMap rearranges the items randomly automatically on insert.

I don't have the time to go through and do all this testing right now, but the point is that neither Map nor HashMap are "faster" or "slower" in a meaningful way without additional context. One is faster at doing some things, the other is faster at doing other things.

30 May 2024, 20:17

geek wrote: ↑
28 May 2024, 10:46
Map increases its capacity by 1 each time you add an item over the current capacity, which (as I recall) defaults to 1.

If you are adding a single item with m[index] := value and the Map is at capacity, it doubles its current capacity, unless the previous capacity is 0, in which case it increases to 4. Capacity defaults to 0, but initializing it with elements (Map(k1, v1, k2, v2)) is effectively the same as defaulting to capacity equal to half the length of the parameter list.

If Set/__New reaches capacity, it increases to original capacity + (parameter count / 2). It is more suited to a once-off initialization or injection of a set of key-value pairs, conserving memory but wasting time if you call it repeatedly. It should probably use the default exponential expansion under some conditions. Even I forget about this performance characteristic and sometimes use Set in favour of assignment (__Item). The documentation only says "Capacity is automatically adjusted to avoid expanding multiple times during a single call."

However, the HashMap treats unset more like a null value, and will gladly accept setting a value to unset. That unset will be passed back to AHK when that key is asked for, rather than raising an exception that the item has no value.

Functions/methods/properties in v2.0 aren't intended to be able to return unset, and some expressions or functions may not handle it safely or sensibly, let alone correctly. For v2.0, I intended to permit unset only in the immediate context of an unset variable with ? or ??, or unset in a parameter; but some cases slipped through. Once it is present as a "value", it can silently propagate (in v2.0), which is not what was intended.

v2.1 makes it legal to return unset but also requires that the calling expression be marked to either show that the parameter default value might be used (f(m[key]?)) or provide a fallback value (m[key] ?? value). For Map, it does not matter whether a slot exists for a key-value pair. m[key] returns unset rather than throwing, and the expression throws UnsetItemError if appropriate.

As with properties and function parameters in general, Map(a, b ? c : unset) or Map(a, b?) is intended to conditionally omit the key-value pair, which can be much more convenient than using a separate conditional assignment, especially within a list of pairs, which can be in optimal sort order.

When AHK v2.1 becomes stable and the allowed usage of unset in native AHK is expanded, this library will likely be modified to behave more in line with however v2.1 handles these scenarios.

With v2.1-alpha.12, HashMap 1.0.0 and m := HashMap(1, unset), m[1] throws UnsetItemError and m[1] ?? 0 yields 0 as expected. m[2] ?? 0 understandably throws an error. It seems to be that m.Default := unset and m.Default := ComValue(10, 0x80020004) are treated as though there is no default (m[2] still throws), rather than being equivalent to m[2] := unset.

Geek's HashMap.ahk

Geek's HashMap.ahk

Re: Geek's HashMap.ahk

Re: Geek's HashMap.ahk

Re: Geek's HashMap.ahk

Re: Geek's HashMap.ahk

Who is online