# Generating unique random integers

86 replies to this topic

### #1 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 29 March 2005 - 12:05 AM

In some applications we need random numbers, which are all different. There is a simple method to generate them one by one: compare each new random number to all previously generated- and in a list stored- numbers. If no collision is found, attach the new number to the list. You can see that the running time grows at least quadratically with the length of the list, n.

Using priority queues, heaps or other complicated data structures speeds this up to n log(n). Similar expected running time can be achieved if we generate more numbers than needed, sort the list (with attached indices), so the equal numbers get next to each other. A simple scan now finds and removes repetitions. If enough numbers are left, just sort the list, now according to the original indices and we get a random sequence without repetitions. Unfortunately, we either have to write very complicated scripts or use external programs, like SORT.

But there is a way around. Hopefully, AHK uses internally those fancy data structures to find variables. Even if not, the internal search is done by compiled code, not by an interpreted script. We should achieve a great speedup by taking advantage of the variable naming capabilities of AHK. Here is how:

As the firs number is generated, say R_1 = 10, we create a variable called Index_10. We assign the index of the generated number (1 from R_1) to it. At a later stage, if we encounter the same value again, like R_8 = 10, we check Index_10. Now it exists and contains an index: 1. So, we found a repetition. We have to try generating a new R_8. If R_8 is now 9, and this is the first time the value 9 occurs, there will be no Index_9 defined, so we create it, and store the index 8 of R_8 in there.

Is there a problem here? (Think a bit before you read further.) What happens if there is an Index_9 environment variable already defined by some Windows process or by our own script, in a previous run? It is quite slow to clear all possible Index_x values before the script runs. But is it necessary?

NO! To protect ourselves from a leftover Index_9 variable, with some wrong value j in it, we check if j is a number, if it is between 1 and the current index of the generated random number (8 from R_8) and if it really points to an R_j, which has the value 9. If all these are true, we know that Index_9 is not garbage, we can safely use it.

Here is the script, which should generate unique random integers really fast.
```MIN = 0
MAX = 99
N   = 10
If (Max - Min + 1 < N)
{
MsgBox Cannot have %N% different numbers between %MIN% and %MAX%
return
}

Loop %N%
{
i := A_Index
loop
{
Random R, %MIN%, %MAX%     ; R = random number
j := Index_%R%             ; get value from Indexes
If j is number
If j between 1 and % i - 1
If (R_%j% = R)
continue             ; repetition found, try again
Index_%R% := i             ; store index
R_%i% := R                 ; store in R_1, R_2...
break                      ; different number
}
}
MsgBox %R_1%`n%R_2%`n%R_3%`n%R_4%`n%R_5%`n%R_6%`n%R_7%`n%R_8%`n%R_9%`n%R_10%```

### #2 Chris

Chris
• 10727 posts

Posted 29 March 2005 - 12:44 PM

Very nice; great description.

If j between 1 and % i - 1

At first I was surprised the % worked in there, but I'm glad it does. The part after the "and" is actually a parameter for the "if between" command, so that probably explains it.

Hopefully, AHK uses internally those fancy data structures to find variables. Even if not, the internal search is done by compiled code, not by an interpreted script

Currently, dynamic variables in AutoHotkey are somewhat slow because they require a sequential scan of the variable list. Therefore, the more variables a script has, the slower these dynamics get (this does not affect normal variables, only ones such as Array%i%).

I'm planning to implement binary search on the variable list very soon which should help scripts such as this one and the Tetris scripts. Binary search was chosen because it is very simple to implement while giving a very large speedup (log(N) vs. N I think). However, I'm open to other methods if they're signficantly better.

### #3 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 29 March 2005 - 03:21 PM

Thanks! What happens when a dynamic variable is referenced? Its name has to be looked up in a list, where pointers are kept to the current data belonging to those variables, and residing in the dynamically allocated heap. If the name is not there, we have to insert it in the appropriate place. Currently there is no AHK command to remove a variable name from the list. "Var =" should not remove the name Var, because this variable is likely to be used again. In some situations, like with the script above, it might be desirable to prune a branch of the name-tree, like "Clear Index_*" or "Clear Index_0 .. Index_99", which should remove every matching name. The data structure has to allow fast removal of consecutive variable names.

Sorted arrays are fast for search, but slow for insertion. Hash lists are even faster for search, but slow for branch removal. Without pruning they provide the fastest and simplest solution. For supporting fast branch-cut some kind of balanced tree should be appropriate. Binary heaps are the simplest, but priority queues are more universal. They store the names in a tree, which is re-balanced after each change. The speedup is from n to log(n) for an average operation, with some overhead, so more than 10 fold improvement is expected at 1000 variables.

### #4 Invalid User

Invalid User
• Members
• 447 posts

Posted 30 March 2005 - 08:41 AM

```MsgBox, Hit OK to start
Redo:
ResultListTotal = 0
Result =
ResultList =

Loop, 94
{
Random, Result, 0, 99999
If A_Index = 1
{
ResultList = %Result%
}
If %Result% in  %ResultList%
{
Break
}
If A_Index > 1
{
ResultList = %Result%,%ResultList%
}
ResultListTotal += 1
}
If ResultListTotal <> 94
{
Goto, Redo
}
Loop, Parse, ResultList, `,
{
FileAppend, %A_LoopField%`n, %A_ScriptDir%\EFormat.txt
}
MsgBox Finished
ExitApp```
I did it like this, however it works great for 94 different numbers, the fact that is restarts the whole process over may be time issue if thousands and thousands of numbers are needed. just a thought

### #5 Chris

Chris
• 10727 posts

Posted 30 March 2005 - 01:39 PM

In some situations, like with the script above, it might be desirable to prune a branch of the name-tree, like "Clear Index_*" or "Clear Index_0 .. Index_99", which should remove every matching name. The data structure has to allow fast removal of consecutive variable names.

This might well get implemented someday thanks to your suggestion.

Sorted arrays are fast for search, but slow for insertion.

That's a very good point and somewhat dampens my enthusiasm for binary search. However, the insertion penalty only happens the first time the array is created, since subsequent assignments to that array will be against variables that already exist, and thus don't have to be physically inserted.

Hash lists are even faster for search, but slow for branch removal. Without pruning they provide the fastest and simplest solution.

Hmm, so hash lists are nearly as good as binary search for searching, but much better than binary search for insertion performance? At http://msdn.microsof... ... ngcode.asp it says: "Array-based hash tables (so-called closed hashing) is an often-overlooked implementation which frequently has superior performance." So maybe that would be a good approach.

Thanks for your advice. If you have more, it would be appreciated.

### #6 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 30 March 2005 - 05:28 PM

In a Random Access Machine hash tables provide constant expected time search and insertion (assuming a much larger table than the number of items), and so they are better in speed but worse in memory use than binary search based methods. AHK should not be used for processing GB arrays, and so allocating a several MB large hash table should be enough and should not be an issue. (If we grow out the initial table, that is, there are many hash collisions, a larger table can always be allocated and data transferred there.)

A simple sorted array gets very slow at the definition of thousands of dynamic variables. In some applications they are accessed many times, so fast access is more important. Other applications, like generation of unique integers, don't access them much, so the time spent at their definition is not negligible.

There is a difference in the speed of Clear, and the initialization. In a hash table you store the names of the variables and pointers (to their content) in an array indexed by the hash of the name. The whole table (which is mostly empty) has to be scanned for removing unwanted variables, if their names are not known (Clear Index_*). Erasing everything needs to zero the entire table, which implies that the load time of AHK increases, as the table needs to be all zero initially.

Some balanced tree data structures could erase consecutive variable names in time proportional to the logarithm of the size of the tree. They also load fast initially, because only the root of the tree need to be set up, the rest can contain garbage.

As always, there is no single best solution for all the usage scenarios. I prefer balanced trees, because they load faster, use less memory, their running time is predictable, can grow easily and Clear is faster. On the other hand hash tables are faster at normal operations. The complexity of their code is similar and manageable.

### #7 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 30 March 2005 - 06:39 PM

[to Invalidus(er)] Your script is OK if in a large interval very few random numbers are needed. It gets too slow when thousands of numbers are generated or the probability of a collision is high. We can improve it:
```N   = 10

MIN = 0

MAX = 9

!z::

Random ResultList, %MIN%, %MAX%

Loop % N - 1

{

Loop

{

Random Result, %MIN%, %MAX%

If Result not in %ResultList%

Break

}

ResultList = %ResultList%,%Result%

}

MsgBox %ResultList%

return```
Your script throws away everything generated so far, when a new random number happens to be equal to a previous one. It is enough to regenerate the last one only. Also, GoTo's should be avoided, when possible.

### #8 Invalid User

Invalid User
• Members
• 447 posts

Posted 30 March 2005 - 07:09 PM

yeah i know it could use some improvement, but it took less than 2 min to make :lol: and i only needed it once

### #9 Chris

Chris
• 10727 posts

Posted 31 March 2005 - 07:35 PM

I prefer balanced trees, because they load faster, use less memory, their running time is predictable, can grow easily and Clear is faster. On the other hand hash tables are faster at normal operations. The complexity of their code is similar and manageable.

Thanks for all the tips. You obviously have a lot of experience in this area and it's already been a great help to me.

Originally, I was just going to do binary search, figuring it had to be way better than sequential scan (even inserting would probably be faster until the number of variables gets really huge, say over 10000). Then I saw your earlier post and thought hash tables can't be that much harder, and I understand them, so why not go that route.

By contrast, balanced trees are something that seems conceptually more complicated than hashing to me, such as the need to rebalance them periodically (or "as needed"?). I'd welcome any algorithms, code, or URLs you might have that give an overview of how to implement them (the simpler the better) and how they should be kept balanced.

Also, I wonder if dynamic-variable scripts, which typically contain less than 10000 variables each, would run significantly faster with hashing than they would with balanced trees, just from the reduced overhead.

If you have more time to help me, we can continue here or you can e-mail me at support@autohotkey.com . Thanks again.

### #10 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 01 April 2005 - 12:04 AM

(Tell me what you like to do and I will explain why it is the best choice… or the worst.)

There are many more things to consider. For example, comparing long variable names is slow, like WindowUnderTheMouseCursor_01. Using a fast hash function makes these comparisons very rare; therefore, hash tables are faster with long names. Long names might be necessary to avoid conflicts until AHK will have block structure and local variables.

If there are really many names, a search in a balanced tree causes a lot of page faults, and extra time to get mostly unneeded data into the cache. A hash table usually takes just two random data accesses.

With few dynamic variables small hash tables are faster in the average than balanced trees. However, large hash tables do not fit in the cache, and could cause a couple of page faults, while balanced trees have size proportional to the number of dynamic variables, so they could remain in the cache.

Let's assume we use a 16-bit hash. Limit the length of a variable to 55 characters, Null terminated, 4-byte continuation pointers and 4-byte pointers to the data. They make 64 bytes per table entry, giving a table size of 4MB. At start up we only need to Null the first entries in the table, which is a loop of 64K iterations of storing single bytes, a barely noticeable time. This table is fine until a few thousand dynamic variables. If some pervert defines more, it slows down the process, as we often have to follow continuation pointer chains. It could be a stated policy, that whoever uses more than 10,000 dynamic variables should accept gradual slow down. In this case there will be no need for allocating a larger hash table if it gets too crowded.

"Clear" instructions should not occur frequently under typical use, so if they take a few milliseconds it does not really matter. Accordingly, implement the hash table and everybody will be happy.

Some better balanced tree info URL's:

http://en.wikibooks....tructures:Trees
http://www.delorie.c...libavl_276.html
http://www.delorie.c...libavl_150.html

http://www.bluerwhite.org/btree/
http://libredblack.sourceforge.net/

http://www.site.uott...nced-trees.html
http://www.brpreiss....ml/page320.html

### #11 Chris

Chris
• 10727 posts

Posted 01 April 2005 - 12:40 PM

hash tables are faster with long names.

I would guess that a typical (median) variable name is only 20 characters long, though they can be up to 255 characters long.

Long names might be necessary to avoid conflicts until AHK will have block structure and local variables.

Local variables will exist in the next version, along with function parameters and return values.

One of the things that's slowing down the next release is that I only recently realized that whenever a function is running more than once simultaneously, either through a recursive call to itself or because a thread gets interrupted by another, each new caller must backup the previous caller's block of local variables and restore it when its call completes. This backup and restore is skipped when the function has no other instances on the call-stack, which should greatly improve average-case performance.

Therefore, the data structure chosen for each function's local variable list (which will probably be the same structure used for the global list) is also a consideration if it impacts how quickly a block of local variables can be backed up and restored. Hash tables might be fairly slow to back up and restore. By contrast, a sorted binary-search array would probably be one of the fastest. I'm not sure about a balanced tree.

"Clear" instructions should not occur frequently under typical use, so if they take a few milliseconds it does not really matter. Accordingly, implement the hash table and everybody will be happy.

Because of the future need to clear/delete variables from the list, and because of the backup/restore issue mentioned above, I'm back to favoring the binary search idea (perhaps fallaciously). It seems that binary search supports fast clearing and fast backup/restore. In addition, since each entry in the sorted array is only 4 bytes (since it's just a pointer to content), even a script with 10000 variables only requires an average of 20 KB of memory to be moved for an insert, which might not be unacceptably slow (though certainly much slower than hash or b-tree).

Let's assume we use a 16-bit hash. Limit the length of a variable to 55 characters, Null terminated, 4-byte continuation pointers and 4-byte pointers to the data. They make 64 bytes per table entry, giving a table size of 4MB.

(If you think binary-search mentioned above has merit, don't spend much time replying to the following)
I wouldn't use hashing if it required 4 MB for every script. Why can't an entry consist solely of a pointer-to-data and a pointer-to-continuation? With each such entry being only 8 bytes, 64K entries would only consume 512 KB of memory.

Some better balanced tree info URL's:

### #12 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 02 April 2005 - 03:19 AM

(Still no clear winner…)

Sorted lists either consist of pointers to the names *and* to the data, or only to the names, which have attached pointers to the current data. In the first case insertion and deletion in an n-byte long list requires moving 4n bytes in the average, in the second case only 2n bytes. If accessing dynamic variables dominates the creation and deletion operations, it is OK. I can imagine a kind of *lazy* approach: keep the last few created variables in a separate list. If it grows beyond a certain limit, sort and merge this one to the main list, what can be done in a single sweep. It reduces the total amount of work proportionally. Similarly, if variables are deleted in blocks of consecutive names, the actual removal from the main list can be delayed until the ghost variables slow down the search (like every other variable in the main list is actually a deleted entry.) This assumes that mostly local variables are deleted in blocks, and the variables of a terminated block are not accessed any more.

If variable names can be long, hash tables are also better organized containing just one pointer to a corresponding record. The record contains the name of the variable, a pointer to the data contained in the variable (which could cahnge) and a continuation pointer to the record of the next variable name with the same hash (if any). This way we need a minimum of 3 indirections instead of 2, but save a lot of table memory. With 16 bit hash the table is only 256KB. The records and the data are allocated from the heap, with the usual garbage collection, needed for repeated Clear operations.

One way to handle local variables is to prepend some distinguishing string to their names, like `BlockID` (with the quotes forbidden otherwise in names). Each new block (function, procedure) receives a unique ID number at creation. At sorted list-, and balanced tree - dynamic variable management the pointers to local variables will be next to each other, and so a Clear is fast. Hash tables are of fixed size, and we have to scan them fully to remove local variables, which is slow. Accordingly, it is better to allocate a small (8-bit) hash table to each instance of a function or block. If it becomes overcrowded, we replace it with a larger one. When the function exits, the whole local hash table is deleted. It does not take time at all, but more memory is consumed.

Still, any of the (lazy) sorted lists, balanced trees or hash tables can do the work.

Another difficulty with multiple, parallel created instances of functions is the large number of deleted local variables, which necessitate a fast garbage collection. The OS might or might not do a good job with them. Currently a block (subroutine) cannot be entered until a possible previous call has not terminated. I understand that in the next release this limitation will be lifted. It means, we cannot assume that local variables are allocated in contiguous blocks, so, when the heap is exhausted we have to go through our live tables to find still needed data. Traversing sorted lists or (dense) balanced trees is faster than traversing mostly empty hash tables (but we can scan fast through Null pointers).

### #13 Chris

Chris
• 10727 posts

Posted 02 April 2005 - 01:15 PM

If accessing dynamic variables dominates the creation and deletion operations, it is OK.

Since currently, variables are never deleted (not even local variables, see below) this might be the case.

This assumes that mostly local variables are deleted in blocks, and the variables of a terminated block are not accessed any more.

Hash tables are of fixed size, and we have to scan them fully to remove local variables, which is slow. Accordingly, it is better to allocate a small (8-bit) hash table to each instance of a function or block. If it becomes overcrowded, we replace it with a larger one. When the function exits, the whole local hash table is deleted. It does not take time at all, but more memory is consumed.

It means, we cannot assume that local variables are allocated in contiguous blocks, so, when the heap is exhausted we have to go through our live tables to find still needed data. Traversing sorted lists or (dense) balanced trees is faster than traversing mostly empty hash tables (but we can scan fast through Null pointers).

The current approach is that local variables (and formal parameters, which are basically locals themselves) are never deleted, though their contents might be. This is done primarily so that at load-time, any line in the script that references such a variable can be given its permanent address. Thus at runtime, such a script never searches for variables (except dynamics) because all the addresses have already been resolved.

I can imagine a kind of *lazy* approach: keep the last few created variables in a separate list. If it grows beyond a certain limit, sort and merge this one to the main list, what can be done in a single sweep.

Very interesting. Are you saying that the lazy list, when it reaches the size limit, should be sorted into the main list via q-sort or some other fast sort? Perhaps the lazy list itself would be a sorted list (binary search capable), and whose size limit is fairly large, say 1000 variables.

With 16 bit hash the table is only 256KB.

The idea of performing an extra indirection in exchange for halving the memory required is great. Hashing is more attractive now.

I have two questions: 1) Would even a memory restricted hash, that isn’t that sparse and thus has a lot of continuation entries, be generally superior to the binary search approach in both inserts and searches?; 2) Which hash function is generally considered the best way to transform a variable name into a hash index? I was naively thinking that each letter in the variable name could be multiplied by an order of magnitude, so that something like abc would be ascii-of-a*100 + ascii-of-b*10 + ascii-of-c*1.

Currently a block (subroutine) cannot be entered until a possible previous call has not terminated.

There is no such restriction. A subroutine can have any number of instances on the call stack (either through recursion or an interrupting thread). You might be thinking of hotkeys, which default to one instance each. This default can be changed on a per-hotkey basis with #MaxThreadsPerHotkey or the Hotkey command.

One way to handle local variables is to prepend some distinguishing string to their names, like `BlockID` (with the quotes forbidden otherwise in names). Each new block (function, procedure) receives a unique ID number at creation. At sorted list-, and balanced tree - dynamic variable management the pointers to local variables will be next to each other, and so a Clear is fast.

Wonderful. I can see how this would allow merging of all local and global variables into a single hash table, which would definitely save memory and probably reduce complexity. However, I’m still stuck with the need to backup and restore blocks of local variables, for which such a large hash table might be ill-suited. To clarify, here are the details of the current plan:

When a function is called and there are no other instances of it on the call stack (which is probably true 90% of the time), no backup and restore is necessary and everything below is skipped.

But when a function is called recursively, or called by a new thread that suspended the previous thread, we backup the list of local variables of the instance that lies immediately beneath us on the call stack. The primary thing that needs to be backed up is the variable’s pointer-to-contents (previously allocated). In other words, the address of each local variable never changes. But the address of each local variable’s contents does change, either through recursion or an interrupting thread.

Because recursed layers and underlying threads are suspended while the current layer is running, they do not need their local variables until they're resumed. Thus, when the topmost instance of a function completes, all it has to do is free any memory that its layer of variables allocated and restore each variable's pointer-to-contents prior to returning to its caller.

Perhaps this approach is rather hair-brained and should be abandoned completely. If so, please be honest, but keep in mind that it was adopted so that the address of all non-dynamic variables, including local variables, could be resolved once at load-time to accelerate runtime performance.

Thanks for everything.

### #14 Chris

Chris
• 10727 posts

Posted 02 April 2005 - 03:24 PM

After posting the above, I thought of a couple more things:

I was at first confused about how the lazy sort would actually save any CPU time (it seemed like it would just procrastinate). But perhaps you had in mind a "reverse merge sort", which relies on the main list and the lazy list being already sorted. For each item in the lazy list (starting at the right/far end), binary search the main list to find the insertion point, move enough memory to handle all remaining lazy items (making a large gap), then copy the current lazy item into the far right side of the gap. Repeat this process, which results in a moving but shrinking gap, until the merge is completed. The net amount of memory moved should be much less (perhaps Log(N) vs. N?) than would have been required without the lazy approach.

Also, since the hashing approach still seems attractive, I thought about how to backup and restore the part of a hash table that corresponds to a block of local variables: Perhaps it can be done simply by storing the address of each backed up variable along with the data that is backed up. This would making restoring very fast, but backing up slower because a large chunk of the hash table must be scanned for variables whose prefix matches that of the function's block ID (assuming that the the hash function can be adapted to isolate a small range of indices where the variables must reside).

### #15 Laszlo

Laszlo
• Fellows
• 4713 posts

Posted 02 April 2005 - 07:22 PM

Lazy insertion: what you describe is the in-place merge. It is a good idea! It has the advantage that if you have enough room for the list to grow, no new memory needs to be allocated. Also, if the lists are of different length (n >> m), the running time is slightly better than with the normal merging of sorted lists: compare the topmost elements of the two lists and move the larger one to a new location. Repeat this until the lists become empty. (Actually, pointers to the records are copied.) The normal merge takes n+m pointer copy operations, the in-place merge saves moving the beginning of the list, an expected saving of n/(m+1), a few percent. The normal merge performs n+m comparisons, the in-place merge performs less than m*log(n). If m < n/(1+log(n)), we save work with the binary search. The overall speedup is not much, but with the memory allocation advantage it is worth doing it. (There are endless heuristic/probabilistic speedup possibilities. E.g. the binary search could start at the element located in the position n/(m+1) from the end of the larger list, not at the median.)

Merging saves time, because inserting m names one-by-one into a length-n list takes roughly m*n/2 data movements in the average, instead of n+m with merging.

Static and dynamic variables need to be handled differently. There is no need for backup and restore the values of dynamic variables if the process ID is prefixed to their names (they get created at the first reference and purged form the table some time after the process terminates). For static local variables there are alternatives to saving/restoring values. For example, each static local variable reference could be replaced by an offset in a table, to be added to the address of the table. The table address is given by the main controller at the activation of the particular instance of the block. When the block (subroutine) is first activated and the unique process ID is generated, a new table is allocated, with all Null pointers to the local data. When the process terminates, the memory used by its local variable table gets freed, which is automatic, if the table is on the stack.

By the way, if subroutines are started asynchronously by external events (timer, interrupts), and terminate in different order, is a call stack still the right data structure to manage the processes? It looks like a series of independent call stacks are needed, created at each asynchronous event.

Hash: The book D. E. Knuth, Art of Computer Programming, Volume 3: Sorting and Searching discusses a few hash functions. You could calculate the CRC of the name, or consider the name as a long integer and compute the remainder modulo a prime number, and so on. Hash tables do not need cryptographic hash functions, which are way too slow. Here are some references:

http://en.wikipedia....i/Hash_function
http://www.partow.ne...hfunctions/#top
http://www.concentri...ech/inthash.htm
http://www.cs.yorku.ca/~oz/hash.html

Extendible Hashing: http://www.istis.uno...a3300/fs009.htm

http://burtleburtle....hash/doobs.html
http://www.azillionm...m/qed/hash.html

SourceForge.Net: http://sourceforge.n...rojects/ga-lib/
http://sourceforge.net/projects/libtc/
http://sourceforge.n...yetanotherclib/
http://sourceforge.n...cts/hackerlabs/
http://sourceforge.n...rojects/gcslib/
http://sourceforge.n...rojects/dbprim/