Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

Generating unique random integers


  • Please log in to reply
86 replies to this topic
Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004

...the in-place merge. It is a good idea!

Thanks for the details about it. They may well get applied.

the binary search could start at the element located in the position n/(m+1) from the end of the larger list, not at the median.

If this means that the binary search should start in the middle of the part of the large list that remains (that which lies to the left of the current insertion point), that certainly makes sense.

There is no need for backup and restore the values of dynamic variables if the process ID is prefixed to their names (they get created at the first reference and purged form the table some time after the process terminates).

That's a good point but the current design relies on the address of each local variable being known at load-time, and never changing throughout the life of the script (i.e. variables are never destroyed, only set to NULL contents). I'm not saying this is a good design, but rather than spending a week to redesign the whole thing, I think I'll try to patch it up. That's where save-and-restore comes in: I think it offers superior performance when there are no other instances of a function on the call stack (which is probably 90% of the time). As a trade-off, it requires more memory because local variables are all static in duration.

if subroutines are started asynchronously by external events (timer, interrupts), and terminate in different order, is a call stack still the right data structure to manage the processes?

They don't terminate in different order. When a thread interrupts another, the interrupted thread stays suspended until the current thread finishes. In this sense, it's the same as recursion because interrupted threads are always resumed in reverse order (newest to oldest) and threads never run simultaneously in a true multithreaded sense.

Here are some references:
...
http://www.cs.yorku.ca/~oz/hash.html

Of the first five you posted I liked the above the best because it just comes out recommends djb2 as a great compromise between speed and distribution.

In summary (from what I know now), a balanced tree is probably the best overall approach, but the need to periodically rebalance it is an example of conceptual complexity which makes it seem harder to implement (probably due to inexperience).

Hash tables seem very good too, though perhaps a little wasteful in memory if every script would require a new 256 KB table (perhaps the table could grow instead: starting at 6400 entries and growing to 64K/640K entries when the number of script variables goes beyond some certain limit, say 4000/40000). But anyway, hashing might be abandoned for now because the need to "backup" local variables would require each function to have its own hash table, which seems wasteful of memory. Individual hash tables seem necessary because backing up one block of local variables among a huge/central hash table seems like it would be very poor in performance, requiring a complete traversal of the entire table unless a separate, unsorted array of pointers-to-vars were maintained solely for backup purposes (which is probably less wasteful than separate hash tables).

So unless you see some way to salvage either of the above approaches to be workable under my constraints (and you're probably well tired of this topic by now :)), that leaves the binary search idea with your wonderful lazy-sort enhancement, which will hopefully improve its performance scalability by an order of magnitude or more.

Finally, although I'd welcome further advice if you care to give it, you've already done so much that I don't want to ask for more. I might contact you via e-mail with specific questions in the future, but I'll try to keep them to a minimum.

Thanks for everything.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
One trivial note

…the current design relies on the address of each local variable being known at load-time, and never changing throughout the life of the script…
That's where save-and-restore comes in: I think it offers superior performance when there are no other instances of a function on the call stack

Instead of global pointers, offsets into a changing table are not slower. They are just accessed by relative addressing. The table is so small (the script spells each variable individually), that there is no need for allocating them from the heap: static local variables can be on the stack (the table address is the stack frame). It saves you the whole backup/restore nightmare, time and memory.

Chris
  • Administrators
  • 10727 posts
  • Last active:
  • Joined: 02 Mar 2004
That's a good point. It might be somewhat complicated by the fact that local variables (like globals) can be dynamically created via something like Array%i% = String. But even so, it's something to consider. The real issue is that the entire program should probably be revamped someday to use hashing or a balanced tree for variables. But that's something I'm going to defer so that higher priority items can be worked on.

By the way, binary-search is working well for the upcoming release. It immediately increased the practical capacity of each script from about 10,000 to 100,000 dynamic variables, and increased their access speed nearly 1000-fold. Beyond 100,000 variables the performance dropped off a cliff, I believe due to the CPU cache size (256 KB in my case) having been exceeded for each memmove().

But when I added in your lazy list idea, this problem was solved and the scalability increased again: now scripts can have 5 million or more variables without a significant degradation in performance.

Thanks for breathing new life into an old design -- and for your advice about hashing and b-trees, which will definitely help me work smarter in the future.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
it has been my pleasure

PhiLho
  • Moderators
  • 6850 posts
  • Last active: Jan 02 2012 10:09 PM
  • Joined: 27 Dec 2005
Great topic! I found it unvoluntarily, searching the word "heuristic" in the forum (not much occurrences...).
So I know now why Laszlo is credited in the source code of AutoHotkey... :-)
Laszlo, I appreciate the insights on data structures (it is always hard to chose one) and the links. Using high level languages like Java, we tend to forget the subtilities of these implementations... I know you prefer lower level offering better control...

I also like the first script, it is a problem I often ask myself. I always wondered if one could create a mathematical function to generate semi-random numbers without repetition. I suppose there aren't.

You use AutoHotkey features to create a sparce array, ie. a big array with very few used entries (if number of generated values is much less than the given range), optimized to use much less space than the full array.
Something not obvious to do in C, for example: either you allocate the big memory (fast, but memory hungry for large ranges) or you use clever algorithms (hashing).

Chris, when you write you tested with thousand of variables, I suppose you test using pseudo-arrays... Can you confirm that using foo_%i% with i being 1 2 3 is the same as using foo_1, foo_2, foo_3 variables, ie. there is three variable creations?

Last note: nowadays, with the #NoEnv directive, the algorithm is even simplier, no need for value checking.
The issue, as you pointed out, is that if you have to generate several series, you have to clear all entries.
Posted Image vPhiLho := RegExReplace("Philippe Lhoste", "^(\w{3})\w*\s+\b(\w{3})\w*$", "$1$2")

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006
Is it possible to create some kind of black box for this so we can change it to feel the difference. This is theory and it would not be the first time that something that theory proves comes wrong in real life. I suggest binary trees, balanced if you have time, but black box with experiments would be great . This can be implementd through some directive witch we can use for testing in the beginning, like #SearchMethod [tree|btree|hash...]. This is very important aspect of AHK and it should be throughly tested.

May I inform you that binary trees are great for random data, but for sorted data you get N instead log(N) complexity witch is the same as we have now. Also, NTFS is implemented using binary trees. On MSDN it also states that comparing to FAT32 the file access of NTFS is extremely faster except when iterating through whole directory in witch case FAT32 is better since it does it sequentuely.

2Laszlo
Great theory here. Thank you for refreshing me and thank you for some new information.
Posted Image

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006
Oh.. this is some old shit.. what it does on the first page I don't know...

and I was wondering how could I miss this entire converstion...
and I was so happy about new search.. :)
Posted Image

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

a mathematical function to generate semi-random numbers without repetition

Take any invertible function in the range [0,n-1], and generate f(0), f(1),...f(k). If k < n, the function values are all different. The difficulty is to pick a function, which behaves (pseudo) randomly, that is, common statistical tests find the generated sequence random.

One possibility is to pick 2 random numbers p and q in [0,n-1], p be relative prime to n, and define f(x) := mod(p*x +q,n). This works, but the randomness is not great:
N   = 10
Min = 0
Max = 9
M := Max-Min+1

Random Q, Min, Max ; Pick random Q
Loop {
   Random P, Min, Max
   If gcd(P,M) = 1 ; Pick random P, gcd(P,M)=1
      Break
}                  ; f(x) = P*x+Q
R := Q             ; f(0)
Loop % N-1
   R := R "," mod(P*A_Index+Q,M)+Min
S = %R%
Sort R, ND,
MsgBox %S%`n%R%
Return

gcd(x,y) {         ; Greatest Common Divisor
   If (x*y)        ; neither is 0
      Return gcd(y,mod(x,y))
   Return abs(x+y)
}
You can define more complex functions, depending on more random parameters. If the function contains shifts, logic- and arithmetic operations, the resulting sequence will look more random. If someone is interested, I can post a few good enough (but quite complicated) functions.

with the #NoEnv directive, the algorithm is even simplier, no need for value checking

True. Using functions and local arrays we can be sure that any array element, which exists was created by the current call of the function, and we find collisions easier. Still, the described technique is useful with paging memory. Only those pages get physically allocated, which contain data, but these pages could contain garbage. The algorithm tolerates it, disproving the myth, that un-initialized memory is always bad.
--- ---

...binary trees, balanced if you have time, but black box with experiments would be great. This can be implemented through some directive witch we can use for testing in the beginning, like #SearchMethod [tree|btree|hash...].

An interpreted language, like AHK should not be used for huge problems, when high speed is important. AHK just has to be fast enough, 20% faster or slower algorithms do not matter. Even if you prototype complex functions in AHK, you test it with small data sets, but run the algorithm programmed in compiled languages. I would not complicate the syntax (and the documentation) with directives about internal data representation.

binary trees are great for random data, but for sorted data you get N instead log(N) complexity

You probably mean that binary-sorting an already sorted array is slower than it could be. In the name search problem, what AHK has to handle for finding variables, the binary tree is always sorted. We build it as new dynamic variables come up. Finding the right place for the new name takes log(N) steps in the average, and using binary trees the worst case is of the same complexity.

It is true, that in a program variables are not defined in random order. It is common to have Array1, Array2, Array3... in this order. Some kind of heuristics could speed up the tree building process. One is implemented in AHK: lazy insertion. Keep a certain number of newly created names in a scratch pad in the order they were created, and merge them to the main tree (sorted list) when the scratch pad is full. You can have more sophisticated heuristics, but it does not really matter in practice. Defining new names are usually less frequent then accessing them, and binary search is reasonably fast for finding something. You could argue, that hash tables are better for certain applications, but then we would need to figure, if those applications form the majority or not.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
For math lovers:

The script of my previous post needs a P in [0,M-1] relative prime to M (no common divisor), otherwise there would be repetitions in the sequence f(0), f(1),...f(M-1). We can avoid the random search loop with a suitable choice of P. P = 1 and P = M-1 are relative prime to M, but they lead to too simple sequences. P around M/2 would be better.

Prove that P = M//2 + 1 + (M&3=2) is relative prime to M and it is the closest one to M/2.

majkinetor
  • Moderators
  • 4512 posts
  • Last active: May 20 2019 07:41 AM
  • Joined: 24 May 2006

You probably mean that binary-sorting an already sorted array is slower than it could be

Yeah, I was talking about sorted input. The very same example you already mention A1, A2, etc...

I was talking about new directive that would be there for testers, not for public.

An interpreted language, like AHK should not be used for huge problems

I think AHK is good enough to be used for serious things, not only click this click that. I think I prove that with Favmenu code. Check it out if you have time

For math lovers:

Oh man... I loved math until I finished the math university. Now.... heh... to many nerves lost because of it. I am sure it is becuase of order at university of Belgrade, but it doesn't matter for me now.... That is a story named "how to kill true math lover" :(
Posted Image

David
  • Members
  • 25 posts
  • Last active: Mar 17 2008 06:37 PM
  • Joined: 08 Aug 2004
Great discussion of the sample-with-replacement problem. To sample without replacement, which eliminates the issue of duplicates: Fill an n-element array with candidates (e.g., x(i)=i for i=1..n) and set m equal to the number of pseudo-random numbers you need from the array (m<=n). In a loop, select a random positive integer r where 1<=r<=m, display x®, and decrement m. The loop ends when m numbers have been selected.

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005
Two problems: if r is not unique, you get repetitions (so you still need a way to generate unique random indices), and for large n you need huge arrays.

For moderate n values (like 16-bit numbers) any random permutation of x1..xn would work. The simplest algorithm generates n random numbers ® in a loop and swaps x(A_Index) and x®, after the array x is filled with the desired values, like 0..n-1.

David
  • Members
  • 25 posts
  • Last active: Mar 17 2008 06:37 PM
  • Joined: 08 Aug 2004
In your second point, you are correct -- a large array is required to hold a large candidate pool although with RAM sizes today, this may or not be an issue. The uniqueness of r isn't an issue since even a constantly-repeating value of r will generate unique values from x().

Laszlo
  • Moderators
  • 4713 posts
  • Last active: Mar 31 2012 03:17 AM
  • Joined: 14 Feb 2005

constantly-repeating value of r will generate unique values from x().

I did not get it. Could you post some code?

jetskijoe
  • Members
  • 7 posts
  • Last active: Jun 11 2007 06:28 PM
  • Joined: 02 Jun 2006
I found this post my problem is it seems to run slow I was wandering if anyone has anyway to make it faster:

totalitem = 200000
startitem = 100000
MAX := startitem + totalitem
loop, 100000 {
			Loop
			{
				Random itemno, %startitem%, %MAX%
				If itemno not in %itemList%
				Break
			}
		itemList = %itemList%,%itemno%

FileAppend,`nItem = %itemno%, items.txt
}

Any help would be great. Is there anything that I can add to the top of the script to make it run faster or anyway to speed up all my scripts?

Thanks
Joe