Jump to content


Photo

Fast 64- and 128-bit hash functions


  • Please log in to reply
6 replies to this topic

#1 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 01 January 2007 - 06:59 PM

Here are two of hash functions, which
- are programmed fully in AHK
- are much faster than cryptographic hash functions
- provide long enough hash values (64 or 128 bits), so a collision is very unlikely

They are modeled after LFSR based hash functions, like CRC-32. LFSR stands for Linear Feedback Shift Register. It can be programmed with a CPU register, holding an integer. This register is shifted to the left at processing the next bit of the input stream. The input bit is written to the now empty position of the register. The leftmost bit of the shifted register is removed. If it is 1, a certain integer is XOR-ed to the register. We repeat this until all the input bits are processed, when the final content of the register is returned as the hash value.

This procedure can be accelerated if more bits are processed at once. E.g. the register is shifted to the left by 8 bits, the whole next input byte is written to the end of the register, and one of 256 numbers is XOR-ed to the register, indexed by the shifted out byte. The presented 64-bit hash function works this way, with the table filled up with pseudorandom entries.

We can replace the XOR with another arithmetic function, like subtraction. Using the same constant table we get very different hash values. Concatenating this to the XOR version gives 128-bit hash values (32 hex digits).

The global pseudorandom table is initialized only at the first call, the values are re-used until the script is terminated. It can be set up using the built in random number generator, if an explicit seed is provided (otherwise hash values will be different at each run of the script). If hash values are used as integrity check (as CRC-32), we want them to be the same on different computers, regardless if the hash functions are programmed in AHK or any other language. Thus, we have to explicitly program the pseudorandom number generator. One really good and simple one calls the TEA cipher recursively, with an arbitrary key. It requires only a handful of AHK commands. The speed is not critical, because the table is created with 256 calls of TEA, only once at the first call, finishing in less than a second.

This hash functions are very fast, requiring only a handful operations for each input character. On my 2 GHz Centrino laptop I got the following results:

64-bit hash of 2.1 MB string = 10.8 sec
128-bit hash of 2.1 MB string = 21 sec

100,000 calls of 64-bit hash operations on 12 character strings: 8.25 sec
100,000 calls of 64-bit hash operations on 30 character strings: 19.25 sec
MsgBox % L64Hash("12345678")
MsgBox % L128Hash("12345678")

L64Hash(x) {                        ; 64-bit generalized LFSR hash of string x
   Local i, R = 0
   LHashInit()                      ; 1st time set LHASH0..LHAS256 global table
   Loop Parse, x
   {
      i := (R >> 56) & 255          ; dynamic vars are global
      R := (R << 8) + Asc(A_LoopField) ^ LHASH%i%
   }
   Return Hex8(R>>32) . Hex8(R)
}

L128Hash(x) {                       ; 128-bit generalized LFSR hash of string x
   Local i, S = 0, R = -1
   LHashInit()                      ; 1st time set LHASH0..LHAS256 global table
   Loop Parse, x
   {
      i := (R >> 56) & 255          ; dynamic vars are global
      R := (R << 8) + Asc(A_LoopField) ^ LHASH%i%
      i := (S >> 56) & 255
      S := (S << 8) + Asc(A_LoopField) - LHASH%i%
   }
   Return Hex8(R>>32) . Hex8(R) . Hex8(S>>32) . Hex8(S)
}

Hex8(i) {                           ; integer -> LS 8 hex digits
   SetFormat Integer, Hex
   i:= 0x100000000 | i & 0xFFFFFFFF ; mask LS word, set bit32 for leading 0's --> hex
   SetFormat Integer, D
   Return SubStr(i,-7)              ; 8 LS digits = 32 unsigned bits
}

LHashInit() {                       ; build pseudorandom substitution table
   Local i, u = 0, v = 0
   If LHASH0=
      Loop 256 {
         i := A_Index - 1
         TEA(u,v, 1,22,333,4444, 8) ; <- to be portable, no Random()
         LHASH%i% := (u<<32) | v
      }
}
                                    ; [y,z] = 64-bit I/0 block, [k0,k1,k2,k3] = 128-bit key
TEA(ByRef y,ByRef z, k0,k1,k2,k3, n = 32) { ; n = #Rounds
   s := 0, d := 0x9E3779B9
   Loop %n% {                       ; standard = 32, 8 for speed
      k := "k" . s & 3              ; indexing the key
      y := 0xFFFFFFFF & (y + ((z << 4 ^ z >> 5) + z  ^  s + %k%))
      s := 0xFFFFFFFF & (s + d)     ; simulate 32 bit operations
      k := "k" . s >> 11 & 3
      z := 0xFFFFFFFF & (z + ((y << 4 ^ y >> 5) + y  ^  s + %k%))
   }
}


#2 Chris

Chris
  • Administrators
  • 10727 posts

Posted 02 January 2007 - 02:04 PM

Nice! I'm sure there are many uses for this, but it might bring more attention to this topic to provide a real-world application/example (for those unfamiliar with hashing).

#3 PhiLho

PhiLho
  • Fellows
  • 6850 posts

Posted 02 January 2007 - 03:04 PM

I understand you used TEA as random number generator for portability, which is nice as I might reuse this in other languages.
But perhaps you can show a version using the Random command. It is of high quality too in AutoHotkey, since it uses the Mersenne Twister (MT19937) algorithm.
Real-world example of use is in [TIPS & TRICKS] Poor man's associative array.

#4 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 02 January 2007 - 07:26 PM

Here are the functions hashing *binary* variables, like the clipboard or arbitrary files read into AHK variables. We have to use ByRef parameters, otherwise the input is truncated at the first NUL character. The length of data is needed (default to the allocated memory for the variable), because the StrLen function does not work with binary data.
MsgBox % B128Hash(x:=12345678,StrLen(x))

B64Hash(ByRef x, n=0) {             ; 64-bit generalized LFSR hash of n bytes of binary x
   Local i, p, R = 0
   LHashInit()                      ; 1st time setup LHASH0..LHAS256 global table
   p := &x
   Loop % n < 1 or n > VarSetCapacity(x) ? VarSetCapacity(x) : n
   {
      i := (R >> 56) & 255          ; dynamic vars are global
      R := (R << 8) + *p++ ^ LHASH%i%
   }
   Return Hex8(R>>32) . Hex8(R)
}

B128Hash(ByRef x, n=0) {            ; 128-bit generalized LFSR hash of n bytes of binary x
   Local i, p, S = 0, R = -1
   LHashInit()                      ; 1st time setup LHASH0..LHAS256 global table
   p := &x
   Loop % n < 1 or n > VarSetCapacity(x) ? VarSetCapacity(x) : n
   {
      i := (R >> 56) & 255          ; dynamic vars are global
      R := (R << 8) +  *p  ^ LHASH%i%
      i := (S >> 56) & 255
      S := (S << 8) + *p++ - LHASH%i%
   }
   Return Hex8(R>>32) . Hex8(R) . Hex8(S>>32) . Hex8(S)
}
; … Hex8(), LHashInit(),TEA()
In reasonable running time several MB can be hashed. Even a 30MB file can be easily read into an AHK variable. Larger files would take too long to hash, so a file hash function, which reads and processes one block of data at a time, makes no sense.

One application of hashing is the simulated associative arrays, mentioned above. Another one is in websites providing checksums with files, offered for download. This is customarily done with the cryptographic hash function, MD5. Although MD5 is broken, it is still much more secure than our simple hash functions. (It means, it is very-very hard for an attacker to write a malicious program with the given MD5 hash, so user would trust and use it, instead of the original.)

However, we can improve the security of the LHASH family: publish the creation time and the size of the file next to its hash, and modify the data before hashing it. This modification can be pre-pending the time/size information to the data, XOR this time/size to a constant number and append it to the data. Finding collisions in this special form (fix length, given beginning and end of the data) is much harder than without this restrictions.

#5 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 02 January 2007 - 08:08 PM

you can show a version using the Random command. It is of high quality too in AutoHotkey, since it uses the Mersenne Twister (MT19937) algorithm.

We have to be very careful about the quality claims of random number generators. It does not matter if their cycle length is 2**80 or 2**600 or even 2**19937. We will never be able to generate that many random numbers when they start repeating. Another claim of the MT19937 generator is the lack of simple correlations in low dimensional spaces. It is only important in very special applications. However, having seen several hundred generated random numbers one can reconstruct the content of the internal tables, and be able to predict all future generated numbers. It is bad for applications even remotely related to security.

Another problem of the random number generator MT19937 is that it is initialized by a single 32 bit number. It means, there are only 2**32 different random number sequences it can generate. In a fast computer you can try them all, and seeing just a few output values you can find the seed, and thus the whole sequence it will generate.

There are other problems, too. In our case, we only need 256 "uncorrelated" numbers. The MT19937 generator builds a large table with very simple linear congruential expressions (x[i+1] := 69069 * x[i] + 1, and later combines these entries to update the table and provide the output. In the beginning the generated numbers are simple combinations of the results of this simple recursion, therefore they are related by straightforward formulas. This does not ruin the good statistical properties of the generator, which considers millions of generated numbers, among them the first few thousand does not matter, but it is a bad idea to use such generators for a few hundred numbers, only.

#6 PhiLho

PhiLho
  • Fellows
  • 6850 posts

Posted 02 January 2007 - 09:00 PM

The length of data is needed (default to the allocated memory for the variable), because the StrLen function does not work with binary data.

VarSetCapacity works, though (it gives the size of the allocated data, not of the data, of course.

In reasonable running time several MB can be hashed. Even a 30MB file can be easily read into an AHK variable.

I used this (in Java with MD5) for make a quick hack to find and eliminate identical files (Gif icons) with different names.

#7 Laszlo

Laszlo
  • Fellows
  • 4713 posts

Posted 02 January 2007 - 09:22 PM

VarSetCapacity works, though (it gives the size of the allocated data, not [the length] of the data, of course.

This is what I meant by "default to the allocated memory", and it was implemented with VarSetCapacity calls.

to find and eliminate identical files (Gif icons) with different names.

This is another good application: it is very rare, that different objects give the same hash. Collect the hash values of the objects in a variable, sort it, and a linear scan comparing neighboring hash values find equal ones. Only the objects with equal hash values (and equal size) have to be compared byte-by-byte.