Collision Risk of Fast Rolling Hash

Fast Rolling Hash

When implement Rabin–Karp rolling hashing algorithms, it is important to pick the right power base p and modulo m in the following formula:

Hash(X, n) = (Hash(X, n − 1) * p + Xnmod m

On one side, we want the period length of the $ p^n mod m $ as close to m as possible, while on the other side we want it to be computational efficient.

Usually we pick m as a big prime while p be the primitive element of the finite field GF(m). Despite it can maximize the period length, it usually require division computation.

Another popular choice (fast rolling hash) is to use 2n as the modulo number while picking an odd prime as p that making sure 2n and p is co-prime. Please note that this will limit the maximal period length to (2n)/4, but the module computation can be replaced by integer truncation.

Hash Collision

In general, the hash collision rate should be low if we choose the p and m wisely, and the collision rate will be reduced if you create multiple hash functions with different p. However, I just discovered a magic sequence X that can cause “Hash collision” easily, if you chose the fast rolling hash implementation.

For that magic sequence, if you pick any m = 2n as the modulo number and any p that co-prime with the m, the collision WILL HAPPEN, no matter how many hash functions you use. Please check out Collision Example for details, and feel free to try it out yourself.