108
$\begingroup$

If I have a list of key values from 1 to 100 and I want to organize them in an array of 11 buckets, I've been taught to form a mod function

$$ H = k \bmod \ 11$$

Now all the values will be placed one after another in 9 rows. For example, in the first bucket there will be $0, 11, 22 \dots$. In the second, there will be $1, 12, 23 \dots$ etc.

Let's say I decided to be a bad boy and use a non-prime as my hashing function - take 12. Using the Hashing function

$$ H = k \bmod \ 12$$

would result in a hash table with values $0, 12, 24 \dots $ in the first bucket, $1, 13, 25 \dots$ etc. in the second and so on.

Essentially they are the same thing. I didn't reduce collisions and I didn't spread things out any better by using the prime number hash code and I can't see how it is ever beneficial.

$\endgroup$
3
  • $\begingroup$ Relevant question, why we use xor in hash-function stackoverflow.com/questions/5889238/… $\endgroup$
    – KRoy
    Commented Dec 22, 2017 at 22:22
  • 1
    $\begingroup$ Same question on stackoverflow: Why should hash functions use a prime number modulus? $\endgroup$ Commented Oct 31, 2020 at 9:45
  • $\begingroup$ You are right, there is no difference to select prime or not if the keys are uniformly distributed. But in some cases to select small prime number is worse than big non-prime number in the case of uniformly distributed keys. For instance, try 5 which is prime with 12 which is a non-prime number with your example and compare. I think the best way is to select a big number. In the case of the keys (12,24,36,3,15,27) are the multiplies of one of the factors (3) of the modulus(12), yes the prime number is a better modulus since it does not have factors in common with (12,24,36,3,15,27,...). $\endgroup$ Commented Jun 12, 2022 at 18:07

7 Answers 7

128
$\begingroup$

Consider the set of keys $K=\{0,1,...,100\}$ and a hash table where the number of buckets is $m=12$. Since $3$ is a factor of $12$, the keys that are multiples of $3$ will be hashed to buckets that are multiples of $3$:

  • Keys $\{0,12,24,36,...\}$ will be hashed to bucket $0$.
  • Keys $\{3,15,27,39,...\}$ will be hashed to bucket $3$.
  • Keys $\{6,18,30,42,...\}$ will be hashed to bucket $6$.
  • Keys $\{9,21,33,45,...\}$ will be hashed to bucket $9$.

If $K$ is uniformly distributed (i.e., every key in $K$ is equally likely to occur), then the choice of $m$ is not so critical. But, what happens if $K$ is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of $3$. In this case, all of the buckets that are not multiples of $3$ will be empty with high probability (which is really bad in terms of hash table performance).

This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of $4$. Needless to say that choosing $m$ to be a multiple of $4$ would be a terrible choice: you would have $3m/4$ buckets completely empty, and all of your keys colliding in the remaining $m/4$ buckets.

In general:

Every key in $K$ that shares a common factor with the number of buckets $m$ will be hashed to a bucket that is a multiple of this factor.

Therefore, to minimize collisions, it is important to reduce the number of common factors between $m$ and the elements of $K$. How can this be achieved? By choosing $m$ to be a number that has very few factors: a prime number.

$\endgroup$
8
  • $\begingroup$ I just saw that my query is in alignment with your answer. Do you think the hash function in my query holds good? $\endgroup$ Commented Dec 27, 2016 at 5:46
  • $\begingroup$ @overexchange: I replied to your question. This answer may also be of interest for you. $\endgroup$ Commented Dec 28, 2016 at 18:30
  • $\begingroup$ why is it so that choice of m matters only if K is skewed? isn't it true that we'll have worse performance with bad m even if K is uniformly distributed? $\endgroup$
    – vorou
    Commented Dec 4, 2018 at 11:20
  • $\begingroup$ It depends on what you mean by "bad $m$". If you mean "small compared to the number of elements in the hash table" (i.e., high load factor), then, performance will be poor. However, if you mean "not prime", then this fact is not so important if all keys are equally likely because they will be distributed evenly in the hash table. The question itself provides an example. $\endgroup$ Commented Dec 5, 2018 at 21:22
  • $\begingroup$ In most cases, it would be sufficient if the modulus had no prime factors smaller than 13. The larger the smallest prime factor of the modulus is, the smaller the chance that keys with common factors with the modulus will be unexpectedly common in the population of keys. Most of the time it's overkill to insist the modulus be prime. You may be interested in this article: linkedin.com/pulse/… $\endgroup$
    – WaltK
    Commented Jan 11, 2020 at 5:56
19
$\begingroup$

Whether a collision is less likely using primes depends on the distribution of your keys.

If many of your keys have the form $a+k\cdot b$ and your hash function is $H(n)=n \bmod m$, then these keys go to a small subset of the buckets iff $b$ divides $n$. So you should minimize the number of such $b$, which can be achieved by choosing a prime.

If on the other hand you like to have $11$ to $12$ buckets and you know that differences which are multiples of $11$ are more likely than differences which are multiples of $2$ and $3$, you may choose $12$ for your very special application.

$\endgroup$
4
  • 1
    $\begingroup$ But if my keys do not have the form $a + k \times b$ then $m$ doesn't matter? Is that right? $\endgroup$ Commented Apr 4, 2013 at 23:58
  • 1
    $\begingroup$ @lmray, if your keys are uniformly distributed, $m$ doesn't matter. If they are not, it will depend on the precision distribution for $m$ to matter or not. $\endgroup$ Commented Apr 5, 2013 at 11:02
  • 1
    $\begingroup$ Just reverted the last edit, I forgot that $12>11$. $\endgroup$
    – frafl
    Commented Apr 5, 2013 at 11:19
  • 4
    $\begingroup$ Did you mean that "go to a small subset of the buckets iff $b$ divides $m$"? $\endgroup$ Commented Sep 4, 2015 at 23:34
11
$\begingroup$

Whether this has an impact (also) depends on how you treat collisions. When using some variants of open hashing, using primes guarantees empty slots are found as long as the table is sufficiently empty.

Try to show the following, for instance:

Assume we want to insert an element that hashes to address $a$ and resolve collisions by trying positions $a + i^2$ subsequently for $i=1,2,\dots$.

Show that this procedure always yields an empty position if the hash table is of size $p$, $p$ a prime larger than $3$, and at least half of all positions are free.

Hint: Use the fact that the residue class ring modulo $p$ is a field if $p$ is prime and therefore $i^2=c$ has at most $2$ solutions.

$\endgroup$
5
$\begingroup$

First of all, the question is phrased incorrectly. The following are equivalent and correct expressions of the intended question:

  • why must we use a prime number as the modulo of the hash value (not "in the hashing function)

or equivalently and more succinctly:

  • why must the size of a hash table be a prime number?

The proper answer to this question lies in the context, and that context is open addressing with double hashing collision resolution.

Where you compute:

hashval1 = hashfcn1(k) // for initial location
hashval2 = hashfcn2(k) // probing increment

hashval1 is used for the initial insertion probe. If that location is empty, then you insert the (k, v) and you’re done, never using hashval2. If that location is occupied, then you jump hashval2 addresses to (hashval1 + hashval2) % arraysize, and repeat to see if the location is empty. If that location is again not empty, then you jump another hashval2 addresses to (hashval1 + 2 * hashval2) % arraysize. And so on.

Here’s the key: HOW MANY TIMES DO YOU PROBE BEFORE GIVING UP?

If every element of the array is potentially a target, then the answer is: you probe n-times. This is easy to visualize with linear probing, you write a for (i = 0; i < n; ++i) loop for probing which breaks early if a empty slot is found. It’s no different with double hashing, with the exception that the probe increment is hashval2 rather than 1. But this leads to a problem, you’re not guaranteed to probe every slot in the array when hashval2 > 1. Simple example of this: array size = 10 and hashval2 = 2; in this example you only ever probe the even or odd slots, in fact, you end up probing each slot twice, which is pointless. Even if you make the array size an odd number like 15, if probe increment is 5 then the probe sequence will be (let’s use hashval1 = 0): 0, 5, 10, 0, 5, 10, 0 5, 10, 0, 5, 10, 0, 5, 10. You keep probing the same three occupied slots over and over.

What’s the problem here? The problem is that the hash table size and hashval2 share common factors, in this case 5. So we need the hash table size and hashval2 to share no factors. Since hashval2 is a function of k and post modulo lies in the range of 0 < hashval2 < array size, there’s nothing we can do about hashval2. That leaves the array size. The array size in double hashing must be guaranteed to have no factors in common with hashval2. Thus by simple math, the array size must be a prime number.

Example: Array size = 11 (a prime number). hashval1 = 1.

hashval2 = 1; // 1,2,3,4,5,6,7,8,9,10,0 <-- probing sequence
hashval2 = 2; // 1,3,5,7,9,0,2,4,6,8,10
hashval2 = 3; // 1,4,7,10,2,5,8,0,3,6,9
hashval2 = 4; // 1,5,9,2,6,10,3,7,0,4,8
hashval2 = 5; // 1,6,0,5,10,4,9,3,8,2,7
hashval2 = 6; // 1,7,2,8,3,9,4,10,5,0,6
hashval2 = 7; // 1,8,4,0,7,3,10,6,2,9,5
hashval2 = 8; // 1,9,6,3,0,8,5,2,10,7,4
hashval2 = 9; // 1,10,8,6,4,2,0,9,7,5,3
hashval2 = 10;// 1,0,10,9,8,7,6,5,4,3,2

You always probe a maximum of the number of slots in the hash table; i.e., every slot has an equal chance of hosting a value. But since the probe increment is itself a hash value, it is uniformly distributed over the size of the hash table which eliminates the clustering you get with other collision resolution algorithms.

TL;DR: a hash table size that is a prime number is the ONLY way to guarantee that you do not accidentally re-probe a previously probed location.

$\endgroup$
2
$\begingroup$

If your hash function is of the form $h(k)=a\times k \mod m$ where $m$ is prime and $a$ is chosen at random, then the probability that 2 distinct keys hash to the same bucket is $1\over m$. So for $m=1009$, $Pr\{h(x)=h(y), x\neq y\}=0.00099108027$ which is very small.

This scheme is known as: Universal Hashing.

$\endgroup$
1
$\begingroup$

Just to complement Mario Cervera's answer.

Let's consider a hash function given by hash(k) = k mod 12, so our hash table has 12 buckets.

First, we need to understand how collisions occurr in hash tables. Two numbers with keys $k_1$ and $k_2$ are said to collide when hash(k_1)=hash(k_2) which is $k_1 \mod 12 = k_2 \mod 12$.

Now imagine that we have a list of values to add to our hash table, with the following keys respectively: $ [ 0, 1, 2, \dots, 100]. $

The following plot shows the size of each bucket in the hash table, after adding the above values.

As can be seen, the bucket size is fairly even accross the hash table. In practice however, it is seldom the case that the data is so neatly distributed.

It is for example, more common that there are values that are more likely to occur than others. An example where this occurs is if the keys are strings, and the hash function is just adding the ascii code of each character, and then modding the result.

In any case, let's assume the extreme case where the set of keys consists of only numbers that are multiples of 3.

So we construct the list as follows:

[0, 3, 6, 9, 12, 15, 18, ...]

Now we try filling in the hash table again, using the same hash function as before: enter image description here

As you can see, it is less balanced. In fact, the uneven buckets are 0, 3, 6, and 9.

To see why, suppose we only have multiples of 3 as keys, and observe how they are assigned to buckets:

Keys:     0  3  6  9  12 15 18 ...
          ↓  ↓  ↓  ↓  ↓  ↓  ↓ 
Bucket:   0  3  6  9  0  3  6  ...

Which explains the uneven buckets above.

Now let's try with a bucket size of 10, and the same unbalanced data.

enter image description here

It is again well balanced. And 10 is not a prime. So what's going on?

For the unbalanced data, the main difference between bucket size 10 and bucket size 12, is that the buckets that are multiples of 3 do not line up neatly inside the bucket size 10, as marked in bold:

Keys:     0  3  6  9  12 15 18 ...
          ↓  ↓  ↓  ↓  ↓  ↓  ↓ 
Bucket:   0  3  6  9  2  5  8  ...

Which explains the nicely balanced result above.

So the main issue is that 12 (the first bucket size) is a multiple of 3.

So the general rule is: since we know that the data might be biased, we want to decrease the chance that it is biased towards numbers that divide the bucket size. And which numbers are famous for having few divisors? The prime numbers!

$\endgroup$
-2
$\begingroup$

I think we have a short answer by using probability. we use % (modulo) in hashing to calculate remainder of a division(our data % chosen number). if our mod be a prime number which also is our divisor it is more likely that division has a remainder and less numbers has same factor like our divisor So less collision happens.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.