16

I feel like I'm way overthinking this problem, but here goes anyway...

I have a hash table with M slots in its internal array. I need to insert N elements into the hash table. Assuming that I have a hash function that randomly inserts am element into a slot with equal probability for each slot, what's the expected value of the total number of hash collisions?

(Sorry that this is more of a math question than a programming question).

Edit: Here's some code I have to simulate it using Python. I'm getting numerical answers, but having trouble generalizing it to a formula and explaining it.

import random
import pdb

N = 5
M = 8

NUM_ITER = 100000

def get_collisions(table):
    col = 0
    for item in table:
        if item > 1:
            col += (item-1)
    return col

def run():
    table = [0 for x in range(M)]

    for i in range(N):
        table[int(random.random() * M)] += 1

    #print table
    return get_collisions(table)

# Main

total = 0
for i in range(NUM_ITER):
    total += run()

print float(total)/NUM_ITER
3
  • how do you want "triplet" collisions measured ? Commented Feb 1, 2012 at 22:47
  • Whatever makes the most sense I guess. So I'll go with counting it as two collisions (one per new element added after the first)
    – numegil
    Commented Feb 1, 2012 at 22:48
  • The best measure appears to be the amount of work to retrieve all items, which is SUM(x * (x+1) /2) with X is the number of items in a bucket, and the sum is over all buckets. Commented Feb 1, 2012 at 22:55

2 Answers 2

25

You'll find the answer here: Quora.com. The expected number of collisions for m buckets and n inserts is

n - m * (1 - ((m-1)/m)^n).

5
  • 3
    There is also a proof of this on the Math StackExchange. Commented Dec 6, 2013 at 9:14
  • 2
    Answer should include the proof.
    – MVTC
    Commented Oct 21, 2014 at 3:23
  • Is there a table available for generic m values (such as 2^32) ?
    – Cyan
    Commented Apr 23, 2015 at 9:41
  • 4
    IMHO, number of collision is not same as number of elements sharing same bucket/slot. In the context of B'day paradox, if 4 persons share same B'day, then answer to the latter question (# of person sharing same B'day) would be 4. However, for the former question, the # of B'day collision is often considered to be 4-1=3. The rationale behind this is that, without any three of the four persons, there is no collision. The difference is minor and yet worth noting so as to not get confused.
    – KGhatak
    Commented Jul 31, 2016 at 14:44
  • Is there a way to show the variance of the number of collisions?
    – zyxue
    Commented Mar 11, 2020 at 21:22
0

The formula for the SUM(x*(x+1)/2) metric can be found here, and the expected value appears to be (n/2m)* (n+2m -1).

Don't know about the variance, IANAM.

Not the answer you're looking for? Browse other questions tagged or ask your own question.