Expected number of hash collisions

Question

I feel like I'm way overthinking this problem, but here goes anyway...

I have a hash table with M slots in its internal array. I need to insert N elements into the hash table. Assuming that I have a hash function that randomly inserts am element into a slot with equal probability for each slot, what's the expected value of the total number of hash collisions?

(Sorry that this is more of a math question than a programming question).

Edit: Here's some code I have to simulate it using Python. I'm getting numerical answers, but having trouble generalizing it to a formula and explaining it.

import random
import pdb

N = 5
M = 8

NUM_ITER = 100000

def get_collisions(table):
    col = 0
    for item in table:
        if item > 1:
            col += (item-1)
    return col

def run():
    table = [0 for x in range(M)]

    for i in range(N):
        table[int(random.random() * M)] += 1

    #print table
    return get_collisions(table)

# Main

total = 0
for i in range(NUM_ITER):
    total += run()

print float(total)/NUM_ITER

Whatever makes the most sense I guess. So I'll go with counting it as two collisions (one per new element added after the first) — numegil, Commented Feb 1, 2012 at 22:48
The best measure appears to be the amount of work to retrieve all items, which is SUM(x * (x+1) /2) with X is the number of items in a bucket, and the sum is over all buckets. — wildplasser, Commented Feb 1, 2012 at 22:55

Felix Gessert · Accepted Answer · 2012-07-06 12:16:17Z

25

You'll find the answer here: Quora.com. The expected number of collisions for m buckets and n inserts is

n - m * (1 - ((m-1)/m)^n).

answered Jul 6, 2012 at 12:16

Felix Gessert

2823 silver badges3 bronze badges

3

There is also a proof of this on the Math StackExchange.
– ShreevatsaR
Commented Dec 6, 2013 at 9:14
2

Answer should include the proof.
– MVTC
Commented Oct 21, 2014 at 3:23
Is there a table available for generic m values (such as 2^32) ?
– Cyan
Commented Apr 23, 2015 at 9:41
4

IMHO, number of collision is not same as number of elements sharing same bucket/slot. In the context of B'day paradox, if 4 persons share same B'day, then answer to the latter question (# of person sharing same B'day) would be 4. However, for the former question, the # of B'day collision is often considered to be 4-1=3. The rationale behind this is that, without any three of the four persons, there is no collision. The difference is minor and yet worth noting so as to not get confused.
– KGhatak
Commented Jul 31, 2016 at 14:44
Is there a way to show the variance of the number of collisions?
– zyxue
Commented Mar 11, 2020 at 21:22

Add a comment |

Vivek Jain · Accepted Answer · 2015-09-28 10:05:19Z

0

The formula for the SUM(x*(x+1)/2) metric can be found here, and the expected value appears to be (n/2m)* (n+2m -1).

Don't know about the variance, IANAM.

edited Sep 28, 2015 at 10:05

Vivek Jain

3,8816 gold badges31 silver badges47 bronze badges

answered Feb 1, 2012 at 22:58

wildplasser

43.9k9 gold badges69 silver badges115 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Expected number of hash collisions

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
hash
collision
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonhashcollision or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
hash
collision
or ask your own question.