Approximating a probability threshold for hash collisions

Question

First off: I know this is on the verge of being subjective and i'll try my best to make it as specific as possible.

I am in a situation where I need to generate a fixed number of hashes (k), say 10,000. My goal is to choose a small as possible number space for those hashes (N) while keeping collision probability below a fixed threshold (p).

The collision probability can be approximated by: $p\left(k,N\right)=1-e^{\frac{-k\left(k-1\right)}{2N}}$

The key question now is: How low is low enough for my collision threshold? I've tried numerous things to come up with a number:

Comparing various unlikely events like winning the Powerball, meteorite hits, lightning strikes, etc. (While doing this I found out that China is going to be hit by a meteorite in 2027)
Looking into the failure rates of technical items like uncorrectable bit errors in hard drives
Read a lot of Stack, Mathematics and CS posts about the topic which all explain the birthday problem and show me why I have a problem but are unfortunately not helping to solve it.

Yet nothing seems to do the trick for me. The system I am implementing relies on the hashes generated and collision would corrupt the functionality of the system at its core. Unfortunately perfect hashing is not an option for reasons not relevant here.

Does anyone have an Idea how I can tackle this problem and come up with a solid, profound probability threshold that is not just made up from the top of my head?

mihaild · Accepted Answer · 2019-04-26 20:19:24Z

1

It all depends on cost of increasing space of hashes and cost of collision (and number of trials you'll have, below I'll assume it's $1$) - there is no universal answer. If you, say, can make a chip at cost of one dollar with space of $10^9$ hashes, chip at cost of two dollars with space of $10^{18}$ or chip at cost of three with space of $10^{30}$, then the first is better if cost of failure is less then 10 dollars, the second if it is from 10 dollars to 10 billion dollars, and the third is better if cost of failure is greater then 10 billion dollars.

answered Apr 26, 2019 at 20:19

mihaild

15.9k1 gold badge22 silver badges37 bronze badges

$\begingroup$ This seems to bee a good place to start. But system failure cost is depend of the upfront invest in the chip. In my case: System failure cost increases linear with the number of hashes produced. Yet you don't want collisions. This seems to be a bit paradox. But I am very restricted when it comes to storage, so with an increase in storage the system failure probability decreases but the failure cost increases. $\endgroup$
– Lennart
Commented Apr 29, 2019 at 5:53
$\begingroup$ Well, then you need to somehow describe set of all possible chips you can use, for each calculate probability and cost of failure and optimize $(\text{cost of chip}) - (\text{cost of failure} $\cdot$ \text{probability of failure})$. If you have a specific model for this prices and costs and need help with optimization - feel free to ask new question about it. Otherwise I am afraid your problem is too general to give a useful general advice. $\endgroup$
– mihaild
Commented Apr 29, 2019 at 9:58

Add a comment |

Stack Exchange Network

Approximating a probability threshold for hash collisions

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
probability
hash-function
collision-detection
.

Hot Network Questions

Approximating a probability threshold for hash collisions

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged probabilityhash-functioncollision-detection.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
probability
hash-function
collision-detection
.