4
$\begingroup$

I'm trying to extend the birthday problem to detect collision probability in a hashing scheme. Here is my problem. I use the letters and numbers [A-Z][a-z][0-9] to make a set of keys by randomly choosing from that set 10 times and concatenating them all together. So A24cv30kf2 could be one key for example. This implies that the base set has $62^{10}$ possible keys. Next, I want to compute the probability that I would see a collision if I picked a key from that set and issued it to $n$ users. If $M$ represents the size of the set (= $62^{10}$) and $P(M,n)$ the probability then

$$P(M,n) = \frac{M!}{M^{n}(M-n)!}$$

This follows by replacing 365 in the well known birthday problem shown here

What I want to understand is the following: I want to check and see if my choice of concatenating 10 is right, when I know that my $n$ is of the order of a few millions and my $1 - P(M,n)$ needs to be less than 0.0001 or some small number. The large values of M & n make the above intractable to compute. Is there an easier way to compute this?

$\endgroup$
6
  • $\begingroup$ a) It's not clear what it would mean for there to be a collision if you picked a key and issued it to $n$ users. I suspect that would you meant to say is that you pick $n$ keys, one for each of $n$ users. b) As you can readily see by substituting $n=1$, the probability $P(M,n)$ that you've written down is the probability of not getting a collision. $\endgroup$
    – joriki
    Commented Mar 19, 2013 at 22:13
  • $\begingroup$ Right, I meant 1 key for each of the $n$ users at random. I'm making do with the approximation given in that wikipedia link I have in there for now. $\endgroup$
    – broccoli
    Commented Mar 19, 2013 at 22:31
  • $\begingroup$ You didn't respond to my remark that what you've written down is the probability of not getting a collision. The surrounding text is written as if it were the probability of getting a collision. $\endgroup$
    – joriki
    Commented Mar 19, 2013 at 22:39
  • $\begingroup$ Sorry. $P(M,n)$ is the probability of not getting a collision so, $ 1 - P(M,n)$ is what I'm after. It is still intractable. $\endgroup$
    – broccoli
    Commented Mar 19, 2013 at 22:50
  • $\begingroup$ You only changed one instance; it still says above the formula that this is the probability of getting a collision. $\endgroup$
    – joriki
    Commented Mar 19, 2013 at 22:58

3 Answers 3

6
$\begingroup$

As in my answer here, the Poisson approximation is pretty accurate. $$\mathbb{P}(\text{no collision})\approx \exp(-n^2/2M)\approx 1-{n^2/2M}.$$ Setting this equal to .9999 and $M=62^{10}$, we find that we can issue around $n=12.95$ million keys without worrying.

$\endgroup$
5
  • $\begingroup$ We are using the same approximation but at opposite ends of the probability spectrum. I hope that between us we help OP. $\endgroup$ Commented Mar 20, 2013 at 2:48
  • $\begingroup$ @Ross I hope so too. $\endgroup$
    – user940
    Commented Mar 20, 2013 at 3:45
  • $\begingroup$ Thanks. This is what I ended up using in my code $\endgroup$
    – broccoli
    Commented Mar 20, 2013 at 17:36
  • $\begingroup$ @broccoli Glad to be of help. $\endgroup$
    – user940
    Commented Mar 20, 2013 at 18:01
  • $\begingroup$ If read this properly, at 12.5m keys, you have a probability of 99.99% that there will be no collisions. At 1.3b keys (@RossMillikan answer), you have a 50% probability of collision? $\endgroup$
    – Alan
    Commented Jul 3, 2013 at 2:27
2
$\begingroup$

Roughly speaking, you expect a collision when the number of pairs is the number of hashes. If you issue $n$ hashes, you have $\frac {n^2}2$ pairs (I ignored the $-1$ on one of the $n$'s), so you can afford to issue about $\sqrt{2M}=\sqrt {2\cdot 62^{10}}=62^5\sqrt 2\approx 1.3\cdot 10^9$ keys. In the Wikipedia article it shows there is another factor of $\sqrt {\ln 2} \approx 0.83$ which I would ignore but it says you can issue one billion codes before you have a $50\%$ chance of collision.

$\endgroup$
1
  • $\begingroup$ The $\sqrt {\ln 2}$ multiplicative term or more generally $\sqrt {-\ln p}$ get you closer to the probability of no collisions being about $p$ $\endgroup$
    – Henry
    Commented Jul 27, 2021 at 17:09
1
$\begingroup$

These expressions can be evaluated numerically using the standard "trick" of writing it as the exponential ($\exp$) of their (natural) logarithm ($\log$), and remembering that $\log(ab)=\log(a)+\log(b)$:

$P(M,n)=\exp\left( \log M!-n\log M-\log(M-n)! \right)$

Now $N!=\Gamma(N+1)$, and most software packages have the function that calculates $\log\Gamma(x)$ to a reasonable accuracy.

In "R", for instance, your code would look like

P <- function(M, n) exp( lgamma(M+1)-n*log(M)-lgamma(M-n+1) )

Other possibility is cancelling out the factorials:

$P(M,n)=\frac{M(M-1)...(M-n+1)((M-n)!)}{M^n(M-n)!}=\frac{M}{M}\frac{M-1}{M}...\frac{M-n+1}{M}$

With an $M$ as large as 62^10, both approaches may lose some precision. In fact, in 64-bit double floating point, $62^{10}=62^{10}-1=62^{10}-2=...=62^{10}-60$.

So I believe you need a software that can deal with arbitrary precision rationals to get what you're looking for. You can try package GMP for "R", or mathematica, or matlab.

EDIT after reading other answers.

Computational comments:

Using arbitrary precision rationals is unfeasible for large $n$, since the this is the number of factor in the product above.

Using Stirling lower and upper bounds, as suggested by @GregMartin only worked (in double precision) up to $M=10^{14}$ and $N=10^6$.

The good news is that a multi-precision floating point package enables us to calculate this simply with the code provided above. Some results:

One million keys:

> M <- mpfr(62, precBits=2048)^10
> n <- mpfr(10, precBits=2048)^6
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.99999940426578238924518...

10 million keys:

> n <- mpfr(10, precBits=2048)^7
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.99994042828134289665284...

100 million keys: (I'd start to get worried...)

> n <- mpfr(10, precBits=2048)^8
> P(M,n)
1 'mpfr' number of precision  2048   bits 
[1] 0.994060359974671181685...

(I've used package Rmpfr)

$\endgroup$
3
  • $\begingroup$ Another possibility is replacing $\log N!$ by its approximation using Stirling's formula. $\endgroup$ Commented Mar 20, 2013 at 2:57
  • $\begingroup$ Thanks @GregMartin. Actually, the arbitrary precision alternative i've mentioned is not feasible for large $n$... There is no escape but use approximations. $\endgroup$ Commented Mar 20, 2013 at 3:12
  • $\begingroup$ Thanks much for the effort. I'd thought of using this, but decided not to pursue as I only wanted an order of estimate. btw, the Ramanujan approximation is also very handy for handling large factorials $\endgroup$
    – broccoli
    Commented Mar 20, 2013 at 17:38

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .