What is the frequentist's Bayesian prior for a coin with unknown bias

Question

A "coin" has a fixed unknown bias $0\le p\le1$ for heads, and out of $n\ge0$ tosses it yielded $0\le h\le n$ heads. Note that this occurs with probability $P(h\;|\;p,n)=\binom{n}{h}p^h(1-p)^{n-h}$. We would like a "best guess" for $p$.

The frequentist view is that $p$ should be the maximum-likelihood-estimate $\frac hn$. Indeed $\frac{d}{d\rho}\binom{n}{h}\rho^h(1-\rho)^{n-h}=0$ occurs at $\rho=\frac hn$.

The uniform Bayesian view is that $p$ should be $\frac{h+1}{n+2}$. Indeed it has prior distribution $f(p)=1$ and the posterior distribution conditional on $(n,h)$ is then a Beta distribution $f(p\;|\;n,h)=\frac{P(h\;|\;p,n)f(p)}{\int_0^1P(h\;|\;\rho,n)f(\rho)d\rho}=\frac{(n+1)!}{h!(n-h)!}p^h(1-p)^{n-h}$ hence $\mathbb E[p\;|\;n,h]=\frac{(n+1)!}{h!(n-h)!}\int_0^1\rho^{h+1}(1-\rho)^{n-h}d\rho=\frac{(n+1)!}{h!(n-h)!}\frac{(h+1)!(n-h)!}{(n+2)!}=\frac{h+1}{n+2}$.

I don't yet have intuition for why these two viewpoints are the same if and only if $n=2h$, let me know! But my main question is: What is the frequentist's prior, i.e. what distribution $f$ satisfies $\mathbb E_f[p\;|\;n,h]=\frac hn$ for all pairs $\lbrace(n,h)\in\mathbb Z^2\;|\; 0\le h\le n\rbrace$?

Rephrased, $n\int_0^1\rho^{h+1}(1-\rho)^{n-h}f(\rho)d\rho=h\int_0^1\rho^h(1-\rho)^{n-h}f(\rho)d\rho$. Taking $(n,h)=(1,0)$ forces $f$ to obey $\int_0^1\rho(1-\rho)f(\rho)d\rho=0$ and so under some natural assumptions on the non-negative $f$ this should mean $f$ is almost-everywhere zero and not a normalized sum of Dirac-deltas. I believe this is Qiaochu's answer below.

This would be poetic and intuitive: a frequentist by construction would have no a priori guess, consistent with the fact that the vacuous 0 heads out of 0 tosses has undefined quotient $\frac00$ (whereas the uniform Bayesian invokes symmetry to guess $\frac12=\frac{0+1}{0+2}$).

This video addresses this question or something very similar in slightly different notation. youtube.com/… — user10478, Commented Jun 9 at 1:47
@user10478 The video shows that, for a Binomial(n,p) likelihood function, a Beta(a, b) prior for p results in a Beta(h+a, n-h+b) posterior. Therefore, the posterior mean is h/n only for a=b=0 (improper prior); however, with a=b=1 (uniform prior), the posterior density has its maximum at h/n. — r.e.s., Commented Jun 9 at 3:52

N. Virgo · Accepted Answer · 2024-06-10 02:36:01Z

8

There is such a prior, but it's an improper one.

It's given by $$ f(p) \propto \frac{1}{p(1-p)}. $$

Formally, you should think of this $f$ as a density function with respect to the uniform prior on $[0,1]$. You can't normalise it because $\int_{0}^{1}\frac{1}{p(1-p)}dp$ diverges. But you can still use it to calculate posteriors, by defining $$ f(p\mid h,n ) \mathrel{:=} \frac{1}{Z}p^h(1-p)^{n-h}f(p) = \frac{1}{Z}p^{h-1}(1-p)^{n-h-1} $$ where $Z$ is whatever it needs to be to make the posterior normalised, I guess $\Gamma(h)\Gamma(n-h)/\Gamma(n)$. The calculations are then basically the same as for the beta distribution, since this prior is in some informal sense "a beta distribution with $\alpha=\beta=0$." You should then easily see that the expectation over the posterior for $p$ is $h/n$ as desired.

This improper prior is known as Haldane's prior, after a 1932 paper by J.B.S. Haldane. (Hat tip r.e.s. in the comments.) I originally learned about it from a paper by E. T. Jaynes called Prior Probabilities (1968), which apparently reinvents it but gives some nice invariance arguments in its favour.

Unfortunately, although improper priors are often used in practice they seem not to be studied much in modern probability theory, so I don't think there's much formal theory about them.

edited Jun 10 at 2:36

answered Jun 9 at 3:16

N. Virgo

7,3791 gold badge29 silver badges58 bronze badges

1

$\begingroup$ Improper priors are still used all the time in practice because many Bayesian sampling algorithms only use the unnormalized posterior anyway. Nice reference +1 $\endgroup$
– whpowell96
Commented Jun 9 at 3:22
$\begingroup$ @whpowell96 that's a good point, I've added a remark that they are used in practice, even though it seems they're not formally studied much. $\endgroup$
– N. Virgo
Commented Jun 9 at 3:24
2

$\begingroup$ @N.Virgo This improper prior is historically known as Haldane's prior (published in 1932). The reference is given in the WP article. $\endgroup$
– r.e.s.
Commented Jun 9 at 3:39
1

$\begingroup$ OK this is all makes sense when I just start by computing the general $Beta(\alpha,\beta)$-Bayes prior and taking the limit $(\alpha,\beta)\to(0,0)$. This limiting-prior is improper, whereas the uniform-Bayes prior ($\alpha=\beta=1$) is proper because the random variable $p\in[0,1]$ is compactly supported. In contrast, the uniform-prior for linear regression recovers OLS solution and is improper because the coefficient vector (slope,intercept)$\in\mathbb R^2$ is not compactly supported (and a non-uniform prior represents a regularized OLS that makes a choice of penalization). $\endgroup$
– Chris Gerig
Commented Jun 9 at 19:15
1

$\begingroup$ A big problem with the Haldane prior is that if $h=0$ or $h=n$ then it concentrates all the posterior distribution at a single point. So if you flip the coin once and see heads then you are almost certain that $p=1$, while if you see tails then you are almost certain that $p=0$ - this is not a practical approach. $\endgroup$
– Henry
Commented Jun 10 at 1:41

| Show 4 more comments

Qiaochu Yuan · Accepted Answer · 2024-06-09 01:44:15Z

5

There is no such prior. If you flip a single heads then the frequentist says $p = 1$, and if you flip a single tails then the frequentist says $p = 0$, but this is not compatible with a prior in which $\mathbb{P}(\varepsilon < p < 1 - \varepsilon) > 0$ for any $\varepsilon > 0$ (since if $f(p)$ is the prior, then $\int_{\varepsilon}^{1-\varepsilon} f(p) \, dp > 0$ implies $\int_{\varepsilon}^{1-\varepsilon} p f(p) \, dp > 0$ and similarly for $(1 - p) f(p)$, the (unnormalized) posterior distributions after a single heads and tails respectively). The only prior that could produce these results is a discrete prior taking the value $0$ with some probability, $1$ with some other probability, and no other values, but this prior isn't compatible with flipping heads then tails.

answered Jun 9 at 1:44

Qiaochu Yuan

432k53 gold badges967 silver badges1.4k bronze badges

2

$\begingroup$ Qiaochu! Hilariously I was discussing the problem an hour ago with Alex Zorn, and all three of us were Berkeley "officemates", hope you remember our secret office seminars. Let me mull over your answer. $\endgroup$
– Chris Gerig
Commented Jun 9 at 1:46
$\begingroup$ Hey, nice to run into you two again on the internet, hope you're both doing well. $\endgroup$
– Qiaochu Yuan
Commented Jun 9 at 1:47
$\begingroup$ Indeed we're both in algorithmic trading for the time being, hence this genre of math problems ;-) Come join. $\endgroup$
– Chris Gerig
Commented Jun 9 at 1:57
1

$\begingroup$ If you allow improper priors then there is such a prior, see my answer. $\endgroup$
– N. Virgo
Commented Jun 9 at 3:16

Add a comment |

Stack Exchange Network

What is the frequentist's Bayesian prior for a coin with unknown bias

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
probability
probability-distributions
statistical-inference
.

Hot Network Questions

What is the frequentist's Bayesian prior for a coin with unknown bias

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged probabilityprobability-distributionsstatistical-inference.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
probability
probability-distributions
statistical-inference
.