Setting up simulation algorithm to check calibration of Bayesian posterior probabilities

Question

Figuring out how to simulate something is often the best way to understand the underlying principles. I am a bit at a loss on exactly how to simulate the following.

Suppose that $Y \sim N(\mu, \sigma^{2})$ and that $\mu$ has a prior distribution that is $N(\gamma, \tau^{2})$. Based on a sample of $n$ observations $Y_{1}, \dots, Y_{n}$ abbreviated by just $Y$, I am interested in showing to a non-Bayesian that the posterior probability that $\mu > 0 | Y$ is well calibrated, e.g., Prob$(\mu > 0 | P) = P$ where $P$ is the posterior probability. A related discussion is here

What I really want to show is that if one were to do sequential testing and stop sampling when the posterior probability exceeds some level such as 0.95 the probability that $\mu > 0$ is not $< 0.95$.

I am trying to convince frequentists that Bayesian probabilities are meaningful without getting into any discussion about type I error. I suppose there is a philosophical problem when talking to a frequentist who entertains null hypotheses in that if the prior is continuous (as above) the probability that $\mu = 0$ is zero and simulations are not needed. I would appreciate some suggestions about how to think of the whole problem and how to design demonstration simulations. I am used to doing frequentist simulations where $\mu$ is just set to a single constant; Bayesians don't condition on $\mu$.

For the sequential situation we set a maximum possible sample size, e.g., $n=1000$.

There is a subtlty to the problem that I always have trouble thinking about. A real skeptic is sometimes worried about a false claim of effectiveness ($\mu > 0$) when the process really has exactly no effect ($\mu=0$). The subtlty is that the skeptic is "singling out" zero as a special value, and perhaps is giving non-zero probability to the event $\mu = 0$ (?). Our method of showing that the posteriors is calibrated may not make such a skeptic happy because the skeptic really seems to want to condition on $\mu = 0$ and as Bayesians we only condition on what is knowable. Perhaps this is a case where the prior distribution that the statistician is using conflicts with a discontinuous prior distribution the skeptic is using?

Juho Kokkala · Accepted Answer · 2016-10-07 05:45:49Z

Simulation results will depend on how the parameter is sampled in the simulation. I don't think there is any dispute over whether the posterior probabilities will be calibrated (in the frequency sense) if the prior probabilities are, so I suspect a simulation will not convince anyone of anything new.

Anyway, in the sequential sampling case mentioned in the question (third paragraph) can be simulated "as is" by drawing $\mu$ from the prior, drawing samples given this $\mu$ until $p(\mu>0\mid \textrm{samples})>0.95$ or some other termination criterion occurs (another termination criterion is needed since there is positive probability that the running posterior probability will never exceed $0.95$). Then, for every $p(\mu>0\mid \textrm{samples})>0.95$ claim, check whether the underlying sampled $\mu$-parameter is positive and count the number of true positives vs. false positives. So, for $i=1,2,\ldots$:

Sample $\mu_i \sim N(\gamma, \tau^2)$
For $j=1,\ldots^\ast$:
- Sample $y_{i,j} \sim N(\mu_i, \sigma^2)$
- Compute $p_{i,j} := P(\mu_i>0 \mid y_{i,1:j})$
- If $p_{i,j}>0.95$
  - If $\mu_i>0$, increment true positive counter
  - If $\mu_i\leq0$, increment false positive counter
  - Break from the inner for loop
- $\ast$ some other breaking condition, such as $j\geq j_{\max}$

The ratio of true positives to all positives will be at least $0.95$, which demonstrates calibration of the $P(\mu>0 \mid D)>0.95$ claims.

A slow-and-dirty Python implementation (bugs very possible + there is a potential stopping bias in that I debugged until I saw the expected calibration property holding).

# (C) Juho Kokkala 2016
# MIT License 

import numpy as np

np.random.seed(1)

N = 10000
max_samples = 50

gamma = 0.1
tau = 2
sigma = 1

truehits = 0
falsehits = 0

p_positivemus = []

while truehits + falsehits < N:
    # Sample the parameter from prior
    mu = np.random.normal(gamma, tau)

    # For sequential updating of posterior
    gamma_post = gamma
    tau2_post = tau**2

    for j in range(max_samples):
        # Sample data
        y_j = np.random.normal(mu, sigma)

        gamma_post = ( (gamma_post/(tau2_post) + y_j/(sigma**2)) /
                       (1/tau2_post + 1/sigma**2) )
        tau2_post = 1 / (1/tau2_post + 1/sigma**2)

        p_positivemu = 1 - stats.norm.cdf(0, loc=gamma_post,
                                          scale=np.sqrt(tau2_post))

        if p_positivemu > 0.95:
            p_positivemus.append(p_positivemu)
            if mu>0:
                truehits += 1
            else:
                falsehits +=1
            if (truehits+falsehits)%1000 == 0:
                print(truehits / (truehits+falsehits))
                print(truehits+falsehits)
            break

print(truehits / (truehits+falsehits))
print(np.mean(p_positivemus))

I got $0.9807$ for the proportion of true positives to all claims. This is over $0.95$ as the posterior probability will not hit exactly $0.95$. For this reason the code tracks also the mean "claimed" posterior probability, for which I got $0.9804$.

One could also change the prior parameters $\gamma,\tau$ for every $i$ to demonstrate a calibration "over all inferences" (if the priors are calibrated). On the other hand, one could perform the posterior updates starting from "wrong" prior hyperparameters (different than what are used in drawing the ground-truth parameter), in which case the calibration might not hold.

This is very clear and very helpful. I'm adding another paragraph to my question with one remaining issue. In addition to the counting method I am interested in plotting the probability of a false claim against the true (sampled) $\mu$ possibly loess-smoothed to show a calibration curve. — Frank Harrell, Commented Oct 7, 2016 at 11:01
Instead of changing the 2 parameters in the prior, I wonder if it would be meaningful and interpretable to plot the drawn $\mu$ against the maximum posterior probability over the enlarging sample sizes in the sequential assessment. This doesn't get at false and true positives but perhaps is another form of calibration? — Frank Harrell, Commented Oct 8, 2016 at 13:21

Frank Harrell · Accepted Answer · 2016-10-13 16:55:25Z

Expanding on the excellent answer by @juho-kokkala and using R here are the results. For a prior distribution for the population mean mu I used an equal mixture of two normals with mean zero, one of them very skeptical about large means.

## Posterior density for a normal data distribution and for
## a mixture of two normal priors with mixing proportions wt and 1-wt
## and means mu1 mu2 and variances v1 an
## Adapted for LearnBayes package normal.normal.mix function

## Produces a list of 3 functions.  The posterior density and cum. prob.
## function can be called with a vector of posterior means and variances
## if the first argument x is a scalar

mixpost <- function(stat, vstat, mu1=0, mu2=0, v1, v2, wt) {
  if(length(stat) + length(vstat) != 2) stop('improper arguments')
  probs      <- c(wt, 1. - wt)
  prior.mean <- c(mu1, mu2)
  prior.var  <- c(v1,  v2)

  post.precision <- 1. / prior.var + 1. / vstat
  post.var       <- 1. / post.precision
  post.mean <- (stat / vstat + prior.mean / prior.var) / post.precision
  pwt       <- dnorm(stat, prior.mean, sqrt(vstat + prior.var))
  pwt       <- probs * pwt / sum(probs * pwt)

  dMix <- function(x, pwt, post.mean, post.var)
    pwt[1] * dnorm(x, mean=post.mean[1], sd=sqrt(post.var[1])) +
    pwt[2] * dnorm(x, mean=post.mean[2], sd=sqrt(post.var[2]))
  formals(dMix) <- z <-
    list(x=NULL, pwt=pwt, post.mean=post.mean, post.var=post.var)

  pMix <- function(x, pwt, post.mean, post.var)
    pwt[1] * pnorm(x, mean=post.mean[1], sd=sqrt(post.var[1])) +
    pwt[2] * pnorm(x, mean=post.mean[2], sd=sqrt(post.var[2]))
  formals(pMix) <- z

  priorMix <- function(x, mu1, mu2, v1, v2, wt)
    wt * dnorm(x, mean=mu1, sd=sqrt(v1)) +
    (1. - wt) * dnorm(x, mean=mu2, sd=sqrt(v2))
  formals(priorMix) <- list(x=NULL, mu1=mu1, mu2=mu2, v1=v1, v2=v2, wt=wt)
  list(priorMix=priorMix, dMix=dMix, pMix=pMix)
}

## mixposts handles the case where the posterior distribution function
## is to be evaluated at a scalar x for a vector of point estimates and
## variances of the statistic of interest
## If generates a single function

mixposts <- function(stat, vstat, mu1=0, mu2=0, v1, v2, wt) {
  post.precision1 <- 1. / v1 + 1. / vstat
  post.var1       <- 1. / post.precision1
  post.mean1      <- (stat / vstat + mu1 / v1) / post.precision1

  post.precision2 <- 1. / v2 + 1. / vstat
  post.var2       <- 1. / post.precision2
  post.mean2      <- (stat / vstat + mu2 / v2) / post.precision2

  pwt1 <- dnorm(stat, mean=mu1, sd=sqrt(vstat + v1))
  pwt2 <- dnorm(stat, mean=mu2, sd=sqrt(vstat + v2))
  pwt <- wt * pwt1 / (wt * pwt1 + (1. - wt) * pwt2)

  pMix <- function(x, post.mean1, post.mean2, post.var1, post.var2, pwt)
    pwt        * pnorm(x, mean=post.mean1, sd=sqrt(post.var1)) +
    (1. - pwt) * pnorm(x, mean=post.mean2, sd=sqrt(post.var2))
  formals(pMix) <-
    list(x=NULL, post.mean1=post.mean1, post.mean2=post.mean2,
         post.var1=post.var1, post.var2=post.var2, pwt=pwt)
 pMix
}

## Compute proportion mu > 0 in trials for
## which posterior prob(mu > 0) > 0.95, and also use a loess smoother
## to estimate prob(mu > 0) as a function of the final post prob
## In sequential analyses of observations 1, 2, ..., N, the final
## posterior prob is the post prob at the final sample size if the
## prob never exceeds 0.95, otherwise it is the post prob the first
## time it exceeds 0.95

sim <- function(N, prior.mu=0, prior.sd, wt, mucut=0, postcut=0.95,
                nsim=1000, plprior=TRUE) {
  prior.mu <- rep(prior.mu, length=2)
  prior.sd <- rep(prior.sd, length=2)
  sd1 <- prior.sd[1]; sd2 <- prior.sd[2]
  v1 <- sd1 ^ 2
  v2 <- sd2 ^ 2
  if(plprior) {
    pdensity <- mixpost(1, 1, mu1=prior.mu[1], mu2=prior.mu[2],
                        v1=v1, v2=v2, wt=wt)$priorMix
    x <- seq(-3, 3, length=200)
    plot(x, pdensity(x), type='l', xlab=expression(mu), ylab='Prior Density')
    title(paste(wt, 1 - wt, 'Mixture of Zero Mean Normals\nWith SD=',
                round(sd1, 3), 'and', round(sd2, 3)))
  }
  j <- 1 : N
  Mu <- Post <- numeric(nsim)
  stopped <- integer(nsim)

  for(i in 1 : nsim) {
    # See http://stats.stackexchange.com/questions/70855
    component <- sample(1 : 2, size=1, prob=c(wt, 1. - wt))
    mu <- prior.mu[component] + rnorm(1) * prior.sd[component]
    # mu <- rnorm(1, mean=prior.mu, sd=prior.sd) if only 1 component

    Mu[i] <- mu
    y  <- rnorm(N, mean=mu, sd=1)
    ybar <- cumsum(y) / j
    pcdf <- mixposts(ybar, 1. / j, mu1=prior.mu[1], mu2=prior.mu[2],
                     v1=v1, v2=v2, wt=wt)
    if(i==1) print(body(pcdf))
    post    <- 1. - pcdf(mucut)
    Post[i] <- if(max(post) < postcut) post[N]
               else post[min(which(post >= postcut))]
    stopped[i] <- if(max(post) < postcut) N else min(which(post >= postcut))
  }
  list(mu=Mu, post=Post, stopped=stopped)
}

# Take prior on mu to be a mixture of two normal densities both with mean zero
# One has SD so that Prob(mu > 1) = 0.1
# The second has SD so that Prob(mu > 0.25) = 0.05
prior.sd <- c(1 / qnorm(1 - 0.1), 0.25 / qnorm(1 - 0.05))
prior.sd
set.seed(2)
z <- sim(500, prior.mu=0, prior.sd=prior.sd, wt=0.5, postcut=0.95, nsim=10000)

mu   <- z$mu
post <- z$post
st   <- z$stopped
plot(mu, post)
abline(v=0, col=gray(.8)); abline(h=0.95, col=gray(.8))
hist(mu[post >= 0.95], nclass=25)
k <- post >= 0.95
mean(k)   # 0.44 of trials stopped with post >= 0.95
mean(st)  # 313 average sample size
mean(mu[k] > 0)  # 0.963 of trials with post >= 0.95 actually had mu > 0
mean(post[k])    # 0.961 mean posterior prob. when stopped early
w <- lowess(post, mu > 0, iter=0)
# perfect calibration of post probs 
plot(w, type='n',         # even if stopped early
     xlab=expression(paste('Posterior Probability ', mu > 0, ' Upon Stopping')),
     ylab=expression(paste('Proportion of Trials with ',  mu > 0)))
abline(a=0, b=1, lwd=6, col=gray(.85))
lines(w)

Stack Exchange Network

Setting up simulation algorithm to check calibration of Bayesian posterior probabilities

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
bayesian
simulation
posterior
or ask your own question.

Linked

Hot Network Questions

Setting up simulation algorithm to check calibration of Bayesian posterior probabilities

2 Answers 2

Not the answer you're looking for? Browse other questions tagged bayesiansimulationposterior or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bayesian
simulation
posterior
or ask your own question.