4
$\begingroup$

This is the second follow up question from these two previous questions:

Consider again the model of the previous question, which I will repeat here for clarity.

$$ \text{Likelihood:}\\ \\ y \sim \mathcal{N}(\mu_1, \sigma_1)\\ x \sim \mathcal{N}(\mu_2, \sigma_2)\\[2em] \text{Prior:}\\ \begin{aligned} \mu_1 &\sim \mathcal{N}(0, 1000)\\ a &\sim \mathcal{U}(0,2)\\ \mu_2 &\leftarrow \mu_1 + a\\ \sigma_1 &\sim \mathcal{U}(0, 100)\\ \sigma_2 &\sim \mathcal{U}(0, 100) \end{aligned} $$

Where $\mathcal{N}()$ denotes a gaussian and $\mathcal{U}()$ denotes a uniform distribution. Here is the implementation in rjags:

library(rjags)
  model <- "
model {
  for (i in 1:length(x)){
    x[i] ~ dnorm(mu1, tau1)
  }

  for (i in 1:length(y)){
    y[i] ~ dnorm(mu2, tau2)
  }

  mu1 ~ dnorm(0, .00001)
  a ~ dunif(0, 2)
  mu2 <- mu1 + a

  sigma1 ~ dunif(0,100)
  tau1 <- pow(sigma1, -2)

  sigma2 ~ dunif(0,100)
  tau2 <- pow(sigma2, -2)
}
"

Now let's consider we have infinite data from a data generating process that cannot be captured by this model. Below I show such an example in R (here "infinite" of course is approximated by a large sample and low standard deviation).

n <- 1e3
dat <- list(x = rnorm(n, mean = 2, sd = .1),
            y = rnorm(n, mean = 10, sd = .1))

jags.model   <- jags.model(textConnection(model), data =dat)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 2000
#>    Unobserved stochastic nodes: 4
#>    Total graph size: 2012
#> 
#> Initializing model
samp <- coda.samples(jags.model, n.iter = 1e4, 
                       variable.names = c("mu1", "mu2", "sigma1", "sigma2"))
post  <- as.data.frame(samp[[1]])
summary(post$mu1)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   7.988   7.999   8.002   8.003   8.006   8.048
summary(post$mu2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   9.986   9.995   9.997   9.997   9.999  10.009

Now note that the posterior does not converge to the true values of 2 and 10 as expected, since the model cannot capture a difference of more than 2 units apart. But, specifically, the model "converges" to something: $\mu_1 = 8$ and $\mu_2 = 10$. If you run a different chain, it "converges" to $\mu_1 = 2$ and $\mu_2 = 4$. What characterizes these solutions? What should be the theoretical posterior distribution in this case? Are these the only peaks, so it should converge to a 50% point mass in both? What characterizes the solutions in this case?

More generally, when the true DGP cannot be captured by your bayesian model (in practice, almost always), what characterizes the solutions it eventually converges to?

$\endgroup$
9
  • 2
    $\begingroup$ You didn’t show anything proving that the model “converged” to this solution. Your simulation has resulted in one of two equivalent, equally bad, solutions. That’s not convergence yet. Your question does not have good answer beyond the obvious: bad model would make it best to give you the “least bad” solution that is possible under it. $\endgroup$
    – Tim
    Commented May 24, 2020 at 8:54
  • 1
    $\begingroup$ Also, unless I’m missing something, you used single chain for simulation that just picked one local maxima. $\endgroup$
    – Tim
    Commented May 24, 2020 at 8:58
  • 1
    $\begingroup$ @Tim I'm not trying proving to prove anything, I'm asking to what it converges to. What is the "least bad" solution in this case, and where should the posterior peak? $\endgroup$
    – user272422
    Commented May 24, 2020 at 13:09
  • $\begingroup$ I'd assume much the same thing as under a misspecified frequentist model: You should get a posterior that minimizes the difference (in some sense) to the true posterior under the "true" model to the extent that it's possible with the model you've specified (I know, not very precise, I'm sure someone must have looked this up more rigorously). Of course, any inference could be way, way off from the truth, depending on how baldy misspecified your model is. $\endgroup$
    – Björn
    Commented May 24, 2020 at 13:21
  • 2
    $\begingroup$ @Tim When you say "least bad", do you know what is the actual minimization/maximization which is being performed? Least bad with respect to what exactly? $\endgroup$
    – user272422
    Commented May 24, 2020 at 14:15

3 Answers 3

9
$\begingroup$

I think you can simplify your specific problem for the asymptotic case. The normal distribution is summarised by two sufficient statistics, so the data can be reduced to six numbers. These are the two sample sizes $n_y,n_x$ and the mean and variance given as

$$\overline{y}=\frac{1}{n_y}\sum_{i=1}^{n_y}y_i$$ $$s^2_y=\frac{1}{n_y}\sum_{i=1}^{n_y}(y_i-\overline{y})^2$$ $$\overline{x}=\frac{1}{n_x}\sum_{i=1}^{n_x}x_i$$ $$s^2_x=\frac{1}{n_x}\sum_{i=1}^{n_x}(x_i-\overline{x})^2$$

With these you can write the posterior as

$$p(\mu_1,\mu_2,\sigma_1,\sigma_2,a|DI)\propto p(\mu_1,\mu_2,\sigma_1,\sigma_2,a|I)\sigma_1^{-n_y}\sigma_2^{-n_x}\exp\left(-\frac{n_y[s_y^2+(\mu_1-\overline{y})^2]}{2\sigma_1^{2}}-\frac{n_x[s_x^2+(\mu_2-\overline{x})^2]}{2\sigma_2^{2}}\right)$$

Now asymptotically, the only part of the prior that "survives" the large sample size is the range restriction $\mu_1<\mu_2<\mu_1+2$. This means we can analytically integrated out the variance parameters and $a$ is redundant, as we can write $(\mu_2|\mu_1)\sim U(\mu_1,\mu_1+2)$ (by properties of uniform distribution). The joint marginal distribution will be a truncated t distribution, which asymptotically is truncated normal.

$$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\exp\left(-\frac{n_y(\mu_1-\overline{y})^2}{2s_y^{2}}-\frac{n_x(\mu_2-\overline{x})^2}{2s_x^{2}}\right)$$

The maximum can be found via constrained least squares. The unconstrained maximum is $(\hat{\mu}_1,\hat{\mu}_2)=(\overline{y},\overline{x})$. If this violates the constraint, then we set it to the nearest boundary. So if the data are $\overline{x}>\overline{y}+2$ then we would set $\hat{\mu}_2=\hat{\mu}_1+2$ and then maximise wrt $\hat{\mu}_1$ giving the maximum of $\hat{\mu}_1=w\overline{y}+(1-w)(\overline{x}-2)$ where $w=\frac{n_ys_y^{-2}}{n_xs_x^{-2}+n_ys_y^{-2}}$.

For your specific case we would have $w=\frac{1}{2}$ (because the sample size and standard deviations are equal). We also have $\hat{\mu}_1=w\overline{y}+(1-w)(\overline{x}-2)=\frac{1}{2}2+(1-\frac{1}{2})(10-2)=5$ $\hat{\mu}_2=7$

Your posterior should concentrate around this point. To see this you simply evaluate the likelihood function. The only difference is the terms $(\mu_1-\overline{y})^2+(\mu_2-\overline{x})^2$. This evaluates to $36$ for either $(\hat{\mu}_1,\hat{\mu}_2)=(2,4)$ or $(\hat{\mu}_1,\hat{\mu}_2)=(8,10)$. But it evaluates to $18$ for $(\hat{\mu}_1,\hat{\mu}_2)=(5,7)$. much smaller!

You could also see this geometrically as well - as the precision is equal. On a simple x-y graph draw the line with the equation $y=x-2$ and mark the point $(10,2)$. Then the is the shortest distance from this point to the line is to the point $(7,5)$. The likelihood "wants" to concentrate the posterior around $(10,2)$ and $(7,5)$ is the closest to this point.

Not quite sure why your chain isn't converging to this point...The posterior still only has one mode...maybe bad starting points?

Also your code does not quite match your equations - your equation has $y$ with the lower mean but your simulation has $x$ with the lower mean.

update

In light of the answer by @Sextus empiricus, I looked into my answer again. If I take the marginal without making the normal approximation we have

$$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\left(1+t_y^2\right)^{-\frac{n_y-1}{2}}\left(1+t_x^2\right)^{-\frac{n_x-1}{2}}$$

where $t_y=\frac{\mu_1-\overline{y}}{s_y}$ and $t_x=\frac{\mu_2-\overline{x}}{s_x}$. This is the product of two independent t distributions. If we take negative log of this posterior, we get the function

$$-\log\left[p(\mu_1,\mu_2|DI)\right]=-\log\left[I_{\mu_1<\mu_2<\mu_1+2}\right]+\frac{n_y-1}{2}\log\left(1+t_y^2\right)+\frac{n_x-1}{2}\log\left(1+t_x^2\right)$$

Interestingly, the function $\log\left(1+t_x^2\right)$ behaves like $t_x^2$ when it's small (ie least squares, normal distribution in my earlier response) but it behaves like $2\log\left(t_x\right)$ when it's large. This is what is driving the bimodal behaviour - an extreme deviation is not penalised that much more severely than a large devation. This makes it better to "dismiss as noise" one of the data points and fit the other one exactly.

Plugging in some numbers from the example shows this. We have $\log\left(1+t_x^2\right)=5.9$ when $\mu_2=4$ and it equals $4.5$ when $\mu_2=7$. Compare to least squares where $t_x^2=360$ when $\mu_2=4$ and it equals $90$ when $\mu_2=7$.

Further, asymptotically this does not converge to the truncated normal I outlined above. If we use the large $n$ approximation $(1+t_y^2)^{-\frac{n_y-1}{2}}\approx\exp\left(-\frac{(n_y-1) t_y^2}{2}\right)$ it won't work here because there is another term that cannot be ignored. If we set $n_x=n_y=n$ then we can write the posterior as $$p(\mu_1,\mu_2|DI)\propto I_{\mu_1<\mu_2<\mu_1+2}\left(1+t_y^2+t_x^2+t_y^2t_x^2\right)^{-\frac{n-1}{2}}$$$$ \approx I_{\mu_1<\mu_2<\mu_1+2}\exp\left(-\frac{(n-1)(t_y^2+t_x^2+t_y^2t_x^2)}{2}\right)$$

This is not a normal distribution, because we have term $t_y^2t_x^2$ in the exponent (a bivariate normal would have $t_yt_x$). Now if we don't place the range restriction, then this term becomes negligible, because it is possible to set $t_y=0$ and $t_x=0$ simultaneously. When the range restriction applies, then we can no longer assume $t_y^2t_x^2\approx 0$. This also clearly shows the bimodal nature of the posterior as well, because we can set this term $t_y^2t_x^2=0$ by setting either $t_x=0,t_y\neq 0$ or by setting $t_x\neq 0, t_y=0$. If I use this additional term, we see that $t_y^2+t_x^2+t_y^2t_x^2$ evaluates to $360$ for either case of $\mu_1=2,\mu_2=4$ or $\mu_1=8,\mu_2=10$ compared to $8280$ when $\mu_1=5,\mu_2=7$

I personally found this very interesting, and thanks to @Sextus Empiricus for his answer!

$\endgroup$
10
  • 1
    $\begingroup$ Thanks you! (+1) I'm not sure I follow some of the steps, are you sure it should converge to 5, 7? It does make sense in a certain way, but the posterior seems to converge to 8, 10 in some chains and 2,4 in other chains, I never obtained 5, 7. $\endgroup$
    – user272422
    Commented May 24, 2020 at 17:53
  • $\begingroup$ So maybe there is a convergence problem with Gibbs sampling here? $\endgroup$
    – user272422
    Commented May 24, 2020 at 17:55
  • $\begingroup$ Also, when you say "(7, 5) is the closest", it is close in what sense specifically? (thanks again for the answer!). Another thing I noticed is that the standard derivations also do not converge to the true values. $\endgroup$
    – user272422
    Commented May 24, 2020 at 17:57
  • $\begingroup$ I think if you set mu2 <- dunif(mu1,mu1+2) in your jags code it might work better. Your Gibbs sampler will get stuck I think because the conditional distribution of $(mu1|a,mu2)$ is a point mass at the value $mu1=mu2-a$. The chain can't move. might be same problem with tau1 and sigma1..ie use x[i] <- dnorm(mu1,pow(sigma1,-2)) $\endgroup$ Commented May 24, 2020 at 22:03
  • $\begingroup$ regarding distance, this is euclidean distance in the sense of being close to the MLEs. The point $(7,5)$ is also the asymptotic posterior mode under your prior. You will see this if you plot a histogram of your simulated values $\endgroup$ Commented May 24, 2020 at 22:09
3
$\begingroup$

I found parts of an answer to the question in this paper by Gelman and Shalizi, so I will post here for reference (relevant bits below). Basically, the "best attainable" solution is given by the "distance" as measured by likelihood function, in accordance to probabilityislogic's answer. We still have the unsolved puzzle of whether the solutions are (2,4) and (8,10) or (7,5) as argued by probabilityislogic.

References:

Gelman, Andrew, and Cosma Rohilla Shalizi. "Philosophy and the practice of Bayesian statistics." British Journal of Mathematical and Statistical Psychology 66.1 (2013): 8-38.

Quotes:

enter image description here

enter image description here

$\endgroup$
2
$\begingroup$

Bimodal likelihood function

The reason that you get "convergence" to either $(\mu_1,\mu_2) = (8,10)$ or $(\mu_1,\mu_2) = (2,4)$ is because the likelihood is very high when

  • either the points $x$ concentrate around the true mean (giving $\mu_1=2$)
  • or when the points $y$ concentrate around the true mean (giving $\mu_2 = 10)$.

$${ -\log\mathcal{L}(\mu_1,\mu_2,\sigma_1,\sigma_2) = n \log(\sigma_1) +\frac{1}{2 \sigma_1^2} \sum_{1\leq i \leq n} (x_i-\mu_1)^2 + n \log(\sigma_2) +\frac{1}{2 \sigma_2^2} \sum_{1\leq i \leq n} (y_i-\mu_2)^2}$$

In this case optimizing the likelihood function (or posterior but this will approach the likelihood for large samples) is not just minimizing the least squares terms

$$\sum_{1\leq i \leq n} (y_i-\mu_1)^2 + \sum_{1\leq i \leq n} (y_i-\mu_2)^2$$

(which would give the point $(\mu_1,\mu_2)=(5,7)$ as probabilityislogic argues).

It is also about the role of the $\sigma_1$ and the $\sigma_2$ in the likelihood function.

When you have $\sigma_1 = 8$ and the other is $\sigma_2 = 0.1$ then you get the maximum likelihood. So there are two maxima which makes that you get these two different results (and you won't have convergence to a single point, because there are two solutions).


Example computation

Let's simplify the likelihood expression by replacing the sumterms with expressions of sample moments (which are sufficient statistics) and divide by $n$.

$$\log(\sigma_1) +\frac{\overline{x^2} - 2 \mu_1 \overline{x} + \mu_1^2}{2 \sigma_1^2} + \log(\sigma_2) +\frac{\overline{y^2} - 2 \mu_2 \overline{y} + \mu_2^2}{2 \sigma_2^2} $$

In your example we have $\overline{x^2} \to 0.01$, $\overline{y^2} \to 0.01$, $\overline{x} \to 2$ and $\overline{y} \to 10$.

Let's see for the minimum when we keep $\sigma_1$, $\sigma_2$ and $\mu_2 = \mu_1 + 2$ fixed, such that it is only a function of a single free parameter $\mu_1$:

$$\log(\sigma_1) +\frac{\overline{x^2} - 2 \mu_1 \overline{x} + \mu_1^2}{2 \sigma_1^2} + \log(\sigma_2) +\frac{\overline{y^2} - 2 (\mu_1+2) \overline{y} + (\mu_1+2)^2}{2 \sigma_2^2} $$

The minimum of this can be found by differentiating to $\mu_1$ and setting equal to zero which gives:

$$\mu_1 = \frac{\sigma_2^2 \bar{x} + \sigma_1^2 (\bar{y}-2)}{\sigma_2^2 + \sigma_1^2}$$

When we plug this back into the likelihood we get a function that depends on $\sigma_1$ and $\sigma_2$. It is a bit difficult to compute the minimum so let's do it computationally

bimodal

and you see that you get the optimal likelihood for $(\sigma_1,\sigma_2) = (0.1,8)$ or $(\sigma_1,\sigma_2) = (8,0.1)$ and this will put the optimal mean to either one of the means but not in the middle.

optlikelihood <- function(sigma_1,sigma_2) {

  ### distribution parameters
  xm <- 2
  x2m <- xm^2+0.01
  ym <- 10
  y2m <- ym^2+0.01

  ### compute optimal mu
  mu_opt <- (sigma_2^2*xm + sigma_1^2*(ym-2)) / (sigma_2^2 + sigma_1^2)

  ### compute likelihood value
  L = log(sigma_1) + log(sigma_2) +
          (x2m-2*mu_opt*xm + mu_opt^2) / (2*sigma_1^2) + 
          (y2m-2*(mu_opt+2)*ym + (mu_opt+2)^2) / (2*sigma_2^2)  
  return(L)
}


### choose variable range
s1 <- 10^seq(-2,2,0.25)
s2 <- 10^seq(-2,2,0.25)
n <- length(s1)


### compute results on a matrix
z <- matrix(rep(0,n*n),n)
for (i1 in 1:n) {
  for (i2 in 1:n) {
    z[i1,i2] = optlikelihood(s1[i1],s2[i2])
  }
}


#plotting parameters
levs <- 10^seq(-1,4,0.5)   # contour levels
collevs <- 10^seq(-2,5,0.1)   # colour levels
axislevs <- 10^seq(-2,2,1)  # axis levels

labs <- (matrix(levs[-1],1/0.5))  # for contour labels
labs[-1/0.5,] <- ""
labs <- c("",as.character(labs))

# contour plot
dev.off()
filled.contour(log(s1),log(s2),log(z),
               xlab="s1",ylab="s2", border = NULL,       
               color.palette=function(n) {hsv(c(seq(0.15,0.7,length.out=n),0),
                                              c(seq(0.7,0.2,length.out=n),0),
                                              c(seq(1,0.7,length.out=n),0.9))},
               levels=log(collevs),  
               key.axes=axis(4,at=log(levs),labels=labs),
               plot.axes= c({
                 contour(log(s1),log(s2),log(z),add=1, levels=log(levs), 
                         labels= labs, vfont = c("sans serif", "plain"))
                 axis(1, at = log(axislevs),labels=axislevs)
                 axis(2, at = log(axislevs),labels=axislevs)
                 title("bimodal optimum likelihood")
               },"")
)
$\endgroup$
6
  • $\begingroup$ interesting answer - I think this is also consistent with my own because the posterior I was working with was the marginal whereas your plot is a joint posterior. If you average over the variance parameters, you get something in between the two modes. $\endgroup$ Commented Jun 18, 2020 at 13:24
  • $\begingroup$ @probabiltyialogic It is not so clear how the equation that you used is a marginal distribution. It should at least depend on the prior distribution of the $\sigma$ and I get intuitively your formulation when those priors are for fixed/constant $\sigma$, but I do not see how the formulation follows from some marginal distribution given non fixed $\sigma$. $\endgroup$ Commented Jun 18, 2020 at 14:20
  • $\begingroup$ From my final graph it is obvious that when you assume $\sigma_1=\sigma_2$ then you will be looking in a different region and you do not get the effect of the bimodal likelihood function. But should we do it like that? At least this explains why the OP get's these different results for different chains, because the OP is using a method that does not look for the marginal distribution. $\endgroup$ Commented Jun 18, 2020 at 14:39
  • $\begingroup$ I was more thinking in terms of using $E[\mu_1]=E[E(\mu_1|\sigma_1,\sigma_2)]\approx E[\frac{\sigma_2^2\overline{x}+\sigma_1^2(\overline{y}-2)}{\sigma_1^2+\sigma_2^2}]$. This uses the mode as approximation for inner expectation. Then the outer expectation will be a simple average of the two modes (asymptotically) and we have $E[\mu_1]\approx \frac{8+2}{2}=5$. $\endgroup$ Commented Jun 18, 2020 at 21:43
  • $\begingroup$ in terms of the prior for $\sigma_1,\sigma_2$ it would only matter asymptotically if the upper bound on the uniform distribution was lower than the large mode. This doesn't happen in this case $\endgroup$ Commented Jun 18, 2020 at 21:52