1
$\begingroup$

UPDATE Thanks for the many thoughtful responses and questions! I've made edits here to clarify further. and also respond to each respondent individually.

Original Post

I have two sets of posterior samples of size 5000 each ($\mu_1$, $sd_1$) and ($\mu_2$, $sd_2$).

These are posterior means and posterior standard deviations respectively. The posterior samples are from different prediction models (say one for sales in Walmart and another for sales in JC Penney) , and from each model I have 5000 samples from the posterior distribution of the prediction percentage error in prediction. NOTE The percentage error is based on posterior predictive sampling for a held out set - i.e. it is a fair measure of generalization error.

I can therefore compute mean percentage error, and the standard deviation of the percentage error. I want to compare whether one model is better at prediction than the other - in other words I want to compare whether the means $\mu_1$ and $\mu_2$ are statistically different.

Option 1) In the two sample t-test, in the frequentist world, I would treat these posterior samples as data and convert $sd_1$ and $sd_2$ into standard errors (by dividing each by the square root of sample size i.e. 5000). And run the test.

Option 2) However, my friend believes that the posterior samples represent the posterior distribution. And that $sd_1$ (and $sd_2$ respectively)is the estimate of the population standard deviation. So no need to divide by the square root of sample size.

Further she argues that I should simply take a difference between the two means (with arbitrary pairings - I don't get this part fully) and plot the distribution. If the high density interval includes 0, then the two are the statistically the same. Basically, randomly draw 5000 times a percentage error value from each posterior, compute the difference and then a) plot the differences and b) compute the HPD/HDI and check if 0 lies within the 95% HDI/HPD. If it does include 0, then the two models have similar percentage errors. If not the two are different.

I've seen these two links, but somehow the answers are not quite clicking with me.

Calculating posterior of difference given posterior of two means

How should I compare posterior samples of the same parameter from two Bayesian models?

Any references would be most welcome

Question 1) who do you think is right and why? Question 2) a reference would help a lot.

$\endgroup$
3
  • 2
    $\begingroup$ Where do the two posterior samples come from? are they different parameters in the same model (with same data)? are they the same parameter but with the model fit to two different datasets? Or is it same parameter, same data, but two separate MCMC runs? (assuming you're using MCMC here) $\endgroup$ Commented Oct 14, 2021 at 20:11
  • $\begingroup$ See updated question $\endgroup$ Commented Oct 14, 2021 at 20:49
  • $\begingroup$ In short completely different data, different models, but same predictive metric. The percentage errors is measured on held out data (different for each store) $\endgroup$ Commented Oct 14, 2021 at 21:00

2 Answers 2

1
$\begingroup$

You have two sets of posterior samples that you want to compare. The problem is, you haven't defined the joint distribution of the two sampled quantities. Why? because the two quantities are derived from independent models/data. The two models have been developed independently, so there is no assumed/modelled link between your two sampled quantities. Hence assuming independence of the two sets of samples is not unreasonable, in which case the proposed 'arbitrary pairings' approach does give you what you're looking for. Having said that, it's worth bearing in mind the assumptions you're making here. If the two models had explanatory variables/parameters in common then you could have fitted a single combined model for both datasets, and in that case the posterior samples for your two quantities would potentially no longer be independent.

An alternative approach which I've found helpful is the following. What is each of your 5000 prediction error samples telling you? It's telling you about the performance of the model with fixed parameters equal to those of the specific sample - but you wouldn't actually use such a fixed-parameter model for prediction. So, you could instead construct a performance metric which is a property of the fitted model rather than a property of a specific sample. For example, you could define prediction error for Walmart store no. 1 to be the difference between its average sales and the expected average sales where the expectation is over the posterior predictive distribution. Or whatever metric is most meaningful in your line of work. I have done something similar using metrics derived from the ROC curve. This approach is briefly discussed in the book Bayesian Data Analysis, section 7.3, subsection "Evaluating predictive error comparisons":

Sometimes it may be possible to use an application-specific scoring function that is so familiar for subject-matter experts that they can interpret the practical significance of differences. For example, epidemiologists are used to looking at differences in area under receiver operating characteristic curve (AUC) for classification and survival models

Finally, you compare the metrics you've calculated for your two models.

$\endgroup$
10
  • 1
    $\begingroup$ "You could try assuming independence, which is what your proposed 'arbitrary pairings' approach does - but is this assumption reasonable?" This basically amounts to the question whether the two data sets behind the posteriors can be assumed to be independent. That's probably often the case, at least from a frequentist point of view. $\endgroup$ Commented Oct 15, 2021 at 12:18
  • 1
    $\begingroup$ If you had independent posterior samples for simple parameters that we are used to (e.g., regression coefficient or mean) you can get posterior samples from arbitrarily paired absolute differences in the two parameters, and from that get the posterior density of absolute differences. The proportional of absolute differences $> \epsilon$ is the simulated posterior probability of absolute difference $> \epsilon$. Individual posterior means and SDs are to be ignored. $\endgroup$ Commented Oct 15, 2021 at 13:16
  • 1
    $\begingroup$ @FrankHarrell Yes, what you describe, that would definitely make sense for parameters belonging to the same model. In this case, the two quantities are associated with different models/data, so I don't see how to define a joint distribution for them. $\endgroup$ Commented Oct 15, 2021 at 13:33
  • 1
    $\begingroup$ If the prior distribution for the two parameters are independent, the data are independent, and there is no bivariate model for the joint distribution of the two models' data then the posterior draws are independent. I.e if there is no assumption that links the two parameters you are making inference about then it seems reasonable to assume the posterior draws are independent. $\endgroup$ Commented Oct 15, 2021 at 13:44
  • 2
    $\begingroup$ You can use the posterior distribution of any parameter that is a function of fundamental model parameters. $\endgroup$ Commented Oct 17, 2021 at 12:14
2
$\begingroup$

As a frequentist you have to construct statistics for which you know the distribution of (or at least the asymptotic distribution of) so that you can calculate p-values in order to do inference.

Now what the appropriate statistic will be depends on the model and assumptions, and how correct the inference will be depends on if the data corresponds to the modeling assumptions (or how robust the inference is to specific deviations).

In your case it looks like you have data with potentially unequal variances. The frequentist approach would be to either do a Welch's t-test or do weighted least squares.

As a Bayesian you don't have to worry about existence of asymptotic results for a particular statistic; if you can correctly produce samples from the posterior you can do direct inference on the parameters in your model.

"But I must take the variances into account don't I?!" you may be thinking. Recall that your samples of $\mu_1$ and $\mu_2$ are from the joint distribution which include the standard deviations. As an exercise for such a simple model you can derive the closed form posterior distributions (assuming conjugate priors) and you will see how the posterior draws of the $\mu$s depend on the variances.

So in that sense your colleague is correct looking at the posterior of the difference is sufficient. Although I think its preferable if the samples come from the same MCMC sampling.

$\endgroup$
6
  • $\begingroup$ Thanks for your response. To be sure I have two separate set of posterior samples of percentage errors (size 5000). So they cannot be from the same MCMC for sure. So in your view "randomly draw 5000 times a percentage error value from each posterior, compute the difference and then a) plot the differences and b) compute the HPD/HDI and check if 0 lies within the 95% HDI/HPD. If it does include 0, then the two models have similar percentage errors. If not the two are different." is correct. $\endgroup$ Commented Oct 14, 2021 at 22:01
  • 1
    $\begingroup$ Not quite sure what you mean by "population distribution" the posterior of $\mu_1$ is $p(\mu_1|data)$ if $\mu_1$ is the average number of sales at Walmart then using the posterior you can make inference on the average number of sales at walmart. $\endgroup$
    – bdeonovic
    Commented Oct 15, 2021 at 0:39
  • 1
    $\begingroup$ re: reference. This seems like a pretty fundamental concept to the Bayesian paradigm so I'll point you towards stats.stackexchange.com/questions/125/… and stats.stackexchange.com/questions/22/… $\endgroup$
    – bdeonovic
    Commented Oct 15, 2021 at 0:42
  • 1
    $\begingroup$ How is % error computed? Is it really a fundamental parameter? $\endgroup$ Commented Oct 15, 2021 at 13:11
  • 1
    $\begingroup$ Percentage error is simply (y_actual - y_predict)/y_actual - the metric depdens on the underlying regression model's parameters, but is an outcome measure and not a fundamental parameter. Note: The percentage error is based on posterior predictive sampling for a held out set - i.e. it is a fair measure of generalization error. $\endgroup$ Commented Oct 15, 2021 at 16:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.