UPDATE Thanks for the many thoughtful responses and questions! I've made edits here to clarify further. and also respond to each respondent individually.
Original Post
I have two sets of posterior samples of size 5000 each ($\mu_1$, $sd_1$) and ($\mu_2$, $sd_2$).
These are posterior means and posterior standard deviations respectively. The posterior samples are from different prediction models (say one for sales in Walmart and another for sales in JC Penney) , and from each model I have 5000 samples from the posterior distribution of the prediction percentage error in prediction. NOTE The percentage error is based on posterior predictive sampling for a held out set - i.e. it is a fair measure of generalization error.
I can therefore compute mean percentage error, and the standard deviation of the percentage error. I want to compare whether one model is better at prediction than the other - in other words I want to compare whether the means $\mu_1$ and $\mu_2$ are statistically different.
Option 1) In the two sample t-test, in the frequentist world, I would treat these posterior samples as data and convert $sd_1$ and $sd_2$ into standard errors (by dividing each by the square root of sample size i.e. 5000). And run the test.
Option 2) However, my friend believes that the posterior samples represent the posterior distribution. And that $sd_1$ (and $sd_2$ respectively)is the estimate of the population standard deviation. So no need to divide by the square root of sample size.
Further she argues that I should simply take a difference between the two means (with arbitrary pairings - I don't get this part fully) and plot the distribution. If the high density interval includes 0, then the two are the statistically the same. Basically, randomly draw 5000 times a percentage error value from each posterior, compute the difference and then a) plot the differences and b) compute the HPD/HDI and check if 0 lies within the 95% HDI/HPD. If it does include 0, then the two models have similar percentage errors. If not the two are different.
I've seen these two links, but somehow the answers are not quite clicking with me.
Calculating posterior of difference given posterior of two means
How should I compare posterior samples of the same parameter from two Bayesian models?
Any references would be most welcome
Question 1) who do you think is right and why? Question 2) a reference would help a lot.