Consider a regression task (e.g. predicting house prices) with a given train and test sets.
We start with constructing a linear regression model, in which we assume $y_i=X^T\beta+\epsilon$ with $E[\epsilon_i]=0$ and usually $\epsilon\sim\mathcal{N}(0,\sigma^2I)$. As we know the real value of the dependent variable $y$, we can denote the residuals as $e_i=y_i-\hat y_i$. The residuals are estimates of the "real" error $\epsilon$. This is all in the definitions of GLMs.
Next, we construct another model (e.g. XGBoost) for the same task and using the same data. We can calculate the residuals for this model in a similar manner.
Now, with the two models at hand, we would like to assess whether or not the models have the same output distribution - that is, whether or not the sets of residuals have the same distribution. Of course, the two distributions will never be exactly identical (I handle this using equivalence testing approach), but we can test them for a certain extent of similarity/ I can think of some tests for central/dispersion metrics, but as we all know that's not enough.
Unlike GLMs, in other models (especially ensembles) we can't take assumptions on the residual distribution, which is the main issue here. If both models have normal residuals with close enough $\mu,\sigma$, that's one thing; if the parameters differ by much, that's another story. Of course, If one model has (for example) normal residuals $\sim\mathcal{N}(0,1)$ and the other has uniform residuals $\sim\mathcal{U}[-3,3]$, I want to be able to spot this. The same applies for tail differences, as sub-Exponential distros are not sub-Gaussian. I think you get the point.
I had some thoughts, such as:
- Using parametric tests for distributions (KS or AD, although both are heavily criticized)
- Calculating the KLD and then inferring (a possible problem: there's a distribution only for some cases)
- Calculating Empirical CDFs and then measuring the area between them (but how to I test it? I can't simply use the Raju method for ECDFs)
- Drawing a QQ plot for the sets of residuals, but then I'm not sure what to do with deviations from the identity line
Any ideas?