3
$\begingroup$

I have a collection of bond distances from a series of $Fm\bar{3}m$ crystal structures that I would like to compare versus each metal's ionic radius with a linear regression.

The data quality for some of the structures are worse than others, so the associated ESDs/error bars for those bond distances are larger.

Weighing a linear regression fit with the associated ESDs from the crystal structures causes the abnormal data points to appear off the fit whereas an unweighted regression fit captures those data points well.

My understanding is that an unweighted analysis assumes the distribution of errors estimated from the residuals is assumed to not only be normal, but also uniform. This non-uniformity in error "appears" to be how the data are (in the sense that there is one particular data point with much larger ESds than the rest), but my intuition is that there is nothing "intrinsic" or systematic that affects these data and the large ESDs are simply a result of poor crystal quality and if a good crystal could be supplied then it would have "normal" ESDs.

Is there a statistical test to determine when one should or should not perform a weighted analysis?

Should one reflexively always perform a weighted linear analysis if the errors at each point are known?

$\endgroup$
6
  • 1
    $\begingroup$ Normally, in linear regression one would plot the standard deviations in the dependent variable $y$ vs. the independent variable $x$. If the larger $x$ have larger errors and the standard deviation errors look linear with x$, then weighted least squares is good idea. Now the key points is that if you actually suspect outliers in data, then you might resort even more advanced fitting problems. Instead of minimizing the 2-norm (least squares), one can minimize the 1-norm. This methods gives low importance to outliers. $\endgroup$
    – ACR
    Commented May 21 at 0:55
  • $\begingroup$ Unless you know exactly what the weighting are its best not to use them. As AChem suggest you could use least absolute deviation meaning minimise $\sum_i |data_i - fit function_i|$ as this is good at ignoring outliers, however the penalty is that its hard to estimate good ness of fit. You cannot use least squares software for this. Have a look at Numerical Recipes by Prest et al. for some code. $\endgroup$
    – porphyrin
    Commented May 21 at 7:48
  • $\begingroup$ @porphyrin there is no mention in the post of outliers, only of "abnormal" data, which can be interpreted many ways but need not imply serious flaws. Without a plot showing the data in question we can only guess how serious the deviations are. Regardless, the point of a reported uncertainty is among other things to determine which data should be considered more trustworthy, and when performing a model fit, how much one or another data point should influence the choice of fitting parameters. $\endgroup$
    – Buck Thorn
    Commented May 21 at 8:51
  • $\begingroup$ @Buck Thorn abnormal data or outliers who cares, the LLS is nonetheless very sensitive to one or two points deviating greatly from the trend and fitting shows this clearly when residuals are plotted, i.e. you can see that a poor fit is produced. The least absolute deviation is not nearly so sensitive and gives a better fit in these instances as common sense will show. $\endgroup$
    – porphyrin
    Commented May 21 at 14:24
  • 1
    $\begingroup$ Yes, the actual data point is not a statistical outlier and if I assume that there is no error in y (the interatomic distance) then those more pesky data points are well fit by an OLS with an $r^2$ ~ 0.98. The issue arises in that one particular data point has very high standard deviation compared to everything else and so when I do a WLS, it falls off the line. $\endgroup$
    – legolizard
    Commented May 22 at 15:26

1 Answer 1

2
$\begingroup$

This non-uniformity in error "appears" to be how the data are (in the sense that there is one particular data point with much larger ESds than the rest), but my intuition is that there is nothing "intrinsic" or systematic that affects these data and the large ESDs are simply a result of poor crystal quality and if a good crystal could be supplied then it would have "normal" ESDs.
[.....]

Normality and heteroscedasticity are separate issues. Normality means that samples at a value of the independent variable x are normally distributed, even if the population sd might differ at different x. Applying LLS assumes normality. Heteroscedasticity means that the population sd differs at different values of the independent variable. This appears to be the case for your data, so you should use weighted LLS.

Should one reflexively always perform a weighted linear analysis if the errors at each point are known?

Yes. In your case the larger uncertainty for some values implies that these should be weighted less.

Is there a statistical test to determine when one should or should not perform a weighted analysis?

If you had sample replicates at each x you could perform a test for heteroscedasticity (uniformity of standard deviations). Instead you have reported standard deviations and presumably no information about the associated size of the samples on which these were based, so those are essentially estimates of the population sd obtained for different x values under what might be different data acquisition conditions and sample sizes. If you had a measure of the sample sizes you could perform the statistical tests, but your options are more limited in their absence. You might want to consult at the statistics stack exchange site to see if there are any other options.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.