4
$\begingroup$

I used generalized linear mixed models (with the glmmADMB package) to identify environmental factors related to parasite abundance in rodents. I used stepwise backward elimination to sequentially simplify the full model until only significant factors and interaction terms remained (Crawley, 2007). However, when I checked the final model's summary, one of the factors is no longer significant (p = 0.087). However, in the ANOVA tests that led to the final model, this variable (temperature) is significant:

Final model's summary:

Coefficients:           
                Estimate Std. Error z value Pr(>|z|)                
(Intercept)     3.131658   0.213630   15.33  < 2e-16            
temperature    -0.004019  0.0033143   -1.71  0.08673 .

anova

Model 1: parasites ~ age + vegetation + synchrony       
Model 2: parasites ~ temperature + age + vegetation + synchrony

  NoPar  LogLik Df Deviance Pr(>Chi)        
1     7 -550.75                             
2     8 -546.81  1    7.898 **0.004949**    

My questions are which significance value should I report? Does this mean that temperature is not important? I've seen manuscripts that report values from final's model summary but also some that report significance values from model comparisons, which one is correct?

$\endgroup$
3
  • 10
    $\begingroup$ Stepwise regression has been widely discredited. What about your situation makes you think it works correctly? $\endgroup$ Commented Mar 2, 2014 at 20:47
  • 1
    $\begingroup$ I wouldn't say $p$ values are simply frowned upon by optimally fair-minded statisticians; the real danger with them is misinterpretation, or arguably overreliance on them for purposes to which they're not suited. Stepwise regression is probably harder to defend (see also this, this, and this). Sorry about that advisor of yours :\ Anyway, these look like $p$s from two different kinds of tests. $\endgroup$ Commented Mar 2, 2014 at 21:27
  • 3
    $\begingroup$ To allow an advisor to advocate terrible statistical practice is simply unacceptable. $\endgroup$ Commented Mar 2, 2014 at 22:20

3 Answers 3

7
$\begingroup$

You should report the significance (along with everything else) from the initial full model. As @Frank Harrell notes, stepwise model selection is invalid. (If you want to understand this more fully, it may help you to read my answer here.) People are often believe that leaving 'non-significant' variables in a model will cause it to be overfitted, but it is more likely that looking at your data and continuing to change your model until it gives a picture you like will yield an overfitted model. If you are worried that the model's predictive accuracy may be affected by the inclusion of the other variables, or that model fit statistics (like $R^2$) will be over-optimistic, you can estimate these via cross-validation. A general rule of thumb is that you will be OK if you have at least 10 data per variable in your model.

Regarding why you get the seemingly counter-intuitive results you find, the stepwise algorithm doesn't have any way of knowing what the variables are, what their relationships are to each other, or what needs to remain in the model. That's one of the reasons why it produces such poor results. My guess is that the other variables, when they are included in the model, absorb more of the residual variability than they consume in degrees of freedom (see @whuber's answer here). I also suspect that there may be some collinearity amongst them, such that none show up as 'significant', but that all should be included anyway. That sort of thing can happen very easily, and it is another example of something stepwise selection methods cannot detect. You can check for it by dropping all three and performing a nested model test.

$\endgroup$
0
6
$\begingroup$

The only correct answer is neither. As @Nick says, p-values aren't frowned upon per se; stepwise techniques can't be guaranteed to produce bad models: but why on earth look at p-values calculated after following a procedure that invalidates them?

The idea behind variable selection is to get a model that predicts better by introducing a little bias into the predictions but reducing their variance a lot. Even if you manage to achieve this, it's senseless to "forget" you did it & carry out inference on the coefficients of the reduced model as if you hadn't.

(Worse than using stepwise is using stepwise to try to fix a problem you may not care about too much, without checking whether you do in fact have that problem, & then not checking to see whether stepwise made things better or worse—which is why you should consider what your model's going to be used for, &, if you still want to use stepwise, validate the full & reduced models.)

[PS: Crawley (2007) gives poor advice on modelling. To go from Occam's Razor to stepwise selection is a clear non-sequitur: though there are often other reasons than potentially improved predictive performance to want to fit simpler models, there's no reason at all to suppose that stepwise selection is in any way taking those reasons into account. If you think that a small effect is local, temporary, or artefactual, you've every right to propose a scientific model excluding it (invoking the "Principle of Parsimony" if you want to sound grand)—this has nothing to do with whether the variable in question drops out or not during stepwise selection.]

† Unless Model 2 in the ANOVA test in question is the full model.

$\endgroup$
3
  • 1
    $\begingroup$ +1, however, I suspect the ANOVA is the p-value from the initial full model, meaning that it is a valid p-value. $\endgroup$ Commented Mar 2, 2014 at 21:52
  • 1
    $\begingroup$ @gung: I was taking it to be the penultimate in the stepwise (the interaction terms mentioned aren't in it), but worth clarifying. $\endgroup$ Commented Mar 2, 2014 at 22:44
  • $\begingroup$ Hmmm, good point, maybe it's not the full model. $\endgroup$ Commented Mar 2, 2014 at 22:56
0
$\begingroup$

It seems you are obtaining two different p-values because two different tests are being conducted. The p-value in the model's summary looks like a Wald test, and the p-value in the 'anova' output is a likelihood ratio test. The two tests are asymptotically equivalent, but in smaller samples there could be differences, especially at the tails of the sampling distributions. The Wald test is a simpler test that is easy to compute based only on parameter estimates and their (asymptotic) standard errors. The likelihood ratio test, on the other hand, requires the likelihoods of a full and a reduced models. So "the LRT is computationally more demanding, but it's more powerful and reliable. The likelihood ratio test is almost always preferable to the Wald test, unless computational demands make it impractical to refit the model."
http://support.sas.com/documentation/cdl/en/statug/66859/HTML/default/viewer.htm#statug_glimmix_details31.htm

As an aside comment, actually the question doesn't have anything to do with the arguments and counterarguments regarding stepwise selection procedures. Even though the stepwise police immediately jumped on that. The issue the question referred to can occur even if no predictor selection is conducted.

Now, perhaps there are some tweaks that can help improve the performance of the maligned stepwise procedures; for instance not having a static alpha, but a significance level that changes according to the number of model refits being conducted, e.g. FDR corrections that account for dependendency of tests statistics, or alpha spending functions. Other options could include resampling and model averaging, etc. All I'm saying is keep an open mind guys. There is nothing wrong with trying to remove noise or useless predictors from a model; as George Box said: "Since all models are wrong the scientist cannot obtain a 'correct' one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity." http://en.wikiquote.org/wiki/George_E._P._Box

Although I agree that inference on the 'final' model might require some adjustments, or at least a cautionary note.

$\endgroup$
6
  • $\begingroup$ Good point, though there's still no reason to suppose an LRT that a coefficient's different from zero should give the same results in different models, nor a Wald's test. And how do you adjust inference on the final model? $\endgroup$ Commented Mar 28, 2014 at 9:51
  • $\begingroup$ sorry, what do you mean by different models? generalizability of the result? $\endgroup$
    – user16263
    Commented Mar 28, 2014 at 17:33
  • $\begingroup$ There's a model with temperature only as a predictor - called the "final model" - , & a model with temperature, age, vegetation & synchronicity - fitted at some stage in the stepwise process. So the coefficient estimate for temperature will be different in each model in any case. $\endgroup$ Commented Mar 28, 2014 at 17:41
  • $\begingroup$ As of how to adjust for inference on a 'final' model after stepwise selection, well that's the issue someone has yet to figure satisfactorily (that I know of). Some statisticians simply recommend not conducting stepwise selection in lieu of advanced procedures such as penalized regression, which is fine. However, those procedures are not readily available in other situations, e.g. mixed effects models. So it would be great to be able to trim a model quickly and easily with some sort of stepwise elimination and adjust SE's, CI's and p-value's accordingly. $\endgroup$
    – user16263
    Commented Mar 28, 2014 at 17:42
  • $\begingroup$ The resulting model might not be optimal, but it could be good-enough. A good compromise between complexity/effort of a procedure and end results. A False Discovery Rate approach to each model refit could be a candidate for this, as in this 2009 paper by no less than Yoav Benjamini arxiv.org/pdf/0905.2819.pdf $\endgroup$
    – user16263
    Commented Mar 28, 2014 at 17:42

Not the answer you're looking for? Browse other questions tagged or ask your own question.