1
$\begingroup$

I am well-aware of the issues that stepwise regression causes. I want to demonstrate some of them via simulation in a particular situation.

I am thinking of a regression where I have some categorical variable of interest and then some covariates. I am operating under the assumption that some combination of these covariates matters, and that the stepwise selection will select a good combination of them (probably not, but that failure is what I want to show in my simulation).

However, if I just run a normal stepwise regression like MASS:stepAIC, I run the risk of not including that original categorical variable of interest.

What I would ideally like to do is to run the stepwise selection on just the covariates and then find the t-stat for the final test (after the stepwise elimination or inclusion) on that original categorical variable of interest, as if I had gone with that model from the beginning, but there will not be a t-stat if the variable is entirely excluded.

What would be the remedy? Sure, I can code my own stepwise selection algorithm that does not consider a particular variable, but I am not even totally sure what I would do if the "correct" stepwise elimination step is to remove that main variable of interest. Is that the end of my backward elimination?

Citing a simulation study (or mathematical derivation) showing what happens to the test statistic distributions would make for an interesting answer.

$\endgroup$
6
  • 1
    $\begingroup$ It seems like this might be a request for code, which would be off topic. In SAS, you can use the INCLUDE option on the SELECTION statement in PRO GLMSELECT. There is probably a way to do this in R, but I don't know what it is. $\endgroup$
    – Peter Flom
    Commented Jun 13 at 11:32
  • 2
    $\begingroup$ This has been studied enough, for example onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780100504 - we don’t really need more demonstrations of what a disaster stepwise regression is. $\endgroup$ Commented Jun 13 at 12:29
  • 1
    $\begingroup$ The simplest workaround perhaps is to replace all variables--explanatory and response--by their residuals in a regression against the variable of interest. Run the model selection process on the residuals in place of the original variables. This is mathematically identical to forcing the variable of interest to be included in all the models visited in the stepwise algorithm. $\endgroup$
    – whuber
    Commented Jun 20 at 21:33
  • 1
    $\begingroup$ @whuber That's totally straightforward to implement! Why is that equivalent, though? $//$ You mean something like L1 <- lm(y ~ x1 + x2 + x3) and then L2 <- lm(resid(L1) ~ x1 + x2 + x3), right? $\endgroup$
    – Dave
    Commented Jun 20 at 21:35
  • 1
    $\begingroup$ Replace the $x_i$ with their residuals, too. This is explained at stats.stackexchange.com/a/46508/919 which includes R code to do the job (the take.out function). The idea merely generalizes the familiar concept of centering all variables for multiple regression to avoid including an explicit constant term. A mathematical demonstration is given at stats.stackexchange.com/a/113207/919. $\endgroup$
    – whuber
    Commented Jun 20 at 21:40

0