I am well-aware of the issues that stepwise regression causes. I want to demonstrate some of them via simulation in a particular situation.
I am thinking of a regression where I have some categorical variable of interest and then some covariates. I am operating under the assumption that some combination of these covariates matters, and that the stepwise selection will select a good combination of them (probably not, but that failure is what I want to show in my simulation).
However, if I just run a normal stepwise regression like MASS:stepAIC
, I run the risk of not including that original categorical variable of interest.
What I would ideally like to do is to run the stepwise selection on just the covariates and then find the t-stat for the final test (after the stepwise elimination or inclusion) on that original categorical variable of interest, as if I had gone with that model from the beginning, but there will not be a t-stat if the variable is entirely excluded.
What would be the remedy? Sure, I can code my own stepwise selection algorithm that does not consider a particular variable, but I am not even totally sure what I would do if the "correct" stepwise elimination step is to remove that main variable of interest. Is that the end of my backward elimination?
Citing a simulation study (or mathematical derivation) showing what happens to the test statistic distributions would make for an interesting answer.
L1 <- lm(y ~ x1 + x2 + x3)
and thenL2 <- lm(resid(L1) ~ x1 + x2 + x3)
, right? $\endgroup$R
code to do the job (thetake.out
function). The idea merely generalizes the familiar concept of centering all variables for multiple regression to avoid including an explicit constant term. A mathematical demonstration is given at stats.stackexchange.com/a/113207/919. $\endgroup$