About regression analysis with categorical variables

Question

Suppose my dependent variable is a continuous variable and is normally distributed. And I have three IVs: one is a continuous variable, and the other two independent variables are categorical. What type of test and analysis should be used? The sample size is 400.

While others are suggesting multiple linear regression, which is the appropriate direction to go, I want to add that for linear regression it is not a requirement that your dependent variable be normally distributed. Rather it is the residuals of the model that must have a normal distribution. This can be checked using a qqplot or histogram. — Jack, Commented Feb 15 at 22:08
To illustrate Jack's point, see my answer here: stats.stackexchange.com/a/635609/345611 — Shawn Hemelstrand, Commented Feb 23 at 4:09

Christian Geiser · Accepted Answer · 2024-02-15 21:03:29Z

2

Multiple linear regression analysis could be an option. For polytomous nominal predictor variables, you would have to use binary code variables in the regression model (e.g., using dummy coding [0, 1] and one dummy variable less than there are categories). Equivalently, you could use analysis of covariance (ANCOVA).

answered Feb 15 at 21:03

Christian Geiser

3,7514 silver badges12 bronze badges

$\begingroup$ I think this answer is good (+1) but I would argue more for regression simply because it doesn't require as many assumptions and is more flexible in general. $\endgroup$
– Shawn Hemelstrand
Commented Feb 23 at 4:19

Add a comment |

Marko Lalovic · Accepted Answer · 2024-02-15 21:05:01Z

You can attempt to build a multiple regression model. A standard approach to perform regression with categorical variables is called one hot encoding. You encode each categorical variable with $k$ levels into $k - 1$ indicator variables. Here, each indicator variable is 1 if observation takes the value of that level and 0 otherwise. For example, the blood type of a person can be A, B, AB, or O. Thus, it will be converted into 3 indicator variables.

You don't have to do the encoding yourself. For example, this is pretty fast and easy in R. Since R will automatically do this for you as long as you treat categorical variables as factors. For example, if x2 and x3 are categorical, then you can call:

lm(y ~ x1 + x2 + x3, data = yourdata)

Finally, as a rule of thumb, it's recommended to have about 10-20 observations per independent variable.

I think the sample size rule of thumb may not hold for all conditions, probably principal among those factors being the amount of error present in the model (which will change how accurate the model will be wrt sample size differences). But in general I think this answer is still good (+1). — Shawn Hemelstrand, Commented Feb 23 at 4:17

Demetri Pananos · Accepted Answer · 2024-02-15 23:00:25Z

1

This really depends on what your research question is.

If you're simply interested in the effect of the continuous variable, you can just run a regression and look at the Wald test for the coefficient.

If your hypothesis is that the coefficients for the categorical variables are 0, you can run an F test.

And on and on and on.

Can you say more about what question you want to answer with this analysis?

answered Feb 15 at 23:00

Demetri Pananos

38k1 gold badge60 silver badges145 bronze badges

$\begingroup$ I'm not as much a fan of this answer because the wording seems to put primacy on $p$ values for modeling (wrt Wald and F tests) rather than understanding more important elements like the coefficients in and of themselves. $\endgroup$
– Shawn Hemelstrand
Commented Feb 23 at 4:15

Add a comment |

Stack Exchange Network

About regression analysis with categorical variables

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
multiple-regression
nonparametric
stata
or ask your own question.

Linked

Hot Network Questions

About regression analysis with categorical variables

3 Answers 3

Not the answer you're looking for? Browse other questions tagged multiple-regressionnonparametricstata or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
multiple-regression
nonparametric
stata
or ask your own question.