3
$\begingroup$

Suppose my dependent variable is a continuous variable and is normally distributed. And I have three IVs: one is a continuous variable, and the other two independent variables are categorical. What type of test and analysis should be used? The sample size is 400.

$\endgroup$
2
  • 2
    $\begingroup$ While others are suggesting multiple linear regression, which is the appropriate direction to go, I want to add that for linear regression it is not a requirement that your dependent variable be normally distributed. Rather it is the residuals of the model that must have a normal distribution. This can be checked using a qqplot or histogram. $\endgroup$
    – Jack
    Commented Feb 15 at 22:08
  • $\begingroup$ To illustrate Jack's point, see my answer here: stats.stackexchange.com/a/635609/345611 $\endgroup$ Commented Feb 23 at 4:09

3 Answers 3

2
$\begingroup$

Multiple linear regression analysis could be an option. For polytomous nominal predictor variables, you would have to use binary code variables in the regression model (e.g., using dummy coding [0, 1] and one dummy variable less than there are categories). Equivalently, you could use analysis of covariance (ANCOVA).

$\endgroup$
1
  • $\begingroup$ I think this answer is good (+1) but I would argue more for regression simply because it doesn't require as many assumptions and is more flexible in general. $\endgroup$ Commented Feb 23 at 4:19
2
$\begingroup$

You can attempt to build a multiple regression model. A standard approach to perform regression with categorical variables is called one hot encoding. You encode each categorical variable with $k$ levels into $k - 1$ indicator variables. Here, each indicator variable is 1 if observation takes the value of that level and 0 otherwise. For example, the blood type of a person can be A, B, AB, or O. Thus, it will be converted into 3 indicator variables.

You don't have to do the encoding yourself. For example, this is pretty fast and easy in R. Since R will automatically do this for you as long as you treat categorical variables as factors. For example, if x2 and x3 are categorical, then you can call:

lm(y ~ x1 + x2 + x3, data = yourdata)

Finally, as a rule of thumb, it's recommended to have about 10-20 observations per independent variable.

$\endgroup$
1
  • $\begingroup$ I think the sample size rule of thumb may not hold for all conditions, probably principal among those factors being the amount of error present in the model (which will change how accurate the model will be wrt sample size differences). But in general I think this answer is still good (+1). $\endgroup$ Commented Feb 23 at 4:17
1
$\begingroup$

This really depends on what your research question is.

If you're simply interested in the effect of the continuous variable, you can just run a regression and look at the Wald test for the coefficient.

If your hypothesis is that the coefficients for the categorical variables are 0, you can run an F test.

And on and on and on.

Can you say more about what question you want to answer with this analysis?

$\endgroup$
1
  • $\begingroup$ I'm not as much a fan of this answer because the wording seems to put primacy on $p$ values for modeling (wrt Wald and F tests) rather than understanding more important elements like the coefficients in and of themselves. $\endgroup$ Commented Feb 23 at 4:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.