1
$\begingroup$

I need your help with some work I am doing.

Some context first: I am writing a dissertation for my master. The topic is about perceived trust in Smart Home technology. I launched a survey with a closed ended questions for demographic data, and likert scale that asks 8 Questions on a scale of 1 to 5. I gathered 159 responses in total.

The 8 Questions in ther likert scale are actually 4 different dependent variables. Q1/Q2 make dependent variable1, Q3/Q4 dependent variable 2 etc. Since it's a likert scale the data is not an interval, so what I did is that I took the sum of Q1 and Q2 and divided it by 2, which gave me a mean. This mean is one of the 4 dependent variables. I did this an additional 3 times for the other 3. Here is an example of the likert scale:enter image description here

The IV: Age (integer from 18 to 99), Gender (0 = male, 1= female), educational level (0 = low, 1 =mid , 2 = high), income ( they're ranges (below 24.999 => 0 = low, 25000 -39.999 => 1 = mid, more than 40000 2 = high), household size.

DV => Predictability of the technology, Dependability of the technology, Faith in the technology, Technology usefulness

I have 4 different hypotheses for this. One for each dependent variable, here is an example: There is a relationship between at least one of the independent variables and predictability of smart home technology.

The idea is to test each one of these dependent variables and see if they can be predicted with the independent variables (and control variables) that I have ( age, gender, educational attainment, household size and income). For that I read that a multiple linear regression would be enough. So I started reading about that method and I saw that there were some assumptions that needed to be met before I could use that method. For normality (3 of the 4dependent variables were normally distributed, but the last one had was not quite normally distributed. Secondly, it seems that testing the the four variables for linearity resulted in all of them not being linear.

Now I need to start the analysis part of my dissertation but I have no clue wich method I should use since the assumptions of the multiple linear regression are not met. I know about non-parametric tests, but I can't find anything non-parametric alternative for the multiple linear regression. If you need more info about the variables etc let me know, I will provide them! Thanks for your help and time.

$\endgroup$
7
  • 1
    $\begingroup$ 1. There is no need for dependent or independent variables to be normally distributed; it is helpful (but not strictly necessary) for the model residuals to be normally distributed $\endgroup$
    – mkt
    Commented May 11, 2023 at 18:24
  • $\begingroup$ 2. What do you mean by linearity? What exactly did you test? $\endgroup$
    – mkt
    Commented May 11, 2023 at 18:24
  • $\begingroup$ 3. It's not clear why adding your Likert scale responses is useful $\endgroup$
    – mkt
    Commented May 11, 2023 at 18:25
  • $\begingroup$ 4. Explaining your models, variables, and questions/hypotheses in more detail (or at least in one example) may be useful for getting more feedback. $\endgroup$
    – mkt
    Commented May 11, 2023 at 18:27
  • $\begingroup$ @mkt Hello there thank you for your quick reply I will try to answer all of your questions. But what do you mean by 3.? I don't quite get your question, care to elaborate? $\endgroup$
    – Abdel Kdj
    Commented May 11, 2023 at 18:31

1 Answer 1

2
$\begingroup$

As mkt said in a comment, linear regression does not assume the dependent variable is itself normally distributed, but only the residuals.

Regarding the non-linearity, you can consider a non-linear transformation of the predictor variables. For continuous variables like the age, taking a logarithm often helps. You can also try a polynomial, or splines, e.g. restricted cubic splines.

For discrete variables, whose values are just arbitrary encodings, you are free to change the encoding. E.g. instead 0, 1, and 2 for low, mid, and high, you may take 0, 1, and 5, if it makes the dependency more linear.

Update:

"Normality" is a mathematical ideal which can never be truly achieved in practice. Even for your data where the Shapiro-Wilk test is negative! The negative test result does not confirm that the data are normally distributed; it just fails to reject that hypothesis.

So, the true question which you should be interested in is whether the violation of normality assumption is so strong than an alternative gives better prediction than the ordinary least squares regression.

As of alternatives, I can think of mean absolute deviation (MAD) or SVM regression with a linear kernel (which is basically a MAD regression with some tolerance $\epsilon$), but I don't know whether they are available in Stata.

$\endgroup$
3
  • $\begingroup$ Thank you for you answer I will test it out. I tested the normality of the residuals. I tried it for the 4 regressions (one for each of my dependent variables) and stata shows me that 2 of the 4 are normally distributed when using Shapiro-Wilkinson or when using the comand kdensity , pnorm or qnorm to help visually. So my other question is, what if these assumptions keep being unmet, what alternatives do I have? $\endgroup$
    – Abdel Kdj
    Commented May 11, 2023 at 19:32
  • 1
    $\begingroup$ @AbdelKdj Tests for normality are bad: stats.stackexchange.com/questions/2492/… . I'd recommend taking a step back and learning a little more about this; you are making several questionable assumptions, and it's not possible to address them all in any single question or answer. $\endgroup$
    – mkt
    Commented May 12, 2023 at 10:03
  • $\begingroup$ If the response variable is a Likert scale, then obviously normality is just as violated for the residuals as it is for the DV. The assumption is indeed that the DV is normally distributed - conditionally. Focusing on residuals often just obscures this important fact. $\endgroup$ Commented May 12, 2023 at 13:44

Not the answer you're looking for? Browse other questions tagged or ask your own question.