3
$\begingroup$

I have run multiple tests to determine normality on my dataset, but I am unsure which one to adhere to, especially since my histograms, density plots, and QQ plots leave much to be desired in terms of looking normal. To sum up my data, I have $37$ total observations in $2$ groups (Group 1: $n=16$, Group 2: $n=21$). If I apply a t-test, ANOVA, linear regression, or Mann-Whitney U (to determine any difference between groups), I get a significant result, but I'm not actually sure which one to use! I know the difference between the tests, I just mean I'm not sure if I can use parametric tests in this scenario.

Here is what I have gotten out of my normality tests in R:

  • Shapiro Wilk- normal
  • Kolmogorov-Smirnov test- normal
  • Cramer-von Mises test – normal
  • Anderson-Darling - not uniform

Histogram and density plot, separating by group (group 1= red, group 2= blue) Histogram and density line for all observations QQ Plot for each group QQ plot for all observations

Would really appreciate some input as to how people more skilled than I would interpret these!

$\endgroup$
4
  • 1
    $\begingroup$ Am I correct in assuming that the first and third plot are plotting the groups you are testing, whereas the second and fourth plot are the data as a whole? $\endgroup$ Commented Sep 11, 2023 at 23:34
  • $\begingroup$ @ShawnHemelstrand Yes you are. sorry, I added that description to the pictures when I uploaded them, but was unaware it wouldn't actually show up $\endgroup$
    – Kimber
    Commented Sep 11, 2023 at 23:36
  • $\begingroup$ @ShawnHemelstrand I am very appreciative of your answer, thank you. I have also taken a look at the references which were quite helpful. In terms of how I should go about using other tests (e.g. a correlation test), would you advise me to consider the data normal and use a Pearson's correlation, or a Spearman rank? Quite honestly they give the same answer (except for one relationship being a moderate negative vs a weak negative), but if I use a Welch's t-test, would one make more sense to use than the other? Unfortunately I am very limited in sample size due to the species we are working with. $\endgroup$
    – Kimber
    Commented Sep 14, 2023 at 23:38
  • $\begingroup$ You are you getting Pearson/Spearman correlations for? Usually that is only useful for two continuous variables. Here you have a categorical variable and a continuous variable. $\endgroup$ Commented Sep 15, 2023 at 0:54

1 Answer 1

10
$\begingroup$

Normality Concerns

As it is my understanding, you would normally run a QQ plot for each group rather than the distribution as a whole, as you are testing two separate distributions against each other in the case of a t-test. This is because the operations behind a two sample t-test assume probabilities related to the area under the curve for each respective distribution. One of your distributions is "approaching" normality with its bell shape and mostly linear QQ line. The other is slightly bimodal and has some oddball tailing/splitting going on in the QQ graph, so it is really the only part of this that may be worth considering. As for your normality tests, I generally find them less helpful than they're worth because they are sensitive to different aspects of a sample and I find them personally to be fairly unreliable at actually painting a picture about normality. In the case of the Anderson-Darling test, it is more prone to Type I errors (finding a false positive), and that seems to be confirmed by the three other tests here not being flagged.

The sample size here is what really concerns me more than the normality testing. The visualizations here aren't all that surprising given how small your samples are, as any misplacement in the quantiles will result in odd histograms and QQ plots (Oppong & Agbedra 2016). The most important part here I think is the bimodalism, which could either be simply an artifact of a very small sample or because of some hidden group in your data influencing this shape. You consequently have some inequality of variance as well. The biggest issue though is that whatever t-test you run may only detect chance findings because it simply 1) won't have the statistical power to detect effects and 2) the sampling variation will make whatever differences fairly erratic and prone to higher Type I errors (Delacre et al., 2017).

It would be better to run a power analysis with a specific effect size in mind that gives you an adequate sample size to work with, then collect the data to have a robust enough sample to consider here. You have already collected the data at this point so its probably too late for that, but it would be advisable next time before you run a t-test. I've seen people say you need anywhere from 30 participants (Gravetter et al., 2021) to 100 per group (Brysbaert, 2019), but most of that is contingent on what your expected effect size is supposed to be, the given sample sizes, and the other assumptions related to the test. For example, a t-test with 35 degrees of freedom (what your sample has) and an expected Cohen's $d$ of $.40$ would only have $20$% power, but the same sample with an expected strong Cohen's d of $2.0$ would have nearly perfect power. You can play around with this calculator to get an idea, but it should alert you to the limitations of your sample here.

Solution?

As for what you should do now, I think probably your best bet is to use a Welch t-test with some strong caveats. Simulations have shown that it is fairly robust to departures to both normality and equality of variance while still maintaining the power of a parametric test (Delacre et al., 2017). Because it is obvious your sample is quite small and it is too late to run an a priori power analysis, I would certainly list this as a major limitation in your results/discussion if this is for a planned paper.

References

  • Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition, 2(1), 1–38. https://doi.org/10.5334/joc.72
  • Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30(1), 92. https://doi.org/10.5334/irsp.82
  • Gravetter, F. J., Wallnau, L. B., Forzano, L.-A. B., & Witnauer, J. E. (2021). Essentials of statistics for the behavioral sciences (Edition 10). Cengage.
  • Oppong, F. B., & Agbedra, S. Y. (2016). Assessing univariate and multivariate normality, A guide for non-statisticians. Mathematical Theory and Modeling, 6(2).
$\endgroup$
4
  • $\begingroup$ These plots and diagnostics cannot possibly allow the "clearly normally distributed" conclusion. Perhaps edit for correctness? $\endgroup$ Commented Sep 17, 2023 at 21:10
  • $\begingroup$ I would be interested to hear your thoughts on why the first distribution's linear QQ plot and normally distributed histogram are wrong, but I hope its clear from my post that I do not advise assuming the samples are in fact normal with the restraints of the sample size and of course the second distributions oddball behavior. $\endgroup$ Commented Sep 17, 2023 at 22:58
  • $\begingroup$ Distributions of data we can observe are a priori non- normal, with 100% probability, so there is never any justification for saying that they are normal. "Close to normal" is acceptable phrasing, with caveats about what "close" means. $\endgroup$ Commented Sep 17, 2023 at 23:09
  • $\begingroup$ I have edited the wording in hopes that it may better suit what you say. $\endgroup$ Commented Sep 17, 2023 at 23:14

Not the answer you're looking for? Browse other questions tagged or ask your own question.