3
$\begingroup$

I have a dataset that contains human metabolite concetration in a fluid. One group has about 12 samples, while another only has 5. My question is if I can assume normality for this data and do ANOVA/t-tests or if, given the small data-set and large number of features, I should do non-parametric tests.

So, for example, one of the features looks like this for the control group:

Q-Q plot

I mean, it follows a normal distribution maybe, but with so few points, can I really say it does? Also, should each feature follow a normal distribution for the 2 groups individually?

Then there's features that look like this:

Q-Q plot that looks bimodal if 5 points can be bimodal

Now if I plot their densities I get stuff like this, note that the data is log2 transformed:

enter image description here

$\endgroup$
7
  • $\begingroup$ are these two groups : experimental and control group? $\endgroup$ Commented Sep 5, 2023 at 5:26
  • $\begingroup$ @Subhash C. Davar yes they are, the smallest is the control group for some bizarre reason. $\endgroup$ Commented Sep 5, 2023 at 8:29
  • $\begingroup$ Can you add a plot of your data? $\endgroup$
    – mkt
    Commented Sep 5, 2023 at 8:51
  • $\begingroup$ Sorry, I meant the raw data, rather than a QQ plot. $\endgroup$
    – mkt
    Commented Sep 5, 2023 at 9:15
  • $\begingroup$ oh, like a scatter plot? or a density plot? $\endgroup$ Commented Sep 5, 2023 at 9:34

1 Answer 1

2
$\begingroup$

In general, if the assumptions are met, parametric tests are more powerful than their non-parametric equivalents.

But the problem here is not parametric vs. non-parametric, it's sample size and power and maybe overfitting.

Suppose you have only one feature to test. Then the difference would have to be huge to be statistically significant, and the parameter estimate will be very imprecisely measured. The exact numbers you will get depend on the exact situation, but let's say the groups have means that are 2 SD apart.

set.seed(1234) #Sets a seed

x <- rnorm(12, 10, 1)
y <- rnorm(5, 12, 1)

The result is statistically significant, but the 95% CI for the difference is -3.2 to -1.5, which isn't very good.

Welch Two Sample t-test

data:  x and y
t = -5.94, df = 10.461, p-value = 0.0001193
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.250345 -1.484742
sample estimates:
mean of x mean of y 
 9.557737 11.925281 

And, if you start to get into more complex models, you run the risk of overfitting (even if the model is just a little more complex).

$\endgroup$
13
  • $\begingroup$ But how can I know if the assumptions are met when the sample size is so small? $\endgroup$ Commented Sep 5, 2023 at 8:30
  • $\begingroup$ It can be tricky, for sure. One more problem with small samples. $\endgroup$
    – Peter Flom
    Commented Sep 5, 2023 at 10:03
  • $\begingroup$ could i try fitting an ANOVA for each feature and then doing qq plots on the residuals? because I don't think I could do a single ANOVA on like 50 features for the sample size I have, right? I say ANOVA because I could add factors like age and sex. $\endgroup$ Commented Sep 5, 2023 at 10:13
  • 1
    $\begingroup$ Why would metabolite concentrations have a normal distribution? What biomathematics dictates that? Nonparametric methods are very efficient even if normality were to miraculously hold. $\endgroup$ Commented Sep 5, 2023 at 12:36
  • 1
    $\begingroup$ Why do you think anything should be Gaussian? It is true that heights tend to be more normally distributed than most variables, but normality of raw measurements in pretty much an accident of nature. The shape of the distribution also depends on study enrollment criteria. $\endgroup$ Commented Sep 5, 2023 at 15:49

Not the answer you're looking for? Browse other questions tagged or ask your own question.