4
$\begingroup$

I'm a master student currently working with taxonomy and cultivation of wild edible mushrooms. In my experiment, I'm evaluating the effect of 4 distinct solid culture media and 4 temperatures on the mycelium growth and the biomass production of different mushroom strains. A separate experiment was done for each strain, with 10 replicates for each treatment. Statistical analyses of the experiments are being performed in R.

The data of the dependent factors (mycelium growth in diameter and dry biomass in grams) from one of my strains were homoscedastic and normally distributed, according to Levene and Shapiro-Wilk tests, so I ran a 2-Way ANOVA and T-test with no problems. But the data for one of my strains are barely homoscedastic according to Levene test, and not parametric according to Shapiro-Wilk, even though the scatter plot of the errors seems to follow a normal distribution with some outliers at each end. I tried removing some apparent outliers to see if it would fix things, but it doesn't seem to work.

So ... What should I do? I'm not really fluent in statistics or programming, but I know the basics. I've spent the whole day stressed out trying to fix this, but I just wasn't able to.

I know that you can do a Box-Cox transformation of non-parametric data, but honestly, I just don't know how to do it, I was able to obtain the lambda number, but I don't know the command for transforming the dependent factor data, tutorials I've seen online only show how to transform the independent factors. Furthermore, I also know that Kruskal-Wallis test is a non-parametric alternative for t-test, but is it the best option for my case?

enter image description here

enter image description here

enter image description here

$\endgroup$
4
  • 1
    $\begingroup$ Can you share (a link to) your data, or at least some plots? Omitting outliers are probably not the way to go $\endgroup$ Commented Jun 22, 2023 at 8:26
  • 1
    $\begingroup$ It would help a lot to see some examples of the apparent problems, as @kjetilbhalvorsen requested. Also, with respect to the Shapiro-Walk test, you might want to read the discussion on whether normality testing is essentially useless. A q-q plot of residuals against their theoretical normal distribution and a plot of residuals against predicted values are useful aids for making a judgment call about whether the normality and homoscedasticity assumptions hold well enough. There are alternatives, if not. $\endgroup$
    – EdM
    Commented Jun 23, 2023 at 18:43
  • $\begingroup$ Added some plots of my data $\endgroup$ Commented Jun 24, 2023 at 14:57
  • $\begingroup$ One tiny comment: "Nonparametric" is used to describe or classify statistical tests, not to describe the distributions of data. I think you mean to say that one of your data sets are clearly not sampled from Normal distributions. $\endgroup$ Commented Jun 29, 2023 at 19:23

1 Answer 1

2
$\begingroup$

A couple of pages on this site have useful discussions of how to evaluate the diagnostic plots. This page discusses all 4 of the default R diagnostic plots for lm models, and this page goes into more detail about the normal Q-Q plot.

The residuals versus fitted plot and the scale-location plot both indicate some heteroscedasticity, in that the residuals tend to be somewhat smaller around lower predicted values than around higher predicted values. The apparent "outliers" on the Q-Q plot indicate a possible "light-tailed" distribution of residuals, with not as many extreme values as might have been expected. Also, it's hard to interpret a Normal Q-Q plot if there isn't homoscedasticity to start with. As Kjetil Halvorsen said in a comment, you certainly don't want to omit those "outliers." Outlier removal should only be done if there's a very good reason, like evidence of a technical measurement error. Those "outliers" on the Q-Q plot certainly shouldn't be omitted arbitrarily.

The Kruskal-Wallis test is only for one-way ANOVA, while you have a two-way design (4 media by 4 temperatures). So that won't work here. Also, non-parametric models make it harder to do comparisons among different specific treatment combinations, or to make predictions.

It's not clear that the violations of homoscedasticity and normally distributed errors are bad enough to matter much in practice here. In particular, if your main interest is to identify the growth conditions that maximize diam, then whatever combination is leading to predicted values of >80 is a clear winner, more than 20 units greater than the next-best combination while only having maximum absolute residuals of about 4 units. Tweaks to the model aren't likely to affect that interpretation at all.

If you are nevertheless worried about the apparent violation of homoscedasticity, you might consider a simple transformation of the outcome values, like square-root or log (a special case of Box-Cox), and perform ANOVA on the transformed data. That often can help with such situations where the error magnitudes increase with the fitted values. The downside is that you are then modeling the means of the square-roots or logs of the data values rather than the mean values themselves.

An alternative would be to use a regression method that takes heteroscedasticity into account; see this page and its links for approaches. For example, the answer from gung - Reinstate Monica recommends weighted least squares, possibly combined with robust standard errors. Depending on the specific approach, the differences from your model will be primarily or even solely in terms of standard errors, not the point estimates of treatment effects.

You also could consider an ordinal logistic regression model, as explained for example in Chapters 13 and 14 of Frank Harrell's Regression Modeling Strategies. It can be thought of as a way to generalize the Kruskal-Wallis and similar non-parametric tests to more complex designs, in a way that is better suited for specific treatment comparisons and for predictions. It works with any ordered outcomes (even if there are as many outcome levels as observations) and makes no assumptions about error distributions.

If the fundamental results of a transformed, weighted, or robust model isn't substantially different from the simple linear model, then you might simplify the presentation by showing the simple linear model (as you do for the other strains) while stating that, given potential violation of linear model assumptions, you evaluated another approach without finding substantial differences in interpretation. You might provide the results of the transformed, weighted, or robust model in supplemental data to support that claim.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.