13
$\begingroup$

We recently received peer review on a manuscript that contained reasonable feedback but also this item:

when the p-value is larger than the alpha level chosen (here is .05), it is not necessary to report effect size.

To me it seems clear that effect sizes should be reported regardless of categorical significance/non-significance. For a start, meta-analyses would be relatively uninformative if non-significant effect sizes were not available.

I usually have a high bar for not complying with reviewer suggestions, but this seems like something we should resist. But I'm struggling to come up with a way of expressing it that won't excessively call into doubt the reviewer's understanding.

  1. What would be an ideal non-combative yet cogent response to this query?
  2. Is there a useful paper we could provide as an authoritative reference on this issue, so it doesn't come across as a battle of opinions?
$\endgroup$
6
  • 4
    $\begingroup$ I agree with you. I think you should just say that regardless of the significance level chosen the actual p-value should be reported as well as any other relevant information. $\endgroup$ Commented Feb 27, 2018 at 2:10
  • 3
    $\begingroup$ Point the reviewer towards this: stats.stackexchange.com/questions/16218/… $\endgroup$ Commented Feb 27, 2018 at 2:27
  • 1
    $\begingroup$ How did you formulate the report about the effect size(s). Did provide a neutral representation of all the effects reported in a table? Or did you cherry pick several but not all effect sizes and added a discussion about the findings? $\endgroup$ Commented Oct 2, 2023 at 6:16
  • 1
    $\begingroup$ @SextusEmpiricus More than five years after posting this question, I don’t even recall what manuscript it arose from. But given our standard practices, I can be pretty confident that no, we weren’t cherry picking. The whole point of the question is about trying to report significant and non-significant results consistently. $\endgroup$ Commented Oct 2, 2023 at 22:57
  • 1
    $\begingroup$ @MichaelMacAskill maybe calling it 'cherry picking' makes it too strong, but what it refers to is the practice of spending more than neccesary text to particular observations and explaining those observations in a way as if they had been significant. I feel that there's a difference between whether the report is only a simple table or whether there is also words being spent to explain the observed effect (beyond explanation why it was insignificant, like for instance stating that the power was lower than expected). $\endgroup$ Commented Oct 3, 2023 at 8:39

3 Answers 3

10
+50
$\begingroup$

A plausible answer:

While some researchers prefer hypothesis significance over effect sizes in reporting, it is often beneficial to complement coefficient estimates with effect-size measures, significant or not. A nonsignificant estimate could emerge simply due to sampling error that resulted in estimation uncertainty, especially when the sample size is small. It may suggest the same effect size as another study does with a significant coefficient. Therefore, a nonsignificant estimate should not be used as evidence for the absence of any effect. Making dichotomous decisions solely based on p values, though convenient, may suffer from the multiple-testing problem and risk omitting important effects, especially ones that are difficult to measure precisely. In meta-analysis that synthesize results from multiple studies of different sample sizes, the sign and magnitude of effect-size measures are usually more important than p values to decide result consistency and disparity.

As others mentioned, reporting confidence intervals is another good practice to convey uncertainty. Confidence intervals are not the same quantity as effect size measures, which themselves can have confidence intervals. Along with coefficients, standard errors, and p values, reporting effect sizes with their confidence intervals is beneficial. However, some of them can be easily calculated based on the other and are thus unnecessary to present additionally. A typical reporting style can be: β = 0.51 (0.11), p = 3.55e-06, HR = 1.67, 95% CI [1.34, 2.07] for Cox hazards among a sample size of n = 90 although the p value may be redundant as readers can readily derive it from a normal table.

In a replicated study among a small sample n = 10 due to budget constraints, the same underlying population effect would result in β = 0.51 (0.33), HR = 1.67, 95% CI [0.87, 3.18]. That is, the standard error triples due to the sample size shrinks nine times, as standard errors of means are inversely related to the square root of sample sizes. The same coefficient and effect size emerge, only at a greater uncertainty. Although the coefficient estimate is no longer significant, as z = 0.51 / 0.33 = 1.55 < 1.96, p = .12, the point estimates of β = 0.51 > 0 and HR > 1 are very useful information that provides further evidence for a plausibly positive effect on hazards. Therefore, one cannot claim β = 0.51 and HR = 1.67 from the replication meaningless just because p = .12, although one can argue that the original study provides stronger and more powerful evidence of the same effect size. Also, p > .050 cannot be used as evidence for a true zero-effect as in H0: β = 0 and HR = 1 although one can claim that such parameter values in the null hypothesis are compatible with the observed data. Thus, the replication provides more support instead of contradiction to the same conclusion in the original study, despite lack of statistical significance at common thresholds. In summary, reporting effect sizes with their confidence intervals is good practice regardless of significance levels.

Useful reference articles that advocate for all above practices:

Chu, B., Liu, M., Leas, E. C., Althouse, B. M., & Ayers, J. W. (2021). Effect size reporting among prominent health journals: A case study of odds ratios. BMJ Evidence-Based Medicine, 26(4), 184–184. https://doi.org/10.1136/bmjebm-2020-111569

Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift Für Psychologie/Journal of Psychology, 217(1), 15–26. https://doi.org/10.1027/0044-3409.217.1.15

Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106

Lee, D. K. (2016). Alternatives to p value: Confidence interval and effect size. Korean Journal of Anesthesiology, 69(6), 555–562. https://doi.org/10.4097/kjae.2016.69.6.555

Rosenthal, R., & DiMatteo, M. R. (2001). Meta-analysis: Recent developments in quantitative methods for literature reviews. Annual Review of Psychology, 52, 59–82. https://doi.org/10.1146/annurev.psych.52.1.59

$\endgroup$
6
$\begingroup$

I think it is essential to report confidence intervals with all results, regardless of p-value. If the p-value is high, the confidence interval could ....

  • ...range from a tiny decrease (scientifically trivial, if true) to a tiny increase (scientifically trivial, if true). Such data provide strong evidence that either there is no effect or there is a scientifically trivial effect. This is a pretty useful conclusion.
  • ...range from a huge decrease (scientifically important, if true) to a huge increase (scientifically important, if true). Such data are consistent with a large decrease, no change, and a large increase -- so lead to no useful conclusion. The fact that the p-value is large really does not help you interpret the findings.

(I use "decrease" and "increase" assuming we are looking at a difference between means. If looking at an odds ratio or relative risk, the question would be how much lower than 1.0 is the lower limit and how much higher than 1.0 is the upper limit).

$\endgroup$
3
$\begingroup$

Whether or not the reviewer's comment makes sense depends on how you formulated the reporting of the the insignificant result (s).

  • Neutral: A representation of all the effects in a table. Simply raw results and no conclusions attached to it

  • Subjective Or did you cherry pick a few but not all of the tested effects in a results section that partially blends with some discussion and conclusion?

    You can report insignificant results. But if the discussion and conclusions go further than a plain “the observations are less accurate than expected”, or "the effect is not as large as expected" and instead you add a discussion that overinterprets the results (a discussion as if the effect was significant), then the reviewer has a point.

In the first case you can write a reply that downplays the reporting of the insignificant effect size. E.g. stating that you agree that the report effect size is not statistically relevant but that you report the sizes of all the effects to help the reader to get a better idea of the results in entirety, and for uniformity of all the presented results.

$\endgroup$
2
  • 1
    $\begingroup$ I don't really follow. What does a high p-value have to do with observations being "accurate" or "statistically relevant"? $\endgroup$ Commented Oct 2, 2023 at 18:44
  • 1
    $\begingroup$ @HarveyMotulsky if you report about some cherry picked effect size of interest, but it is not significant, then the experiment was underpowered for that effect size of interest. A particular observed effect size might be interesting, but if there are large inaccuracies/variations in measurements, then the observed effect size is statistically speaking not remarkable. $\endgroup$ Commented Oct 2, 2023 at 19:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.