Effect size for Wilcoxon signed rank test that incorporates the possible range of the attribute

Question

I'm currently working on my master thesis and I'm analysing attributes obtained from digital elevation models (DEMs). I try to compare two point sets for which I extracted altitude values from two DEM rasters with different resolution.

Long story short: I conduct a Wilcoxon signed-rank test on the two attributes (no normal distribution and paired sample). Now, the boxplots look extremely similar and the mean values show a difference of about one meter. I already learned that the significance is highly sensitive to large n and therefore, I'm focusing on the effect size. I'd expect a low effect size, due to the similar boxplots and mean values. However, as the shift is really one-sided, the effect size ($Z/\sqrt{n}$) is getting rather large, even though the two sets are actually pretty similar.

I know, that these tests are designed to find even smallest differences and it achieves this goal, as there is a one-directional shift. Even though there is a difference, it is rather small and I'm looking for an effect size which considers this. In other words, it should be normalized not only by sample size but also attribute range.

Is there an effect size measure that considers the range of the attribute?

Here is some R code that illustrates this behavior with simulated data:

# install.packages("coin")
library(coin)

set.seed(1)
a <- runif(1000,900,1100)
b <- a+runif(1000,0,1)

wilcoxsign_test(a ~ b)
-27.393/sqrt(length(a)) # Z-score/sqrt(n)

diff <- c(a - b)
diff <- diff[ diff!=0 ]
diff.rank <- rank(abs(diff))
diff.rank.sign <- diff.rank * sign(diff)

W <- sum(diff.rank.sign)
Z <- W/sqrt((1000*1001*2001)/6)
Z/sqrt(1000)

windows()
  d = stack(list(a=a, b=b))
  boxplot(values~ind, d)
windows()
  boxplot(a-b)

I have never encountered an effect size for a signed rank test. However, you might consider something like the difference in medians divided by the IQR of the aggregated sample. — Gregg H, Commented Apr 26, 2018 at 18:27
Possibly stats.stackexchange.com/questions/133077/… will help? — jbowman, Commented Apr 26, 2018 at 18:46
At gis.stackexchange.com/questions/1551/… I posted an example of how you might go about doing this kind comparison. Although it focuses on slopes, most of the ideas translate to comparing other derived characteristics. Since DEMs almost always exhibit considerable spatial autocorrelation, the applicability of univariate tests like you use here is doubtful. Additional considerations are described at gis.stackexchange.com/questions/55507/…. — whuber, Commented Apr 26, 2018 at 23:27
@whuber I'm aware of the resolution influence. That's actually, what I want to show on my data. Several other papers (eg. Thompson et al. or Soerensen & Seibert) also used non-parametric tests. The reason is, that the measures do not follow a normal distribution and the attributes are paired. Anyhow, that's more or less unrelated to my question. — Chris, Commented Apr 27, 2018 at 6:44
@GreggH: Check this paper out. On page 12 they cover effect sizes for non-parametric tests. That would be something similar to Cohen's D then I guess. I will look into it. — Chris, Commented Apr 27, 2018 at 6:52

Sal Mangiafico · Accepted Answer · 2018-04-26 22:29:01Z

The effect size statistic Z/sqrt(N) --- sometimes called r --- in the paired observations case, is related to the probability that one group is larger than the other, or if you'd rather, that the differences are consistently greater than zero.

It doesn't measure the difference in values between the two groups. Other effect size statistics like Cohen's d is related to the differences in means.

To me, the most practical approach is to consider the practical importance of the results. If the mean difference is 1 meter, is this large enough to matter? This is subjective, but honestly, the practical conclusions from any research have to be subjective. You can report p and r, and then leave the hard work of thinking about what the results actually mean to your brain.

Another approach using effect size statistics would be to use Cohen's d, or something you create akin to Cohen's d. Cohen's d is essentially the difference in means divided by the standard deviation of the observations. There are some variants; you can look up their precise calculations if you want. The interpretation here is that a Cohen's d of 1 indicates that the means differ by one standard deviation. If you are comfortable using means and standard deviations with your data, you could use this statistic. Otherwise, if you prefer you could create some effect size statistic like the difference in medians divided by the median value (which is a percent), or the median divided by the IRQ as @GreggH suggests.

Chris · Accepted Answer · 2018-04-27 13:23:55Z

After some digging and talking to my professor I came up with a solution for further reference.

The problem was, that I had the wrong idea about the Wilcoxon signed-rank test. The purpose of the test is to indicate if there is a shift between the two variables. The p-value suggests, that there is a statistically significant shift. As nowadays, p-values are not as meaningful as they used to be (due to large sample sizes) there is need for an effect size measure (e.g. Wasserstein et al. (2016)). The calculated effect size only indicates, if there is a constant shift in one direction. It does however not imply, how strong this shift in terms of values is.

To get an understanding how strong the shift is, there are currently no widely accepted effect size measures. In general, there is a lack of effect size measures for non-parametric tests e.g. Leech & Onwuegbuzie (2002). Effect measures for non-parametric tests to exist though (as the one suggested by Gregg H for example). Other tests such as two-sample Kolmogorov-Smirnov or Anderson-Darling might help to get a better understanding of the distribution shift. Otherwise, with increasing sample sizes, it is also possible to use a t-test according to the central limit theorem. Then, the calculation of other effect measures are possible (e.g. Pearson's d).

This answer is along the lines of what Sal Mangiafico explained in some other words. I hope it could help someone else while trying to figure this out.

Happy to edit if anyone has any supplement or whatsoever.

References:

Leech, Nancy L.; Onwuegbuzie, Anthony J. (2002). A Call for Greater Use of Nonparametric Statistics.

Wasserstein, Ronald L.; Lazar, Nicole A. (2016). The ASA's Statement on p -Values: Context, Process, and Purpose. In The American Statistician 70(2).

gung - Reinstate Monica · Accepted Answer · 2018-04-27 18:02:17Z

If you have this many data, you really could have used a $t$-test with no problems. It's worth noting that the Wilcoxon signed-rank test is really testing a slightly different null hypothesis^1,2. Often the reason for choosing the Wilcoxon signed-rank test is that people are not willing to assume the numbers are equal interval. In your case, it seems you think they are.

Another issue is that I would not focus on the individual boxplots. They are more likely to mislead than illuminate. At a minimum, consider plotting a boxplot of differences along with the boxplots of the original data³.

In general, measures of effect size are designed to unconflate the magnitude of the effect from the amount of data we have (a statistical test necessarily combines the two), and to thereby communicate the size of the shift in a manner that is intuitive and simple (i.e., typically a single number). Thus, the effect size you list is not trying to "normalize by the sample size", but to extract the sample size from the test statistic (although in this case that is totally opaque).

Since you believe the units are reliable, some version of a mean difference should be fine. If you believe your audience will be sufficiently familiar with the units, a raw mean difference would be appropriate. (Consider that when people talk about weight loss or stunted growth, they always use the everyday measures, say, pounds or kilograms.) If your audience wouldn't be conversant with these units, a standardized mean difference is needed to give it an interpretable context. The typical way to do this is to divide the mean difference by a standard deviation (computed in one of several ways). In fact, this procedure has become the meaning of 'standardized mean difference'. The notion of standardization is much broader than that, of course, and there is no reason you need to be bound by that procedure. If the mean differences that are possible in your situation are limited (e.g., by some physical constraints) and they can be defined, you could divide your observed mean difference by the possible range and present that. You just need to be sure you explain what you did clearly. I would probably combine this with the common effect size⁴ and say something like:

The rasters showed significant improvement due to the manipulation (z=-30, p<0.001). One hundred percent of the differences between paired pixels were negative, with the mean shift, -0.49, constituting an improvement equal to X% of what is physically possible.

_References:

Stack Exchange Network

Effect size for Wilcoxon signed rank test that incorporates the possible range of the attribute

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
effect-size
paired-data
wilcoxon-signed-rank
or ask your own question.

Linked

Hot Network Questions

Effect size for Wilcoxon signed rank test that incorporates the possible range of the attribute

3 Answers 3

Not the answer you're looking for? Browse other questions tagged reffect-sizepaired-datawilcoxon-signed-rank or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
r
effect-size
paired-data
wilcoxon-signed-rank
or ask your own question.