7
$\begingroup$

I wanted to conduct a Wilxocon signed rank test but stumbeld upon two questions that I am unable to solve on my own. I tested 2 types of interfaces for a software with the same ten people. I want to compare if there is a difference between using these two interfaces. Each person answerd the same ten questions for both interfaces. The questions were on a range from one to five. ( 1 - strongly disagree, 5 - strongly agree)

My first questions would be if my data is appropriate for a Wilcoxon signed rank test? My questions were on a likert scale, therefore they are ordinal data. I have read that ordinal data is not suffcient for a Wilcoxon signed rank test but I have also read the opposite. I am kinda confused now and need some confirmation for my use case.

The last question is something I couldn't find in my literature. Since I wanted to compare these two interfaces, my idea was that I sum up all the answers for a question and and then take the difference/calculate the rank based on the sum. For example question one got 10 values for Interface1 and 10 for interface2. Total equals for Interface1: 30 points and Interface2: 35 points. Then I take the difference a calculate the rank of all questions based on their sum. Is this possible to do, or should I rather take the difference/rank the answers for each question individually?

$\endgroup$

3 Answers 3

9
$\begingroup$

An excellent question. As @Glen_b implied, the signed rank test, unlike the Wilcoxon unpaired 2-sample test, is metric-dependent. A better test is the Kornbrot rank difference test discussed here. It has slight problems with discrete data (which create a lot of ties in the data) but is invariant to any monotonic transformation of the scale. Here you'll find a section on using regression models for paired data. This generalizes paired tests and opens the way for repeated measures. You'll see an example there where an ordinal random effects model is attempted, with unsatisfactory results (because of the large number of parameters) if one uses the standard frequentist statistics approximation to handle random effects. A Bayesian hierarchical model worked very well, with subjects as random effects (intercepts). Since you are are accounting for within-subject correlation you could extend this approach to joint analysis of all the individual items without summing them.

Models generalize tests in many ways, and often lead to better analyses. Plus with models you can estimate secondary parameters such as means and quantiles.

$\endgroup$
3
  • 2
    $\begingroup$ Would it be possible to use an ordinal regression model but just use a cluster-robust standard error to adjust for the within-subject correlations rather than fixed/random effects for subject? $\endgroup$
    – Noah
    Commented Aug 3, 2023 at 15:52
  • $\begingroup$ Good question. It's possible but the cluster sandwich estimator may require a certain sample size to work. I'll add an example to my BBR notes at the above link. $\endgroup$ Commented Aug 3, 2023 at 16:27
  • 1
    $\begingroup$ I added a cluster sandwich method for both a linear model and a proportional odds model. I think the standard errors are a little too small. Part of the problem is that we don't know which degrees of freedom to use in a $t$-statistic after doing the robust covariance calculation. By convention analysts just use $z$-statistics ($\infty$ d.f.) as I've done. $\endgroup$ Commented Aug 3, 2023 at 17:11
8
$\begingroup$

A signed rank test relies on taking pair differences. Consequently you're asserting that the difference between a 1 and a 3 on a corresponding pair of Likert item is the same as the difference between a 2 and a 4, and so on (in order to be able to then rank those differences). Which is to say you are in effect asserting an interval scale on the Likert items.

On the other hand this treatment is common, even pretty standard; Likert designed Likert scales as sums of such Likert items, and sums or differences, either way relies on an assumption of equal intervals.

Whether such a treatment of these Likert items is a reasonable thing to do in your context is perhaps more a matter for your colleagues (or whoever else your research is aimed at) and their attitudes to measurement, rather than being a directly statistical question.

$\endgroup$
7
  • $\begingroup$ I feel like agreeing, but not fully. A difference on a likert rating scale is not just like a sum on it. When we sum points 1 and 3, we are summing the distances from the origin of measurement, (3-0)+(1-0), we are thus clearly on the interval measurement thought, because the above expression pushes from zero: (3+1)-(0+0). But a difference between two points, say, 3 and 1, the 3-1, dismisses zero: (3-0)-(1-0) is 3-1+0-0: zero cancels out. Algebraically, 0+0 and 0-0 amounts to the same; metaphysically, I doubt they are equivalent. $\endgroup$
    – ttnphns
    Commented Aug 2, 2023 at 21:49
  • $\begingroup$ (cont.) If 0 does have the positive sign, they are not. It must have the same sign as 1 or 3, in order to belong to the same measuring ruler or tool likert scale presents. What am I at? If 3-1 is memoryless of the origin (unlike 3+1), a difference, like 3-1 or 4-2, is still compatible with the notion of ordinal scale. 3-1 means there is the count of two intervals between, i.e., one some category labelled "2" falling somewhere interim (we don't bother where). $\endgroup$
    – ttnphns
    Commented Aug 2, 2023 at 21:49
  • $\begingroup$ (cont.) Subtracting likert ratings, we are counting intervals (or points, the notches), not measuring with them. Wilcoxon signed rank test thus does not seem to me betraying ordinal scale for interval scale. But I may be completely wrong. $\endgroup$
    – ttnphns
    Commented Aug 2, 2023 at 21:50
  • $\begingroup$ @ttnphns I think there's no need to assume a 0 even exists for sums or differences; the "1" can be arbitrary (as long as it's a common point to the items you're differencing). $\endgroup$
    – Glen_b
    Commented Aug 3, 2023 at 3:07
  • $\begingroup$ My point is that in an ordinal scale, A B C D, where we only know that each next value is somehow greater than the previous, the difference or distance B-A can still be regarded as equal to C-B. In the specific counting sense, counting of the number of grades between these ordered (but not fixed) notches. Likewise, C-A is equal to D-B; A-A is equal to B-B. $\endgroup$
    – ttnphns
    Commented Aug 3, 2023 at 6:39
5
$\begingroup$

Unlike summing or averaging two scores from a k-point ordinal scale, differencing between two such scores does not necessarily requests assuming the scale to be metrical, interval, for me. "3"-"1" = 2 can easily mean "3 is 2 runs (= 1 point) higher in rank than 1", rather than "the magnitude of 3 is greater than the magnitude of 1 by 2". Per the second way of thinking, the difference gets measured: there is a ruler with the origin 0 and the gauge implied. Per the first way, there is only an ordered sequence of k notches from smaller to bigger; the difference is the result of counting positions "from here" to the left or to the right. Such difference will still have a known size (say, "3"-"1" = "4"-"2" = 2). But that size is after counting positions from a given locus on the scale; it is not the gap which was measured.

Possibly, ordinal rating scale can allow computing differences on it (the differeces of the position-counting nature) without becoming an interval scale. However direct summing of scores on it calls to assume it is interval. (My reasoning, sooner, intuition, may be flawed.)

$\endgroup$
1
  • 1
    $\begingroup$ I have to disagree. DIfferencing requires just as much as a interval-scale metric assumption as averaging. $\endgroup$ Commented Aug 3, 2023 at 16:25

Not the answer you're looking for? Browse other questions tagged or ask your own question.