Comparing matched pre-post likert scale data (n = 40)

Question

I have pre-post matched likert scale data (~20 questions) that I need to analyze. 1) What is the best way to analyze these data? I initially thought it would be ok to use means and do a paired samples t-test but this doesn't seem appropriate with likert scale data. Is paired samples wilcoxin appropriate? Some context, this is evaluation data looking at students knowledge and skills before and after a course.

Glen_b · Accepted Answer · 2024-06-05 04:44:55Z

[TL;DR: If you want to perform a hypothesis test, you must start with a hypothesis to test, then choose a suitable test for that hypothesis. A hypothesis will be a statement about one or more population parameters.]

I take a fairly strong stance in what follows, but this is not directed at the asker. My vexation arises from what I see as distressing convention in teaching and practice in some application areas, which seem to make a point of phrasing hypotheses as vaguely as possible, so that they can stand for any number of distinct hypotheses (about different population parameters) in the end. Leaving this choice of actual hypothesis open then leads to a number of distinct tests that might relate to the encompassed possible hypotheses. This practice, naturally, leads to p hacking once the specific test is then chosen by looking at the data - and with it, some actual hypothesis is finally selected.

Considering tests that relate to quite distinct hypotheses as if they were readily-substituted options is, I argue, poor scientific practice. For example, the effects that you're usually taken to be estimating under a signed rank test (population pseudo-median of differences) and paired t-test (population mean difference) might not even be in the same direction.

If the two distinct things don't even necessarily happen in the same direction when $H_0$ is false, what thing are we even talking about in the hypothesis and the conclusion? Presumably you care enough about which parameter you're discussing and which direction (up or down) you claim the difference is in to take some care over which effect you're hypothesizing about and making claims about in a conclusion. This is not a matter to be disposed of quite so casually as to say "which one of these should I throw at my data?" as if it were a small matter, rather than somewhat more fundamental.

You can of course make an additional assumption that the effect of interest is a pure location-shift in which case the parameter being targeted should coincide (if the population means are finite), as indeed would that of a number of other possible tests - but if a test can respond to other classes of alternative, your additional assumption may mislead you when those other kinds of alternatives actually hold. There's no particular need for either test to make such pure-shift assumption, both can maintain their usual type I error rate (which doesn't involve the effect under the alternative) and also maintain good power against a wide variety of alternatives without that specific requirement.

You'd have no way to be confident that this additional restriction holds in general but in many cases you can tell immediately that it can't. One case where it doesn't make sense is when the response is a Likert item. You instead have to consider alternatives as a change in the probabilities of each outcome, and the distribution shape changes as a result. You might hypothesize a pure-shift on some underlying scale which is transformed to a Likert item, but then you need a model relating the observed variable to that underlying scale, where the shifts would exist.

In this case you have a Likert scale (composed of multiple items). You already assumed the items were interval-scale when you added them (I'll leave discussion of the plausibility of this aside, simply noting that it was necessarily assumed when you declared the collection of statements like "1" + "4" = "2"+"3" to all be true, as you must have done when calling both those things "5").

However, this interval-assumption still doesn't automatically make the pure-location shift alternative correct (it certainly is not the case for the components - whence for the sum?), but if you don't get too close to the ends of the scale it might perhaps be more or less okay as an approximation. If we simply assume that it will be the case approximately, you still won't know that this assumption holds, and we're still in the bind of "what thing are we really talking about?"

We're relying on a lot there to treat them as relating to the same thing and thereby treating the tests as potential alternatives. From a scientific point of view that's a really big if. If we don't even know which parameter we're talking about and even its direction might be a coin flip based on the happenstance of which of those two tests we plump for, why should anyone accept our claims as meaningful?

In short, choose your parameter first (what population quantity are we hypothesizing about), then choose how to measure it in practice, then choose how to test that hypothesis. Then when you actually know what it is you're trying to find out about the population, you're in a position to go ahead and design experiments/studies for that particular question and consequently, to collect data to answer it.

If your research hypothesis was really about shift in means (noting that you already assumed an interval scale when you added the items), then a t-test may well be reasonable, though you don't have to use a t-test to test such a hypothesis.

You might use a different statistic, or you might use the same statistic without needing to assume that it has close to a t-distribution under $H_0$. However such a test (whichever way you go about it) will be sensitive to some alternatives that are not simple mean-shifts.

If your research hypothesis was really about picking up a non-zero pseudo-median (or indeed the other sorts of alternative that the signed rank test is sensitive to), that's fine too. You would need to deal with the heavy ties in carrying out the test.

There are numerous other options besides these two tests. There are many potential population parameters that you might have been interested in and many tests that are more or less sensitive to them. A number of such tests exist already and designing and implementing new tests in this sort of circumstance is fairly easy.

This post by @Glen_b needs to become a community wiki. It has far ranging implications. — Frank Harrell, Commented Jun 13 at 12:44

jginestet · Accepted Answer · 2024-06-13 02:23:20Z

I do share @Glen_b's strong sentiments about the need to chose one's hypotheses. But I assume from the context that this is a classroom assignment, and that the instructor wants to test the student's understanding of which test can be used under given conditions. So I will try to provide some practical answers.

You can not treat Likert-scale data as interval scale data (see Glen_b's answer for his scpeticism about this). I will be stronger than him. Performing any arithmetic on such data is not mathematically valid; it assumes that the distance say, between Strongly agree and Agree, and Agree and Neither agree or disagree (5 to 4, and 4 to 3) is the same; it is clearly not, and actually may be different for different subjects. So summing them is mathematical nonsense. If you had used a VAS (Visual Analogue Scale), there could have been a case made to treat as interval scale, but that is not your case, and would be a topic for another post. Likert is purely an ordinal scale (Strongly agree is "more" than agree, and that is all we can say about it; how much "more" is it, we do not know). It is a very poor practice to assign numbers to Likert scale levels, which induces researchers to treat them as interval/ratio scales, which they are definitively not; but that is a rant for another post.
Given that you need to treat your data as ordinal, this limits your choice of tests.

Sign test (or rather paired sign test). It is basically a binomial test, requires no assumption of any kind, and is a test of the median (of the paired comparisons). It can tell you if the median of the After data is larger/smaller than the median of the Before data. You say you have ~20 questions, but we do not know how many students. If it is a reasonable number (e.g. ~20), you would have 400 samples per group, so plenty of power even for a binomial test. Honestly that is what I would use.
Wilcoxon sign rank test (WSRt) (or rather its paired version). But there is a strong caveat. It is fundamentally a test of the pseudo-median (see here, or here, on wikipedia). Now the pseudo-median is an acceptable measure of centrality, but not a very intuitive one, and, depending on your audience, one which they may not be familiar at all. So, use with caution...
Now, WSRt can also be a test of the median (and, btw the mean as well!), but only under a fairly stringent condition, namelly that the distribution of the paired differences be symmetrical. Given that (one would hope) the students skills improved, that distribution is most likely positively skewed, so that assumption will not be met.
Lastly, it is debatable that you can use the WSRt on ordinal data; that assumes that you can quantify the difference between Before and After answers (what is (Agree - Neither Agree/Disagree)? Is that 1? And is it the same as Strongly Agree - Agre?). You can probably do it, but there is disagreement whether that is valid (see e.g. here on CV).
Now you could also abandon the notion of pairing, (because it requires subtracting the Before score from the After score, which is debatable for ordinal data), and simply use a 2-sample test. A Mann-Whitney U test (MWUt) could be used on ordinal data (because all it does is compare values from the 2 samples). But... It is NOT a test of median (as it is sadly too often described: it can only be considered a test of medians under some stringent assumptions, which are unlikely to hold in your case). It is a test of stochastic superiority (not dominance!); Man & Whitney titled their paper "On a test of whether 1 of 2 random variables is stochastically larger than the other". The alternative is that P(After>Before)>.5; i.e. the After scores are greater than the Before scores, significantly more often than not. And that may be a good demosntration (better than the pseudo-median, certainly) that the course had a positive effect.

TL;DR. I would use a Sign test; no assumption, is a paired test, and definitively works for ordinal data. I would avoid WSRt. And I would try, just for fun MWUt; given that the sample size is rather large, I have a strong hunch that the results will be similar. And if ever they are not, then this would require more digging into data (accompanied by appropriate head scratching).

To solve the important issue of the appropriateness of subtracting an ordinal variable from another, I suggest that the Wilcoxon signed rank test be replaced by the rank difference test or its model equivalent the proportional odds model accounting for intra-cluster correlation. The signed rank test assumes that the variable is perfectly transformed before subtraction. The rank difference test on the other hand is indifferent to monotonic transformations. See here. — Frank Harrell, Commented Jun 13 at 12:42

Stack Exchange Network

Comparing matched pre-post likert scale data (n = 40)

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
modeling
likert
or ask your own question.

Linked

Hot Network Questions

Comparing matched pre-post likert scale data (n = 40)

2 Answers 2

Not the answer you're looking for? Browse other questions tagged modelinglikert or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
modeling
likert
or ask your own question.