2
$\begingroup$

I have some data on the frequency that a particular claim is referenced by two distinct types of sources. The first is traditional news media, and the second source are conspiracy theory sites. I want to test the hypothesis that conspiracy sites more frequency cite a given claim than conventional mainstream media sites.

I have data for (a) how many times the claim was cited in 83 news sources, given below

news = [1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   2   2   1   1   1   1   1   1   3   1   1   1   2   1   1   1   2   1   1   1   1   1   7   1   1   1   6   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   2   1   1   1   1   1   1   1   1   1   1   5   1   1   4   1   1];

and for how often the same claim was cited in 17 conspiracy theory sources:

ct = [1 2 1 6 2 20  1 1 1 2 9 1 4 2 2 7 7]

I do not a prior have any reason to believe this data will be normally distributed nor am I interested in means, so my understanding is that a non-parametric test ought to be employed. From my initial reading, I think the Wilcoxon Rank Sum test might be appropriate to test this hypothesis, and that I should use a right tailed version because I have a specific direction I wish to test in the hypothesis. Deploying this in MATLAB I get

[p,h,stats] = ranksum(ct,news,'tail','right'); 

which yields $p = 3.5166 \times 10^{-6}$ and strong rejection of the null that they're from populations with the same median. But is this the right test to employ, or would another non-parametric method be better for analysing data of this sort? I will in future have to run similar analysis, so am open to suggestion or correction!

$\endgroup$

1 Answer 1

4
$\begingroup$

Data: (You might want to check whether I captured your data correctly.)

news = c(1,1,1,1,1,1,1,1,1,1,2,1,
  1,1,1,1,2,2,1,1,1,1,1,1,3,1,1,1,2,
  1,1,1,2,1,1,1,1,1,7,1,1,1,6,1,1,1,
  1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,   
  2,1,2,1,1,1,1,1,1,1,1,1,1,5,1,1,4,1,1)

ct = c(1,2,1,6,2,20,1,1,1,2,9,1,4,2,2,7,7)

Data summaries:

summary(news)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.341   1.000   7.000 
table(news)
news
 1  2  3  4  5  6  7 
69  8  1  1  1  1  1 

summary(ct)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   4.059   6.000  20.000 
table(ct)
ct
 1  2  4  6  7  9 20 
 6  5  1  1  2  1  1 

From boxplots (News at bottom), there seems little doubt that the claim was cited more frequently by CT sources. But maybe CT sources, cite all their claims more heavily as a matter of style, so it is not clear in what sense the two samples are comparable.

boxplot(news, ct, horizontal=T, pch=19, col="skyblue")

enter image description here

Also, the two sample distributions, both right skewed, are of remarkably different shapes, so the Wilcoxon Rank Sum test cannot be interpreted as a straightforward comparison of sample medians.

Using the implementation of this Wilcoxon test in R, I get results similar to yours, for the one-sided test. For smaller sample sizes than yours, the Wilcoxon SR test does not always give reliable P-values in the presence of as many ties as in your data. However, there is no warning message for your data.

wilcox.test(news, ct, alt="less")

        Wilcoxon rank sum test 
        with continuity correction

data:  news and ct
W = 336, p-value = 4.139e-06
alternative hypothesis: 
 true location shift is less than 0

In such cases, one does not directly compare medians, but tests whether one sample (in your case CT) stochastically dominates the other.

The empirical CDF (ECDF) of a sample is made by sorting the data, starting with 0 at the left and making an upward jump of $1/n$ at each observed value, reaching height $1$ at the right. (If there are ties of multiplicity $k$ at a value, then the jump is of size $k/n.)$

A dominant sample plots to the right of a non-dominant one, and thus the dominant ECDF plots below the other ECDF. Here, the ECDT for CT (brown) plots consistently below the ECDF (blue).

plot(ecdf(ct), col="brown")
 plot(ecdf(news), add=T, col="blue")

enter image description here for News.

$\endgroup$
2
  • 1
    $\begingroup$ Thank you, this is really useful! Is there a condition that the two samples should have broadly similar shapes for the RS test or is does it simply mean that it's not a straightforward median difference issue? $\endgroup$
    – DRG
    Commented May 31, 2022 at 8:26
  • 1
    $\begingroup$ The Wilcoxon Rank Sum test (also called Mann-Whitney Wilcoxon test) can be considered a test of different population medians only if the two samples are of similar shapes and variances. Otherwise, it is a test of stochastic dominance (whether values in one dist'n tend to be larger than in the other)--which is a somewhat broader concept. (Looking at a comparison of ECDF plots as at the end of by Answer is an easy way to judge this.) This article from The American Statistician may be helpful. $\endgroup$
    – BruceET
    Commented May 31, 2022 at 10:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.