7
$\begingroup$

I know that Pearson correlation is sensitive to outliers, unlike Spearman correlation. I am trying to generate data (let's say with at least 30 points) that will maximize the difference between these two methods. I was trying to generate some points in an ascending linear direction, and then one outlier point in the opposite direction. It indeed showed a big difference in the coefficients, but not as much as I wanted. Any ideas?

$\endgroup$
3
  • $\begingroup$ Maybe something that maximizes the difference between a linear trend and a monotonic trend ? Perhaps, in R, x = 0:10 + 0.000000001; y = 1/x; plot(x,y); cor(x,y, method="pearson"); cor(x,y, method="kendall") . This can be run at: rdrr.io/snippets/ . Or change that decimal to e.g. 0.3 to make the plot more obvious. $\endgroup$ Commented Feb 1, 2022 at 21:04
  • $\begingroup$ thank you for the answer. I tried that but both methods return values of the same sign. I try to make opposite signs, when at least one value is significant $\endgroup$ Commented Feb 1, 2022 at 21:22
  • $\begingroup$ Your idea is fine. $\endgroup$
    – whuber
    Commented Feb 1, 2022 at 22:33

3 Answers 3

17
$\begingroup$

Pearson correlation depends on the values of the data; Spearman correlation depends only on their (marginal) ranks. Thus, the former is (far) more sensitive to outlying data.

What kind of outlying data? Those with high leverage. These are far to the left or right of the rest of the points in a plot, as in the left panel in the figure. Figure

That isolated point at $(-20,20)$ pulls the least-squares line close to it (for otherwise the squared penalty would be huge). As a result, the Pearson correlation (which is the standardized slope of this line) must be large and negative.

However, that same point no longer has the same leverage in a plot of the ranks of the data: yes, it is off to the left again, but it cannot be far to the left. It pulls the least squares line up only a little. The Spearman correlation is large and positive, because the $30$ points already have high positive Spearman correlation and altering the value of one point cannot change those ranks all that much.

Flip these pictures upside-down for an example of a switch from a large positive Pearson correlation to a large negative Spearman correlation.

Fixing the rightmost 30 points along a line segment from $(-1,-1)$ to $(1,1),$ we may vary that outlying point $(-a,a)$ and plot the correlations as a function of $a.$

Figure

The black curve tracks the Pearson correlation. When $a=0,$ the point $(0,0)$ fits in perfectly with the other $30$ points and the both correlations are $1.$ But for extremely negative and positive values of $a,$ this leverage phenomenon occurs and the two correlation coefficients separate.

The dotted red curve tracks the Spearman correlation, which stays high no matter what value $a$ might have.

In the limit, the Pearson correlation can approach $-1.$ The Spearman correlation reaches a lower limiting value that depends only on the amount of data: in the figure, it's about $0.806.$ With sufficiently large datasets, the Spearman correlation will stay very close to $1.$ For instance, repeating this example with $300+1$ points rather than $30+1$ points, the Spearman coefficient is never less than $0.980.$

The gray (Pearson) and dotted blue (Spearman) curves show the situation with the $y$ values negated.

Thus, by making $n$ sufficiently large and pulling just a single point away from a highly correlated dataset, you can make the two correlation coefficients as close to $\pm 1$ as you want, but with opposite signs.

$\endgroup$
7
$\begingroup$

I know that Pearson correlation is sensitive to outliers, unlike Spearman correlation.

There is a more striking difference between the two: Pearson assumes a linear relationship between the data, whereas Spearman checks whether it is simply monotonuous (see the image below, taken from Wikipedia). Generating data via a non-linear process is thus a way to show that these are not equivalent.

A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation. enter image description here

$\endgroup$
0
$\begingroup$

This is the basic idea. In this example Spearman's correlation is obviously 1, and Pearson's correlation is 0.65. You can generate "step data" that will look like almost a straight line, then add an outlier.

Person vs. Spearman

$\endgroup$
1
  • $\begingroup$ This is somewhat similar to what suggested in the comment on the original question, which used x: 0 to 10 and y = 1 / x. But you have to add a small number to x, to be able to calculate 1 / x. Here I got rho = 1, and r = 0.5. $\endgroup$ Commented Feb 4, 2022 at 18:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.