6
$\begingroup$

What is a medcouple? I understand that it is the median of a couple of data points but it is not clear to me what these pairs of data actually are. E.g. https://wis.kuleuven.be/stat/robust/papers/2008/adjboxplot-revision.pdf: this paper explains it. I am struggling to understand what this kernel is doing.

Can somebody explain?

$\endgroup$

2 Answers 2

7
$\begingroup$

This concept concerns a batch of data $(x_1, x_2, \ldots, x_n):$ the medcouple is a way to measure how much a batch deviates from being symmetric.

The center of a symmetry, should it exist, would be the median $M.$ To study symmetry, then, it suffices to examine how far each value is from the median. Accordingly, recenter the data to their median residuals

$$y_i = x_i - M.$$

By the very definition of the median, at least half the $y_i$ are zero or greater ("non-negative") and at least half the $y_i$ are zero or smaller ("non-positive").

In a perfectly symmetric distribution, each nonzero $y_i$ has a counterpart $y_{i^\prime} = -y_i$ an equal distance away from $0$ but of the opposite sign. (Let's say the corresponding $x_i$ and $x_{i^\prime}$ are counterparts of each other, too.)

We may therefore measure the imbalance of any $y_j \ge 0$ compared to any $y_i \le 0$ by comparing their absolute values $|y_j| = y_j$ and $|y_i| = -y_i.$

Your reference adopts a relative measure of imbalance,

$$h(y_i, y_j) = \frac{|y_j| - |y_i|}{|y_j| + |y_i|} = \frac{y_j + y_i}{y_j - y_i} = \frac{(x_j - M) + (x_i - M)}{x_j - x_i}.$$

(This is half the "relative percent difference" of the absolute values of the median residuals. It is not, by far, the only such relative measure one could use. See https://stats.stackexchange.com/a/201864/919 for a discussion and a characterization of all such possible measures.)

Your reference remarks there will be problems whenever the denominator is zero, a situation it (incorrectly) dismisses as being of no interest in its intended applications (to samples of distributions that are continuous near their medians). (This remark is incorrect because in any sample of odd size $n$ there will always be one fraction with denominator $0;$ namely, $h(M,M).$ For a full definition of $h,$ see Wikipedia on medcouples.)

The salient properties of this measure are

  1. Location invariance: when a constant is added to all $x_i,$ $h$ does not change. This is by construction: the $y_i$ are unaffected by this change of location of the $x_i.$

  2. Scale invariance: when all $x_i$ are multiplied by a positive value, $h$ does not change.

  3. Universal finite range: $-1 \le h \le 1$ always. This is obvious from the expression for $h$ in terms of absolute values (apply the triangle inequality inequality for the Euclidean line $\mathbb R$ for a rigorous proof).

  4. Small values of $h(x_i,x_j)$ indicate $x_i$ and $x_j$ are close to being counterparts. ("Small" of course means relative to $1,$ the largest possible absolute value of $h.$)

  5. Sign equivariance: when all the data are negated, all the $h(x_i,x_j)$ are negated, too, because $h(x_i,x_j) = -h(-x_j, -x_i).$

  6. Indication of skewness. The sign of $h(x_i, x_j)$ is positive when $x_j$ is further above the median than $x_i$ is below the median.

Absolute values near $1$ indicate one of the values is much further from $M$ than the other is, relative to the distance between $x_j$ and $x_i.$ Positive values mean $x_j$ is further and negative values mean $x_i$ is further.

This all justifies calling $h(x_i,x_j)$ something like a "two-point skewness measure" whenever $x_i \le M \le x_j.$ However, it's only one indication of the overall distribution of the data. The medcouple summarizes these two-point skewnesses.

Thus, if there is an overall tendency for positive deviations of data to exceed the magnitudes of negative deviations, an average of the $h(x_i, x_j)$ will measure the "overall skewness" (again restricting to $x_i\le M$ and $x_j\ge M$).

Continuing in the spirit of using robust statistics, for the average we may use the median. Thus,

the medcouple of the batch $(x_1, x_2, \ldots, x_n)$ is the median of all the two-point skewness measures.


Consider, as a simple example, the batch $(4, 4, 6, 12).$ Its median can be taken to be midway between $4$ and $6,$ equal to $5.$ The deviations $y_i$ are $(-1,-1,1,7).$ The two nonpositive deviations $(y_1,y_2)=(-1, -1)$ can be taken to be the $y_i$ and the two nonnegative deviations $(y_3,y_4)=(1,7)$ will serve as $y_j,$ thereby giving four possible two-point skewness indicators:

$$\begin{aligned} h(y_3,y_1) &= h(1,-1) = 0;\\ h(y_4,y_1) &= h(7,-1) = 6/8;\\ h(y_3,y_2) &= h(1,-1) = 0;\\ h(y_4,y_2) &= h(7,-1) = 6/8. \end{aligned}$$

The resulting batch of two-point skewness indicators $(0, 6/8, 0, 6/8)$ has $3/8$ as its median: this is the "medcouple" of the original batch $(x_1, \ldots, x_4).$ It tells us a typical two-point skewness measure is $3/8:$ this batch is positively skewed by this amount.

$\endgroup$
0
-1
$\begingroup$

Im sorry, im not very good with formulars/formating but i still try my best to share my understanding of med couple.

We have the MC itself as MC = med h(xi, xj)

and we have h as h(xi, xj) = (xj−Q2)−(Q2−xi) / (xj−xi)

we have two indices: i and j. they are used to form the couples to be compared. The goal is to compare the biggest value in the data-set to the smallest one. Then compare the second-biggest to the seccond smallest The 3rd biggest to the 3rd smallest and so on.

This can be done by sorting the dataset descending with j starting at 1 and i starting at [length of data] With every step j gets increased by one and i decreased by 1.

example-dataset, sorted descending allread: {10, 8, 5, 2, 1}

x[j=1] will start on the left side and select 10. x[i=5] will start on the right side and select 1. Thats the first couple for the medcouple calculation

For step 2 j gets increased by 1 and i decreased.

x[j=2] will start on the left side and select 8. x[i=4] will start on the right side and select 2. Thats the second couple for the medcouple calculation (and also the final one in our example, since there are no pairs left and right the median left)

Now we can plug this couples into the h(xi, xj) = (xj−Q2)−(Q2−xi) / (xj−xi) formular. (xj−Q2) is a measure of how much the bigger of the 2 couple-values lies above the median (Q2−xi) is a measure of how much the smaller of the 2 couple_values lies under the median. you could also see this as distance to the median for both of the couple values.

(xj−Q2) - (Q2−xi) evaluates the difference in distances of the couple-values to the mean. If this value is negative you know, that the smaller value of the couple lies further away from the median than the bigger value of the couple.

By dividing (xj−Q2) - (Q2−xi) by (xj−xi) you standardise the difference of the couple-values distance to the median by their distance to each other.

For each of your value couples (from step1 to stepn, see above) you now have value, that tells you which of this two values lies further away from the median and how impactfull this difference is in relation to the distance between both values from the couple (remember: wenn this value is < 0 it means that the lower value is further away from the median, than the higher value).

Now we use the median over all this calculated values, to get a measurement for the skewness of the distribution. If we find, that this calculated med lies below 0 we know, that there are more couples where the smaller value is further away from the median than the larger value, or in other words, the left-side tail of the distribution is longer than the right-side one.

$\endgroup$
1

Not the answer you're looking for? Browse other questions tagged or ask your own question.