0
$\begingroup$

I have two one-dimensional samples that I'm trying to quantifiably distinguish (or deny such distinction). I.e. the null-hypothesis is that they come from the same population (distribution?). The alternative is that they don't. So after some reading I figured that K-S test is what I need. In order to implement it, I'm following the general instructions here (as well as wiki).

I compute the distribution functions as the first link shows (number of members of each sample with smaller value than the one that we currently have on X axis). The result I get can be seen in the picture: picture

Then it gets a bit confusing: do I calculate the test statistic as simply the maximum "vertical" distance between these distribution functions? I.e. I build the list of all the Y differences (for every X) as my test statistic, and in the end choose the maximum of these as the end result (to be evaluated against a test)? In my case such difference would look like the second picture: here So I'm having 9 as the resulting value.

The reason I'm confused is that if I'm to believe wiki, 9 (my result) is much higher than the test stat. I have n = m = 137 (both samples have 137 elements, even though they represent independent events), so the square root turns into measly 0.12, hence even at crazy 0.1% significance level my stats refute the null with flying colors. In fact, I could have boosted my significance level all the way down to Exp(-11097) - yeah, that's minus eleven thousandth power of e...

This is more than suspicious. Hence I want to make sure that I'm doing everything correct. Or maybe I am correct, but the test itself is unfit for the situation, as it is clearly too prone for type I error for my situation. Then maybe any advice for good alternatives?

$\endgroup$

1 Answer 1

2
$\begingroup$

The K-S test is defined in terms of the maximum difference between cumulative distribution functions. The CDF is defined as $F_X(t)= P(X\leq t)$, so its range is from 0 to 1, not 0 to the sample size, as you have. Your link has iut defined correctly, as a fraction with sample size in the denominator. You need to divide your distribution functions by their respective sample sizes to get them into the interval $[0,1]$

Your difference should be 9/137, not 9.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.