0
$\begingroup$

I generated distributions of travel times of commuters using transportation simulation tools (for different scenarios). The distributions are attached below. I wish to statistically compare each pair of these non-parametric distributions.

enter image description here Null Hypothesis - distribution belong to same population and they are different only by chance (randomness).

Alt hypothesis - distribution do not belong to same population i.e. the factors varied in each simulation affected the outcome distribution

Q1. Which test should I use? There are some tests which compare medians but these distribution can have multiple peaks and therefore, similar median does not mean they belong to same population.

Q2. I am currently using Kolmogorov-Smirnoff test which looks for maximum gap between the distributions. Can I use chi-square test?

$\endgroup$
16
  • 1
    $\begingroup$ 1) What does "nonparametric distributions" mean to you? // 2) What about the KS test do you not like? // 3) What would you like about a chi-squared test? $\endgroup$
    – Dave
    Commented Oct 28, 2021 at 15:38
  • $\begingroup$ Did you mean to include histograms or smoothed density estimates of your simulated distributions? How many distributions have you generated and are testing? For instance, if you have 4 distributions, are you calculating 6 pairwise comparisons or do you want 1 global test? $\endgroup$
    – AdamO
    Commented Oct 28, 2021 at 15:46
  • $\begingroup$ If there is no family of distributions specified, you can't use hypothesis testing, because there are an infinite number of hypotheses to be tested. Hypothesis testing works on specific parameters of interest, and you can look at those. $\endgroup$
    – Paul
    Commented Oct 28, 2021 at 15:56
  • $\begingroup$ @Dave I am not an expert in statistics. But I will try to answer based on limit knowledge. 1. Non parametric distribution - no assumption about the shape of the underlying distribution is made. The distribution are generated from simulations. $\endgroup$
    – SiH
    Commented Oct 28, 2021 at 15:58
  • 2
    $\begingroup$ 4) I suggest that you ask about the question you have about your data, not about your approach to solving a question that you do not know how to solve. $\endgroup$
    – Dave
    Commented Oct 28, 2021 at 16:17

1 Answer 1

3
$\begingroup$

The problem with comparing simulated distributions as you describe is that the $n$ is arbitrary. In other words, you are only using simulation as a way of calculating the distribution function. So $n$ can be set to 100, 1,000, or 10,000; and the power to reject the null hypothesis becomes arbitrarily high. Conversely, fixing $n$ to some arbitrary value is of no use. From a testing perspective, a distribution is a population level summary whereas a random sample is a sample level summary, and so it doesn't make sense to perform statistical inference on populations when there is no "super"population to generalize to.

Having said that, you should crank up the simulation iterations as high as your CPU can handle to get the most precise estimate of the distribution functions as possible. Then "finding a difference" boils down to no more than actually finding some difference in the curve(s) and your job is done.

The matter of what comprises a "difference" is interesting. The supremum norm is not an intuitive comparison in practice. The KS test has interesting operating characteristics: it's a distribution-free test of the strong null hypothesis $F_1 = F_2$ where $F_1$ and $F_2$ are the distribution functions of the respective populations. You would, of course, reject the null if, for any $x$ at all $F_1(x) \ne F_2(x)$. However, you can easily calculate $\int ||F_1(x) - F_2(x)||^2 dx$ and call this the mean squared error. You can then rank or demonstrate how significant these differences are using a heatmap, by plotting higher intensities of color for relatively larger pairwise differences. This will show you which distributions are more disparate than the others.

set.seed(123)

BIGNUM <- 1e2
p <- 50
b <- rnorm(50, 0, 1)
x <- sapply(b, rnorm, n=BIGNUM, sd=1)
d <- apply(x, 2, ecdf)
pairs <- combn(1:p, 2)
delta <- apply(pairs, 2, function(ind) {
    integrate(function(x)(d[[ind[1]]](x) - d[[ind[2]]](x))^2, lower=-Inf, upper=Inf, subdivisions=10000)$value
  })

plot(t(pairs), pch=22, bg=rgb(1-delta/max(delta),1-delta/max(delta), delta/max(delta)))

ind <- pairs[, which.max(delta)]
plot(d[[ind[1]]], xlim=c(-7, 7))
lines(d[[ind[2]]])

enter image description here

Note this image lets you pick out pairs 44 and 18 which, together, have the most discordant means in this simple normal example, (-1.97 and 2.17).

$\endgroup$
3
  • $\begingroup$ It's unclear why you reject the KS statistic in favor of the $L_2$ norm, because the KS statistic is even easier to calculate. Regardless, a permutation test would be reasonably fast and quickly could determine the null distribution of any pairwise test statistic and of its maximum among all pairs, thereby enabling formal hypothesis testing. $\endgroup$
    – whuber
    Commented Oct 28, 2021 at 19:13
  • $\begingroup$ @whuber I am advising to buck the idea of hypothesis testing altogether. According to OP, they have "generated distributions" which means this is not a statistical problem of performing inference on a sample, but rather needing a method to quantify differences between populations. Let me know if you read otherwise. $\endgroup$
    – AdamO
    Commented Oct 28, 2021 at 20:28
  • $\begingroup$ Fair enough--your comments about that situation are good. I was trying to interpret the question more generally as supposing these were real data for which the OP does not have the opportunity to generate any more. $\endgroup$
    – whuber
    Commented Oct 28, 2021 at 20:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.