When using the function hoeffd in the CRAN package Hmisc I get unusual p-values for pairs of data sets that are independent. The function hoeffd is an implementation of Hoeffding's $D$ statistic. Given a pair of independent data sets, the p-value from the $D$ statistic should be a uniform random variable. However, I get a distribution that looks uniform up to $0.4$, then the density spikes and decreases linearly to $1$.
Suppose I run the following code, which generates $10{,}000$ pairs of independent Gaussian data sets (each set of length $100$), calculates the p-value for the independence test on each of them, and then plots a histogram of the p-values:
library(Hmisc)
N = 10000
n = 100
set.seed(n)
pvals = numeric(N)
for (i in 1:N) {
pvals[i] = hoeffd(rnorm(n), rnorm(n))$P[1, 2]
}
hist(pvals, freq = FALSE, main = paste("P-Values: n = ", n, sep = ""), xlab = "P-Values")
Running the same simulation, but with each data set in each pair having $1{,}000$ observations:
N = 10000
n = 1000
set.seed(n)
pvals = numeric(N)
for (i in 1:N) {
pvals[i] = hoeffd(rnorm(n), rnorm(n))$P[1, 2]
}
hist(pvals, freq = FALSE, main = paste("P-Values: n = ", n, sep = ""), xlab = "P-Values")
gives
It takes a long time to compute this with $n = 10{,}000$ but a pattern like this is still visible with a small number of simulations.
If you run the code above but replace
rnorm(n)
with, for example,
rexp(n, 1)
you get essentially the same results.
I'm assuming that this is due to the fact that the p-values are asymptotic approximations of the true p-values, which is stated in the reference manual. But shouldn't the distribution look more uniform than this? Why do the p-values have this distribution?
I am using R version 3.5.2 ("Eggshell Igloo") on my Mac (but I get the same distribution on Windows) and I'm using Hmisc version 4.2-0.