1
$\begingroup$

I am a bit confused about how people calculate p-value when calculating odds-ratios.

The log-odds ratio (LOR) for a contingency table with two entries is $L = \log \frac{p_{1}}{p_{0}}$ and has an unbiased estimator using sampled frequencies: $\hat{L} = \log \frac{n_{1}}{n_{0}}$. This estimator has asymptotic variance $\sqrt{\frac{1}{n_1} + \frac{1}{n_0}}$, which allows you to assign confidence intervals to the estimated LOR. If you also want to assign a p-value to the observed sample LOR, then you'd need the variance around the null hupothesis of a LOR of zero, which in this case, since $n_1+n_2=N$ and $n_1 = n_0$, is equal to $\frac{2}{\sqrt{N}}$. This is independent of the population parameters since it only depends on the total number of samples, which makes it a pivotal statistic. This means you can shift the distribution to zero to calculate probabilities under the null hypothesis of a LOR of zero, and assign p-values. No problems there.

However The LOR for a contingency table with four entries is $L = \log \frac{p_{11}p_{00}}{p_{10}p_{01}}$ and has an unbiased estimator using sampled frequencies: $\hat{L} = \log \frac{n_{11}n_{00}}{n_{10}n_{01}}$. This estimator has variance $\sqrt{\frac{1}{n_{11}} + \frac{1}{n_{00}} + \frac{1}{n_{01}} + \frac{1}{n_{10}}}$.

While this still allows you to construct a confidence interval, it is (if I understand correctly) no longer a pivotal statistic: the variance depends on the observed frequencies and thus the population parameters.

Still, I see people calculate p-values associated to nonzero LORs (see for example this discussion: How to calculate the p.value of an odds ratio in R?). How is that possible? Am I missing something? Are there hidden assumptions?

$\endgroup$

1 Answer 1

1
$\begingroup$

If you use a likelihood-based binomial regression, as suggested by Frank Harrell and Ben Bolker on the page you cite, or use log-linear analysis of counts in a contingency table, the p-values are based on the asymptotic normality of the maximum-likelihood estimator. The test statistic is then a pivotal z-statistic from which confidence intervals can be calculated. There remains a question of whether there are enough cases to be close enough to asymptotic normality, but that's an issue for all maximum-likelihood estimation.

Agresti devotes Chapter 3 of the second edition of Categorical Data Analysis to "Inference for Contingency Tables." Sections 3.5 and 3.6 discuss relative advantages of different methods for small samples, where the highly discrete nature of the data poses particular problems.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.