0
$\begingroup$

For categorical variables with $l \ge 2$ categories, what is the sampling distribution of the proportion of events in each category? These are obviously not independent, since they add up to 1.

Does it matter if these variables are ordinal?

For binary variables, the well-known result is that the proportion of events has a normal distribution with mean $p$ and variance $p(1-p)/n$.

With an arbitrary number of categories, is the result a generalization of this?

  • I've seen the multivariate normal mentioned, with a particular covariance matrix. But I am not sure how that works, since the draws are not guaranteed to sum up to 1.
  • I know the Dirichlet distribution is used for this in Bayesian estimation. Ideally, for my application, I am looking for a frequentist solution, just to keep things simple.
  • A "natural" solution could be to draw from the multinomial (with the probability parameter set to the sample proportions) and then divide by the number of trials. I've not seen this mentioned anywhere. Does this have a name? If this was indeed the solution, why isn't it the solution for $l = 2$ categories (where we would use the binomial)?

A good reference is a plus. A solution that someone has already implemented in R is also a plus.

$\endgroup$
6
  • 1
    $\begingroup$ For binary variables, I would have thought the proportion of one of the categories had a binomial distribution scaled by $\frac1n$ rather a normal distribution. So with more categories, I would have thought you had a multinomial distribution scaled again scaled by $\frac1n$. $\endgroup$
    – Henry
    Commented Jul 7 at 23:15
  • 1
    $\begingroup$ If certain assumptions hold (that observations arise from a Bernoulli process), then you'd have a scaled binomial (which is approximately normal in sufficiently large samples). Under analogous assumptions, this would be scaled multinomial, which also has an asymptotic normal distribution (it's degenerate because of the sum-to-one thing; it lives on a hyperplane of dimension $k-1$ for $k$ categories). The mean and variance of each term is as for the scaled binomial, and the covariances are $-p_i p_j/n$... ctd $\endgroup$
    – Glen_b
    Commented Jul 7 at 23:25
  • 2
    $\begingroup$ I think @Henry was very kind in his comment. The well-known result for $l=2$ (binary variables) is the binomial distribution, which as n gets larger and $p$ is neither too close to 0 or 1, can start approximating a gaussian (but why use an approximation when one can use the true binomial distribution?). For $l \ge 3$, this is the multinomial distribution. Variables can be categorical or ordinal (e.g. 6-sided die). Wikipedia has good descriptions of both distributions (bi & multi-nomial). $\endgroup$
    – jginestet
    Commented Jul 7 at 23:26
  • 1
    $\begingroup$ ctd ... I don't know of any reference that directly discusses the scaled multinomial in detail - it would be a waste of time $-$ it's just a linear rescaling of the multinomial; any work with it would be a simple undergrad exercise. Some references discuss the multivariate normal approximation to either the multinomial or the scaled multinomial, including Pearson's original 1900 paper on the chi-squared test for multinomial goodness of fit. Also see the refs here: stats.stackexchange.com/questions/34547/… $\endgroup$
    – Glen_b
    Commented Jul 7 at 23:39
  • $\begingroup$ Thank you all. I think the correct answer is the binomial / multinomial rescaled by $1/n$. I just can't find anything that explicitly says so. I am also confused by why people bring up the normal approximation. If we have the actual distribution, what's the point of the approximation. $\endgroup$
    – Jessica
    Commented Jul 8 at 14:57

0

Browse other questions tagged or ask your own question.