0
$\begingroup$

I'm seeking the right distribution for my 4D data, where the sum of values in each sample equals one. Currently, I've chosen to employ the Dirichlet distribution. However, upon applying this distribution, I've noticed that the marginal Dirichlet distribution doesn't align well with the first dimension. It appears that the influence of the other dimensions is causing the distribution to shift to the left. Surprisingly, when I exclude the fourth dimension or fourth and third dimensions, the fit improves significantly. This discrepancy seems to be due to interdimensional interactions.

My question is whether it's advisable to explore alternative distributions to the Dirichlet, or if there's a means to introduce a parameter that accommodates this variance, if this is due to the variance.

If you're aware of any articles or methods in this field, your guidance would be greatly appreciated.

Please note that I'm working with a limited sample size, approximately 200 samples, so a distribution with a large number of parameters may not be suitable.

Edit:

thanks to @whuber, I have removed histogram plots to make the question more clear.

Q-Q plots: 4D data enter image description here 3D data enter image description here

Update (another example of a 4D data modeling with dirichlet distribution with 4D vs 3D vs 2D )

4D

enter image description here

3D

enter image description here

2D enter image description here

$\endgroup$
9
  • $\begingroup$ Welcome to CV, roan. What you label a "frequency" evidently is a relative frequency. This makes it difficult to assess whether the data are reasonably well approximated by the curves. Could you plot the actual frequencies in each bin or use a clearer method of indicating the marginal distributions, such as QQ plots? Moreover, even if the marginals aren't perfect, please tell us why that matters: just how closely do you need to approximate the distribution and for what purpose? $\endgroup$
    – whuber
    Commented Nov 5, 2023 at 15:52
  • $\begingroup$ @whuber, Thank you for your comment. I have included a Q-Q plot figure in response to your suggestion. If possible, I would like to model the first dimension of the 4D data in a manner similar to the first dimension of the 3D data. The goal is to utilize the obtained Dirichlet pdf later for pdf calculations for another dataset. Would be possible to add a weight to increase the impact of some dimension to others in the estimation of dirichlet parameters? $\endgroup$
    – roan
    Commented Nov 5, 2023 at 16:37
  • $\begingroup$ Could you explain what you see in these plots that might be problematic? Also, "PDF calculations" is not sufficiently specific to help us answer your question. For instance, it doesn't help anyone choose between fitting a parametric distribution and (say) resampling from your data. $\endgroup$
    – whuber
    Commented Nov 5, 2023 at 16:41
  • $\begingroup$ In the Q-Q plots in 4D data, we observe that the fitted distribution exhibits deviations from the theoretical distribution for extreme high values. Regarding your second question, my goal is to fit a parametric distribution to my data. Does this answer your question? $\endgroup$
    – roan
    Commented Nov 5, 2023 at 17:06
  • $\begingroup$ All QQ plots will exhibit some deviations. Thus, you should (a) test whether they can be attributed to random variation in the sampling or observation process and (b) assess whether the sizes of those deviations might matter in your application. If your goal really is only to "fit a parametric distribution," then that's a mere textbook exercise, so just go ahead and do it. $\endgroup$
    – whuber
    Commented Nov 5, 2023 at 17:15

1 Answer 1

2
$\begingroup$

Over the past few decades, at least a dozen papers have explored variations on the Dirichlet featuring more complex interactions among the dimensions than the original. As Ian R. James and James E. Mosimann showed in A New Characterization of the Dirichlet Distribution Through Neutrality, the original Dirichlet maximizes the independence among the dimensions through "complete neutrality", such that removal of some dimensions always leaves the relative proportions among the remaining dimensions unaffected.

Since you are interested in modeling more complex interactions among your four dimensions, I would recommend that you consult David D. K. Chow's superb unifying overview of the various ways that the Dirichlet has been generalized. Chow explains:

Through a special mixture distribution based on the Schlomilch integral, we shall provide a unified framework for several distributions appearing in the literature. Special cases include the G3D generalized Dirichlet distribution [17] or scaled Dirichlet distribution [18], the shifted-scaled Dirichlet distribution [19] or simplicial generalized beta distribution [20], the flexible Dirichlet distribution [21], and the tilted Dirichlet distribution [22]. In a similar manner, we shall also obtain a family of distributions that includes the Concrete or Gumbel-softmax distribution [23, 24]. By including additional parameters, these generalizations of the Dirichlet distribution allow for behaviours that the Dirichlet distribution itself cannot model, for example positive covariances Cov(Xi , Xj ).

Chow makes good on his claim to organize and unify the whole Dirichlet family through helpful charts like this:

helpful_chart

While I don't pretend to understand all the details, Chow provides an invaluable framework within which to compare different ways to add parameters and induce the kinds of interactions about which you are inquiring. I hope you find it as useful as I did when trying to make sense of the various generalizations of the Dirichlet that have been proposed.

The book Dirichlet and Related Distributions: Theory, Methods and Applications by Kai Wang Ng, Guo-Liang Tian, and Man-Lai Tang also devotes chapters to some of these variations, but does not tie them together as nicely as Chow.

$\endgroup$