1
$\begingroup$

I was watching this video over here (https://www.youtube.com/watch?v=UBiaLq5V7mE) that discussed a Non-Parametric based Bayesian approach for deciding the number of clusters in a dataset.

Essentially, the Dirichlet Probability Distribution can be used to simulate "customers entering a restaurant and deciding whether to sit at empty table vs. a non-empty table relative to current seating arrangement at the restaurant" (Chinese Restaurant Process). In this analogy, individual data points are considered as "customers", and the number of clusters are considered as "tables" - individual data points are probabilistically assigned to existing or new clusters in such a way that so that total number of clusters does not need to be specified in the beginning.

enter image description here

My Question: What are the advantages of doing this compared to the standard methods used to decide the number of clusters such as a "elbow plot" or a "silhouette plot"?

enter image description here

Does anyone know why such complicated Bayesian Non-Parametric methods (Dirichlet Distribution via Chinese Restaurant Process) need to be used to infer the true number of clusters in a dataset, compared to the more standard methods?

Are these Bayesian Non-Parametric methods more "powerful" in higher dimensional data compared to the standard methods? Do the Bayesian Non-Parametric methods allow you to place "probabilistic uncertainty" on the number of clusters? Do the Bayesian Non-Parametric methods attempt to better account for the fact that new data might not belong to any of the existing clusters?

Thanks!

$\endgroup$

1 Answer 1

2
$\begingroup$

You are comparing apples to oranges. The nonparametric “methods” you are mentioning are parts of the definition of underlying probabilistic clustering models. Those cannot be compared to model agnostic methods for deciding on the number of clusters. It's like comparing regularization to feature selection algorithms. In the first case, it is an integral part of the model, that produces model-specific solution vs a general-purpose algorithm.

$\endgroup$
4
  • $\begingroup$ @ Tim : Thank you for your answer! Does the Chinese restaurant process only decide the number of clusters? In the end - how do we decide which observations are assigned to which cluster? I find it a bit difficult to grasp the concept that in non-parametric clustering, the number of clusters and the assignments to each cluster are not decided using the covariates....but only the number of data points? Thanks! $\endgroup$
    – stats_noob
    Commented Dec 27, 2021 at 16:18
  • 1
    $\begingroup$ @stat555 it serves as a prior over the number of clusters, see stats.stackexchange.com/q/348522/35989 $\endgroup$
    – Tim
    Commented Dec 27, 2021 at 16:44
  • $\begingroup$ @ Tim: thank you for your reply! So the number of clusters can be decided using the Chinese restaurant process (this does not depend on the covariate values) - and then a clustering algorithm like gaussian mixture clustering or k means can be used, with the number of clusters being decided using the Chinese restaurant process? Thanks! $\endgroup$
    – stats_noob
    Commented Dec 27, 2021 at 16:54
  • $\begingroup$ @stats555 no, that's exactly the opposite of what I said. It is used as a prior within a probabilistic model. It is one of the priors used in the model. There's no easy way to use it to pick the number of clusters for arbitrary clustering algorithm. $\endgroup$
    – Tim
    Commented Dec 27, 2021 at 17:16

Not the answer you're looking for? Browse other questions tagged or ask your own question.