23
$\begingroup$

I am struggling to understand zero inflated distributions. What are they? What's the point?

If I have data with many zeroes, then I could fit a logistic regression first calculate the probability of zeroes, and then I could remove all the zeroes, and then fit a regular regression using my choice of distribution (poisson e.g.).

Then somebody told me "hey, use a zero inflated distribution", but looking it up, it does not seem to do anything differently than what I suggested above? It has a regular parameter $\mu$, and then another parameter $p$ to model the probability of zero? It just does both things at the same time no?

$\endgroup$
1
  • 3
    $\begingroup$ Why you remove all zeros? you can do it together, you first calculate the probability of 0 and 1 and use that as weight to your Poisson distribution that is Zero inflated model (distribution) . Read this, it is quite clear en.wikipedia.org/wiki/Zero-inflated_model $\endgroup$
    – Deep North
    Commented May 12, 2017 at 23:34

3 Answers 3

16
$\begingroup$

fit a logistic regression first calculate the probability of zeroes, and then I could remove all the zeroes, and then fit a regular regression using my choice of distribution (poisson e.g.)

You're absolutely right. This is one way to fit a zero-inflated model (or as Achim Zeileis points out in the comments, this is strictly a "hurdle model", which one could view as a special case of a zero-inflated model).

The difference between the procedure you described and an "all-in-one" zero-inflated model is error propagation. Like all other two-step procedures in statistics, the overall uncertainty of your predictions in step 2 won't take into account the uncertainty as to whether the prediction should be 0 or not.

Sometimes this is a necessary evil. Fortunately, it's not necessary in this case. In R, you can use pscl::hurdle() or fitdistrplus::fitdist().

$\endgroup$
3
  • $\begingroup$ can you explain this"the overall uncertainty of your predictions in step 2 won't take into account the uncertainty as to whether the prediction should be 0 or not"? When you do a Zip Poisson you will multiple the probability of the first part to the likelihood function of Poisson model,therefore step 2 will take into account the uncertainty of the 0 or 1. $\endgroup$
    – Deep North
    Commented May 12, 2017 at 23:49
  • 1
    $\begingroup$ @DeepNorth if by "uncertainty of the 0 or 1" you mean something like $P(Y=1|X=x) = 0.51$, then that statement is itself an estimate. Being an estimate, there is some degree of uncertainty around it. What is the range of plausible values? How confident are we that $0.51$ is correct? That is the uncertainty which does not propagate in a simple two-step procedure. $\endgroup$ Commented May 13, 2017 at 2:47
  • 3
    $\begingroup$ @ssdecontrol Usually this is not called a zero-inflated model but a hurdle model (e.g., pscl::hurdle()). And to obtain a proper fit the distribution employed for the data without zeros should be zero-truncated (or not lead to any zeros in the first place). See my reply for more details. $\endgroup$ Commented May 21, 2017 at 19:36
15
$\begingroup$

The basic idea you describe is a valid approach and it is often called a hurdle model (or two-part model) rather than a zero-inflated model.

However, it is crucial that the model for the non-zero data accounts for having the zeros removed. If you fit a Poisson model to the data without zeros this will almost certainly produce a poor fit because the Poisson distribution always has a positive probability for zero. The natural alternative is to use a zero-truncated Poisson distribution which is the classic approach to hurdle regression for count data.

The main difference between zero-inflated models and hurdle models is which probability is modeled in the binary part of the regression. For hurdle models it is simply the probability of zero vs. non-zero. In zero-inflated models it is the probability to have an excess zero, i.e., the probability of a zero that is not caused by the un-inflated distribution (e.g., Poisson).

For a discussion of both hurdle and zero-inflation models for count data in R, see our manuscript published in JSS and also shipped as a vignette to the pscl package: http://dx.doi.org/10.18637/jss.v027.i08

$\endgroup$
10
$\begingroup$

What ssdecontrol said is very correct. But I'd like to add a few cents to the discussion.

I just watched the lecture on Zero Inflated models for count data by Richard McElreath on YouTube.

It makes sense to estimate p while controlling for the variables that are explaining the rate of the pure Poisson model, specially if you consider that the chance of an observed zero being originated from the Poisson distribution is not 100% .

Zero inflated distributions as a multilevel model

It also makes sense when you consider the parameters of the model, since you end up with two variables to estimate, p and the rate of the Poisson model, and two equations, the case when count is zero and case when the count is different from zero.

Image source : Statistical Rethinking - A Bayesian Course with Examples in R and Stan by Richard McElreath

Edit: typo

$\endgroup$
1
  • $\begingroup$ References to learning materials are appreciated... but how does this answer the question at hand? This looks like a comment posted as an answer... $\endgroup$
    – user35780
    Commented Feb 12, 2019 at 16:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.