Zero inflated distributions, what are they really?

Question

I am struggling to understand zero inflated distributions. What are they? What's the point?

If I have data with many zeroes, then I could fit a logistic regression first calculate the probability of zeroes, and then I could remove all the zeroes, and then fit a regular regression using my choice of distribution (poisson e.g.).

Then somebody told me "hey, use a zero inflated distribution", but looking it up, it does not seem to do anything differently than what I suggested above? It has a regular parameter $\mu$, and then another parameter $p$ to model the probability of zero? It just does both things at the same time no?

Why you remove all zeros? you can do it together, you first calculate the probability of 0 and 1 and use that as weight to your Poisson distribution that is Zero inflated model (distribution) . Read this, it is quite clear en.wikipedia.org/wiki/Zero-inflated_model — Deep North, Commented May 12, 2017 at 23:34

shadowtalker · Accepted Answer · 2017-05-21 23:51:35Z

16

fit a logistic regression first calculate the probability of zeroes, and then I could remove all the zeroes, and then fit a regular regression using my choice of distribution (poisson e.g.)

You're absolutely right. This is one way to fit a zero-inflated model (or as Achim Zeileis points out in the comments, this is strictly a "hurdle model", which one could view as a special case of a zero-inflated model).

The difference between the procedure you described and an "all-in-one" zero-inflated model is error propagation. Like all other two-step procedures in statistics, the overall uncertainty of your predictions in step 2 won't take into account the uncertainty as to whether the prediction should be 0 or not.

Sometimes this is a necessary evil. Fortunately, it's not necessary in this case. In R, you can use pscl::hurdle() or fitdistrplus::fitdist().

edited May 21, 2017 at 23:51

answered May 12, 2017 at 23:39

shadowtalker

12.8k4 gold badges58 silver badges123 bronze badges

$\begingroup$ can you explain this"the overall uncertainty of your predictions in step 2 won't take into account the uncertainty as to whether the prediction should be 0 or not"? When you do a Zip Poisson you will multiple the probability of the first part to the likelihood function of Poisson model,therefore step 2 will take into account the uncertainty of the 0 or 1. $\endgroup$
– Deep North
Commented May 12, 2017 at 23:49
1

$\begingroup$ @DeepNorth if by "uncertainty of the 0 or 1" you mean something like $P(Y=1|X=x) = 0.51$, then that statement is itself an estimate. Being an estimate, there is some degree of uncertainty around it. What is the range of plausible values? How confident are we that $0.51$ is correct? That is the uncertainty which does not propagate in a simple two-step procedure. $\endgroup$
– shadowtalker
Commented May 13, 2017 at 2:47
3

$\begingroup$ @ssdecontrol Usually this is not called a zero-inflated model but a hurdle model (e.g., pscl::hurdle()). And to obtain a proper fit the distribution employed for the data without zeros should be zero-truncated (or not lead to any zeros in the first place). See my reply for more details. $\endgroup$
– Achim Zeileis
Commented May 21, 2017 at 19:36

Add a comment |

Achim Zeileis · Accepted Answer · 2017-05-21 19:34:24Z

The basic idea you describe is a valid approach and it is often called a hurdle model (or two-part model) rather than a zero-inflated model.

However, it is crucial that the model for the non-zero data accounts for having the zeros removed. If you fit a Poisson model to the data without zeros this will almost certainly produce a poor fit because the Poisson distribution always has a positive probability for zero. The natural alternative is to use a zero-truncated Poisson distribution which is the classic approach to hurdle regression for count data.

The main difference between zero-inflated models and hurdle models is which probability is modeled in the binary part of the regression. For hurdle models it is simply the probability of zero vs. non-zero. In zero-inflated models it is the probability to have an excess zero, i.e., the probability of a zero that is not caused by the un-inflated distribution (e.g., Poisson).

For a discussion of both hurdle and zero-inflation models for count data in R, see our manuscript published in JSS and also shipped as a vignette to the pscl package: http://dx.doi.org/10.18637/jss.v027.i08

Guilherme Marthe · Accepted Answer · 2017-05-22 12:10:55Z

What ssdecontrol said is very correct. But I'd like to add a few cents to the discussion.

I just watched the lecture on Zero Inflated models for count data by Richard McElreath on YouTube.

It makes sense to estimate p while controlling for the variables that are explaining the rate of the pure Poisson model, specially if you consider that the chance of an observed zero being originated from the Poisson distribution is not 100% .

It also makes sense when you consider the parameters of the model, since you end up with two variables to estimate, p and the rate of the Poisson model, and two equations, the case when count is zero and case when the count is different from zero.

Image source : Statistical Rethinking - A Bayesian Course with Examples in R and Stan by Richard McElreath

Edit: typo

References to learning materials are appreciated... but how does this answer the question at hand? This looks like a comment posted as an answer... — user35780, Commented Feb 12, 2019 at 16:03

Stack Exchange Network

Zero inflated distributions, what are they really?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
zero-inflation
or ask your own question.

Linked

Hot Network Questions

Zero inflated distributions, what are they really?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged zero-inflation or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
zero-inflation
or ask your own question.