5
$\begingroup$

It is claimed by many authors that if we fit the GLM Poisson model to an over dispersed dataset of count data, the standard error of the estimated coefficients will be under-estimated.

Could you please explain me why doing like that leads to the under-estimated standard errors ? In other words, could you tell me the mechanism that makes the standard errors of the estimated coefficients under-estimated ? And they are "under-estimated" in comparison with which quantity ?

Thank you very much for your help!

$\endgroup$

1 Answer 1

5
$\begingroup$

As a result of the Poisson probability model, the variance and the mean of that distribution are the same. This fact entails that whenever this condition is not met, there will be a mismatch between the confidence intervals suggested by the model and what the data actually shows.

When characterizing count data, the term "overdispersed" comes when the data has a higher dispersion (or in other words, a higher variance) as compared to what would be expected under a Poisson model. To see what this overdispersion looks like, check out this small R snippet:

library(tidyverse)


set.seed(4242)
df <- 
tibble(`poisson (mean 8)` = rpois(600, 8), 
       `neg_binomial (mean 8)`= rnbinom(600, mu=8, size=4), 
       ) %>% 
  gather(dist, val) 

df %>% 
  ggplot(aes(val)) + 
  geom_histogram(fill='skyblue', color='black',
                 binwidth = 1 ) + 
  facet_wrap(~dist, ncol=1) + 
  theme_minimal()

df %>% group_by(dist) %>% 
  summarise(mean = mean(val), variance = var(val))
#> # A tibble: 2 x 3
#>   dist                   mean variance
#>   <chr>                 <dbl>    <dbl>
#> 1 neg_binomial (mean 8)  7.98    23.6 
#> 2 poisson (mean 8)       8.02     8.97

Created on 2021-09-20 by the reprex package (v2.0.1)

As you can see, compared to the negative binomial, the poison has much lower variance, even though both have the same mean. This capacity of the negative binomial, the capacity of adjusting to higher dispersion count data, makes it much preferable in situations of overdispersion. That is why negative binomial regression is a thing.


Now, in order to see the effect of overdispersion in the standard errors of the parameters, let's simulate a negative binomial regression with a binary variable is a "control" variable that shifts the mean by 3 units, starting from a baseline of 8.

set.seed(123)
df <- 
tibble(x = rbinom(600, size=1, p=.5),
       y = rnbinom(600, mu=8 + if_else(x==1, 3, 0), size=4), 
       ) 

df %>%
  mutate(variable = if_else(x==1, 'x = 1', 'x = 0')) %>% 
  ggplot(aes(y)) + 
  geom_histogram(fill='skyblue', color='black',
                 binwidth = 1 ) + 
  facet_wrap(~variable, ncol=1) + 
  theme_minimal()

And as we can see by the mean of each group, indeed when x = 1 the mean is greater (so is the variance as shown by the histograms, which is a characteristic of count data)

df %>% 
  mutate(variable = if_else(x==1, 'x = 1', 'x = 0')) %>% 
  group_by(variable) %>% 
  summarise(mean = mean(y))
#> # A tibble: 2 x 2
#>   variable  mean
#>   <chr>    <dbl>
#> 1 x = 0     8.34
#> 2 x = 1    11.0

In order to show the underestimation of standard errors, lets model this negative binomial data with a Poisson glm, and afterward, with the correct model, a negative binomial

library(tidyverse)
set.seed(123)
df <- 
tibble(x = rbinom(600, size=1, p=.5),
       y = rnbinom(600, mu=8 + if_else(x==1, 3, 0), size=4), 
       ) 
glm(y~x, family = 'poisson', data = df) %>% 
    summary()
#> 
#> Call:
#> glm(formula = y ~ x, family = "poisson", data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -4.6970  -1.6597  -0.3155   0.8781   6.7168  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  2.12091    0.01967  107.83   <2e-16 ***
#> x            0.27980    0.02645   10.58   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 2157.7  on 599  degrees of freedom
#> Residual deviance: 2045.0  on 598  degrees of freedom
#> AIC: 4395.9
#> 
#> Number of Fisher Scoring iterations: 5

MASS::glm.nb(y~x, data = df) %>% summary
#> 
#> Call:
#> MASS::glm.nb(formula = y ~ x, data = df, init.theta = 3.89401969, 
#>     link = log)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.2348  -0.9097  -0.1631   0.4797   2.8902  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  2.12091    0.03486  60.840  < 2e-16 ***
#> x            0.27980    0.04913   5.696 1.23e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for Negative Binomial(3.894) family taken to be 1)
#> 
#>     Null deviance: 657.73  on 599  degrees of freedom
#> Residual deviance: 625.25  on 598  degrees of freedom
#> AIC: 3686.3
#> 
#> Number of Fisher Scoring iterations: 1
#> 
#> 
#>               Theta:  3.894 
#>           Std. Err.:  0.319 
#> 
#>  2 x log-likelihood:  -3680.255

As you can see, the estimates of the parameters of both models agree. Where they don't agree is in the std error, since the Poisson model reports a lower estimate, and based on the simulation we know that it's wrong. And we know that a Poisson distribution can't handle higher values of variance (or dispersion, in the language of distributions belonging to the exponential family, used in glms).


Usually, this underestimation of standard errors comes from the notion of "if we have an actual overdispersion in our data generating process, then Poisson regression will underestimate it." One of the distributions that we know that has overdispersion is the Negative Binomial, and we shall find this overestimation.

First of all, the parametrization for the negative binomial usually used for negative binomial regression arises from the gamma-Poisson mixture interpretation, and you can learn more about it here. Under this parametrization, if $Y_i \sim NB(\mu_i, \phi)$ then:

$$ \text{E}(Y_i) = \mu_i \\ \text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{\phi} $$

Comparing that if the Poisson case, if $Y_i \sim \text{Poisson}(\mu_i)$, we have:

$$ \text{E}(Y_i) = \mu_i \\ \text{Var}(Y_i) = \mu_i $$

This should start giving you hints of why the estimates would be different.

$\endgroup$
4
  • $\begingroup$ Hi, thank you very much for your answer. however, i think that your answer does not answet the question "why the standard error of the coefficients when we fit GLM poisson to the overdispersed data are underestimated". Could you kindly add a section in your answer that answer that question please ? $\endgroup$ Commented Sep 20, 2021 at 15:57
  • $\begingroup$ Hi, thank you so much for the answer. However, it seems to me that your reasoning is based strictly on "empiric" which does not totally satisfy me. Could you kindly add some theoretical justification please ? It is nice to have your graphs but I think it will be much more persuasive if we talk about the theories. Thank you very much for your help! $\endgroup$ Commented Sep 20, 2021 at 17:01
  • $\begingroup$ +1, however I think the last sentence regarding AIC is incorrect. AIC here is computed by dropping irrelevant constants specific to the likelihood function, but these constants will differ if the likelihood function or link function used in the GLM differ. AIC as output by the GLM summary can only compare models of the same family fit on the same data. $\endgroup$ Commented Sep 20, 2021 at 17:19
  • 1
    $\begingroup$ I don't have much time right now to write the derivation of the standard errors for the estimates of the parameters right now. But I can later, and it will prove that for a small specific model. But the results are connected to the difference between the variance of the two distributions. $\endgroup$ Commented Sep 20, 2021 at 18:22

Not the answer you're looking for? Browse other questions tagged or ask your own question.