5
$\begingroup$

I am trying to look at how the likelihood of event X occurring is affected by three factors A (5 levels), B (2 levels) and C (3 levels). To do this I have run a binomial glm followed by a type 3 ANOVA to determine the significance of each of the terms in the model as follows:

type3.Li.Full <- list(A = contr.sum, B = contr.sum, C = contr.sum)
model.lik<-glm(X ~ A*B*C, family = binomial (link = "logit"), data = likelihood, contrasts = type3.Li.Full, method = brglmFit)
summary(model.lik)

Call:
glm(formula = X ~ A * B * C, 
    family = binomial(link = "logit"), data = likelihood, method = brglmFit, 
    contrasts = type3.Li.Full)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2534  -1.0731   0.5415   0.9738   1.6651  

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         0.54796    0.16566   3.308 0.000940 ***
A1                                  0.64720    0.34125   1.897 0.057886 .  
A2                                  0.70109    0.29486   2.378 0.017420 *  
A3                                  0.75656    0.35739   2.117 0.034267 *  
A4                                 -0.33962    0.21995  -1.544 0.122572    
B1                                  0.38930    0.16566   2.350 0.018771 *  
C1                                  0.62192    0.16566   3.754 0.000174 ***
A1:B1                               0.18165    0.34125   0.532 0.594520    
A2:B1                              -0.58628    0.29486  -1.988 0.046771 *  
A3:B1                               0.39197    0.35739   1.097 0.272747    
A4:B1                              -0.09776    0.21995  -0.444 0.656725    
A1:C1                               0.12427    0.34125   0.364 0.715740    
A2:C1                               0.11557    0.29486   0.392 0.695086    
A3:C1                               0.04053    0.35739   0.113 0.909714    
A4:C1                              -0.70748    0.21995  -3.216 0.001298 ** 
B1:C1                              -0.13459    0.16566  -0.812 0.416535    
A1:B1:C1                           -0.53189    0.34125  -1.559 0.119081    
A2:B1:C1                           -0.13863    0.29486  -0.470 0.638249    
A3:B1:C1                            0.82033    0.35739   2.295 0.021713 *  
A4:B1:C1                            0.38356    0.21995   1.744 0.081193 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 583.11  on 435  degrees of freedom
Residual deviance: 486.65  on 416  degrees of freedom
AIC:  526.65

Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 7

library(car)
Anova(model.lik, type = 3) 

Analysis of Deviance Table (Type III tests)

Response: X
                                LR Chisq Df Pr(>Chisq)    
A                                 44.904  4  4.164e-09 ***
B                                  5.810  1   0.015939 *  
C                                 20.969  1  4.667e-06 ***
A:B                                6.359  4   0.173862    
A:C                               13.550  4   0.008877 ** 
B:C                               -2.479  1   1.000000    
A:B:C                             13.371  4   0.009597 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

According to the output of the ANOVA table the chi-squared value for the interaction between B and C is negative however, I was under the impression that chi-squared values should be positive. As a result, would anyone be able to explain why this error may have occurred?

Is this as a result of my use of a penalized maximum likelihood (method = brglmFit). This was used because complete separation was detected in my original model as follows (I believe if any of the values are infinity this indicates the presence of complete separation):

model.lik.orig<-glm(X ~ A*B*C, family = binomial (link = "logit"), data = likelihood, contrasts = type3.Li.Full)
library("detectseparation")
update(model.lik.orig, method = "detect_separation")

Implementation: ROI | Solver: lpsolve 
Separation: TRUE 
Existence of maximum likelihood estimates
                       (Intercept)                                 A1 
                              -Inf                                Inf 
                                A2                                 A3 
                               Inf                                Inf 
                                A4                                 B1 
                               Inf                                Inf 
                                C1                              A1:B1 
                               Inf                               -Inf 
                             A2:B1                              A3:B1 
                              -Inf                               -Inf 
                             A4:B1                              A1:C1 
                              -Inf                               -Inf 
                             A2:C1                              A3:C1 
                              -Inf                               -Inf 
                             A4:C1                              B1:C1 
                              -Inf                               -Inf 
                          A1:B1:C1                           A2:B1:C1 
                               Inf                                Inf 
                          A3:B1:C1                           A4:B1:C1 
                               Inf                                Inf 
0: finite value, Inf: infinity, -Inf: -infinity
$\endgroup$
4
  • $\begingroup$ I deleted my comment $\endgroup$
    – Peter Flom
    Commented Apr 24 at 11:55
  • $\begingroup$ Do you observe the same situation if you use the default fitting method, rather than method = brglmFit which uses penalized maximum likelihood to do the estimation? $\endgroup$
    – dipetkov
    Commented Apr 24 at 11:57
  • 1
    $\begingroup$ @dipetkov If I were not to use brglmFit the value for the B:C interaction would be 0 which I think is due to the fact that there are instances of complete separation with some combinations of the levels of my three factors as using the update(model.lik, method = "detect_separation") function in the package detectseparation shows this to be so. $\endgroup$ Commented Apr 24 at 12:11
  • $\begingroup$ Thanks for the details. These might be relevant, so I encourage you to edit your question with this extra information. $\endgroup$
    – dipetkov
    Commented Apr 24 at 12:20

2 Answers 2

4
$\begingroup$

The statistic being reported is $-2\cdot\text{ln}(L_0/L_1)$, where $L$ is the likelihood of two models being compared. For type I sums of squares that would be a model without and with the parameter of interest respectively, for type III it's a bit more involved but the idea remains the same. As an aside, I have a suspicion that this is something that might perhaps only occur with type III SS.

Observing $L_0/L_1>1$, which means $\text{ln}(L_0/L_1)>0$ and the LR statistic becoming negative, implies that the model including B:C has a lower likelihood -- explains the data less well -- than one that doesn't.

It's true that a theoretical $\chi^2$ variable cannot be negative, but an observed likelihood ratio can result in such a value in edge cases: the LR statistic only follows (or is assumed to follow) a $\chi^2$ distribution asymptotically. The density for non-positive values of a $\chi^2$ variable is trivially zero, so this results in a P-value of 1. Or, to say it with the R code of car::Anova:

pchisq(-2.479, 1, lower.tail = FALSE)
#> 1
$\endgroup$
2
  • 1
    $\begingroup$ The OP uses penalized maximum likelihood to deal with a complete separation issue. Not sure if this is the case, but I suspect this matters for how the results are to be interpreted. $\endgroup$
    – dipetkov
    Commented Apr 24 at 12:19
  • 1
    $\begingroup$ Having thought more about it, I'm not sure this has to do with asymptotics. Maximum likelihood (and iteratively weighted least squares) is an estimation method. As long as the models are nested and there is no penalty and/or constraints, likelihood(full model) should be at least as good as likelihood(reduced model). $\endgroup$
    – dipetkov
    Commented May 25 at 18:16
4
+25
$\begingroup$

Is this as a result of my use of a penalized maximum likelihood (method = brglmFit).

Yes.

This needs a bit of theory. {brglm2::brglmFit} uses a bias reduction method to fit generalized linear models. The mathematical details of this adjustment strategy are intimidating; see (Kosmidis et al, 2020). For this discussion it may be sufficient to note two facts:

  • The adjustment is a function of the design matrix $X$. Changing X — for example by omitting terms from the model — changes the adjustment.
  • Since the method adjusts the estimates $\hat{\beta}$ of the regression coefficients, the bias-reduction (BR) estimates are not equal to the maximum likelihood (ML) estimates.

The BR estimates and the ML estimates tend to be markedly different in the case of complete separation — the situation you are facing — where ML estimates can be infinite while bias reduction produces finite estimates and valid inferences for $\beta$.

Why can this explain the negative value in the type III ANOVA table? As @PBulls points out, each likelihood ratio (LR) statistic in the table is a difference of two log likelihoods: $-2(\ln L_0 - \ln L_1)$ where $L_1$ is the likelihood of the full model and $L_0$ is the likelihood of a reduced model with one term (either a main effect or an interaction) dropped.

With maximum likelihood estimation, the full model has equal or higher likelihood than the reduced model, so the LR statistic is non-negative. With bias-reduction adjustment (and penalized maximum likelihood methods in general), the full model can have a lower likelihood than the reduced model depending on how the adjustment (penalty) is applied. Consider this: Since the likelihood adjustment depends on the design matrix and the full and reduced models have different design matrices by construction, we might be adjusting the two likelihoods differently, making them not directly comparable.

This is what happens in this example, due in part to using an ANOVA function that doesn't know how to handle a model fitted with the brglmFit method.

Let's show this with an example. I've generated a fake dataset (with factors A, B, C and binary outcome Y) that exhibits complete separation.

Here is the type III ANOVA reported by {car::Anova} for the model fitted with brglmFit and bias reduction adjustment.

Anova(fit, type = 3)
#> Analysis of Deviance Table (Type III tests)
#> 
#> Response: Y
#>       LR Chisq Df Pr(>Chisq)    
#> A     16.05273  4  0.0029492 ** 
#> B      4.19371  1  0.0405743 *  
#> C      3.79310  1  0.0514643 .  
#> A:B    3.36712  4  0.4983678    
#> A:C   27.89279  4 1.3113e-05 ***
#> B:C   -1.01833  1  1.0000000    
#> A:B:C  2.50766  4  0.6432640    

Alternatively, we can obtain the type III ANOVA with the standard R function {stats::drop1}, which makes the choice to lower-bound the LR statistics and report 0 for the B:C term.

drop1(fit, .~., test = "Chisq")
#> Single term deletions
#> 
#> Model:
#> Y ~ A * B * C
#>        Df Deviance     AIC      LRT   Pr(>Chi)    
#> <none>     554.459 594.459                        
#> A       4  570.511 602.511 16.05273  0.0029492 ** 
#> B       1  558.652 596.652  4.19371  0.0405743 *  
#> C       1  558.252 596.252  3.79310  0.0514643 .  
#> A:B     4  557.826 589.826  3.36712  0.4983678    
#> A:C     4  582.351 614.351 27.89279 1.3113e-05 ***
#> B:C     1  553.440 591.440  0.00000  1.0000000    
#> A:B:C   4  556.966 588.966  2.50766  0.6432640   

The Deviance column gives the sum of squared deviance residuals (deviance = -2 * likelihood). You can check for yourself that LRT = Deviance(reduced model) - Deviance(full model) and that the model without the B:C interaction has a smaller deviance than the full model (with <none> terms dropped).

And here is the type III ANOVA for the unadjusted model fitted with the default method, iteratively weighted least squares.

drop1(fit_mle, .~., test = "Chisq")
#> Single term deletions
#> 
#> Model:
#> Y ~ A * B * C
#>        Df Deviance     AIC      LRT   Pr(>Chi)    
#> <none>     553.365 593.365                        
#> A       4  570.511 602.511 17.14658  0.0018102 ** 
#> B       1  558.652 596.652  5.28756  0.0214783 *  
#> C       1  558.252 596.252  4.88695  0.0270605 *  
#> A:B     4  557.826 589.826  4.46097  0.3472001    
#> A:C     4  582.351 614.351 28.98664 7.8664e-06 ***
#> B:C     1  553.440 591.440  0.07552  0.7834659    
#> A:B:C   4  556.966 588.966  3.60151  0.4626117    

Comparing the ANOVAs with and without bias reduction reveals that only the full model is adjusted. This doesn't seem a correct way to compare the full model and the reduced model; the negative $\chi^2$ statistic is a symptom of the problem. How do we deal with it?

One option is to use the maximum likelihood fit. Separation poses a challenge for ML to estimate the regression coefficients and their std. errors but the likelihood ratio tests are valid. Update: This is the recommendation by @FrankHarrell written in response the follow-up question, What is the difference between using logistf and brglm2 when dealing with complete separation in a logistic regression?.

Another option is to use an ANOVA function designed for penalized logistic regression. The {logistf} package implements Firth's bias-reduced logistic regression as well as dedicated anova.logistf and drop1.logistf methods. Here is the logistf ANOVA for the same fake data:

drop1(fit_firth, test = "Chisq")
#>            ChiSq df      P-value
#> A     14.0865189  4 7.023982e-03
#> B      3.7387676  1 5.316364e-02
#> C      3.3839266  1 6.583503e-02
#> A:B    3.0051851  4 5.569581e-01
#> A:C   25.2351577  4 4.511854e-05
#> B:C    0.6169475  1 4.321839e-01
#> A:B:C  2.1597337  4 7.064075e-01

I also found this explanation in the {logistf::anova} documentation helpful:

Comparing models fitted by penalized methods, one must consider that the penalized likelihoods are not directly comparable, since a penalty is involved. Or in other words, inserting zero for some regression coefficients will not lead to the same penalized likelihood as if the corresponding variables are simply "unknown" to a model. The anova method takes care that the same penalty is used for two hierarchically nested models.

References

I. Kosmidis, E. C. Kenne Pagui, and N. Sartori. Mean and median bias reduction in generalized linear models. Statistics and Computing, 30(1):43–59, 2020.
Using Chi-squared test when perfect separation in logistic model?
Likelihood ratio vs. score vs. Wald test: Different p values, which to use?
Why does logistic regression become unstable when classes are well-separated?


The R code to reproduce the analysis:

library("brglm2")
library("logistf")

set.seed(1234)

# C has one unobserved level; it's not what causes the complete separation.
# Treat C as two-level a factor, so that the sum contrast is applied correctly
# when constructing the design matrices.

A <- factor(1:5, levels = 1:5)
B <- factor(1:2, levels = 1:2)
C <- factor(1:2, levels = 1:2)

# Use sum contrasts to compute the type 3 anova correctly.
contrasts(A) <- contr.sum(5)
contrasts(B) <- contr.sum(2)
contrasts(C) <- contr.sum(2)

n <- 436

dat <- data.frame(
  Y = sample(0:1, n, replace = TRUE),
  A = sample(A, n, replace = TRUE),
  B = sample(B, n, replace = TRUE),
  C = sample(C, n, replace = TRUE)
)
dat$Y <- with(
  dat,
  ifelse(A == "5" & C == "2", rbinom(n, 1, 0.9), Y)
)

# maximum likelihood with IWLS
fit_mle <- glm(
  Y ~ A * B * C,
  family = binomial(),
  data = dat
)
# bias-reducing score adjustment
fit_brglm <- glm(
  Y ~ A * B * C,
  family = binomial(),
  data = dat,
  method = brglmFit
)
# Firth's penalized likelihood
fit_firth <- logistf(
  Y ~ A * B * C,
  data = dat
)

drop1(fit_brglm, . ~ ., test = "Chisq")
drop1(fit_mle, . ~ ., test = "Chisq")
drop1(fit_firth, test = "Chisq")
$\endgroup$
12
  • $\begingroup$ The Firth's regression penalty is easier to describe that the bias-reducing adjustment of $\beta$. For example: Seeking a Theoretical Understanding of Firth Logistic Regression, Seeking to understand using the Firth correction, Calculation of p-values in logistf package. In both cases the Fisher information matrix is involved, which how the design matrix enters the adjustment / penalty. $\endgroup$
    – dipetkov
    Commented May 26 at 8:45
  • $\begingroup$ How does the type-III ANOVA get incorporated into the anova.logistf function? I would like to obtain p-values for each term in the model using a type-III ANOVA. $\endgroup$ Commented Jun 3 at 15:44
  • $\begingroup$ Why do you insist on the type III ANOVA? You just got one more answer (by @Frank Harrell to your follow-up question) explaining why the type III ANOVA -- probably -- doesn't make sense. $\endgroup$
    – dipetkov
    Commented Jun 3 at 15:49
  • $\begingroup$ That being said, it seems to be your choice. drop1 effectively does type III ANOVA. So using drop1.logistf you get type III for a logistf fit. $\endgroup$
    – dipetkov
    Commented Jun 3 at 15:51
  • $\begingroup$ I apologize for my lack of understanding. I was advised to run a type-III ANOVA given my data is unbalanced and that my main focus is on the 3-way interaction (I had read that type-III ANOVA is better when you have significant interactions in your model). $\endgroup$ Commented Jun 3 at 15:54

Not the answer you're looking for? Browse other questions tagged or ask your own question.