4
$\begingroup$

I have a dataset of a few million observations of a binary response with a low "Success"-probability of on average 1% to 2%. The dataset encompasses several categorical (~20 some with up to 50 categories) and numerical (~10) variables. I fitted a main effects logistic Generalised Linear Model (GLM) as baseline and a Gradient Boosted Tree (GBT). The GBT clearly outperforms the GLM as measured by log loss on a test set. Yet, both models seem to be quite similar when comparing only their marginal effects (as measured by eyeballing of suitable plots). So one potential reason for the outperformance of the GBT over the GLM may be the inclusion of interactions. I would like to verify this and ideally find (some of) those interactions.

My question

  1. What are possible ways to find interactions in the GBT model?
  2. More general: Where can I find information on the state of the art with respect to finding interactions and on what is currently doable and what is not?

My goals are quite pragmatic:

  • I don't need to find ALL interactions, a few "important" ones would be a great start.
  • I do not need to do hypothesis test on the findings.
  • But methods need to be implementable given "standard" computational resources.

My attempts so far

  • Given the size of the dataset and the number of inputs, any exhaustive search method such as stepwise regression seems futile.
  • Same problem with selection by regularization such as Lasso. In particular, since sparse design matrices are not possible due to the numerical inputs.
  • I am aware of, but have not tried yet, Friedman's H-statistic. The problems I see there is that it is based on variance decomposition and not log loss. It is also a kind of exhaustive search, and doable (at best ?) only for pairwise interactions. Furthermore, its estimates are based on permutations and some of the inputs show strong dependence.
  • The dataset is complex, and there is no a-priori reason why interactions should be limited to pairs. My success at "guessing" interactions based on my general domain knowledge and verifying them by inclusion in the GLM has been limited.
$\endgroup$
2
  • $\begingroup$ The better performance of the tree based method could be due to nonlinear relationships in the numeric features also, right? $\endgroup$
    – N Brouwer
    Commented Apr 22 at 0:50
  • $\begingroup$ Yes, potentially all types: between categoricals, between numericals and also mixed interactions between categorical and numerical. $\endgroup$
    – g g
    Commented Apr 22 at 7:22

2 Answers 2

5
+25
$\begingroup$

From what I understand there aren't that many variables when compared to observations and the sheer amount of observations can be burdensome for many common approaches. And the goal is to actually find the interactions. Important to keep in mind that finding three or even four-way interactions relies heavily on the number of instances of the minority class, since:

  • detecting interactions already depends on a lot more data rather than the sole main effects. See this SO answer about this.
  • models with binary response have a suggested number of variables that depend upon the sample size of the minority class. See this SO answer.

With all that said, this is how I would approach the problem. It isn't rigorous in and of itself, but it's principled.

Shallow trees

There is some background for using decision trees for finding interactions, such as CHAID trees. I wouldn't go after and actual $\chi^2$ computing algorithm, since those tend to be slow. I would:

  1. maybe lump infrequent categories into "other" just for stability
  2. divide my sample into a few sets, maybe preserving class proportion, maybe not, depending on the results
  3. fit a shallow decision tree algorithm, go all the way down and up finishing with pruned trees
  4. compare the most common leaves and see if there is any pattern emerging

I'd be looking into common variables that end up together, common ranges of continuous variables, etc... This would hint me into the variables I should be looking at to test in an actual logistic regression model. Remember that a leave at the end is just presenting to you an elaborate indicator variable i.e. in this leave lives the observations that had $X_1 > k_1 \text{ and } X_2 == 1 \text{ and } X_3 > k_3$. This is just describing and interaction between those three variables.

Grouped lasso

Ok, I see your angle, but the LASSO was literally built to help us find sparse effects, meaning I have a bunch of potential variables in the model, and I want just a few to be included. In this case, specifically, I would work with Group lasso and penalize two-way interactions and even more strongly three-way interactions, and leaving the main effects without regularization (this is why you need the group lasso). Pick regularization hipper parameters conservatively so that you are more confident that the ones found by the optimization patterns are non-noise.

Again I'd split my sample and compare the results between them.

WOE encoding or other target encoding transformation

This is an idea to make all categories numeric variables and study just numeric interactions but turn each category into a number that is a function of the prevalence in that category (such as proportion or log-odds) To avoid spurious findings I'd add noise into those variables. Read more about it here, here or here. Again, study the statistically significant ones just as a glance as to which categorical feature interacts with what, be it other categorical or other numerical features. The lasso regularization can be helpful here as well.

The same idea of dataset division and seeing what consistently comes up.

Finally, don't look into higher-degree interactions. Even focusing on three-way interaction is pushing it, because even with 1% of minority class out of a million is still unlikely to not be noise if you find it.

Conclusion

The boosted tree is already doing a lot of this heavy work for you, but leaving it under the black box that is the swarm of trees it averages over. I'm just suggesting a few ideas of how to explore interactions more closely. Do compare any of the results with the feature importance gathered from the GBT model to confirm the interaction.

Finally, all my approaches will help you maybe find the interactions that show up consistently throughout the data replications and may help you sort them out. I would still check the benefit of adding them into the final model, be it through cross-validation, or more statistically sound methods, such as likelihood ratio tests. However I wouldn't expect the GLM to outperform the GBT, since GBT is literally searching over interactions, and this is very powerful for binary outcomes.

$\endgroup$
1
$\begingroup$

What are possible ways to find interactions in the GBT model?

Extreme gradient boosting algorithm already has interaction search built in. Every time a tree splits, it is an interaction with upstream branches. To illustrate the interactions discovered, you can make plots of predictions. See tutorials https://marginaleffects.com/vignettes/machine_learning.html. You can also make a table of predicted probabilities of typical groups. Because the linear predictor is connected to the binary outcome through a link function, typically logit but possibly probit, the nonlinear or interaction effects should be assessed on the logit or probit scale, not the probability scale which is already nonlinear. Every time compare four groups by two variables, such as those illustrated Frank Harrell's interpretation of interaction in regression results.

More general: Where can I find information on the state of the art with respect to finding interactions and on what is currently doable and what is not?

For binary outcome, the hypothesis testing and variable selection results highly depend on what the model already has included as predictors prior to variable selection. Thus, an interaction may be significant if some main effects are dropped and become nonsignificant if all main effects are retained. Every time when the model specification changes, all coefficients are rescaled. So coefficient sizes under different model specifications are not directly comparable. Thus, an interaction seemingly important in one model specification may be not important in another. See Williams, R., & Jorgensen, A. (2023). Comparing logit & probit coefficients between nested models. Social Science Research, 109, 102802. https://doi.org/10.1016/j.ssresearch.2022.102802.

I don't need to find ALL interactions, a few "important" ones would be a great start.

There are many ways to assess variable importance for binary responses. It can be about coefficient size, variable significance (z or Wald statistic), log likelihood contribution, marginal effects on probability, differences in ROC AUC, PR AUC, lift, gain, ... If you are making business intelligence, maybe none of these truly matter. You may want to make a four-cell table of the relative cost or benefit of true positive, false positive, true negative, and false negative predictions and associate the consequences of incorporating interaction effects with such financial indicators. Estimating interactions accurately for binary responses also requires a very large sample size. See Likelihood ratio vs. score vs. Wald test: Different p values, which to use? and https://stats.stackexchange.com/a/641263/284766. Your effective sample size is 3M * 1% * 99% = 30k, and it may not be enough to test certain interactions.

I do not need to do hypothesis test on the findings.

Although you did not intend to make any hypothesis, deciding whether an interaction effect is present is a hypothesis test. Therefore, regular assumptions and caveats regarding null-hypothesis testing are relevant. See Heinze, G., & Dunkler, D. (2017). Five myths about variable selection. Transplant International, 30(1), 6–10. https://doi.org/10.1111/tri.12895 and Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection: A review and recommendations for the practicing statistician. Biometrical Journal. Biometrische Zeitschrift, 60(3), 431–449. https://doi.org/10.1002/bimj.201700067.

Given the size of the dataset and the number of inputs, any exhaustive search method such as stepwise regression seems futile.

Stepwise selection is not exhaustive. It is stepwise and by chance may not find the best or true model. But it can be useful for explorative analysis. You can use it to easily find two-way and three-way interactions in a binary logit model among all possible combinations while retaining all main effects. You can also try multimodal inference, which provides exhaustive model search for simple specifications. Since you have 30 predictors, a genetic algorithm can be useful. See Calcagno, V., & Mazancourt, C. de. (2010). glmulti: An R package for easy automated model selection with (generalized) linear models. Journal of Statistical Software, 34, 1–29. https://doi.org/10.18637/jss.v034.i12.

Same problem with selection by regularization such as Lasso. In particular, since sparse design matrices are not possible due to the numerical inputs.

I do not see how LASSO is not possible. You can leave all main effects not panelized and only panelize interaction terms. For continuous variable, you can build square, cubic, logarithm, and reciprocal terms for nonlinear effects after centering and scaling. See https://glmnet.stanford.edu/articles/glmnet.html.

I am aware of, but have not tried yet, Friedman's H-statistic. The problems I see there is that it is based on variance decomposition and not log loss. It is also a kind of exhaustive search, and doable (at best ?) only for pairwise interactions. Furthermore, its estimates are based on permutations and some of the inputs show strong dependence.

The H statistic is not just for two-way interaction. It is a comparison between model predictions including all interactions and those excluding any interaction. It works for not only continuous responses but also binary ones. See tutorial https://christophm.github.io/interpretable-ml-book/interaction.html "The H-statistic tells us the strength of interactions, but it does not tell us how the interactions look like. That is what partial dependence plots are for. A meaningful workflow is to measure the interaction strengths and then create 2D-partial dependence plots for the interactions you are interested in."

The dataset is complex, and there is no a-priori reason why interactions should be limited to pairs. My success at "guessing" interactions based on my general domain knowledge and verifying them by inclusion in the GLM has been limited.

Multiway interactions can be decomposed into two-way interactions. Seeking multiway interaction can be numerically dangerous because it increases the chance of getting infinite but nonsignificant coefficients, related to the complete-separation problem. See Yee, T. W. (2021). On the Hauck–Donner effect in Wald tests: Detection, tipping points, and parameter space characterization. Journal of the American Statistical Association, 0(0), 1–12. https://doi.org/10.1080/01621459.2021.1886936.

$\endgroup$
7
  • $\begingroup$ Some questions on your response: 1. Paragraph 1) What do you mean by "To illustrate the interactions discovered, you can make plots of predictions." 2) You write "See tutorials marginaleffects.com/vignettes/machine_learning.html. ". But I could not find a tutorial on detection of interactions. $\endgroup$
    – g g
    Commented Apr 26 at 9:34
  • $\begingroup$ Paragraph 5: Stepwise regression You write "You can use it to easily find two-way and three-way interactions in a binary logit model among all possible combinations" The "among all possible combinations" is the issue. The number of possible combinations is in the order of millions. So I do not see how to do it "easily". $\endgroup$
    – g g
    Commented Apr 26 at 9:43
  • $\begingroup$ Paragraph 6 Lasso: See 5. above, it is the number of combinations. $\endgroup$
    – g g
    Commented Apr 26 at 9:44
  • $\begingroup$ Last Paragraph: Can you explain how multiway interactions can be decomposed? And how this relates to Wald Tests? $\endgroup$
    – g g
    Commented Apr 26 at 9:50
  • $\begingroup$ Paragraph 2 and 3: Those are generally true and generic statements. But I do not see how this relates to my specific question. In particular, Paragraph 3 seems to be off-topic. Or did I misunderstand something? $\endgroup$
    – g g
    Commented Apr 26 at 9:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.