What are possible ways to find interactions in the GBT model?
Extreme gradient boosting algorithm already has interaction search built in. Every time a tree splits, it is an interaction with upstream branches. To illustrate the interactions discovered, you can make plots of predictions. See tutorials https://marginaleffects.com/vignettes/machine_learning.html. You can also make a table of predicted probabilities of typical groups. Because the linear predictor is connected to the binary outcome through a link function, typically logit but possibly probit, the nonlinear or interaction effects should be assessed on the logit or probit scale, not the probability scale which is already nonlinear. Every time compare four groups by two variables, such as those illustrated Frank Harrell's interpretation of interaction in regression results.
More general: Where can I find information on the state of the art with respect to finding interactions and on what is currently doable and what is not?
For binary outcome, the hypothesis testing and variable selection results highly depend on what the model already has included as predictors prior to variable selection. Thus, an interaction may be significant if some main effects are dropped and become nonsignificant if all main effects are retained. Every time when the model specification changes, all coefficients are rescaled. So coefficient sizes under different model specifications are not directly comparable. Thus, an interaction seemingly important in one model specification may be not important in another. See Williams, R., & Jorgensen, A. (2023). Comparing logit & probit coefficients between nested models. Social Science Research, 109, 102802. https://doi.org/10.1016/j.ssresearch.2022.102802.
I don't need to find ALL interactions, a few "important" ones would be a great start.
There are many ways to assess variable importance for binary responses. It can be about coefficient size, variable significance (z or Wald statistic), log likelihood contribution, marginal effects on probability, differences in ROC AUC, PR AUC, lift, gain, ... If you are making business intelligence, maybe none of these truly matter. You may want to make a four-cell table of the relative cost or benefit of true positive, false positive, true negative, and false negative predictions and associate the consequences of incorporating interaction effects with such financial indicators. Estimating interactions accurately for binary responses also requires a very large sample size. See Likelihood ratio vs. score vs. Wald test: Different p values, which to use? and https://stats.stackexchange.com/a/641263/284766. Your effective sample size is 3M * 1% * 99% = 30k, and it may not be enough to test certain interactions.
I do not need to do hypothesis test on the findings.
Although you did not intend to make any hypothesis, deciding whether an interaction effect is present is a hypothesis test. Therefore, regular assumptions and caveats regarding null-hypothesis testing are relevant. See Heinze, G., & Dunkler, D. (2017). Five myths about variable selection. Transplant International, 30(1), 6–10. https://doi.org/10.1111/tri.12895 and Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection: A review and recommendations for the practicing statistician. Biometrical Journal. Biometrische Zeitschrift, 60(3), 431–449. https://doi.org/10.1002/bimj.201700067.
Given the size of the dataset and the number of inputs, any exhaustive search method such as stepwise regression seems futile.
Stepwise selection is not exhaustive. It is stepwise and by chance may not find the best or true model. But it can be useful for explorative analysis. You can use it to easily find two-way and three-way interactions in a binary logit model among all possible combinations while retaining all main effects. You can also try multimodal inference, which provides exhaustive model search for simple specifications. Since you have 30 predictors, a genetic algorithm can be useful. See Calcagno, V., & Mazancourt, C. de. (2010). glmulti: An R package for easy automated model selection with (generalized) linear models. Journal of Statistical Software, 34, 1–29. https://doi.org/10.18637/jss.v034.i12.
Same problem with selection by regularization such as Lasso. In particular, since sparse design matrices are not possible due to the numerical inputs.
I do not see how LASSO is not possible. You can leave all main effects not panelized and only panelize interaction terms. For continuous variable, you can build square, cubic, logarithm, and reciprocal terms for nonlinear effects after centering and scaling. See https://glmnet.stanford.edu/articles/glmnet.html.
I am aware of, but have not tried yet, Friedman's H-statistic. The problems I see there is that it is based on variance decomposition and not log loss. It is also a kind of exhaustive search, and doable (at best ?) only for pairwise interactions. Furthermore, its estimates are based on permutations and some of the inputs show strong dependence.
The H statistic is not just for two-way interaction. It is a comparison between model predictions including all interactions and those excluding any interaction. It works for not only continuous responses but also binary ones. See tutorial https://christophm.github.io/interpretable-ml-book/interaction.html "The H-statistic tells us the strength of interactions, but it does not tell us how the interactions look like. That is what partial dependence plots are for. A meaningful workflow is to measure the interaction strengths and then create 2D-partial dependence plots for the interactions you are interested in."
The dataset is complex, and there is no a-priori reason why interactions should be limited to pairs. My success at "guessing" interactions based on my general domain knowledge and verifying them by inclusion in the GLM has been limited.
Multiway interactions can be decomposed into two-way interactions. Seeking multiway interaction can be numerically dangerous because it increases the chance of getting infinite but nonsignificant coefficients, related to the complete-separation problem. See Yee, T. W. (2021). On the Hauck–Donner effect in Wald tests: Detection, tipping points, and parameter space characterization. Journal of the American Statistical Association, 0(0), 1–12. https://doi.org/10.1080/01621459.2021.1886936.