Anomaly-aware summary statistic from data batches

Gaia Grosso 1,2.3111Correspondence E-mail: gaia.grosso@cern.ch
1NSF AI Institute for Artificial Intelligence and Fundamental Interactions
2MIT Laboratory for Nuclear Science, Cambridge, MA
3School of Engineering and Applied Sciences, Harvard University, Cambridge, MA
Abstract

Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman–Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the likelihood-ratio test is thus shared over the batches, and complies with the assumption that the expected rate of new physical processes is time invariant. We show that this method outperforms the simple sum of the independent tests run over the batches, and can recover, or even surpass, the sensitivity of the single test run over the full data. Beside the significant advantage for the offline application of NPLM to large size samples, the proposed approach offers new prospects toward the use of NPLM to construct anomaly-aware summary statistics in quasi-online data streaming scenarios.

1 Introduction

Numerous experimental tests have confirmed the Standard Model as the best description of the particle physics world, reaching levels of precision up to few parts per thousand. New physical processes, if detectable, should therefore manifest as very subtle statistical deviations of the data distributions with respect to the Standard Model expectations. Increasing the phase space acceptance and the statistics of the analysed datasets could unveil interesting data structures that were missed so far. Along with the regular signal-specific statistical analyses, signal-agnostic anomaly detection tools have grown of interest in the high energy physics community, and machine learning has proved to be a crucial ingredient to scale their discovery power.

Anomaly detection broadly refers to the problem of identifying data patterns that do not align with their expected behaviours. Events can be interpreted as anomalous for multiple reasons; either because their appearance in the dataset is rare, or because they manifest out of the expected data support. Moreover, anomalies can emerge as a collective behaviour affecting the statistical model describing the data occurrence in the experiment. In this work we focus on collective anomalies, namely phenomena that rise as an unexpected deformation of the data probability distribution. These kind of anomalies are tackled by means of statistical anomaly detection tools. Statistical anomaly detection refers to the problem of recognising and quantifying distributional deviations of a dataset from its nominal expected behaviour. Besides its use in new physics searches, statistical anomaly detection finds a wide spectrum of applications in experimental particle physics. The most striking examples are data quality monitoring of experimental setups at data taking time, and validation of simulated samples.

When a tractable model of the data nominal behaviour is available, statistical anomaly detection reduces to a goodness-of-fit test: a test statistic is defined according to a notion of similarity between the data distribution and the nominal model, and the level of compatibility is quantified in terms of the test statistic p𝑝pitalic_p-value with respect to the test distribution in nominal conditions. However, in most of the applications of statistical anomaly detection in high energy physics, a tractable model for the nominal data distribution is not available and needs to be replaced by a dataset of reference ( denoted as {\cal{R}}caligraphic_R in this work), simulated according to the nominal model or built in a data-driven fashion. In these cases, the problem of statistical anomaly detection is solved as a two-sample-test, introducing a test statistic that measures the distance between two samples.

High energy physics, and in particular collider data, pose extremely challenging problems for statistical anomaly detection due to the level of precision required to test the Standard Model. For instance, the accuracy of the statistical models employed in the process of scientific discovery is a particularly delicate matter that requires powerful testing tools. Similarly, the lack of new interesting tensions with the Standard Model hypothesis after the Higgs boson discovery seem to indicate that, if detectable, new physics should manifest at the LHC as a very rare process, whether in time or in phase space. Anomaly detection algorithms must therefore be able to capture very subtle anomalous behaviours, and they should be run over the most inclusive set of data allowed by the computational and methodological constraints. An extensive review of goodness-of-fit algorithms routinely adopted in the high energy physics community can be found in Ref. [1].

Machine learning (ML) has provided significant improvements to the statistical tools employed in data analysis at the LHC, allowing to deal with increasingly complex data structures, and exploiting a wider amount of the physics information produced by the experiments. In the specific context of goodness-of-fit, machine learning based algorithms have opened the path to multivariate statistical analyses, able to capture mismodeling also in the correlations between variables Ref. [2].

In absence of constraining inductive biases, as it is the case in unknown anomalous signals searches, the power of machine learning strategies resides in their ability to learn from the training examples coming directly from the experimental data. However, the latter could be either scarce or completely lacking of anomalous events due to their rarity. The impact of machine learning is therefore expected to become more and more evident by scaling the training sample size and, as a consequence, the rate of collected anomalies. While beneficial to the training, larger sample sizes introduce computational and storage constraints, that need to be addressed. In this work, we focus on the New Physics Learning Machine (NPLM) algorithm, an ML-empowered goodness-of-fit method based on the Neyman-Pearson hypothesis test (Ref. [3, 4, 5, 6, 7]), and we address the problem of how to test large datasets under computational and storage constraints. The approach presented in this paper proposes parallel computing of the test over batches of the original sample and a final aggregation strategy that combines the local approximation of the data-to-reference ratios learnt by each instance of the algorithm. We outline three operational approaches for different resource constraints settings, aiming at maximally exploiting the available experimental data both in offline and quasi-online configurations.

We apply the split-aggregation approach to a one-dimensional toy model and two realistic LHC datasets of increasing complexity. The former is a dimuon final state characterised by 5 observables and sample size of O(104)𝑂superscript104O(10^{4})italic_O ( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) events, and a 24-dimensional problem defined by the tri-momenta of eight objects selected at the L1 trigger of the CMS experiment with sample size of O(106)𝑂superscript106O(10^{6})italic_O ( 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) events.

Our numerical experiments for one-dimensional and five-dimensional problems show that the aggregation strategy can fully recover or, in same cases, even outperform the single batch implementation, thanks to the effect of regularization against statistical noise that the aggregation strategy introduces.

Furthermore, the studies conducted on the 24-dimensional problem show the discovery potential of the various approaches in high-dimensional and large statistic scenarios.

The paper is organised as follows. In Section 2 we describe the NPLM algorithm and the proposed batch solution for handling large sample sizes. In Section 3 we study the performances of the split-aggregation approach on the one-dimensional and five-dimensional problems and we compare them with the original implementation of NPLM. In Section 4, we apply the split-aggregation strategy to a large size dataset simulating the CMS L1 trigger, showing the scalability of the algorithm to more complex scenarios. We conclude in Section 6 discussing our main findings and future research directions.

2 Algorithm

In this Section we briefly summarise the main features of the NPLM algorithm and we provide the statistical foundations for the new proposed split-aggregation solution.

2.1 New Physics Learning Machine (NPLM)

New Physics Learning Machine (NPLM) is a ML-based signal-agnostic strategy to detect and quantify statistically significant deviations of the observed data 𝒟𝒟{\cal{D}}caligraphic_D from a defined reference model RR{\rm{R}}roman_R Ref. [3]. The NPLM method is inspired by the classical approach to hypothesis testing based on the log-likelihood-ratio Ref. [8]

t(𝒟)=2max𝐰x𝒟log(x|H𝐰)(x|R),𝑡𝒟2subscript𝐰subscript𝑥𝒟conditional𝑥subscriptH𝐰conditional𝑥Rt({\cal{D}})=2\max\limits_{{\rm{\bf{w}}}}\sum\limits_{x\in{\cal{D}}}\log\frac{% {\cal L}(x|{{\rm{H}}_{\rm{\bf{w}}}})}{{\cal L}(x|{\rm{R}})}\,,italic_t ( caligraphic_D ) = 2 roman_max start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT roman_log divide start_ARG caligraphic_L ( italic_x | roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) end_ARG start_ARG caligraphic_L ( italic_x | roman_R ) end_ARG , (1)

where the two compared hypotheses are the reference, RR{\rm{R}}roman_R, and a composite alternative, H𝐰subscriptH𝐰{\rm{H}}_{\rm{\bf{w}}}roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT, whose nature is not known a priori. A model f𝐰(x)subscript𝑓𝐰𝑥f_{\rm{\bf{w}}}(x)italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) with trainable parameters 𝐰𝐰{\rm{\bf{w}}}bold_w is defined on the space of the data x𝑥xitalic_x to parametrize the density distribution of the data in the family of alternatives

n(x|H𝐰)=n(x|R)ef𝐰(x).𝑛conditional𝑥subscriptH𝐰𝑛conditional𝑥Rsuperscript𝑒subscript𝑓𝐰𝑥n(x|{\rm{H}}_{\rm{\bf{w}}})=n(x|{\rm{R}})e^{f_{\rm{\bf{w}}}(x)}\,.italic_n ( italic_x | roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) = italic_n ( italic_x | roman_R ) italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT . (2)

Since the total number of events in x𝑥xitalic_x is a Poissonian random variable whose expectation depends on the theory model, the test statistic in Eq. 1 is computed as the extended log-likelihood-ratio, which in terms of f𝐰subscript𝑓𝐰f_{\rm{\bf{w}}}italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT is written as

t(𝒟)=2max𝐰[N(R)N(H𝐰)+x𝒟f𝐰(x)],𝑡𝒟2subscript𝐰NRNsubscriptH𝐰subscript𝑥𝒟subscript𝑓𝐰𝑥t({\cal{D}})=2\max\limits_{{\rm{\bf{w}}}}\left[{\rm{N}}({\rm{R}})-{\rm{N}}({% \rm{H}}_{\rm{\bf{w}}})+\sum\limits_{x\in{\cal{D}}}f_{\rm{\bf{w}}}(x)\right]\,,italic_t ( caligraphic_D ) = 2 roman_max start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT [ roman_N ( roman_R ) - roman_N ( roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) ] , (3)

with N(R)NR{\rm{N}}({\rm{R}})roman_N ( roman_R ) and N(H𝐰)NsubscriptH𝐰{\rm{N}}({\rm{H}}_{\rm{\bf{w}}})roman_N ( roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) the expectation values of the total number of events in 𝒟𝒟{\cal{D}}caligraphic_D under the reference and the alternative hypotheses respectively. The maximisation problem described by Eq. 3 can be solved by a machine learning task

t𝐰(𝒟)=2min𝐰L[f𝐰],subscript𝑡𝐰𝒟2subscript𝐰𝐿delimited-[]subscript𝑓𝐰t_{{\rm{\bf{w}}}}({\cal{D}})=-2\min\limits_{{\rm{\bf{w}}}}L[f_{{\rm{\bf{w}}}}]\,,italic_t start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( caligraphic_D ) = - 2 roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT italic_L [ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] , (4)

where the loss function

L[f𝐰]=[xw(x)(ef𝐰(x)1)x𝒟f𝐰(x)].𝐿delimited-[]subscript𝑓𝐰delimited-[]subscript𝑥subscript𝑤𝑥superscript𝑒subscript𝑓𝐰𝑥1subscript𝑥𝒟subscript𝑓𝐰𝑥L[f_{{\rm{\bf{w}}}}]=\left[\sum\limits_{x\in{\cal{R}}}w_{\cal{R}}(x)(e^{f_{\rm% {\bf{w}}}(x)}-1)-\sum\limits_{x\in{\cal{D}}}f_{\rm{\bf{w}}}(x)\right]\,.italic_L [ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ] = [ ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) ( italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT - 1 ) - ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) ] . (5)

takes as input two training datasets: the dataset of interest 𝒟𝒟{\cal{D}}caligraphic_D, and a sample of reference {\cal{R}}caligraphic_R drawn according to the reference hypothesis and weighted by w(x)subscript𝑤𝑥w_{\cal{R}}(x)italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) to match the experimental luminosity . Thus, the machine learning task learns the optimal values of the trainable parameters of f𝑓fitalic_f, 𝐰^^𝐰{\widehat{\rm{\bf{w}}}}over^ start_ARG bold_w end_ARG, to approximate the log-density-ratio of the two samples

f𝐰^(x)=logn(x|H𝐰^)n(x|R)subscript𝑓^𝐰𝑥𝑛conditional𝑥subscriptH^𝐰𝑛conditional𝑥Rf_{\widehat{\rm{\bf{w}}}}(x)=\log\frac{n(x|{\rm{H}}_{{\widehat{\rm{\bf{w}}}}})% }{n(x|{\rm{R}})}italic_f start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ( italic_x ) = roman_log divide start_ARG italic_n ( italic_x | roman_H start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n ( italic_x | roman_R ) end_ARG (6)
Refer to caption
Figure 1: NPLM training over a single dataset. Schematic representation of a single run of the NPLM algorithm. The NPLM model, a kernel method in this work, takes as input two datasets, a reference sample ({\cal{R}}caligraphic_R) and a data sample 𝒟𝒟{\cal{D}}caligraphic_D, and it fit via maximum-likelihood the log-ratio of their densities f(x;𝐰^)𝑓𝑥^𝐰f(x;\,{\widehat{\rm{\bf{w}}}})italic_f ( italic_x ; over^ start_ARG bold_w end_ARG ). The resulting model is evaluated in-sample on 𝒟𝒟{\cal{D}}caligraphic_D and {\cal{R}}caligraphic_R to compute the test statistic t𝑡titalic_t.

As for any test statistic, the value of the test obtained for 𝒟𝒟{\cal{D}}caligraphic_D has to be compared to the distribution of the test under the reference hypothesis, p(t|R)𝑝conditional𝑡Rp(t|{\rm{R}})italic_p ( italic_t | roman_R ), and the p𝑝pitalic_p-value is finally used as a GoF metric. We build p(t|R)𝑝conditional𝑡Rp(t|{\rm{R}})italic_p ( italic_t | roman_R ) empirically, running pseudo-experiments on datasets that are generated according to the RR{\rm{R}}roman_R hypothesis. The model f𝐰(x)subscript𝑓𝐰𝑥f_{\rm{\bf{w}}}(x)italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) could be a neural network as in Ref. [3, 4, 5], or it could be built with kernel methods Ref. [6, 9]. In this work we employ kernel methods since they have been showed to be less time demanding than neural networks. Details about the NPLM training strategy based on kernel methods are postponed to Section 2.4.

It should be mentioned that the test statistic in Eq. 4 has good statistical properties (i.e. its null distribution approximates a χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the asymptotic regime) only if the number of events representing the reference hypothesis, NsubscriptN{\rm N}_{{\cal{R}}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT is large enough that the statistical fluctuations affecting the reference sample are negligible with respect to the one affecting the data. We refer the reader to Ref. [4] and Ref. [6] for a detailed discussion on the impact of the reference sample size on the NPLM test statistic properties.

Moreover, in realistic particle physics scenarios uncertainties affect the knowledge of the reference hypothesis requiring the introduction of nuisance parameters, defining families of transformations of the reference distribution that should be seen as not anomalous. A solution to extend the NPLM strategy to account for systematic errors has been introduced in Ref. [5]. In our studies, we will consider systematic uncertainties as negligible and leave the treatment of systematic uncertainties within the batch approach for future studies. Finally, often times simulations are not accurate enough to provide a good reference model. In those circumstances, data driven techniques need to be implemented to build {\cal{R}}caligraphic_R. The problem of building an accurate reference model RR{\rm{R}}roman_R is a distinct problem and out of the scope of this work.Comment on possible solutions to this problem in Section 4. In the next Section we introduce the batching idea to address the problem of large sample size.

2.2 Learning New Physics from batches

Let’s assume that the experimental data 𝒟𝒟{\cal{D}}caligraphic_D are acquired in Nbsubscript𝑁bN_{\rm b}italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT batches 𝒟isubscript𝒟𝑖{\cal{D}}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

𝒟=i𝒟i,𝒟subscript𝑖subscript𝒟𝑖\displaystyle{\cal{D}}=\cup_{i}{\cal{D}}_{i}\,,caligraphic_D = ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (7)

and the size Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each batch is a Poissonian random variable whose expectation is a function of the integrated data acquisition time.222more precisely, the expected number of observations for a given physics process (H𝐻Hitalic_H) is a function of its physics cross section (σ(H)𝜎𝐻\sigma(H)italic_σ ( italic_H )), the integrated luminosity (L𝐿Litalic_L), and the acceptance (a𝑎aitalic_a) and efficiency (ϵitalic-ϵ\epsilonitalic_ϵ) of the experimental setup: N(H)=σ(H)×L×a×ϵ𝑁𝐻𝜎𝐻𝐿𝑎italic-ϵN(H)=\sigma(H)\times L\times a\times\epsilonitalic_N ( italic_H ) = italic_σ ( italic_H ) × italic_L × italic_a × italic_ϵ. In addition, let’s assume that each batch 𝒟isubscript𝒟𝑖{\cal{D}}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be stored in a buffer long enough time to undergo the NPLM training task.

Since the proton-proton collisions are independent events, the optimal test of the full dataset according to Neyman and Pearson would be the sum of the log-likelihood-ratio test performed over each single batch, with alternative hypothesis set to the True model (TT\rm Troman_T) underlying the data:

tid(𝒟=i𝒟i)=2i=1Nblog(𝒟i|T)(𝒟i|R).subscript𝑡𝑖𝑑𝒟subscript𝑖subscript𝒟𝑖2superscriptsubscript𝑖1subscript𝑁bconditionalsubscript𝒟𝑖Tconditionalsubscript𝒟𝑖Rt_{id}({\cal{D}}=\cup_{i}{\cal{D}}_{i})=2\sum\limits_{i=1}^{N_{\rm b}}\log% \frac{{\cal L}({\cal{D}}_{i}|{\rm T})}{{\cal L}({\cal{D}}_{i}|{\rm{R}})}\,.italic_t start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ( caligraphic_D = ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log divide start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_T ) end_ARG start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_R ) end_ARG . (8)

Clearly, the True hypothesis T is not known a priori and we want to exploit the NPLM strategy to estimate it. The solution we propose in this work is to run separate instances of the NPLM algorithm over different batches and recombine the information encoded in the log-density-ratio functions

f𝐰^,i(x)=logn(x;H𝐰^i)n(x;R)subscript𝑓^𝐰𝑖𝑥𝑛𝑥superscriptsubscriptH^𝐰𝑖𝑛𝑥Rf_{{\widehat{\rm{\bf{w}}}},\,i}(x)=\log\frac{n(x;\,{\rm{H}}_{{\widehat{\rm{\bf% {w}}}}}^{i})}{n(x;\,{\rm{R}})}italic_f start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG , italic_i end_POSTSUBSCRIPT ( italic_x ) = roman_log divide start_ARG italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n ( italic_x ; roman_R ) end_ARG (9)

learnt by the different instances. Recalling the parametrization defined in Eq. 2, for each learnt model f𝐰,isubscript𝑓𝐰𝑖f_{{\rm{\bf{w}}},\,i}italic_f start_POSTSUBSCRIPT bold_w , italic_i end_POSTSUBSCRIPT we can write an estimate of the density ratio between the data batch and the reference {\cal{R}}caligraphic_R

n(x;H𝐰^i)=n(x;R)ef𝐰^,i(x),𝑛𝑥superscriptsubscriptH^𝐰𝑖𝑛𝑥Rsuperscript𝑒subscript𝑓^𝐰𝑖𝑥n(x;\,{\rm{H}}_{{\widehat{\rm{\bf{w}}}}}^{i})=n(x;\,{\rm{R}})e^{f_{{\widehat{% \rm{\bf{w}}}},i}(x)}\,,italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_n ( italic_x ; roman_R ) italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG , italic_i end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT , (10)

Each density n(x;H𝐰^i)𝑛𝑥superscriptsubscriptH^𝐰𝑖n(x;\,{\rm{H}}_{{\widehat{\rm{\bf{w}}}}}^{i})italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) can be interpreted as an approximated view of the same True physical model. If the learning process outputs an accurate enough representation of T, then Eq. 8 can be straightforward approximated summing the likelihood-ratio-tests output by NPLM for the separate batches

tSUMNb(𝒟)=i=1Nbt(𝒟i)=2i=1Nblog(𝒟i|H𝐰i)(𝒟i|R)=2i=1Nb[xw(x)(1efi,𝐰^(x))+x𝒟ifi,𝐰^(x)].superscriptsubscript𝑡SUMsubscript𝑁b𝒟superscriptsubscript𝑖1subscript𝑁b𝑡subscript𝒟𝑖2superscriptsubscript𝑖1subscript𝑁bconditionalsubscript𝒟𝑖superscriptsubscriptH𝐰𝑖conditionalsubscript𝒟𝑖R2superscriptsubscript𝑖1subscript𝑁bdelimited-[]subscript𝑥subscript𝑤𝑥1superscript𝑒subscript𝑓𝑖^𝐰𝑥subscript𝑥subscript𝒟𝑖subscript𝑓𝑖^𝐰𝑥t_{\rm SUM}^{N_{\rm b}}({\cal{D}})=\sum_{i=1}^{N_{\rm b}}t({\cal{D}}_{i})=2% \sum_{i=1}^{N_{\rm b}}\log\frac{{\cal L}({\cal{D}}_{i}|{\rm{H}}_{\rm{\bf{w}}}^% {i})}{{\cal L}({\cal{D}}_{i}|{\rm{R}})}=2\sum_{i=1}^{N_{\rm b}}\left[\sum% \limits_{x\in{\cal{R}}}w_{\cal{R}}(x)(1-e^{f_{i,{\widehat{\rm{\bf{w}}}}}(x)})+% \sum\limits_{x\in{\cal{D}}_{i}}f_{i,{\widehat{\rm{\bf{w}}}}}(x)\right]\,.italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_t ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log divide start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_R ) end_ARG = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i , over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ( italic_x ) ] . (11)

This approach produces a powerful test if the exact shape and normalization of the signal is captured by all models. However, when the signal is very rare its impact in a small size data sample can be overlooked by the model. We can exploit two simple assumptions to provide a better solution. First, we assume that the new physics source is static, namely its occurence rate is uniform over time. Second, we assume that biases affecting single batch views are due to statistical noise whose impact over time is on average null. Under these assumptions, averaging the density-ratio terms learnt in different batches provide a single alternative hypothesis H𝐰NbsuperscriptsubscriptH𝐰subscript𝑁b{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}}roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT shared among batches, that helps to enhance systematic discrepancies, interpreted as signals, while suppressing random ones. The density of the data under H𝐰NbsuperscriptsubscriptH𝐰subscript𝑁b{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}}roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT would then have the following form:

n(x;H𝐰Nb)=1Nbi=1Nbn(x;H𝐰i).𝑛𝑥superscriptsubscriptH𝐰subscript𝑁b1subscript𝑁bsuperscriptsubscript𝑖1subscript𝑁b𝑛𝑥superscriptsubscriptH𝐰𝑖n(x;\,{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}})=\frac{1}{N_{\rm b}}\sum_{i=1}^{N_{% \rm b}}n(x;\,{\rm{H}}_{\rm{\bf{w}}}^{i})\,.italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (12)

By expressing the model of the data density distribution in each batch in terms of the set of functions {f𝐰,i}i=1Nbsuperscriptsubscriptsubscript𝑓𝐰𝑖𝑖1subscript𝑁b\{f_{{\rm{\bf{w}}},i}\}_{i=1}^{N_{\rm b}}{ italic_f start_POSTSUBSCRIPT bold_w , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we can define an “aggregated” model of the log-ratio of the densities as

FNb(x;𝐖)=logn(x;H𝐰Nb)n(x;R)=log[1Nbi=1Nbefi(x;𝐰)],superscript𝐹subscript𝑁b𝑥𝐖𝑛𝑥superscriptsubscriptH𝐰subscript𝑁b𝑛𝑥R1subscript𝑁bsuperscriptsubscript𝑖1subscript𝑁bsuperscript𝑒subscript𝑓𝑖𝑥𝐰F^{N_{\rm b}}(x;\,{\rm{\bf{W}}})=\log\frac{n(x;\,{\rm{H}}_{\rm{\bf{w}}}^{N_{% \rm b}})}{n(x;\,{\rm{R}})}=\log\left[\frac{1}{N_{\rm b}}\sum_{i=1}^{N_{\rm b}}% e^{f_{i}(x;\,{\rm{\bf{w}}})}\right],italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ; bold_W ) = roman_log divide start_ARG italic_n ( italic_x ; roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n ( italic_x ; roman_R ) end_ARG = roman_log [ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; bold_w ) end_POSTSUPERSCRIPT ] , (13)

and use it to compute the log-likelihood-ratio test statistics over the set of batches:

tAGGRNb(𝒟)=2i=1Nblog(𝒟i|H𝐰Nb)(𝒟i|R)=2i=1Nb[xw(x)(1eFNb(x;𝐖))+x𝒟iFNb(x;𝐖)].superscriptsubscript𝑡AGGRsubscript𝑁b𝒟2superscriptsubscript𝑖1subscript𝑁bconditionalsubscript𝒟𝑖superscriptsubscriptH𝐰subscript𝑁bconditionalsubscript𝒟𝑖R2superscriptsubscript𝑖1subscript𝑁bdelimited-[]subscript𝑥subscript𝑤𝑥1superscript𝑒superscript𝐹subscript𝑁b𝑥𝐖subscript𝑥subscript𝒟𝑖superscript𝐹subscript𝑁b𝑥𝐖t_{\rm AGGR}^{N_{\rm b}}({\cal{D}})=2\sum_{i=1}^{N_{\rm b}}\log\frac{{\cal L}(% {\cal{D}}_{i}|{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}})}{{\cal L}({\cal{D}}_{i}|{\rm% {R}})}=2\sum_{i=1}^{N_{\rm b}}\left[\sum\limits_{x\in{\cal{R}}}w_{\cal{R}}(x)(% 1-e^{F^{N_{\rm b}}(x;\,{\rm{\bf{W}}})})+\sum\limits_{x\in{\cal{D}}_{i}}F^{N_{% \rm b}}(x;\,{\rm{\bf{W}}})\right]\,.italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D ) = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log divide start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_R ) end_ARG = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ; bold_W ) end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ; bold_W ) ] . (14)

Batching solutions for computationally intensive calculations with kernel methods have already been proposed in the literature for supervised approaches (see for instance Ref. [10]). In Ref. [10] the authors show theoretical guarantees for recovering the performances of the full sample training solution, provided that the number of splitting is not too large.

2.3 Operational strategies in resource constrained settings

Despite the advantage provided by splitting the dataset in batches, the computational requirements for the algorithm execution might still pose significant challenges to its application in realistic scenarios. In fact, computing tAGGRNbsuperscriptsubscript𝑡AGGRsubscript𝑁bt_{\rm AGGR}^{N_{\rm b}}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT requires evaluating F𝐖M(x)superscriptsubscript𝐹𝐖𝑀𝑥F_{{\rm{\bf{W}}}}^{M}(x)italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_x ) over each element x𝑥xitalic_x in the total set of acquisitions 𝒟=i𝒟i𝒟subscript𝑖subscript𝒟𝑖{\cal{D}}=\cup_{i}{\cal{D}}_{i}caligraphic_D = ∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This means that all data batches have to be stored for the whole duration of the NPLM algorithm execution, and the memory locations at which the models are stored need to be accessible by all batches. Moreover, the execution time largely varies depending on several factors concerning the algorithm setup, such as the size of the batches, the choice of input variables, but also to the computational resources available and the way these are dispatched. The possibility of running NPLM tasks in parallel on multiple GPUs provides, for instance, the potential to significantly reduce the computational time.

On the other hand, if the storage resources are limited, it could happen that not all the data can be saved long enough for the NPLM routines to be all completed. In this situation, the test in Eq. 14 cannot be evaluated exactly. Two alternatives can then be contemplated. The first one is to run NPLM over all the batches and save the final models {f𝐰,i}i+1Nbsuperscriptsubscriptsubscript𝑓𝐰𝑖𝑖1subscript𝑁b\{f_{{\rm{\bf{w}}},i}\}_{i+1}^{N_{\rm b}}{ italic_f start_POSTSUBSCRIPT bold_w , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to estimate H𝐰NbsuperscriptsubscriptH𝐰subscript𝑁b{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}}roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, but running the test only on a subset of Ntestsubscript𝑁testN_{\rm test}italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT batches that could be stored:

tAGGRNb,Ntest(𝒟)=2i=1Ntestlog(𝒟i|H𝐰Nb)(𝒟i|R)=2i=1Ntest[xw(x)(1eF𝐖Nb(x))+x𝒟iF𝐖Nb(x)].superscriptsubscript𝑡AGGRsubscript𝑁bsubscript𝑁test𝒟2superscriptsubscript𝑖1subscript𝑁testconditionalsubscript𝒟𝑖superscriptsubscriptH𝐰subscript𝑁bconditionalsubscript𝒟𝑖R2superscriptsubscript𝑖1subscript𝑁testdelimited-[]subscript𝑥subscript𝑤𝑥1superscript𝑒superscriptsubscript𝐹𝐖subscript𝑁b𝑥subscript𝑥subscript𝒟𝑖superscriptsubscript𝐹𝐖subscript𝑁b𝑥t_{\rm AGGR}^{N_{\rm b},N_{\rm test}}({\cal{D}})=2\sum_{i=1}^{N_{\rm test}}% \log\frac{{\cal L}({\cal{D}}_{i}|{\rm{H}}_{\rm{\bf{w}}}^{N_{\rm b}})}{{\cal L}% ({\cal{D}}_{i}|{\rm{R}})}=2\sum_{i=1}^{N_{\rm test}}\left[\sum\limits_{x\in{% \cal{R}}}w_{\cal{R}}(x)(1-e^{F_{{\rm{\bf{W}}}}^{N_{\rm b}}(x)})+\sum\limits_{x% \in{\cal{D}}_{i}}F_{{\rm{\bf{W}}}}^{N_{\rm b}}(x)\right]\,.italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_D ) = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log divide start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_H start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_R ) end_ARG = 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ] . (15)

The second alternative is considering Eq. 10 as the limit of a smart multidimensional binning approach, where the bins centres are the elements of the reference sample {\cal{R}}caligraphic_R. In this limit, each bin has counting equal to the unity under the reference hypothesis and value eF𝐖Nb(x)superscript𝑒superscriptsubscript𝐹𝐖subscript𝑁b𝑥e^{F_{{\rm{\bf{W}}}}^{N_{\rm b}}(x)}italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT under the alternative. A “saturated” binned likelihood-ratio test [11] can then be computed:

tSATNb()=2xw(x)[(1eF𝐖Nb(x))+eF𝐖Nb(x)logF𝐖Nb(x)].superscriptsubscript𝑡SATsubscript𝑁b2subscript𝑥subscript𝑤𝑥delimited-[]1superscript𝑒superscriptsubscript𝐹𝐖subscript𝑁b𝑥superscript𝑒superscriptsubscript𝐹𝐖subscript𝑁b𝑥superscriptsubscript𝐹𝐖subscript𝑁b𝑥t_{\rm SAT}^{N_{\rm b}}({\cal{R}})=2\sum\limits_{x\in{\cal{R}}}w_{\cal{R}}(x)% \left[(1-e^{F_{{\rm{\bf{W}}}}^{N_{\rm b}}(x)})+e^{F_{{\rm{\bf{W}}}}^{N_{\rm b}% }(x)}\log F_{{\rm{\bf{W}}}}^{N_{\rm b}}(x)\right]\,.italic_t start_POSTSUBSCRIPT roman_SAT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_R ) = 2 ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) [ ( 1 - italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ) + italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT roman_log italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ] . (16)

Notice that the test does not depend on the data 𝒟𝒟{\cal{D}}caligraphic_D directly, but only via F𝐖Nbsuperscriptsubscript𝐹𝐖subscript𝑁bF_{{\rm{\bf{W}}}}^{N_{\rm b}}italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, that has been learnt exploiting the dataset 𝒟𝒟{\cal{D}}caligraphic_D in batches. Moreover, the test does not compute the likelihood-ratio of the data, but it rather evaluates the distance between the density of the reference and that of the alternative hypothesis on the data points belonging to the reference sample {\cal{R}}caligraphic_R. Eq. 16 is a powerful metric when the model F𝐖Nbsuperscriptsubscript𝐹𝐖subscript𝑁bF_{{\rm{\bf{W}}}}^{N_{\rm b}}italic_F start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is able to capture the signal shape well enough. Its performances deteriorates for signal events that lay in rare regions of the input space, or whose shape is too complex compared to the expressivity of the family of universal approximators used to define the models {f𝐰,i}subscript𝑓𝐰𝑖\{f_{{\rm{\bf{w}}},i}\}{ italic_f start_POSTSUBSCRIPT bold_w , italic_i end_POSTSUBSCRIPT }. An alternative approach of online data summarization based on machine learning was proposed in Ref. [12]. There are two main conceptual differences between the approach presented in Ref. [12] and the one presented in the current paper. The first is that the authors of Ref. [12] propose to learn directly the probability distribution of the data, while via NPLM we propose to learn the ratio of the data density with respect to a reference. The second difference concerns the training strategy. The authors of Ref. [12] suggest to train a unique model (specifically, a normalizing flow) feeding the acquired data batches sequentially. Conversely, in the present work we propose to perform a complete and independent training for each batch, thus preventing catastrophic forgetting.

In the experiments reported in Section 4 we consider three representative resource constrained scenarios:

  1. 1.

    No resource constraints (NPLM-ALL): all batches are used both for training and for testing. The final test is the one in Eq. 14.

  2. 2.

    Long term storage constraint (NPLM-ONE): there are enough computational resources to train multiple instances of the algorithm in a quasi-online fashion but only one batch is available for testing. For this configuration, the final test is the one in Eq. 15, with Ntest=1subscriptNtest1\rm N_{\rm test}=1roman_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT = 1.

  3. 3.

    Saturated test (NPLM-SAT): all data are temporarily available for a quasi-online training of the algorithm but they cannot be stored and tested. In this case, we only test the distance between the density of the data and that of the reference distribution by computing the saturated test in Eg. 16.

Refer to caption
Figure 2: Operational strategies in resource constrained scenarios. Schematic representation of the three scenarios considered in this work. The experimental data are collected in a storage unit or buffer (orange box) and used to train multiple instance of the NPLM model (purple squares). In the NPLM-ALL scenario all data batches and trained models are sent downstream to compute the test statistic. In the NPLM-ONE scenario all the models but only one batch are sent downstream for testing. Finally, in the NPLM-SAT scenario only the models are stored and can be used to test for significant deviations. Due to the information reduction imposed by the resource constraints, the power of the test generally decreases from left to right.

2.4 Time-efficient NPLM implementation with non-parametric models

To harness the execution time of a single instance of the NPLM algorithm we consider the kernel-based implementation of NPLM presented in Ref. [6], and applied in Ref. [9], which exploits the Falkon library [13], a modern solver for kernel methods that leverages several algorithmic solutions developed in Ref. Ref. [14, 15, 13] to allow for their use in large scale problems. The model used in Falkon is a Nystrom approximated Kernel method defined as a finite weighted sum of kernel functions

f𝐰(x)=i=1Mwik(x;xi,σFLK).subscript𝑓𝐰𝑥superscriptsubscript𝑖1𝑀subscript𝑤𝑖𝑘𝑥subscript𝑥𝑖subscript𝜎FLKf_{\rm{\bf{w}}}(x)=\sum_{i=1}^{{M}}w_{i}k(x;\,x_{i},\,\sigma_{\rm FLK})\,.italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k ( italic_x ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT ) . (17)

The kernel function, k(x;xi,σFLK)𝑘𝑥subscript𝑥𝑖subscript𝜎FLKk(x;\,x_{i},\,\sigma_{\rm FLK})italic_k ( italic_x ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT ), adopted in this implementation is a Gaussian kernel

k(x;xi,σFLK)=exp(xx22σFLK2),𝑘𝑥subscript𝑥𝑖subscript𝜎FLKsuperscriptnorm𝑥superscript𝑥22superscriptsubscript𝜎FLK2\displaystyle k(x;\,x_{i},\,\sigma_{\rm FLK})=\exp\left(-\frac{\|x-x^{\prime}% \|^{2}}{2\sigma_{\rm FLK}^{2}}\right)\,,italic_k ( italic_x ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT ) = roman_exp ( - divide start_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (18)

where the hyper parameter σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT denotes the Gaussian width.

To approximate the log-ratio between the distribution of the 𝒟𝒟{\cal{D}}caligraphic_D and {\cal{R}}caligraphic_R datasets the training exploits a weighted binary cross-entropy loss

L(f𝐰)=xw(x)log[1+ef𝐰(x)]+x𝒟log[1+ef𝐰(x)],𝐿subscript𝑓𝐰subscript𝑥subscript𝑤𝑥1superscript𝑒subscript𝑓𝐰𝑥subscript𝑥𝒟1superscript𝑒subscript𝑓𝐰𝑥L(f_{\rm{\bf{w}}})=\sum_{x\in{\cal{R}}}w_{\cal{R}}(x)\log\left[1+e^{f_{\rm{\bf% {w}}}(x)}\right]+\sum_{x\in{\cal{D}}}\log\left[1+e^{-f_{\rm{\bf{w}}}(x)}\right],italic_L ( italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ) roman_log [ 1 + italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT roman_log [ 1 + italic_e start_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ] , (19)

with a L2 regularisation term. The NPLM method works under the assumption that the reference sample is large enough to be a good representation of the RR{\rm{R}}roman_R hypothesis, which typically leads to an imbalanced dataset with N>N𝒟subscriptNsubscriptN𝒟{\rm{N}}_{\cal{R}}>{\rm{N}}_{\cal{D}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT > roman_N start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT. The weight w(x)subscript𝑤𝑥w_{\cal{R}}(x)italic_w start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( italic_x ), is thus introduced to balance the two classes’ contributions.

Hyper parameters choice.

The hyper parameters choice for a given family of universal approximators defines the “modes” of the data that the NPLM algorithm is able to capture. For the kernel-methods considered in this work, there are three hyper parameters that need to be defined: the number of centres M𝑀Mitalic_M, the Gaussian width σ𝜎\sigmaitalic_σ, and the regularisation parameter λ𝜆\lambdaitalic_λ. In Ref. [6] it has been showed that there exists a “valley” of (M𝑀Mitalic_M, σ𝜎\sigmaitalic_σ, λ𝜆\lambdaitalic_λ) configurations that guarantee the convexity of the loss function, arbitrarily small prediction error and reasonable convergence time. However, no further indication can be provided to enhance the sensitivity to statistical anomalies. Different hyper parameters choices could be optimal to capture different types of data deviations from the reference distribution. For instance, small values of the Gaussian kernels width, σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT, would allow to better capture localised deviations, like narrow resonances, while larger values would be better suited to represent effects affecting the distribution on a wider range, like scale mismatches. If the nature of the anomaly is unknown and no good prior is available, then the most inclusive way of testing the data is combining multiple choices of the hyper-parameters configurations. In this work we fix M𝑀Mitalic_M and λ𝜆\lambdaitalic_λ following the prescription proposed in Ref. [6]. Namely, we chose M=cN𝑀𝑐𝑁M=c\sqrt{N}italic_M = italic_c square-root start_ARG italic_N end_ARG with c(1)similar-to𝑐1c\sim(1)italic_c ∼ ( 1 ) such that the training time is reasonable given the available computational resources, and we choose λ𝜆\lambdaitalic_λ as small as possible, provided no instabilities are encountered.

For the Gaussian width, σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT, we adopt the approach recently proposed in Ref. [NPLM-aggreg] rather than limiting to a single value we consider five values corresponding to the 5, 25, 50, 75 and 95 % quantiles of the pairwise distance distribution between the elements in the reference set. We combine the results obtained with different values of σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT by selecting the minimum:

P=min{σFLK}(pσFLK)𝑃subscriptsubscript𝜎FLKsubscript𝑝subscript𝜎FLK\displaystyle P=\min\limits_{\{\sigma_{\rm FLK}\}}(p_{\sigma_{\rm FLK}})italic_P = roman_min start_POSTSUBSCRIPT { italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (20)

It is worth noticing that the authors of Ref. [10] comment on the impact of regularisation on the batched solutions, highlighting how the correct level of generalisation can be recovered if the trainings run on the batches is under-regularised. The NPLM algorithm is a signal-agnostic strategy and therefore an optimal regularisation level cannot be determined prior to testing the data. Nevertheless, we observe the benefit of the aggregation as a regularisation in our studies. Details are reported in Section 3.

3 Numerical experiments

In this Section we study the properties of the batch-based strategy outlined in Section 2.3 over two proof-of-concept applications. We start considering a one-dimensional problem characterised by a smoothly falling reference distribution (EXPO-1D) for which the optimal test statistic according to Neyman and Pearson can be computed analytically and used as a target. We then move to the multidimensional problem of a dimuon final state observed at collider experiments (DIMUON-5D). A detailed description of the datasets’ properties and signal benchmarks is given in Appendix 7.1. The details about the algorithm implementation, execution time and resource consumption are given in Section 5.

3.1 Improving regularisation and sensitivity by aggregating over batches.

We start by studying the impact of the number of batches Nbsubscript𝑁bN_{\rm b}italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT on the power of the NPLM-ALL aggregated test statistic (Eq. 14). We run the NPLM algorithm for different values of Nbsubscript𝑁bN_{\rm b}italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT and we compare the power curves of the aggregated test (Eq. 14) and the simple sum of tests (Eq. 11) resulting by combining them. We run the experiments on the various signal benchmarks reported in Table 2 for the EXPO-1D dataset and on Table 3 for the DIMUON-5D dataset.

Interestingly, the aggregated test shows similar or higher power as the number of batches increases. Conversely, we observe that the simple sum of tests suffers from significant power degradation as the number of batches increases. We argue that the regularisation benefit introduced by averaging over the batches has higher impact than the sensitivity loss due to data splitting. One possible explanation for such result can be found in the in-sample nature of the NPLM algorithm. Each trained model is by construction induced to (over)fit fluctuations in the data batch distribution even when those are not statistically significant in the single batch test. Combining the outcome of each batch training at the level of the functional forms of the model allow therefore to restore the suite of information carried along by each model, and exploit it to recover sensitivity to very rare signals. An illustrative example of the dynamics mentioned above is given in Figure 3. The panels represent the data distribution (black histograms) and reference distribution (light blue histograms) for the one-dimensional EXPO-1D dataset, when the data are split in four batches. A rare Gaussian signal is injected in the tail of the exponentially falling distribution. The light green solid lines represent the models learnt in each training. On the right hand side of the Figure, we report four panels showing the individual batches and the relative trained models. The models follow the statistical fluctuations of the training sample producing an imprecise view of the true model of the data. On the left hand side, the four models are combined to produce the aggregated version, showed in dark green. It is easy to notice the improved stability of the aggregated model with respect to the ones trained on single batches.

Refer to caption
Figure 3: Understanding the regularising effect of the batching strategy. Visual representation of the densities functions reconstructed by the NPLM models trained on different batches (light green lines), and after combining them (dark green line in the main panel). The reconstruction after combination is more robust to statistical fluctuations and approximates the true signal model better than each single one.

The observed behaviour is consistent across all the studied signal benchmarks for both datasets. We report in Figure 4 the power curves for one signal benchmark of the EXPO-1D dataset, and in Figure 5 the power curves for one signal benchmark of the DIMUON-5D dataset. Additional power plots for the aggregated and simple-sum tests for different values of σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT on different datasets and signal benchmarks are reported in Appendix 7.2.

Refer to caption
Refer to caption
Figure 4: NPLM-ALL. EXPO-1D – localised signal in the tail. Power curves for the aggregated test tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT (left panel) and the simple sum of tests tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT (right panel) at different number of batches. Data splitting improves the performances of tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT, while degrades tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT.
Refer to caption
Refer to caption
Figure 5: NPLM-ALL. DIMUON-5D – Z’ resonance at 300 GeV. Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT (left panel) and tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT (right panel) at different number of batches. Data splitting improves the performances of tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT, while degrades tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT.

3.2 Improving anomaly detection on a single batch exploiting quasi-online learning.

The results presented in Section 3.1 suggest that the aggregation improves the quality of the density-ratio modelling. This motivates the use of NPLM in a quasi-online setup to extract powerful information from data even when only a fraction of the latter can be eventually stored (storage constrained scenario). This is always the case at collider experiments, where trigger algorithms are designed to filter the data. We study the impact of aggregating the modelling over multiple batches on the power of the NPLM test for a single data batch (NPLM-ONE scenario). For the EXPO-1D dataset, the data are split in eight batches, all of them are analysed via NPLM but only one is saved for testing. For the DIMUON-5D dataset, instead, the data are split in four batches. Figure 6 shows the results of our study for the narrow (left panel) and broad (right panel) signal benchmarks in the EXPO-1D dataset, whereas Figure 7 shows the results for two different injections of a Z’ resonant signal with invariant mass of 300 GeV, considered in the DIMUON-5D dataset. While not competitive with the ideal scenario of unlimited storage availability in which the full dataset can be tested (green line), the power curve of our experiment (lightblue line) shows significant improvement over the simple use of NPLM on a single batch (black line). Similar behaviours are found for different signal benchmarks in both the univariate and multivariate datasets. Additional plots are reported in Appendix 7.3.

Refer to caption
Refer to caption
Figure 6: NPLM-ONE. EXPO-1D. Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT when evaluated over one single data batch (light blue line) and over the full dataset (green line), compared with the power of the test learnt and evaluated over one data batch (black line). The aggregation improves the accuracy over the learnt alternative model enhancing the sensitivity with respect to the single batch test. The two panels show the results of our tests for two signal benchmarks, the narrow peak (left panel) and the broad peak (right panel).
Refer to caption
Refer to caption
Figure 7: NPLM-ONE. DIMUON-5D. Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT when evaluated over one single data batch (light blue line) and over the full dataset (green line), compared with the power of the test learnt and evaluated over one data batch (black line). The aggregation improves the accuracy over the learnt alternative model enhancing the sensitivity with respect to the single batch test.The two panels show the results of our tests for two signal benchmarks, the a Z’ resonance in the bulk of the mass spectrum (left panel) and and a Z’ resonance in the tail of the mass spectrum (right panel).

3.3 Anomaly-preserving data compression via likelihood-ratio

Refer to caption
Refer to caption
Figure 8: NPLM-SAT. EXPO-1D. Comparison of the aggregated and saturated test statistic power curves for a localized signal in the bulk (left panel) and in the tail (right panel). The experiments have been performed aggregating over 4 batches.

Finally, we consider the extreme scenario in which data cannot be permanently stored but are only available quasi-online for a limited time window. In this case, the proposed solution is storing the NPLM models {f𝐰,i}i+1Nbsuperscriptsubscriptsubscript𝑓𝐰𝑖𝑖1subscript𝑁b\{f_{{\rm{\bf{w}}},i}\}_{i+1}^{N_{\rm b}}{ italic_f start_POSTSUBSCRIPT bold_w , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as compressed representations of the data batches, and build the saturated likelihood-ratio test described in Eq. 16. The saturated test is not properly a likelihood-ratio test because the likelihood is not directly evaluated on the data points in 𝒟𝒟{\cal{D}}caligraphic_D. This in principle could lead to inefficiencies. A trivial failure example, for instance, is that of a signal localised in a low density region of the reference data distribution. Since the likelihood-ratio reconstructed by F𝐹Fitalic_F is only evaluated in the data points of the reference sample {\cal{R}}caligraphic_R, if the shape of the signal is underestimated then the test would not be sensitive to it. This intuition can be verified running numerical experiments of different nature. Figure 8 shows two representative experiments run on the EXPO-1D dataset. In the Figure we compare the power curve of the saturated test (black line) with that of the aggregated test evaluated over the full dataset (light blue line), or on one batch only (green line). For the signal located in the bulk of the reference distribution (left panel), the saturated test has comparable power to the aggregated test evaluated over the full dataset. Instead, for the Gaussian signal in the tail of the reference distribution (right panel), the power of the saturated test deteriorates, as fewer reference data points are available in the signal region to properly quantify the discrepancy. Similar results are found for the DIMUON-5D dataset (see Figure 9). Additional Figures for the remaining signal benchmarks are reported in Appendix 7.4.

Refer to caption
Refer to caption
Figure 9: NPLM-SAT. DIMUON-5D. Comparison of the aggregated and saturated test statistic power curves for a Z’ resonant signal in the bulk (left panel) and in the tail (right panel) of the mass spectrum considered in this work. The experiments have been performed aggregating over 4 batches.

4 Scaling up to big data LHC scenarios

In this Section we apply the batch-aggregation approach to a larger scale dataset, emulating a typical data stream acquired at the CMS experiment with a one light lepton (electron or muon) requirement at the first level of data filtering (L1 trigger). The dataset, initially introduced in Refs. [16, 17] and later adapted for anomaly detection [18] is publicly available on Zenodo [19, 20, 21, 22, 23]. It consists of simulated events for the Standard Model (SM) and four beyond the Standard Model signatures that we report below333Additional information about the dataset can be found in Ref. [18].:

  • A4l𝐴4𝑙A\rightarrow 4litalic_A → 4 italic_l: a neutral scalar boson (A𝐴Aitalic_A) decaying to two off-shell Z bosons, each forced to decay to two leptons;

  • h±τ±νsuperscriptplus-or-minussuperscript𝜏plus-or-minus𝜈h^{\pm}\rightarrow\tau^{\pm}\nuitalic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT italic_ν: a charged scalar boson (h±superscriptplus-or-minush^{\pm}italic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT) decaying to a τ𝜏\tauitalic_τ lepton and a neutrino;

  • h0τ+τsuperscript0superscript𝜏superscript𝜏h^{0}\rightarrow\tau^{+}\tau^{-}italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_τ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT: a scalar boson (h0superscript0h^{0}italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) decaying to two τ𝜏\tauitalic_τ leptons;

  • LQbτLQ𝑏𝜏{\rm LQ}\rightarrow b\tauroman_LQ → italic_b italic_τ: a leptoquark (LQ) decaying to a b𝑏bitalic_b quark and a τ𝜏\tauitalic_τ lepton.

The dataset contains a list of 19 physics objects (4 muons, 4 electrons, 10 jets and the event missing transverse energy) for each collision event. Each object is characterised by three kinematic variables and a class label, for a total of 76 features. We run the batched version of NPLM over the 24 dimensional problem defined by considering the kinematic features of eight out of the nineteen objects: the missing transverse energy, the first two most energetic muons, the first two most energetic electrons, and first three most energetic jets. We consider a data sample 𝒟𝒟{\cal{D}}caligraphic_D of 1111 million SM events, on top of which signal events of the order of few parts per mill are injected to test the algorithm sensitivity. The marginal distributions over the input features for the Standard Model background and the four signal benchmarks are reported in Figures 12 and 13 in Appendix 7.1.

The reference sample {\cal{R}}caligraphic_R is taken to be 10101010 million SM events (ten times larger than the full 𝒟𝒟{\cal{D}}caligraphic_D sample), and the 𝒟𝒟{\cal{D}}caligraphic_D sample is split in 5 batches (Nb=5subscript𝑁b5N_{\rm b}=5italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 5). As for the kernel methods hyper parameters, we set M=10 000𝑀10000M=10\,000italic_M = 10 000 and λ=106𝜆superscript106\lambda=10^{-6}italic_λ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. For sake of time, we study the performances of the model for a single value of the kernel width σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT, corresponding to the 90% quantile of the distribution of pair-wise distance between SM data points. We run the three versions of the algorithm presented in Section 3: train and test over all batches (NPLM-ALL), train over all and test only one batch (NPLM-ONE), and the saturated test on the model obtained averaging over all batches (NPLM-SAT).

Figure 10 shows the median Z-score reached by the three approaches for a 0.2% and 0.4% signal injection of the four different signal benchmarks. For completeness, we also show the sensitivity of the algorithm if only a fifth of the luminosity, namely one batch, is exploited by NPLM. The results of our tests confirm the main finding already outlined in Section 3. First, the NPLM-ONE approach exhibits increased performances with respect to the application of NPLM to a single batch, remarking that the aggregation via average proposed in this work allows to unveil sensitive information about the signal signature, even when the full set of data is not available at testing time. Moreover, when all the data are available at testing time (NPLM-ALL), the NPLM algorithm can take advantage of the full sample statistics to make discovery possible. This is the case for the A4l𝐴4𝑙A\rightarrow 4litalic_A → 4 italic_l and h±τ±νsuperscriptplus-or-minussuperscript𝜏plus-or-minus𝜈h^{\pm}\rightarrow\tau^{\pm}\nuitalic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT italic_ν signal benchmarks, where a 2 per mill signal injection (corresponding to a 2 sigma global normalization effect in our studies), can be raised to evidence (for A4l𝐴4𝑙A\rightarrow 4litalic_A → 4 italic_l) or discovery (for h±τ±νsuperscriptplus-or-minussuperscript𝜏plus-or-minus𝜈h^{\pm}\rightarrow\tau^{\pm}\nuitalic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT italic_ν) level of significance 444The Z-score values reported in Figure 10 are computed empirically up to 3. For larger values of the default NPLM, NPLM-ALL and NPLM-ONE, we rely on the asymptotic χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution of the test statistic under the null hypothesis (the compatibility of the NPLM test with a χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for kernel methods was previously studied in Refs. [6, 9]). For NPLM-SAT we don’t observe the emergence of an asymptotic χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. However the null distribution has a good agreement with a normal distribution. We therefore use the normal asymptotic to extrapolate an estimate of Z-scores above 3. Points above 3 in the Figure should therefore be taken more as a qualitative trend than an accurate estimate.. For the A4l𝐴4𝑙A\rightarrow 4litalic_A → 4 italic_l and h±τ±νsuperscriptplus-or-minussuperscript𝜏plus-or-minus𝜈h^{\pm}\rightarrow\tau^{\pm}\nuitalic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT italic_ν signal benchmarks, we observe impressive reaches for the NPLM-SAT approach as well. This result aligns with what observed in the one-dimensional and five-dimensional experiments, namely that NPLM-SAT is as successful as NPLM-ALL when the signal support is mainly in the bulk of the reference sample support, as it is the case for the signal benchmarks considered here (see marginal distributions reported in Figures 12 and 13). Additionally, this result points out the potential of the NPLM split-aggregation method as a tool to compute anomaly-preserving summary statistics, allowing to save information about data that are only temporarily available and eventually discarded by the data acquisition system. We conclude this Section with commenting on the performances of NPLM on the h±τ±νsuperscriptplus-or-minussuperscript𝜏plus-or-minus𝜈h^{\pm}\rightarrow\tau^{\pm}\nuitalic_h start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT → italic_τ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT italic_ν and LQbτLQ𝑏𝜏{\rm LQ}\rightarrow b\tauroman_LQ → italic_b italic_τ signals, that are found to be poorer. We believe that the reason could reside in the choice of the features adopted in these studies – only a subset of the original nineteen physics objects is retained for this analysis. Moreover, the algorithm performances are sensitive to the size of the reference sample {\cal{R}}caligraphic_R – our studies suggest that the larger the sample the more accurate the result becomes, though more computationally demanding. Finally, one peculiar feature of this dataset is the so called “zero padding”, namely filling with zeros the entries corresponding to specific objects in the event that are not observed. The zero padding gives rise to sharp features that project great part of the data on lower dimensional surfaces of the input manifold. The problem of variable size events could be solved choosing a different data representations, model architecture, or by mapping the data to a lower dimensional embedding. These interesting directions are out of the scope of this paper and are left for future work.

Refer to caption
Figure 10: CMS-L1-24D Summary plot. Sensitivity reach in term of the median Z-scores as a function of the fraction of signal injection. The total number of background events sum to 1 million. The inspected signal fractions are of the order of few per mill. Each panel corresponds to a different signal benchmark.

5 Computational resources and training time

In this Section, we report the details of the algorithm implementation, highlighting the typical execution time and computational resources needed for the numerical experiments studied in this work.

Execution time.

The execution time of a single instance of the NPLM algorithm based on kernel methods depends on several factors: the total number of data points, the number of kernels M𝑀Mitalic_M, the kernel width σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT, the smoothness parameter λ𝜆\lambdaitalic_λ, and on the hardware resources employed. Our implementation relies on the Falkon library [13], that allows to speed up the fitting time by optimally exploiting GPUs [6].

dataset M𝑀Mitalic_M σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT λ𝜆\lambdaitalic_λ NsubscriptN{\rm{N}}_{\cal{R}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT N𝒟subscriptN𝒟{\rm{N}}_{\cal{D}}roman_N start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT (all) Time (single batch)
EXPO-1D 1k [0.1, 0.3, 0.7, 1.4, 3.0] 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 200k 16k 0.5–1.5 s
DIMUON-5D 10k [0.8, 1.9, 3.0, 4.4, 6.6] 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 40k 8k 3–12 s
L1CMS-24D 10k [10.5] 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 10M 1M 40-70 s
Table 1: Model hyper parameters and execution time. Unless stated otherwise, all instances of the NPLM algorithm for each dataset follow the settings described in this table.

We report in Table 1 the typical time range for a single batch execution for the hyper parameters choice adopted in our experiments. The variance within experiments on the same dataset is mainly due to the σFLKsubscript𝜎FLK\sigma_{\rm FLK}italic_σ start_POSTSUBSCRIPT roman_FLK end_POSTSUBSCRIPT value, smaller values require a higher number of iterations of the Falkon algorithm to reach convergence. The main difference in execution time between the univariate and multivariate cases is due to the model size, being the number of kernels used in the five-dimensional problems a factor 10 larger than in the one-dimensional case. The training time is not significantly affected by the batch size, as the reference sample size is dominating the overall size of the training sample. For large samples and complex models the execution time increases exponentially with the dataset size (see studies reported in Ref. [6]), and the benefit of batching the dataset in terms of execution time become more striking. This has been observed, for instance, when studying the L1CMS-24D.

Storage resources.

Aggregating over batches requires both temporary and permanents computational and storage resources. The temporary resources are used to train each single instance of the algorithm and should be sufficient to store the data batch input to the model, and the model itself. Importantly, the model location should be accessible by all data batches to allow estimating FNbsuperscript𝐹subscript𝑁bF^{N_{\rm b}}italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT according to Eq. 13. The reference sample {\cal{R}}caligraphic_R and the values that the aggregated model FNbsuperscript𝐹subscript𝑁bF^{N_{\rm b}}italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT takes in the test points should instead be stored in a permanent location. The reference sample is a static input to the NPLM algorithm and is common to all training instances.555As previously stated, this work doesn’t cover the case of an imperfect reference, namely the case in which the reference model is affected by systematic uncertainties and its optimal configuration given the experimental condition is not constant. We leave this case for future work. Together with the test data, it is used to asses the value of the test in all three versions described in Section 3. The amount of memory resources to be allocated for the aggregated model depends on the number of data points required by the test. For the test in Eq. 14, the aggregated function FNbsuperscript𝐹subscript𝑁bF^{N_{\rm b}}italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is evaluated over all data points and reference points, requiring a vector of length N+N𝒟subscriptNsubscriptN𝒟{\rm{N}}_{\cal{R}}+{\rm{N}}_{\cal{D}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT + roman_N start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT. For the test in Eq. 15, only a subset Ntest<Nbsubscript𝑁testsubscript𝑁bN_{\rm test}<N_{\rm b}italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT < italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT of batches is used for evaluation and therefore the vector length reduces to N+NtestNbN𝒟subscriptNsubscript𝑁testsubscript𝑁bsubscriptN𝒟{\rm{N}}_{\cal{R}}+\frac{N_{\rm test}}{N_{\rm b}}{\rm{N}}_{\cal{D}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT + divide start_ARG italic_N start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG roman_N start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT. Finally for the saturated test in Eq. 16, the aggregated function is only evaluated over the reference set, requiring a NsubscriptN{\rm{N}}_{\cal{R}}roman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT length vector only. For the first two cases, the storage resources needed scale linearly with the data taking time, while in the last case they are constant. The implications of information loss on the sensitivity of the different summary statistics are discussed in Section 3.2 and 3.3.

6 Conclusions

In this work we explore the possibility of scaling the use of NPLM to test large size data samples in presence of variable computational and storage resource constraints. We propose to split the dataset in batches and run parallel instances of the NPLM algorithm on each of them. The final results are combined averaging the density-ratio functions learnt by each instance of the NPLM algorithm. We apply this strategy to both univariate and multivariate benchmarks representative of typical data distributions in particle physics experiments. Our experiments show that the split-aggregation strategy preserves or even surpass the sensitivity power of the NPLM test applied to the full sample (original implementation). The reason for the success is mainly related to the uniqueness of its training strategy, aiming at finding the maximum likelihood fit of the model to the input data, a representation of the true model underneath the data that over-fits the specific sample used for training. While over-fitting is commonly evaded in machine learning algorithms, here it assumes a crucial role in preserving data structures that do not generalise, either because they resemble statistical fluctuations or because they are especially rare traits of the data. Preserving rare structures is what makes the NPLM batching successful in recombining subtle hints of anomalous events into a statistically significant signal, also suggesting that the density-ratio functions learnt by the NPLM algorithm could be good candidates to build anomaly-preserving summary statistics.

Moreover, the new NPLM split-aggregation strategy offers a solution to offline signal-agnostic analyses of collider data, often characterised by large sample sizes. It is beneficial both in terms of training time and sensitivity performances. thanks to its effect of regularization against statistical fluctuations, able in some cases to increase the significance of the observed discrepancy. It should be mentioned that rigorously treating systematic uncertainties affecting the reference sample in this context is crucial. In this work we don’t address the problem of systematic uncertainties. A way to address systematic uncertainties in the original NPLM implementation was proposed in Ref. [5] for neural network based models. In the split-aggregation approach additional challenges arise from the fact that each batch has access only to “local” information about uncertainties. An accurate global assessment of the systematic effects via nuisance parameters estimation needs therefore the design of a successful aggregation strategy over the batches (see for instance Ref. [24]). The extension of the approach proposed in Ref. [5] to handle systematic uncertainties in the split-aggregation NPLM strategy is subject of ongoing study and left for future work.

The potential application of the new batch-based NPLM strategy extends beyond the offline analysis, to quasi-online signal-agnostic analyses of streamed data that are only temporarily available. Online approaches to data analyses allow to inspect collider data at the experiments prior the stage of filtering applied by the triggers, opening the way to the exploration of phase space regions never analysed before, at significantly high rates. Efforts in this direction are ongoing at the LHC experiments. The CMS experiment is investigating new approaches to analyse the experimental data at the collision rate of 40 MHz with focus on the muon detector chambers and the electromagnetic calorimeters [25, 26]. Moreover, the LHCb experiment has recently upgraded to a triggerless data readout at a rate of roughly 30 MHz, with partial event reconstruction that allows to reduce the rate down to 1 MHz prior the final trigger stage [27]. In this work we proposed two possible solutions to run the NPLM algorithm over a continuous stream of data under storage and computational resources constraints. In case of limited storage, we propose to train NPLM over multiple batches and evaluate the test only on one batch available for long term storage (NPLM-ONE). In complete absence of permanent storage, we propose to train NPLM over multiple batches and perform a goodness-of-fit test of the aggregated density model following the saturated likelihood-ratio approach (NPLM-SAT). Experiments run under constrained storage scenarios (NPLM-ONE) show that performing the aggregation over batches improves the modelling of the signal increasing the power of the NPLM test applied to a single batch. As for the saturated test, we observe sensitivity powers comparable to those of the NPLM algorithm applied to the full statistics when the signal lays on high density regions, while the power is only partially recovered for signals localised in the tails. In both cases, the sensitivity is expected to scale as the statistics of the processed data increases. Online analysis is a promising avenue for anomaly detection. The smart use of machine learning algorithms can leverage large statistics to efficiently navigate unexplored data and recognise novel rare processes of unexpected nature or location. The studies presented in this work are a first step in the direction of investigating online or quasi-online solutions for statistical anomaly detection with the NPLM algorithm. Detailed studies on scalability and uncertainties quantification in typical LHC data streaming scenarios are left for future work.

Acknowledgments

The author would like to thank Siddharth Mishra-Sharma, Phil Harris, Marco Zanetti and Marco Letizia for the useful conversations and constructive feedback on the manuscript.

References

  • [1] M. Williams, “How good are your fits? Unbinned multivariate goodness-of-fit tests in high energy physics,” JINST, vol. 5, p. P09004, 2010.
  • [2] C. Weisser and M. Williams, “Machine learning and multivariate goodness of fit,” 12 2016.
  • [3] R. T. D’Agnolo and A. Wulzer, “Learning New Physics from a Machine,” Phys. Rev. D, vol. 99, no. 1, p. 015014, 2019.
  • [4] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer, and M. Zanetti, “Learning multivariate new physics,” Eur. Phys. J. C, vol. 81, no. 1, p. 89, 2021.
  • [5] R. T. d’Agnolo, G. Grosso, M. Pierini, A. Wulzer, and M. Zanetti, “Learning new physics from an imperfect machine,” Eur. Phys. J. C, vol. 82, no. 3, p. 275, 2022.
  • [6] M. Letizia, G. Losapio, M. Rando, G. Grosso, A. Wulzer, M. Pierini, M. Zanetti, and L. Rosasco, “Learning new physics efficiently with nonparametric methods,” Eur. Phys. J. C, vol. 82, no. 10, p. 879, 2022.
  • [7] G. Grosso, M. Letizia, M. Pierini, and A. Wulzer, “Goodness of fit by Neyman-Pearson testing,” 5 2023.
  • [8] J. Neyman and E. S. Pearson, “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Phil. Trans. Roy. Soc. Lond. A, vol. 231, no. 694-706, pp. 289–337, 1933.
  • [9] G. Grosso, N. Lai, M. Letizia, J. Pazzini, M. Rando, L. Rosasco, A. Wulzer, and M. Zanetti, “Fast kernel methods for data quality monitoring as a goodness-of-fit test,” Mach. Learn. Sci. Tech., vol. 4, no. 3, p. 035029, 2023.
  • [10] Y. Zhang, J. Duchi, and M. Wainwright, “Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates,” Journal of Machine Learning Research, vol. 16, no. 102, pp. 3299–3340, 2015.
  • [11] R. D. Cousins, “Lectures on Statistics in Theory: Prelude to Statistics in Practice,” 7 2018.
  • [12] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman, T. Plehn, D. Shih, and R. Winterhalder, “Ephemeral Learning - Augmenting Triggers with Online-Trained Normalizing Flows,” SciPost Phys., vol. 13, no. 4, p. 087, 2022.
  • [13] G. Meanti, L. Carratino, L. Rosasco, and A. Rudi, “Kernel methods through the roof: handling billions of points efficiently,” Advances in Neural Information Processing Systems, vol. 33, pp. 14410–14422, 2020. arXiv:2006.10350 [cs.LG].
  • [14] A. Rudi, L. Carratino, and L. Rosasco, “Falkon: An optimal large scale kernel method,” Advances in neural information processing systems, vol. 30, 2017.
  • [15] U. Marteau-Ferey, F. Bach, and A. Rudi, “Globally convergent newton methods for ill-conditioned generalized self-concordant losses,” Advances in Neural Information Processing Systems, vol. 32, 2019. arXiv:1907.01771 [math.OC].
  • [16] O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu, and J.-R. Vlimant, “Variational Autoencoders for New Physics Mining at the Large Hadron Collider,” JHEP, vol. 05, p. 036, 2019.
  • [17] O. Knapp, O. Cerri, G. Dissertori, T. Q. Nguyen, M. Pierini, and J.-R. Vlimant, “Adversarially Learned Anomaly Detection on CMS Open Data: re-discovering the top quark,” Eur. Phys. J. Plus, vol. 136, no. 2, p. 236, 2021.
  • [18] E. Govorkova, E. Puljak, T. Aarrestad, M. Pierini, K. A. Woźniak, and J. Ngadiuba, “LHC physics dataset for unsupervised New Physics detection at 40 MHz,” Sci. Data, vol. 9, p. 118, 2022.
  • [19] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: Training Dataset,” June 2021.
  • [20] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: A -¿ 4 leptons Signal Benchmark Dataset,” Oct. 2022.
  • [21] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: LQ -¿ b tau Signal Benchmark Dataset,” Oct. 2022.
  • [22] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: h0superscript0h^{0}italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT -¿ tau tau Signal Benchmark Dataset,” Oct. 2022.
  • [23] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: h+ -¿ tau nu Signal Benchmark Dataset,” Oct. 2022.
  • [24] L. Heinrich, S. Mishra-Sharma, C. Pollard, and P. Windischhofer, “Hierarchical Neural Simulation-Based Inference Over Event Ensembles,” 6 2023.
  • [25] R. Ardino et al., “A 40 MHz Level-1 trigger scouting system for the CMS Phase-2 upgrade,” Nucl. Instrum. Meth. A, vol. 1047, p. 167805, 2023.
  • [26] M. Migliorini, J. Pazzini, A. Triossi, and M. Zanetti, “40 MHz triggerless readout of the CMS Drift Tube muon detector,” JINST, vol. 19, no. 02, p. C02050, 2024.
  • [27] R. Aaij et al., “Allen: A high level trigger on GPUs for LHCb,” Comput. Softw. Big Sci., vol. 4, no. 1, p. 7, 2020.
  • [28] G. Grosso, R. T. D’Agnolo, M. Pierini, A. Wulzer, and M. Zanetti, “Nplm: Learning multivariate new physics,” Jan. 2021.
\beginsupplement

7 Supplementary Materials

7.1 Datasets details

EXPO 1D.

This simple univariate setup introduced in Ref. [3] and further studied in Ref. [5, 7] represents an energy or transverse-momentum spectrum that falls exponentially. Such type of distributions are fairly common in collider physics experiments. Studying GoF techniques in this setup is thus illustrative of some of the challenges associated with the search for new physics at these experiments. The density distribution under the reference model is defined as

n(x|R)=N(R)ex,𝑛conditional𝑥RNRsuperscript𝑒𝑥\displaystyle n(x|{{\rm{R}}})={\rm N(R)}\,e^{-x}\,,italic_n ( italic_x | roman_R ) = roman_N ( roman_R ) italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT , (21)

where N(R)NR{\rm N(R)}roman_N ( roman_R ) denotes the number of expected events in the dataset. In our studies we set N(R)NR{\rm N(R)}roman_N ( roman_R ) at 16k16k16{\rm k}16 roman_k events for the full dataset 𝒟𝒟{\cal{D}}caligraphic_D, and we consider batches of size 8k8k8{\rm k}8 roman_k (2 batches), 4k4k4{\rm k}4 roman_k (4 batches) or 2k2k2{\rm k}2 roman_k (8 batches) events. To test the model ability to detect features of various narrowness and in different locations, we consider four Gaussian signals

n(x|Gμ,σ)=N(S)12πσNP,ie(xμ)22σ2,𝑛conditional𝑥subscriptG𝜇𝜎NS12𝜋subscript𝜎NPisuperscript𝑒superscript𝑥𝜇22superscript𝜎2\displaystyle n(x|{\rm G}_{\mu,\sigma})={\rm N(S)}\frac{1}{\sqrt{2\pi}\sigma_{% \rm NP,i}}\,e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\,,italic_n ( italic_x | roman_G start_POSTSUBSCRIPT italic_μ , italic_σ end_POSTSUBSCRIPT ) = roman_N ( roman_S ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG italic_σ start_POSTSUBSCRIPT roman_NP , roman_i end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , (22)

and an excess in the tail defined by

n(x|E)=N(S)2x2ex.𝑛conditional𝑥ENS2superscript𝑥2superscript𝑒𝑥\displaystyle n(x|{\rm E})=\frac{\rm N(S)}{2}x^{2}e^{-x}\,.italic_n ( italic_x | roman_E ) = divide start_ARG roman_N ( roman_S ) end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT . (23)

The data distribution under the alternative hypotheses has therefore the following form

n(x|Hi)=n(x|R)+n(x|S),𝑛conditional𝑥subscriptHi𝑛conditional𝑥R𝑛conditional𝑥S\displaystyle n(x|{\rm H_{i}})=\,n(x|{\rm R})+n(x|{\rm S})\,,italic_n ( italic_x | roman_H start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) = italic_n ( italic_x | roman_R ) + italic_n ( italic_x | roman_S ) , (24)

with S={Gμ,σ,E}SsubscriptG𝜇𝜎E{\rm S}=\{{\rm G}_{\mu,\sigma},{\rm E}\}roman_S = { roman_G start_POSTSUBSCRIPT italic_μ , italic_σ end_POSTSUBSCRIPT , roman_E }. For this toy model, the ideal test statistic according to Neyman–Pearson can be computed analytically as

tidS(𝒟)=2log[eN(S)x𝒟n(x|S)+n(x|R)n(x|R)].superscriptsubscript𝑡𝑖𝑑S𝒟2superscript𝑒NSsubscriptproduct𝑥𝒟𝑛conditional𝑥S𝑛conditional𝑥R𝑛conditional𝑥Rt_{id}^{\rm S}({{\cal{D}}})=2\log\left[e^{\rm-N(S)}\prod_{x\in{{\cal{D}}}}% \frac{n(x|{\rm S})+n(x|{\rm R})}{n(x|{\rm R})}\right]\,.italic_t start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ( caligraphic_D ) = 2 roman_log [ italic_e start_POSTSUPERSCRIPT - roman_N ( roman_S ) end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT divide start_ARG italic_n ( italic_x | roman_S ) + italic_n ( italic_x | roman_R ) end_ARG start_ARG italic_n ( italic_x | roman_R ) end_ARG ] . (25)

The fraction of signal injection over reference events, N(S)/N(R)NSNR\rm N(S)/N(R)roman_N ( roman_S ) / roman_N ( roman_R ), the Z-score666The Z-score is defined as the quantile of a standard normal distribution whose survival function matches the p𝑝pitalic_p-value p𝑝pitalic_p: Z=Γ1(1p)𝑍superscriptΓ11𝑝Z=\Gamma^{-1}(1-p)italic_Z = roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p ) for the ideal test (Zidsubscript𝑍𝑖𝑑Z_{id}italic_Z start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT), as well as the location μ𝜇\muitalic_μ and scale σ𝜎\sigmaitalic_σ of the Gaussian signals are reported in Table 2.

μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ N(S)/N(R) Zidsubscript𝑍𝑖𝑑Z_{id}italic_Z start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT (full batch)
Gaussian peaks bulk 1.6 0.16 1.5021.5superscript021.5\cdot 0^{-2}1.5 ⋅ 0 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 4.9
broad 4 0.64 6.5036.5superscript036.5\cdot 0^{-3}6.5 ⋅ 0 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.2
narrow 4 0.01 1.5031.5superscript031.5\cdot 0^{-3}1.5 ⋅ 0 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.8
tail 6.4 0.16 1.5031.5superscript031.5\cdot 0^{-3}1.5 ⋅ 0 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.0
Tail excess excess 1.5021.5superscript021.5\cdot 0^{-2}1.5 ⋅ 0 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 4.5
Table 2: Summary of the 1D signal benchmarks.
Refer to caption
Figure 11: Expo-1D signal benchmarks

Running the NPLM algorithm requires a reference sample ({\cal{R}}caligraphic_R) to compare the data 𝒟𝒟{\cal{D}}caligraphic_D with. Our {\cal{R}}caligraphic_R sample is composed of N=200ksubscriptN200𝑘{\rm{N}}_{\cal{R}}=200kroman_N start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT = 200 italic_k events, sampled according to the RR{\rm{R}}roman_R hypothesis. It should be noticed that this studied is conducted under the assumption that we can sample synthetic data from the RR{\rm{R}}roman_R hypothesis, which is perfectly known, at will.

Dimuon 5D.

The second benchmark considered in this work consists of a set of Monte Carlo simulated proton-proton collisions happening at 13 TeV at the LHC, with two opposite charged muons in the final state Ref. [28]. In this dataset, each event is represented by five variables describing the kinematics of the dimuon system: the transverse momentum of each muon (pT,1(2)subscript𝑝𝑇12p_{T,1(2)}italic_p start_POSTSUBSCRIPT italic_T , 1 ( 2 ) end_POSTSUBSCRIPT), their pseudorapidities (η1(2)subscript𝜂12\eta_{1(2)}italic_η start_POSTSUBSCRIPT 1 ( 2 ) end_POSTSUBSCRIPT), and the relative azimuthal angle between the two objects (Δϕ12=ϕ1ϕ2Δsubscriptitalic-ϕ12subscriptitalic-ϕ1subscriptitalic-ϕ2\Delta\phi_{12}=\phi_{1}-\phi_{2}roman_Δ italic_ϕ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). We focus on final states with transverse momenta greater than 20 GeV, pseudorapidities lower than 2.1 in absolute value, and with invariant mass of the dimuon system larger than 60 GeV. The dominant background process in this configuration is the Drell-Yan, that for sake of simplicity we consider as the only source of background. In our experiments the total number of expected events in the SM background hypothesis is around N(R)=8000NR8000{\rm{N}}({\rm{R}})=8000roman_N ( roman_R ) = 8000, after the acceptance selections are applied. We then study the impact of splitting the data in four batches and combining them via the batch-based strategy proposed in Section 3. As signal benchmarks, we considered a set of Zsuperscript𝑍Z^{\prime}italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bosons with variable mass and width and a set of EFT scenarios described by the dimension-6 4-fermions Lagrangian cWΛJLμaJLaμsubscript𝑐𝑊Λsuperscriptsubscript𝐽𝐿𝜇𝑎superscriptsubscript𝐽𝐿𝑎𝜇\frac{c_{W}}{\Lambda}J_{L\mu}^{a}J_{La}^{\mu}divide start_ARG italic_c start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG start_ARG roman_Λ end_ARG italic_J start_POSTSUBSCRIPT italic_L italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_L italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, with variable Wilson coefficients. Details on the signal benchmarks are given in Table 3.

Mass (GeV) 𝚺𝚺\boldsymbol{\Sigma}bold_Σ (GeV) N(S)/N(R) Zidsubscript𝑍𝑖𝑑Z_{id}italic_Z start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT (full batch)
Z’ scenarios 600 16 61046superscript1046\cdot 10^{-4}6 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3
16 1.51031.5superscript1031.5\cdot 10^{-3}1.5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 7.1
300 8 2.51032.5superscript1032.5\cdot 10^{-3}2.5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5
8 51035superscript1035\cdot 10^{-3}5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.8
200 5 51035superscript1035\cdot 10^{-3}5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.1
5 11021superscript1021\cdot 10^{-2}1 ⋅ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 9.1
𝒄𝒘(𝐓𝐞𝐕𝟐)subscript𝒄𝒘superscript𝐓𝐞𝐕2\boldsymbol{c_{w}({\rm TeV^{-2}})}bold_italic_c start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT bold_( bold_TeV start_POSTSUPERSCRIPT bold_- bold_2 end_POSTSUPERSCRIPT bold_) N(D)/N(R)
EFT scenarios 1.0 1.002 3.7
Table 3: Summary of the MUMU signal benchmarks. The signal yields, N(S), and data yileds, N(D), are the ones expected for the luminosity of one batch in our experiments configuration.

CMS-L1 24D.

As supplementary material we report the marginal distribution of the SM dataset and the four signal benchmarks over the 24 input variables considered in this work (Figures 1213). The narrow peak at zero present in all plots is the effect of “zero padding” (see main text for more comments).

Refer to caption
(a) Leading muon.
Refer to caption
(b) Subleading muon.
Refer to caption
(c) Leading electron.
Refer to caption
(d) Subeading electron.
Figure 12: Input features for the CMS-L1-24D dataset (panel 1). In each row we show the transverse momentum, pseudorapidity and azymuthal angle of an object.
Refer to caption
(a) Missing transverse energy.
Refer to caption
(b) Leading jet.
Refer to caption
(c) Second leading jet.
Refer to caption
(d) Third leading jet.
Figure 13: Input features for the CMS-L1-24D dataset (panel 2). In each row we show the transverse momentum, pseudorapidity and azymuthal angle of an object.

7.2 Improving regularization and sensitivity by aggregating over batches.

In this Appendix we provide additional plots showing the power curves of the aggregation method NPLM-ALL proposed in this work, both for the univariate EXPO-1D dataset (Figure 14) and for the multivariate MUMU-5D one (Figures 15 and 16). The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.1).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: EXPO-1D, NPLM-ALL. Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT (top row) and tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT (bottom row) at different number of batches. Each column shows a different signal benchmark. Data splitting improves the performances of tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT, while degrades tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: MUMU-5D, NPLM-ALL (panel 1). Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT (top row) and tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT (bottom row) at different number of batches. Each column shows a different signal benchmark. Data splitting maintains or improves the performances of tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT, while degrades tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: MUMU-5D, NPLM-ALL (panel 2). Power curves for tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT (top row) and tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT (bottom row) at different number of batches. Each column shows a different signal benchmark. Data splitting maintains or improves the performances of tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT, while degrades tSUMsubscript𝑡SUMt_{\rm SUM}italic_t start_POSTSUBSCRIPT roman_SUM end_POSTSUBSCRIPT.

7.3 Impact of aggregation on a single batch test

We provide in Figure 17 additional plots showing the power curves of the aggregation method NPLM-ONE proposed in this work. The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.2).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17: NPLM-ONE. DIMUON-5D. Power curves for the aggregated test tAGGRsubscript𝑡AGGRt_{\rm AGGR}italic_t start_POSTSUBSCRIPT roman_AGGR end_POSTSUBSCRIPT when evaluated over one single data batch (light blue line) and over the full dataset (green line), compared with the power of the test learnt and evaluated over one data batch (black line). The aggregation improves the accuracy over the learnt alternative model enhancing the sensitivity with respect to the single batch test.The two panels show the results of our tests for two signal benchmarks, the a Z’ resonance in the tail of the mass spectrum (left panel) and and a EFT scenario (right panel).

7.4 Saturated test statistic

In this Appendix we provide additional plots showing the power curves of the aggregation method NPLM-SAT proposed in this work (see Figures 18 and 19). The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.3).

Refer to caption
Refer to caption
Refer to caption
Figure 18: NPLM-SAT. EXPO-1D. Comparison of the aggregated and saturated test statistic power curves for a narrow gaussian signal (left panel), a broad gaussian signal (central panel), and a non resonant excess in the tail (right panel) of the exponential falling reference distribution. The experiments have been performed splitting the dataset in 8 batches. We show in black the power curve of the saturated test (NPLM-SAT), in lightblue the power curve of the aggregated test evaluated over the full dataset (NPLM-ALL), and in green the one of the aggregated test evaluated over one batch (NPLM-ONE).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 19: NPLM-SAT. DIMUON-5D. Comparison of the aggregated and saturated test statistic power curves for a Z’ resonant signal in the bulk (left panel) and in the tail (right panel) of the mass spectrum considered in this work. The experiments have been performed splitting the dataset in 4 batches