At the time of this study at ]Sapienza University of Rome, Physics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy At the time of this study at ]Sapienza University of Rome, Physics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy Also at ]Complexity Science Hub (CSH) Vienna, A-1080 Vienna, Austria

Inference through innovation processes tested in the authorship attribution task

Giulio Tani Raffaelli [ Institute of Computer Science of the Czech Academy of Sciences, Pod Vodárenskou věží 2, 182 07 Prague, Czech Republic    Margherita Lalli [ Scuola Normale Superiore, Piazza dei Cavalieri 7, Pisa, Italy    Francesca Tria [ francesca.tria@uniroma1.it Sapienza University of Rome, Physics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy
(July 5, 2024)
Abstract

Abstract

Urn models for innovation capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, the urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a general approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other related methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics.

preprint: PREPRINT NUMBER

I Introduction

Innovation enters a wide variety of human activities and natural processes, from artistic and technological production to the emergence of new behaviours or genomic variants. At the same time, the encounter with novelty permeates our daily lives more extensively than we typically realise. We continuously meet new people, learn and incorporate new words into our lexicon, listen to new songs, and embrace new technologies. Although innovation and novelties (i.e., new elements at the individual or local level) operate at different scales, we can describe their emergence within the same framework, at least in certain respects [1]. Shared statistical features, including the well-known Heaps’ [2], Taylor’s [3, 4, 5, 6] and Zipf’s [7, 8] laws, suggest a common underlying principle governing their emergence. In this respect, an intriguing concept is the expansion into the adjacent possible [9]. The adjacent possible refers to the set of all the potential innovations or novelties attainable at any given time. When one of these possibilities is realised, the space of the actual enlarges, making additional possibilities achievable and thus expanding the adjacent possible. The processes introduced in [1] provide a mathematical formalisation of these concepts, extending Polya’s urn model [10] to accommodate infinitely many colours. They generate sequences of items exhibiting Heaps’, Zipf’s, and Taylor’s laws. The most general formulation of the modelling scheme proposed in [1], the urn model with semantic triggering, also captures correlations in the occurrences of novelties, as observed in real-world systems. Further generalisations have been explored to capture the empirical phenomenology in diverse contexts: network growth and evolution [11], the varied destinies of different innovations [12], and mutually influencing events [13]. Additionally, the proposed modelling scheme can be cast within the framework of random walks on graphs, offering further intriguing perspectives and broadening its scope of applications [14, 15, 16, 17].

We now want to address the question of whether these generative models can also be successfully used in inference problems. This question is further motivated by the precise connection that has been established [5, 6] between the urn models in [1] and seminal processes in Bayesian nonparametrics. The latter is a powerful tool for inference and prediction in innovation systems, where possible states or realisations are not predefined and fixed once and for all. Nonparametric Bayesian inference enables us to assign probabilities to unseen events and to deal with an ever-increasing number of new possibilities. Various applications have been proposed in diverse fields, including (but not limited to) estimation of diversity [18, 19, 20, 21, 22], classification problems [23, 24], Bayesian modelling of complex networks [25, 26], and they take a considerable role in Natural Language Processing [27, 28].

The simplest model described in [1], the urn model with triggering (UMT), reproduces, with a specific parameter setting, the conditional probabilities that define the two-parameter Poisson-Dirichlet process [29], referred to as PD hereafter, that generalises the Dirichlet process [30]. The PD and the Dirichlet processes have gained special relevance as priors in Bayesian nonparametrics due to their generality and manageability [31], and the PD process predicts the Heaps, Zipf and Taylor laws, making its use more convenient in linguistically motivated problems.

Here, we aim to explore the potential of the outlined connection between urn models for innovation and priors for Bayesian nonparametric inference. As a sample application, we address the authorship attribution task [32].

The PD and Dirichlet processes have already been considered as underlying models for natural language processing and for authorship attribution purposes. The proposed procedures interpret the outputs of PD (or Dirichlet) processes as sequences of identifiers for distributions over words (i.e., topics) [33] and measure similarity among texts or authors based on topics’ similarity [34, 35]. We briefly discuss topic models in the Methods section. It is worth stressing here that these approaches have led to hierarchical formulations that require efficient sampling algorithms for solving the problem of computing posterior probabilities [33, 28, 36, 37]. Moreover, these methods strongly rely on exchangeability, mainly due to the property of conditional independence it implies, through the de Finetti and Kingman theorems [38, 39], and for guaranteeing the feasibility of the Gibbs sampling procedure [27, 28]. Exchangeability refers to the property of the joint probability of a sequence of random variables being invariant under permutations of the elements. Notwithstanding the powerful tools it provides, this assumption is often unrealistic when modelling real-world processes.

We take a different perspective by interpreting the outputs of the underlying stochastic processes directly as sequences of words in texts or, more generally, tokens. Language serves as a paradigmatic example where novelty enters at different scales, ranging from true innovation—creation and diffusion of new words or meanings—to what we denote as novelties—the first time an individual adopts or encounters (or an author uses in their production) a word or expression. We thus borrow from information theory [40, 41] the conceptualization of a text as an instance of a stochastic process and consider urn models for innovation processes as underlying generative models. Specifically, we here consider the UMT model in its exchangeable version, which is equivalent to the PD process. We opt out of a fully Bayesian approach and use a heuristic method to determine the base distribution of the process—that is, the prior distribution of the items expected to appear in the sequence.

The overall change in perspective we adopt allows us to avoid the Monte Carlo sampling required in hierarchical methods. Moreover, while we consider here an exchangeable model, exchangeability is not crucial in our approach, paving the way for an urn-based inferential method that considers time-dependent correlations among items.

When comparing our method to various approaches used in authorship attribution tasks, we find promising results across different datasets (ranging from literary texts to blogs and emails), demonstrating that the method can scale to large, imbalanced datasets and remains robust to language variation.

II Results

II.1 The authorship attribution task

To demonstrate a possible application of the UMT generative model for an inference problem, we used the probabilities of token sequences derived from the process to infer the authorship of texts. In the authorship attribution task, one is presented with a set of texts with known attribution – the reference corpus – along with a text T𝑇Titalic_T from an unknown author. The goal is to attribute T𝑇Titalic_T to one of the authors represented in the corpus (closed attribution task) or more generally, to recognise the author as one of those represented in the corpus or possibly as a new, unidentified author (open attribution task) [42]. Here we explicitly consider the case of the closed attribution task, although several strategies can be adopted to apply the method in open attribution problems as well.

Following the framework of Information Theory [40, 41], we can think of an author as a stochastic source generating sequences of characters. In particular, a written text is regarded as a sequence of symbols, which can be dictionary words or, more generally, short strings of characters (e.g., n𝑛nitalic_n-grams if such strings have a fixed length n𝑛nitalic_n), with each symbol appearing multiple times throughout the sequence. Each symbol constitutes a novelty the first time it is introduced.

We evaluate the similarity between two symbolic sequences by computing the probability that they are part of a single realisation from the same source. More explicitly, let x1nsuperscriptsubscript𝑥1𝑛x_{1}^{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and x2msuperscriptsubscript𝑥2𝑚x_{2}^{m}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT be two symbolic sequences with length n𝑛nitalic_n and m𝑚mitalic_m respectively. Given their generative process—their source—we can compute the conditional probability P(x1nx2m)𝑃conditionalsuperscriptsubscript𝑥1𝑛superscriptsubscript𝑥2𝑚P(x_{1}^{n}\mid x_{2}^{m})italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ), that is, the probability that x1nsuperscriptsubscript𝑥1𝑛x_{1}^{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the continuation of x2msuperscriptsubscript𝑥2𝑚x_{2}^{m}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. In the authorship attribution task, the anonymous text T𝑇Titalic_T is represented by a symbolic sequence xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, while an author A𝐴Aitalic_A by the symbolic sequence xAsubscript𝑥𝐴x_{A}italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT obtained by concatenating the texts of A𝐴Aitalic_A in the reference corpus. It is worth noting that an author A𝐴Aitalic_A affects the probability of T𝑇Titalic_T both by defining the source and through the sequence xAsubscript𝑥𝐴x_{A}italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. We will use the notation P(TA)PA(xTxA)𝑃conditional𝑇𝐴subscript𝑃𝐴conditionalsubscript𝑥𝑇subscript𝑥𝐴P(T\mid A)\equiv P_{A}(x_{T}\mid\!x_{A})italic_P ( italic_T ∣ italic_A ) ≡ italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) for the conditional probability of T𝑇Titalic_T to continue the production of A𝐴Aitalic_A. The anonymous text T𝑇Titalic_T is attributed to the author A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG that maximises such conditional probability: A~=maxAP(TA)~𝐴subscript𝐴𝑃conditional𝑇𝐴\tilde{A}=\max\limits_{A}P(T\mid\!A)over~ start_ARG italic_A end_ARG = roman_max start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_P ( italic_T ∣ italic_A ). We thus need to specify the processes generating the texts and the elements xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the symbolic sequences, i.e., the tokens.

II.1.1 The tokens

We can make several choices for defining the variables—or tokens—xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In what follows, we consider two alternatives: first, we consider Overlapping Space-Free N𝑁Nitalic_N-Gram [43] (OSF). These are strings of characters of fixed length N𝑁Nitalic_N as tokens, including spaces only as the first or last characters, thereby discarding words shorter than N2𝑁2N-2italic_N - 2. This choice has often yielded the best results. Secondly, we explore a hybrid approach where we exploit the structures captured by the Lempel and Ziv compression algorithm (LZ77) [44]. We define LZ77 sequence tokens as the repeated sequences extracted through a modified version of the Lempel and Ziv algorithm, which has been previously used for attribution purposes [45]. For each dataset, we select the token specification that provides the best performance. In the Supplementary Results, we compare the achieved accuracy when using the token definitions discussed above as well as when using simple dictionary words as tokens.

II.1.2 The generative process and the posterior probabilities

We consider the UMT model in its exchangeable version, which provides an urn representation of the PD process. The latter is defined by the conditional probabilities of drawing at time t+1𝑡1t+1italic_t + 1 an old (already seen) element y𝑦yitalic_y and a new one (not seen until time t𝑡titalic_t). They are given, respectively, by:

P(xt+1=yxt)=ny,tαθ+t,ifny,t>0P(xt+1=yxt)=θ+αDtθ+tP0(y),ifny,t=0formulae-sequenceformulae-sequence𝑃subscript𝑥𝑡1conditional𝑦superscript𝑥𝑡subscript𝑛𝑦𝑡𝛼𝜃𝑡ifsubscript𝑛𝑦𝑡0𝑃subscript𝑥𝑡1conditional𝑦superscript𝑥𝑡𝜃𝛼subscript𝐷𝑡𝜃𝑡subscript𝑃0𝑦ifsubscript𝑛𝑦𝑡0\begin{split}&P(x_{t+1}=y\mid x^{t})=\frac{n_{y,t}\!-\alpha}{\theta\!+\!t},\,% \,\,\text{if}\,\,\,n_{y,t}>0\\ &P(x_{t+1}=y\mid x^{t})=\frac{\theta\!+\!\alpha D_{t}}{\theta\!+\!t}P_{0}(y),% \,\,\,\text{if}\,\,\,n_{y,t}=0\end{split}start_ROW start_CELL end_CELL start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_y ∣ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT - italic_α end_ARG start_ARG italic_θ + italic_t end_ARG , if italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT > 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_y ∣ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG italic_θ + italic_α italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_θ + italic_t end_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) , if italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT = 0 end_CELL end_ROW (1)

where ny,tsubscript𝑛𝑦𝑡n_{y,t}italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT is the number of elements of type y𝑦yitalic_y at time t𝑡titalic_t and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the total number of distinct types appearing in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT; 0<α<10𝛼10<\alpha<10 < italic_α < 1 and θ>α𝜃𝛼\theta>-\alphaitalic_θ > - italic_α are two real-valued parameters and P0()subscript𝑃0P_{0}(\cdot)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) is a given distribution on the variables’ space, called the base distribution. The UMT model does not explicitly define the prior probability for the items’ identity, i.e., the base distribution P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The latter can be independently defined on top of the process, in the same way as for the Chinese restaurant representation of the Dirichlet or PD processes [46] (please refer to section UMT and PD processes in the Methods for a thorough discussion on the urn models for innovation and their relation with the PD process).

Crucially, Eqs. (1) are only valid when P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is non-atomic, which implies that each new token can be drawn from P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at most once with probability one. On the contrary, when P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a discrete probability distribution (it has atoms), an already seen value y𝑦yitalic_y can be drawn again from it, and the conditional probabilities no longer have the simple form shown in Eqs. (1) (as detailed in the Methods). In a problem of language processing, the tokens are naturally embedded in a discrete space, which has led to the development of hierarchical formulations of the PD process [47, 48]. In these approaches, the P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the (almost surely) discrete outcome of another PD process with a non-atomic base distribution. Here we follow a different approach. We regard P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a prior probability on the space of new possibilities. In this view, the tokens take values from an uncountable set, and thus the probability of drawing the same token y𝑦yitalic_y from P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT more than once is null. As a consequence, we can use the simple Eqs. (1), where we need to make some arbitrary choices for the actual definition of the base distribution. In the following, we identify P0(y)subscript𝑃0𝑦P_{0}(y)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) with the frequency of y𝑦yitalic_y in each dataset, while still treating P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a non-atomic distribution by ensuring that each item can be drawn at most once from it. However, this raises a tricky question of normalisation, which strongly depends on the dataset, resulting in the arbitrary modulation of the relative importance of innovations and repetitions. We have addressed this problem heuristically by introducing an additional parameter δ>0𝛿0\delta>0italic_δ > 0 that multiplies P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: it suppresses (δ<1𝛿1\delta<1italic_δ < 1) or enhances (δ>1𝛿1\delta>1italic_δ > 1) the probability of introducing a novelty in T𝑇Titalic_T. In addition, we consider an author-dependent base distribution by discounting the vocabulary already appearing in A𝐴Aitalic_A (details are given in section The strategy of P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the Methods section). To summarise, the conditional probabilities P(TA)𝑃conditional𝑇𝐴P(T\mid\!A)italic_P ( italic_T ∣ italic_A ) are derived from Eqs. (1), where the base distribution P0(y)subscript𝑃0𝑦P_{0}(y)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) is defined as discussed above. Different values of α𝛼\alphaitalic_α and θ𝜃\thetaitalic_θ characterise the specific distribution associated with each author. We fix αAsubscript𝛼𝐴\alpha_{A}italic_α start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and θAsubscript𝜃𝐴\theta_{A}italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for each author A𝐴Aitalic_A to the values that maximise her likelihood (refer to the Supplementary Methods for details). We denote by DKsubscript𝐷𝐾D_{K}italic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (with K=A,T𝐾𝐴𝑇K=A,Titalic_K = italic_A , italic_T) the number of types (i.e., distinct tokens) in A𝐴Aitalic_A and T𝑇Titalic_T, and by DTADAsubscript𝐷𝑇𝐴subscript𝐷𝐴D_{T\cup A}-D_{A}italic_D start_POSTSUBSCRIPT italic_T ∪ italic_A end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT the number of types in T𝑇Titalic_T that do not appear in A𝐴Aitalic_A. The conditional probability of a text T𝑇Titalic_T to be the continuation of the production of an author A𝐴Aitalic_A reads:

P(TA)=(θA+αADAαA)DTADA(θA+m)nj=1DTQj,𝑃conditional𝑇𝐴subscriptsubscript𝜃𝐴conditionalsubscript𝛼𝐴subscript𝐷𝐴subscript𝛼𝐴subscript𝐷𝑇𝐴subscript𝐷𝐴subscriptsubscript𝜃𝐴𝑚𝑛superscriptsubscriptproduct𝑗1subscript𝐷𝑇subscript𝑄𝑗\displaystyle P(T\mid\!A)=\frac{(\theta_{A}+\alpha_{A}D_{A}\mid\alpha_{A})_{D_% {T\cup A}-D_{A}}}{(\theta_{A}+m)_{n}}\prod_{j=1}^{D_{T}}Q_{j},italic_P ( italic_T ∣ italic_A ) = divide start_ARG ( italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∣ italic_α start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_T ∪ italic_A end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ( italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_m ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,
Qj{(1αA)njT1P0(yj)if yjA(njAαA)njTotherwise.subscript𝑄𝑗casessubscript1subscript𝛼𝐴subscriptsuperscript𝑛𝑇𝑗1subscript𝑃0subscript𝑦𝑗if subscript𝑦𝑗𝐴subscriptsubscriptsuperscript𝑛𝐴𝑗subscript𝛼𝐴subscriptsuperscript𝑛𝑇𝑗otherwise.\displaystyle Q_{j}\equiv\begin{cases}(1-\alpha_{A})_{n^{T}_{j}-1}P_{0}(y_{j})% &\text{if }y_{j}\notin A\\ (n^{A}_{j}\!-\!\alpha_{A})_{n^{T}_{j}}&\text{otherwise.}\end{cases}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ { start_ROW start_CELL ( 1 - italic_α start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ italic_A end_CELL end_ROW start_ROW start_CELL ( italic_n start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL otherwise. end_CELL end_ROW (2)

where njKsubscriptsuperscript𝑛𝐾𝑗n^{K}_{j}italic_n start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of occurrences of yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in K𝐾Kitalic_K (with K=A,T𝐾𝐴𝑇K=A,Titalic_K = italic_A , italic_T), such that jnjA=msubscript𝑗subscriptsuperscript𝑛𝐴𝑗𝑚\sum_{j}n^{A}_{j}=m∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_m and jnjT=nsubscript𝑗subscriptsuperscript𝑛𝑇𝑗𝑛\sum_{j}n^{T}_{j}=n∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_n. The Pochhammer symbol and the Pochhammer symbol with increment k𝑘kitalic_k are defined respectively by (z)nz(z+1)(z+n1)=Γ(z+n)/Γ(z)subscript𝑧𝑛𝑧𝑧1𝑧𝑛1Γ𝑧𝑛Γ𝑧(z)_{n}\equiv z(z+1)\ldots(z+n-1)=\Gamma(z+n)/\Gamma(z)( italic_z ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ italic_z ( italic_z + 1 ) … ( italic_z + italic_n - 1 ) = roman_Γ ( italic_z + italic_n ) / roman_Γ ( italic_z ) and (zk)nz(z+k)(z+(n1)k)subscriptconditional𝑧𝑘𝑛𝑧𝑧𝑘𝑧𝑛1𝑘(z\mid k)_{n}\equiv z(z+k)\ldots\left(z+(n-1)k\right)( italic_z ∣ italic_k ) start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ italic_z ( italic_z + italic_k ) … ( italic_z + ( italic_n - 1 ) italic_k ).

In practice, when attributing the unknown text, we adopt the procedure of dividing it into fragments and evaluating their conditional probability separately. The entire document is then attributed either to the author that maximises the probabilities of most fragments or to the author that maximises the whole document probability computed as a joint distribution over independent fragments (i.e., as a product of the probabilities of its fragments). We optimise this choice for each specific dataset, as described in the Supplementary Methods.

Refer to caption
Figure 1: Corpora sizes and the impact of model parameters on attribution accuracy. In panel (a) we offer a pictorial view of various characteristics related to the size of the considered corpora. The size of the triangles is proportional to the logarithm of the corpus size, measured as number of documents. In the x𝑥xitalic_x and y𝑦yitalic_y axes we represent for each corpus the distribution of the numbers of texts (x𝑥xitalic_x axis) and of the numbers of characters (y𝑦yitalic_y axis) per author. Specifically, the continuous line bars represent the interquartile range of the distributions and the dotted lines show the 95% interval, to highlight their long tails. Panels (b) to (f) report the attribution accuracy varying the length of the fragments and the δ𝛿\deltaitalic_δ value. The colour scale refers to the difference relative to the maximum attribution accuracy obtained in each dataset. In the upper band, the considered length of fragments is of a single token. In the lower band, the text is not partitioned in fragments (Full Text).
Refer to caption
Figure 2: Attribution accuracy. For each of the considered datasets and attribution methods, thick lines show the average accuracy in the ten-fold stratified cross-validation experiment, while shaded circles refer to the attribution accuracy on each of the ten test sets separately. An exception is the E-mail dataset, where a unique test set is considered (see main text). We compare the accuracy achieved by our method (the Constrained Probability 2-parameters Poisson-Dirichlet, in both its versions with and without including the parameter δ𝛿\deltaitalic_δ: the CP2D and δ𝛿\deltaitalic_δ-CP2D) with the Cross-Entropy based approach (CE), the Latent Dirichlet Allocation plus Hellinger distance (LDA-H), the Disjoint Author-Document Topic model in its Probabilistic formulation (DADT-P), and the Topic Drift Model (TDM). On the literary corpora, the LDA-H accuracy is computed using our implementation; please refer to Supplementary Methods for details. For the informal corpora, the results are are available from a previous study [49, Table 1]. Results for the DADT-P and the TDM algorithms were available in the works by Seroussi et al. [34, Tables 4, 5] and Yang et al. [35, Table 1], respectively.

II.2 Results

We test our approach on literary corpora and informal corpora. To challenge the generality of our method versus language variation [50], we consider three corpora of literary texts in three different languages, English, Italian, and Polish, belonging to distinct Indoeuropean families and bearing a diverse degree of inflection (refer to the Supplementary Note 1 for details). We further consider informal corpora mainly composed of English texts. They are particularly challenging for the attribution task due to the strong unbalance in the number of samples per author and the texts’ lengths (refer to Fig. 1, panel a). We consider, in particular, an email corpus and a blog corpus. The first is part of the Enron Email corpus proposed during the PAN’11 contest [51]. It is still used as a valuable benchmark, and we compare the accuracy of our method with those reported in [35, 34]. The Blog corpus is one of the largest datasets used to test methods for authorship attribution [52]. This is a collection of 678,161 blog posts by 19,320 authors taken from [53]. Additionally, in line with [54, 49], we test our method on the subset of 1,000 most prolific authors of this corpus. For more details on the corpora, please refer to the Supplementary Note 1.

In Fig. 1 (panels b–f), we illustrate the dependency of the attribution accuracy on the value of two free parameters of our model, specifically the normalisation δ𝛿\deltaitalic_δ and the length of the fragments in which we partition the text to be attributed. In particular, we report the accuracy achieved on each dataset in a leave-one-out experiment, where we select each text in turn and attribute it by training the model on the rest of the corpus (refer to the Supplementary Methods for more details). We note that, although simply setting δ=1𝛿1\delta=1italic_δ = 1 often gives the most or nearly the most accurate results, in a few datasets using a different value of δ𝛿\deltaitalic_δ significantly improves the accuracy. Indeed an effect of δ𝛿\deltaitalic_δ is also to correct for a non-optimal choice of the length of the fragments, as is evident in the literary English dataset. When attributing an anonymous text, we optimise these two parameters—as well as the selection of P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the definition of the tokens, and the strategy to attribute the whole document from the likelihood of single fragments—on the training and validation sets, as detailed in the Supplementary Methods.

In the case of informal corpora, we compare our method with state-of-the-art methods in the family of topic models [33]. Topic models are among the most established applications of nonparametric Bayesian techniques in natural language processing, and different authors’ attribution methods rely on this approach. The underlying idea is to consider each document as a mixture of topics and to compute the similarity between two documents in terms of a measure of overlap between them (as detailed in the Methods section and the Supplementary Methods). Those methods were proposed to address challenging situations, particularly in informal corpora with many reference authors and typically short texts. Moreover, they have a similar ground to the method we propose. We consider the Latent Dirichlet Allocation plus Hellinger distance (LDA-H) [54], the Disjoint Author-Document Topic model in its Probabilistic version (DADT-P) [34] and the Topic Drift Model (TDM) [35] since their performances are available on the informal corpora. LDA-H is a straightforward application of topic models to the authorship attribution task. The DADT-P algorithm is a generalisation of the LDA-H characterising both the topics associated with texts and with authors. TDM merges topic models with machine learning methods [55, 56] to account for dynamical correlations between words.

For the literary corpora, there is no direct comparison available in the literature. In the family of topic models, we considered the LDA-H approach, whose implementation is available with the need for minor intervention (please refer to the Supplementary Methods for details on our implementation). In addition, we consider a cross-entropy (CE) approach [57, 58] in the implementation used in previous research [45]. Compression-based methods are general and powerful tools to assess similarity between symbolic sequences and have been at the forefront of authorship attribution for considerable time [59].

When comparing the aforementioned methods and ours, we optimise the free parameters of our model (i.e., δ𝛿\deltaitalic_δ, length of fragments, attribution criterion, type of tokens, and P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) on the training set, as detailed in the Supplementary Methods. The email corpus already provides training, validation, and test sets. For the remaining corpora, we use ten-fold stratified cross-validation [54, 49, 34]: in turn, one-tenth of the dataset is treated as a test set and the other nine-tenths as training, and the number of samples per author is kept constant across the different folds. In Fig. 2, we report the accuracy obtained on each of the ten partitions, as well as the average value over them. We show the results obtained by either switching off the parameter δ𝛿\deltaitalic_δ (that is, by fixing it to 1111) or optimising it on each specific corpus. The first scenario is denoted by CP2D (Constrained Probability 2-parameters Poisson-Dirichlet), the latter by δ𝛿\deltaitalic_δ-CP2D. The second procedure yields better performances in all the datasets except for the Polish literary dataset, where the number of texts per author is too low to prevent overfitting in this simple training setting. In the literary corpora, the attribution accuracy is overall high, and that of our method consistently higher than that of the other techniques. In the informal corpora, our method achieves an accuracy slightly lower than the best-performing algorithm on the email corpus, while it is the most accurate on the blog corpus. This latter corpus presents a very large number of candidate authors, and our approach appeared more robust in these extreme conditions. In Table 1, we present the numerical value of the average accuracy over the ten partitions, as shown in Fig. 2 (additional evaluation metrics can be found in the Supplementary Results). We also add the attribution accuracy on the training set. We observe that in the literary corpora, only in the Polish dataset, the accuracy on the test set is significantly lower than that in the training set, pointing to overfitting, as discussed above. For the informal corpora, we conversely notice an increase in attribution rate from the training to the test corpora. For the email corpus, also other methods exhibit a similar behaviour [49, 34]. This is probably related to the particular partition considered. For the Blog corpus, the attribution accuracy on the test set is not available for the other methods. Our method features a slightly greater accuracy on the test set than on the training, suggesting that, on the one hand, the corpus is sufficiently large to prevent overfitting. On the other hand, the method increases accuracy when increasing the length of the reference authors’ sequences.

III Conclusion

We present a method for authorship attribution based on urn models for innovation processes. We interpret texts as instances of stochastic processes, where the generative stochastic process represents the author. The attribution relies on the posterior probability of the anonymous text being generated by a particular author and continuing their production. We consider the UMT model [1] in its exchangeable version [5, 6], which is equivalent to the two-parameter Poisson-Dirichlet process. While the latter process is widely used in Bayesian nonparametric inference, it is often employed in a hierarchical formulation. In the case of attribution tasks, this approach has led to topic models, where the output of the stochastic process is a sequence of topics, i.e., distribution over words. Here, we follow a more direct approach, where the stochastic process directly generates words. By relying on a heuristic approach, we can explicitly write posterior probabilities that can be computed exactly. Besides its computational convenience, the method we propose is easily adaptable to incorporate more realistic models for innovation processes.

For instance, one avenue we intend to explore in future research is leveraging the urn model with semantic triggering [1].

We evaluate the performance of our approach by employing the simple UMT exchangeable model against various related approaches in the field. Specifically, we compare it with information theory-based methods [57, 58, 45] and probabilistic methods based on topic models [34, 35]. Our method achieves overall better or comparable performance in datasets with diverse characteristics, ranging from literary texts in different languages to informal texts.

We acknowledge that our method may not compete with deep learning-based models (DL) when large pre-training datasets are available [60, 61]. Nonetheless, it exhibits robustness in challenging situations for DL, for example when only a few texts are available for many authors [61] or in languages where pre-training is less extensive [62]. A deeper comparison with deep learning-based approaches, perhaps by concurrently exploring more sophisticated urn models in our approach, is in order but beyond the scope of the present work (refer to the Supplementary Results for a more detailed discussion and a preliminary analysis).

As a final remark, we also note that we have here considered the so-called closed-set attribution [32], where the training set contains part of the production of the author of the anonymous text. In open-set attribution [63, 64], the anonymous text may be of an author for which no other samples are available in the dataset. Despite the conceptual differences and nuances between the two tasks, approaches based on closed-set attribution [64] are sometimes used also in open-set problems, for instance, by assigning the text to an unknown author if a measure of confidence falls below a given threshold. Similar strategies can be employed with our method by leveraging the conditional probabilities of documents.

We finally note that the method presented here is highly general and can be valuable beyond authorship attribution tasks. Although we expect it to be particularly suitable when elements take values from an open set and follow an empirical distribution close to that produced by the model, it can be applied to assess the similarity between any class of symbolic sequences.

IV Methods

IV.1 UMT and PD processes

In [1], a family of urn models with infinitely many colours was proposed to reproduce shared statistical properties observed in real-world systems featuring innovations. In this context, a realisation of the process is a sequence xt=x1,,xtsuperscript𝑥𝑡subscript𝑥1subscript𝑥𝑡x^{t}=x_{1},\dots,x_{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of extractions of coloured balls, where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the colour of the element drawn at time t𝑡titalic_t, and the space of colours available at a given time – the urn – represents the adjacent possible space. The urn model with triggering (UMT) [1] (and in a more general setting in [5, 6]) operates as follows: the system evolves by drawing items from an urn initially containing a finite number N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of balls of distinct colours. At each time step t𝑡titalic_t, a ball is randomly selected from the urn, its colour registered into the sequence, and returned to the urn. If the colour of the drawn ball is not in the sequence xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, ρ~~𝜌\tilde{\rho}over~ start_ARG italic_ρ end_ARG balls of the same colour and ν+1𝜈1\nu+1italic_ν + 1 balls of entirely new colours, i.e., not yet present in the urn, are added to the urn. Thus, the occurrence of new events facilitates others by enlarging the set of potential novelties. Conversely, if the colour of the drawn ball already exists in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, ρ𝜌\rhoitalic_ρ balls of the same colour are added to the urn. Given the history of extractions xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the probabilities btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and qc,tsubscript𝑞𝑐𝑡q_{c,t}italic_q start_POSTSUBSCRIPT italic_c , italic_t end_POSTSUBSCRIPT that the drawing at time t𝑡titalic_t results in a new colour or yields a colour c𝑐citalic_c already present in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are easily specified for this model:

bt=N0+νDtN0+ρt+aDtqc,t=ρnc,t+aνN0+ρt+aDtsubscript𝑏𝑡subscript𝑁0𝜈subscript𝐷𝑡subscript𝑁0𝜌𝑡𝑎subscript𝐷𝑡subscript𝑞𝑐𝑡𝜌subscript𝑛𝑐𝑡𝑎𝜈subscript𝑁0𝜌𝑡𝑎subscript𝐷𝑡\begin{split}&b_{t}=\frac{N_{0}+\nu D_{t}}{N_{0}+\rho t+aD_{t}}\\ &q_{c,t}=\frac{\rho n_{c,t}+a-\nu}{N_{0}+\rho t+aD_{t}}\end{split}start_ROW start_CELL end_CELL start_CELL italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ν italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ italic_t + italic_a italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_q start_POSTSUBSCRIPT italic_c , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT italic_c , italic_t end_POSTSUBSCRIPT + italic_a - italic_ν end_ARG start_ARG italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ italic_t + italic_a italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_CELL end_ROW (3)

where Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and nc,tsubscript𝑛𝑐𝑡n_{c,t}italic_n start_POSTSUBSCRIPT italic_c , italic_t end_POSTSUBSCRIPT are the number of distinct colours and the number of extractions of colour c𝑐citalic_c in the sequence xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, respectively, and a=ρ~ρ+ν+1𝑎~𝜌𝜌𝜈1a=\tilde{\rho}-\rho+\nu+1italic_a = over~ start_ARG italic_ρ end_ARG - italic_ρ + italic_ν + 1. Different choices of the parameters (ρ,ρ~,ν)𝜌~𝜌𝜈(\rho,\tilde{\rho},\nu)( italic_ρ , over~ start_ARG italic_ρ end_ARG , italic_ν ) lead to different scenarios, enabling the UMT model to capture the empirical properties summarised by Heap’s, Zipf’s and Taylor’s laws. In the original formulation [1], only two values for the parameter ρ~~𝜌\tilde{\rho}over~ start_ARG italic_ρ end_ARG were discussed: ρ~=ρ~𝜌𝜌\tilde{\rho}=\rhoover~ start_ARG italic_ρ end_ARG = italic_ρ or ρ~=0~𝜌0\tilde{\rho}=0over~ start_ARG italic_ρ end_ARG = 0; the special setting ρ~=ρ(ν+1)~𝜌𝜌𝜈1\tilde{\rho}=\rho-(\nu+1)over~ start_ARG italic_ρ end_ARG = italic_ρ - ( italic_ν + 1 ), which makes the model exchangeable, was later pointed out [5]. We remind that exchangeability refers to the property that the probability of drawing any sequence xtx1,,xtsuperscript𝑥𝑡subscript𝑥1subscript𝑥𝑡x^{t}\equiv x_{1},\dots,x_{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≡ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of any finite length t𝑡titalic_t does not depend on the order in which the elements occur: P(x1,,xt)=P(xπ(1),,xπ(t))𝑃subscript𝑥1subscript𝑥𝑡𝑃subscript𝑥𝜋1subscript𝑥𝜋𝑡P(x_{1},\dots,x_{t})=P(x_{\pi(1)},\dots,x_{\pi(t)})italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P ( italic_x start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_π ( italic_t ) end_POSTSUBSCRIPT ) for each permutation π𝜋\piitalic_π and each sequence length t𝑡titalic_t. In this case, upon a proper redefinition of the parameters, namely ν/ρα𝜈𝜌𝛼\nu/\rho\equiv\alphaitalic_ν / italic_ρ ≡ italic_α and N0/ρθsubscript𝑁0𝜌𝜃N_{0}/\rho\equiv\thetaitalic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_ρ ≡ italic_θ, the UMT model reproduces the conditional probabilities associated with the PD process (expressed in Eqs. (1)). We note here that such probabilities include the Dirichlet process as a special case, where α=0𝛼0\alpha=0italic_α = 0 and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT grows logarithmically with t𝑡titalic_t. In the framework of urn models, the Dirichlet process finds its counterpart in the Hoppe model [65] and in the exchangeable version of the UMT model with the additional choice ν=0𝜈0\nu=0italic_ν = 0. The PD process is defined by 0<α<10𝛼10<\alpha<10 < italic_α < 1 and predicts the asymptotic behaviour Dttαsimilar-tosubscript𝐷𝑡superscript𝑡𝛼D_{t}\sim t^{\alpha}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_t start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT [46]. We note that the probabilities in Eqs. (3) coincide, when renaming the parameters as stated above, with those in Eqs. (1).

IV.2 The strategy for P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

When P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a discrete probability distribution (it has atoms), an already seen value y𝑦yitalic_y can be drawn again from it, and the conditional probabilities no longer have the simple form as in Eqs. (1). In this case, the conditional probabilities depend not only on the sequence xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of observable values but also on latent variables indicating, for each element in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, whether it has been drawn from P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or arose from the reinforcement process [66]. In particular, we can define, for each type yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i=1,,Dt𝑖1subscript𝐷𝑡i=1,\dots,D_{t}italic_i = 1 , … , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, a latent variable λi,tsubscript𝜆𝑖𝑡\lambda_{i,t}italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT that counts the number of times yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is drawn from the base distribution P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The probabilities conditioned on the observable sequence xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and on the latent variables sequence λDtsuperscript𝜆subscript𝐷𝑡\lambda^{D_{t}}italic_λ start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT read:

P(xt+1=yxt,λDt)=ny,tλi,tαθ+t+θ+αΛtθ+tP0(y)ifny,t>0P(xt+1=yxt,λDt)=θ+αΛtθ+tP0(y)ifny,t=0formulae-sequence𝑃subscript𝑥𝑡1conditional𝑦superscript𝑥𝑡superscript𝜆subscript𝐷𝑡subscript𝑛𝑦𝑡subscript𝜆𝑖𝑡𝛼𝜃𝑡𝜃𝛼subscriptΛ𝑡𝜃𝑡subscript𝑃0𝑦ifsubscript𝑛𝑦𝑡0𝑃subscript𝑥𝑡1conditional𝑦superscript𝑥𝑡superscript𝜆subscript𝐷𝑡𝜃𝛼subscriptΛ𝑡𝜃𝑡subscript𝑃0𝑦ifsubscript𝑛𝑦𝑡0\begin{split}P(x_{t+1}=y\mid x^{t},\lambda^{D_{t}})=\frac{n_{y,t}\!-\lambda_{i% ,t}\alpha}{\theta\!+\!t}+\frac{\theta\!+\!\alpha\Lambda_{t}}{\theta\!+\!t}P_{0% }(y)&\\ \text{if}\,\,\,n_{y,t}>0&\\ P(x_{t+1}=y\mid x^{t},\lambda^{D_{t}})=\frac{\theta\!+\!\alpha\Lambda_{t}}{% \theta\!+\!t}P_{0}(y)\;\;\;\;\;\;\text{if}\,\,\,n_{y,t}=0&\end{split}start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_y ∣ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_α end_ARG start_ARG italic_θ + italic_t end_ARG + divide start_ARG italic_θ + italic_α roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_θ + italic_t end_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL if italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT > 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_y ∣ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG italic_θ + italic_α roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_θ + italic_t end_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) if italic_n start_POSTSUBSCRIPT italic_y , italic_t end_POSTSUBSCRIPT = 0 end_CELL start_CELL end_CELL end_ROW (4)

Where ΛtiDtλi,tsubscriptΛ𝑡superscriptsubscript𝑖subscript𝐷𝑡subscript𝜆𝑖𝑡\Lambda_{t}\equiv\sum_{i}^{D_{t}}\lambda_{i,t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the total number of extractions from P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT till time t𝑡titalic_t. To compute the probabilities conditioned to the observable sequence xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we must integrate out the latent variables. This is an exponentially hard problem and efficient sampling algorithms [33, 36, 37] have been developed for an approximate solution.

By taking the perspective of the urn model, we investigate the possibility of bypassing the problem by imposing that each element can be extracted only once from P0()subscript𝑃0P_{0}(\cdot)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ), which is equivalent to fixing all the latent variables λi,t=1subscript𝜆𝑖𝑡1\lambda_{i,t}=1italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = 1 and set to zero the last term in the first equation in Eqs. (4).

The latter procedure effectively replaces P0(y)subscript𝑃0𝑦P_{0}(y)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) with a history-dependent probability, normalised at each time over all the elements y𝑦yitalic_y not already appeared in xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. It reads:

P0t(y)P0(yyxt)={P0(y)1y~xtP0(y~)if yxt0otherwisesuperscriptsubscript𝑃0𝑡𝑦subscript𝑃0conditional𝑦𝑦superscript𝑥𝑡casessubscript𝑃0𝑦1subscript~𝑦superscript𝑥𝑡subscript𝑃0~𝑦if 𝑦superscript𝑥𝑡0otherwiseP_{0}^{t}(y)\equiv P_{0}(y\mid\!y\notin x^{t})=\begin{cases}\frac{P_{0}(y)}{1-% \sum\limits_{\tilde{y}\in x^{t}}P_{0}(\tilde{y})}&\text{if }y\notin x^{t}\\ 0&\text{otherwise}\end{cases}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_y ) ≡ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ∣ italic_y ∉ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = { start_ROW start_CELL divide start_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG 1 - ∑ start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG ∈ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_y end_ARG ) end_ARG end_CELL start_CELL if italic_y ∉ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (5)

where the sum is over all the elements already drawn at time t𝑡titalic_t. Note that this choice breaks the exchangeability of the process with respect to the order in which novel elements are introduced. In the implementation of our algorithm, we follow an even simpler and fast procedure, which yielded equivalent results. We simply introduce an author-dependent base distribution by considering, for each author A𝐴Aitalic_A, the frequency of the tokens that do not appear in A𝐴Aitalic_A. Such procedure translates into replacing P0(y)subscript𝑃0𝑦P_{0}(y)italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) with P0(A)(y)=P0(y)P(A𝒞)superscriptsubscript𝑃0𝐴𝑦subscript𝑃0𝑦𝑃superscript𝐴𝒞P_{0}^{(A)}(y)=\frac{P_{0}(y)}{P(A^{\mathcal{C}})}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_A ) end_POSTSUPERSCRIPT ( italic_y ) = divide start_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG italic_P ( italic_A start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT ) end_ARG, where A𝒞superscript𝐴𝒞A^{\mathcal{C}}italic_A start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT denotes the set of all distinct tokens that do not appear in A𝐴Aitalic_A. This author-dependent base distribution proved to be preferable to simply using the original frequency, especially in datasets with short texts and few samples for each author.

IV.3 LDA and topic models

LDA is a generative probabilistic model [67], which generates corpora of documents. A document is a finite sequence of words w1,w2,,wNsubscript𝑤1subscript𝑤2subscript𝑤𝑁w_{1},w_{2},...,w_{N}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and it is represented as a random mixture over latent topics. Each topic corresponds to a categorical probability distribution over the set of all possible words. Topics can be shared by different documents. The total number k𝑘kitalic_k of topics is fixed a priori and to each topic i𝑖iitalic_i in each document d𝑑ditalic_d is associated a probability θi,dsubscript𝜃𝑖𝑑\theta_{i,d}italic_θ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT, extracted independently for each document from a k𝑘kitalic_k-dimensional Dirichlet distribution D(α1,,αk)𝐷subscript𝛼1subscript𝛼𝑘D(\alpha_{1},\dots,\alpha_{k})italic_D ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Each document d𝑑ditalic_d is generated as follows: first, its length Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is extracted from a Poissonian distribution with a given mean. Then, the document is populated with words using the following procedure: a topic i𝑖iitalic_i is extracted with probability θi,dsubscript𝜃𝑖𝑑\theta_{i,d}italic_θ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT and a word w𝑤witalic_w is extracted from i𝑖iitalic_i with the probability associated to it in topic i𝑖iitalic_i. The probabilities pi(w)subscript𝑝𝑖𝑤p_{i}(w)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) of a word w𝑤witalic_w in the topic i𝑖iitalic_i is in turn extracted independently from a W𝑊Witalic_W-dimensional Dirichlet distribution D(β1,,βW)𝐷subscript𝛽1subscript𝛽𝑊D(\beta_{1},\dots,\beta_{W})italic_D ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ), where W𝑊Witalic_W is the total number of words W𝑊Witalic_W in the corpus.

As in Eqs. (4), we can introduce latent variables [67], now with a different meaning. To each word wi,dsubscript𝑤𝑖𝑑w_{i,d}italic_w start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT in document d𝑑ditalic_d, i=1,,Nd𝑖1subscript𝑁𝑑i=1,\dots,N_{d}italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we associate a latent variable λi,dsubscript𝜆𝑖𝑑\lambda_{i,d}italic_λ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT that is the identifier of the topic j𝑗jitalic_j from which the word wi,dsubscript𝑤𝑖𝑑w_{i,d}italic_w start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT is extracted. The joint distribution of the sequence of words wNdw1,d,,wNd,dsuperscript𝑤subscript𝑁𝑑subscript𝑤1𝑑subscript𝑤subscript𝑁𝑑𝑑w^{N_{d}}\equiv w_{1,d},\dots,w_{N_{d},d}italic_w start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≡ italic_w start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_d end_POSTSUBSCRIPT and latent variables λNdλ1,d,,λNd,dsuperscript𝜆subscript𝑁𝑑subscript𝜆1𝑑subscript𝜆subscript𝑁𝑑𝑑\lambda^{N_{d}}\equiv\lambda_{1,d},\dots,\lambda_{N_{d},d}italic_λ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≡ italic_λ start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_d end_POSTSUBSCRIPT in a document d𝑑ditalic_d thus read:

P(wNd,λNd)=n=1Ndp(wi,d|λ1,d)p(λi,d)𝑃superscript𝑤subscript𝑁𝑑superscript𝜆subscript𝑁𝑑superscriptsubscriptproduct𝑛1subscript𝑁𝑑𝑝conditionalsubscript𝑤𝑖𝑑subscript𝜆1𝑑𝑝subscript𝜆𝑖𝑑P(w^{N_{d}},\lambda^{N_{d}})=\prod_{n=1}^{N_{d}}p(w_{i,d}|\lambda_{1,d})p(% \lambda_{i,d})italic_P ( italic_w start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_w start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT 1 , italic_d end_POSTSUBSCRIPT ) italic_p ( italic_λ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT ) (6)

where p(λi,d)θi,d𝑝subscript𝜆𝑖𝑑subscript𝜃𝑖𝑑p(\lambda_{i,d})\equiv\theta_{i,d}italic_p ( italic_λ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT ) ≡ italic_θ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT. To compute the posterior probability of the observable sequence wNdsuperscript𝑤subscript𝑁𝑑w^{N_{d}}italic_w start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT we must integrate out the latent variables. This is an exponentially hard problem and is solved with methods for numerical approximation by using for instance Markov Chain Monte Carlo algorithms. A more flexible approach is to use the Dirichlet or PD processes instead of the Dirichlet distributions over topics. This allows the number of topics k𝑘kitalic_k to remain unspecified a priori.

The probabilities θi,dsubscript𝜃𝑖𝑑\theta_{i,d}italic_θ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT are the elements of a sequence generated by a Dirichlet or PD process, for each document d𝑑ditalic_d. The processes characterising each document share the same discrete base distribution, which is, in turn, generated by a Dirichlet or PD process with a non-atomic P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Again, efficient sampling algorithms for computing the posterior distributions [33, 36, 37] have been developed in this framework.

In the framework of authorship attribution, methods relying on LDA are more widely adopted than those based on the Dirichlet or PD processes, primarily due to their simplicity and comparable accuracy [34].

The procedure followed by the LDA-H algorithm to address the author attribution task is described in the Supplementary Methods.

V Data Availability

The corpora used to validate our approach, with the exception of the Italian literature, are available online at the following addresses: Blog corpus (https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm), PAN’11 email corpus (https://doi.org/10.5281/zenodo.3713246), Polish literature (https://github.com/computationalstylistics/100_polish_novels), English literature (https://github.com/GiulioTani/InnovationProcessesInference/tree/main/sample_data/English_literature). The Italian literature corpus is currently covered by copyright, we make accessible the list of included titles (https://github.com/GiulioTani/InnovationProcessesInference/tree/main/sample_data/Italian_literature). Please refer to the Supplementary Note 1 for details about which parts of the public datasets where used in this study.

VI Code Availability

All the code we used to compute all the attributions with the CP2D approach is publicly available under the GNU GPL v3.0 license at https://github.com/GiulioTani/InnovationProcessesInference.

VII Acknowledgements

M.L. acknowledges support by the European Project: XAI: Science and technology for the eXplanation of AI decision making (https://xai-project.eu/), ERC-2018-ADG G.A. 834756. G.T.R. was supported by the Czech Science Foundation project No. 21-17211S. This work has been realised in the framework of the agreement between Sapienza University of Rome and the Sony Computer Science Laboratories.

VIII Author contributions

F.T. conceived and designed research; F.T. and M.L. performed a preliminary investigation, collected in M.L. master thesis. M.L. and G.T.R. collected the datasets. G.T.R. wrote the code and performed the numerical calculations. F.T. and G.T.R. analyzed the data, refined the model and discussed the results. All authors contributed to write the paper and revise the manuscript.

IX Competing interests

The authors declare no competing interests.

X Additional information

Supplementary Information is attached to this manuscript.

References

  • Tria et al. [2014] F. Tria, V. Loreto, V. Servedio, and S. Strogatz, Scientific reports 4, 1 (2014).
  • Heaps [1978] H. S. Heaps, Information retrieval, computational and theoretical aspects (Academic Press, 1978).
  • Taylor [1961] L. Taylor, Nature 189, 732 (1961).
  • Gerlach and Altmann [2014] M. Gerlach and E. G. Altmann, New Journal of Physics 16, 113010 (2014).
  • Tria et al. [2018] F. Tria, V. Loreto, and V. Servedio, Entropy 20, 752 (2018).
  • Tria et al. [2020] F. Tria, I. Crimaldi, G. Aletti, and V. D. P. Servedio, Entropy 22, 573 (2020).
  • Zipf [1935] G. K. Zipf, The Psychobiology of Language (Houghton-Mifflin, New York, NY, USA, 1935).
  • Moreno-Sánchez et al. [2016] I. Moreno-Sánchez, F. Font-Clos, and Á. Corral, PloS one 11, e0147073 (2016).
  • Kauffman [2000] S. A. Kauffman, Investigations (Oxford University Press, New York/Oxford, 2000).
  • Pólya [1930] G. Pólya, Annales de l’I.H.P. 1, 117 (1930).
  • Ubaldi et al. [2021] E. Ubaldi, R. Burioni, V. Loreto, and F. Tria, Communications Physics 4, 28 (2021).
  • Monechi et al. [2017] B. Monechi, Ã. Ruiz-Serrano, F. Tria, and V. Loreto, PloS one 12, e0179303 (2017).
  • Aletti et al. [2023] G. Aletti, I. Crimaldi, and A. Ghiglietti, Scientific Reports 13, 17187 (2023).
  • Iacopini et al. [2018] I. Iacopini, S. c. v. Milojević, and V. Latora, Physical Review Letters 120, 048301 (2018).
  • Iacopini et al. [2020] I. Iacopini, G. Di Bona, E. Ubaldi, V. Loreto, and V. Latora, Physical Review Letters 125, 248301 (2020).
  • Di Bona et al. [2022] G. Di Bona, E. Ubaldi, I. Iacopini, B. Monechi, V. Latora, and V. Loreto, arXiv preprint arXiv:2202.05099  (2022).
  • Di Bona et al. [2023] G. Di Bona, A. Bellina, G. De Marzo, A. Petralia, I. Iacopini, and V. Latora, arXiv preprint arXiv:2307.06147  (2023).
  • Lijoi et al. [2007] A. Lijoi, R. H. Mena, and I. Prünster, Biometrika 94, 769 (2007).
  • Favaro et al. [2009] S. Favaro, A. Lijoi, R. H. Mena, and I. Prünster, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71, 993 (2009).
  • Chakraborty et al. [2019] S. Chakraborty, A. Arora, C. B. Begg, and R. Shen, Nature communications 10, 5506 (2019).
  • Holec et al. [2019] P. V. Holec, J. Berleant, M. Bathe, and M. E. Birnbaum, Bioinformatics 35, 1318 (2019).
  • Masoero et al. [2022] L. Masoero, F. Camerlenghi, S. Favaro, and T. Broderick, Biometrika 109, 17 (2022).
  • Gershman and Blei [2012] S. J. Gershman and D. M. Blei, Journal of Mathematical Psychology 56, 1 (2012).
  • Ni et al. [2020] Y. Ni, P. Müller, M. Diesendruck, S. Williamson, Y. Zhu, and Y. Ji, Journal of Computational and Graphical Statistics 29, 53 (2020).
  • Schmidt and Morup [2013] M. N. Schmidt and M. Morup, IEEE Signal Processing Magazine 30, 110 (2013).
  • Hu et al. [2019] L. Hu, K. C. Chan, X. Yuan, and S. Xiong, IEEE Transactions on Knowledge and Data Engineering 32, 2115 (2019).
  • Teh and Jordan [2010] Y. W. Teh and M. I. Jordan, Bayesian nonparametrics 1, 158 (2010).
  • Blei et al. [2010] D. M. Blei, T. L. Griffiths, and M. I. Jordan, Journal of the ACM (JACM) 57, 1 (2010).
  • Pitman [1997] J. Pitman, Annals of Probability 25, 855 (1997).
  • Ferguson [1973] T. S. Ferguson, The Annals of Statistics 1, 353 (1973).
  • De Blasi et al. [2013] P. De Blasi, S. Favaro, A. Lijoi, R. H. Mena, I. Prünster, and M. Ruggiero, IEEE transactions on pattern analysis and machine intelligence 37, 212 (2013).
  • Fadel et al. [2020] A. Fadel, H. Musleh, I. Tuffaha, M. Al-Ayyoub, Y. Jararweh, E. Benkhelifa, and P. Rosso (ACM, 2020) pp. 4–8.
  • Blei [2010] D. M. Blei, IEEE Signal Processing Magazine 27, 55 (2010).
  • Seroussi et al. [2014] Y. Seroussi, I. Zukerman, and F. Bohnert, Computational Linguistics 40, 269 (2014).
  • Yang et al. [2017] M. Yang, D. Zhu, Y. Tang, and J. Wang, in Thirty-First AAAI Conference on Artificial Intelligence (2017).
  • Teh et al. [2006a] Y. Teh, D. Newman, and M. Welling, Advances in neural information processing systems 19 (2006a).
  • Porteous et al. [2008] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008) pp. 569–577.
  • de Finetti [1937] B. de Finetti, Annales de l’institut Henri Poincaré 7, 1 (1937).
  • Kingman [1978] J. F. C. Kingman, Proceedings of the Royal Society 361, 1 (1978).
  • Shannon [1948] C. E. Shannon, The Bell system technical journal 27, 379 (1948).
  • Cover and Thomas [2006] T. M. Cover and J. A. Thomas, Elements of Information Theory 2nd Edition (Wiley-Interscience, 2006).
  • Stamatatos [2009] E. Stamatatos, Journal of the American Society for information Science and Technology 60, 538 (2009).
  • Koppel et al. [2011] M. Koppel, J. Schler, and S. Argamon, Language Resources and Evaluation 45, 83 (2011).
  • Ziv and Lempel [1977] J. Ziv and A. Lempel, IEEE Transactions on information theory 23, 337 (1977).
  • Lalli et al. [2018] M. Lalli, F. Tria, and V. Loreto, in Drawing Elena Ferrante’s Profile: Workshop Proceedings, Padova, 7 September 2017, edited by A. Tuzzi and M. A. Cortelazzo (Padova UP, 2018).
  • Pitman [2006] J. Pitman, Combinatorial Stochastic Processes: Ecole d’Eté de Probabilités de Saint-Flour XXXII-2002 (Springer, 2006).
  • Teh et al. [2006b] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, Journal of the American Statistical Association 101, 1566 (2006b).
  • Teh [2006] Y. W. Teh, in Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44 (Association for Computational Linguistics, USA, 2006) p. 985?992.
  • Seroussi et al. [2012] Y. Seroussi, F. Bohnert, and I. Zukerman, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2012) pp. 264–269.
  • Rybicki and Eder [2011] J. Rybicki and M. Eder, Literary and Linguistic Computing 26, 315 (2011)https://academic.oup.com/dsh/article-pdf/26/3/315/3955977/fqr031.pdf .
  • Argamon and Juola [2011] S. Argamon and P. Juola, Overview of the International Authorship Identification Competition at PAN-2011 (2011).
  • Saedi and Dras [2021] C. Saedi and M. Dras, Computer Speech & Language 70, 101241 (2021).
  • Schler et al. [2006] J. Schler, M. Koppel, S. Argamon, and J. W. Pennebaker, in AAAI spring symposium: Computational approaches to analyzing weblogs, Vol. 6 (2006) pp. 199–205.
  • Seroussi et al. [2011] Y. Seroussi, I. Zukerman, and F. Bohnert, in Proceedings of the fifteenth conference on computational natural language learning (2011) pp. 181–189.
  • Yang et al. [2016] M. Yang, J. Mei, F. Xu, W. Tu, and Z. Lu, in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (2016) pp. 801–804.
  • Mnih and Kavukcuoglu [2013] A. Mnih and K. Kavukcuoglu, Advances in neural information processing systems 26 (2013).
  • Benedetto et al. [2002] D. Benedetto, E. Caglioti, and V. Loreto, Phys. Rev. Lett. 88, 048702 (2002).
  • Baronchelli et al. [2005] A. Baronchelli, E. Caglioti, and V. Loreto, Journal of Statistical Mechanics: Theory and Experiment 2005, P04002 (2005).
  • Neal et al. [2017] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, ACM Computing Surveys (CSUR) 50, 1 (2017).
  • Fabien et al. [2020] M. Fabien, E. Villatoro-Tello, P. Motlicek, and S. Parida (2020) pp. 127–137.
  • Bauersfeld et al. [2023] L. Bauersfeld, A. Romero, M. Muglikar, and D. Scaramuzza, PLOS ONE 18, e0287611 (2023).
  • Romanov et al. [2021] A. Romanov, A. Kurtukova, A. Shelupanov, A. Fedotova, and V. Goncharov, Future Internet 1310.3390/fi13010003 (2021).
  • Kestemont et al. [2019] M. Kestemont, E. Stamatatos, E. Manjavacas, W. Daelemans, M. Potthast, and B. Stein, in Working Notes Papers of the CLEF 2019 Evaluation Labs, CEUR Workshop Proceedings, Vol. 2380, edited by L. Cappellato, N. Ferro, D. Losada, and H. Müller (2019).
  • Stamatatos et al. [2023] E. Stamatatos, K. Kredens, P. Pezik, A. Heini, J. Bevendorff, B. Stein, and M. Potthast, in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), CEUR Workshop Proceedings, Vol. 3497, edited by M. Aliannejadi, G. Faggioli, N. Ferro, and M. Vlachos (2023) pp. 2476–2491.
  • Hoppe [1984] F. M. Hoppe, Journal of Mathematical Biology 20, 91 (1984).
  • Buntine and Hutter [2010] W. Buntine and M. Hutter, A Bayesian view of the Poisson-Dirichlet Process, Tech. Rep. arXiv:1007.0296 (NICTA and ANU, Australia, 2010).
  • Blei et al. [2003] D. M. Blei, A. Y. Ng, and M. I. Jordan, Journal of Machine Learning Research 3, 993–1022 (2003).
Eng Ita Pol Email Blog1K Blog
LDA-H .87711footnotemark: 1 .83011footnotemark: 1 .76911footnotemark: 1 .426 .216 .079
CE .853 .900 .870
DADT-P .594 .437 .286
TDM .542 .308
CP2D .913 .929 .899 .521 .494 .369
δ𝛿\deltaitalic_δ-CP2D .918 .935 .838 .558 .495 .393
CP2Dtrtr{}^{\textsc{tr}}start_FLOATSUPERSCRIPT tr end_FLOATSUPERSCRIPT .924 .927 .949 .497 .489 .358
δ𝛿\deltaitalic_δ-CP2Dtrtr{}^{\textsc{tr}}start_FLOATSUPERSCRIPT tr end_FLOATSUPERSCRIPT .926 .929 .956 .518 .490 .386
11footnotemark: 1 Our implementation.
Table 1: Attribution results. Numerical values of the average accuracy, depicted in Fig. 2, are here listed for all considered methods: our Constrained Probability 2-parameters Poisson-Dirichlet, in both its versions with and without including the parameter δ𝛿\deltaitalic_δ (the CP2D and δ𝛿\deltaitalic_δ-CP2D), the Cross-Entropy based approach (CE), the Latent Dirichlet Allocation plus Hellinger distance (LDA-H), the Disjoint Author-Document Topic model in its Probabilistic formulation (DADT-P), and the Topic Drift Model (TDM). In addition, we report the average accuracy obtained on the training sets (TR) by our method, as discussed in the text. We highlight in bold the best performance on each dataset.