Statistical signatures of abstraction in deep neural networks

Carlo Orientale Caputo
SISSA - International School for Advanced Studies, 34136 Trieste, Italy
Matteo Marsili
Quantitative Life Sciences Section
The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy
marsili@ictp.it
Abstract

We study how abstract representations emerge in a Deep Belief Network (DBN) trained on benchmark datasets. Our analysis targets the principles of learning in the early stages of information processing, starting from the “primordial soup” of the under-sampling regime. As the data is processed by deeper and deeper layers, features are detected and removed, transferring more and more “context-invariant” information to deeper layers. We show that the representation approaches an universal model – the Hierarchical Feature Model (HFM) – determined by the principle of maximal relevance. Relevance quantifies the uncertainty on the model of the data, thus suggesting that “meaning” – i.e. syntactic information – is that part of the data which is not yet captured by a model. Our analysis shows that shallow layers are well described by pairwise Ising models, which provide a representation of the data in terms of generic, low order features. We also show that plasticity increases with depth, in a similar way as it does in the brain. These findings suggest that DBNs are capable of extracting a hierarchy of features from the data which is consistent with the principle of maximal relevance.

Perception and knowledge could never be the same [Plato, Theaetetus 186e]

In the study of hierarchies of abstraction in cognitive sciences, Marr’s hierarchy [1] distinguishes three levels: the first is the conceptual one of defining these functions. The second is the algorithmic one, and the third is how both the first two can be implemented in biophysical terms. Artificial neural networks have reached levels of abstraction that endow a computer (the second and third levels of Marr’s hierarchy) with linguistic proficiency and knowledge that challenge the Turing test. Artificial neural networks then offer a playground for understanding how abstraction emerges from data, abstracting away from biophysical constraints. Indeed, the conceptual level alone contains fantastic richness. At this level, a hierarchy of deeper levels of abstraction opens up111Vision is the prototype example of this hierarchy, whereby the recognition of a particular face, which is an abstract concept, depends on lower level concepts such as edges and wedges [1]., from perception to knowledge, ultimately making us conscious beings [2]. This paper addresses the tip of this iceberg by asking the questions “what is abstraction in the first place, and how can we measure it?”.

We’ll do this studying how the data is represented in layers of different depth, inside a Deep Belief Network (DBN). The architecture is definitely outdated as compared, for example, to models that enable linguistic capabilities. But its simplicity allows us to study how abstraction emerges in a controlled way. The optimisation of the algorithms is not a relevant issue, precisely because we want to focus on how abstraction emerges independently of the algorithm. For this reason we will adopt some standards and well known benchmark datasets (see the Appendix), so that our result may be easily reproduced. We shall rather focus on how information flows inside the network. It is known that these models are capable of some level of abstraction, because the activity of deeper “neurons” reproduce higher order features with respect to those of shallower layers [3, 4].

Having delimited precisely our domain of action, we shall focus on the statistical properties of the representation that emerges from the data along the hierarchy of levels of the DBN and search for traces of abstraction. DBNs are able to reproduce images which are indistinguishable from those of the dataset they are trained with, carrying “meaning” back and forth between the visible and the deepest layer. In this process, intermediate layers discover “hidden” features, filtering out noise and non-semantic information.

Abstraction has to do with developing a representation that is more and more independent of the data222In biological brains, independence extends to the sensory modality with which data is acquired. For example, we recognise a person both if we see her or if we hear her voice., which represents only “meaning”. This leads us to the question: “what is the statistical structure of languages which carry meaning, i.e. semantic information?”. In order to answer this question a quantitative notion of meaning is necessary. One such notion has been recently introduced [5], the relevance, which quantifies Barlow’s intuition [6] that meaning is carried by redundancy. We refer the interested reader elsewhere [7] for arguments on why this is a meaningful notion of relevance. The important point is that this notion can be easily applied within our simplified setting. So we’ll test it as a candidate principle that may inform us about “what does it mean to learn?” or “what is abstraction?”. Interestingly, in the simplified setting studied here there is only one model of a representation that encodes the principle of maximal relevance and only that. This model – the Hierarchical Feature Model (HFM) – thus provides a candidate for an abstract representation that encodes only meaning333Loosely speaking, the HFM can be thought of as a toy model for the platonic realm of reality consisting in forms, ideals, or ideas..

Besides proposing the HFM as a scaffold for meaning, Ref. [8] contrasts the classical learning modality of Restricted Boltzmann Machines (RBM) with one in which, akin to “understanding”, data is stored in a fixed, pre-existent representation. These are two quite different learning modalities: In the first, both the internal representation and the networks of connections change during training while in the latter the internal representation is anchored to the HFM and only the weights are learned444Ref. [8] does not show that the HFM is better than any other model in any empirical way. The empirical evidence only shows that the HFM provides a flexible abstract representation, which also enables the network to extract relations between datasets learned in different contexts. The rationale for focusing on the HFM is the purely theoretical argument of efficient data storage. Although further analysis of the HFM would be welcome to asses its optimality in empirical analysis, an abstract theoretical justification should suffice in the present context.. The RBM enjoys a remarkable flexibility of transfer learning555Transfer learning refers the fact that a network trained on one dataset is able to perform classification tasks in supervised learning also when probed on a different dataset, by just retraining the output layer. , which implies that information is stored in the representation and weights are generic666Although see [9].. But when the internal representation is fixed, as in the “understanding” modality, information on the data is stored uniquely in the weights.

The comparison of these two learning modalities suggest that while the RBM modality describes well “perception” in shallow layers, the HFM modality prevails as one moves to deeper layers. In testing this hypothesis, the rest of this paper will show that i) the internal representations of the layers in a DBN approach more and more the HFM as one moves from shallow to deep layers, and ii) as one draws data from a wider domain. This relation of abstraction with “width” conforms to the idea that the level of abstraction of a representation should be higher for datasets with a larger variation, not evidently reducible to invariances. We also find that iii) shallow layers are well described by pairwise Ising models (PIM) whereas deeper layers require higher order interactions. PIM encode generic representations, because they have shown to describe data from a large variety of systems, from neurons in the retina [10], to voting patterns in the US Supreme court [11]. Finally, iv) we show that plasticity is reduced in shallow layers and enhanced in deeper ones. In other words, we find that weights of the shallow layer do not change much when the network is retrained on a new dataset, while those of the deeper layers change considerably. Finally v) we estimate the number of invariances and show that they decrease with depth in a way that is consistent with the picture discussed above.

Taken together, these finding corroborate the idea that successive layers of a DBN extract statistical models of hidden features invariances and that “meaning” is that part of the information which cannot be reduced to features that can be properly fit by a statistical model. Indeed, relevance measures exactly the uncertainty on the model of the data. This suggests that relevance is a quantitative measure of “meaning”, intended as “irreducible” information.

The paper is organised as follows. We first try to put our paper in relation to the literature of other ways of attacking the abstraction hierarchy at the conceptual level. Then we lay down the background introducing the concept of relevance and the HFM. In Section 3 we discuss the results and we conclude with some remarks. All technical details are relegated to the Appendix.

1 Related approaches

As entries in the subject of abstraction hierarchies, we refer to the book by Dana Ballard [2], and the review by Yee [12] which covers approaches more rooted in experimental psychology. Among the many insights that emerge from this literature, the trade-off between abstraction and complexity is relevant to our discussion: Two and a half year old children exploit reward-relevant information if that is presented in a two dimensional drawing, but not if that is displayed by a three dimensional realistic model [13]. This suggests that motor areas are connected to a rather abstract representation of the environment. Cognitive structures evolve during development: while children are more prone to thematic abstractions based on co-occurrences of memes, abstraction in adults is more frequently associated to a taxonomic (or categorical) structure777This difference is also anatomically different regions of the brain that are putatively involved in a given abstract representation. Event-based inputs is easily stored in “static memory” structure of the hippocampus region, whereas taxonomical abstractions may require more complex dynamical representations such as those offered by recurrent neural networks [14].. This difference is likely related to a difference in the brain areas which are involved, suggesting that taxonomic abstraction requires a higher level of abstraction with respect to thematic abstraction, which is based on event-based elements closer to empirical reality (typically stored in regions afferent to the hippocampus) [15].

Abstraction does not only have to do with classification but also with uncovering the structure of relations hidden in the data. This idea is at the basis of the concept of a cognitive map [14, 16], which is not only an efficient and flexible scaffold of data, but it is also endowed with an appropriate structure of relations that makes it possible to navigate the representation efficiently. Indeed relational structures such as “Alice is the daughter of Jim” and ”Bob is Alice’s brother” allow for computations (e.g. ”Jim is Bob’s father”) which are invariant with respect to the context (Alice, Jim and Bob can be replaced by any triplet of persons that stand in the same relation, in this example). How the brain builds the appropriate cognitive maps at different levels of abstraction, combining elements at lower levels, is a fascinating issue for which we refer to [14]. The reach of our analysis will not cover higher functions that involve computation, because our setting is completely static. Aspects of abstraction which are encoded in the time domain may be better attacked by an analysis similar to ours in recurrent neural networks. We shall target the very first levels of the hierarchy which corresponds to the under-sampling regime where data is so high dimensional that statistical evidence is not sufficient to identify models. Our aim is to characterise abstraction on the basis of the sole statistical signatures of the representation itself, with no reference to what is being represented in the data. In this respect, we depart from the typical “tuning curve” approach in computational neuroscience, in which levels of abstraction are assessed in terms of the features of the data – e.g. edges in vision [17] or spatial positions [18] – that a representation encodes. We shall argue that, in this setting, the relevance is a natural measure of abstraction, because it quantifies the uncertainty on a possible generative model of the data.

Research in vision has brought us considerable insights on abstraction hierarchies, and in particular on the capacity of the visual system in recognising and exploiting invariances [1, 19, 20]. This ability, when enforced in artificial neural networks either by augmenting the data using invariances [21] or by explicitly implementing them in the architecture of neural networks – as in convolutional neural networks [3] – can be learned by a deep neural network. Interestingly, even simple neural networks are able to develop a convolutional structure by themselves [22]. Convolutions are only one type of regularity that a neural network may use to capture the inner structure of the data. As argued by Tenenbaum et al. [16], knowledge is not information on the data, but rather information on the way data can be organised in representations that enable generative and computational power.

Within this largest context, our paper merely probes the very entrance of the process by which structured data acquires meaning.

2 The framework

A Deep Belief Network (DBN) is a stack of simpler units – the Restricted Boltzmann machines – which are trained to maximise the likelihood of the data. Mathematically, the DBN can be seen as a joint probability distribution

p(𝐱,𝐬(1),,𝐬(L)).𝑝𝐱superscript𝐬1superscript𝐬𝐿p(\mathbf{x},\mathbf{s}^{(1)},\ldots,\mathbf{s}^{(L)}).italic_p ( bold_x , bold_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_s start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) . (1)

We’re interested in this probability distribution after the DBN has been trained on the data. The data 𝐱^=(𝐱1,,𝐱N)^𝐱subscript𝐱1subscript𝐱𝑁\hat{\mathbf{x}}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})over^ start_ARG bold_x end_ARG = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) consist of the N𝑁Nitalic_N vectors 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that specify the 28×28282828\times 2828 × 28 B/W pixel values corresponding to handwritten digits (MNIST) and characters (eMNIST), and of stylised pictures of some commercial articles (fMNIST). Each internal layer is characterised by a binary vector 𝐬()superscript𝐬\mathbf{s}^{(\ell)}bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT that encodes the activity and a vector of weights 𝐖superscript𝐖\mathbf{W}^{\ell}bold_W start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT that relates this activity to the one of the previous layer 𝐬(1)superscript𝐬1\mathbf{s}^{(\ell-1)}bold_s start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT. We focus on the marginal distribution p(𝐬)𝑝subscript𝐬p(\mathbf{s}_{\ell})italic_p ( bold_s start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) that encodes the representation in the thsuperscriptth\ell^{\rm th}roman_ℓ start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT level of the hierarchy888In practice all our results are based on a sample 𝐬^()superscript^𝐬\hat{\mathbf{s}}^{(\ell)}over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT of N𝑁Nitalic_N sampled states from p(𝐬())𝑝superscript𝐬p(\mathbf{s}^{(\ell)})italic_p ( bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) induced in layer \ellroman_ℓ when the data is presented in the visible layer. The value of N𝑁Nitalic_N used in all experiments was N=60000𝑁60000N=60000italic_N = 60000.. Our architecture is the same as that of Ref. [23]: It is composed of L=10𝐿10L=10italic_L = 10 hidden layers, of decreasing number of variables. More details on the data, on the architecture and on the learning algorithm are given in the Appendix.

We describe representations p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) in terms of the natural variable E𝐬=log2p(𝐬)subscript𝐸𝐬subscript2𝑝𝐬E_{\mathbf{s}}=-\log_{2}p(\mathbf{s})italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( bold_s ), which is the minimal number of bits needed to represent the state 𝐬𝐬\mathbf{s}bold_s. The average coding cost H[𝐬]=E𝐬𝐻delimited-[]𝐬delimited-⟨⟩subscript𝐸𝐬H[\mathbf{s}]=\langle E_{\mathbf{s}}\rangleitalic_H [ bold_s ] = ⟨ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ⟩ is the usual Shannon entropy and counts the number of bits available to describe one point of the dataset. Following Ref. [7], we shall call H[𝐬]𝐻delimited-[]𝐬H[\mathbf{s}]italic_H [ bold_s ] the resolution.

The resolution H[𝐬]𝐻delimited-[]𝐬H[\mathbf{s}]italic_H [ bold_s ] is a measure of information content but not of information “quality”. We take the view that meaningful information should bear statistical signatures that allow it to be distinguished from noise. These make it possible to identify relevant information before finding out what that information is relevant for, a key feature of learning in living systems.

The relevance of a representation p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) is the entropy of the coding cost E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT. Representations where coding costs are distributed uniformly should be promoted for the reason that, in an optimal representation, the number W(E)𝑊𝐸W(E)italic_W ( italic_E ) of states 𝐬𝐬\mathbf{s}bold_s that require E𝐸Eitalic_E bits to be represented should match as closely as possible the number (2Esuperscript2𝐸2^{E}2 start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT) of words that require E𝐸Eitalic_E bits. This principle corresponds exactly to the maximisation of the relevance

H[E]=Ep(E)log2p(E),𝐻delimited-[]𝐸subscript𝐸𝑝𝐸subscript2𝑝𝐸H[E]=-\sum_{E}p(E)\log_{2}p(E)\,,italic_H [ italic_E ] = - ∑ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT italic_p ( italic_E ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( italic_E ) , (2)

where p(E)=W(E)eE𝑝𝐸𝑊𝐸superscript𝑒𝐸p(E)=W(E)e^{-E}italic_p ( italic_E ) = italic_W ( italic_E ) italic_e start_POSTSUPERSCRIPT - italic_E end_POSTSUPERSCRIPT is the probability that a random point in the data has E𝐬=Esubscript𝐸𝐬𝐸E_{\mathbf{s}}=Eitalic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = italic_E. Note that states 𝐬𝐬\mathbf{s}bold_s and 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with very different coding costs E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT and E𝐬subscript𝐸superscript𝐬E_{\mathbf{s}^{\prime}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be distinguished by their statistics, because they would naturally belong to different typical sets999By the law of large numbers, typical samples of weakly interacting variables all have approximately the same coding cost, a fact knowns as the asymptotic equipartition property [24]. A trained DBN classifies the points in a dataset in different typical sets [25].. Representations that maximise the relevance harvest this benefit in discrimination ability that is accorded to us by statistics.

The HFM describes the distribution p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) of a string 𝐬=(s1,,sn)𝐬subscript𝑠1subscript𝑠𝑛\mathbf{s}=(s_{1},\ldots,s_{n})bold_s = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of binary variables that we can take as indicators of whether each of n𝑛nitalic_n features is present (si=1subscript𝑠𝑖1s_{i}=1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1) or not (si=0subscript𝑠𝑖0s_{i}=0italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0). This distribution satisfies the property that, the occurrence of a feature sk=1subscript𝑠𝑘1s_{k}=1italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 at level k𝑘kitalic_k does not provide any information on whether lower order features are present or not. This means that conditional on sk=1subscript𝑠𝑘1s_{k}=1italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, all lowest order features are as random as possible, H[s1,,sk1|sk=1]=k1𝐻delimited-[]subscript𝑠1conditionalsubscript𝑠𝑘1subscript𝑠𝑘1𝑘1H[s_{1},\ldots,s_{k-1}|s_{k}=1]=k-1italic_H [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 ] = italic_k - 1 in bits. This requirement implies that the Hamiltonian E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT should be a function of m𝐬=max{k:sk=1}subscript𝑚𝐬:𝑘subscript𝑠𝑘1m_{\mathbf{s}}=\max\{k:~{}s_{k}=1\}italic_m start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = roman_max { italic_k : italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 }, with m𝐬=0subscript𝑚𝐬0m_{\mathbf{s}}=0italic_m start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = 0 if 𝐬=(0,,0)𝐬00\mathbf{s}=(0,\ldots,0)bold_s = ( 0 , … , 0 ) is the featureless state.

The principle of maximal relevance prescribes a degeneracy of states W(E)=eνE𝑊𝐸superscript𝑒𝜈𝐸W(E)=e^{\nu E}italic_W ( italic_E ) = italic_e start_POSTSUPERSCRIPT italic_ν italic_E end_POSTSUPERSCRIPT that increases exponentially with the coding cost [7]. So it excludes all functional forms between E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT and m𝐬subscript𝑚𝐬m_{\mathbf{s}}italic_m start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT that are not linear. This is why, combined with the previous requirement, the principle of maximal relevance leads to the HFM, that assigns a probability

hn(s1,,sn)=1Znegm𝐬,subscript𝑛subscript𝑠1subscript𝑠𝑛1subscript𝑍𝑛superscript𝑒𝑔subscript𝑚𝐬h_{n}(s_{1},\ldots,s_{n})=\frac{1}{Z_{n}}e^{-gm_{\mathbf{s}}}\,,italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_g italic_m start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (3)

to state 𝐬𝐬\mathbf{s}bold_s [here Znsubscript𝑍𝑛Z_{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the partition function]. We refer to [8] for a detailed discussion of the properties of the HFM. In brief, in the limit n𝑛n\to\inftyitalic_n → ∞ the HFM features a phase transition at gc=log2subscript𝑔𝑐2g_{c}=\log 2italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_log 2 between a random phase where H[𝐬]𝐻delimited-[]𝐬H[\mathbf{s}]italic_H [ bold_s ] is of order n𝑛nitalic_n for g<gc𝑔subscript𝑔𝑐g<g_{c}italic_g < italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and a “low temperature” phase where pn(s)subscript𝑝𝑛𝑠p_{n}(s)italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s ) is dominated by a finite number of states (and H[𝐬]𝐻delimited-[]𝐬H[\mathbf{s}]italic_H [ bold_s ] is finite in the limit n𝑛n\to\inftyitalic_n → ∞).

The HFM interpolates between high order features that code for meaning and low order ones, whose statistics is closer to that of noise. Indeed, marginalising over the low order features s1,,sksubscript𝑠1subscript𝑠𝑘s_{1},\ldots,s_{k}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT returns again an HFM over the remaining nk𝑛𝑘n-kitalic_n - italic_k features

s1,,skhn(s1,,sn)=hnk(sk+1,,sn).subscriptsubscript𝑠1subscript𝑠𝑘subscript𝑛subscript𝑠1subscript𝑠𝑛subscript𝑛𝑘subscript𝑠𝑘1subscript𝑠𝑛\sum_{s_{1},\ldots,s_{k}}h_{n}(s_{1},\ldots,s_{n})=h_{n-k}(s_{k+1},\ldots,s_{n% })\,.∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_n - italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (4)

On the other hand, marginalising over the high order ones yields a mixture between the HFM and the maximum entropy distribution

sk+1,,snhn(s1,,sn)=ZkZnhk(s1,,sk)+(1ZkZn)2k.subscriptsubscript𝑠𝑘1subscript𝑠𝑛subscript𝑛subscript𝑠1subscript𝑠𝑛subscript𝑍𝑘subscript𝑍𝑛subscript𝑘subscript𝑠1subscript𝑠𝑘1subscript𝑍𝑘subscript𝑍𝑛superscript2𝑘\sum_{s_{k+1},\ldots,s_{n}}h_{n}(s_{1},\ldots,s_{n})=\frac{Z_{k}}{Z_{n}}h_{k}(% s_{1},\ldots,s_{k})+\left(1-\frac{Z_{k}}{Z_{n}}\right)2^{-k}\,.∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ( 1 - divide start_ARG italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) 2 start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT . (5)

When ggc𝑔subscript𝑔𝑐g\leq g_{c}italic_g ≤ italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the ratio ZkZn0subscript𝑍𝑘subscript𝑍𝑛0\frac{Z_{k}}{Z_{n}}\to 0divide start_ARG italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG → 0 as n𝑛n\to\inftyitalic_n → ∞ with k𝑘kitalic_k finite, so the distribution of the first k𝑘kitalic_k features converges to a state of maximal entropy in this limit. Likewise, low order features become more and more independent from higher order ones in this limit.

We take the view of learning as progressively detecting hidden features and invariances in the data and modeling them. Ideally, this process is one where the marginal probability of a variable ϕ(𝐬)italic-ϕ𝐬\phi(\mathbf{s})italic_ϕ ( bold_s ) that encodes the hidden features is sharply peaked101010We take the loose meaning of the term ”hidden features” as approximate sufficient statistics. By sharply peaked we mean that the variation of ϕitalic-ϕ\phiitalic_ϕ is constrained to a low dimensional manifold. Ansuini et al. [26] show that this is true for the variation of 𝐬()superscript𝐬\mathbf{s}^{(\ell)}bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT itself. or that there are some variables θ(𝐬)𝜃𝐬\theta(\mathbf{s})italic_θ ( bold_s ) that encode some invariance (e.g. by translations). In the former case, the maximisation of the likelihood between layers can avail of H[𝐬()|ϕ]𝐻delimited-[]conditionalsuperscript𝐬italic-ϕH[\mathbf{s}^{(\ell)}|\phi]italic_H [ bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | italic_ϕ ] bits of noise, at most, to be expelled from the DBN111111Note that, by construction, the conditional distribution p(𝐬()|𝐬(1))𝑝conditionalsuperscript𝐬superscript𝐬1p(\mathbf{s}^{(\ell)}|\mathbf{s}^{(\ell-1)})italic_p ( bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | bold_s start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ) in the thsuperscriptth\ell^{\rm th}roman_ℓ start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT layer of the DBN is a maximum entropy distribution of independent binary variables, which is fully specified by the averages si()|𝐬(1)inner-productsubscriptsuperscript𝑠𝑖superscript𝐬1\langle s^{(\ell)}_{i}|\mathbf{s}^{(\ell-1)}\rangle⟨ italic_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_s start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ⟩. Therefore H[𝐬()|𝐬(1)]𝐻delimited-[]conditionalsuperscript𝐬superscript𝐬1H[\mathbf{s}^{(\ell)}|\mathbf{s}^{(\ell-1)}]italic_H [ bold_s start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | bold_s start_POSTSUPERSCRIPT ( roman_ℓ - 1 ) end_POSTSUPERSCRIPT ] quantifies the amount of information (per datapoint) that the DBN regards as noise.. In the case of an invariance, the marginal distribution itself of θ𝜃\thetaitalic_θ is a-priori a maximum entropy distribution, granting log2p1(θ)subscript2subscript𝑝1𝜃-\log_{2}p_{\ell-1}(\theta)- roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ( italic_θ ) bits of noise to be disposed of. By the way, the reduction of relevant information to invariances provides substantial computational advantages. Indeed if a representation 𝐬=(X(𝐬),Y(𝐬),)𝐬𝑋𝐬𝑌𝐬\mathbf{s}=(X(\mathbf{s}),Y(\mathbf{s}),\ldots)bold_s = ( italic_X ( bold_s ) , italic_Y ( bold_s ) , … ) can be expressed in terms of two (or more) independent random variables, these can be processed independently one from the other in parallel121212We refer to De Mulatier et al. [27] for an attempt to disentangle a sample 𝐬^^𝐬\hat{\mathbf{s}}over^ start_ARG bold_s end_ARG of binary data in independent components that is inspired by information theoretic principles alone and addresses inference in the under-sampling domain. Our preliminary attempts to disentangle independent variables in the DBN with this method suggest that this view of learning refers to an ideal limit of an optimal learning machine..

In this view, the features in the HFM describes “irreducible” information, not yet captured within a maximum entropy model. The relevance can then be thought of as a quantitative measure of the residual uncertainty on the model of the data.

3 The results

We compute the DKL(p^|hn)subscript𝐷𝐾𝐿conditionalsubscript^𝑝subscriptsubscript𝑛D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) divergence of the data from the HFM in each layer \ellroman_ℓ. This can be thought of as a tax (in bits) that is charged to the data for not being storable in an efficient manner. The results summarised in Fig. 1 show that this measure responds positively to the expectation that relevance provides a quantitative measure of meaning. Indeed DKL(p^|hn)subscript𝐷𝐾𝐿conditionalsubscript^𝑝subscriptsubscript𝑛D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) decreases with \ellroman_ℓ in all datasets, showing that the internal representation approaches the HFM as depth increases.

We expect that a dataset with a larger variety, not evidently reducible to invariances, should correspond to a more abstract representation, with respect to one trained on data drawn from a “narrower” domain. Fig. 1 corroborates this expectation, by showing that the distance of the internal representation to the HFM also decreases with “width”. We probe this behaviour in two ways: first we generate a “narrower” dataset form symmetry transformations of the digit ”2” of MNIST. The distance of the internal representations of this dataset from the HFM is significantly larger than that of the MNIST dataset. Second, we train the DBN with a “wider” dataset, combining the MNIST and the eMNIST datasets. The results confirm our expectations, even though a significant reduction in the distance of the internal representations to the HFM (with respect to that of DBMs trained on the individual datasets) is only visible in the deepest layers. The inset of Fig. 1 shows that the estimate of the parameter g𝑔gitalic_g approaches the critical point gc=log2subscript𝑔𝑐2g_{c}=\log 2italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_log 2 with depth.

Refer to caption
Figure 1: DKL between the internal representation of each layer and the best-fit HFM, normalized to the number of nodes of each layer for 5 different datasets. Besides benchmark datasets (MNIST, eMNIST and fMNIST), we also show results for a dataset of N=60000𝑁60000N=60000italic_N = 60000 digits which are obtained by simple transformations (rotations and translation) of the data points in MNIST that correspond to the digit ”2”, and for a DBN trained on the combined MNIST and eMNIST datasets (N=120000𝑁120000N=120000italic_N = 120000). The inset shows the distance δg=(g^gc)/𝕍(g^)𝛿𝑔^𝑔subscript𝑔𝑐𝕍^𝑔\delta g=(\hat{g}-g_{c})/\sqrt{\mathbb{V}(\hat{g})}italic_δ italic_g = ( over^ start_ARG italic_g end_ARG - italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) / square-root start_ARG blackboard_V ( over^ start_ARG italic_g end_ARG ) end_ARG of the estimated value of g𝑔gitalic_g from the critical point gcsubscript𝑔𝑐g_{c}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, normalised by the standard deviation of the estimator, for the MNIST dataset.

Fig. 1 shows that the HFM does not provide a good description of shallow layers. Fig. 2 shows that these are instead well described by pairwise Ising models (PIM), which contain only up-to-pairwise interactions. The PIM is defined as

p(2)(𝐬)=1Zexp(i<jJijlsisj+ihilsi),superscript𝑝2𝐬1𝑍subscript𝑖𝑗superscriptsubscript𝐽𝑖𝑗𝑙subscript𝑠𝑖subscript𝑠𝑗subscript𝑖superscriptsubscript𝑖𝑙subscript𝑠𝑖p^{(2)}(\mathbf{s})=\frac{1}{Z}\exp{\left(\sum_{i<j}J_{ij}^{l}s_{i}s_{j}+\sum_% {i}h_{i}^{l}s_{i}\right)}\,,italic_p start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_s ) = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_i < italic_j end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (6)

where Z𝑍Zitalic_Z is the partition function and the parameters Jijsubscript𝐽𝑖𝑗J_{ij}italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are estimated using maximum likelihood (more information are given in the Appendix).

In order to measure “pairwise-ness”, we compute the Kullback-Leibler distance between the internal representation of a layer \ellroman_ℓ and the best PIM describing that layer. That is the model p(2)(σ)superscriptsubscript𝑝2𝜎p_{\ell}^{(2)}(\sigma)italic_p start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_σ ) that minimize the DKL(p^||p(2))D_{KL}(\hat{p}_{\ell}||p^{(2)})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) with the hidden layer distribution. Fig. 2 shows that the minimal DKL(p^||p(2))D_{KL}(\hat{p}_{\ell}||p^{(2)})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for DBNs trained with the MNIST, fMNIST and eMNSIT datasets is negligibly small in shallow layers, and it ramps up with depth.

Refer to caption
Figure 2: Kullback-Leibler divergence between the internal representation of each layer and the best pairwise model describing that representation, normalized with the number of nodes of each layer. It is estimated from the sample {𝐬μ}μ=1Nsuperscriptsubscriptsubscriptsuperscript𝐬𝜇𝜇1𝑁\{\mathbf{s}^{\mu}_{\ell}\}_{\mu=1}^{N}{ bold_s start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_μ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The DBN was trained with MNIST, fMNIST and eMNIST datasets.

This is consistent with the fact that PIM are models of very high complexity [28], which means that they can describe data from a large variety of systems131313The complexity of a model, as shown in Ref. [29], is a measure of the number of different datasets that can be described with it. The complexity of the PIM grows as the number of parameters, which is proportional to n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. That of the HFM only grows as logn𝑛\log nroman_log italic_n, which implies that the uncertainty of the parameter g𝑔gitalic_g that a sample of N𝑁Nitalic_N points can provide is of the order of 1/(Nlogn)1𝑁𝑛1/(\sqrt{N}\log n)1 / ( square-root start_ARG italic_N end_ARG roman_log italic_n ). The couplings of the fitted PIM are small, as in Ref. [10], which is consistent with the fact that information in RBMs is passed mostly by one-spin averages because the conditional multi-information I(s1,,sn|𝐱)𝐼subscript𝑠1conditionalsubscript𝑠𝑛𝐱I(s_{1},\ldots,s_{n}|\mathbf{x})italic_I ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | bold_x ) is zero. The HFM is instead characterised by strong interaction at all orders, as shown in Ref. [8].. In this sense, the level of abstraction of PIM is very low and the distance to the best PIM can be taken as a measure of un-abstractness. The internal representations of shallow layers is therefore rather generic, which agrees with their ability to “transfer” information do deeper layers also for other datasets, without the need of being retrained. This ability is related to the widely supported idea that shallow layers code information in terms of local, low order features of the data, which are well described by pairwise interactions.

Taken together, the two results discussed above suggest that plasticity should increase with depth. This is because, if the representation of deep layer is close to an abstract (data independent) model, then the information on the data should necessarily be stored in the (data dependent) weights that connect one layer to the next. So weights of deep layers should change considerably when the data changes. On the contrary, the weights of shallow layers should not change much, given what we said above.

Fig. 3 shows the results of experiments training a DBN first on a dataset and then on a different one. It shows the distance between the weights 𝐖1superscriptsubscript𝐖1\mathbf{W}_{1}^{\ell}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT learned in layer \ellroman_ℓ for the first dataset, to those (𝐖2superscriptsubscript𝐖2\mathbf{W}_{2}^{\ell}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT) learned for the second dataset.

Refer to caption
Figure 3: Difference of the weights as a function of depth after training the same architecture with two different data sets (MNIST, eMNIST, fMNIST). The error bars on MNIST-fMNIST curve were calculated on 10 independent experiments.

As expected, the weights in shallow layers change less with respect to those in deep layers, consistently with the idea that weights of the first layers are very generic, and they do not capture specific features of the dataset. Instead the deep layers have a more specific representations, and the weights are more data dependent. This result is consistent with the observation that shallow layers tend to learn oriented and localized edge filters, whereas the deeper layers are inclined to capture higher-level features. [4].

Finally we extend the HFM by dividing the set of variables into two groups as

hn(k)(s1,,sn)=1Zn(k)egmax(k,m𝐬)=2khnk(0)(sk+1,,sn),superscriptsubscript𝑛𝑘subscript𝑠1subscript𝑠𝑛1superscriptsubscript𝑍𝑛𝑘superscript𝑒𝑔𝑘subscript𝑚𝐬superscript2𝑘superscriptsubscript𝑛𝑘0subscript𝑠𝑘1subscript𝑠𝑛h_{n}^{(k)}(s_{1},\ldots,s_{n})=\frac{1}{Z_{n}^{(k)}}e^{-g\max(k,m_{\mathbf{s}% })}=2^{-k}h_{n-k}^{(0)}(s_{k+1},\ldots,s_{n})\,,italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_g roman_max ( italic_k , italic_m start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (7)

in such a way that the first k𝑘kitalic_k variables are described by a maximal entropy distribution p(s1,,sk)=2k𝑝subscript𝑠1subscript𝑠𝑘superscript2𝑘p(s_{1},\ldots,s_{k})=2^{-k}italic_p ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 2 start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT and the remaining nk𝑛𝑘n-kitalic_n - italic_k are described by an HFM. Here k𝑘kitalic_k is meant to provide a sharp separation between variables coding for invariances and variables that code for meaning. Therefore the size k𝑘kitalic_k of the first groups provides a rough measure of the “number of invariances” present in the representation. Fig. 4 shows that the estimated value of k𝑘kitalic_k sharply decreases with depth, and it does so more slowly for data augmented using invariances.

Refer to caption
Figure 4: difference of the weights in function of layer after training the same architecture with two different data sets. The error bars on mnist-fmnist curve were calculated from 10 simulations.

4 Conclusions

Universal statistical signatures of relevance should exist because otherwise learning would be impossible. These signatures make it possible to identify relevant information without the need to know what that information is relevant for, which is precisely what learning is about. Our results suggest that it is the very uncertainty on the way in which the data may be informative, i.e. on the model which describes the data, that makes the data meaningful. And because meaning has to do with model uncertainty, it should admit a model-free, universal characterisation. The measure of relevance proposed in Ref. [5, 7] fits with this general description of an abstract measure of meaning. In addition, in the present setting, there is a unique, simple model – the HFM – that encodes the principle of maximal relevance.

This paper analyses how information is stored in the rather simple setting of a DBN in order to probe the organising principles of unsupervised learning: as data is processed in deeper and deeper layers, it is stripped off of features and invariances, which, once detected, are reduced to noise. Syntactic meaning is organised in more and more data-independent representation which approach a universal abstract representation (the HFM) that encodes only principles of efficient information storage. All this process is driven by the maximisation of the likelihood alone. In the opposite direction, the syntactic meaning generated in the deepest layers is dress up with contextual information on its way to the visible layer. In this picture, depth – i.e. the distance of a representation from the data – negatively correlates with complexity and plasticity. With respect to complexity, our results suggest a further [29] rationale for Occam’s razor, that posits simplicity as a guiding principle in learning.

This picture is largely consistent with the prevailing one in artificial neural networks [3, 4, 25] as well as in neuroscience [6, 2]. In vision, early stages of information processing are adapted to process a large variation of structured datasets. From the retina to the primary visual cortex, input is encoded in terms of generic features, such as localised filters [30]. These areas of early processing of visual stimuli exhibit suppressed levels of experience dependent plasticity after development [31]. Conversely, enhanced levels of plasticity are required for incremental learning in higher areas of visual processing [32], in order to store data specific information in the synapses (or in the couplings of models of associative memories [33]).

The main original contribution of this paper is that it offer further support to the notion of relevance [5, 7] as a quantitative measure of meaningfulness. We hope that this insight can help shed further light on understanding deeper levels of the abstraction hierarchy [2] by, for example, constraining the search for models of cognitive maps [14, 16], or providing further insights on the statistical regularities that emerge in the analysis of correlations of neural activities across scales [34, 35].

The approach can be generalised to elucidate the principles that govern learning in more complex architectures, but it may also shed light on conceptual issues in a broader disciplinary domain141414It is tempting to speculate on the analogies between our setting and the evolution of a system in time. The DBN architecture is characterised by a Markovian structure where the state of each layer only depends on the state of the previous layer. Likewise, the state of a system at a given time only depend on its state at a previous time. In this respect, equilibrium states are the least informative ones, because they satisfy a maximum entropy principle. Once a system reaches equilibrium all information on its past is lost. Meaning would then encode in which specific way a system is driven out of equilibrium. In this perspective, the relevance should help us finding those features which carry meaning on the dynamics which led to the present state. Grigolon et al. [36] discuss an example of this strategy in the context of biological evolution of proteins..

5 Acknowledgements

We are grateful to Paolo Muratore and Davide Zoccolan for interesting discussions.

Appendix A Simulation details

A.1 Training of DBN

A deep belief networks (DBN) consists of Restricted Boltzmann Machines (RBM) stacked one on top of the other, as shown in Fig. 5. Each RBM is a Markov random field with pairwise interactions defined on a bipartite graph of two non interacting layers of variables: visible variables x=(x1,..,xm)\textbf{x}=(x_{1},..,x_{m})x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) representing the data, and hidden variables s=(s1,,sn)ssubscript𝑠1subscript𝑠𝑛\textbf{s}=(s_{1},...,s_{n})s = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) that are the latent representation of the data. The measure of a single RBM is:

p(x,s)=1Zexp(i,jWijxisj+kxkck+lslbl).𝑝xs1𝑍subscript𝑖𝑗subscript𝑊𝑖𝑗subscript𝑥𝑖subscript𝑠𝑗subscript𝑘subscript𝑥𝑘subscript𝑐𝑘subscript𝑙subscript𝑠𝑙subscript𝑏𝑙p(\textbf{x},\textbf{s})=\frac{1}{Z}\exp{\left(\sum_{i,j}W_{ij}x_{i}s_{j}+\sum% _{k}x_{k}c_{k}+\sum_{l}s_{l}b_{l}\right)}.italic_p ( x , s ) = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) . (8)

where W={Wij},c=(c1,,cm)formulae-sequenceWsubscript𝑊𝑖𝑗csubscript𝑐1subscript𝑐𝑚\textbf{W}=\{W_{ij}\},~{}\textbf{c}=(c_{1},\ldots,c_{m})W = { italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } , c = ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and b=(b1,,bn)bsubscript𝑏1subscript𝑏𝑛\textbf{b}=(b_{1},\ldots,b_{n})b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are the parameters that are learned during training.

Refer to caption
Figure 5: A three layer Deep Belief Network

In order to train the DBN we learn the parameters one layer at a time, following the prescription of Hinton [37]. It consists of training the first RBM on the data and then to propagate the input data 𝐱^=(𝐱1,,𝐱N)^𝐱subscript𝐱1subscript𝐱𝑁\hat{\mathbf{x}}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})over^ start_ARG bold_x end_ARG = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) forward to the first hidden layer, thus obtaining a sample of the hidden states 𝐬^(1)superscript^𝐬1\hat{\mathbf{s}}^{(1)}over^ start_ARG bold_s end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT for the first layer. This is then used as input for training the second hidden layer, and so on. This type of training procedure was proven [37] to increase a variational lower bound for the log likelihood of the data set. This allows us to use approximated training methods like Contrastive Divergence (CD) and still being able to obtain a good generative model.

In order to generate samples from the trained DBN we consider the connections between the top two layers as undirected, whereas all lower layers are connected to the upper layer by directed connections. This means that, in order to obtain a sample from a DBN we use Gibbs sampling to sample the equilibrium of the top RBM pL(s(L),s(L1))subscript𝑝𝐿superscripts𝐿superscripts𝐿1p_{L}(\textbf{s}^{(L)},\textbf{s}^{(L-1)})italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( s start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , s start_POSTSUPERSCRIPT ( italic_L - 1 ) end_POSTSUPERSCRIPT ). Then we use this data to sample the states of lower layers using the conditional distribution p(𝐬1|𝐬)𝑝conditionalsubscript𝐬1subscript𝐬p(\mathbf{s}_{\ell-1}|\mathbf{s}_{\ell})italic_p ( bold_s start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ). In this way, we propagate the signal till the visible layer.

The DBN used in our experiment is the same as that used in Ref. [23]: it has a visible layer with 784784784784 nodes and L=10𝐿10L=10italic_L = 10 hidden layers with the following number of nodes: n=500,250,120,60,30,25,20,15,10subscript𝑛500250120603025201510n_{\ell}=500,250,120,60,30,25,20,15,10italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 500 , 250 , 120 , 60 , 30 , 25 , 20 , 15 , 10 and 5555, for =1,,10110\ell=1,\ldots,10roman_ℓ = 1 , … , 10. Similar results to those discussed in the main text were obtained for different architectures.

In order to learn the parameters of a single RBM we used a stochastic gradient ascent of the log-likelihood, using Contrastive Divergence with k=10𝑘10k=10italic_k = 10 and mini-batches of 64646464 (see [38]), for 103similar-toabsentsuperscript103\sim 10^{3}∼ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT epochs. Decelle et al. [39] [40] have shown that the distribution learned by an RBM trained with CD-10 does not reproduce equilibrium distribution, but it can be a good generative model if it was sampled out of equilibrium. Instead they observed that persistent contrastive divergence (PCD-10) was able to converge to the equilibrium distribution151515In Contrastive Divergence-k (CD-k), the Markov chain used to sample the distribution is initialized on the batch used to compute the gradient and k𝑘kitalic_k Monte Carlo steps are performed. In Persistent Contrastive Divergence-k (PCD-k) the MCMC is initialized in the configuration of the previous epoch.. To the best of our knowledge, the main gist of our results does not depend on the details of the algorithm used in training.

A.2 Boltzmann learning of Ising model

The Ising model is the maximum entropy model that reproduces the empirical averages

si𝒟1Nn=1Nsi(n),sisj𝒟1Nn=1Nsi(n)sj(n)formulae-sequencesubscriptdelimited-⟨⟩subscript𝑠𝑖𝒟1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑠𝑖𝑛subscriptdelimited-⟨⟩subscript𝑠𝑖subscript𝑠𝑗𝒟1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑠𝑖𝑛superscriptsubscript𝑠𝑗𝑛\langle s_{i}\rangle_{\mathcal{D}}\equiv\frac{1}{N}\sum_{n=1}^{N}s_{i}^{(n)},% \qquad\langle s_{i}s_{j}\rangle_{\mathcal{D}}\equiv\frac{1}{N}\sum_{n=1}^{N}s_% {i}^{(n)}s_{j}^{(n)}⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ≡ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , ⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ≡ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT (9)

of single spins and pairs of spins. For an exponential family, finding the parameters hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Jijsubscript𝐽𝑖𝑗J_{ij}italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in Eq. (6) such that the expectation over the model matches empirical averages in Eq. (9) is the same as maximizing the log-likelihood (J,h|𝒟)Jconditionalh𝒟\mathcal{L}(\textbf{J},\textbf{h}|\mathcal{D})caligraphic_L ( J , h | caligraphic_D ) of the empirical data, whose gradient components are:

Jijsubscript𝐽𝑖𝑗\displaystyle\frac{\partial\mathcal{L}}{\partial J_{ij}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG =sisj𝒟sisjp(2)absentsubscriptdelimited-⟨⟩subscript𝑠𝑖subscript𝑠𝑗𝒟subscriptdelimited-⟨⟩subscript𝑠𝑖subscript𝑠𝑗superscript𝑝2\displaystyle=\left<s_{i}s_{j}\right>_{\mathcal{D}}-\left<s_{i}s_{j}\right>_{p% ^{(2)}}= ⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT - ⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (10)
hisubscript𝑖\displaystyle\frac{\partial\mathcal{L}}{\partial h_{i}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG =si𝒟sip(2)absentsubscriptdelimited-⟨⟩subscript𝑠𝑖𝒟subscriptdelimited-⟨⟩subscript𝑠𝑖superscript𝑝2\displaystyle=\left<s_{i}\right>_{\mathcal{D}}-\left<s_{i}\right>_{p^{(2)}}= ⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT - ⟨ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

To find the parameters we perform a gradient ascent of the log likelihood. We used 64646464 parallel Markov chains of length 10n10𝑛10\cdot n10 ⋅ italic_n, with n𝑛nitalic_n the total number of spins.

A.3 Best fitted HFM with k𝑘kitalic_k independent spins

Given N𝑁Nitalic_N samples of an hidden layer: {𝐬(i)}i=1Nsuperscriptsubscriptsuperscript𝐬𝑖𝑖1𝑁\{\mathbf{s}^{(i)}\}_{i=1}^{N}{ bold_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the empirical distribution can be expressed as:

p^(s)=1Ni=1Nδ(ssi).^𝑝s1𝑁superscriptsubscript𝑖1𝑁𝛿ssubscripts𝑖\hat{p}(\textbf{s})=\frac{1}{N}\sum_{i=1}^{N}\delta(\textbf{s}-\textbf{s}_{i}).over^ start_ARG italic_p end_ARG ( s ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ ( s - s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (11)

The HFM model with k𝑘kitalic_k independent spins is defined in Eq. (7) can be written as an exponential family

hn(k)(𝐬)=1Z(k,g)eg(s)superscriptsubscript𝑛𝑘𝐬1𝑍𝑘𝑔superscript𝑒𝑔sh_{n}^{(k)}(\mathbf{s})=\frac{1}{Z(k,g)}e^{-g\mathcal{H}(\textbf{s})}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( bold_s ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_k , italic_g ) end_ARG italic_e start_POSTSUPERSCRIPT - italic_g caligraphic_H ( s ) end_POSTSUPERSCRIPT (12)

where the Hamiltonian is

(s)=max{msk,0},ms=max{i:si=1}.formulae-sequences𝑚𝑎𝑥subscript𝑚s𝑘0subscript𝑚s𝑚𝑎𝑥conditional-set𝑖subscript𝑠𝑖1\mathcal{H}(\textbf{s})=max\{m_{\textbf{s}}-k,0\},\qquad m_{\textbf{s}}=max\{i% :s_{i}=1\}.caligraphic_H ( s ) = italic_m italic_a italic_x { italic_m start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - italic_k , 0 } , italic_m start_POSTSUBSCRIPT s end_POSTSUBSCRIPT = italic_m italic_a italic_x { italic_i : italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 } . (13)

The normalization factor is given by

Z(k,g)=seg(s)=2k1(1+ξnk+11ξ1),ξ=2eg.formulae-sequence𝑍𝑘𝑔subscriptssuperscript𝑒𝑔ssuperscript2𝑘11superscript𝜉𝑛𝑘11𝜉1𝜉2superscript𝑒𝑔Z(k,g)=\sum_{\textbf{s}}e^{-g\mathcal{H}(\textbf{s})}=2^{k-1}\left(1+\frac{\xi% ^{n-k+1}-1}{\xi-1}\right),\qquad\xi=2e^{-g}.italic_Z ( italic_k , italic_g ) = ∑ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_g caligraphic_H ( s ) end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_ξ start_POSTSUPERSCRIPT italic_n - italic_k + 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_ξ - 1 end_ARG ) , italic_ξ = 2 italic_e start_POSTSUPERSCRIPT - italic_g end_POSTSUPERSCRIPT . (14)

The Kullback-Leibler divergence between the empirical distribution and the HFM model is defined as:

DKL(p^|hn)=sp^(s)logp^(s)hn(k)(s)=sp^(s)[logp^(s)+g(s)]+log[Z(k,g)].subscript𝐷𝐾𝐿conditionalsubscript^𝑝subscriptsubscript𝑛subscripts^𝑝s^𝑝ssuperscriptsubscript𝑛𝑘ssubscripts^𝑝sdelimited-[]^𝑝s𝑔s𝑍𝑘𝑔D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})=\sum_{\textbf{s}}\hat{p}(\textbf{s})\log% \frac{\hat{p}(\textbf{s})}{h_{n}^{(k)}(\textbf{s})}=\sum_{\textbf{s}}\hat{p}(% \textbf{s})\left[\log\hat{p}(\textbf{s})+g\mathcal{H}(\textbf{s})\right]+\log[% Z(k,g)].italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( s ) roman_log divide start_ARG over^ start_ARG italic_p end_ARG ( s ) end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( s ) end_ARG = ∑ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( s ) [ roman_log over^ start_ARG italic_p end_ARG ( s ) + italic_g caligraphic_H ( s ) ] + roman_log [ italic_Z ( italic_k , italic_g ) ] . (15)

For a given sample of an hidden layers, for each values of k𝑘kitalic_k we can find the HFM model that minimize the DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT in equation (15) by finding the value g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG such that the expected value of the energy over the model matches the empirical one

(s)𝒟1Mn=1Mmax{msnk,0}=smax{msk,0}hn(k)(s)(s)hn(k).subscriptdelimited-⟨⟩s𝒟1𝑀superscriptsubscript𝑛1𝑀𝑚𝑎𝑥subscript𝑚subscripts𝑛𝑘0subscripts𝑚𝑎𝑥subscript𝑚s𝑘0superscriptsubscript𝑛𝑘ssubscriptdelimited-⟨⟩ssuperscriptsubscript𝑛𝑘\langle\mathcal{H}(\textbf{s})\rangle_{\mathcal{D}}\equiv\frac{1}{M}\sum_{n=1}% ^{M}max\{m_{\textbf{s}_{n}}-k,0\}=\sum_{\textbf{s}}max\{m_{\textbf{s}}-k,0\}h_% {n}^{(k)}(\textbf{s})\equiv\langle\mathcal{H}(\textbf{s})\rangle_{h_{n}^{(k)}}.⟨ caligraphic_H ( s ) ⟩ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ≡ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_m italic_a italic_x { italic_m start_POSTSUBSCRIPT s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_k , 0 } = ∑ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT italic_m italic_a italic_x { italic_m start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - italic_k , 0 } italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( s ) ≡ ⟨ caligraphic_H ( s ) ⟩ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (16)

The average energy of the HFM in function of the parameter k𝑘kitalic_k and g𝑔gitalic_g is:

(s)hn(k)=ξ((nk+1)ξnk+1ξnk+1+ξ21ξ1),ξ=2eg.formulae-sequencesubscriptdelimited-⟨⟩ssuperscriptsubscript𝑛𝑘𝜉𝑛𝑘1superscript𝜉𝑛𝑘1superscript𝜉𝑛𝑘1𝜉21𝜉1𝜉2superscript𝑒𝑔\langle\mathcal{H}(\textbf{s})\rangle_{h_{n}^{(k)}}=\xi\left(\frac{(n-k+1)\xi^% {n-k}+1}{\xi^{n-k+1}+\xi-2}-\frac{1}{\xi-1}\right),\qquad\xi=2e^{-g}.⟨ caligraphic_H ( s ) ⟩ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_ξ ( divide start_ARG ( italic_n - italic_k + 1 ) italic_ξ start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT + 1 end_ARG start_ARG italic_ξ start_POSTSUPERSCRIPT italic_n - italic_k + 1 end_POSTSUPERSCRIPT + italic_ξ - 2 end_ARG - divide start_ARG 1 end_ARG start_ARG italic_ξ - 1 end_ARG ) , italic_ξ = 2 italic_e start_POSTSUPERSCRIPT - italic_g end_POSTSUPERSCRIPT . (17)

References

  • [1] David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
  • [2] Dana H Ballard. Brain computation as hierarchical abstraction. MIT press, 2015.
  • [3] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning, volume 1. MIT press Massachusetts, USA:, 2017.
  • [4] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Unsupervised learning of hierarchical representations with convolutional deep belief networks. Communications of the ACM, 54(10):95–103, 2011.
  • [5] M Marsili, I Mastromatteo, and Y Roudi. On sampling and modeling complex systems. Journal of Statistical Mechanics: Theory and Experiment, 2013(09):P09003, 2013.
  • [6] Horace B Barlow. Unsupervised learning. Neural computation, 1(3):295–311, 1989.
  • [7] Matteo Marsili and Yasser Roudi. Quantifying relevance in learning and inference. Physics Reports, 963:1–43, 2022.
  • [8] Rongrong Xie and Matteo Marsili. A simple probabilistic neural network for machine understanding. Journal of Statistical Mechanics: Theory and Experiment, 2024(2):023403, 2024.
  • [9] Florentin Guth and Brice Ménard. On the universality of neural encodings in CNNs. In ICLR 2024 Workshop on Representational Alignment, 2024.
  • [10] G Tkačik, T Mora, O Marre, D Amodei, S E Palmer, M J Berry, and W Bialek. Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences, 112(37):11508–11513, 2015.
  • [11] E D Lee, C P Broedersz, and W Bialek. Statistical Mechanics of the US Supreme Court. Journal of Statistical Physics, 160:275–301, July 2015.
  • [12] Eiling Yee. Abstraction and concepts: when, how, where, what and why?, 2019.
  • [13] Judy S DeLoache. Rapid change in the symbolic functioning of very young children. Science, 238(4833):1556–1557, 1987.
  • [14] James CR Whittington, David McCaffary, Jacob JW Bakermans, and Timothy EJ Behrens. How to build a cognitive map. Nature neuroscience, 25(10):1257–1272, 2022.
  • [15] Charles P Davis and Eiling Yee. Features, labels, space, and time: Factors supporting taxonomic relationships in the anterior temporal lobe and thematic relationships in the angular gyrus. Language, Cognition and Neuroscience, 34(10):1347–1357, 2019.
  • [16] Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011.
  • [17] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
  • [18] John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat. Brain research, 1971.
  • [19] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive sciences, 11(8):333–341, 2007.
  • [20] Davide Zoccolan. Invariant visual object recognition and shape processing in rats. Behavioural brain research, 285:10–33, 2015.
  • [21] Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239–274. Springer, 2002.
  • [22] Alessandro Ingrosso and Sebastian Goldt. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences, 119(40):e2201854119, 2022.
  • [23] J Song, M Marsili, and J Jo. Resolution and relevance trade-offs in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2018(12):123406, dec 2018.
  • [24] T M Cover and J A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [25] R Shwartz-Ziv and N Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • [26] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems, pages 6111–6122, 2019.
  • [27] Clélia de Mulatier, Paolo P Mazza, and Matteo Marsili. Statistical inference of minimally complex models. arXiv preprint arXiv:2008.00520, 2020.
  • [28] Alberto Beretta, Claudia Battistin, Clélia De Mulatier, Iacopo Mastromatteo, and Matteo Marsili. The stochastic complexity of spin models: Are pairwise models really simple? Entropy, 20(10):739, 2018.
  • [29] In Jae Myung, Vijay Balasubramanian, and Mark A. Pitt. Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97(21):11170–11175, 2000.
  • [30] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
  • [31] Nicoletta Berardi, Tommaso Pizzorusso, and Lamberto Maffei. Critical periods during sensory development. Current opinion in neurobiology, 10(1):138–145, 2000.
  • [32] Daniel J Amit and Daniel J Amit. Modeling brain function: The world of attractor neural networks. Cambridge university press, 1989.
  • [33] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  • [34] Leenoy Meshulam, Jeffrey L Gauthier, Carlos D Brody, David W Tank, and William Bialek. Coarse graining, fixed points, and scaling in a large population of neurons. Physical review letters, 123(17):178103, 2019.
  • [35] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini, and Kenneth D Harris. High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365, 2019.
  • [36] Silvia Grigolon, Silvio Franz, and Matteo Marsili. Identifying relevant positions in proteins by critical variable selection. Molecular BioSystems, 12(7):2147–2158, 2016.
  • [37] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  • [38] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
  • [39] Aurélien Decelle, Cyril Furtlehner, and Beatriz Seoane. Equilibrium and non-equilibrium regimes in the learning of restricted boltzmann machines. Advances in Neural Information Processing Systems, 34:5345–5359, 2021.
  • [40] Elisabeth Agoritsas, Giovanni Catania, Aurélien Decelle, and Beatriz Seoane. Explaining the effects of non-convergent sampling in the training of energy-based models. In ICML 2023-40th International Conference on Machine Learning, 2023.