Statistical signatures of abstraction in deep neural networks

Carlo Orientale Caputo
SISSA - International School for Advanced Studies, 34136 Trieste, Italy
Matteo Marsili
Quantitative Life Sciences Section
The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy marsili@ictp.it

Abstract

We study how abstract representations emerge in a Deep Belief Network (DBN) trained on benchmark datasets. Our analysis targets the principles of learning in the early stages of information processing, starting from the “primordial soup” of the under-sampling regime. As the data is processed by deeper and deeper layers, features are detected and removed, transferring more and more “context-invariant” information to deeper layers. We show that the representation approaches an universal model – the Hierarchical Feature Model (HFM) – determined by the principle of maximal relevance. Relevance quantifies the uncertainty on the model of the data, thus suggesting that “meaning” – i.e. syntactic information – is that part of the data which is not yet captured by a model. Our analysis shows that shallow layers are well described by pairwise Ising models, which provide a representation of the data in terms of generic, low order features. We also show that plasticity increases with depth, in a similar way as it does in the brain. These findings suggest that DBNs are capable of extracting a hierarchy of features from the data which is consistent with the principle of maximal relevance.

Perception and knowledge could never be the same [Plato, Theaetetus 186e]

In the study of hierarchies of abstraction in cognitive sciences, Marr’s hierarchy [1] distinguishes three levels: the first is the conceptual one of defining these functions. The second is the algorithmic one, and the third is how both the first two can be implemented in biophysical terms. Artificial neural networks have reached levels of abstraction that endow a computer (the second and third levels of Marr’s hierarchy) with linguistic proficiency and knowledge that challenge the Turing test. Artificial neural networks then offer a playground for understanding how abstraction emerges from data, abstracting away from biophysical constraints. Indeed, the conceptual level alone contains fantastic richness. At this level, a hierarchy of deeper levels of abstraction opens up¹¹1Vision is the prototype example of this hierarchy, whereby the recognition of a particular face, which is an abstract concept, depends on lower level concepts such as edges and wedges [1]., from perception to knowledge, ultimately making us conscious beings [2]. This paper addresses the tip of this iceberg by asking the questions “what is abstraction in the first place, and how can we measure it?”.

We’ll do this studying how the data is represented in layers of different depth, inside a Deep Belief Network (DBN). The architecture is definitely outdated as compared, for example, to models that enable linguistic capabilities. But its simplicity allows us to study how abstraction emerges in a controlled way. The optimisation of the algorithms is not a relevant issue, precisely because we want to focus on how abstraction emerges independently of the algorithm. For this reason we will adopt some standards and well known benchmark datasets (see the Appendix), so that our result may be easily reproduced. We shall rather focus on how information flows inside the network. It is known that these models are capable of some level of abstraction, because the activity of deeper “neurons” reproduce higher order features with respect to those of shallower layers [3, 4].

Having delimited precisely our domain of action, we shall focus on the statistical properties of the representation that emerges from the data along the hierarchy of levels of the DBN and search for traces of abstraction. DBNs are able to reproduce images which are indistinguishable from those of the dataset they are trained with, carrying “meaning” back and forth between the visible and the deepest layer. In this process, intermediate layers discover “hidden” features, filtering out noise and non-semantic information.

Abstraction has to do with developing a representation that is more and more independent of the data²²2In biological brains, independence extends to the sensory modality with which data is acquired. For example, we recognise a person both if we see her or if we hear her voice., which represents only “meaning”. This leads us to the question: “what is the statistical structure of languages which carry meaning, i.e. semantic information?”. In order to answer this question a quantitative notion of meaning is necessary. One such notion has been recently introduced [5], the relevance, which quantifies Barlow’s intuition [6] that meaning is carried by redundancy. We refer the interested reader elsewhere [7] for arguments on why this is a meaningful notion of relevance. The important point is that this notion can be easily applied within our simplified setting. So we’ll test it as a candidate principle that may inform us about “what does it mean to learn?” or “what is abstraction?”. Interestingly, in the simplified setting studied here there is only one model of a representation that encodes the principle of maximal relevance and only that. This model – the Hierarchical Feature Model (HFM) – thus provides a candidate for an abstract representation that encodes only meaning³³3Loosely speaking, the HFM can be thought of as a toy model for the platonic realm of reality consisting in forms, ideals, or ideas..

Besides proposing the HFM as a scaffold for meaning, Ref. [8] contrasts the classical learning modality of Restricted Boltzmann Machines (RBM) with one in which, akin to “understanding”, data is stored in a fixed, pre-existent representation. These are two quite different learning modalities: In the first, both the internal representation and the networks of connections change during training while in the latter the internal representation is anchored to the HFM and only the weights are learned⁴⁴4Ref. [8] does not show that the HFM is better than any other model in any empirical way. The empirical evidence only shows that the HFM provides a flexible abstract representation, which also enables the network to extract relations between datasets learned in different contexts. The rationale for focusing on the HFM is the purely theoretical argument of efficient data storage. Although further analysis of the HFM would be welcome to asses its optimality in empirical analysis, an abstract theoretical justification should suffice in the present context.. The RBM enjoys a remarkable flexibility of transfer learning⁵⁵5Transfer learning refers the fact that a network trained on one dataset is able to perform classification tasks in supervised learning also when probed on a different dataset, by just retraining the output layer. , which implies that information is stored in the representation and weights are generic⁶⁶6Although see [9].. But when the internal representation is fixed, as in the “understanding” modality, information on the data is stored uniquely in the weights.

The comparison of these two learning modalities suggest that while the RBM modality describes well “perception” in shallow layers, the HFM modality prevails as one moves to deeper layers. In testing this hypothesis, the rest of this paper will show that i) the internal representations of the layers in a DBN approach more and more the HFM as one moves from shallow to deep layers, and ii) as one draws data from a wider domain. This relation of abstraction with “width” conforms to the idea that the level of abstraction of a representation should be higher for datasets with a larger variation, not evidently reducible to invariances. We also find that iii) shallow layers are well described by pairwise Ising models (PIM) whereas deeper layers require higher order interactions. PIM encode generic representations, because they have shown to describe data from a large variety of systems, from neurons in the retina [10], to voting patterns in the US Supreme court [11]. Finally, iv) we show that plasticity is reduced in shallow layers and enhanced in deeper ones. In other words, we find that weights of the shallow layer do not change much when the network is retrained on a new dataset, while those of the deeper layers change considerably. Finally v) we estimate the number of invariances and show that they decrease with depth in a way that is consistent with the picture discussed above.

Taken together, these finding corroborate the idea that successive layers of a DBN extract statistical models of hidden features invariances and that “meaning” is that part of the information which cannot be reduced to features that can be properly fit by a statistical model. Indeed, relevance measures exactly the uncertainty on the model of the data. This suggests that relevance is a quantitative measure of “meaning”, intended as “irreducible” information.

The paper is organised as follows. We first try to put our paper in relation to the literature of other ways of attacking the abstraction hierarchy at the conceptual level. Then we lay down the background introducing the concept of relevance and the HFM. In Section 3 we discuss the results and we conclude with some remarks. All technical details are relegated to the Appendix.

1 Related approaches

As entries in the subject of abstraction hierarchies, we refer to the book by Dana Ballard [2], and the review by Yee [12] which covers approaches more rooted in experimental psychology. Among the many insights that emerge from this literature, the trade-off between abstraction and complexity is relevant to our discussion: Two and a half year old children exploit reward-relevant information if that is presented in a two dimensional drawing, but not if that is displayed by a three dimensional realistic model [13]. This suggests that motor areas are connected to a rather abstract representation of the environment. Cognitive structures evolve during development: while children are more prone to thematic abstractions based on co-occurrences of memes, abstraction in adults is more frequently associated to a taxonomic (or categorical) structure⁷⁷7This difference is also anatomically different regions of the brain that are putatively involved in a given abstract representation. Event-based inputs is easily stored in “static memory” structure of the hippocampus region, whereas taxonomical abstractions may require more complex dynamical representations such as those offered by recurrent neural networks [14].. This difference is likely related to a difference in the brain areas which are involved, suggesting that taxonomic abstraction requires a higher level of abstraction with respect to thematic abstraction, which is based on event-based elements closer to empirical reality (typically stored in regions afferent to the hippocampus) [15].

Abstraction does not only have to do with classification but also with uncovering the structure of relations hidden in the data. This idea is at the basis of the concept of a cognitive map [14, 16], which is not only an efficient and flexible scaffold of data, but it is also endowed with an appropriate structure of relations that makes it possible to navigate the representation efficiently. Indeed relational structures such as “Alice is the daughter of Jim” and ”Bob is Alice’s brother” allow for computations (e.g. ”Jim is Bob’s father”) which are invariant with respect to the context (Alice, Jim and Bob can be replaced by any triplet of persons that stand in the same relation, in this example). How the brain builds the appropriate cognitive maps at different levels of abstraction, combining elements at lower levels, is a fascinating issue for which we refer to [14]. The reach of our analysis will not cover higher functions that involve computation, because our setting is completely static. Aspects of abstraction which are encoded in the time domain may be better attacked by an analysis similar to ours in recurrent neural networks. We shall target the very first levels of the hierarchy which corresponds to the under-sampling regime where data is so high dimensional that statistical evidence is not sufficient to identify models. Our aim is to characterise abstraction on the basis of the sole statistical signatures of the representation itself, with no reference to what is being represented in the data. In this respect, we depart from the typical “tuning curve” approach in computational neuroscience, in which levels of abstraction are assessed in terms of the features of the data – e.g. edges in vision [17] or spatial positions [18] – that a representation encodes. We shall argue that, in this setting, the relevance is a natural measure of abstraction, because it quantifies the uncertainty on a possible generative model of the data.

Research in vision has brought us considerable insights on abstraction hierarchies, and in particular on the capacity of the visual system in recognising and exploiting invariances [1, 19, 20]. This ability, when enforced in artificial neural networks either by augmenting the data using invariances [21] or by explicitly implementing them in the architecture of neural networks – as in convolutional neural networks [3] – can be learned by a deep neural network. Interestingly, even simple neural networks are able to develop a convolutional structure by themselves [22]. Convolutions are only one type of regularity that a neural network may use to capture the inner structure of the data. As argued by Tenenbaum et al. [16], knowledge is not information on the data, but rather information on the way data can be organised in representations that enable generative and computational power.

Within this largest context, our paper merely probes the very entrance of the process by which structured data acquires meaning.

2 The framework

A Deep Belief Network (DBN) is a stack of simpler units – the Restricted Boltzmann machines – which are trained to maximise the likelihood of the data. Mathematically, the DBN can be seen as a joint probability distribution

p(\mathbf{x},\mathbf{s}^{(1)},\ldots,\mathbf{s}^{(L)}).

(1)

We’re interested in this probability distribution after the DBN has been trained on the data. The data $\hat{\mathbf{x}}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})$ consist of the $N$ vectors $\mathbf{x}_{i}$ that specify the $28\times 28$ B/W pixel values corresponding to handwritten digits (MNIST) and characters (eMNIST), and of stylised pictures of some commercial articles (fMNIST). Each internal layer is characterised by a binary vector $\mathbf{s}^{(\ell)}$ that encodes the activity and a vector of weights $\mathbf{W}^{\ell}$ that relates this activity to the one of the previous layer $\mathbf{s}^{(\ell-1)}$ . We focus on the marginal distribution $p(\mathbf{s}_{\ell})$ that encodes the representation in the $\ell^{\rm th}$ level of the hierarchy⁸⁸8In practice all our results are based on a sample $\hat{\mathbf{s}}^{(\ell)}$ of $N$ sampled states from $p(\mathbf{s}^{(\ell)})$ induced in layer $\ell$ when the data is presented in the visible layer. The value of $N$ used in all experiments was $N=60000$ .. Our architecture is the same as that of Ref. [23]: It is composed of $L=10$ hidden layers, of decreasing number of variables. More details on the data, on the architecture and on the learning algorithm are given in the Appendix.

We describe representations $p(\mathbf{s})$ in terms of the natural variable $E_{\mathbf{s}}=-\log_{2}p(\mathbf{s})$ , which is the minimal number of bits needed to represent the state $\mathbf{s}$ . The average coding cost $H[\mathbf{s}]=\langle E_{\mathbf{s}}\rangle$ is the usual Shannon entropy and counts the number of bits available to describe one point of the dataset. Following Ref. [7], we shall call $H[\mathbf{s}]$ the resolution.

The resolution $H[\mathbf{s}]$ is a measure of information content but not of information “quality”. We take the view that meaningful information should bear statistical signatures that allow it to be distinguished from noise. These make it possible to identify relevant information before finding out what that information is relevant for, a key feature of learning in living systems.

The relevance of a representation $p(\mathbf{s})$ is the entropy of the coding cost $E_{\mathbf{s}}$ . Representations where coding costs are distributed uniformly should be promoted for the reason that, in an optimal representation, the number $W(E)$ of states $\mathbf{s}$ that require $E$ bits to be represented should match as closely as possible the number ( $2^{E}$ ) of words that require $E$ bits. This principle corresponds exactly to the maximisation of the relevance

H[E]=-\sum_{E}p(E)\log_{2}p(E)\,,

(2)

where $p(E)=W(E)e^{-E}$ is the probability that a random point in the data has $E_{\mathbf{s}}=E$ . Note that states $\mathbf{s}$ and $\mathbf{s}^{\prime}$ with very different coding costs $E_{\mathbf{s}}$ and $E_{\mathbf{s}^{\prime}}$ can be distinguished by their statistics, because they would naturally belong to different typical sets⁹⁹9By the law of large numbers, typical samples of weakly interacting variables all have approximately the same coding cost, a fact knowns as the asymptotic equipartition property [24]. A trained DBN classifies the points in a dataset in different typical sets [25].. Representations that maximise the relevance harvest this benefit in discrimination ability that is accorded to us by statistics.

The HFM describes the distribution $p(\mathbf{s})$ of a string $\mathbf{s}=(s_{1},\ldots,s_{n})$ of binary variables that we can take as indicators of whether each of $n$ features is present ( $s_{i}=1$ ) or not ( $s_{i}=0$ ). This distribution satisfies the property that, the occurrence of a feature $s_{k}=1$ at level $k$ does not provide any information on whether lower order features are present or not. This means that conditional on $s_{k}=1$ , all lowest order features are as random as possible, $H[s_{1},\ldots,s_{k-1}|s_{k}=1]=k-1$ in bits. This requirement implies that the Hamiltonian $E_{\mathbf{s}}$ should be a function of $m_{\mathbf{s}}=\max\{k:~{}s_{k}=1\}$ , with $m_{\mathbf{s}}=0$ if $\mathbf{s}=(0,\ldots,0)$ is the featureless state.

The principle of maximal relevance prescribes a degeneracy of states $W(E)=e^{\nu E}$ that increases exponentially with the coding cost [7]. So it excludes all functional forms between $E_{\mathbf{s}}$ and $m_{\mathbf{s}}$ that are not linear. This is why, combined with the previous requirement, the principle of maximal relevance leads to the HFM, that assigns a probability

h_{n}(s_{1},\ldots,s_{n})=\frac{1}{Z_{n}}e^{-gm_{\mathbf{s}}}\,,

(3)

to state $\mathbf{s}$ [here $Z_{n}$ is the partition function]. We refer to [8] for a detailed discussion of the properties of the HFM. In brief, in the limit $n\to\infty$ the HFM features a phase transition at $g_{c}=\log 2$ between a random phase where $H[\mathbf{s}]$ is of order $n$ for $g<g_{c}$ , and a “low temperature” phase where $p_{n}(s)$ is dominated by a finite number of states (and $H[\mathbf{s}]$ is finite in the limit $n\to\infty$ ).

The HFM interpolates between high order features that code for meaning and low order ones, whose statistics is closer to that of noise. Indeed, marginalising over the low order features $s_{1},\ldots,s_{k}$ returns again an HFM over the remaining $n-k$ features

\sum_{s_{1},\ldots,s_{k}}h_{n}(s_{1},\ldots,s_{n})=h_{n-k}(s_{k+1},\ldots,s_{n% })\,.

(4)

On the other hand, marginalising over the high order ones yields a mixture between the HFM and the maximum entropy distribution

\sum_{s_{k+1},\ldots,s_{n}}h_{n}(s_{1},\ldots,s_{n})=\frac{Z_{k}}{Z_{n}}h_{k}(% s_{1},\ldots,s_{k})+\left(1-\frac{Z_{k}}{Z_{n}}\right)2^{-k}\,.

(5)

When $g\leq g_{c}$ , the ratio $\frac{Z_{k}}{Z_{n}}\to 0$ as $n\to\infty$ with $k$ finite, so the distribution of the first $k$ features converges to a state of maximal entropy in this limit. Likewise, low order features become more and more independent from higher order ones in this limit.

We take the view of learning as progressively detecting hidden features and invariances in the data and modeling them. Ideally, this process is one where the marginal probability of a variable $\phi(\mathbf{s})$ that encodes the hidden features is sharply peaked¹⁰¹⁰10We take the loose meaning of the term ”hidden features” as approximate sufficient statistics. By sharply peaked we mean that the variation of $\phi$ is constrained to a low dimensional manifold. Ansuini et al. [26] show that this is true for the variation of $\mathbf{s}^{(\ell)}$ itself. or that there are some variables $\theta(\mathbf{s})$ that encode some invariance (e.g. by translations). In the former case, the maximisation of the likelihood between layers can avail of $H[\mathbf{s}^{(\ell)}|\phi]$ bits of noise, at most, to be expelled from the DBN¹¹¹¹11Note that, by construction, the conditional distribution $p(\mathbf{s}^{(\ell)}|\mathbf{s}^{(\ell-1)})$ in the $\ell^{\rm th}$ layer of the DBN is a maximum entropy distribution of independent binary variables, which is fully specified by the averages $\langle s^{(\ell)}_{i}|\mathbf{s}^{(\ell-1)}\rangle$ . Therefore $H[\mathbf{s}^{(\ell)}|\mathbf{s}^{(\ell-1)}]$ quantifies the amount of information (per datapoint) that the DBN regards as noise.. In the case of an invariance, the marginal distribution itself of $\theta$ is a-priori a maximum entropy distribution, granting $-\log_{2}p_{\ell-1}(\theta)$ bits of noise to be disposed of. By the way, the reduction of relevant information to invariances provides substantial computational advantages. Indeed if a representation $\mathbf{s}=(X(\mathbf{s}),Y(\mathbf{s}),\ldots)$ can be expressed in terms of two (or more) independent random variables, these can be processed independently one from the other in parallel¹²¹²12We refer to De Mulatier et al. [27] for an attempt to disentangle a sample $\hat{\mathbf{s}}$ of binary data in independent components that is inspired by information theoretic principles alone and addresses inference in the under-sampling domain. Our preliminary attempts to disentangle independent variables in the DBN with this method suggest that this view of learning refers to an ideal limit of an optimal learning machine..

In this view, the features in the HFM describes “irreducible” information, not yet captured within a maximum entropy model. The relevance can then be thought of as a quantitative measure of the residual uncertainty on the model of the data.

3 The results

We compute the $D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})$ divergence of the data from the HFM in each layer $\ell$ . This can be thought of as a tax (in bits) that is charged to the data for not being storable in an efficient manner. The results summarised in Fig. 1 show that this measure responds positively to the expectation that relevance provides a quantitative measure of meaning. Indeed $D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})$ decreases with $\ell$ in all datasets, showing that the internal representation approaches the HFM as depth increases.

We expect that a dataset with a larger variety, not evidently reducible to invariances, should correspond to a more abstract representation, with respect to one trained on data drawn from a “narrower” domain. Fig. 1 corroborates this expectation, by showing that the distance of the internal representation to the HFM also decreases with “width”. We probe this behaviour in two ways: first we generate a “narrower” dataset form symmetry transformations of the digit ”2” of MNIST. The distance of the internal representations of this dataset from the HFM is significantly larger than that of the MNIST dataset. Second, we train the DBN with a “wider” dataset, combining the MNIST and the eMNIST datasets. The results confirm our expectations, even though a significant reduction in the distance of the internal representations to the HFM (with respect to that of DBMs trained on the individual datasets) is only visible in the deepest layers. The inset of Fig. 1 shows that the estimate of the parameter $g$ approaches the critical point $g_{c}=\log 2$ with depth.

Refer to caption — Figure 1: DKL between the internal representation of each layer and the best-fit HFM, normalized to the number of nodes of each layer for 5 different datasets. Besides benchmark datasets (MNIST, eMNIST and fMNIST), we also show results for a dataset of $N=60000$ digits which are obtained by simple transformations (rotations and translation) of the data points in MNIST that correspond to the digit ”2”, and for a DBN trained on the combined MNIST and eMNIST datasets ( $N=120000$ ). The inset shows the distance $\delta g=(\hat{g}-g_{c})/\sqrt{\mathbb{V}(\hat{g})}$ of the estimated value of $g$ from the critical point $g_{c}$ , normalised by the standard deviation of the estimator, for the MNIST dataset.

Fig. 1 shows that the HFM does not provide a good description of shallow layers. Fig. 2 shows that these are instead well described by pairwise Ising models (PIM), which contain only up-to-pairwise interactions. The PIM is defined as

p^{(2)}(\mathbf{s})=\frac{1}{Z}\exp{\left(\sum_{i<j}J_{ij}^{l}s_{i}s_{j}+\sum_% {i}h_{i}^{l}s_{i}\right)}\,,

(6)

where $Z$ is the partition function and the parameters $J_{ij}$ and $h_{i}$ are estimated using maximum likelihood (more information are given in the Appendix).

In order to measure “pairwise-ness”, we compute the Kullback-Leibler distance between the internal representation of a layer $\ell$ and the best PIM describing that layer. That is the model $p_{\ell}^{(2)}(\sigma)$ that minimize the $D_{KL}(\hat{p}_{\ell}||p^{(2)})$ with the hidden layer distribution. Fig. 2 shows that the minimal $D_{KL}(\hat{p}_{\ell}||p^{(2)})$ for DBNs trained with the MNIST, fMNIST and eMNSIT datasets is negligibly small in shallow layers, and it ramps up with depth.

This is consistent with the fact that PIM are models of very high complexity [28], which means that they can describe data from a large variety of systems¹³¹³13The complexity of a model, as shown in Ref. [29], is a measure of the number of different datasets that can be described with it. The complexity of the PIM grows as the number of parameters, which is proportional to $n^{2}$ . That of the HFM only grows as $\log n$ , which implies that the uncertainty of the parameter $g$ that a sample of $N$ points can provide is of the order of $1/(\sqrt{N}\log n)$ . The couplings of the fitted PIM are small, as in Ref. [10], which is consistent with the fact that information in RBMs is passed mostly by one-spin averages because the conditional multi-information $I(s_{1},\ldots,s_{n}|\mathbf{x})$ is zero. The HFM is instead characterised by strong interaction at all orders, as shown in Ref. [8].. In this sense, the level of abstraction of PIM is very low and the distance to the best PIM can be taken as a measure of un-abstractness. The internal representations of shallow layers is therefore rather generic, which agrees with their ability to “transfer” information do deeper layers also for other datasets, without the need of being retrained. This ability is related to the widely supported idea that shallow layers code information in terms of local, low order features of the data, which are well described by pairwise interactions.

Taken together, the two results discussed above suggest that plasticity should increase with depth. This is because, if the representation of deep layer is close to an abstract (data independent) model, then the information on the data should necessarily be stored in the (data dependent) weights that connect one layer to the next. So weights of deep layers should change considerably when the data changes. On the contrary, the weights of shallow layers should not change much, given what we said above.

Fig. 3 shows the results of experiments training a DBN first on a dataset and then on a different one. It shows the distance between the weights $\mathbf{W}_{1}^{\ell}$ learned in layer $\ell$ for the first dataset, to those ( $\mathbf{W}_{2}^{\ell}$ ) learned for the second dataset.

As expected, the weights in shallow layers change less with respect to those in deep layers, consistently with the idea that weights of the first layers are very generic, and they do not capture specific features of the dataset. Instead the deep layers have a more specific representations, and the weights are more data dependent. This result is consistent with the observation that shallow layers tend to learn oriented and localized edge filters, whereas the deeper layers are inclined to capture higher-level features. [4].

Finally we extend the HFM by dividing the set of variables into two groups as

h_{n}^{(k)}(s_{1},\ldots,s_{n})=\frac{1}{Z_{n}^{(k)}}e^{-g\max(k,m_{\mathbf{s}% })}=2^{-k}h_{n-k}^{(0)}(s_{k+1},\ldots,s_{n})\,,

(7)

in such a way that the first $k$ variables are described by a maximal entropy distribution $p(s_{1},\ldots,s_{k})=2^{-k}$ and the remaining $n-k$ are described by an HFM. Here $k$ is meant to provide a sharp separation between variables coding for invariances and variables that code for meaning. Therefore the size $k$ of the first groups provides a rough measure of the “number of invariances” present in the representation. Fig. 4 shows that the estimated value of $k$ sharply decreases with depth, and it does so more slowly for data augmented using invariances.

4 Conclusions

Universal statistical signatures of relevance should exist because otherwise learning would be impossible. These signatures make it possible to identify relevant information without the need to know what that information is relevant for, which is precisely what learning is about. Our results suggest that it is the very uncertainty on the way in which the data may be informative, i.e. on the model which describes the data, that makes the data meaningful. And because meaning has to do with model uncertainty, it should admit a model-free, universal characterisation. The measure of relevance proposed in Ref. [5, 7] fits with this general description of an abstract measure of meaning. In addition, in the present setting, there is a unique, simple model – the HFM – that encodes the principle of maximal relevance.

This paper analyses how information is stored in the rather simple setting of a DBN in order to probe the organising principles of unsupervised learning: as data is processed in deeper and deeper layers, it is stripped off of features and invariances, which, once detected, are reduced to noise. Syntactic meaning is organised in more and more data-independent representation which approach a universal abstract representation (the HFM) that encodes only principles of efficient information storage. All this process is driven by the maximisation of the likelihood alone. In the opposite direction, the syntactic meaning generated in the deepest layers is dress up with contextual information on its way to the visible layer. In this picture, depth – i.e. the distance of a representation from the data – negatively correlates with complexity and plasticity. With respect to complexity, our results suggest a further [29] rationale for Occam’s razor, that posits simplicity as a guiding principle in learning.

This picture is largely consistent with the prevailing one in artificial neural networks [3, 4, 25] as well as in neuroscience [6, 2]. In vision, early stages of information processing are adapted to process a large variation of structured datasets. From the retina to the primary visual cortex, input is encoded in terms of generic features, such as localised filters [30]. These areas of early processing of visual stimuli exhibit suppressed levels of experience dependent plasticity after development [31]. Conversely, enhanced levels of plasticity are required for incremental learning in higher areas of visual processing [32], in order to store data specific information in the synapses (or in the couplings of models of associative memories [33]).

The main original contribution of this paper is that it offer further support to the notion of relevance [5, 7] as a quantitative measure of meaningfulness. We hope that this insight can help shed further light on understanding deeper levels of the abstraction hierarchy [2] by, for example, constraining the search for models of cognitive maps [14, 16], or providing further insights on the statistical regularities that emerge in the analysis of correlations of neural activities across scales [34, 35].

The approach can be generalised to elucidate the principles that govern learning in more complex architectures, but it may also shed light on conceptual issues in a broader disciplinary domain¹⁴¹⁴14It is tempting to speculate on the analogies between our setting and the evolution of a system in time. The DBN architecture is characterised by a Markovian structure where the state of each layer only depends on the state of the previous layer. Likewise, the state of a system at a given time only depend on its state at a previous time. In this respect, equilibrium states are the least informative ones, because they satisfy a maximum entropy principle. Once a system reaches equilibrium all information on its past is lost. Meaning would then encode in which specific way a system is driven out of equilibrium. In this perspective, the relevance should help us finding those features which carry meaning on the dynamics which led to the present state. Grigolon et al. [36] discuss an example of this strategy in the context of biological evolution of proteins..

5 Acknowledgements

We are grateful to Paolo Muratore and Davide Zoccolan for interesting discussions.

Appendix A Simulation details

A.1 Training of DBN

A deep belief networks (DBN) consists of Restricted Boltzmann Machines (RBM) stacked one on top of the other, as shown in Fig. 5. Each RBM is a Markov random field with pairwise interactions defined on a bipartite graph of two non interacting layers of variables: visible variables $\textbf{x}=(x_{1},..,x_{m})$ representing the data, and hidden variables $\textbf{s}=(s_{1},...,s_{n})$ that are the latent representation of the data. The measure of a single RBM is:

p(\textbf{x},\textbf{s})=\frac{1}{Z}\exp{\left(\sum_{i,j}W_{ij}x_{i}s_{j}+\sum% _{k}x_{k}c_{k}+\sum_{l}s_{l}b_{l}\right)}.

(8)

where $\textbf{W}=\{W_{ij}\},~{}\textbf{c}=(c_{1},\ldots,c_{m})$ and $\textbf{b}=(b_{1},\ldots,b_{n})$ are the parameters that are learned during training.

In order to train the DBN we learn the parameters one layer at a time, following the prescription of Hinton [37]. It consists of training the first RBM on the data and then to propagate the input data $\hat{\mathbf{x}}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})$ forward to the first hidden layer, thus obtaining a sample of the hidden states $\hat{\mathbf{s}}^{(1)}$ for the first layer. This is then used as input for training the second hidden layer, and so on. This type of training procedure was proven [37] to increase a variational lower bound for the log likelihood of the data set. This allows us to use approximated training methods like Contrastive Divergence (CD) and still being able to obtain a good generative model.

In order to generate samples from the trained DBN we consider the connections between the top two layers as undirected, whereas all lower layers are connected to the upper layer by directed connections. This means that, in order to obtain a sample from a DBN we use Gibbs sampling to sample the equilibrium of the top RBM $p_{L}(\textbf{s}^{(L)},\textbf{s}^{(L-1)})$ . Then we use this data to sample the states of lower layers using the conditional distribution $p(\mathbf{s}_{\ell-1}|\mathbf{s}_{\ell})$ . In this way, we propagate the signal till the visible layer.

The DBN used in our experiment is the same as that used in Ref. [23]: it has a visible layer with $784$ nodes and $L=10$ hidden layers with the following number of nodes: $n_{\ell}=500,250,120,60,30,25,20,15,10$ and $5$ , for $\ell=1,\ldots,10$ . Similar results to those discussed in the main text were obtained for different architectures.

In order to learn the parameters of a single RBM we used a stochastic gradient ascent of the log-likelihood, using Contrastive Divergence with $k=10$ and mini-batches of $64$ (see [38]), for $\sim 10^{3}$ epochs. Decelle et al. [39] [40] have shown that the distribution learned by an RBM trained with CD-10 does not reproduce equilibrium distribution, but it can be a good generative model if it was sampled out of equilibrium. Instead they observed that persistent contrastive divergence (PCD-10) was able to converge to the equilibrium distribution¹⁵¹⁵15In Contrastive Divergence-k (CD-k), the Markov chain used to sample the distribution is initialized on the batch used to compute the gradient and $k$ Monte Carlo steps are performed. In Persistent Contrastive Divergence-k (PCD-k) the MCMC is initialized in the configuration of the previous epoch.. To the best of our knowledge, the main gist of our results does not depend on the details of the algorithm used in training.

A.2 Boltzmann learning of Ising model

The Ising model is the maximum entropy model that reproduces the empirical averages

\langle s_{i}\rangle_{\mathcal{D}}\equiv\frac{1}{N}\sum_{n=1}^{N}s_{i}^{(n)},% \qquad\langle s_{i}s_{j}\rangle_{\mathcal{D}}\equiv\frac{1}{N}\sum_{n=1}^{N}s_% {i}^{(n)}s_{j}^{(n)}

(9)

of single spins and pairs of spins. For an exponential family, finding the parameters $h_{i}$ and $J_{ij}$ in Eq. (6) such that the expectation over the model matches empirical averages in Eq. (9) is the same as maximizing the log-likelihood $\mathcal{L}(\textbf{J},\textbf{h}|\mathcal{D})$ of the empirical data, whose gradient components are:

	$\displaystyle\frac{\partial\mathcal{L}}{\partial J_{ij}}$	$\displaystyle=\left<s_{i}s_{j}\right>_{\mathcal{D}}-\left<s_{i}s_{j}\right>_{p% ^{(2)}}$		(10)
	$\displaystyle\frac{\partial\mathcal{L}}{\partial h_{i}}$	$\displaystyle=\left<s_{i}\right>_{\mathcal{D}}-\left<s_{i}\right>_{p^{(2)}}$		(10)

To find the parameters we perform a gradient ascent of the log likelihood. We used $64$ parallel Markov chains of length $10\cdot n$ , with $n$ the total number of spins.

A.3 Best fitted HFM with $k$ independent spins

Given $N$ samples of an hidden layer: $\{\mathbf{s}^{(i)}\}_{i=1}^{N}$ , the empirical distribution can be expressed as:

\hat{p}(\textbf{s})=\frac{1}{N}\sum_{i=1}^{N}\delta(\textbf{s}-\textbf{s}_{i}).

(11)

The HFM model with $k$ independent spins is defined in Eq. (7) can be written as an exponential family

h_{n}^{(k)}(\mathbf{s})=\frac{1}{Z(k,g)}e^{-g\mathcal{H}(\textbf{s})}

(12)

where the Hamiltonian is

\mathcal{H}(\textbf{s})=max\{m_{\textbf{s}}-k,0\},\qquad m_{\textbf{s}}=max\{i% :s_{i}=1\}.

(13)

The normalization factor is given by

Z(k,g)=\sum_{\textbf{s}}e^{-g\mathcal{H}(\textbf{s})}=2^{k-1}\left(1+\frac{\xi% ^{n-k+1}-1}{\xi-1}\right),\qquad\xi=2e^{-g}.

(14)

The Kullback-Leibler divergence between the empirical distribution and the HFM model is defined as:

D_{KL}(\hat{p}_{\ell}|h_{n_{\ell}})=\sum_{\textbf{s}}\hat{p}(\textbf{s})\log% \frac{\hat{p}(\textbf{s})}{h_{n}^{(k)}(\textbf{s})}=\sum_{\textbf{s}}\hat{p}(% \textbf{s})\left[\log\hat{p}(\textbf{s})+g\mathcal{H}(\textbf{s})\right]+\log[% Z(k,g)].

(15)

For a given sample of an hidden layers, for each values of $k$ we can find the HFM model that minimize the $D_{KL}$ in equation (15) by finding the value $\hat{g}$ such that the expected value of the energy over the model matches the empirical one

\langle\mathcal{H}(\textbf{s})\rangle_{\mathcal{D}}\equiv\frac{1}{M}\sum_{n=1}% ^{M}max\{m_{\textbf{s}_{n}}-k,0\}=\sum_{\textbf{s}}max\{m_{\textbf{s}}-k,0\}h_% {n}^{(k)}(\textbf{s})\equiv\langle\mathcal{H}(\textbf{s})\rangle_{h_{n}^{(k)}}.

(16)

The average energy of the HFM in function of the parameter $k$ and $g$ is:

\langle\mathcal{H}(\textbf{s})\rangle_{h_{n}^{(k)}}=\xi\left(\frac{(n-k+1)\xi^% {n-k}+1}{\xi^{n-k+1}+\xi-2}-\frac{1}{\xi-1}\right),\qquad\xi=2e^{-g}.

(17)

References

[1] David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
[2] Dana H Ballard. Brain computation as hierarchical abstraction. MIT press, 2015.
[3] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning, volume 1. MIT press Massachusetts, USA:, 2017.
[4] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Unsupervised learning of hierarchical representations with convolutional deep belief networks. Communications of the ACM, 54(10):95–103, 2011.
[5] M Marsili, I Mastromatteo, and Y Roudi. On sampling and modeling complex systems. Journal of Statistical Mechanics: Theory and Experiment, 2013(09):P09003, 2013.
[6] Horace B Barlow. Unsupervised learning. Neural computation, 1(3):295–311, 1989.
[7] Matteo Marsili and Yasser Roudi. Quantifying relevance in learning and inference. Physics Reports, 963:1–43, 2022.
[8] Rongrong Xie and Matteo Marsili. A simple probabilistic neural network for machine understanding. Journal of Statistical Mechanics: Theory and Experiment, 2024(2):023403, 2024.
[9] Florentin Guth and Brice Ménard. On the universality of neural encodings in CNNs. In ICLR 2024 Workshop on Representational Alignment, 2024.
[10] G Tkačik, T Mora, O Marre, D Amodei, S E Palmer, M J Berry, and W Bialek. Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences, 112(37):11508–11513, 2015.
[11] E D Lee, C P Broedersz, and W Bialek. Statistical Mechanics of the US Supreme Court. Journal of Statistical Physics, 160:275–301, July 2015.
[12] Eiling Yee. Abstraction and concepts: when, how, where, what and why?, 2019.
[13] Judy S DeLoache. Rapid change in the symbolic functioning of very young children. Science, 238(4833):1556–1557, 1987.
[14] James CR Whittington, David McCaffary, Jacob JW Bakermans, and Timothy EJ Behrens. How to build a cognitive map. Nature neuroscience, 25(10):1257–1272, 2022.
[15] Charles P Davis and Eiling Yee. Features, labels, space, and time: Factors supporting taxonomic relationships in the anterior temporal lobe and thematic relationships in the angular gyrus. Language, Cognition and Neuroscience, 34(10):1347–1357, 2019.
[16] Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011.
[17] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
[18] John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat. Brain research, 1971.
[19] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive sciences, 11(8):333–341, 2007.
[20] Davide Zoccolan. Invariant visual object recognition and shape processing in rats. Behavioural brain research, 285:10–33, 2015.
[21] Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239–274. Springer, 2002.
[22] Alessandro Ingrosso and Sebastian Goldt. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences, 119(40):e2201854119, 2022.
[23] J Song, M Marsili, and J Jo. Resolution and relevance trade-offs in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2018(12):123406, dec 2018.
[24] T M Cover and J A Thomas. Elements of information theory. John Wiley & Sons, 2012.
[25] R Shwartz-Ziv and N Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
[26] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems, pages 6111–6122, 2019.
[27] Clélia de Mulatier, Paolo P Mazza, and Matteo Marsili. Statistical inference of minimally complex models. arXiv preprint arXiv:2008.00520, 2020.
[28] Alberto Beretta, Claudia Battistin, Clélia De Mulatier, Iacopo Mastromatteo, and Matteo Marsili. The stochastic complexity of spin models: Are pairwise models really simple? Entropy, 20(10):739, 2018.
[29] In Jae Myung, Vijay Balasubramanian, and Mark A. Pitt. Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97(21):11170–11175, 2000.
[30] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
[31] Nicoletta Berardi, Tommaso Pizzorusso, and Lamberto Maffei. Critical periods during sensory development. Current opinion in neurobiology, 10(1):138–145, 2000.
[32] Daniel J Amit and Daniel J Amit. Modeling brain function: The world of attractor neural networks. Cambridge university press, 1989.
[33] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
[34] Leenoy Meshulam, Jeffrey L Gauthier, Carlos D Brody, David W Tank, and William Bialek. Coarse graining, fixed points, and scaling in a large population of neurons. Physical review letters, 123(17):178103, 2019.
[35] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini, and Kenneth D Harris. High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365, 2019.
[36] Silvia Grigolon, Silvio Franz, and Matteo Marsili. Identifying relevant positions in proteins by critical variable selection. Molecular BioSystems, 12(7):2147–2158, 2016.
[37] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[38] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
[39] Aurélien Decelle, Cyril Furtlehner, and Beatriz Seoane. Equilibrium and non-equilibrium regimes in the learning of restricted boltzmann machines. Advances in Neural Information Processing Systems, 34:5345–5359, 2021.
[40] Elisabeth Agoritsas, Giovanni Catania, Aurélien Decelle, and Beatriz Seoane. Explaining the effects of non-convergent sampling in the training of energy-based models. In ICML 2023-40th International Conference on Machine Learning, 2023.