DeepLNE++ leveraging knowledge distillation for accelerated multi-state path-like collective variables
Abstract
Path-like collective variables can be very effective for accurately modeling complex biomolecular processes in molecular dynamics simulations. Recently, we introduced DeepLNE, a machine learning-based path-like CV that provides a progression variable along the path as a non-linear combination of several descriptors, effectively approximating the reaction coordinate. However, DeepLNE is computationally expensive for realistic systems needing many descriptors and limited in its ability to handle multi-state reactions. Here we present DeepLNE++, which uses a knowledge distillation approach to significantly accelerate the evaluation of DeepLNE, making it feasible to compute free energy landscapes for large and complex biomolecular systems. In addition, DeepLNE++ encodes system-specific knowledge within a supervised multitasking framework, enhancing its versatility and effectiveness.
Classical molecular dynamics (MD) simulations at the atomistic scale provide a powerful method for studying complex physical, chemical, and biological systems. Indeed, MD simulations can reveal mechanisms at spatial and temporal scales that are difficult to observe experimentally, acting as a computational microscope.Dror et al. (2012) Advances in computational power and force field accuracy have increased the capabilities of MD, even though a full description of intricate biological phenomena remains hard to achieve due to limitations in accessible timescales. To address this, enhanced sampling algorithms have been developed to allow the exploration of high-energy transitions and their associated free energy landscapes. Laio and Gervasio (2008); Valsson, Tiwary, and Parrinello (2016); Camilloni and Pietrucci (2018) Many of these algorithms use either collective variables (CVs) or paths connecting two-states A and B, each of which has its own advantages and limitations. Laio and Parrinello (2002); Bolhuis et al. (2002)
CV-based algorithms rely on the identification of slow degrees of freedom specific to the system under investigation and are limited to applying an external bias potential only on the empirically determined relevant ones. Pietrucci (2017) Path-based methods, on the other hand, can incorporate many degrees of freedom but are highly dependent on the definition of end states and effective path search algorithms. Path-collective variables (PATHCVs) combine these approaches and are often coupled with enhanced sampling algorithms to explore high free energy barriers and successfully reconstruct free energy profiles for complex systems. Branduardi, Gervasio, and Parrinello (2007)
It is worth emphasizing that the effectiveness of PATHCVs depends on the metric and milestone parameters chosen. Recent work has explored machine learning to optimise PATHCV metrics, but linear combinations of features can be sub-optimal for complex pathways, such as ligand binding to a protein. Hovan, Comitani, and Gervasio (2019) Non-linear combinations of descriptors better capture the relevant degrees of freedom at different stages of the pathway. Bonati, Rizzi, and Parrinello (2020); Bonati et al. (2023) Data-driven methods have also emerged for learning CVs, using techniques such as variational autoencoders and kernel ridge regression. Šípka, Erlebach, and Grajciar (2023); France-Lanord et al. (2024)
We recently introduced a new machine learning approach, DeepLNE (deep-locally non-linear-embedding), which refines the PATHCV method and overcomes existing limitations Fröhlking et al. (2024). DeepLNE creates a semi-automatic 1D description of the training data by combining local linear embedding Roweis and Saul (2000) (LLE) principles with PATHCV. The algorithm uses dimensionality reduction via artificial neural networks (ANNs), differentiable k-nearest neighbour selection, and an ANN-encoded 1D latent space. By focusing on system dynamics, DeepLNE forms a path-like CV anchored by training data, expanding the tools for CV definition and improving on previous methods with respect to available input space and functional flexibility.
The algorithm can be applied to reactive MD trajectories connecting metastable states and derives a PyTorch Paszke et al. (2017) model that accurately approximates the true reaction coordinate of the system and its perpendicular distance . Therefore, exporting DeepLNE as a CV and biasing it via state-of-the-art enhanced sampling methods such as ’On-the-fly Probability Enhanced Sampling’ (OPES) Invernizzi and Parrinello (2020, 2022) or OneOPES Rizzi et al. (2023) allows efficient estimation of the free energy surface (FES). However, DeepLNE is computationally demanding. When the underlying feature space is large and models are anchored to many training datapoints, the model evaluation via PyTorch can significantly impact the computational performance of the MD engine. Additionally, when multiple metastable states are aligned along the reaction path, the sequentiality of assigned path values on the 1D projection automatically derived by the DeepLNE algorithm can be different from the desired one. In this study, we show how DeepLNE can be improved by defining a multi-tasking objective function aligning the expectations when designing the path-like CVs (e.g. sequentiality of metastable states) with the model output and how the evaluation of DeepLNE CVs can be significantly accelerated by representing and exporting them as artificial neural networks (ANN). We achieve this within a knowledge distillation framework deriving DeepLNE student models for the path-like variables and . In the paper, we also document the best practices for training DeepLNE models including cross-validation and augmented objective functions allowing the use of labeled data in a supervised multi-tasking framework as well as the option to assign importance weights to individual input features. We demonstrate the extended and improved algorithm on the toy models: 3-state Müller-Brown-potential, alanine tripeptide, and the microRNA precursor pre-miR21. Importantly we benchmark the original PyTorch-based CV against the computationally faster DeepLNE student models obtained from knowledge distillation, showing that for the solvated pre-miR21 system composed of 31 nucleotides with a total of 45543 atoms we can achieve a 3 times speed-up.
I Methods
Before introducing the DeepLNE++ algorithm and its novelties, we summarise the original DeepLNE framework, highlighting its advantages and its limitations.
I.1 DeepLNE
The DeepLNE CV aims to represent high-dimensional datasets using a single dimension. It combines ideas from PATHCV and LLE within a neural network architecture which parameters are optimized using the mean square error between the original input vector and the reconstructed one for training datapoints:
(1) |
The architecture can be seen as a generalized autoencoder. An initial ANN reduces dimensionality. Then a continuous k-nearest-neighbor (k-NN) step is applied to each data point. The neighborhood information is compressed into a one-dimensional representation . This representation is used for reconstruction via a decoder and for computing an accessory perpendicular distance Since is derived from the features of a data point’s neighbors it is robust to extrapolation. However, evaluating the model becomes slower with larger datasets due to memory requirements. The DeepLNE algorithm constructs variables and that approximate a transition path between states. It addresses several issues encountered in PATHCVs:
-
•
Automatic Metric Learning (): DeepLNE learns a lower-dimensional metric to reduce degeneracy when identifying local neighborhoods.
-
•
Automatic Neighborhood Construction: Neighborhoods are constructed using a differentiable k-NN step.
-
•
Heterogeneous Features: Distances, angles, and contact maps can be combined in without predefined landmarks.
However, there are remaining limitations that we attempted to address in this study:
-
•
Sequentiality Preservation: The 1D DeepLNE variable may not preserve desired sequentiality in multi-state systems.
-
•
Computationally Demanding Evaluation: On-the-fly evaluation for biasing MD simulations can impact performance.
I.2 DeepLNE++ in Knowledge Distillation framework
In the following, we describe all the key steps of our recommended strategy for training the DeepLNE++ CV (compare Fig. 1).
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x1.png)
I.2.1 Training data generation
We use ’ratchet-and-pawl’ restraints of the Adiabatic Bias Molecular Dynamics (ABMD) method Camilloni, Broglia, and Tiana (2011) to perform directed enhanced sampling simulations connecting all the relevant states. Here we choose the spring constant such that we reach the target state within picoseconds of simulation time. can be seen as a hyperparameter of the DeepLNE algorithm, because it regulates the directionality and the input feature space the DeepLNE model is trained on. In the limit of high , we connect all states of interest within a short simulation time. However, since the training data will be obtained from proportionally out-of-equilibrium dynamics, the minimum transition path has to be recovered a posteriori from the trajectory after performing enhanced sampling simulation using the DeepLNE student models. We will show how this can be achieved when inspecting the FES with respect to the variables and in the following chapters. Consequently, we recommend not applying harmonic restraints on the variable as long as the minimum free energy path between the states of interest is unknown. Additionally, we recommend to generate ABMD states-connecting trajectories along all possible regions of the accessible descriptor space . The more exhaustive is the exploration of the input descriptor space provided during the training of DeepLNE, the less extrapolation is required and unexpected behavior can be minimized. A novel key step in the DeepLNE knowledge distillation framework is the assignment of labels to supervise the model. Information should be provided in the form of labels whenever a region in the input descriptor space can be assigned to a metastable state or a transition state with high confidence. To minimize the computational cost of DeepLNE we use Farthest Point-Sampling (FPS)Goscinski et al. (2023) to select a maximally diverse subset of the sampled data, such that only the frames are retained. The dataset is then split into training and validation data, with 80% of the data used for training and the remaining 20% is left out for validation.
I.2.2 Teacher model training
For all the DeepLNE teacher models in this study, we choose for the embedded manifold coordinates and meaning we use 3 nearest-neighbors to construct the vector (compare chapter of DeepLNE on hyperparameter choices in Ref. Fröhlking et al., 2024). We scale the variable using a sigmoid activation in the final encoder layer, restricting its values to the range of 0–1. As shown in Fig. 1 the network parameters are optimized using the mean square discrepancy between the original input and the decoded ones, also referred to as reconstruction loss term. The model is trained on a multi-tasking objective function that includes a cross-entropy term allowing to supervise the model via labeled data and an L2-regularization term penalizing large magnitudes in the model parameters :
(2) |
Each term of the objective function has associated hyperparameters , , and that are found by scanning a range of different choices respectively. While is a hyperparameter vector that tunes the importance of each individual input feature reconstruction, and are scalar hyperparameters tuning the importance of the label reconstruction and the magnitude of the model parameters respectively. In order to obtain the most generalizing model we select the one that minimizes the multi-tasking reconstruction loss on the validation data:
(3) |
We use gradient descent using the ADAM optimizer with a learning rate of and 5000 epochs, using the machine learning library PyTorch Paszke et al. (2017).
I.2.3 Expanding training data
Since the student models will be more susceptible to overfitting, because the ANNs are not anchored to the training data as it is the case for the DeepLNE teacher model we recommend expanding the training data by performing short enhanced sampling simulations e.g. using OPES in exploration mode Invernizzi and Parrinello (2022) (OPES-EXPLORE) using the DeepLNE teacher model (in the order of ns). This step aims at increasing the robustness of the DeepLNE student models. To track the exploration of the descriptor space it can be instructive to plot the DeepLNE CVs along a few important physical descriptors or against the first principal components of the input features . Based on this inspection we empirically decide when the descriptor space has been explored sufficiently such that we are minimizing the regions in configurational space that need to be extrapolated by the DeepLNE student models. Ultimately the original training data are concatenated with the newly sampled datapoints. Also here we use FPS to select a maximally diverse subset of all available datapoints. We retain frames and we split the data into training and validation set at a 80-20 ratio.
I.2.4 Student model training
As shown in Fig. 1 we project the DeepLNE teacher model variables and on the extended training dataset and train DeepLNE student models by minimizing the mean square discrepancy between the teacher model output and the student model output:
(4) |
with and the PyTorch loss function SmoothL1 with . The hyperparameter regulates the penalty applied based on the magnitude of the student model parameters . The choice of SmoothL1 loss is motivated by its robustness against outliers and consequently reducing overfitting in trained models. As the best model, we choose the one that minimizes this objective function evaluated on the validation dataset:
(5) |
We use gradient descent using the ADAM optimizer with a learning rate of and up to 15000 epochs, using the machine learning library PyTorch Paszke et al. (2017). We recommend using a higher number of training epochs than in I.2.2 in order to obtain models with a lower validation error, which typically corresponds to models that overfit less and exhibit smooth extrapolation into unknown regions of the descriptor space .
I.2.5 Exporting the student models into PLUMED
The parameters of the trained DeepLNE student models are stored such that the models can be evaluated on-the-fly in MD simulations using the PLUMED plugin Tribello et al. (2014). We use a variation of the MULTI_ANN CV available at https://github.com/bussilab/plumed-multi-ann that provides computationally efficient evaluation of ANN as CVs. An implementation of the full knowledge distillation framework and usage as a tutorial are available at https://github.com/ThorbenF/DeepLNE2.
I.3 Interpretation of hyperparameters
In order to obtain the DeepLNE models that are most adequate for the biophysical system at hand we here focus on the choice and the interpretation of the hyperparameters that are part of the knowledge distillation framework.
I.3.1 Teacher model
The teacher model keeps the original DeepLNE hyperparameter for the continuous k-nearest-neighbors step that are the number of neighbors and the sparsity of the selection matrix , which detailed description can be found in Ref. Fröhlking et al., 2024. We recommend using and for models with high locality and numerical stability. The variable is also dependent on which is not relevant for the training of the model and can be quickly adjusted without retraining the model. The effect of is analogous to its role in the PATHCV. To decide the adequate value for all hyperparameters it can be helpful to visualize the resulting behavior of variables and in the descriptor space or the 2 first components of the PCA on the descriptor set (compare Ref. Fröhlking et al., 2024).
Beyond the DeepLNE hyperparameter in I.2.2 we introduced a new objective function for training the teacher model that contains hyperparameters and , for which we have the following limiting cases:
-
•
: DeepLNE does not prioritize descriptor reconstruction.
-
•
: Full reconstruction of the selected descriptor from .
-
•
: Assigned labels are ignored.
-
•
: Optimal matching of with all labels.
-
•
: In this limit we obtain a model with maximal flexibility, which can lead to overfitting.
-
•
: Maximal penalty for large parameters, resulting in reduced model flexibility.
I.3.2 Student model
In I.2.4 we introduced a new objective function for the training of the student model that contains hyperparameters and . In contrast to the teacher model’s objective function, which uses the hyperparameter to weight the importance of the input features, the student model’s objective function is missing . The motivation for this choice is that the student model can learn automatically and unsupervised which of the input descriptors are relevant for reconstructing the DeepLNE teacher model variables and . While behaves as just described above, we have the limiting cases for :
-
•
: the SmoothL1 loss converges to L1 loss decreasing the weight of the discrepancies between teacher and student model on the objective function.
-
•
: the SmoothL1 objective becomes the L2 norm increasing the weight of the discrepancies between teacher and student model.
In this study, we assigned the same model specific hyperparameter choices to the training and the validation objective function.
I.4 Details on MD simulations and DeepLNE training
The DeepLNE knowledge distillation framework that we introduce here (see Fig. 1) aims at representing the DeepLNE CVs and as a computationally cheap ANNs that can be straightforwardly exported into PLUMED. For all the systems investigated in this paper, we followed the same protocol: (1) sampling of the transition of interest; (2) training the original DeepLNE CV on a set of input features as a teacher model; (3) performing enhanced sampling MD simulation using the teacher model to extend the training data; (4) training a student model for and another student model for the variable; (5) exporting the CVs to PLUMED and combining them with an enhanced sampling method for free energy estimation. In the following, we outline all the details of our recommended strategy. We test the DeepLNE knowledge distillation framework across the following applications: exploration of the 3-state Müller-Brown potential by a single particle, and sampling of the conformational space of the alanine tripeptide. Lastly, as an example of a real-world biological problem, we investigated the conformational change associated to the maturation of the microRNA’s pre-miR21 (see Fig. 2), a prominent oncological target whose conformational space and interactions with a novel inhibitor has been recently studied in Ref. Aureli et al., 2024.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x2.png)
I.4.1 Müller-Brown-Potential
The simulations of a particle in the Müller-Brown potential are performed using the simple Langevin dynamics code contained in the VES module of PLUMED Valsson and Parrinello (2014). The particle is moving in two dimensions under the action of the 3-state potential depicted in Fig. 2 built out of a sum of Gaussians. We start by generating training data concatenating 2 ABMD simulations biasing the y-coordinate with a spring constant of respectively in order to connect the 3 metastable states. The DeepLNE CVs and are trained using the and position of the particle as input features also assigning labels to the 3 metastable states. We train the DeepLNE model with hyperparameters , , , setting for both input variables. The teacher model is then used to perform an exploration simulation to expand the training dataset. We deposit bias along the DeepLNE variable via the OPES-EXPLORE engine. We reduce the number of training datapoints to applying FPS. The DeepLNE student models for and are subsequently trained on this expanded training dataset choosing . Finally we export the DeepLNE student models as CVs in PLUMED in order to perform free-energy estimation by depositing bias on via OPES and applying a half parabolic harmonic constraint on with a cutoff value of 0.1 (UWALL) and . We set a bias deposition rate (PACE) of 200 steps and a barrier parameter equal to 15 kJ/mol. We estimate statistical errors on the FES via block analysis using 10 blocks.
I.4.2 Alanine Tripeptide
For the extensively studied FES of alanine tripeptide, and are commonly used to describe the metastable states of the system. However there are the additional dihedral angles , , and leading to at least 6 available dihedral angle descriptors for this system (compare Fig. 2 (b)). In this study, we investigate the transition between the 3 metastable states visible in the , descriptor space, namely the configuration close to , and . We generate training data concatenating ABMD simulations biasing with a spring constant of kJ/mol respectively in order to connect the 3 metastable states. To show the advantages of the selective reconstruction loss introduced in Eq. I.2.2 we use all 6 dihedral angles as initial features setting for and the dihedrals. To save computational costs we use the FPS tool to reduce the number of training datapoints () for the teacher model and () for the student models. The hyperparameters for the teacher model are , , , . After training we use the variable as a CV to bias a 5 ns long simulation with OPES-EXPLORE. We choose a PACE of 500 steps and a barrier parameter equal to 80 kJ/mol. We apply a harmonic constraint with a cutoff value of 0.2 and kJ/mol (UWALL) as a function of the variable. We add the newly sampled datapoints to expand the training dataset and train the DeepLNE student models for and setting . Then, we proceed by using the student model variable as a CV to bias 200 ns of MD simulation with OPES in order to estimate the free-energy of the system. We set a PACE of 1000 steps and a barrier parameter equal to 80 kJ/mol. Again, a harmonic constraint with a cutoff value of 0.2 and kJ/mol (UWALL) is applied to the variable. All simulations are carried out in vacuo with the DES-Amber ff. Piana et al. (2020). We estimate statistical errors on the FES using 10 blocks of length 20 ns.
I.4.3 Pre-miR21
As a challenging system with a high degree of configurational complexity, we choose the RNA strand pre-miR21. This system consists of 31 nucleotides containing the physiologically relevant Adenosine nucleobase at 5’-3’ position 29 (A29) that can perform rotations between a ’stacked-in’ state (i.e. A29 pointing towards the paired RNA bases) (see Fig. 2) and a ’bulged-out’ state, in which A29 is translated in the bulk solution. Recently, pre-miR21 has been studied using extensive enhanced sampling MD simulations Aureli et al. (2024) using the multi-replica variant of OPES, i.e. OneOPES, where 8 concurrent replicas of the system are simulated under the effect of an increasing thermal ramp. This study uses the FES obtained by a simulation with a length of 400 ns per replica as a reference. Moreover, in the present study, we keep the simulation protocol identical to Ref. Aureli et al., 2024, however we perform only a single enhanced sampling MD simulation (using 1 replica). To sample the transition from state A (A29 ’stacked-in’) to state B (A29 ’bulged-out’) we use ABMD simulations biasing 3 different CVs simultaneously: 1) a primary contact map built using the heavy atom distances between the nucleobase A29 and the proximal nucleobases G45, C46 with cutoff 5 and 4 respectively (CMAP1); 2) a contact map constructed using the heavy atom distances between the nucleobase A29 and the proximal nucleobases C28, G30 with cutoff 4 (CMAP2); 3) a contact map built using the heavy atom distances between the nucleobases C28, G30 with cutoff 8.5 (CMAP3). We start multiple ABMD runs from both states at CMAP3 values that span the entire possible range to allow maximum variation in this descriptor space direction, terminating the simulations upon reaching the respective opposite state. We reduce the number of training datapoints to via FPS and choose the same distance metrics used for the ABMD simulations as input features in the DeepLNE training. The teacher model is trained with hyperparameters , , , . In the enhanced sampling simulation using the DeepLNE teacher model to expand the training data for the student models, we use the and variables to deposit bias with OPES-EXPLORE for a total simulation time of 20 ns. We choose a PACE of 500 steps, bandwidths 0.1 and 1 respectively, and a barrier parameter equal to 60 kJ/mol. We concatenate all available sampled trajectories, subsample a lower number of datapoints via FPS (), and continue by training the DeepLNE student model setting . After completing the training and exporting the student model into PLUMED we use as a CV to deposit bias via OPES. We set a PACE of 5000 steps and a barrier parameter equal to 35 kJ/mol. With this setup, a trajectory of 2 has been produced. The free energies were calculated by integrating the probability of being in respective microstates of the CV by . Statistical errors of the FES were estimated using 10 blocks of 200 ns. For the reference FES, we choose 10 blocks of 23 ns skipping the first 30 ns of the OneOPES simulation.
II Results
II.1 Particle in Müller-Brown-Potential
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x3.png)
In Fig. 3 we collected the results of some steps along the proposed DeepLNE knowledge distillation algorithm. In Fig. SI 1 the remaining intermediate steps can be found. The Müller-Brown-Potential used in this study encompasses 3 states. To show the effectiveness of the DeepLNE model we choose a relatively minimal architecture with 1 hidden layer and 3 nodes for the dimensionality reduction and 2 hidden layers with 9 and 3 hidden nodes for both neighborhood compression and decoder network and train it on a concatenation of only 2 ABMD trajectories connecting the respective end states with the intermediate state. After 5000 training epochs we can project the DeepLNE teacher model on the input feature space as shown in Fig. 3 (a). We project the behavior of the variable of and can appreciate that the choice of low k=3 leads to a DeepLNE model with high locality, which can be seen especially for the regions in phase space with minimal and maximal values. The smooth extrapolation behavior of the DeepLNE model originates from the continuous k-nearest neighbor step. To achieve a similar extrapolation in the DeepLNE student model we proceed by performing an OPES-EXPLORE simulation biasing and thereby expanding the training datapoints. The results are shown in Fig. 3 (b) where we find the entire range of values explored and importantly a broader range of values is explored too. For both DeepLNE student models and we choose an architecture with 3 hidden layers and 8 nodes each. Training the DeepLNE student models on the concatenation of the data shown in Fig. 3 (a) and the new sampled data in (b) and subsequently biasing and simultaneously applying a harmonic restraint on we obtain the 1D FES along shown in Fig. 3 (c). Importantly the estimated free energy is identical to a reweighted trajectory biasing both Cartesian coordinates and using the same OPES setting as used when biasing the path-like DeepLNE CVs. In Fig. SI 1 (a) we give additional insight into the labels assigned for the supervised multi-tasking framework of DeepLNE. Each datapoint in known regions of the descriptor space is labeled allowing for tailoring the directionality learned by the model during training. In Fig. SI 1 (b) we projected the DeepLNE variable on the descriptor space as done for in Fig. 3 (a). As expected we find the reconstructed data in the center of the training datapoints and accordingly the computed variable describes a reaction channel connecting all 3 system states. In Fig. SI 1 (d), we go beyond the 1D projection of the FES and use the additional information available from the variable to create the 2D FES obtained from the same enhanced sampling simulation run that resulted in Fig. 3 (c). The 2D FES projection allows for additional a posteriori analysis revealing transition regions and more detailed location as well as the extend of the metastable states. This will be especially useful when attempting to describe systems of increasing degree of complexity.
II.2 Alanine Tripeptide
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x4.png)
The alanine tripeptide system has been widely studied showing that it can be sufficiently described via its 6 dihedral angles and . In this study, even though alanine tripeptide has more states that constitute the entire configurational space, we treat it as a 3 state system considering the configurations , and as the metastable states. Additionally we consider a specific transition path that the system has to follow. In Fig. 4 (a) we show the training data obtained from multiple ABMD simulations describing this reaction path. The trajectories are connecting all 3 states and we proceed by training a DeepLNE teacher model. The trained model takes as input all 6 and angles and its variable is projected in along and by setting the angles to 0 and . We choose an architecture with 1 hidden layer and 3 nodes for the dimensionality reduction and 1 hidden layers with 2 hidden nodes for both neighborhood compression and decoder network. Also we select k=3 to obtain a DeepLNE model with high locality, which is reflected in the regions in dihedral space with minimal and maximal value (compare Fig. 4 (a)). The DeepLNE teacher is capable of smooth extrapolation originating from the continuous k-nearest neighbor step. We proceed by expanding the training data coupling the teacher model CV to OPES-EXPLORE simulation biasing and applying a harmonic restraint on in order to exclusively sample the envisioned reaction path. The results are shown in Fig. 4 (b) where we find the entire range of values explored and importantly obtain a more densely sampled descriptor space along the path. We train the DeepLNE student models on the concatenation of the data shown in Fig. 4 (a) and the newly sampled data in (b). Because of the functional complexity differences between and in this system the DeepLNE variable can be approximated with ANNs of lower complexity compared to the DeepLNE variable. Accordingly for the student model we choose a smaller architectures with 1 hidden layer and 3 nodes compared to the student model with 2 hidden layers and 6 and 3 nodes each. We can appreciate that the training datapoints describe the transition regions between the metastable states at and respectively. This is a specific choice for the transition path, but not the only relevant one considering the entire available configurational space for alanine tripeptide. Since dihedral angles are periodic the 3 metastable states that the variable connects can be reached by crossing also other transition regions visible in (a) and (b). This has the consequence that when we want to compare FES estimations obtained from DeepLNE biased simulations to a reference simulation that can sample the entire configurational space, we need to restrict the latter such that both enhanced sampling simulations sample the same reaction path, which can be achieved by applying harmonic restraints on in both cases. Accordingly we proceed by training the student models for and on the expanded training dataset and perform an OPES simulation biasing and restraining the system along to stay within the reaction tube constituted by the variables and . In Fig. 4 (c) we then compare the FES obtained from a DeepLNE biased run with a reference run biasing all 3 dihedral angles. We can appreciate that both simulations arrive at the same estimation of the FES in the reaction channel subspace defined via the DeepLNE student model variable . In Fig. SI 2 (a) we detail how the training data were labeled in order to supervise DeepLNE while constructing the path-like variables and . Each datapoint in the 3 states within the descriptor space is labeled allowing for the desired directionality of . In Fig. SI 2 (b) we projected the DeepLNE variable on the descriptor space as done for in Fig. 4 (a). The reconstructed data and the corresponding variable that is a function of describes a reaction channel connecting all 3 system states that is a subspace of the entire configurational space. In difference to the toy model investigated in II.1 there are states with significant probabilistic weight such that comparison to reference simulations need to be performed within the same reaction channel created by and . Respecting this condition the FES can be estimated in the 2 DeepLNE dimensions and as shown in Fig. SI 2 (d). We can appreciate that all metastable states as well as the transition regions appear as regions of orthogonal orientation with respect to the variable indicating that the created reaction channel by the combination of and crosses the transition states with high orthogonality, which is indicative of a CV that is approximating well the true reaction coordinate of the system (compare committor analysis performed in Fröhlking et al. (2024)).
II.3 Pre-miR21
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x5.png)
The microRNA pre-miR21 is a system of high configurational complexity, namely the number of available descriptors, significant degrees of freedom, and metastable states is difficult to define a priori. While the challenge of reducing the descriptor dimensionality can be addressed using DeepLNE by transforming the high dimensional input feature space into a single CV , the latter challenge can only be overcome by identifying a sub-set of the whole configurational space. Pre-miR21 has been recently studied with respect to processes for which base-flipping of the A29 is of relevance Aureli et al. (2024). Following the previous paper, the current study focuses on the same configurational re-arrangement. We note that this is an important first step that decides all the following analysis and results. As described in section I.4.3, we selected three CMAPS as input features for the DeepLNE teacher model. These CMAPS describe both the distance between A29 and its neighboring nucleobases in the base-paired state and the distance between the stacked nucleobases above and below A29. Multiple ABMD simulations make up the training data distributed maximally diverse in the descriptor space. As shown in Fig. SI 3 (a), we make use of the multi-tasking framework assigning labels to both end states known from the previous study as well as an ad hoc defined intermediate that is located between the end states when projecting the states in the input feature space . We choose an architecture with 1 hidden layer and 3 nodes for the dimensionality reduction and 1 hidden layer with 3 nodes for both neighborhood compression and decoder network. Also, we select k=3 to obtain a DeepLNE model with high locality. We proceed by training the DeepLNE teacher on the labeled training data, exporting it into PLUMED, and performing OPES-EXPLORE simulation biasing exclusively the variable. Combining the new datapoints (see Fig. SI 3 for (b) DeepLNE variable and (c) DeepLNE variable) sampled with the already available ABMD trajectories the DeepLNE student models are trained on reconstructing the projection of the DeepLNE teacher on those data. As demonstrated in alanine tripeptide we choose a smaller architecture with 1 hidden layer and 3 nodes for the student model and a slightly more flexible architecture for the student model with 2 hidden layers and 3 nodes each. In Fig. 5 we collected the results of the OPES simulation performed using the trained DeepLNE student model biasing the variable . In Fig. 5 (a) we compare the FES obtained in this study with the one previously obtained in Ref. Aureli et al., 2024 biasing a harmonic linear discriminant analysis (HLDA) CV (i.e., a weighted linear combination of 10 inter-nucleobases distances). We project the FES obtained from DeepLNE biased simulation along the same CV showing that it converges to the same FES estimation. We note that while the previous study used a total simulation time of over 3 using multiple replicas we obtained the same result after 2 in a single MD realization. Additionally in Fig. 5 (b) we investigate the underlying dynamical rearrangement of A29 in more detail by projecting the FES along both DeepLNE student model variables and . We can appreciate that while the HLDA CV used in the previous study can distinguish the 2 metastable end states ’stacked-in’ and ’bulged-out’, the DeepLNE variables and reveal 3 metastable states as well as the interconnecting transition regions. In Fig. SI 4 we extracted the centroids of the structural ensemble associated with the 3 metastable states that can be identified in Fig. 5 (b). While the end states are the previously identified conformations with A29 being ’stacked-in’ (<0.2) or ’bulged-out’ respectively (>0.9) and the intermediate metastable state corresponds to the approaching of A29 into the RNA helix, with A29 starting to form the necessary contacts to achieve the fully ’stacked-in’ state.
II.4 Performance Benchmarks
To quantify the computational speed-up that can be achieved by applying the knowledge distillation framework introduced in this study we compare the MD simulation performance when biasing the DeepLNE teacher model with DeepLNE student models of increasing number of parameters. The performances of all the models are evaluated using an AMD Ryzen 7950X CPU and a Nvidia RTX3080 GPU. Errors are estimated on 3 repetitions of each benchmark simulation. In Fig. 5 we collected the results of the average DeepLNE student model performance benchmark measure in ns/day for alanine tripeptide (42 atoms) and the solvated RNA pre-miR21 (45543 atoms). The displayed values should be compared to the respective PyTorch-based CV teacher model performance of 201 ns/day for alanine tripeptide and 119 ns/day for pre-miR21 corresponding to speed-up factors of up to 20 and 3 respectively when using the student models. Importantly we can appreciate that across more than 4 orders of magnitude in student model complexity the performance remains similar. While a student model for alanine tripeptide with 16 parameters leads to an average performance 3959 ns/day and a more complex model with 20000 parameters still achieves 3774 ns/day. A student model for pre-miR21 with 16 parameters results in an average performance 329 ns/day while a model with 20000 parameters arrives at 325 ns/day. These results show that DeepLNE student models offer the potential to approximate and bias functions of increasing complexity while achieving similar MD performance.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x6.png)
III Discussion
In this study, we propose the DeepLNE++ strategy to construct a computationally fast DeepLNE CV using a knowledge distillation approach. The algorithm selects the most generalizing model at each step, supervising its directionality via the assignment of labels to the training data and representing the path-like DeepLNE variables and using ANN student models. The new algorithm can be used to capture and especially accelerate rare events in large biophysical systems with reduced computational overhead. In multiple systems, we have shown how this procedure allows exporting computationally efficient CVs in MD software and maintaining its capability to approximate well the ideal reaction coordinate. Consequently, we could demonstrate that the DeepLNE++ CVs can be effectively used to improve sampling when coupled to enhanced sampling methods such as OPES without multi-replica approaches.
Our methodology starts by obtaining reactive trajectories that cover the descriptor space in the most diverse manner possible encompassing the transitions between all metastable system states of interest. Herein, these starting trajectories have been obtained with a ratchet-and-pawl restraint Camilloni, Broglia, and Tiana (2011), which provided reactive data of sufficient quality to successfully train DeepLNE models for all the studied systems. The input trajectory is used to define a feature vector to distinguish all states encountered and their intermediates. With this featurized trajectory, one can automatically derive a path-like DeepLNE teacher model. This model has advantageous extrapolation capabilities because it is intrinsically anchored to the training data. To maximally maintain this behavior when approximating the teacher model using ANNs as DeepLNE student models, we rely on an explorative enhanced sampling step, where we use the DeepLNE teacher model to bias the system dynamics in order to expand the training dataset. The interpretation of the derived variables and is identical in DeepLNE teacher and student models, as they describe the progress along the path and its perpendicular distance, respectively.
The proposed method achieves a speed-up in computational model evaluation when biasing MD simulations with DeepLNE. It does so by representing the DeepLNE model containing a large number of fitting parameters with ANNs of lower complexity. The DeepLNE++ approach uses cross-validation to choose the models with the highest robustness when extrapolating into unknown regions of the descriptor spaces. Furthermore, we represented the DeepLNE variables and by an ANN each, instead of using a single ANN to reconstruct both variables in the output. This decision is made because one would need to normalize the outputs and also ensure for each system that the objective function is leading to equally distributing loss minimization for each variable simultaneously.
We keep some hyperparameter choices to the user as regularization options that permit tailoring a path-like CV adapt to system-specific needs. The model training steps of the algorithm are computationally fast and can additionally employ available GPUs and therefore allow fine-tuning all hyperparameter choices of DeepLNE leading to CVs of high quality. We demonstrated how the hyperparameters within the DeepLNE knowledge distillation framework allow for regulating the models. For all the DeepLNE models we tested and in order to establish a continuous k-nearest-neighbor selection step that allows to obtain DeepLNE teacher models with high locality and numerical stability. The utility of the hyperparameter was shown for alanine tripeptide where the relevant dihedral angle input descriptors could be selected to describe the desired transition path. Although not implemented in this study, the selection of for each feature can be automated by adding an additional L1-norm or Softmax term for to the objective function. This would allow for automatic assignment of feature importance weights. In all 3 toy models, it was necessary to assign a significantly large value to the hyperparameter in order to supervise DeepLNE with respect to the directionality and sequentiality of the variables and . The hyperparameter that is penalizing the magnitude of the fitting parameters in teacher as well as student models is chosen relatively low in all 3 toy models. This resulted in models of higher functional flexibility. However, because we were choosing the models that minimize the cross-validation error in all cases the models exhibited sufficiently smooth extrapolation behavior and numerical stability when applied to the task of CV-based free energy estimation using enhanced sampling MD.
We notice that our proposed knowledge distillation framework is not limited to the representation of DeepLNE CVs as computationally fast ANNs, but can be used to approximate any computationally expensive CV.
We show that our method can be straightforwardly applied to multi-state systems of biologically relevant complexity at a fraction of the computational cost of the original DeepLNE CV Fröhlking et al. (2024). A DeepLNE student model can be successfully trained as long as the training trajectories visit the entire descriptor space encompassed by the reaction channel defined by the variables and . In the systems studied here, we impose a sequentiality of the metastable states into the DeepLNE variable in order to show that path-like CVs for multi-state systems benefit from a supervised multi-tasking objective function during the training step. In summary, we believe that the advances presented in this study significantly broaden the applicability of DeepLNE++, especially for large and complex biological systems. The introduction of automatic, efficient, and fast path-like CVs in DeepLNE++ is crucial for accelerating MD sampling of their dynamics and reactions.
Supplementary Material
Sections for each toy model showing (1) Müller-Brown-Potential with additional details of the DeepLNE training steps and resulting 2D FES of the biased MD simulations, (2) Alanine Tripeptide details of the DeepLNE training steps and resulting 2D FES of the biased MD simulations, (3) Pre-miR21 training datapoints for the DeepLNE models and structural analysis of the identified metastable states from the enhanced sampling MD simulations.
Acknowledgments
We acknowledge Luigi Bonati for useful discussion and providing many suggestions. F.L.G., V.R., S.A. and T.F. acknowledge the Swiss National Supercomputing Centre (CSCS) for large supercomputer time allocations projectID:s1228. They also acknowledge the Swiss National Science Foundation and Bridge for financial support (projects numbers: , and ).
Data availability
At https://github.com/ThorbenF/DeepLNE2 we provided tutorials together with scripts that allow to reproduce the figures and the results of this study.
References
- Dror et al. (2012) R. O. Dror, R. M. Dirks, J. Grossman, H. Xu, and D. E. Shaw, “Biomolecular simulation: A computational microscope for molecular biology,” Annual Review of Biophysics 41, 429–452 (2012).
- Laio and Gervasio (2008) A. Laio and F. L. Gervasio, “Metadynamics: a method to simulate rare events and reconstruct the free energy in biophysics, chemistry and material science,” Reports on Progress in Physics 71, 126601 (2008).
- Valsson, Tiwary, and Parrinello (2016) O. Valsson, P. Tiwary, and M. Parrinello, “Enhancing Important Fluctuations: Rare Events and Metadynamics from a Conceptual Viewpoint,” Annual Review of Physical Chemistry 67, 159–184 (2016).
- Camilloni and Pietrucci (2018) C. Camilloni and F. Pietrucci, “Advanced simulation techniques for the thermodynamic and kinetic characterization of biological systems,” Advances in Physics: X 3, 1477531 (2018).
- Laio and Parrinello (2002) A. Laio and M. Parrinello, “Escaping free-energy minima,” Proceedings of the National Academy of Sciences 99, 12562–12566 (2002).
- Bolhuis et al. (2002) P. G. Bolhuis, D. Chandler, C. Dellago, and P. L. Geissler, “Transition path sampling: Throwing ropes over rough mountain passes, in the dark,” Annual Review of Physical Chemistry 53, 291–318 (2002).
- Pietrucci (2017) F. Pietrucci, “Strategies for the exploration of free energy landscapes: Unity in diversity and challenges ahead,” Reviews in Physics 2, 32–45 (2017).
- Branduardi, Gervasio, and Parrinello (2007) D. Branduardi, F. L. Gervasio, and M. Parrinello, “From A to B in free energy space,” The Journal of Chemical Physics 126, 054103 (2007).
- Hovan, Comitani, and Gervasio (2019) L. Hovan, F. Comitani, and F. L. Gervasio, “Defining an optimal metric for the path collective variables,” Journal of Chemical Theory and Computation 15, 25–32 (2019).
- Bonati, Rizzi, and Parrinello (2020) L. Bonati, V. Rizzi, and M. Parrinello, “Data-driven collective variables for enhanced sampling,” The Journal of Physical Chemistry Letters 11, 2998–3004 (2020).
- Bonati et al. (2023) L. Bonati, E. Trizio, A. Rizzi, and M. Parrinello, “A unified framework for machine learning collective variables for enhanced sampling simulations: mlcolvar,” The Journal of Chemical Physics 159, 014801 (2023).
- Šípka, Erlebach, and Grajciar (2023) M. Šípka, A. Erlebach, and L. Grajciar, “Constructing collective variables using invariant learned representations,” Journal of Chemical Theory and Computation 19, 887–901 (2023).
- France-Lanord et al. (2024) A. France-Lanord, H. Vroylandt, M. Salanne, B. Rotenberg, A. M. Saitta, and F. Pietrucci, “Data-driven path collective variables,” Journal of Chemical Theory and Computation 20, 3069–3084 (2024).
- Fröhlking et al. (2024) T. Fröhlking, L. Bonati, V. Rizzi, and F. L. Gervasio, “Deep learning path-like collective variable for enhanced sampling molecular dynamics,” The Journal of Chemical Physics 160, 174109 (2024).
- Roweis and Saul (2000) S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science 290, 2323–2326 (2000).
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorc,” NIPS 2017 Workshop on Autodiff (2017).
- Invernizzi and Parrinello (2020) M. Invernizzi and M. Parrinello, “Rethinking metadynamics: From bias potentials to probability distributions,” The Journal of Physical Chemistry Letters 11, 2731–2736 (2020).
- Invernizzi and Parrinello (2022) M. Invernizzi and M. Parrinello, “Exploration vs convergence speed in adaptive-bias enhanced sampling,” Journal of Chemical Theory and Computation 18, 3988–3996 (2022).
- Rizzi et al. (2023) V. Rizzi, S. Aureli, N. Ansari, and F. L. Gervasio, “Oneopes, a combined enhanced sampling method to rule them all,” Journal of Chemical Theory and Computation 19, 5731–5742 (2023), pMID: 37603295.
- Camilloni, Broglia, and Tiana (2011) C. Camilloni, R. A. Broglia, and G. Tiana, “Hierarchy of folding and unfolding events of protein G, CI2, and ACBP from explicit-solvent simulations,” The Journal of Chemical Physics 134, 045105 (2011).
- Goscinski et al. (2023) A. Goscinski, V. Principe, G. Fraux, S. Kliavinek, B. Helfrecht, P. Loche, M. Ceriotti, and R. Cersonsky, “scikit-matter : A suite of generalisable machine learning methods born out of chemistry and materials science [version 2; peer review: 2 approved, 1 approved with reservations],” Open Research Europe 3 (2023).
- Tribello et al. (2014) G. A. Tribello, M. Bonomi, D. Branduardi, C. Camilloni, and G. Bussi, “Plumed 2: New feathers for an old bird,” Computer Physics Communications 185, 604–613 (2014).
- Aureli et al. (2024) S. Aureli, F. Bellina, V. Rizzi, and F. L. Gervasio, “Investigating ligand-mediated conformational dynamics of pre-mir21: A machine-learning-aided enhanced sampling study,” bioRxiv (2024).
- Valsson and Parrinello (2014) O. Valsson and M. Parrinello, “Variational approach to enhanced sampling and free energy calculations,” Physical Review Letters 113, 1–5 (2014).
- Piana et al. (2020) S. Piana, P. Robustelli, D. Tan, S. Chen, and D. E. Shaw, “Development of a force field for the simulation of single-chain proteins and protein–protein complexes,” Journal of Chemical Theory and Computation 16, 2494–2507 (2020).
- Abraham et al. (2015) M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl, “GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers,” SoftwareX 1, 19–25 (2015).
- Bussi (2014) G. Bussi, “Hamiltonian replica exchange in gromacs: a flexible implementation,” Molecular Physics 112, 379–384 (2014).
Supplementary Material - DeepLNE++ leveraging knowledge distillation for accelerated multi-state path-like collective variables
Thorben Fröhlking1,2,3, Valerio Rizzi1,2,3, Simone Aureli1,2,3, Francesco Luigi Gervasio1,2,3,4
1School of Pharmaceutical Sciences, University of Geneva, Rue Michel Servet 1, 1206, Genève, Switzerland
2Institute of Pharmaceutical Sciences of Western Switzerland (ISPSO), University of Geneva, 1206, Genève, Switzerland
3Swiss Institute of Bioinformatics, University of Geneva, 1206, Genève, Switzerland
4Department of Chemistry, University College London, London, WC1E 6BT, United Kingdom
francesco.gervasio@unige.ch
SI 1. Particle in 3-state Müller-Brown-Potential
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x7.png)
SI 2. Alanine Tripeptide
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x8.png)
SI 3. Pre-miR-21
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x9.png)
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x10.png)