subscribe to arXiv mailings

Messenger RNA Design via Expected Partition Function and Continuous Optimization

Authors: Ning Dai, Wei Yu Tang, Tianshuo Zhou, David H. Mathews, Liang Huang

Abstract: The tasks of designing RNAs are discrete optimization problems, and several versions of these problems are NP-hard. As an alternative to commonly used local search methods, we formulate these problems as continuous optimization and develop a general framework for this optimization based on a generalization of classical partition function which we call "expected partition function". The basic idea… ▽ More The tasks of designing RNAs are discrete optimization problems, and several versions of these problems are NP-hard. As an alternative to commonly used local search methods, we formulate these problems as continuous optimization and develop a general framework for this optimization based on a generalization of classical partition function which we call "expected partition function". The basic idea is to start with a distribution over all possible candidate sequences, and extend the objective function from a sequence to a distribution. We then use gradient descent-based optimization methods to improve the extended objective function, and the distribution will gradually shrink towards a one-hot sequence (i.e., a single sequence). As a case study, we consider the important problem of mRNA design with wide applications in vaccines and therapeutics. While the recent work of LinearDesign can efficiently optimize mRNAs for minimum free energy (MFE), optimizing for ensemble free energy is much harder and likely intractable. Our approach can consistently improve over the LinearDesign solution in terms of ensemble free energy, with bigger improvements on longer sequences. △ Less

Submitted 1 March, 2024; v1 submitted 29 December, 2023; originally announced January 2024.

arXiv:2311.08339 [pdf, other]

Undesignable RNA Structure Identification via Rival Structure Generation and Structure Decomposition

Authors: Tianshuo Zhou, Wei Yu Tang, David H. Mathews, Liang Huang

Abstract: RNA design is the search for a sequence or set of sequences that will fold into predefined structures, also known as the inverse problem of RNA folding. While numerous RNA design methods have been invented to find sequences capable of folding into a target structure, little attention has been given to the identification of undesignable structures according to the minimum free energy (MFE) criterio… ▽ More RNA design is the search for a sequence or set of sequences that will fold into predefined structures, also known as the inverse problem of RNA folding. While numerous RNA design methods have been invented to find sequences capable of folding into a target structure, little attention has been given to the identification of undesignable structures according to the minimum free energy (MFE) criterion under the Turner model. In this paper, we address this gap by first introducing mathematical theorems outlining sufficient conditions for recognizing undesignable structures, then proposing efficient algorithms, guided by these theorems, to verify the undesignability of RNA structures. Through the application of these theorems and algorithms to the Eterna100 puzzles, we demonstrate the ability to efficiently establish that 15 of the puzzles indeed fall within the category of undesignable structures. In addition, we provide specific insights from the study of undesignability, in the hope that it will enable more understanding of RNA folding and RNA design. △ Less

Submitted 26 February, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

arXiv:2307.09580 [pdf, other]

LinearSankoff: Linear-time Simultaneous Folding and Alignment of RNA Homologs

Authors: Sizhen Li, Ning Dai, He Zhang, Apoorv Malik, David H. Mathews, Liang Huang

Abstract: The classical Sankoff algorithm for the simultaneous folding and alignment of homologous RNA sequences is highly influential, but it suffers from two major limitations in efficiency and modeling power. First, it takes $O(n^6)$ for two sequences where n is the average sequence length. Most implementations and variations reduce the runtime to $O(n^3)$ by restricting the alignment search space, but t… ▽ More The classical Sankoff algorithm for the simultaneous folding and alignment of homologous RNA sequences is highly influential, but it suffers from two major limitations in efficiency and modeling power. First, it takes $O(n^6)$ for two sequences where n is the average sequence length. Most implementations and variations reduce the runtime to $O(n^3)$ by restricting the alignment search space, but this is still too slow for long sequences such as full-length viral genomes. On the other hand, the Sankoff algorithm and all its existing implementations use a rather simplistic alignment model, which can result in poor alignment accuracy. To address these problems, we propose LinearSankoff, which seamlessly integrates the original Sankoff algorithm with a powerful Hidden Markov Model-based alignment module. This extension substantially improves alignment quality, which in turn benefits secondary structure prediction quality, confirmed over a diverse set of RNA families. LinearSankoff also applies beam search heuristics and the A$^\star$-like algorithm to achieve that runtime scales linearly with sequence length. LinearSankoff is the first linear-time algorithm for simultaneous folding and alignment, and the first such algorithm to scale to coronavirus genomes (n $\approx$ 30,000nt). It only takes 10 minutes for a pair of SARS-CoV-2 and SARS-related genomes, and outperforms previous work at identifying crucial conserved structures between the two genomes. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2210.14982 [pdf, other]

LinearCoFold and LinearCoPartition: Linear-Time Algorithms for Secondary Structure Prediction of Interacting RNA molecules

Authors: He Zhang, Sizhen Li, Liang Zhang, David H. Mathews, Liang Huang

Abstract: Many ncRNAs function through RNA-RNA interactions. Fast and reliable RNA structure prediction with consideration of RNA-RNA interaction is useful. Some existing tools are less accurate due to omitting the competing of intermolecular and intramolecular base pairs, or focus more on predicting the binding region rather than predicting the complete secondary structure of two interacting strands. Vienn… ▽ More Many ncRNAs function through RNA-RNA interactions. Fast and reliable RNA structure prediction with consideration of RNA-RNA interaction is useful. Some existing tools are less accurate due to omitting the competing of intermolecular and intramolecular base pairs, or focus more on predicting the binding region rather than predicting the complete secondary structure of two interacting strands. Vienna RNAcofold, which reduces the problem into the classical single sequence folding by concatenating two strands, scales in cubic time against the combined sequence length, and is slow for long sequences. To address these issues, we present LinearCoFold, which predicts the complete minimum free energy structure of two strands in linear runtime, and LinearCoPartition, which calculates the cofolding partition function and base pairing probabilities in linear runtime. LinearCoFold and LinearCoPartition follows the concatenation strategy of RNAcofold, but are orders of magnitude faster than RNAcofold. For example, on a sequence pair with combined length of 26,190 nt, LinearCoFold is 86.8x faster than RNAcofold MFE mode (0.6 minutes vs. 52.1 minutes), and LinearCoPartition is 642.3x faster than RNAcofold partition function mode (1.8 minutes vs. 1156.2 minutes). Different from the local algorithms, LinearCoFold and LinearCoPartition are global cofolding algorithms without restriction on base pair length. Surprisingly, LinearCoFold and LinearCoPartition's predictions have higher PPV and sensitivity of intermolecular base pairs. Furthermore, we apply LinearCoFold to predict the RNA-RNA interaction between SARS-CoV-2 gRNA and human U4 snRNA, which has been experimentally studied, and observe that LinearCoFold's prediction correlates better to the wet lab results. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2206.14794 [pdf, ps, other]

LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments

Authors: Apoorv Malik, Liang Zhang, Milan Gautam, Ning Dai, Sizhen Li, He Zhang, David H. Mathews, Liang Huang

Abstract: Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV… ▽ More Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (~30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ~36$\times$ speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server). △ Less

Submitted 5 July, 2024; v1 submitted 29 June, 2022; originally announced June 2022.

arXiv:2004.10177 [pdf, other]

doi 10.1038/s41586-023-06127-z

Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity

Authors: He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Ziyu Li, Kaibo Liu, Boxiang Liu, Xiaopin Ma, Fanfan Zhao, Weiguo Yao, Hangwen Li, David H. Mathews, Yujian Zhang, Liang Huang

Abstract: Messenger RNA (mRNA) vaccines are being used for COVID-19, but still suffer from the critical issue of mRNA instability and degradation, which is a major obstacle in the storage, distribution, and efficacy of the vaccine. Previous work showed that optimizing secondary structure stability lengthens mRNA half-life, which, together with optimal codons, increases protein expression. Therefore, a princ… ▽ More Messenger RNA (mRNA) vaccines are being used for COVID-19, but still suffer from the critical issue of mRNA instability and degradation, which is a major obstacle in the storage, distribution, and efficacy of the vaccine. Previous work showed that optimizing secondary structure stability lengthens mRNA half-life, which, together with optimal codons, increases protein expression. Therefore, a principled mRNA design algorithm must optimize both structural stability and codon usage to improve mRNA efficiency. However, due to synonymous codons, the mRNA design space is prohibitively large, e.g., there are $\sim\!10^{632}$ mRNAs for the SARS-CoV-2 Spike protein, which poses insurmountable challenges to previous methods. Here we provide a surprisingly simple solution to this hard problem by reducing it to a classical problem in computational linguistics, where finding the optimal mRNA is akin to finding the most likely sentence among similar sounding alternatives. Our algorithm, named LinearDesign, takes only 11 minutes for the Spike protein, and can jointly optimize stability and codon usage. Experimentally, without chemical modification, our designs substantially improve mRNA half-life and protein expression in vitro, and dramatically increase antibody response by up to 23$\times$ in vivo, compared to the codon-optimized benchmark. Our work enables the exploration of highly stable and efficient designs that are previously unreachable and is a timely tool not only for vaccines but also for mRNA medicine encoding all therapeutic proteins (e.g., monoclonal antibodies and anti-cancer drugs). △ Less

Submitted 17 March, 2022; v1 submitted 21 April, 2020; originally announced April 2020.

Comments: 17 pages for main text; 24 pages of supporting information; 4+11=15 figures; 2 tables (both in supporting information)

Journal ref: Nature (621), 396-403, 2023

arXiv:2001.04020 [pdf, other]

doi 10.1093/bioinformatics/btz375

LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search

Authors: Liang Huang, He Zhang, Dezhong Deng, Kai Zhao, Kaibo Liu, David A. Hendrix, David H. Mathews

Abstract: Motivation: Predicting the secondary structure of an RNA sequence is useful in many applications. Existing algorithms (based on dynamic programming) suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results: We present a novel alternative $O(n^3)$-time dynamic programming algorithm for RNA folding t… ▽ More Motivation: Predicting the secondary structure of an RNA sequence is useful in many applications. Existing algorithms (based on dynamic programming) suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results: We present a novel alternative $O(n^3)$-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in $O(n)$ time and $O(n)$ space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5'-to-3') direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models. Availability: Our source code is available at https://github.com/LinearFold/LinearFold, and our webserver is at http://linearfold.org (sequence limit: 100,000nt). △ Less

Submitted 21 December, 2019; originally announced January 2020.

Comments: 10 pages main text (8 figures); 5 pages supplementary information (7 figures). In Proceedings of ISMB 2019

Journal ref: Bioinformatics, Volume 35, Issue 14, July 2019, Pages i295--i304

arXiv:1912.13190 [pdf, other]

LinearPartition: Linear-Time Approximation of RNA Folding Partition Function and Base Pairing Probabilities

Authors: He Zhang, Liang Zhang, David H. Mathews, Liang Huang

Abstract: RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy (MFE) methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and… ▽ More RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy (MFE) methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore a slow calculation for long sequences. This slowness is even more severe than cubic-time MFE-based methods due to a larger constant factor in runtime. Inspired by the success of our recently proposed LinearFold algorithm that predicts the approximate MFE structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g., 2.5 days vs. 1.3 minutes on a sequence with length 32,753 nt). More interestingly, the resulting base pairing probabilities are even better correlated with the ground truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNA), as well as a substantial improvement on long-distance base pairs (500+ nt apart). △ Less

Submitted 1 February, 2020; v1 submitted 31 December, 2019; originally announced December 2019.

Comments: 12 pages main text (10 figures); 6 pages SI (7 figures). Bioinformatics, July 2020 (Proceedings of ISMB 2020)

Journal ref: Bioinformatics, July 2020 (Proceedings of ISMB 2020)

arXiv:1912.12796 [pdf, other]

ThreshKnot: Thresholded ProbKnot for Improved RNA Secondary Structure Prediction

Authors: Liang Zhang, He Zhang, David H. Mathews, Liang Huang

Abstract: RNA structure prediction is a challenging problem, especially with pseudoknots. Recently, there has been a shift from the classical minimum free energy-based methods (MFE) to partition function-based ones that assemble structures using base-pairing probabilities. Two examples of the latter group are the popular maximum expected accuracy (MEA) method and the ProbKnot method. ProbKnot is a fast heur… ▽ More RNA structure prediction is a challenging problem, especially with pseudoknots. Recently, there has been a shift from the classical minimum free energy-based methods (MFE) to partition function-based ones that assemble structures using base-pairing probabilities. Two examples of the latter group are the popular maximum expected accuracy (MEA) method and the ProbKnot method. ProbKnot is a fast heuristic that pairs nucleotides that are reciprocally most probable pairing partners, and unlike MEA, can also predict structures with pseudoknots. However, ProbKnot's full potential has been largely overlooked. In particular, when introduced, it did not have an MEA-like hyperparameter that can balance between positive predictive value (PPV) and sensitivity. We show that a simple thresholded version of ProbKnot, which we call ThreshKnot, leads to more accurate overall predictions by filtering out unlikely pairs whose probabilities fall under a given threshold. We also show that on three widely-used folding engines (RNAstructure, Vienna RNAfold, and CONTRAfold), ThreshKnot always outperforms the much more involved MEA algorithm in (1) its higher structure prediction accuracy, (2) its capability to predict pseudoknots, and (3) its faster runtime and easier implementation. This suggests that ThreshKnot should replace MEA as the default partition function-based structure prediction algorithm. ThreshKnot is already available in the widely used RNAstructure software package version 6.2 (released November 27, 2019): https://rna.urmc.rochester.edu/RNAstructure.html △ Less

Submitted 8 January, 2020; v1 submitted 29 December, 2019; originally announced December 2019.

Showing 1–9 of 9 results for author: Mathews, D H