Keywords

1 Introduction

RNA secondary structure without pseudoknots can be predicted efficiently by means of dynamic programming using a well-established standard energy model. Pseudoknots, however, play an important role in RNA function, contributing in particular to the regulation of translation and splicing, and ribosomal frameshifts [3, 24]. With pseudobase there is a dedicated repository of biologically relevant RNA pseudoknots [26]. The RNA folding problem for general pseudoknotted structures and energy models that depend on stacked base pairs can be formally stated as follows:

General RNA Folding

Input: An ordered sequence of vertices \((x_1,\dots ,x_k)\), weights \(\omega (i,j;k,l)\) that are non-positive only for “stacked edges”, i.e., if \(k=i+1\) and \(l=j-1\), and a bound E.

Question: Is there a matching M, i.e., a set of edges such that each vertex is incident to at most one edge, with \(f(M)\mathop {:=} \sum _{\{i,j\},\{k,l\}\in M} \omega (i,j;k,l)\le E\)?

The General RNA Folding problem is known to be NP-complete if arbitrary stacking energies \(\omega (\,.\,)\) can be used [1]. It remains NP complete in the unweighted case, i.e., for \(\omega (i,j;i+1,j-1)=1\) if and only if \(x_ix_j\) and \(x_{i+1}x_{j-1}\) are Watson-Crick base pairs [10]. Additional hardness results can be found in [20].

Several research groups proposed dynamic programming algorithms that solve the corresponding folding problem for certain restricted classes of matchings M with restrictions on the patterns of crossing edges \(\{i,j\}\) and \(\{k,l\}\) with \(i<j<j<l\)) forming the pseudoknots. These algorithms differ drastically in their definition of admissible pseudoknot types and thus in the extent of the search space, see [4, 12] for an overview. The performance of the different algorithms is difficult to compare because they typically employ different parametrizations of the energy model and thus already differ in their prediction of structure without pseudoknots. It is hard to decide, therefore, whether differences in the prediction accuracy are the consequence of better energy parameters for the stems and loops of pseudoknot-free parts of the structure, or whether they have to be attributed to the pseudoknots. It has remained an open question, therefore, whether the choice of the search space has an important impact, and whether there is an optimal pseudoknot model that is sufficiently inclusive to cover the known structures but rules out structures that are impossible or unlikely to be realized at all.

In this contribution we consider re-implementations of different pseudoknot models in a common framework. This allows us, in particular, to ensure that all knot-free structures and substructures are handled identically. Furthermore, it makes it possible to assign the same energy contributions to matching types of pseudoknots. Dynamic programming (DP) algorithms are commonly defined as recursion relations that iteratively fill memo-tables. These tables are often indexed by complex structures that make the implementation of DP recursions a tedious and error prone task [6]. The theory of Algebraic Dynamic Programming (ADP) [7] addresses this issue for a restricted class of DP algorithms for which (i) generation of the state space, (ii) scoring of states, and (iii) selection of desired solution can be separated completely. ADP is therefore the ideal framework for our endeavor, although there are attractive alternative abstract formalisms, such as “super-grammars” [18], forward-hypergraphs as an alternative description of dependencies [13] and inverse coupled rewrite systems (ICORES) [8].

2 Algebraic Dynamic Programming and ADPfusion

ADP utilizes a grammar to specify the state space and thus the structure of the recursion without any explicit reference to indices. The original setting of ADP are context-free languages, and thus productions of the form \(A\rightarrow \alpha \), where A is a non-terminal and \(\alpha \) is an arbitrary expression formed from terminals and non-terminals [7]. More recently, the formalism was extended to so-called multi-context-free grammars [16]. The main difference is that non-terminals may now be multi-dimensional, corresponding to non-overlapping sub-objects that are parsed independently. For both CFGs and MCFGs, each production determines a partition of the non-terminal on the l.h.s. Interpreting each non-terminal on the r.h.s. as a parser alleviates the need to specify indices explicitly. For instance, the simple production \(S\rightarrow \boldsymbol{ [ }S\boldsymbol{ ] }S\) corresponds to the recursion relation \(S_{ij} \mathop {+\!\!=} \sum _{k=i+1}^{j} c_i S_{i+1,k-1} \bar{c}_k S_{k+1,j}\), (with the convention that the empty parse \(S_{k,k-1}=1\) serves as neutral element), where the sequence interval [ij] on which a structure “lives” is indicated by the index pair \(S_{ij}\). The terminals c and \(\bar{c}\) together signify a base pair.

In ADP, each production is interpreted by an evaluation algebra. Productions as grammatical objects are linked to the evaluation algebra via a common (type) signature. To understand this connection, consider the grammar \(\{S \rightarrow c S c S\}\), where the brackets ’[’ and ’]’ have been generalized to accommodate any particular character. The r.h.s. of the single rule of this grammar has the following “type”: \(c \times x \times c \times x\), while the l.h.s. holds objects that evaluate to x. The full type of the rule then is \(c \times x \times c \times x \rightarrow x\). This type signature provides a constraint for both the grammar and the evaluation of parses of inputs. The terminal types c indicate that single characters are to be matched upon, while x indicates not only that the parse has to continue recursively but also that each recursive parse can immediately be replaced by a value of type x (by means of memoization), for example a locally optimal score. This finally points to the structure of the evaluation algebra. An evaluation algebra is devoid of any structural notion. Instead it only contains functions that interpret each parse and immediately replace it by a value. In the example above, for each ikj, the parse \(c_iS_{i+1,k-1}\bar{c}_kS_{k+1,j}\) is evaluated by a function of type, say, Char\(\times \)Int\(\times \)Char\(\times \)Int\(\rightarrow \)Int, which is either \(S_{i+1,k-1}+S_{k+1,j}+1\), if \(c_i\) and \(\bar{c}_k\) are pairing, or \(-\infty \).

The same grammar thus can be used to minimize scores, compute partition functions, density of states, or enumerate a fixed number of sub-optimal solutions by simply employing a different evaluation algebra. In addition, not only is each algebra comparatively simple, the notion of product operations on algebras allows for easy combination of different algebras to calculate diverse and complex questions over grammars [25].

We employ a particular variant of the idea of ADP, namely ADPfusion [21]. This framework performs in-depth program fusion during compilation, which effectively turns a very high-level declarative description of a dynamic programming into tight loops that operate directly on flat memory. Authors of dynamic programs may freely mix different types of grammars, which can operate on diverse and heterogeneous index spaces [23] while still producing the desired, efficient loops that are required for dynamic programs that are asymptotically costly.

The latter property is very useful for the type of grammars we are interested in here. Multiple Context-Free Grammars (MCFGs) [19] are a particular type of weakly context-sensitive grammar that, in contrast to the general case, employ in their rewrite rules only total functions that concatenate constant strings and components of their arguments. As a consequence MCFGs admit polynomial-time parsing, i.e., the membership of word w of length n in a language generated by an MCFG G can be determined in \(O(n^{c(G)})\), with a constant c(G) depending only on the grammar.

Each MCFG contains rules that conform to the canonical pseudoknot-free structures – and thus substrings that are juxtaposed – and rules over substrings that contain “holes” and are interleaved with each other. The latter are represented by higher-dimensional index objects. MCFGs therefore operate on non-terminals that have an interpretation as tuples of strings over an alphabet A – rather than strings as in the case of CFGs. Due to space constraints, we cannot give a formal presentation of MCFGs here and instead refer to [16]. Instead, we use the minimal pseudoknot model of GenussFold [16] as a means of explaining the notations at an operational level. Consider the following productions:

$$\begin{aligned} \begin{aligned} S&\rightarrow \epsilon \,\mathop {\big |}\,\bullet S \,\mathop {\big |}\,\boldsymbol{ [ }S \boldsymbol{ ] }S \,\mathop {\big |}\,A_1 B_1 A_2 B_2 \\ \left( \begin{matrix}A_1\\ A_2\end{matrix}\right)&\rightarrow \left( \begin{matrix}S \boldsymbol{ [ }A_1\\ A_2 S \boldsymbol{ ] }\end{matrix}\right) \,\mathop {\big |}\,\left( \begin{matrix}\epsilon \\ \epsilon \end{matrix}\right) \qquad \left( \begin{matrix}B_1\\ B_2\end{matrix}\right) \rightarrow \left( \begin{matrix}S \boldsymbol{ [ }B_1\\ B_2 S \boldsymbol{ ] }\end{matrix}\right) \,\mathop {\big |}\,\left( \begin{matrix}\epsilon \\ \epsilon \end{matrix}\right) \end{aligned} \end{aligned}$$
(1)

In addition to the terminals \(\epsilon \), \(\bullet \), \(\boldsymbol{ [ }\), \(\boldsymbol{ ] }\), which refer to the empty string, a single unpaired nucleotide and base pair, this MCFG uses three non-terminals: the one-dimensional nonterminal S represents arbitrary structures. The two-dimensional terminals \(\left( \begin{matrix}A_1\\ A_2\end{matrix}\right) \) and \(\left( \begin{matrix}B_1\\ B_2\end{matrix}\right) \) describe the two interleaved interacting parts of an H-type pseudoknot. Note that any one-dimensional index is represented by the tuple (ij) with \(i \le j\) to fully identify a substring. A simple example of a successful parse of the string [{]} is given in Fig. 1.

Fig. 1.
figure 1

Parse tree of the string [{]}. Compared to Eq. 1, the terminal symbols for the case \(\left( \begin{matrix}B_1\\ B_2\end{matrix}\right) \) have been replaced with the symbols \(\{,\}\) to emphasize the two terminals forming base pairs: [ pairs with ], while { pairs with }. The rule \(S \rightarrow A_1B_1A_2B_2\) splits the string into four (possibly empty) substrings, say, \([_{1,1},\{_{2,2},]_{3,3},\}_{4,4}\). The two-dimensional rule \(\left( \begin{matrix}A_1\\ A_2\end{matrix}\right) ^{1,1}_{3,3}\) then operates on pairs of indices simultaneously, while the construction guarantees that only legal parses are derivable, i.e., the parse over (1, 1), (3, 3) for \(A_1\) and \(A_2\).

In [16, 22] we introduced a domain specific langauge (DSL) that makes it fairly convenient to write productions with 2-dimensional non-terminals. Here, we employ the same idea. First, the l.h.s. is “reformatted” such that the components of the 2-dimensional non-terminal are aligned:

$$\begin{aligned} \left( \begin{matrix}S \boldsymbol{ [ }A_1\\ A_2 S \boldsymbol{ ] }\end{matrix}\right) \rightsquigarrow \left( \begin{matrix} S &{} \boldsymbol{ [ }&{} A_1 &{} - &{} - \\ - &{} - &{} A_2 &{} S &{} \boldsymbol{ ] }\\ \end{matrix}\right) \end{aligned}$$

and then each column is transposed into a tuple to obtain a linear text

figure a

The “gap symbols” -​ are used to specify whether one-dimensional terminals and non-terminals nt and S refer to the first or second dimension. The DSL also suppresses the indices of the components of two dimensional non-terminals. One thus simply writes

figure b

following as far as possible the notational convention of other ADP implementations [7].

Dynamic programming can be used to answer more complicated questions than the computation of maximum likelihood (or more generally score-optimal) solutions. One important class of problems concerns the relative likelihood with which a substructure occurs, weighted by its likelihood. This question, which also appears e.g. in certain algorithms for parameter fitting, requires a combination of inside and outside algorithms. These two algorithms describe the same search space. While the inside algorithm operates bottom-up, the corresponding outside algorithm traverses the search space in top-down order. Traditionally, the outside algorithm is carefully constructed by hand to correctly match in all cases and generate exactly the same probabilities (or scores). It is possible to fully automate this construction [23] along with the required conversions of the each evaluation algebra. While not shown here, this automated construction is available in ADPfusion and thus for all grammars we consider here. This yields, for instance, algorithms to compute Boltzmann-weighted base pairing probabilities for the different classes of pseudoknotted structures.

3 Pseudoknot Grammars

The context-free grammar describing the folding algorithms for pseudoknot-free structures as implemented e.g. in the ViennaRNA package [9] can be written in the following form

$$\begin{aligned} \begin{aligned} S&\rightarrow \epsilon \,\mathop {\big |}\,\bullet S \,\mathop {\big |}\,B S \\ B&\rightarrow cr\bar{c} \,\mathop {\big |}\,c r B r' \bar{c} \,\mathop {\big |}\,c M M' \bar{c} \\ M&\rightarrow r B \,\mathop {\big |}\,M B \,\mathop {\big |}\,M\bullet \qquad \qquad M' \rightarrow B \,\mathop {\big |}\,M'\bullet \end{aligned} \end{aligned}$$
(2)

The non-terminals denote an arbitrary structure (S), a structure enclosed by a base pair (B), a component of a multiloop with at least one base pair inside (M), and a multiloop component whose initial base is paired (\(M'\)). The grammar conforms to the standard energy model for RNA secondary structures [27], which distinguishes hairpin-loops, interior loops (including base pairs) with a single enclosed base pair, and multiloops with two or more enclosed pairs. The terminals \(\bullet \), and c, \(\bar{c}\) denote an unpaired base and base pair, respectively. In addition, we write r for a region without base pairs of length at least 1 and \(r,r'\) for a pair of regions of total length at least 1. The last two lines implement the loop decomposition, i.e., distinguishes hairpin, interior, and multibranch loops and decomposes multiloops to support and energies that are linear in the number of unpaired bases and the number emanating stems.

Table 1. Overview of the Pseudoknot Grammars, adapted from [4] and [12].
Fig. 2.
figure 2

The four types of pseudoknots with topological genus 1 [15] correspond to H-type pseudoknots (H), kissing hairpins (K) and two types of rare, more complex pseudoknots. The four types of pseudoknots correspond to the four alternatives in the LP\(^+\) and RN grammars.

Many of the competing models of pseudoknots are compared in terms of their MCFG grammars and languages in [12]. Table 1 summarizes the subset considered in the contributions: The RE (Rivas & Eddy) model specifies the most inclusive class of pseudoknots for which DP algorithms have become available so far [17]. On the other end of the spectrum, Lyngsø and Pedersen [11] considered non-recursive H-type pseudoknots. Below, we write T for the non-terminal describing pseudoknot-free structures. The LP+ model [4] includes also the four types of pseudoknots shown in Fig. 2. The model of Dirks & Pierce (DP) generalizes (LP) to include recursive H-type pseudoknots [5]. Reeder and Giegerich (RG) further restrict the appearance of unpaired bases in this setting [14]. While the two grammars look identical in Table 1, they differ in the implemention of the parsers for the terminals. This is due to a (recently remedied) limitation of the ADPfusion high-level parser that did not allow for interleaved non-terminals and terminals in the same “horizontal stack”. The more efficient, original construction of RG in [14] is now possible, whereas the one in Table 1 disregards alternatives that do not fit into the RG scheme during parsing – which is semantically correct, but asymptotically suboptimal. “Simple pseudoknots” were defined by Akutsu and Uemura (AU) [1] as comprising two interleaving distinct sets of base pairs. These pairs create an interleaved stem within both groups. Base pairings are organized so that the first group’s right bases and the second group’s left bases are arbitrarily interleaved, while the other bases are all outside the interleaved area. Categorizing secondary structures by the topological genus, Reidys et al. [15] showed that there are exactly four types of pseudoknots with genus 1, the simplest of which is the H-type pseudoknot, see Fig. 2. The genus-1 structures are referred to as (RN) below.

In order to connect the pseudoknot grammars with Turner’s standard energy model [27], we interpret \(\boldsymbol{ [ }S\boldsymbol{ ] }\) and \(\boldsymbol{ [ }T\boldsymbol{ ] }\) as a nonterminal B in the ViennaRNA recursions and employ the loop-decomposition of Eq. (2). Furthermore, we use the notation \(A^{(i)}_1\) and \(A^{(i)}_2\) for the components of two-dimensional non-terminals that have isomorphic productions (albeit possibly with different values in the evaluations algebras). For the latter we simply dropped the superscript \(^{(i)}\) in Table 1.

In line with the simplified multiloop model, we consider a single parameter, namely a pseudoknot initialization penalty, \(\varPsi \), which is associated with all productions that introduce a 2-dimensional non-terminal on their left side. For all helical parts within pseudoknots, the standard stacking energies are used. Unpaired positions are assigned additive contributions corresponding to the multiloop model.

4 Computational Experiments

In order to evaluate the accuracy of pseudoknot prediction we used a subset of the RNAstrand database [2]. Due to the computational costs of the pseudoknot algorithms, which have asymptotic running times of \(O(n^6)\), we restricted ourselves to entries with at most 70 nucleotides. This leaves 131 pseudoknot-free and 63 pseudoknotted target structures.

On the pseudoknot-free subset accuracy cannot exceed the accuracy on pseudoknot-free structures.Footnote 1 Very large values of \(\varPsi \), in fact, force the predictions to be pseudoknot-free. By construction, then, there is no difference between different grammars and the ViennaRNA-like baseline. On this data set, we achieve a limiting F1-measure of about 0.85 for \(\varPsi \ge 8\) kcal/mol. We note that this value is surprisingly large in comparisons with other benchmarks of RNA folding, probably due to the short sequences.

On the subset with pseudoknots, the performance does not depend very strongly on \(\varPsi \) for moderate values, it decreases, however, for large values of \(\varPsi >12\) kcal/mol as sensitivity decreases. This is expected, since excessive energy penalties for pseudoknots cause them to become markedly underpredicted.

Fig. 3.
figure 3

Performance of five different MCFGs for pseudoknotted RNA structures in comparison with the pseudoknot-free background (VRNA). Both F1 and MCC show the best performance for \(\varPsi \approx 10\) kcal/mol and indicate a qualitative difference between the three grammars with recursive pseudoknots (LP+, RG, and RN) compared to DP and LP.

Figure 3 summarizes the results. To give a balanced picture of performance and pseudoknotted and pseudoknot-free instances despite the difference in sample sizes, we averaged the performance measure for the two samples. As expected from analyzing the two subsets separately, we observe a performance peak for \(\varPsi \approx 10\) kcal/mol. Despite the short sequences in the test set we observe a superior performance of grammars that admit recursive pseudoknots.

5 Availability

This work is accompanied by git repositories. For readers who are interested in experimenting with pseudoknot grammars, we offer the “GenussFold” repository at https://github.com/choener/Prj-GenussFold. This project comes with all necessary dependencies and two options to experiment with and build pseudoknot grammars. It provides, via a nix flake, a complete development environment. In addition, if nix is not available, a more usual path via GHC Haskell and cabal is available. We refer to the readme in the project file on how to build the project. In addition, implementations for the different grammars are made available here: https://github.com/deggers/GenussFoldEnergyMin.

6 Concluding Remarks

Different grammars for the prediction of RNA structures with pseudoknots define vastly different search spaces. Variations of the grammar, therefore, include or exclude certain types of structures and thus in general will affect the predicted structures. While much effort has been expended to study and compare different implementations, no unifying framework was available in which all relevant pseudoknot model grammars are available together with a full fledged scoring system. As a consequence it has remained unclear to what extent differences in predictive power have to be attributed to the different scoring model, and to what extent the grammars themselves play an important role.

In this work we have begun a comprehensive study of the predictive power, advantages, and disadvantages due to the choice of grammar. So far, our study has been constrained to a subset of six grammars, including the pseudoknot-free RNA folding grammars from the ViennaRNA package [9]. Furthermore, we had to restrict ourselves to the set of sequences that can be folded by all grammars within predefined resource limits in order to accurately compare the quality of predictions. In order to minimize the influence of differences in scoring models we used here the initialization energy \(\varPsi \) for a pseudoknot as the single free parameter and otherwise treated pseudoknots like multiloops. Prediction performance as a function of \(\varPsi \) suggests a plausible value of about 10 kcal/mol for the optimal choice of this parameter. Interestingly, this value matches well with regression-based multiloop initialization terms, see [28] for an overview of multiloop energy models. We also observed that the two grammars LP and DP that do not admit recursive pseudoknots are outperformed by the three grammars that include recursive pseudoknots. Given the short size of the benchmarking targets this is surprising and deserves a closer examination.

This first study exposes several avenues for further exploration. When we began this study, we noted that certain production rules did not fit immediately into our framework. We chose to rewrite grammars to fit into the framework, while keeping their semantics intact. Since then, progress in ADPfusion ameliorates these shortcomings. A forthcoming more detailed study hence will encompass the full range of pseudoknot grammars. Recent improvements in the parsing and compiler fusion system further optimizes the resulting program code, enabling a systematic benchmark on significantly longer input sequences and thus more difficult instances.

Inspection of the grammars in Table 1 shows that the grammars are composed of many common rules or parts of rules. This suggests to make systematic use of another feature of ADP, namely the capability to compose grammars by additions, subtractions, and multiplications [22]. This type of construction will provide a guarantee that subsets of grammars that are supposed to be equal, will indeed generate the same structures, while at the same time reduce the complexity of the algorithms themselves. This approach will also simplify the exploration of more sophisticated energy models for pseudoknots, which in the simplest case distinguish the initialization terms for different knot types as suggested e.g. in [15].

Finally, the ability to automatically generate outside grammars opens up the possibility of calculating ensemble quantities and provides an important building block for parameter training extensions. The latter are required as grid based searches, as we performed for the pseudoknot initialization penalty, do not scale beyond two or three independent parameters.