Keywords

1 Introduction

Let \(G=(V,E)\) be undirected graph composed of by a finite set of vertices V and a set E of unordered pair of vertices called edges, where each edge represents a relation between two vertices.

Our RNA analysis is based on dual graphs, introduced in 2003 by Gan et al. [7], to model RNA secondary structures (2D). The 2D elements of RNA molecules consist of double-stranded (stem) regions defined by base pairing such as Adenine-Uracil, Guanine-Cytosine, Guanine-Uracil, and single stranded loops; stems and loops are mapped to the vertices and edges of the corresponding dual graph, respectively (later we present an alternative definition of dual graphs). Dual graphs can represent complex RNA structures called pseudoknots (PKs); these structures involve an intertwining of two-base-paired regions of the RNA and are common elements in many biologically relevant RNAs.

In [19, 20] a linear-time partitioning algorithm was introduced based on the dual graph representation of RNA 2D. This algorithm partitions a dual graph into connected components called blocks and then determines whether each block contains a pseudoknot or is a regular region. Thus our procedure provides a systematic approach to partition an RNA 2D, into smaller classified regions, while providing a topological perspective for the analysis of RNAs.

Pseudoknots can be classified into two main groups: recursive and non-recursive pseudoknot [9, 23]. The former is distinguished from the latter because it contains an internal pseudoknotted or regular region that does not intertwine with external stems within the PK; in this work, the original algorithm is extended to classify PKs into these two main categories. In addition, as a recursive PK comprises independent regions or fragments, our modified algorithm can also identify each of these regions, to be later cataloged and applied in the analysis of RNAs with pseudoknots.

In the next section, we present background material and definitions relevant to this paper, and we review the partitioning algorithm introduced in [19, 20], as well as its applications, as for example the development of a library of building blocks for RNA design by fragment assembly [13]. Following this line of research, in Sect. 3 it is shown how the partitioning approach can be extended so if a block contains a pseudoknot, then it can be classified as either recursive or non-recursive; in the case the PK is recursive, the algorithm can also identify each independent region. We summarize the findings and outline new directions in Sect. 4. An Appendix section includes computational tests performed by the modified algorithm, on some RNA’s motifs.

2 Background

2.1 Biological and Topological Definitions

In 2003, Gan et al. [7] introduced tree and dual graph-theoretic representations of RNA 2D motifs in a framework called RAG (RNA-As-Graphs) [5, 8, 12, 16]. A pseudoknot is an intertwining of two-based-paired regions (stems) of an RNA (see for example Fig. 1).

The partition algorithm is based on topological properties of graphs, suggests an alternative way to look at the problem of detection and classification of PKs and of general RNAs. As base pairing in PKs is not well-nested, making the presence of PKs in RNA sequences more difficult to predict by the more classical dynamic programming [3] and context-free grammars standard methods [2].

Following (Kravchenko [17]), we define our biological variables as follows.

Definition 1

General terms:

  1. a.

    RNA primary structure: a sequence of linearly ordered bases \( x_1, x_2, \ldots , x_r\), where \(x_i \in \{ A, U,\) \( C, G\}\).

  2. b.

    canonical base pair: a base pair \((x_i, x_j) \in \{(A,U), (U,A), (C,G), (G,C),\) \((G,U), (U,G)\}\).

  3. c.

    RNA secondary structure without pseudoknot - or regular structure, encapsulated in the region \((i_0, \dots , k_0)\): an RNA 2D structure in which no two base pairs \((x_i, x_j), (x_l, x_m)\), satisfy \(i_0 \le i< l< j < m \le m_0\) (i.e., no two base pairs intertwined).

  4. d.

    a base pair stem: a tuple \((x_i, x_{i+1},\ldots ,x_{i+r}, x_{i+(r+1)},\ldots , x_{j-1}, x_j)\) in which \((x_i, x_j),\) \( (x_{i+1}, x_{j-1}), \ldots , (x_{i+r}, x_{i+(r+1)})\) form base pairs.

  5. e.

    loop region: a tuple \((x_1, x_2, \ldots , x_r)\) in which \(\forall _{i \le j \le r}(x_i, x_j)\) does not form a base pair.

  6. f.

    a pseudoknot encapsulated in the region \((i_0, \dots , k_0)\): if \(\exists l,m, (i_0< l< m < k_0)\) such that \( (x_{i_0}, x_m)\) and \( (x_l, x_{k_0})\) are base pairs (i.e., at least two base pairs intertwined).

A graphical representation is a natural way to describe an RNA 2D structure (see Fig. 1(a), (b)), in which the x-axis is labeled according to the primary linearly ordered sequence of bases (Definition 1a), and a stem (Definition 1d) is represented by arcs connecting base pairs. A region on the x-axis between the end-points of the arcs representing stems is called a segment.

A dual graph can be defined from the graphical representation of an RNA 2D structure as follows (Fig. 1).

Definition 2

The dual graph is defined by mapping stems and the segments between stems (x-axis), of the graphical representation of an RNA 2D structure, to the vertices and edges of the dual graph, respectively.

Fig. 1.
figure 1

This figure was originally depicted in [20].

Graphical and dual graph representations of an RNA 2D structure. (a) graphical representation of a pseudoknot-free RNA primary sequence and embedded stems or base pairs; (a\(^\prime \)) corresponding dual graph representation. (b) graphical representation of a pseudoknotted RNA 2D structure; (b\(^\prime \)) corresponding dual graph.

In the next section we present our partitioning approach as of a dual graph G, into subgraphs \(G' \subseteq G\), called blocks.

2.2 Graph Partitioning Algorithm

The graph-theoretic partitioning algorithm is based on identifying articulation points of the dual graph representation of an RNA 2D. An articulation point is a vertex of a graph whose deletion disconnects a graph or an isolated vertex remains.

We need to define the following.

Definition 3

Connectivity

  1. a.

    A vertex v is an articulation point or cut-vertex if \(G-{v}\) results in a disconnected graph (i.e., at least two connected components remain) or an isolated vertex remains.

  2. b.

    A connected component is non-separable if it does not have an articulation point (or cut-vertex). Please note that single edges or isolated points are non-separable.

  3. c.

    A block is a maximal (edge-wise) non-separable graph.

  4. d.

    An edge-set X is an edge-disconnecting set if the removal of X from G results in a disconnected graph. The edge-connectivity of a graph \(\lambda (G)\) is the size of a minimum edge-disconnecting of G.

  5. e.

    The degree of a vertex v of G is the number of \(d_G(v)\) is the number of edges incident at v.

Articulation points allow us to identify blocks (see Fig. 2); since a block is a maximally non-separable component, a pseudoknot cannot be then contained in two different blocks. Thus identification of these block components allows us to isolate pseudoknots (as well as pseudoknot-free blocks), without breaking their structural properties.

Fig. 2.
figure 2

Identification of (a) articulation points and (b) partitioning of a dual graph.

An algorithm for identifying (bi-connected) block components in a graph was introduced by Hopcroft and Tarjan [11], and runs in linear computational time.

A hairpin loop occurs when two regions of the same strand, usually complementary in nucleotide sequence when read in opposite directions, base-pair to form a double helix that ends in an unpaired loop. A self-loop in the dual graph, i.e., an edge having the same vertex as the end-points, represents a hairpin, and as it does not connect two different vertices (i.e., stems), it is formally deleted from the dual graph.

From Definition 1c, an RNA 2D structure is a regular-region (pseudoknot-free) and encapsulated in a region \((i_0, \dots , k_0)\), if no two base pairs \((x_i, x_j), (x_l, x_m)\), satisfy \(i< l< j < m\), \(i_0 \le i,j,l,m \le m_0\), otherwise the region is a pseudoknot; this definition yields the following main result.

Corollary 4

[19, 20] Given a dual graph representation of RNA 2D structure, a block represents a pseudoknot if and only if the block has a vertex of degree (Definition 3e) at least 3.

The partitioning algorithm performs the following steps.

  1. 1.

    Partition the dual graph into blocks by application of Hopcroft and Tarjan’s algorithm.

  2. 2.

    Analyze each block to determine whether contains a vertex of degree at least 3. If that is the case then the block contains a pseudoknot, according to Corollary 4. If not then the block represents a pseudoknot-free structure.

Consider as an example the dual graph shown in Fig. 2. This graph is decomposed into 2 blocks. According to Corollary 4, block 1 is a pseudoknot as it has a vertex of degree at least 3, while block 2, a cycle, corresponds to a regular region.

Our partitioning algorithm was applied recently [13] to analyze the modular units of RNAs for a representative database of experimentally determined RNA structures and to develop a library of building blocks for RNA design by fragment assembly, as done recently for tree graphs, along with supporting chemical mapping experiments [14]. Among the 22 frequently occurring motifs we found for known RNAs up to 9 vertices, 15 contain pseudoknots [13]. Thus, further classification of the pseudoknotted RNAs could help in cataloging and applications to RNA design. Another application of the partitioning algorithm to small and large units of ribosomal RNAs of various prokaryotic and eukaryotic organisms helped identify common subgraphs and ancestry relationships [13].

In the next section we extend our algorithm to classify PKs as either recursive or non-recursive; the algorithm can also identify each recursive region.

3 Classification of Pseudoknots as Either Recursive or Non-recursive and Identification of Each Recursive Region

The RNA 2D dual graph and graphical representations depicted in this section are based upon New York University’s RAG-database [12], and R-Chie visualization software [18], respectively.

Fig. 3.
figure 3

Recursive pseudoknot.

The definition of a recursive pseudoknot follows the one of Wong et al. [23]. Let \(A = x_1x_2 ... x_m\) be a sequence of linearly ordered bases, and M be the 2D of A. M is represented as a set of base pair positions, i.e., \(M = \{(i, j)| 1 \le i < j \le m,(x_i, x_j)\) is a base pair\(\}\). Let \( M_{x, y} \subseteq M\) be the set of base pairs within the sub-sequence \(x_1x_2 ... x_m\), \(1 \le x < y \le m\), i.e., \(M_{x, y}= \{(i,j) \in M | x \le i < j \le y\}\), with \(M=M_{1,m}\).

Definition 5

\(M_{x,y}\) is a recursive pseudoknot if \(M_{x,y}\) is a pseudoknot (see Definition 1f), and \(\exists a_1, b_1, \ldots , a_s, b_s, ( x< a_1< b_1< \ldots< a_s< b_s < y)\) that satisfy the followings.

Each \(M_{a_i, b_i}\) is called a recursive region.

  • \(M_{a_i, b_i} \), for \(1 \le i \le s\), is a recursive pseudoknot.

  • For each \(M_{a_i, b_i}, 1 \le i \le s\), there does not exist a base pair \((i, j) \in M\) that \(i \in [a_i, b_i]\) but \( j \notin [a_i, b_i]\), or \(i \notin [a_i, b_i]\) but \(j \in [a_i, b_i]\).

  • \(M_{x,y} - \cup _{i=1}^s M_{a_i,b_i}\) is either a regular structure or a pseudoknot.

A recursive pseudoknot is a pseudoknot \(M_{x,y}\) that contains a pseudoknotted or regular region \(M_{a,b}\), and there does not exist a base pair (cd), such that \(x_d\) is a base of \(M_{a,b}\) and \(x_c\) is a base of M external to \(M_{a,b}\) (see Fig. 3). Here we are assuming that \(M_{a,b}\) is contained in \(M_{x,y}\), that is, \(x< a< b < y\).

Wong et al. definitions [23] also incorporated the concepts of standard and non-standard pseudoknots; however it is not within the scope of this work to study them from the dual graph representation perspective.

A graph is Eulerian if there exist a trail with no repetition of edges from a vertex \(v_0\) of G, ending at vertex \(v_k\), covering all the edges of the topology; if \(v_0 = v_k\) then the graph is an Eulerian cycle (see [10], p. 64). Dual graph representations of general RNA 2D structures, and specifically of PKs, can be easily shown to be Eulerian graphs from Definition 2. By starting from the origin on the x-axis of the graphical representation and traversing to the right, a unique trail in its dual graph can be described, where all edges are covered.

Lemma 6

 [19, 20] The dual graph representations of RNA 2D structures and of PKs are Eulerian by following the primary sequence of bases.

As depicted in Fig. 1(b), the alternating sequence of stems and segments \(\{S_1, I, S_2, II, S_4, III, S_2, IV, S_1, V, S_3, VI, \) \(S_4, VII,\) \( S_3\}\) of the graphical representation (b) forms an Eulerian trail in its dual graph (b\(^\prime \)).

A pseudoknotted block can be classified as recursive by just calculating the edge-connectivity (see Definition 3d) of the block. As an example consider the Hepatitis Delta Virus Ribozyme (see Fig. 4), necessary for viral replication. The stem labeled 4 in the graphical representation (or vertex labeled 4 in the dual graph) is attached to the pseudoknot by the segments a and b in its graphical representation, or edges labeled a and b in the dual graph representation. It is clear that if the PK is recursive then the edge-connectivity of the pseudoknotted block must be 2. However it is not obvious that the converse is necessarily true, that is, if the pseudoknotted block has edge-connectivity 2 then it is recursive. The following Lemma settles this question.

Fig. 4.
figure 4

Hepatitis Delta Virus Ribozyme secondary structure. (a) Graphical representation. (b) Dual graph representation.

Lemma 7

The dual graph representation of a pseudoknotted block is recursive if and only if the block has edge-connectivity 2.

Proof

If the block \(M_{x,y}\) is a recursive pseudoknot then it contains an internal region \(M_{a,b}\) with \(x< a< b < y\) according the aforementioned definition. As there does not exist a base pair (cd) in which \(x_d\) is a base in the internal region and \(x_c\) is a base of the pseudoknot outside this internal region, then \(M_{a,b}\) must be adjacent to the remaining of the PK in the graphical representation by two segments, or equivalently, by two edges in the dual graph of the pseudoknot (see Fig. 4).

Conversely suppose that the dual graph of a PK, \(G=(V,E)\), has edge-connectivity 2 and let \(E^\prime =\{e_1,e_2\}\) represent a minimum size disconnecting set. As G is connected then deletion of \(E^\prime \) from G will result in exactly two connected components, \(G_1=(V_1,E_1)\) and \(G_2=(V_2,E_2)\) (see Fig. 5a). From Lemma 6 one knows that there exist an Eulerian path in G following the primary sequence in linear order, starting from the origin on the x-axis of the graphical representation and traversing to the right. Let the Eulerian path \(P= P_1. e_1 .P_2 .e_2. P_3\) where \(P_1\) starts at the initial base \(x_1\) (w.l.o.g. we assume that \(x_1\) is in \(G_1\)) (see Fig. 5b). It is the case that \(P_2\) must cover all the edges of \(G_2\) (following an ordered sequence of bases) because when P reaches \(e_2\) to continue with \(P_3\), P can not go back to \(G_2\) again as \(e_1\) and \(e_2\) were already used by the Eulerian path. Therefore \(G_2\) is a region composed of all edges (and vertices) corresponding to an ordered sequence of bases, that is, \(G_2\) is a well-defined region within the pseudoknotted block G.    \(\square \)

Fig. 5.
figure 5

Minimum disconnecting set of size 2. (a) Two connected subgraphs. (b) Eulerian path covering all the edges of the dual graph and following the primary sequence.

The edge-connectivity of a graph \(G=(V,E)\) can be determined in polynomial time in order \( (|V| |E|^2)\) using the max-flow min-cut theorem of network flows by Edmond and Karp [4], or Ford and Fulkerson [6].

As it is shown in the proof of by Lemma 7, we can also delete each pair of edges and determine if the graph is disconnected using Depth-First-Search [10] in time \((|E|^3)\); this variation allows us to find every internal recursive region of the recursive pseudoknot if such pair of edges exist. For example if edges a and b of Hepatitis Delta Virus Ribozyme 2D dual graph representation (Fig. 4) are deleted, then vertex labeled 4 corresponding to the stem labeled 4 of the graphical representation will be isolated. Thus the disconnecting set \(\{a, b\}\) uniquely identifies a recursive region (i.e., stem labeled 4).

Corollary 4 and Lemma 7 yield the following partitioning and classification of pseudoknots algorithm.

Algorithm for Partitioning and Classification of PKs

  1. i.

    Input dual graph \(G=(V,E)\) as the Adjacency Matrix, of a RNA 2D.

  2. ii.

    Output partitioning of the RNA 2D into recursive PK, non-recursive PK, and regular regions.

    1. 1.

      Partition the dual graph into blocks by application of Hopcroft and Tarjan’s algorithm;

    2. 2.

      Analyze each block to determine whether each contains a vertex of degree at least 3;

    3. 3.

      IF the block has a vertex of degree \(\ge 3\) then the block is a pseudoknot;

      • Apply max-flow min-cut theorem to determine edge-connectivity;

      • if edge-connectivity = 2 then the block is a recursive pseudoknot; else the pseudoknot is not recursive;

    4. 4.

      ELSE the block is a regular region;.

Fig. 6.
figure 6

A tRNA-like-structure. (a) Graphical representation. (b) Dual graph representation.

As another example consider a tRNA-like-structure [1], linked to regulation of plant virus replication (see Fig. 6). Note that the vertex labeled 4 is an articulation point, therefore this dual graph will be partitioned into two blocks, one is a pseudoknot, because it contains a vertex of degree 3 or greater (self-loops are deleted in the analysis), while the other block is a regular region. The pseudoknotted bock can be classified as recursive because it has edge-connectivity 2. In addition every disconnecting set of size 2 represents a recursive internal region of the PK.

As an example of a non-recursive pseudoknot consider the Translational repression of the Escherichia coli alpha operon mRNA [22], illustrated in Fig. 7. The dual graph representation of this motif 2D has edge-connectivity 3, thus it is not a recursive PK.

Fig. 7.
figure 7

Translational repression of the Escherichia coli alpha operon mRNA. (a) Graphical representation; (b) Dual graph representation.

The Appendix illustrates the output generated by the modified algorithm when is run on some of the aforementioned motifs. The algorithm is written in C++ and is archived for public use [21].

4 Conclusions and Ongoing Work

We have extended our partitioning algorithm of the dual graph representation of RNA 2D structures into maximal non-separable components called blocks, to classify pseudoknots as either recursive or non-recursive. In [19, 20] it was shown that an RNA 2D contains a pseudoknot if and only if the dual graph representation has a block in which one of the vertices is of degree 3 or larger. This paper showed that a pseudoknotted block is recursive if and only if the block has edge-connectivity 2. Moreover each disconnecting set of size 2 represents an internal recursive region of the pseudoknot allowing further classification of modular units for RNA design. These results also offer an alternative and simple way to visualize and classify PKs based on graph theoretical properties, allowing a systematic analysis of RNAs.

With our recent extension of our graph growing algorithm to generate dual graph libraries of possible RNA motifs, thousands more potential graphs were analyzed and classified [15]. The modified algorithm will be useful for these motifs for further studies of RNA structure and design.