Keywords

1 Introduction

In molecular biology and bioinformatics, the relationships between the structure of a molecule and its biological functions have been under intensive study for many years. As a result, a large number of models and tools have been developed to utilize the relationships that have been discovered to improve the accuracy of structure prediction or recognition of these molecules [4, 11, 13, 21, 22, 26]. So far, proteins and noncoding RNAs (ncRNAs) are the two most important types of such molecules.

ncRNAs are RNA molecules that do not encode proteins. Recent research in molecular biology has revealed that ncRNAs are of fundamental importance in many biological processes, such as gene regulation, chromosome replication and RNA modification [3, 12, 25]. Recently, since the amount of fully sequenced data has grown explosively, computational search tools have become an important alternative for searching homologous RNA structures in genomic sequences and recognizing new ncRNAs [6, 7, 11, 13]. In general, a genome is scanned through by a computational search tool during the search procedure and each sequence segment in the genome is aligned to a structural model that describes the secondary structure of the searched family. A segment is reported as a potential member of the family if its alignment score is statistically significant. The structure model generally consists of information from both the primary sequence content and secondary structure.

A covariance model (CM) is a structural model that can accurately describe the secondary structure of an ncRNA family. Compared with profiling models based on Hidden Markov Models (HMMs), a CM [11] contains additional emission states that can generate base pairs in the stems of secondary structure. CMs can thus be used as structural models to model RNA families. A well known fact about CM is that the computational efficiency of a CM-based search tool drops significantly when the searched sequence is only of moderate length.

Our previous work has developed a new parameterized approach for efficiently computing the optimal sequence-structure alignment [13]. However, the computation time needed by the approach may rise sharply when the parameters become large integers. In [21], we developed a machine learning based approach that can accurately extract features to represent sequences in a family and a classifier-based approach is used to determine whether a sequence segment is a member of the searched family or not. Although this approach is able to achieve higher search accuracy than those based on CM, the computation time needed for feature extraction is of the same asymptotic order as that of the optimal sequence-structure alignment. In practice the approach may need more computation time than CM-based search tools.

An important approach that can significantly improve the computational efficiency of search is the filtering method [5]. A filtering method employs a simpler sequence or structural model to describe the family and use it to efficiently screen out genome segments that are unlikely to contain the searched sequence. So far, a number of filtering methods [1, 9, 23] have been developed to improve the search efficiency. For example, tRNAscan-SE [9] uses two efficient tRNA detection algorithms as filters to preprocess a genome and remove most parts that are unlikely to contain the searched tRNA structure. The remaining part of the genome is processed with a CM to determine the location of tRNA. FastR [1] considers an RNA structure as a collection of structural units and evaluates the specificity of each structural unit. A set of filters can be constructed with the specificities of these structural units. In [23], base pairs in an RNA structure are safely broken by an algorithm and a set of filters can be automatically generated from the resulting HMM models. TDFilter uses a graph theoretic approach to construct a set of filters that reflect the most functionally important parts in the secondary structure of a family [8].

These approaches have significantly improved the computational efficiency of ncRNA search. However, most of them only select filters based on their specificity or recognition ability and thus are not guaranteed to achieve the optimal computational efficiency for filtering. For tasks where the genomes are of large sizes, the computation time needed by filtering may significantly affect the speed of searching. Therefore, new approaches that can effectively obtain the most functionally important parts in the searched family while optimizing the computational efficiency of filtering are highly desirable for rapid search of ncRNA structures in genomes of large sizes.

In this paper, we develop a parameterized approach that can construct and select filters based on the expected filtration ratio of filtering, which is the probability that a random sequence segment can pass the filtering process. This ratio is a parameter that can be determined based on the length of the genomes that need to be searched. The approach starts by evaluating the degree of conservation for each position in the structural alignment of the family and uses a dynamic programming method to find the parts that are of appropriate lengths and have the highest degree of conservation. Each part contains a set of aligned subsequences and a profile HMM can be constructed to describe the subsequences. Each of these profile HMMs is considered as a potential filter. We then employ a dynamic programming approach to evaluate the recognition ability of each potential filter. Finally, we select a set of filters from the set of all potential filters to minimize the amount of computation time of filtering by solving a variant of the 0-1 knapsack problem.

We have implemented this filter selection algorithm and compare its performance with that of TDFilter [8] and the filter implemented in Infernal 1.1.1 [11, 24]. We tested the performance of the three filtering approaches on benchmark dataset RMARK3 and compared both the search accuracy and computational efficiency among the three filtering approaches. Our testing results showed that this filter approach is significantly faster and can achieve accuracy comparable with the other two filtering approaches. Specifically, it is around 6 times faster than TDFilter and 9 times faster than the filter implemented in Infernal 1.1.1.

2 The Approach

Given the structural alignment of a set of training sequences in a noncoding RNA family, the goal of the search problem is to locate the potential member of the family in a given DNA genome. A filtering approach processes the sequence segments in the genome with one or a set of filters and only segments that can be aligned to the filters with statistically significant scores are considered to be a candidate for a member of the family. Each candidate is further processed by a more sophisticated structural model to determine whether it is indeed a member of the family or not.

As the first step of our filter generation approach, we use the information content to evaluate the degree of conservation for each position in the structural alignment of training sequences and use a dynamic programming approach to compute the average information content for each subsequence of certain lengths in the structural alignment. A set of potential filters can be selected by choosing a set of subsequences that have the highest values of average information content.

In the second step, we use a dynamic programming approach to compute the mean and standard deviation of the alignment scores of all possible subsequences on each potential filter. Since a set of subsequences are used to construct a potential filter, we align each subsequence in the set to the potential filter and a Z-score can be computed based on the alignment score. We evaluate the recognition ability of each filter by computing the average Z-score of the alignment scores obtained with all subsequences in the set. From the average Z-score, an expected p-value can be obtained to represent the probability for a random sequence to be aligned to a potential filter with a statistically significant score.

In the last step, we select a set of filters that satisfy the expected filtration ratio while minimizing the amount of computation time needed for filtering. We formulate this optimization problem into a variant of the 0-1 knapsack problem and use a dynamic programming approach to efficiently solve the problem.

2.1 Generating the Potential Filters

For each position in the structural alignment of a set of homologous sequences in an ncRNA family, we use information content as a measure of the degree of conservation for nucleotides in the position. The information content \( I(l) \) of a position l in the alignment can be computed as follows.

$$ I(l) = \sum\limits_{t = 0}^{4} {p_{l,t} } \log_{2} p_{l,t} $$
(1)

where \( p_{l,0} \) is the frequency of a gap to appear in l, and \( p_{l,t} \) (\( 1 \le t \le 4 \)) represent the frequencies of nucleotides A, C, G, U to appear in l respectively. It is straightforward to see that the value of \( I(l) \) is minimized when the gap and each possible nucleotide appears with equal probability in l. On the other hand, \( I(l) \) reaches its maximum when l only contains one of the four possible nucleotides. \( I(l) \) is therefore a measure of the degree of conservation for l in the alignment.

A region in an alignment is comprised of a set of positions that are consecutive in the alignment. It is not difficult to see that the average information content of a region reflects the degree of conservation of the region. Recent research has revealed that regions of higher degree of conservation are generally more important for the biological functionality of the ncRNA. A set of regions with high values of average information content can thus be selected to construct the potential filters.

The average information content of all regions in the alignment can be computed with a dynamic programming approach. Specifically, we use a two dimensional table \( M[i,j] \) to represent the average information content for the region that contains the part between positions i and j, where \( i \le j \). It is straightforward to see that \( M[i,j] \) can be determined with the following recursion relationship.

$$ M[i,i] = I(i) $$
(2)
$$ M[i,j] = \frac{(j - i)M[i,j - 1] + I(j)}{j - i + 1} $$
(3)

where Eq. (2) is the case where the region contains only one nucleotide and Eq. (3) relates \( M[i,j] \) with \( M[i,j - 1] \) when \( i < j \).

After the value of each element in M has been determined, we use the following algorithm to select regions that have high average information content and are of lengths between a pair of given integers \( L_{1} \) and \( L_{2} \).

  1. 1.

    Set F to be an empty set.

  2. 2.

    Compute the mean of the average information content of all regions that are of lengths between \( L_{1} \) and \( L_{2} \) based on M and store it in A.

  3. 3.

    Find all regions that have an average information content value larger than A and do not overlap with any region in F. All these regions are included in a set R.

  4. 4.

    If R is empty, go to step 6, otherwise find the region T that has the largest average information content in R and include T in F.

  5. 5.

    Go to step 3.

  6. 6.

    Output the regions in F.

A profile HMM can be constructed based on each region selected by the above algorithm. Such a profile HMM is a potential filter. A profile HMM models each position in a region with two states. A matching state is able to generate a nucleotide while a deletion state models the gaps that may appear in the position. Given a selected region that includes all positions between positions m and n in the alignment, we use the following algorithm to construct a potential filter.

  1. 1.

    Include two states S and E in state set H. Set the set T of transitions to be empty.

  2. 2.

    Generate two states \( D_{m} \) and \( M_{m} \) for position m and set the transition probability from S to \( D_{m} \) to be \( p_{m,0} \) and the transition probability from S to \( M_{m} \) to be \( 1 - p_{m,0} \). Include both transitions in T and \( D_{m} \) and \( M_{m} \) in H.

  3. 3.

    Set the emission probability of nucleotide t in \( M_{m} \) to be \( p_{m,t} /\sum\limits_{q = 1}^{4} {p_{m,q} } \).

  4. 4.

    Set i to be \( m + 1 \).

  5. 5.

    Generate two states \( D_{i} \) and \( M_{i} \) for position i and set the transition probability from both \( D_{i - 1} \) and \( M_{i - 1} \) to \( D_{i} \) to be \( p_{i,0} \) and the transition probability from both \( D_{i - 1} \) and \( M_{i - 1} \) to \( M_{i} \) to be \( 1 - p_{i,0} \). Include all four transitions in T and \( D_{i} \) and \( M_{i} \) in H.

  6. 6.

    Set the emission probability of nucleotide t in \( M{}_{i} \) to be \( p_{i,t} /\sum\limits_{q = 1}^{4} {p_{i,q} } \).

  7. 7.

    Increment i by one, if \( i \le n \), go back to step 5.

  8. 8.

    Set the transition probability from both \( D_{n} \) and \( M_{n} \) to E to be 1.

  9. 9.

    Output H and T as the potential filter.

Figure 1 provides an illustration on the potential filter constructed for a region in the structural alignment of a set of homologous sequences. The rectangle in (a) contains four columns in a selected region and (b) shows the corresponding states and transitions that describe the columns in the profile HMM constructed for the region.

Fig. 1.
figure 1

(a) Columns \( a,a + 1,a + 2,a + 3 \) in the rectangle are selected by the algorithm; (b) the states and transitions among them in the profile HMM constructed for the region, where a square represents a matching state and a circle is a deletion state.

2.2 Evaluating the Recognition Ability of Potential Filters

To evaluate the recognition probability of a potential filter, we consider the distribution of alignment scores of all sequences randomly generated with the base composition of the genome that needs to be searched. Based on the distribution, the statistical significance of each alignment score obtained on the potential filter can be evaluated by computing the Z-score associated with the alignment score. Specifically, given a distribution with mean m and standard deviation \( \sigma \), the Z-score associated with an alignment score A can be computed as follows.

$$ Z = \frac{A - m}{\sigma } $$
(4)

We thus consider each subsequence the region contains in the alignment and align the subsequence to the potential filter to obtain an alignment score. The average Z-score of the alignment scores of all such subsequences is therefore a reliable measure of the recognition ability of the potential filter.

Equation (4) suggests that the mean m and standard deviation \( \sigma \) of the distribution need to be computed to evaluate the recognition ability of each potential filter. One approach that has been extensively used to determine m and \( \sigma \) is based on sampling. Specifically, a large number of random sequences are generated with the base composition of the genome and each of them is aligned to the potential filter to obtain an alignment score. The distribution is then approximated by the obtained alignment scores. To address this problem, we develop a dynamic programming approach that can compute the values of m and \( \sigma \) both accurately and efficiently.

Specifically, given a potential filter constructed from a region that contains positions between positions s and t in an alignment. It is straightforward to see that a profile HMM can also be constructed with positions between s and r, where r is an integer between s and t. We denote this partial profile HMM with \( Q_{r} \). We use a one dimensional table \( N(r) \) to store the natural logarithm of the expected value of the alignment scores obtained by aligning all random sequences generated with the base composition of the genome to \( Q_{r} \). It is not difficult to see that \( N(r) \) can be computed with the following recursion relation.

$$ N[s] = \ln (p_{s,0} + (1 - p_{s,0} )\sum\limits_{t = 1}^{4} {p_{s,t} q_{t} )} $$
(5)
$$ N[r] = N[r - 1] + \ln (p_{r,0} + (1 - p_{r,0} )\sum\limits_{t = 1}^{4} {p_{r,t} q_{t} )} $$
(6)

where \( q_{t} \)’s (\( 1 \le t \le 4 \)) are the probabilities for nucleotides A, C, G, T to appear in the genome that needs to be searched. After every element in N is determined, the value of m can be computed with \( m = \exp (N[t]) \).

Based on the same principle, we are able to compute the expected value of the square of the alignment scores obtained by aligning all random sequences generated with the base composition of the genome. We use a one dimensional table \( S[r] \) to store the natural logarithm of the expected value of the square of the alignments scores by aligning all random sequences to \( Q_{r} \).

$$ S[s] = \ln (p_{s,0}^{2} + (1 - p_{s,0} )^{2} \sum\limits_{t = 1}^{4} {p_{s,t}^{2} q_{t} )} $$
(7)
$$ S[r] = S[r - 1] + \ln (p_{r,0}^{2} + (1 - p_{r,0} )^{2} \sum\limits_{t = 1}^{4} {p_{r,t}^{2} } q_{t} ) $$
(8)

After every element in S is determined, the value of \( \sigma \) can be computed with Eq. (9).

$$ \sigma = \sqrt {\exp (S[t]) - m^{2} } $$
(9)

2.3 Selecting Filters

After a list of potential filters has been generated and their recognition ability is evaluated, we are ready to select filters from all potential filters such that the computation time needed by filtering can be minimized. Based on the average Z-score we have computed for each potential filter, we estimate the probability for a random sequence to achieve an alignment score comparable with those in the training dataset. We use \( p(F) \) to denote this probability for a potential filter F. The estimation of \( p(F) \) is based on the assumption that the distribution of the alignment scores on F is approximately a Gaussian distribution. In other words, \( p(F) \) is approximately computed based on its average Z-score \( Z(F) \) with Eq. (10).

$$ p(F) \approx 1 - erc(Z(F)) = 1 - \frac{2}{\sqrt \pi }\int_{0}^{Z(F)} {e^{{ - t^{2} }} dt} $$
(10)

where the value of the error function can be obtained by looking up a table.

Since all potential filters are selected from regions that do not overlap in the alignment, the probability that a random sequence can achieve statistically significant alignment scores on a list of potential filters can be computed with Eq. (11).

$$ p(F_{1} ,F_{2} , \ldots ,F_{k} ) = \prod\limits_{c = 1}^{k} {p(F_{c} )} $$
(11)

By taking logarithm with base 2 on both sides of Eq. (11), we obtain the following equation.

$$ - \log_{2} p(F_{1} ,F_{2} , \ldots ,F_{k} ) = \sum\limits_{c = 1}^{k} ( - \log_{2} p(F_{c} )) $$
(12)

On the other hand, the computation time needed to optimally align a sequence segment to a potential filter that contains b states is \( O(b^{2} ) \). We thus can assign a value \( W(F) \) to each potential filter F based on the number of states in F. \( W(F) \) can be computed with Eq. (13).

$$ W(F) = b(F)^{2} $$
(13)

where \( b(F) \) is the number of states in F.

The filter selection problem can then be formulated as a variant of the 0-1 knapsack problem, where the objective is to select a set of filters from all potential filters such that the logarithmic of the filtration ratio evaluated by Eq. (12) is approximately equal to the logarithm of a filtration ratio provided by the user and the sum of the values of all selected filters is minimum. This problem is similar to the 0-1 knapsack problem. The only difference is that the objective here is to minimize the total values of all “objects” that can be placed into the “knapsack”.

Based on the above observation, we assign each potential filter F a capacity value \( c(F) = - \log_{2} p(F) \) and a value \( W(F) \) that can be computed with Eq. (13). The total capacity of the knapsack is defined to be \( C = - \log_{2} f_{r} \), where \( f_{r} \) is the filtration ratio provided by the user. All capacity values will be rounded to integers. The well known dynamic programming for the 0-1 knapsack problem can be slightly changed to solve this problem. Specifically, assume we have n potential filters and we assign each filter an integer in \( \{ 1,2,3, \ldots ,n\} \) such that no two filters are assigned the same integer. We use a two dimensional table \( V(D,j) \) to store the minimum total value of a maximal set of filters such that they are selected from potential filters with integer identities from \( \{ 1,2,3, \ldots ,j\} \) and the sum of their capacities is at most D. It is not difficult to see that Eq. (14) holds for any \( 1 \le j \le n \) if \( 0 \le D < C_{m,j} \), where \( C_{m,j} \) is the minimum capacity of all potential filters with integer identities from \( \{ 1,2,3, \ldots ,j\} \).

$$ V(D,j) = 0 $$
(14)

We use \( F_{1} ,F_{2} , \ldots ,F_{n} \) to denote the potential filters with integer identity numbers \( 1,2,3, \ldots ,n \). It is not difficult to see that Eq. (15) holds if \( D < c(F_{j} ) \).

$$ V(D,j) = V(D,j - 1) $$
(15)

For \( D \ge c(F_{j} ) \), if \( V(D,j - 1) > 0 \), the value of \( V(D,j) \) can be computed with Eq. (16). Otherwise, its value can be computed with Eq. (17).

$$ V(D,j) = \hbox{min} \{ V(D,j - 1),V(D - c(F_{j} ),j - 1) + W(F_{j} )\} $$
(16)
$$ V(D,j) = W(F_{j} ) $$
(17)

It is not difficult to see that the above dynamic programming approach can efficiently select the filters that satisfy the required filtration ratio and minimizes the total values of the selected filters. A traceback table can be used to store the decision made in each step of the dynamic programming procedure and the selected filters can be obtained with a standard traceback procedure.

3 Experimental Results

We have implemented this filter selection approach into a computer program NewFilter. The program is freely available and it can be downloaded at the following link: http://www.rworldpublishing.org/software.html. We have tested its performance and compared its accuracy and computational efficiency with those of other filtering tools, including TDFilter [8] and the filters of Infernal 1.1.1 [11, 24]. TDFilter is a filter selection approach we have developed based on graph tree decomposition. TDFilter selects structure units that are functionally important with a tree decomposition based dynamic programming approach and the filters are constructed based on these selected structure units. Infernal 1.1.1 is a CM-based search tool and it uses four separate filter stages where each filter stage is implemented with a set of profile HMMs. The accuracy of a filter stage is successively higher than the previous stage while more computation time is needed for alignments.

The testing is performed on the RMARK3 benchmark dataset [11]. The dataset contains a set of 106 ncRNA families. The structural alignment of the training sequences for each family is constructed based on the seed alignments of Rfam 10.0 database [2]. The mutual identity between any pair of sequences in a structural alignment is below 70 % and the mutual identity between any pair of training and testing sequences is not higher than 60 %. The dataset contains 780 testing sequences that are inserted into ten genome-like sequences and each such sequence contains 106 nucleotides and the dataset contains around \( 1.016 \times 10^{7} \) nucleotides in total. All testing sequences are also available in the Rfam 10.0 database.

We have integrated all three filters with RNASC [21], a search tool developed based on a machine learning approach. The accuracy and efficiency of the three combined tools on the RMARK3 benchmark dataset. For NewFilter, we generate potential filters based on regions with lengths from 5 to 10 and the filtration ratio is set to be \( 1.0 \times 10^{ - 4} \). For each tested family, we compute the sensitivity and specificity based on the search results returned by all three filters. We use the average of the sensitivity and specificity to evaluate the search accuracy of a combined search tool. Table 1 shows distribution of the accuracy of the three combined tools and that of RNASC when no filters are used. It can be seen from the table that the combined search tool with NewFilter integrated achieves higher average search accuracy than two other combined search tools. It is also clear from the table that the average search accuracy of RNASC does not drop significantly after the NewFilter is integrated into it. We believe selecting highly conserved regions to construct filters is an important reason why NewFilter is able to achieve higher accuracy than both TDFilter and Infernal Filter.

Table 1. The distribution of search accuracy of RNASC (without a filter) and three other combined search tools on the RMARK3 benchmark.

In addition to the search accuracy, we also evaluate the computational efficiency of all four tools. Table 2 shows the average amount of computation time needed by each of the four search tools to search in a sequence that contains 105 nucleotides. It can be seen from the table that NewFilter is around 6 times faster than TDFilter and 9 times faster than Infernal Filter. This is not surprising since NewFiler selects short regions to construct filters and does not include information on base pairs in the secondary structure, which significantly reduces the computation time needed for alignments. In addition, NewFilter selects filters that can optimize the computational efficiency of the filtering process, which contributes further to its advantage in computational efficiency.

Table 2. The average amount of computation time needed by each search tool to search a sequence of 105 nucleotides.

4 Conclusions

In this paper, we develop a new approach to improve the computational efficiency for the annotation of non-coding RNAs. This approach develops a dynamic programming approach to accurately evaluate the recognition ability of a potential filter and selects a set of filters to minimize the computation time needed for filtering while maintaining the given filtration ratio. Our testing results also show that the filters selected by this approach can significantly improve the computational efficiency of search without adversely affecting the accuracy. However, evaluations have been only performed on a few ncRNA families and the results thus could be biased. More comprehensive testing is needed to further evaluate its accuracy. In [1519], we have developed parameterized approaches that can solve a few NP-hard Bioinformatics Problems. Combining NewFilter with these approaches can probably further improve its search accuracy.