Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The Importance of Cis-regulatory Elements in mRNAs

Cells respond to the environment by changing gene expression. In human cells, gene expression is often regulated at both the transcriptional and translational levels. Steady-state levels of mRNAs and their proteins will alter due to the combined effects of this regulation. Regulation at the posttranscriptional level is critical for rapid responses to environmental factors. While there is correlation with the levels of mRNA, translational control is key to determining cellular levels of specific protein [10.1]. Cis-regulatory elements commonly affect mRNA stability or translational efficiency of the mRNA [10.2]. Figure 10.1 shows a simplified schematic of the regulatory elements found in mRNAs.

Fig. 10.1
figure 1

Simplified schematic of regulatory elements in DNA and an mRNA transcribed from the corresponding genomic region. White outlined elements (e.g., TFBS) are present but nonfunctional in the mature mRNA (CDS: coding sequence)

Many different proteins potentially bind to RNAs. In yeast there are estimated to be over 600 RNA-binding proteins (Fig. 10.2) [10.3], and in humans, based on the occurrence of RNA binding motifs in proteins, it is likely that this number is substantially higher [10.4]. The most common domains are K homology (KH, Fig. 10.3c), RNA recognition motif (RRM, Fig. 10.3d), and double-stranded RNA (dsRNA, Fig. 10.3a) binding domains [10.5]. Many of these proteins are part of other structures, for example, ribosomes or splicing complexes, but some of these have additional roles in mRNA binding. In addition, there are many proteins that do not have obvious canonical RNA binding domains but bind specifically to groups of RNAs [10.6].

Fig. 10.2
figure 2

PUF3p in complex with the Cox17 mRNA. The PUF3 binding site is not base paired. (Figure rendered using pyMol, PDB, 3K49)

Fig. 10.3
figure 3

Examples of RNA–protein interactions. (a) dsRNA binding domain in complex with Staufen (PDB 1EKZ). (b) Iron responsive element of ferritin mRNA in complex with an iron regulatory protein (PDB 2IPY). (c) KH domain (PDB 2ANR). (d) RRM domain (PDB 2L41). (All figures rendered using PyMol)

Some of these trans-acting binding proteins bind to specific cis-regulatory elements in mRNAs, providing a regulatory mechanism determining transcript fate. The translation, stability, and localization of an mRNA may vary in response to these interactions.

Interactions with microRNAs (miRNAs) are of particular importance in determining mRNA stability. Predicted target sites for miRNAs have been reported for over 30% of human genes. However, only a small proportion of these predictions are experimentally confirmed.

The cis-regulatory elements encoded in the transcript are most often found in the untranslated regions (UTRs). This bias may partly be because study of these regions is more tractable to experimental and bioinformatic analysis. There is however evidence that trans-acting factors such as miRNAs are easily displaced by the translational machinery and so operate more readily in the 3′ UTR [10.7]. Translation can be entirely repressed immediately after nuclear export, and so this displacement is not an issue for elements such as those found in the coding region of the mRNA for ASH1 [10.8]. Some cis-regulatory elements such as the binding site for iron regulatory proteins certainly occur in the 5′ UTR [10.9].

Regulatory elements that act in the DNA may be also present in the mRNA. RNAs exported from the nucleus contain some of these elements that are nonfunctional in the mature transcript. This is particularly confounding for de novo element discovery in mRNAs. Knowledge of the genomic elements acting at the transcriptional level can help to resolve this.

Many types of structured and unstructured cis-regulatory mRNA elements act at the posttranscriptional level. Examples of such elements are the selenocysteine insertion sequence (SECIS), Histone3, PUF3, and iron responsive element (IRE). Computational methods have been developed to find each of these [10.10,11,12]. It is also useful to consider such well-characterized elements in order to estimate the variation that may exist in novel elements.

Methods developed for genes encoding structured noncoding RNAs (ncRNAs) may be applied to cis-regulatory elements [10.13,14].

Most importantly, these well-characterized elements may be used for benchmarking the performance of prediction algorithms that may then be applied in the search for novel elements. For this purpose a subset of the Rfam database [10.15] (e.g., CisReg), or small parts of databases of known RNA secondary structures, e.g., RNA STRAND [10.16] or CompaRNA [10.17], can be used.

In the past, the limitations of methodology for determining cis-regulatory elements have meant that there was a large role for bioinformatic prediction. In the last few years new high-throughput RNA-Seq techniques combined with bioinformatics are now being developed [10.18,19,20,21].

Searching for Cis-regulatory Elements Using Bioinformatics

Elements in mRNAs are necessary for regulation of stability, translational control, and localization. They may have structural motifs critical to their function or be characterized by primary sequence alone. These cis-regulatory elements are the targets for miRNAs and protein binding sites. Bioinformatic analysis of mRNA can be useful in proposing new models and hypotheses for experimental testing and interpreting existing data. The aim of this analysis is to discover those regulatory sequences that regulate the fate of the mRNAs containing them.

There are many challenges in identifying cis-regulatory elements within RNAs. Their primary sequence patterns are often sparse. The binding sites for proteins may depend on just a few nucleotides with critical secondary structure. Determination of RNA structure is experimentally difficult, and prediction tools are often inaccurate. Regulatory elements are only sometimes conserved, and there are many distracting signals, such as elements operating at the transcriptional level. Some regulatory elements may be unique to a particular mRNA, but others such as the iron responsive element certainly operate within many mRNAs with divergent regulatory outcomes [10.12]. Even when effective bioinformatic models exist for a cis-regulatory element, the application of this model for the purposes of discovery will inevitably yield some false positives.

Some methods have been developed that attempt to discover new regulatory elements in mRNA sequences using only limited biological knowledge. However, the most successful methods utilize as much biological knowledge as is available in order to refine and inform the predictions. Despite the development and publication of several hundred methods (particularly for miRNA sites) only a few are commonly used (or cited). This may sometimes be because the expected or actual utility of the software is outweighed by the difficulty of installing and using it. In other cases the benefits of a particular tool may be outweighed by the familiarity of commonly used tools that do a similar job.

Summary of Tools and Data Sources

The most promising commonly used and current tools and research directions are discussed here (listed in Table 10.1). Lists with different foci can be found in the literature [10.40,41,42].

Table 10.1 Commonly used and current tools and research directions

Obtaining High-Quality mRNA Sequence Data to Analyze

Individual mRNA sequences can be obtained from the University of California, Santa Cruz (UCSC), Ensembl, or the National Center for Biotechnology Information (NCBI). Refseq annotations of UTRs are useful to focus on, as these avoid many elements acting at the transcriptional level and also patterns in coding sequences. The Refseq annotations are more conservatively curated than the Ensembl ones, which contain many predicted transcripts. The UTRs may be readily obtained from Ensembl using the web-based Biomart interface. Refseq transcripts along with their annotations can be obtained from NCBI or UCSC – the annotations must be processed to produce UTR sequences. Downstream statistical processing may be necessary to remove duplicate sequences and to remove redundant, very similar sequences (e.g., using CD-HIT-EST).

Multiple sequence alignments for several species can be obtained from the UCSC genome browser. The output is supplied in multiple alignment format (MAF). The alignments can be filtered to include only those sequences for which there is reasonable conservation. This process is simplified by using the online tool Galaxy [10.24].

Evidence for Common Regulation and Tissue-Specific Expression

mRNA Expression

In order to discover regulatory elements in a gene of interest it is useful to look for common sequences and structures in similarly regulated genes. Also, in the scenario where a regulatory element is known for a particular gene, it is sensible to search for similar elements that may be identified in coregulated genes. The gene expression omnibus [10.52] (GEO) contains mRNA expression data for multiple species and tissues under many experimental conditions. GEO also includes data from RNA immunoprecipitation chip (RIP-chip) experiments, e.g., with Staufen1 (GSE8438) or HuR (GSE29778) and more recently RIP-RNA-Seq data (e.g., Tdrd1 GSE29418).

The web interface provided by GEO allows the researcher to identify coregulated mRNAs via a link to profile neighbors. Expression data may also be downloaded and other statistical tools used to identify and quantify these neighbors, which is worth doing if a study of special interest is available. The profile neighbors provided by GEO are simply the 200 most closely expressed genes.

There are also some relevant sets of publicly available data from published work that are not included in GEO, e.g., co-localized genes for fruit-fly embryos [10.56] or the data from yeast RIP-Chip [10.3], which are provided as websites or supplemental data.

RNA Binding Protein Expression

The expression of RNA binding proteins in specific tissues can also be obtained from GEO. On the GEO website the protein expression data may be readily accessed using the advanced search option and specifying “Protein” for the “Sample Type” field.

miRNA Expression

Coexpression of miRNA and target may be an indication that a functional interaction occurs. Many miRNAs, like mRNAs, are expressed in specific tissues. Studies and methods that consider coexpression in the same tissues have been useful in identifying biologically relevant pairs. Large studies have determined the expression of most small RNAs in many tissues, for example, over 250 small RNA libraries from 26 different organ systems [10.57]. Several databases provide access to these expression data [10.58]. Some databases such as ElMMo combine both miRNA and mRNA data.

Narrowing the Search Space to Biologically Relevant Regions

In the search for novel cis-regulatory elements that operate at the posttranscriptional level, sequence elements that are likely to be false positives must be avoided. These include repetitive elements and elements acting purely at the transcriptional level.

When de novo element discovery is being pursued these distracting elements may overwhelm pattern prediction algorithms and so must be masked in input sequence. When searching for known elements these distractions should be considered at a later stage when assessing the likelihood of a putative hit.

Coding sequence contains confounding patterns arising from protein constraints and common combinations of amino acids in the translated product. They may be masked from analysis in a straightforward fashion where sufficient gene annotation is available. If this is not the case but sufficient protein information is available, it is possible to use tblastn (protein–nucleotide 6-frame translation) to map and mask out likely coding regions. It must be noted that this approach will inevitably fail when the interesting cis-regulatory regions are actually in the coding region.

Distinguishing and Avoiding Repetitive Elements

Repetitive sequences arising from events such as viral retrotransposition are abundant in many genomes; For example, the Alu-element forms 10% of the human genome; furthermore, it is concentrated in gene regions and overlaps mRNAs [10.59]. Therefore, many human 3′ UTRs contain detectable Alu remnants of the 7S RNA.

This and other repeat elements are catalogued in Repbase [10.25]. These sequences may be masked out at an early stage of analysis using the RepeatMasker [10.60] program. Alternatively, the UCSC genome browser [10.22] may be used when assessing putative elements – this has a RepeatMasker track and allows intersection of uploaded candidate cis-regulatory elements to report overlaps.

It is possible that a novel common repeat element may be discovered as a putative regulatory element in a set of coregulated mRNAs. The choice of an appropriate negative control set of real RNAs (rather than simulated sequences) from the same species is important to avoid this.

Distinguishing and Avoiding Elements Acting at the Transcriptional Level and Other RNA Features

Elements acting in the DNA such as transcription factor binding sites (TFBSs) and enhancers may be misidentified as elements acting posttranscriptionally. If the genome of interest is well annotated or sufficient transcript sequence data are available, a genomic search can be avoided. This will go some way to avoiding transcriptionally acting elements.

The methods discussed in this chapter may be usefully restricted to untranslated regions (UTRs). This will avoid many false positives from genomic sequences and patterns associated with coding sequence.

TFBSs in Transcription Promoters that May Overlap 5′ Regions of the mRNA

TFBS databases such as Transfac [10.26] and Jaspar [10.27] contain both experimentally defined and computational predictions. As these databases contain many false positives it is inadvisable to mask or remove predicted TFBSs from an early stage of analysis.

The difficulties with predicted TFBSs are reviewed here [10.61]. One approach to improving TFBS prediction is to use conservation information, although this is controversial as binding sites may not be conserved in multispecies alignments [10.62].

The web-based tool STAMP [10.28] may be employed to determine whether a set of aligned predicted cis-regulatory elements coincides with any of those TFBSs in the public databases.

Enhancers that May Overlap 3′ Regions

ChIP-seq studies can be used to identify binding sites for known enhancers [10.63]. The cited review points out that there is conflicting evidence regarding the conservation of enhancer regions and TFBS; although many sites are conserved, there are a large number of nonconserved sites. Bearing these limitations in mind, the available data should still be considered; For example, a recent study first identified highly conserved noncoding sequences and then tested associated genes for tissue-specific expression during mouse embryogenesis [10.64]. Although the identified enhancer sites are clearly a subset of all such elements and very few occur within gene transcripts, putative cis-regulatory elements acting in mRNAs should be cross-checked against known enhancer regions.

Known Regulatory Elements

When a cis-regulatory element is characterized by a primary sequence, pairwise alignment methods or multiple ones, including hidden Markov models (HMMs), may be used to search for it in other RNAs. Pairwise alignment methods are best established and have many fast, readily available implementations such as the basic local alignment search tool (BLAST) [10.65]. On the other hand, HMMs which probabilistically model state transitions and thereby account for gaps in a nonarbitrary manner can have increased sensitivity, albeit at a computational cost [10.66].

An example where bioinformatic primary sequence analysis has been successful is in a study of the mRNA for the Vg1 localized mRNA. This mRNA was shown to be localized to the vegetal cortex of Xenopus laevis oocytes by parts of a 340-base region in the 3′ UTR. Subsequently, this was shown to be bound by proteins that contain multiple KH domains. Schnappʼs group first identified using bioinformatics four repeated sequence elements, but experimental deletion showed that only one of these elements, UUCAC (E2), was critical for function [10.67].

Mowryʼs group also studied the localization of the mRNA for Vg1 and took a different approach [10.68]. They systematically deleted sequences over the entire 340-nt region. Interestingly, the result they found was quite different – a different sequence, UUUCUA (VM1), was identified as critical, and this was supported using site-specific mutagenesis [10.68]. Although this sequence corresponded to one of the three (E1) elements identified in the Schnapp laboratory, their results showed those deletion constructs had reduced but not abolished localization.

The UUCAC, E2 element was also found to be required for localization of the mRNA for vegT [10.69]. Interestingly, both the vegT and the Vg1 mRNAs had multiple (five) copies of this element. Subsequently a shorter more generalized motif, CAC, was postulated, repeats of which are present in the majority of RNAs localized to the vegetal cortex of Xenopus laevis oocytes [10.70].

This demonstrates the utility of bioinformatic analysis, although the requirement to find repeated clusters of short sequence required the development of a utility specialized to this task. For such a short motif (3–6 bases) multiple copies may be found by chance in any mRNA, and statistical tools have been developed to analyze this [10.71]. Functionally, multiple small dispersed E2 and VM1 elements provide, in combination, a specific binding site for the RNA-binding proteins in this case.

Notably many of the experimentally determined elements collected in the RBPDb are short (4–8 bases) and alone would not provide specificity. Computational tools that consider combinations of weakly informative sites have been used in other systems, e.g., for TFBS and miRNA targets. However, the functional relationships between multiple instances of the same or different sites are not usually known. For some sites, notably the important classes of AU-rich and CU-rich elements, programs to detect these sites operate by weighting multiple nearby repeats.

RNA Binding Protein Target Sites

Known protein binding sites are available from several public databases. The Rfam [10.15] database contains a growing group of covariance models for many cis-regulatory elements including RNA binding protein sites. Transterm [10.30] also contains patterns and descriptions of many known cis-regulatory elements – particularly protein binding sites. Nonredundant sets of sequences from NCBI or other uploaded sequences of interest may be searched for matches against these patterns. UTRsite [10.31] is a similar database of patterns. The RNA-Binding Protein DataBase [10.32] (RBPDb) catalogs proteins and their binding sites curated from the literature. Unstructured sequence motifs of binding sites may be downloaded and filtered by experiment type, species, and/or binding protein affinity.

These databases (Rfam, Transterm, UTRsite, and RBPDb) are useful in identifying known elements and may be used to find interesting candidates for testing. The reasoning behind this approach is to direct the identification of putative elements by their identity or similarity to known elements that have been experimentally demonstrated in other mRNAs. Another important usage of these databases is as a source of benchmarking datasets for any algorithms or pipelines designed to predict cis-regulatory elements de novo.

A note of caution is warranted. Database entries depend on manual curation based on literature review to remain current; For example, we have recently updated the Rfam model for the IRE [10.12] – the previous model was out of date and could not be used to identify many IREs that had been more recently experimentally demonstrated. It must also be considered that the models in Rfam, Transterm, and UTRsite inevitably vary in their sensitivity and specificity when identifying elements in target sequences.

Most of the entries in RBPDb are characterized by single sequences alone, and so if these data are to be used, the user must currently confine their search to be based on primary sequence only or build their own models to include predicted/demonstrated secondary structure. RBPDb does, however, contain some position weight matrices (PWMs) from systematic evolution of ligands by exponential enrichment (SELEX) experiments – the web interface may be employed to search mRNA sequences using these.

For some types of RNA–protein interactions it may be possible to predict either the binding from the protein structure, or the reverse. For a few classes of proteins this has been possible (e.g., PUF domain-containing proteins), but research in this area is beyond the scope of this chapter [10.72,73].

miRNA Target Sites

miRNAs have short ungapped seed sequences complementary to their target sites and act to downregulate expression. These targets are characterized by primary sequence with secondary structure normally impairing binding. Higher eukaryotic genomes may contain many hundreds of miRNAs (1,424 Human miRNAs in MiRBase release 17, 4/2011), and each one of those tested affects the expression of several hundred mRNAs. Some of the changes in expression are undoubtedly the result of indirect regulation; For example, transcription factors are regulated by miR-34a, so changes in miR-34a expression certainly have effects beyond the immediate targets [10.74].

There are numerous predicted miRNA binding sites, mainly in 3′ UTRs. It has been estimated that over 30% of human genes contain target sites. However, relatively few target sites have been validated experimentally. miRTarBase [10.75] listed the greatest number (2,819) of verified targets for the 269 human miRNAs that had been tested (4/2011). Databases of less reliable high-throughput data map over 150000 targets to genomes [10.75].

Prediction of target sites for miRNAs is more straightforward than for proteins. However, there are many methods available which will detect different subclasses of sites with different accuracies; these have been recently reviewed [10.76], and ensemble methods that combine several different tools are available [10.77,78]. Several examples of predictive tools are outlined below.

TargetScan [10.33] predicts miRNA target sites. Base pairing between the seed sequence at the 5′ end of the miRNA and the target mRNA and evolutionary conservation of the sites are the primary consideration. The software has been developed to additionally account for conservation at the seed region, minimum free energy of the hybridization including additional 3′ binding, flanking AU-rich sequence, proximity to additional miRNA target sites, and the position of the target site within the UTR [10.79,80]. The UCSC genome browser is a straightforward way of accessing the TargetScan predictions. These have been calculated using conservation information based on multiple species alignment. The inclusion of this information reduces false positives and is probably a good idea for genome-wide analysis. If the target for analysis is restricted to several genes it may be worthwhile to consider nonconserved miRNA target sites by running the TargetScan program on unaligned sequence.

An alternative method is PicTar [10.34]. It allows the identification of target sites that have reduced conservation but are within mRNAs coexpressed with the targeting miRNA. The authors report that 30–50% of such sites are functional [10.81].

Another method, miRanda/mirSVR [10.82], has the distinction of allowing noncanonical G–U base pairs in the seed sequence; miRanda is also available on the web and has recently been extended to include a support vector regression algorithm – mirSVR [10.83]. This incorporates the relevant biological data, including expression data, into a scoring system and avoids a strict filter based on conservation.

An alternative method offering greater sensitivity to nonconserved target sites is ElMMo [10.36]. The algorithm uses Bayesian methods to assign priors calculated from the phylogenetic distribution of target sites for each miRNA. This allows miRNA-dependent adjustment of posterior probabilities for target sites with similar conservation patterns. The benefit of this is that conservation information is automatically tuned to each specific miRNA. The disadvantage is that it will be hard to match a nonconserved target site of an miRNA that has many widely conserved target sites. ElMMo is available via a web server and includes convenient filtering of mRNA targets to search using expression levels from numerous datasets, also providing Gene Ontology enrichment information on the identified targets.

De Novo Element Discovery

The discovery of new regulatory elements is a key goal for improved understanding of gene regulation. Recurring patterns in sequence and predicted structure may be detected and assessed for statistical significance. Some approaches for dealing with false positives caused by distracting sequence patterns have already been discussed. In de novo detection these steps are of particular importance – there is no model of even low specificity with which to begin.

There are many tools available for the detection of patterns in primary sequence, usually because they have been developed for finding DNA regulatory regions and this is then applied to RNA.

This is despite the demonstrated importance of structure in many regulatory elements. The reason for this lies in the difficulty of secondary structure prediction and that primary sequence patterns are still characteristic of structured elements.

Primary Sequence Analysis for Elements Lacking Significant Secondary Structure

A number of enumerative methods are available for de novo detection of primary sequence patterns. The patterns being sought are generally far shorter than the genomic and transcript sequences in which they are contained. Real primary sequence elements may be degenerate, gapped, redundant, and repetitive. Several differently patterned elements may be responsible for similar regulatory outcomes in different transcripts. All these factors contribute to the difficulty of detection.

Overrepresented patterns in a set of unaligned input sequences may be identified by multiple expectation maximization for motif elicitation (MEME) [10.37]. A background model may be provided for this analysis. The output motifs consist of position weight matrices (PWMs) showing the probability of a particular nucleotide at each position within the motif. MEME will not consider gaps in a motif. GLAM2 [10.84] allows gaps in the matched primary sequence, but it does not include these gaps in the output motifs.

An alternative approach offered by Weeder [10.38] involves building suffix trees from a set of input sequences. These are used to find all patterns of a set length, occurring in at least a certain number of sequences, with an upper limit on the number of mismatches (mutations). The program may be run in an automated way multiple times to detect patterns of different lengths.

The TEIRESIAS [10.39] algorithm is not restricted to searching for patterns of specific length and can detect gapped patterns. This is computationally more intensive, in both memory and processing requirements. The large number of results requires further processing for statistical significance.

The application of these methods is shown in overview in Fig. 10.4. Further methods are reviewed and benchmarked elsewhere [10.85].

Fig. 10.4
figure 4

Discovery of motifs starting from input sequences or alignments that are similarly regulated. The general goal of each component is shown, with some specific examples of currently available software named

Secondary Structure Prediction for Structured Element Discovery

Relatively few confirmed secondary or tertiary structures for cis-regulatory elements are available. Therefore, predictions of RNA structures are made computationally. High-throughput methods that may allow more structures to be determined experimentally are becoming available [10.86]. However, these methods are limited, and a combination of bioinformatic and high-throughput experiment has been successful [10.18].

In addition, it may be possible to predict the three-dimensional structures of RNAs using bioinformatics, which will become increasingly feasible as the number of known structures increases [10.87]. Packages are available to assist in tertiary structural prediction [10.88,89]. These may be used with sequences of interest alone or in combination with available experimental data on similar structures.

Single Sequences

Predicting folding on individual sequences is a common technique. This may be done globally (for the entire mRNA) or locally on windows within a biologically relevant section (e.g., 80-base windows in 3′ UTRs). A global RNA fold prediction algorithm from the Zuker laboratory is implemented by the mfold [10.43] program. These methods are commonly used, and the paper associated with this program has over 700 citations in the literature. The Vienna RNA package provides a similar program, RNAfold [10.44]. Like mfold, this calculates predicted secondary structure for RNAs based on minimum free energies (MFE) using conformations derived from published values for stacking and destabilizing energies.

The UNAFold [10.90] software is a development of mfold and further predicts hybridization and melting profiles. Like RNAfold, it produces dotplots showing pairing probabilities over the sequence (Fig. 10.5). The dotplots from both programs include the pairing probabilities corresponding to suboptimal (predicted) folding of the input sequence. Suboptimal structures are discussed later.

Fig. 10.5
figure 5

Predicted secondary structures for the IRE in human CDC14A. (a) Optimal MFE structure from RNAfold (−2.9  kcal/mol). The dot plot (b) represents the ensemble of structures. Some suboptimal structures contain predicted pairs in a lower stem, their presence being more consistent with other IRE structures. The C–G base pair in the loop observed in the experimental structure (b) is not predicted by MFE methods

Local folding methods are likely to be more appropriate for small cis-regulatory elements. Several algorithms which are particularly suitable have been developed. These perform better on known cis-regulatory elements [10.91]. This is an area of active development and testing; current methods include RNAplfold and Rfold [10.92,93]. In addition, methods that combine structure prediction with comparison with known secondary structures of homologous sequences are being developed [10.94].

RNA structure is dynamic. Binding interactions with a structured cis-regulatory element are likely to further influence the structure of that element. A limited set of thermodynamic measurements provides the basis of the RNA structure predicting algorithms. Furthermore, these methods do not account for pseudoknots, G-quartets, and other structures. The predicted structures cannot be expected to entirely model the complex molecular interactions found in the in vivo environment. Approaches using homology to known secondary and tertiary structures that are not necessarily the MFE structures should assist in this [10.87,94,95].

Some elements may contain pseudoknots, but the prediction of pseudoknots remains computationally slow and is only tractable for shorter RNA sequences. The pknots [10.96] program and HotKnots [10.97] are methods that may be employed on targeted regions if such structures are expected [10.98]. There are cis-regulatory elements containing pseudoknots, for example, frameshift elements. Further examples may be found in general and specific databases [10.15,99].

Predicted structures do not always match natively observed structures. An example is the iron responsive element (IRE) found in the human mRNA for CDC14A. Both the UNAfold and RNAFold program find the same MFE. Additional base pairs are seen in the dotplot, some of which would make a lower stem observed in all IREs tested. Some pairs are not predicted at all by MFE approaches, e.g., the C–G in the apical loop (Fig. 10.5).

When there is existing information about the structure of a particular cis-regulatory element, these constraints may be provided to the folding programs. This allows the estimation of the MFE in a candidate element even when this is not the lowest resulting from prediction. Other sources, such as gene expression and phylogenetic information, may then be used in combination to arrive at strong candidates for experimental testing.

Benchmarking on ncRNA genes has shown both sensitivity and specificity of RNA structure prediction using MFE methods to be limited (22–63% and 20–60%, respectively) [10.100]. Newer algorithms improve on this [10.101]. Although secondary and tertiary structure is a factor in RNA interaction, the difficulty of experimentally determining these structures and of accurately predicting them must be always borne in mind.

These programs do not allow noncanonical bases, e.g., U–U or A–G, which have been observed in many experimentally determined RNA structures. Several algorithms do allow predictions that include these noncanonical pairs [10.89,101]. Though these are considerably slower, they may be used with short structured cis-regulatory elements.

Multiple Sequences

Calculating the consensus structure for aligned sequences can overcome some of the shortcomings in the accuracy of MFE calculations for single sequences. This approach depends on the ability to obtain a reasonable alignment of the primary sequences. Methods available include RNAalifold [10.45] – part of the Vienna RNA package.

Covariance (e.g., an A–U base pair being exchanged by some other pairing) or compatible mutation (e.g., a G–C base pair being exchanged by a G–U pair) help tools such as RNAalifold to provide an optimal structure consistent with the alignment. However, too many variations in the primary sequence will make alignment at this level impossible.

A further class of algorithm simultaneously folds and aligns input sequences. The approach is computationally more intensive, though may be useful where the primary sequence alignments have limited similarity [10.41]. Dynalign, Multilign, and Turbofold are all part of the RNAstructure package [10.14]. The original Dynalign works with two sequences – Multilign operates on multiple sequences. Turbofold does not produce an alignment but presents separate structures for each of the sequences in the input, rather than one consensus structure. Structures are predicted based on pairing probabilities in each sequence severally, combined with the pairing probabilities in a consensus model [10.102].

Foldalign and FoldalignM [10.103] will produce local and global alignments along with structure predictions. An interesting feature of Foldalign is that it will attempt local alignments over the input sequences (based on structure and primary sequence) and then report the best alignments found. It is therefore also an element detection tool and not limited to structural prediction. Additional discussion and comparison of RNA fold prediction algorithms can be found in recent reviews [10.14,87,104].

Comparing Secondary Structures

It is useful to compare two structures to assess their similarity. In some cases the best information about a regulatory element may be its secondary structure, and so this is key in finding similar elements. RNAdistance, RNAforester, and Cofolga2mo [10.105] allow these comparisons for simple secondary structures. A simplification of the problem is to compare overall shapes, e.g., stem loop or cloverleaf [10.106].

The Vienna RNA package provides RNAdistance [10.44], which allows not only comparison of pairs of structures but also simultaneous comparison of multiple structures, providing a comparison matrix for all input structures. The output quantifies the differences between the structures. A pipeline on the Vienna RNA servers, Structure Conservation Analysis, includes this method. Input is an alignment and RNAalifold is used to predict MFE structures that are compared with predicted structures of individual sequences.

The RNAforester [10.46] algorithm builds tree-like data structures that represent RNA secondary structure. These can then be used to build multiple alignments of different RNA structures. Thresholds may be applied determining whether a particular structure is sufficiently similar to form part of an aligned group. This allows the degree of similarity between structures to be assessed as well as the grouping of RNAs into structurally determined families. Another useful output of this tool is an alignment of input sequences that is wholly determined by the given structures. This can be useful to build a seed for a covariance model.

Searching for Secondary or Tertiary Structural Elements

When there is good evidence for structure but the specificity of the primary sequence in the regulatory element is largely or completely unknown, a search based on structure alone is required. RNAMotif [10.48] allows the creation of a pattern which has no or little information about primary sequence. The resulting matches may be used to find additional sequences for testing; alignments of these will hopefully allow the incorporation of primary sequence information into the model for the regulatory element in question. Three-dimensional (3-D) motifs (for example, G-bulges from a lysine riboswitch can be searched for with RMdetect [10.95].

Combining Primary and Secondary Structural Search Methods

A covariance model (CM) is a stochastic context-free grammar (SCFG) that can be used to model the consensus sequence and structure of RNAs. Given an alignment to a consensus secondary structure, not only nucleotide residues at single-stranded positions but also base pairings, insertions, and deletions are scored. The Infernal [10.50] software package provides the tools to build covariance models and to search for matches to the model over a target sequence. The resulting bit score is the log-odds ratio of the probability of the target matching the model to the probability of target matching random sequence. This methodology is key to the Rfam database which catalogs RNA families using these models, showing their paralogs and homologs [10.15].

When sufficient information exists about a cis-regulatory element such that known examples may be meaningfully aligned to a consensus structure, a covariance model may be constructed. This model may be used to search for matches within other mRNA sequences – as has been done in the case of the IRE [10.12] and other cis-regulatory elements in the Rfam database. This can result in new candidates for experiment testing.

An alternative to using covariance models is to build a pattern corresponding to a motif. This is the approach taken by Transterm and UTRsite. A useful tool for interpreting and searching using such patterns is scan for matches [10.51]. The pattern descriptions used by this tool can incorporate structural information. Such a pattern does not depend on being able to construct an alignment of known sequence elements. However, complex patterns can be difficult to construct, and the ability to make a good pattern depends on prior knowledge of a motif, including permissible variations at different points. The output from scan for matches does not include the statistical information provided by the Infernal softwareʼs method.

Combinatorial Methods

Figures 10.6 and 10.4 give an overview of how the tools discussed here may be used together in the search for cis-regulatory elements. Some combinations of tools are sufficiently novel to be considered new methods in their own right. These methods are discussed here.

Fig. 10.6
figure 6

Preparing and obtaining sequences and alignments. Solid lines indicate basic approach. Dashed lines indicate additional steps appropriate to genome-scale analysis or de novo element discovery. Dotted lines indicate additional processes dependent on the specific problem

Unpaired bases are more likely to be involved directly with RNA interaction, and certain protein interactions. Consequently, primary sequence patterns found in unpaired regions are of particular interest. This can be seen for example in the IRE, where the unpaired nucleotides have been shown to interact with the iron regulatory proteins. MEMERIS [10.53] is an extension of the MEME algorithm. A script that comes with the package first uses the Vienna RNA tools to predict secondary structure across the input sequences (which must be in FASTA format with all sequence on one line). Altering the prior probabilities for putative motif start sites directs the MEME algorithm towards predicted single-stranded positions. The weakness of not being able to detect gapped patterns is inherited from MEME.

Given the importance of structure in cis-regulatory elements, another useful approach is to identify sequence regions likely to have conserved structure. RNAz [10.54] was originally developed to detect structured ncRNAs in genomic sequence, but this method is also of potential interest in cis-regulatory element discovery and can successfully detect several known elements. The software takes a sequence alignment as input and uses RNAalifold to calculate the MFE of consensus structures arising over a sliding window. MFE values for these same sequences are also severally calculated using RNAfold. The consensus MFE in ratio to the average single sequence MFE gives a structure conservation index (SCI). Additionally, a z-score is calculated, which represents the deviation of the MFE score from random sequence of similar composition and length. The z-score and the SCI are used as inputs to a support vector machine together with the number of aligned sequences and the pairwise identity of the sequences – this results in a probability value for the occurrence of a conserved structural motif. RNAz may be run from the command line or on the Vienna RNA servers. Also the ncRNA [10.23] site has a version of the UCSC genome browser that includes a track for sites predicted by RNAz.

For multiple sequences with common function, automated production of covariance models for elements with similar sequence and structure is provided by the CMFinder [10.49] program, which takes short unaligned sequences (<500  bp) as input. Based on MFE, the algorithm first selects a number of candidates from within these sequences. The candidates are aligned according to predicted secondary structure and an expectation-maximization algorithm used to refine a covariance model that identifies elements within the candidate sequences distinct from a background distribution. The input sequences are then rescanned using this covariance model, and the top hits are included as candidates. The authors of CMFinder note that identifying larger motifs is problematic and go some way to addressing this by attempting to merge smaller motifs as a final step.

The importance of incorporating biological data into the element discovery process has already been discussed. Selection of sequence for pattern discovery together with appropriate background models is an important first step in many analyses. The finding informative regulatory elements (FIRE) [10.55] pipeline automates this process by combining the detection of primary sequence patterns with other biological data – either discrete or continuous data (e.g., gene expression data). FIRE starts by looking for 7-mer seed motifs, but these can be extended one base in either direction. The initial seeds are systematically modified using degenerate International Union of Pure and Applied Chemistry (IUPAC) codes to arrive at a motif most significantly associated with the other biological data. This pipeline also offers the convenience of displaying gene ontology terms with enriched association to the identified motifs.

Conclusions and Future Prospects

The importance of posttranscriptional regulation is becoming increasingly apparent as large-scale proteomic data become available. Transcripts are translated with a wide range of efficiencies, giving differing numbers of functional proteins per message.

It has now become almost routine to measure transcript levels at a genome scale using microarrays or next-generation sequencing. The expectation was that transcript abundance would provide good estimates of protein abundance in the cell. However, the early studies done over a decade ago that suggested that mRNA levels might predict less than 45% of protein levels have been reiterated by recent studies. This indication of widespread posttranscriptional control is seen in many organisms.

High-throughput wet-lab studies and analysis of regulatory elements will facilitate discovery of elements with widely conserved functions. However, it should also be noted that some key elements might only be found in a small number of messages or species, e.g., human-specific miRNA targets, or the targets of other noncoding RNAs. These exceptional elements are a challenge for both bioinformatic and wet-lab studies and may also be of critical importance for cell growth and development, and have applications in biotechnology.

At least some of the variation in the amount of protein translated from individual mRNAs will be mediated by cis-regulatory elements in the mRNAs. This chapter has outlined current bioinformatic methods available for their discovery. Development of new methods and the use of high-throughput data on a genome-wide scale, particularly comparative genomic, proteomic, and high-throughput transcriptomic data, will facilitate this.