Introduction

In most eukaryote lineages, introns are spliced out of protein-coding mRNAs by the spliceosome, a huge RNP complex consisting of about 200 proteins and five small noncoding RNAs (Nilsen 2003). These snRNAs exert crucial catalytic functions in the process (Valadkhan 2005, 2007; Valadkhan et al. 2007) in three distinct splicing machineries. The major spliceosome, containing the snRNAs U1, U2, U4, U5, and U6, is the dominant form in metazoans, plants, and fungi, and removes introns with GT-AG (as well as rarely AT-AC and GC-AG) boundaries. Another class of “noncanonical” introns with AT-AC (and rarely GT-AG (Sheth et al. 2006)) boundaries is excised by the minor spliceosome (Patel and Steitz 2003), which contains the snRNAs U11, U12, U4atac, U5, and U6atac. Just as the major spliceosome, the minor spliceosome is present across most eukaryotic lineages and traces back to an origin very early in the eukaryote evolution (Collins and Penny 2005; López et al. 2008; Lorkovíc et al. 2005; Russell et al. 2006). Recently it was found that the minor spliceosome can also act outside the nucleus and controls cell proliferation (König et al. 2007). Functional and structural differences of two spliceosomes are reviewed in Will and Lührmann (2005). In the third type of splicing, SL-trans-splicing, a “mini-exon” derived from the noncoding spliced-leader RNA (SL) is attached to each protein-coding exon. The corresponding spliceosomal complex requires the snRNAs U2, U4, U5, and U6, as well as an SL RNA (Hastings 2005). Due to the high sequence variation of the short SL RNAs and the patchy phylogenetic distribution of SL-trans-splicing, the evolutionary origin(s) of this mechanism, which is active at least in chordates, nematodes, cnidarians, euglenozoa, and kinetoplastids, is(are) still unclear.

Previous studies on the evolutionary origin of the spliceosomes have been performed based predominantly on homology of the most important spliceosomal proteins. Thus relatively little detail is known on the evolution of the snRNA sequences themselves beyond the homology of nine families of snRNAs across all eukaryotes studies so far (Collins and Penny 2005; Collins et al. 2004; Lorkovíc et al. 2005; Russell et al. 2006; Schneider et al. 2004; Shukla and Padgett 1999). This may come as a surprise since it has been known for more than a decade that at least all of the snRNAs of the major spliceosome appear in multiple copies and that these paralogues are differentially regulated in at least some species (see, Bhathal et al. 1995; Lo and Mount 1990; Morales et al. 1997; Sontheimer and Steitz 1992 ; Stefanovic et al. 1991). Very recently, however, some of these variants have been studied in more detail (see, Chen et al. 2005; Hinas et al. 2006; Kyriakopoulou et al. 2006; Pereira-Simon et al. 2004; Sierra-Montes et al. 2005; Smail et al. 2006 and references therein). The only systematic study that we are aware of is the recent comprehensive analysis of 11 insect genomes (Mount et al. 2007) which reported that phylogenetic gene trees of insect snRNAs do not provide clear support for discernible paralogue groups of U1 and/or U5 snRNAs that would correspond to the variants with tissue-specific expression patterns. Instead, the analysis supports a concerted mode of evolution and/or extreme purifying selection, a scenario previously described for snRNA evolution (Liao 1999; Liao and Weiner 1995; Nei and Rooney 2005).

In this contribution we extend the detailed analysis of the nine spliceosomal snRNAs to metazoan animals. In particular, in mammals, the analysis is complicated by a high copy number of snRNAs of the major spliceosome and an associated large number of pseudogenes (Denison et al. 1981). We focus here on four questions: (1) Is there evidence for discernible paralogue groups of snRNAs in some clades? A dominating mode of concerted evolution does not necessarily prevent this, as demonstrated by the existence of two highly diverged copies of both LSU and SSU rRNA in Chaetognatha (Papillon et al. 2006; Telford and Holland 1997), which is probably associated with a duplication of the entire rDNA cluster. (2) Are there clades with deviant snRNA structures? The prime example for a highly divergent snRNA is the U11 in a subset of the insects (Schneider et al. 2004). (3) Are there interpretable trends in the copy number of snRNAs across metazoa? (4) How mobile are snRNA genes relative to the “background” of protein coding genes? In other words, to what extent are some or all of the snRNA genes offsprings of a locus that remains stably linked to its context over large timescales.

Materials and Methods

Sequence Data

Known snRNA sequences were retrieved from GenBank (Benson et al. 2007) and Rfam (Griffiths-Jones et al. 2005) and, in some cases, extracted directly from the literature. Genomic DNA sequences were downloaded from the Web sites of Ensembl, the Joint Genome Institute, the Sanger Institute, WormBase, the Genome Sequencing Center, UCSC, CAF1, Broad Institute, BGI, and the NCBI trace archive. For some species, we also performed nonexhaustive searches in the NCBI Trace Archive using Megablast. Details on the dataset are given in the electronic supplement.Footnote 1

Overall, the published experimental evidence on metazoan snRNAs is very unevenly distributed. For example, a large and phylogenetically diverse set of U2 snRNA sequences is reported in Giribet et al. (2001), while most other snRNAs have been reported for a few model organisms only. A recent experimental screen for snRNAs in Takifugu rubripes (Myslinksi et al. 2004) resulted in copies of eight snRNAs families. U4atac was missing, but a plausible candidate can easily be found by Blast. Only a few sequences of minor spliceosomal snRNAs have been reported so far, mostly in a few model mammals (Tarn et al. 1995) and in drosophilids (Mount et al. 2007; Schneider et al. 2004).

Homology Search

In a first automatic step we used a local installation of NCBI blast (v.2.2.10) with default parameters and E < 10−6 to find candidate sequences in closely related genomes. If successful, the results of this search were aligned to the query sequence using clustalw (v.1.83). After a manual inspection using clustalx, the consensus sequence of the alignment was again used as a Blast query with the same E-value cutoff.

If this automatic search was not successful, the best Blast hit(s) was(were) retrieved and aligned to a set of known snRNAs from related species. Candidate sequences were retained only when a visual inspection left no doubt that they were true homologues. This manual analysis step included a check whether the phylogenetic position of the candidate sequence in a neighbor-joining tree was plausible, taking into account that the sequences are short and some parts of the alignments are of low quality.

In cases where no snRNA homologues were found as described above, we searched the genome again with the much less stringent cutoff of E < 0.1 (or even higher in a few cases) and extracted all short hits together with 200-nt flanking sequence. We used Sean Eddy’s rnabob with a manually constructed structure model to extract a structure-based match within the selected regions and attempted to align the candidate sequences manually to a structure-annotated alignment of snRNAs in the emacs editor using the ralee mode (Griffiths-Jones 2005).

Finally, the resulting alignments of snRNAs were used to derive search patterns for RNA motif (Macke et al. 2001) and erpin (Gautheret and Lambert 2001). To this end, the consensus structure of the alignment was computed using RNAalifold (Hofacker et al. 2002) and converted into a form suitable as input for the two search programs.

Structure Models

Structure annotated sequence alignments were manually modified in the emacs text editor using the ralee mode (Griffiths-Jones 2005) to improve local sequence-structure features based on secondary structure predictions for the individual sequences obtained from RNAfold (Hofacker et al. 1994). Consensus structures were then computed using RNAalifold (Hofacker et al. 2002). The structure models are compiled in the online supplementary material.

Upstream Region Analysis

With MEME (v.3.5.0) we discovered motifs upstream of the sequences for analysis of regulators and other possible dependencies. They were manually compared with previously published sequence elements. We visually compared the MEME patterns with the upstream elements in related species from the following literature sources: general motifs (Hernandez 2001), human (Bark et al. 1986; Domitrovich and Kunkel 2003; Kunkel and Pederson 1988; Tarn et al. 1995), chicken (Bhathal et al. 1995; Korf and Stumph 1986), insects (Mount et al. 2007), Bombyx mori (Sierra-Montes et al. 2005), Strongylocentrotus purpuratus (Stefanovic and Marzluff 1992), and Caenorhabditis elegans (Thomas et al. 1990).

Phylogenetic Analysis

Since the snRNA sequences are short and, in addition, there are several highly variable regions, we use split decomposition (Bandelt and Dress 1992) and the neighbor net (Bryant and Moulton 2004) algorithm (as implemented as part of the SplitsTree4 package (Huson and Bryant 2006)) to construct phylogenetic networks rather than phylogenetic trees. The advantage of these method is that they are very conservative and that the reconstructed networks provide an easy-to-grasp representation of the considerable noise in the sequence data.

Synteny Information

In order to assess whether snRNA genes are mobile in the genome, we determined their flanking protein-coding genes. We used the ensembl compara annotation (Flicek et al. 2008) to retrieve homologous proteins in other genomes and compared whether these homologues also have adjacent snRNAs. For consistency, this analysis is performed based on ensembl (release 46) (Hubbard et al. 2005) using the data integration platform BioFuice (Kirsten and Rahm 2006). More precisely, for each human snRNA G we examined the relation of the left homologous L H (G) and right homologous R H (G) of flanking protein coding genes L(G) and R(G) on both sides of G. We only considered annotations in L H (G) and R H (G), respectively, if the sequence distance between G H and L H (G) and R H (G) was not more than twice (five times for mammals) the distance between G and L(G) and R(G).

Results

Homology Search

Table 1 summarizes the results of the sequence homology search detailed under Materials and Methods. Only sequences that passed all filtering steps and structure checks are reported as “homologues” in the following. We found that, with few exceptions, Blast-based homology search strategies are in general sufficient to find homologues of all nine spliceosomal snRNAs in most metazoan genomes. The procedure is hard to automatize, however, since in many cases the initial Blast hits have poor E-values, while a multiple sequence alignment then leaves little doubt that a true homologue has been found. This is true, in particular, for searches bridging large evolutinary distances, especially when the search extends beyond bilateria.

Table 1 Approximate copy number of snRNA genes

With very few exceptions we found multiple copies of all five major spliceosomal RNAs that exhibited the typical snRNA-like promoter elements and were hence mostly likely functional copies of the genes. The snRNA copy numbers varied substantially between different clades. The genus Caenorhabditis, for example, was set apart from other nematodes by a two- to threefold increase in the number of major spliceosomal snRNAs. In contrast, the snRNAs of the minor spliceosome were in most cases single-copy genes.

Many genomes, most notably mammalian genomes, contained a sizable number of major snRNA pseudogenes. Table 1 therefore lists only candidates that have plausible snRNA-like promoter structure, that fit the secondary structures of snRNAs in related species, and that exhibit strong sequence similarity in the unpaired regions of the molecule. These are rather restrictive criteria. In the online supplementary material, we therefore provide a corresponding table that is based only on sequence homology.

It is surprisingly difficult to compare the present snRNA survey with previous reports on vertebrate snRNAs. The main reason for discrepancies in the count of snRNAs is that distinguishing functional snRNAs from pseudogenes is still an unsolved problem. In this contribution, we use a very stringent criterion by insisting on a recognizable promoter structure. In some cases, however, it is known that snRNAs have internal promoters only (Tichelaar et al. 1998). These cases constitute false negatives in Table 1. On the other hand, much of the published literature considers sequence similarity to the known functional genes as the only criterion, thus most likely leading to the inclusion of a substantial fraction of pseudogenes. For instance The Chimpanzee Sequencing Analysis Consortium (2005) counts 16 U1, 6 U2, and 44 U6 snRNAs in the human genome (compared to our 8, 3, and 7, respectively), while Domitrovich and Kunkel (2003) reports 5–9 U6 snRNA genes, consistent with our list. Similarly, only a fraction of the major spliceosomal snRNAs reported for the chicken genome in (Consortium 2004) passes our promoter analysis.

For drosophilids, on the other hand, our analysis is almost identical to the results in Mount et al. (2007), Table 1, and the data reported in Sierra-Montes et al. (2005). Furthermore, we come close to the results of a comparative genomics screen for noncoding RNAs in C. elegans (Missal et al. 2006), which reported 12 U1, 19 U2, 5 U4, 13 U5, and 23 U6, i.e., only a few more candidates than our present purely homology-based approach. A comparative screen of the two Ciona species for evolutionary conserved structured RNAs (Missal et al. 2005) missed a small number of snRNA genes that we indentified as most likely functional ones.

In a few species we failed to identify individual major spliceosomal snRNAs (e.g., A. pisum U4, H. bacteriophora U4, and S. mediterannea U2). Minor spliceosomal snRNAs are more often missing. In those cases where only some of the major or minor snRNAs remain undetected, the missing family member most likely escaped our detection procedure for one of several reasons.

  1. 1.

    In the case of unassembled incomplete genomes for which only shotgun reads were searched, the snRNA may be located in the not yet sequenced fraction of the genome or it might not be completely contained within at least one single shotgun read.

  2. 2.

    The snRNA in question may be highly derived in sequence. For instance, the U11 snRNA in drosophilids (Schneider et al. 2004) cannot be found by a simple Blast search starting from noninsect sequences. It can be found, however, by the combination of very unspecific Blast and subsequent structure search as described under Homology Search, above.

  3. 3.

    In some cases we list a “0” in Table 1 even though there is recognizable sequence homology in the genome. In these cases we were not able to identify the snRNA-like promoter elements and/or the secondary structure did not fit the expectations. These cases are marked in Table 1.

  4. 4.

    It is conceivable that some species had lost a particular snRNA and replaced it by corresponding snRNA from the other spliceosome. The observation that U4 may function in both the major and the minor spliceosomes (Shukla and Padgett 2004) shows that such a replacement mechnism might indeed be evolutionarily feasible.

In our data set, we most frequently were unable to find a U4atac homologue. We cannot know, of course, whether we missed these cases due to poor sequence conservation or due to loss of the gene. For instance, we did not recover a plausible U4atac candidate for the hemichordate Saccoglossus kowalevskii despite the fact that the U4atac sequence of the sea urchin Strongylocentrotus purpuratus was easily retrieved.

Surprisingly, we found neither a canonical U6 nor a canonical U6atac in Drosophila willistoni. A highly derived U6 homologue has no recognizable snRNA-like promoter structure and exhibits substantial deviations from the consensus structure (see Secondary Structures, below). Interestingly, it is aligned with the functional U6 RNAs of the other 11 drosophilids in the genome-wide “12-Fly” Pecan alignment,Footnote 2 which respects syntenic conservation. This strongly suggests that D. willistoni indeed has a highly derived U6 snRNA. According to known annotation the sequence is not located in an intron. The absence of external promoter elements has also been observed for one of the human U6 snRNAs (Tichelaar et al. 1998), hence the prediction is not at all implausible. Similarly, the U4atac candidate from Daphnia pulex deviates substantially from other arthropod sequences. It is possible that in some or all of these cases the snRNA is present in the genome but is not contained in the currently available genomic sequence data. This is most likely the case for the missing minor spliceosomal snRNAs of Ixodes scapularis, Pediculus humanus, or Drosophilia willistoni.

In some cases, however, we failed to identify all four minor spliceosomal snRNAs. Consistent with previous work (Patel and Steitz 2003) we found no convincing homologues of the minor spliceosomal snRNAs U11, U12, U4atac, and U6atac in any of the nematode genomes, suggesting that the minor spliceosome was lost early in the nematode lineage. Nevertheless, we find some Blast hits for minor spliceosomal snRNAs in some nematode genomes.

Our analysis furthermore suggests the possible loss of the minor spliceosome in Oikopleura dioica, while a complete complement of minor spliceosomal snRNAs was found in the genus Ciona. It is unclear, however, whether this is an artifact due to limiations of available shotgun traces.

Our survey provides evidence that most metazoan clades for which genomic sequences are available have retained the minor spliceosome. For many groups, such as Annelida and Cnidaria, we are not aware of earlier references to the existence of minor spliceosome.

Specific Upstream Elements

The classical snRNA-specific PSE and TATA elements that have been described in detail for several vertebrates (Domitrovich and Kunkel 2003; Hernandez 2001) are highly conserved. This appears to be an exception rather than the rule, however: the snRNA upstream elements are highly diverse across metazoa. Our analysis agrees with the recent observation that in drosophilids there is a rapid turnover in the upstream sequences. Even though the PSE is fairly well conserved within drosophilids, it already differs substantially between the major insect groups (Mount et al. 2007). Similarly, within the nematodes conservation of upstream elements is limited to the genus level. In general, the PSEs of U11, U12, and U4atac is much less conserved than their counterparts in major spliceosomal snRNA genes. For the purpose of this study, the relatively well-conserved elements were used to discriminate functional snRNAs from likely pseudogenes. We concentrated on PSE and TATA elements for this purpose because other snRNA-associated upstream elements, such as SPH, OCT, CAAT-box, GC-box, -35-element, and Inr, are even less well conserved:

A GC-box was identified in Caenorhabditis at a noncanonical position (about −68 nt). These elements are different for each single snNRA class: U1 GGACGG (44/52 sites), U2 TGGCCG (38/60 sites), and U5 CGGCCG (39/46 sites). However, also among a single snRNA this element varies a lot: insects have a U1 GC-box GCGCTG at about −75 nt (15/39 sites). About half of the U6 sequences of basal deuterostomes show the CAAT-box motif TGCCAAGAA at the known position of −70 nt. Interestingly, we found related motifs in the upstream region of drosophilids U11 (GACCAATAT; −33 nt) and other insects U5 snRNA (TTCCAATCA; −28 nt). The Octamer motif (OCT; ATTTGCAC) was found in six of seven sequences of basal deuterostomes at the known position of −54 nt upstream of U6atac. However, in 12 of 14 drosophilid sequences, the closely related motif ATTTGCTT was found at position −33 nt. About 35 nt upstream of U11 and U12 snRNAs of teleosts we found the motifs GTGACA and TGCACA, respectively. The Inr element of U1 snRNA was found in each species. For teleost fish and drosophilids we found a complete set of this element for all snRNAs. However, the Inr shows substantial sequence variations both between different genes in the same species and between homologous genes in different species. We refer to the online supplementary material for further details and lists of identified sequence elements.

Clusters of snRNA Genes

In Mammalia, we observe linkage of tandem copies of U2 snRNAs (see also Liao et al. 1997; Pavelitz et al. 1999), while there are no clusters of distinct snRNAs. In Drosophila, there are surprisingly constant patterns of snRNA clusters: (a) U2-U5 clusters are observed four to six times per genome, (b) there are one or two U1-U2 clusters, and (c) three to nine tandem copies of snRNAs. Two species deviated therefrom. In D. ananassae, we find no U2-U5 cluster but, instead, 7 U1-U2, 1 U4-U5 cluster, and 4 other tandem copies, while D. willistoni lacks the U4-U5 cluster but contains 10 U2-U5 pairs and 6 tandem copies. Teleost fish also have a common pattern: there are one or two U1-U2 pairs and two to six tandem copies. In general, however, snRNA do not appear in clusters throughout metazoan genomes.

In several species, linkage of snRNAs with 5S rRNA has been observed (Cross and Rebordinos 2005; Ebel et al. 1999; Liao 1999; Liao and Weiner 1995; Manchado et al. 2006; Pelliccia et al. 2001). We found only one further example of this type: in Daphnia pulex 5S and U5 snRNAs are separated by only 308 bp.

Phylogenetic Analysis and Paralogues

Like ribosomal RNAs, spliceosomal RNAs are subject to concerted evolution (Gonzalez and Sylvester 2001; Hillis and Dixon 1991; Schlötterer and Tautz 1994), i.e., one observes that paralogous sequences in the same species are more similar than orthologous sequences of different species. Multiple molecular mechanisms may account for this phenomenon: gene conversion, repeated unequal crossover, and gene amplification (frequent duplications and losses within family); see Liao (1999) for a review. In some cases, however, paralogues can escape from the concerted evolution mechanisms as exemplified by the two paralogue groups of SSU rRNA in Chaetogatha (Papillon et al. 2006).

Distinguishable snRNA paralogues that are often differentially expressed have previously been reported for a diverse collection of major spliceosmal snRNAs including U1 snRNAs in insects (Lo and Mount 1990; Pereira-Simon et al. 2004; Sierra-Montes et al. 2005), Xenopus (Dahlberg and Lund 1988), and human (Kyriakopoulou et al. 2006), U2 snRNAs in Dictyostelium (Hinas et al. 2006), sea urchin (Stefanovic et al. 1991), and silk moth (Sierra-Montes et al. 2005), U5 snRNAs in human (Sontheimer and Steitz 1992), sea urchin (Morales et al. 1997), and drosophilids (Chen et al. 2005), and U6 snRNAs in silk moth (Smail et al. 2006) and human (Domitrovich and Kunkel 2003; Tichelaar et al. 1998).

A phylogenetic analysis of the individual snRNA families, nevertheless, does not show widely separated paralogue groups that are stable throughout larger clades. Figure 1, for example, shows that the U5 variants described in Chen et al. (2005) do not form clear paralogue groups beyond the closest relatives of Drosophila melanogaster. On the other hand, there is some evidence for distinguishable paralogues outside the melanogaster subgroup. The situation is much clearer for the drosophilid U4 snRNAs, where three paralogue groups can be distinguished (see Fig. 2). One group is well separated from the other two and internally rather diverse. The other two groups are very clearly distinguishable for the melanogaster and obscura group (see Drosophila 12 Genomes Consortium 2007). For D. virilis, D. mojavensis, D. grimshawi, and D. willistoni we have two nearly identical copies instead of two different groups of genes.

Fig. 1
figure 1

Phylogenetic network of drosophilid U5 snRNAs. The eight U5 snRNAs reported in Chen et al. (2005) are the same as our predictions, indicated here by white dots. me, D. melanogaster; er, D. erecta; si, D. simulans; se, D. sechellia; ya, D. yakuba; wi, D. willistoni; gr, D. grimshawi; mo, D. mojavensis; vi, D. virilis; pe, D. persimilis; ps, D. pseudoobscura; an, D. ananassae. The phylogenetic tree is adapted from Drosophila 12 Genomes Consortium (2007)

Fig. 2
figure 2

Phylogenetic tree of insect U4 snRNAs. In this case we can distinguish three paralog groups within the drosophilids. me, D. melanogaster; er, D. erecta; si, D. simulans; se, D. sechellia; ya, D. yakuba; wi, D. willistoni; gr, D. grimshawi; mo, D. mojavensis; vi, D. virilis; pe, D. persimilis; ps, D. pseudoobscura; an, D. ananassae

Table 2 summarizes the presence of recognizable paralogue groups within major animal groups. Within the genus Caenorhabitis we find evidence for the formation of U5 paralogue groups in C. remanei, C. brenneri, and C. briggsae, to the exclusion of C. elegans and C. japonica. Evidence for paralogue groups of U1 snRNA in drosophilids remains ambiguous due to the small sequence differences.

Table 2 Paralogue groups of major splicesomal snRNAs recognizable within major animal clades

In teleost fish, we find clearly recognizable paralogue groups for U2, U4, and U5 snRNAs. Surprisingly, the medaka Oryzias latipes has only a single group of closely related sequences, despite the fact that for U4, the split of the paralogues appear to predate the last common ancestor of zebrafish and fugu (Fig. 3).

Fig. 3
figure 3

Phylogenetic networks of teleost fish snRNAs. fru, Fugu rubripes; tni, Tetraodon nigrovidis; gac, Gasterosteus aculeatus; ola, Oryzias latipes; dre, Danio rerio; pma, Petromyzon marinus; bfl, Branchiostoma floridae

Neither the two rounds of genome duplications at the root of the vertebrates nor the teleost-specific genome duplication has led to recognizable paralogue groups of snRNAs. In particular, minor snRNA genes are single-copy genes in teleosts.

Secondary Structures

The spliceosomal snRNAs have evolutionarily well-conserved secondary structures (Shukla and Padgett 1999). These structures received substantial interest in the past, as exemplified by the following nonexhaustive list of references covering a diverse set of animal species: Homo sapiens U1 (Mount and Steitz 1981), U2 (Hausner et al. 1990), U4 (Krol et al. 1981), U5 (Branlant et al. 1983; Sontheimer and Steitz 1992), U6 (Hausner et al. 1990), U11 (Montzka and Steitz 1988; Russell et al. 2006; Tarn et al. 1995), U12 (Montzka and Steitz 1988; Russell et al. 2006; Tarn et al. 1995), and U4atac (Shukla et al. 2002); Rattus norvegicus U1 (Krol et al. 1981), U4 (Krol et al. 1981), and U5 (Krol et al. 1981); Gallus gallus U4 (Krol et al. 1981) and U5 (Branlant et al. 1983); Xenopus laevis U1 (Forbes et al. 1984) and U2 (Mattaj and Zeller 1983); Caenorhabditis elegans U1, U2, U5, and U4/U6 (Thomas et al. 1990); Drosophila melanogaster U1 (Mount and Steitz 1981; Myslinski et al. 1984), U2 (Myslinski et al. 1984), U4 (Myslinski et al. 1984), U5 (Myslinski et al. 1984), U4atac/U6atac, and U6atac/U12 (Otake et al. 2002); Bombyx mori U1 (Sierra-Montes et al. 2003) and U2 (Sierra-Montes et al. 2002); Asselus aquaticus U1 (Barzotti et al. 2003); and Ascaris lumbricoides U1, U2, U5, and U4/U6 (Shambaugh et al. 1994). Large changes in snRNA structures over evolutionary time were recently reported for hemiascomycetous yeasts (Mitrovich and Guthrie 2007). The comprehensive survey of snRNA sequences throughout metazoa set the stage for a comparably detailed analysis of metazoan snRNA structures. In order to assess structural variations, we constructed structure annotated sequence alignments of all snRNA families. The complete set of alignments and consensus structure models is provided (in Stockholm format) as part of the online supplementary material.

In general we find that snRNA sequences vary more in paired regions than in the loops. The sequence variations almost exclusively comprises compensatory mutations that leave the secondary structures intact. As an example, Fig. 4 shows the structures of the U12 snRNA of Xenopus tropicalis and Capitella capitata. The sequences have few paired nucleotides in common.

Fig. 4
figure 4

Predicted secondary structures of Capitella capitata and Xenopus tropicalis, and an alignment created with RNAalifold of both. Paired circles represent compensatory mutations (e.g., AT ↔ GC), while circles on only one side of a base pair indicate “consistent” mutations (e.g., GU ↔ GC)

Structural variations are typically limited. In Fig. 5 we use the U1 snRNAs as a typical example for the evolutionary variation of snRNAs across the metazoa. Overall the structures are extremely well conserved, with small variations in the length of the individual stems. With several notable exceptions this is true for all metazoan snRNAs.

Fig. 5
figure 5

Secondary structure prediction of U1 snRNA, folded by RNAalifold. From left to right: protostomia without insects, insects, deuterostomes without vertebrates, vertebrates. Red: conserved sequences in all organisms, which possibly bind to proteins. Sm binding site marked separately

As reported previously (Chen et al. 2005), the second stem of U5 snRNA shows some variations. More interestingly, the minor spliceosomal snRNAs tend to be derived in insects. This has been reported previously, in particular, for U11 in drosophilids (Mount et al. 2007; Schneider et al. 2004). We found substantial structural variations also for drosophilid U12 snRNAs: there are massive insertions in and after stem III, while stems I and II show mispairings. Furthermore, stem II of U6atac is completely deleted in all examined insects. Details are compiled in the online supplementary material.

Most surprisingly, Acyrthosiphon pisum exhibits highly derived structures for all four minor spliceosomal snRNAs (see Fig. 6).

Fig. 6
figure 6

Secondary structures of U11 (left), U12 (center), and U6atac (right) in Acyrthosiphon pisum, Drosophila melanogaster, and Homo sapiens. Drosophilids derived far from all other minor spliceosome structures (e.g., human). Moreover, Acyrthosiphon pisum built an autonomous structure group for all minor snRNAs

The U2 snRNA of Schmidtea mediterannea does fit well to the structural alignment of the other U2 snRNAs. In Schistosoma mansoni we found a canonical U12 snRNA, while the sequences of the candidates for minor spliceosomal snRNAs do not fit well to the consensus secondary structure models. Details are given in the online supplementary material.

Syntenic Conservation

In order to assess the conservation of the genomic positions of the snRNAs we retrieved the protein coding genes adjacent to the 31 human snRNAs (8 U1, 3 U2, 2 U4, 5 U5, 7 U6, 1 U11, 1 U12, 3 U4atac, and 1 U6atac) and compared the position of their homologues in 14 vertebrate genomes (teleosts, frog, chicken, platypus, opossum, rodents, cow, dog, and chimp) with the 234 snRNA genes that were found in these genomes. We found syntenic conservation of snRNA and flanking genes in only 36 cases, of which 20 belong to the human-chimp comparison. Only 9 of the 31 human snRNA preserve synteny with adjacent genes in the mouse genome, while 22,680 annotated human genes give rise to 21,480 adjacent pairs that have adjacent homologues in the mouse. Furthermore, only a single pair is conserved between human and opossum, and no syntenic conservation can be traced back further in evolutionary history, while large syntenic blocks are conserved across chordata (Putnam et al. 2008). Including the pseudogenes increases the numbers of conserved pairs to 499 of 1609. Again, most of these (453) are human/chimp pairs. The data clearly show that snRNA locations are not syntenically conserved, i.e., snRNA behave like mobile elements in their genomic context.

Pseudogenes

As mentioned above, snRNAs are frequently the founders of families of pseudogenes. This is a property that they share with most other small RNA classes such as 7SL RNA, Y RNA, and tRNAs. Such families of pseudogenes are easily recognized as a by-product of Blast-based homology searches as a large set of hits with intermediate E-values. Figure 7 summarizes these data; more details are provided in the online supplementary material.

Fig. 7
figure 7

Double-logarithmic plot of the number of blast hits versus cutoff E-value for six different genomes. Pseudogene families appear as a slowly increasing curve, while genes without a “cloud” of pseudogene have a flat distribution for E < 10−5. Dash-dotted line, U1; dotted line, U2; dashed line, U4; dash-dot-dotted line, U5; solid line, U6

Spliceosomal snRNA pseudogenes families are very unevenly distributed across distinct phylogenetic groups and have clearly arisen in independent bursts multiple times across animal evolution. Within deuterostomes, almost all sequenced genomes, with the notable exception of teleosts and chicken, contain at least one large family of snRNA-derived pseudogenes.

The genus Caenorhabditis shows no pseudogenes, whereas other nematods show nearly such a high number of pseudogenes as primates. Annelids, molluscs, and platyhelminthes behave similarly. The Trichoplax adhaerens genome, on the other hand, contains a single copy of each of the nine spliceosomal snRNAs.

Discussion

We have reported here on a comprehensive computational survey of spliceosomal snRNA in all currently available metazoan genomes. We thus provide a comparable and nearly complete collection of animal snRNA sequences. The dense taxon sampling allowed us to verify homology of candidate sequences. Both the major and the minor spliceosome are present in almost all metazoan clades, nematodes (and possibly Oikopleura) being the only notable exception. For many of the metazoan families we report here the first evidence on their spliceosomal RNAs.

Using restrictive filtering of the candidates by both secondary structure and canonical promoter structure left us with a high-quality data set that was then used to construct secondary structure models. This is useful, in particular, for the snRNAs of the minor spliceosome, for which very few sequences are reported in databases; indeed, Rfam 7.0 (Griffiths-Jones et al. 2005) lists only the U11 and U12 families, with a meager set of seed sequences from few model organisms. The sequence and secondary structure data compiled in this study provide a substantially improved database and set the stage for systematic searches of even more distant homologues.

The analysis of the genomic distribution of snRNAs reveals that discernible paralogues are not uncommon within genera or families. However, no dramatically different paralogues have been found. Spliceosomal snRNAs are prone to spawning large pseudogene families, which arose independently in many species. They behave like mobile genetic elements in that they barely appear in syntenic positions as measured by their flanking genes. While in some genomes snRNAs appear in tandem and/or associated with 5S rRNA genes, these clusters are not conserved over longer evolutionary timescales. Taken together, the data are consistent with a dominating duplication-deletion mechanism of concerted evolution for the genomic evolution and proliferation of snRNA. This behavior of snRNAs is similar, in particular, to tRNAs, albeit the copy number of snRNAs is typically much smaller. Recent studies have demonstrated that snoRNAs behave like mobile genetic elements that spread via retroposition (Schmitz et al. 2008; Weber 2006). Their mode of expression from spliced-out introns, however, restricts the functional copies predominantly to introns of the same host gene, with only occasional translocations to different carriers (see Bompfünewerer et al. 2005). Spliceosomal RNAs, in contrast, appear to freely spread across the genome when they appear as multicopy genes.