Introduction

The importance of incorporating secondary structure information in phylogenetic studies of rRNA genes is well established (e.g., Dixon and Hillis 1993), with the need for accurate models increased now that refined models of DNA evolution that incorporate base covariation are available (Huelsenbeck and Ronquist 2001). Inferring secondary structures from sequence data is generally undertaken by either phylogenetic (comparative) studies that search for covarying bases or those that minimize the free energies of putative secondary structures (Zuker 1989), although there are a growing number of hybrid techniques (e.g., Juan and Wilson 1999). One issue with the former is that analyses are generally carried out on quite divergent taxa in order for a reasonable degree of covariation to be observed. This has allowed prediction of the major structures, but identification of all base pairs within those structures may leave some scope for improvement. This can be partly attributed to (1) highly accurate alignments becoming more difficult with increased divergence between taxa and (2) more divergent taxa being more likely to have divergent secondary structures, reducing the likelihood that a comprehensive consensus model (i.e., one containing interacting sites common to most/all taxa in the study) can be obtained.

A phylogenetic comparative study of closely -related taxa is unlikely to yield a good secondary structure model through insufficient base covariation (but see Parsch et al. 2000). Fortunately, a large volume of research means that the main structural elements are well- established for some RNA subunits in a wide variety of taxa (Wuyts et al. 2001a). Thus highly refined models applicable to a group of closely related organisms in a phylogenetic study can be obtained by fine-scale analysis of the known structural elements, using information on base complementarities as well as covariation if present (e.g., Espinosa de los Monteros 2003). This approach is feasible for studies of large (LSU; also known as 16S) and small (SSU; also known as 12S) subunit mitochondrial rRNA thanks to the quantity of sequence data now available in genetic databanks.

Over recent years there have been many evolutionary studies of two families within the scincomorph lizard clade that are traditionally named the Lacertidae and the Scincidae. Interest has centered on the systematics of these groups and the interesting phylogeographic patterns that they show (e.g., Fu 1998; Harris et al. 1998a; Honda et al. 2000; Mausfeld et al. 2000). The former is found in Africa and most of Eurasia, while the latter is found on all continents except Antarctica. Secondary structures for the LSU mitochondrial rRNA have been proposed for 14 species of the Darevskia genus within the Lacertidae and for Eumeces egregius within the Scincidae and are available on a WWW database (see Wuyts et al. 200la), but how representative these are of other taxa within these groups is unknown. These models are hereafter referred to as the “W&VP” models, with the Eumeces model specified as /E and the Darevskia models (which are all almost identical) denoted by /D. The main structures are common to both models, with differences restricted to the actual pattern of site interactions that form these elements (note that the term “site interaction” is used here to refer to the pairing between bases at specific positions to form secondary structures). The first objective of this work is to evaluate, compare, and refine these interactions to provide consensus models for each lineage.

Obtaining refined rRNA consensus structures allows detailed and accurate analyses of rRNA evolution, which increases the understanding of functional aspects of rRNA molecules. It also aids alignment of rRNA sequences and contributes to the increasingly sophisticated models of sequence evolution being applied in maximum likelihood and Bayesian approaches. Two of these are of particular interest here: (1) doublet models that take into account the dependent substitutions between interacting sites within an rRNA secondary structure (Savill et al. 2001) and (2) covarion models that allow variation in site-specific rates among lineages, generally known as site-specific rate variation (SSRV) or covarion-like evolution (Galtier 2001). The former requires application of a good secondary structure model, which is achieved by the first part of this work described above. There has been considerable recent interest in finding empirical evidence of the latter and studies have been carried out that involve diverse groups of taxa (Galtier 2001; Misof et al. 2002). Additional objectives of this work were examination of rate variation between sites and testing for SSRV in the mitochondrial LSU rRNA of the Scincidae and the Lacertidae.

Materials and Methods

Alignment and Assessment of Secondary Structure

Helix names follow the Wuyts et al. (200la) nomenclature unless stated otherwise. A total of 99 Lacertidae and 126 Scincidae 16S rRNA sequences were obtained from GenBank (Appendix I). All sequences contained a homologous fragment of the 16S rRNA gene, extending from the terminal two bases of helix E23′ (helix 66′ in the Gutell model [Gutell et al. 1985; Gutell and Fox 1988; http://www.rna.icmb.utexas.edu/]) to the first base of G20’ (helix 90′ in the Gutell model). This is equivalent to 499 bases in the Scincid Eumeces egregius and 496 bases in the Lacertid Darevskia caucasica. Four of the scincid sequences were slightly shorter because they lacked up to 13 bases at either the 3′ or the 5′ end of the sequence, which were treated as ambiguities (although these sites were included in the analyses where possible).

An initial alignment was achieved using ClustalX on the Scincidae and the Lacertidae independently. This provided good alignments due to considerable sequence conservation in these closely related taxa. Sites were then designated to helices or loop regions according to the W&VP/D or W&VP/E secondary structure models, and the initial alignments adjusted manually (similar to the procedure outlined by Kjer [1995]). Some variable-length helices provided many indels, while some very variable-length hairpin loops could not be confidently aligned and so were not used in subsequent analyses.

It was assumed that the W&VP models were accurate unless evidence was found to support alternative models. Mutual information (MI) and potential base pairing (PP) were obtained in half-matrices representing all possible pairwise site comparisons. PP proportions were assessed for the proposed models for each lineage. For values below 0.95, the elements corresponding to neighboring bases were searched within the matrices to investigate whether alternative base pairs could provide higher PP and MI values. The latter were only informative for sites showing above-average variability. Wherever possible, helices ending in hairpin loops were extended under these criteria, conditional on the presence of at least three bases in the terminal loop. G–U wobble pairs were accepted within helices but not as initial or terminal helix pairs.

Analysis of Secondary Structure Evolution

A total of 56 models of sequence evolution were initially assessed by comparison of corresponding likelihoods computed for a neighbor-joining topology based on Jukes Cantor distances (MODELTEST version 3-06; Posada and Crandall 1998). Phylogenies and estimates of sequence parameters were obtained from Bayesian analyses of the sequences (MrBayes version 3.0b4; Huelsenbeck and Ronquist 2001), using a doublet model that specified interacting sites from the consensus models. Other likelihood settings for the sequence evolution model were guided by the results of MODELTEST. The Bayesian analyses were run for 800,000 generations starting from a random tree, with 1 sample per 100 generations. This was repeated three times on each lizard group, and after removal of the initial 500–2500 samples prior to stationarity being reached (as assessed by the stabilization of the likelihoods), the consistency of the resultant three trees was checked. The 50% majority rule consensus trees were compared among the three runs. This was based on 21,500 trees for the Lacertidae and 17,450 trees for the Scincidae.

Analysis of covarion-like evolution was carried out using PAML version 3.14 (Yang 1997) by comparison of base pairs between the Lacertidae and the Scincidae. The input trees were the consensus trees inferred from the Bayesian analyses. Heterogeneity in site-specific rates within each lineage was assumed to follow a gamma distribution that could be approximated by assigning each site to one of eight rate categories (Yang 1994). Homologous interacting sites in the Lacertidae and Scincidae were analyzed and the rate categories compared, as they provided relative rates within a lineage. Thus, analysis of the categories was comparable between lineages, whereas actual rates themselves would be affected by the degree of divergence within the lineage.

Results

Secondary Structures

The findings for the 17 structures in the Lacertidae and Scincidae W&VP models are summarized in Table 1, and consensus models are shown in Figs. 1 and 2, respectively. Further details on structures in which the consensus models differ from the W&VP models are given below.

Table 1 Evaluation of helix base-pairing in the Lacertidae and Scincidae relative to Wuyts et al. (2001a) models, with information on hairpin loop variability
Figure 1
figure 1

Consensus secondary structure model for the Lacertidae shown for Darevskia caucasica (AF206187), based on potential pairing and mutual information analysis (note that this is based on site interactions for the majority of taxa, explaining the presence of non-Watson–Crick/GU bonds and why some potentially bonding bases do not appear to be paired). Paired sites that were included in the analysis (i.e., those homologous with the Scincidae) are connected by a large dot; other paired sites are joined by a small dot.

Figure 2
figure 2

Consensus secondary structure model for the Scincidae shown for Eumeces egregius. See Fig. 1 legend for details.

E28

For Lacertidae, the W&VP/D models contain an additional A–U base pair (compared to W&VP/E) at the hairpin loop end of the helix, but the very low PP value (0.22) suggests that this should not be included in a consensus model. The interaction could be due to helix length extension (and consequent loop reduction) in a small number of taxa, i.e., primarily Darevskia. This runs against expectation given that the entire hairpin loop sequence is relatively highly conserved (conserved in >60% of the Lacertidae and >70% of the Scincidae). The W&VP/E model is the best consensus in both cases.

F1

The high PP (0.96) allows a one-base pair extension of the base of the helix in the Scincidae. A similar extension was included for the Lacertidae, despite slightly lower support (PP = 0.91).

G2

An additional terminal A–U base pair (prior to G3) is conserved in all Scincidae and Lacertidae, and this is included in the consensus model.

G3

The extreme variability and high number of A–U pairings make this one of the most difficult secondary structures to align (Buckley et al. 2000). This is the only structure in which the W&VP models for closely-related Darevskia differ slightly from one another. The model shown (Fig. 1) provides higher PPs (>0.91 in all cases, as opposed to values as low as 0.51 for the Darevskia models) and is supported by compensatory substitutions for some base pairs. The initial three base pairs (all A–U) are conserved for all taxa, while the terminal end of the helix appears to be variable in length. The W&VP/E model does not appear to be entirely appropriate for the Scincidae. Similar to the Lacertidae, the first A–U bond for the proposed structure is invariant for all Scincidae, while the subsequent two bonds (both A–U) are present in 90% of taxa. A further base pair (also supported by MI) is followed by an internal loop. A conserved A–U base pair at/near the terminal end appears to be present.

G15

The W&VP/D model is supported for the Lacertidae, although there is PP and MI support for additional pairing between three more pairs of sites currently situated in the hairpin loop, leading to lengthening of the helix at the terminal end. An alternative to the W&VP/E is strongly supported in the Scincidae, with evidence for a four-base pair helix provided by high PPs and also high MI contents at two sites. None of the W&VP/E base pairs are incorporated and the proposed change makes it more similar to the W&VP/D model.

G18

High PPs support helix extension into the hairpin loop by one G–C base pair in the Lacertidae.

rRNA Evolution

Some base positions were removed from the analyses because they were in highly variable loops and so difficult to align and/or in areas of putative helix length variation. These were found in the following structures: E25 (terminal loop and helix), noninteracting sites between G2 and G3, G3 (terminal loop and helix), some noninteracting sites between G8 and G13 and between G13 and G15, most of the terminal G15 loop, one indel site within G16, and one (Scincidae) or two (Lacertidae) indel sites between G17’ and G19’. For the Lacertidae, this left 446 bases for analysis, of which 176 were involved in secondary structure interactions, while corresponding values for the Scincidae were 450 and 184, respectively.

Independent of the criterion used, comparisons of likelihoods for a Jukes Cantor tree suggested that a general time reversible model (Rodriguez et al. 1990) incorporating invariant sites (I) and gamma-distributed site rate heterogeneity (G) provided the best model of sequence evolution for the Scincidae. The same model was also selected for the Lacertidae under the Aikaike information criterion, although the TrN model (Tamura and Nei 1993) with I and G was favored by hierarchical likelihood ratio testing. Thus all Bayesian analyses allowed different rates for all substitution types and gamma-distributed rate variation across sites (six categories), with a proportion of the sites invariable. The doublet model was applied to interacting sites, with the “4by4” model for other sites. The Bayesian strict consensus trees (see Appendix II) provided the topologies for subsequent analyses using PAML.

For the Lacertidae the assigned site rate categories were either 1, 5, 6, 7, or 8, while in the Scincidae the rates were either 1, 6, 7, or 8. Site rate assignment to categories does have an error term associated, although this decreases with the number of categories used and is less of a problem when the number of rate categories is greater than three, as in this case (Misof et al. 2002). As in previous studies (Misof et al. 2002), rate categories sometimes differed between sites within a pair (Figs. 1 and 2). Several explanations of this are possible, including deviations from the consensus structures, incorrect assignment, and substitutions in only one of the two sites that did not affect base pairing (e.g., G–C to G–U). Evidence of the latter was found, but examination of the original sequences revealed a more persistent cause: conserved sites with only a single substitution (potentially because of an error in the original sequence or an aberrant individual) could be assigned to rate category 5 or 6, while the corresponding site may be completely conserved and thus assigned to category 1.

Site-specific rate categories obtained for homologous interacting sites were used to obtain rate category values from the mean rate of the site pair (although a single value was used when only one of the paired sites was present) (Fig. 3). Deviations between Scincidae and the Lacertidae rate category assignments allows an assessment of the degree of covarion-like evolution between the two lineages (Fig. 4). To overcome the potential problem of differing rate categories between paired sites outlined above, additional conservative analyses considered rate variation to exist only when one site pair was invariant, (i.e., rate category 1) while the other site pair averaged more than two substitutions each (i.e., corresponding to rates above 6). The proportion of between-lineage deviations that differ from zero, as expected when covarion-like evolution occurs, is 0.43 (i.e., 54). Using the normal approximation (for large sample sizes) for proportions, the 95% confidence interval for this is (0.34, 0.51), suggesting that approximately one-third or more of sites could show covarion-like evolution. The corresponding estimate for the more conservative analysis is 0.07 (i.e., 9), with corresponding 95% confidence intervals of (0.04, 0.13), which substantiates the overall finding. An analysis of whether covarion-type evolution is more concentrated in internal parts of the helices rather than terminals was investigated using a likelihood ratio test. The presence or absence of substantial differences in rate assignment was not significantly contingent on whether the interacting sites were internal (five deviations of ≤5 in 90) or not (four deviations >5 in 36) at the 5% significance level: G=1.11, p (Exact) = 0.44. The median deviation did not differ significantly from zero (Wilcoxon signed ranks procedure: smallest sum of ranks = 637, p [Exact] = 0.34), which clearly shows that there was no increase in the frequency of higher or lower rate category sites in one lineage relative to the other.

Figure 3
figure 3

Summary of frequencies of paired site rate categories by helix. The rate categories are simplified to fast (black bars; rate categories >5), medium (gray bars; rate categories 2–5), and slow (open bars; rate category 1) rate categories for (A) Lacertidae and (B) Scincidae.

Figure 4
figure 4

Rate category deviation between homologous interacting sites in the Lacertidae and the Scincidae, by helix. Scincidae rates were subtracted from Lacertidae rates, so positive deviations indicate a higher rate category for the latter, and vice versa. Note that points are jittered.

Discussion and Conclusions

All secondary structure elements present in the Eumeces and Darevskia models from the LSU database (Wuyts et al. 2001a) were confirmed. This is not too surprising, as these models are now known to be robust, being originally based on large-scale comparative analysis and confirmed by recent fine resolution of the crystal structure of the LSU from an archaebacterium (see Wuyts et al. 2001a). However, adjustments to the pattern of site interactions were possible and this provided more accurate specification for doublet models in phylogenetic analyses. Early comparative studies of secondary structure tended to be limited by the trade-off between using related sequences with nearly identical patterns of site interactions but highly conserved sites that provide little mutual information as support for base pairing and using highly divergent sequences with potentially high mutual information but different site interactions. Here and elsewhere (Buckley et al. 2000) it has been shown that the latter have provided good overall models but that they can be further refined by comparative analyses of taxa with lower levels of divergence.

In general, helices with the highest evolutionary rates (i.e., those with highest proportions of rapidly evolving base pairs in both lineages) were E18, E21, and F1 (Figs. 11, 2, and 3). G16 and G18 also contained significant proportions of rapidly evolving sites. It may appear surprising that the highest proportions of rapidly evolving base pairs were not detected within G3 and G15, as these are possibly the most variable helices. However, this might be explained by the conservative nature of the remaining sites once sites that were difficult to align had been removed. Higher rates in E18 and E21 are largely expected, as these correspond to domain IV, which comprises most of the subunit–subunit interface surface (Ban et al. 2000). Although these helices are not part of the prominent active cleft site, they are known to be either quite or extremely variable, respectively (Wuyts et al. 2001b). F1 is also adjacent to domain IV. Domain V encompasses F1 and subsequent helices and is involved in peptidyl transferase activity. It is found in the middle of the subunit, where variability is expected to be lower (Wuyts et al. 2001b). It helps stabilize the elongation factor-binding region of the ribosome and is generally quite conserved across higher taxa (Ban et al. 2000; Wuyts et al. 2001b).

Evidence for covarion-like evolution is provided, even though very large differences in rates between the two lizard families were only observed at approximately 7% of homologous base pairs. This represents one of the first observations of such a pattern between such closely related clades, indicating that SSRV is widespread. The distribution pattern of the major covarion sites is difficult to compare with those described in insects (Misof et al. 2002) due to the relatively small number that were detected. However, all of the helices in which SSRV was detected in this study also showed SSRV in insects, except for G17 and G18 (and G19 if all potential covarion sites are considered). Why some sites within these generally conserved helices should evolve at different rates between lizard lineages is difficult to ascertain, although it is possible that the factors determining SSRV differ at different taxonomic levels.

The impact of nonindependent evolution of bases within a base pair and site-specific rate variation on phylogenetic tree construction was not explicitly examined by the current study. Bayesian analyses that did not implement the doublet model were performed on both data sets and found to provide similar trees to those described, suggesting that the impact of correct site pair specification may not be that large. Despite this, it is clearly preferable to account for these known sequence evolution constraints and specify them as correctly as possible in any analysis. Furthermore, a detailed analysis of RNA sequence evolution shows that models incorporating secondary structure information can provide significantly improved log likelihoods for the data (under a given topology) than simpler models that ignore compensatory substitutions (Savill et al. 2001). The impact of specifying covarion-like models for LSU and SSU rRNA evolution has also been examined recently (Galtier 2001). Again, incorporation of the more sophisticated models was found to lead to a significant increase in log likelihood. The present study provides additional empirical support for use of the latter even at lower levels of sequence divergence. Future studies should be directed toward obtaining further empirical evidence of covarion-like evolution, the interdependencies of these rate changes, and the corresponding functional implications for the rRNA itself.

Finally, the phylogenetic trees used here were based on a single mtDNA fragment and so unlikely to be as well -resolved or robust as those based on larger data sets: they were constructed solely for use in subsequent analyses of rRNA evolution (this is not a systematics study). However, it is worth commenting on inferred phylogenetic relationships that either deviate from or include taxa not present in other molecular phylogenetics studies (Fu 1998, 2000; Harris et al. 1998a,b; Lin et al. 2002; Whiting et al. 2003). For the Lacertidae (Appendix IIA), the (Zootoca/Lacerta vivipara, Lacerta/Archeolacerta bedriagae) lineage was well supported, as were the (Takydromus sexlineatus, Lacerta cappadocica wolteri) and (Archaelaceta mosorensis, Algyroides fitzingeri) lineages. The grouping of Psammodromus algirus and Lacerta andreanskyi within the smaller of the two major lineages (corresponding to the Eremiainae [see Harris et al. 1998a) is also surprising. For the Scincidae (Appendix IIB), the recently inferred interesting phylogeographical relationships within the large Mabuya genus (i.e., (Asia, (Cape Verde, (Africa/Madagascar, South America))) with Mabuya atlantica from South America within the Africa clade (see Mausfeld et al. 2003) was still strongly supported, despite my inclusion of more taxa and a doublet model of sequence evolution. However, the very well-supported clade comprising Carlia, Emoia, Eugonglyus, two Eumeces, and Morethia has not been observed previously. Honda et al. (2003) included representatives of the first four of these but their analysis grouped only the first three genera within a well-supported clade, possibly because Eumeces was used as an outgroup in the analysis. Future studies could further explore some of these relationships by testing for congruence across multiple DNA markers.