Introduction

The plastid genome (plastome) comprises a wealth of data that are valuable for comparative evolutionary studies of plants. Plastomes contain many essential genes, especially those required for photosynthesis, and thus they harbour one of the few suites of characters that transcend the green plant branch of the tree of life. One apparent feature of plastome organization (structure) is that it has remained relatively constant over hundreds of millions of years. For example, the plastomes of the eusporangiate fern Angiopteris (marattioids, see Fig. 1) and the seed plant Nicotiana (tobacco) are almost (but not quite) identical in gene order (Karol et al. 2010). Given that these lineages diverged in the Devonian, over 350 million years ago (Pryer et al. 2004), the selective forces that maintain this type of structural stability over such a long period surely must be strong, especially considering the rapid structural evolution that occurred among the nuclear genomes of seed plant lineages during much shorter periods (Wei et al. 2009).

Fig. 1
figure 1

Working phylogeny of relationships among ferns and seed plants, based on several published analyses (Pryer et al. 2004; Qiu et al. 2005; Qiu et al. 2007). See text for reference to lettered branches

Some plastomes, however, appear to have been destabilized (structurally reorganized) relatively recently in a few clades of angiosperms, including Geraniaceae (Chumley et al. 2006), Campanulaceae (Haberle et al. 2008), Fabaceae (Cai et al. 2008; Milligan et al. 1989), and some lineages where photosynthetic function has been lost (Funk et al. 2007; Wickett et al. 2008; Wolfe et al. 1992). Plastome structure has also undergone extensive rearrangement in a large group that includes about 90% of extant fern species. Because ferns (including horsetails) hold a critical phylogenetic position as the extant sister group to seed plants (Pryer et al. 2001), understanding the organization and evolution of fern plastomes can provide useful information for comparative studies across land plants. In this paper we review our current understanding of fern plastome organization, function, and evolution. We describe some new strategies for obtaining plastome sequences, and present new data on several fern plastomes, as well as new analyses on relative substitution rates of plastid genes.

Over one hundred complete plastome sequences are published in GenBank for seed plants, yet only five (at the time of writing) are available for the sister group to seed plants: the ferns. In addition to being an important phylogenetic contrast for studies of genomics, cell biology, reproductive biology, and morphology, ferns are also a major component of the earth’s land flora, with over 11,000 species occurring in varied ecological niches—especially the tropics (Smith et al. 2006). Thus, a better balance in the availability of genome-scale data could have wide utility. Figure 1 depicts our current understanding of vascular plant relationships, and we emphasize the major groups of ferns and the relationship of ferns to seed plants.

Methods of plastome sequencing

Early approaches to sequencing complete plastomes involved the isolation of plastids (followed by DNA extraction) or the purification of plastid DNA from total genomic DNA. Plastid DNA could then be digested with restriction endonucleases, cloned, and sequenced, first with vector primers then followed by primer walking. These approaches were slow and expensive, but generally reliable. Researchers sought more efficient techniques, focusing on the early stages of separating the plastids or their DNA. Chloroplasts and mitochondria can be sorted at the cellular level with Fluorescent Activated Cell Sorting (FACS), an approach that was used to sequence the plastome of the lycophyte, Huperzia (Wolf et al. 2005). An alternative approach is to shotgun-clone genomic DNA into fosmids and probe for clones containing potential plastid genes (McNeal et al. 2006; Wickett et al. 2008). Complete plastome sequences have also been obtained from DNA amplified in long PCR reactions (Goremykin et al. 2003) or by whole-genome amplifications using Rolling Circle Amplification (RCA, Jansen et al. 2005). Details of the aforementioned approaches, and other techniques, are described extensively by Jansen et al. (2005). We are exploring a new approach that exploits second-generation DNA sequencing technology. This approach is technically the simplest and probably the most widely applicable because it does not rely on strategies for purifying plastid DNA or long PCR amplification, which can be taxon-specific. As a test case, we used the leptosporangiate fern Cheilanthes lindheimeri Hook., which is native to southwestern United States and Mexico (Grusz et al. 2009). The plant was collected in Arizona, USA, by E. Schuettpelz (collection number 450) and a voucher specimen is deposited in the Duke Herbarium (accession number 391417; http://www.pryerlab.net/DNA_database.shtml). DNA was extracted with the DNEasy Plant Mini Kit (Qiagen, Valencia, California, USA). We employed whole-genome shotgun sequencing using the Roche 454 GS-FLX Titanium platform to determine the complete plastome sequence of C. lindheimeri using a combination of de novo and reference-guided assembly strategies, combined with bioinformatic filtering to remove “contaminant” nuclear and mitochondrial sequences in silico. Our approach differs from most studies using second-generation sequencing to determine complete plastome sequences (Cronn et al. 2008; Moore et al. 2006) in that we did not first isolate plastid DNA from nuclear or mitochondrial DNA.

Total genomic DNA was sequenced on 1/4 of a picoTiter plate to obtain 234,428 reads averaging 355 bp long, representing a total of 83.26 Mbp of sequence data. Reads were assembled de novo using MIRA (Chevreux et al. 1999; Chevreux et al. 2004) and putative plastid-encoded contigs were identified with NCBI blastn by querying the assembly with previously published plastome sequences from GenBank, using an e-value threshold of 1e-4. Additionally, reference-guided contigs were generated using the YASRA pipeline (Ratan 2009). Putative plastid-encoded contigs from both de novo and reference-guided approaches were collected and assembled to generate a draft plastome sequence. This sequence was then used as a scaffold to identify and correct assembly errors by mapping 454-sequence reads to the draft plastome sequence using Roche GS Mapper v.2.3 software. The resulting high quality, read-supported contigs were used to produce the final complete plastome sequence in the standard circular orientation typically presented, beginning with the large single copy region (LSC) followed by the first inverted repeat (IR), the small single copy region (SSC), and the second inverted repeat. The complete plastome was assembled from 6,818 reads (2.91% of the total reads) and maps to a circle of 155,770 bp with two copies of the IR. The Mauve v.2.2.0 plugin (Darling et al. 2004) for Geneious Pro v.5.0.1 (Drummond et al. 2007) was used to generate a whole plastome alignment of Cheilanthes against three published plastomes: Adiantum, Alsophila, and Pteridium. This alignment was used to annotate the Cheilanthes plastome sequence (GenBank # HM778032). A circular gene map for the annotated sequence (Supplementary Figure 1) was generated in OGDRAW (Lohse et al. 2007). The gene order of the Cheilanthes plastome is the same as that of the fern Adiantum, which is not surprising, given that they are in the same family, Pteridaceae.

Our approach allowed us to avoid the difficult and laborious sample-preparation protocols needed to isolate pure chloroplast DNA or to PCR-amplify the complete genome in overlapping fragments. In addition, we obtained valuable sequence information from the nuclear and mitochondrial genomes in this previously uncharacterized lineage. Compared to other approaches (including all those listed above) bioinformatic extraction of plastome sequences is fast and simple. Most of the steps are conveniently automated. The only steps requiring significant intervention are locating the IR and setting the sequence to “start” at the beginning of the LSC (a mere convention). These manual steps are inherent to all approaches to plastome sequence assembly. Estimating monetary costs is difficult, but as the prices of standard library preparation and second-generation DNA sequencing continue to drop, it is likely that the approach we describe (or similar ones) will become the most efficient solutions for plastome sequencing. Indeed, a similar approach has been used to extract plastid genome sequences from massively parallel sequencing projects (Nock et al. 2010).

The evolution of plastome organization in ferns

At the nucleotide level, DNA tends to evolve in a clock-like mode within a limited window of evolutionary time, although nucleotide substitution rates vary across the genome and can differ among and within lineages. This clock-like aspect of nucleotide substitution provides a powerful tool for model-based analyses to infer phylogenetic relationships among sequences. However, over long periods of time it becomes difficult to infer changes because multiple substitutions may have occurred at the same site, resulting in “saturation”. Thus, it is often difficult to infer the order of early-diverging lineages on a phylogenetic tree. This is where non-clocklike characters can be useful. It has been suggested that because changes in plastome structure are so infrequent, rare rearrangements should be ideal for inferring ancient divergences (Karol et al. 2010; Raubeson and Jansen 1992). Several such markers have proven to be useful in this way near the base of the vascular plant phylogeny (Raubeson and Jansen 1992) and in a large clade of mosses (Goffinet et al. 2007). In the ferns, we see a similar pattern with five inversions that can be mapped across the phylogeny. One ~3.3 kb inversion in the LSC is shared among all ferns, supporting branch A of Fig. 1 (Karol et al. 2010). Two large overlapping inversions (18 kb and 21 kb) in the region of the IR support branch C, and two smaller overlapping inversions in the LSC region support branch D (Wolf et al. 2010). These findings are based on recent studies of complete plastome sequences (Karol et al. 2010) and mapping of partial plastome regions (Wolf et al. 2009; Wolf et al. 2010). The latter studies will require verification with complete plastome sequences. The pair of inversions on branch C resulted in the unusual plastome organization that is typical for most ferns, with the rRNA genes occurring in the reverse order compared to all other plants. Branch C subtends a clade that includes about 90% of fern species (Pryer et al. 2004), so most ferns have an “inverted” IR. It is difficult to infer process from such a small sample size, but the clustering of inversions in the same genomic region on the same phylogenetic branch suggests that something is acting to maintain genome organization.

If an inversion occurs at some particular point during evolution, there could then be selection pressures to revert to the ancestral structure. What features of the plastome would operate to maintain such an organization and resist change? One possibility is that a specific gene order is essential for gene expression. This could be related to the complex patterns of RNA processing observed for the plastome (Stern et al. 2010). Transcription of plastid genes is neither strictly eukaryotic nor prokaryotic, which is presumably a function of the plastome’s endosymbiotic origins. Thus, most primary transcripts consist of several genes that are sometimes degraded into multiple mature RNAs (Stern et al. 2010). The positioning of plastid promoters could put constraints on which genes are adjacent and on the same strand. Consistent with this hypothesis is the non-random tendency of adjacent genes to be on the same strand and to be co-transcribed, even in the highly rearranged plastome of Chlamydomonas reinhardtii (Cui et al. 2006). Because the inversions in fern plastomes are paired and partially overlapping, changes in the strandedness of adjacent genes are minimized, as compared to single inversions. It is possible that paired inversions occur regularly, but they leave no trace. It is only when the endpoints of the second inversion are not the same as the first that we can detect the evidence.

Variation in substitution rates among genes and lineages

Nucleotide substitution rates vary within genes, among genes, and across lineages (Drouin et al. 2008; Muse and Gaut 1997; Muse 2000). Knowledge of the extent of this variation can aid in choosing among genes, or other genomic regions, for markers that are appropriate for inferring phylogenetic relationships at different time scales. The ratios of synonymous to nonsynonymous substitution rates can also be used to examine signatures of past selection on protein-coding genes. Here we compile a preliminary assessment of relative evolutionary rates for ferns and compare them to their sister group, the seed plants.

In addition to the newly sequenced and assembled plastomes of Pteridium (HM535629; Der 2010), Equisetum (Karol et al. 2010), and Cheilanthes (this paper), we selected plastome records from those available in GenBank (Supplementary Table 1) to represent major lineages of ferns (7 taxa), seed plants (9 taxa), and lycophytes (2 taxa—as an outgroup). Seed plants were selected to represent major lineages and avoid taxa with unusually elevated substitution rates, such as Gnetales (McCoy et al. 2008). There are differences in the timing of diversification of various seed plant and fern lineages (Bell et al. 2010; Schneider et al. 2004), but taxa were chosen so that the estimated divergence times for the seed plants included in the analysis are as proximate as possible to those for the ferns. Protein-coding DNA sequences were extracted from all 18 plastome records. For each gene (excluding pseudogenes), either a nucleotide alignment was made using MUSCLE (Edgar 2004), or a translation alignment was made using MUSCLE as implemented in Geneious v.5.0.3 (Drummond et al. 2007). Each alignment was inspected and manually refined, and errors resulting from mis-annotated features in the GenBank records were corrected. Most genes were present in all 18 taxa, but in order to include as many as possible, a gene was only included in the final analysis if it was present in at least one of the outgroup taxa and several representatives of both the seed plants and ferns.

We used PAML (Yang 1997) to estimate synonymous (dS) and nonsynonymous (dN) substitution rates, as well as the dN/dS ratio, for 79 genes separately using a model that allows substitution rates to vary among lineages. We used constraint trees (Supplementary Figure 2, including only the taxa possessing a particular gene) based on published phylogenies (Pryer et al. 2004; Qiu et al. 2005; Qiu et al. 2007). Sequences for some taxa included internal stop codons due to known or suspected RNA editing. PAML does not accept internal stop codons, therefore these positions were removed from the alignments. Average synonymous and nonsynonymous substitution rates, as well as ratios, were calculated for each gene and compared among seed plants, ferns, and the lycophyte outgroup. Wilcoxon rank-sum tests were used to determine whether differences observed between ferns and seed plants were statistically significant (Table 1).

Table 1 Rates of synonymous and nonsynonymous substitutions in ferns versus seed plants

For nonsynonymous rates, 24 of the 79 genes showed significant differences between ferns and seed plants, with ferns faster than seed plants in 23 cases; psbA was the only gene significantly faster in seed plants (Table 1, Fig. 2). For synonymous rates, all 79 genes were faster in ferns, and 61 genes showed significant differences (Table 1, Fig. 3). Seed plants have higher values of the dN/dS ratio for each gene because the differences are greater for the denominator, dS, than for dN.

Fig. 2
figure 2

Estimated nonsynonymous substitution rates for ferns and seed plants for 79 plastid genes

Fig. 3
figure 3

Estimated synonymous substitution rates for ferns and seed plants for 79 plastid genes

We found that for our sample of taxa, substitution rates for plastid genes were higher in ferns than in seed plants (Figs. 24), consistent with previous findings observed for rbcL (Smith et al. 2001; Yatabe et al. 1998) This is a very general statement that ignores the considerable variation among lineages that exists within each of these clades. Such variation has been documented for ferns (Korall et al. 2010; Schuettpelz and Pryer 2006; Schuettpelz and Pryer 2007), and is known to be quite high within some lineages of seed plants (Guisinger et al. 2010). Analyses such as the one we present here are very sensitive to taxon sampling. However, we deliberately excluded seed plant and fern lineages with known accelerated substitution rates, so we have no reason to suspect a strong bias in our data set. As we accumulate more fern plastome sequences we hope to refine our understanding of the differences between ferns and seed plant substitution rates.

Fig. 4
figure 4

Estimated dN/dS ratios substitution rates for ferns and seed plants for 79 plastid genes

Analysis of RNA editing

RNA editing is the post-transcriptional modification of RNA molecules relative to their encoding DNA sequences. RNA editing in land plant organelles occurs in the form of pyrimidine exchanges in mitochondrial and plastid transcripts. In general, levels of RNA editing are higher in mitochondrial genomes than in plastomes (Sugiura 2008; Takenaka et al. 2008), with some of the highest rates observed in seed-free vascular plants and hornworts (Kugita et al. 2003; Tillich et al. 2006). Traditionally, RNA editing in plastomes has been systematically examined by PCR amplification and sequencing of transcripts for individual genes, and has only been examined genome-wide for two seed-free plants: a fern, Adiantum capillus-veneris (Wolf et al. 2004) and a hornwort, Anthoceros formosae (Kugita et al. 2003). Patterns across these taxa and a seed plant (Arabidopsis) with typical levels of RNA editing (Chateigner-Boutin and Small 2007) are provided in Table 2. We have since developed a new, rapid method to identify novel RNA editing sites using second-generation high-throughput transcriptome sequencing and bioinformatic analyses (Der 2010).

Table 2 Rates and types of plastome RNA editing in an example from each of three main lineages: ferns (Wolf et al. 2004), hornworts (Kugita et al. 2003), and seed plants (Chateigner-Boutin and Small 2007)

A normalized transcriptome library of the fern Pteridium aquilinum was prepared and sequenced using the Roche 454 GS-FLX Titanium platform (Der 2010). Transcript reads were mapped to the complete plastome sequence (GenBank # HM535629) using Roche’s GS Mapper v.2.3 software. Discrepancies between the genome and transcript sequences were used to identify putative RNA editing sites using custom Perl scripts (Der 2010). The location and consequence of each putative RNA editing site was examined in the context of gene annotations for the complete plastome. The details of these methods and our results are being prepared for publication elsewhere, however, preliminary scans for RNA editing in the plastid transcriptome of P. aquilinum identified over 1,000 putative RNA editing sites; the highest number yet detected for a plastome. Similar to Anthoceros, a large number of U to C RNA editing events were detected in addition to the more abundant C to U RNA editing events. RNA editing occurs in all types of plastid genes, including protein-coding, tRNA, and rRNA genes, as well as intergenic and intron regions. The majority of RNA editing sites occurs in protein-coding sequences, with over 60 RNA editing events modifying the start or stop codons of the resulting mRNA sequences.

Results from several studies indicate that most seed plant plastomes have about 30–40 RNA editing sites (Stern et al. 2010). Studies on plastomes from other plants groups are limited, but it appears that RNA editing is highest in hornworts and intermediate in ferns (Table 1). However, our results here suggest that levels of RNA editing may vary greatly within each of these large clades of plants, and some fern plastomes may be edited as much as some hornworts. Our high-throughput sequencing and bioinformatic approach enabled the rapid identification of novel chloroplast RNA-editing sites without relying on previous information, while also avoiding the expensive and labour-intensive steps of PCR and sequencing for each predicted mRNA transcript. Thus, the application of this approach should allow us to examine more details of RNA editing and its variation across groups of seed-free plants.

Relationship of plastome organization to gene function

Although the gene content of land plant plastomes is largely conserved, there are some differences among taxa (Supplementary Table 1), and therefore the absence of tRNA-Lys-UUU (trnK) and its intron in the ferns Adiantum (Wolf et al. 2003), Alsophila (Gao et al. 2009), and Cheilanthes (this paper) is not particularly surprising. What is interesting is that matK is still present in these genomes. In most land plant genomes, matK and the trnK intron are closely associated and are thought to have co-evolved (Toor et al. 2001). The maturase K protein is usually encoded within the trnK intron and is required for splicing trnK (Vogel et al. 1997; Vogel et al. 1999).

In most fern plastomes (and those of a few other land plants) that lack trnK, matK is still present (Ems et al. 1995; Funk et al. 2007; Wolf et al. 2003; Wolf et al. 2004; Wolfe et al. 1992), suggesting a function for matK beyond merely splicing trnK. A lack of evidence for a shift in selective constraints on matK in ferns without trnK (Duffy et al. 2009), and in vitro RNA-binding of matK with other introns (Liere and Link 1995) suggests that matK serves as a generalist maturase for several introns in the chloroplast genome (Zoschke et al. 2010).

MatK and trnK are located near the endpoint of one of the large inversions that led to the reorganized fern chloroplast gene order (branch C, Fig. 1). This seems to suggest that trnK was lost when one of the exons was disrupted by the inversion, as appears to be occurring in Jasminum (Lee et al. 2007). Furthermore, trnK is not maintained through trans-splicing in ferns (Duffy et al. 2009; Wolf et al. 2004). However, recent analyses (Wolf et al. 2010) placing the inversions and loss of trnK in a phylogenetic context indicate that trnK was lost on branch B (of Fig. 1) before the inversion occurred (branch C). Thus, the loss of trnK cannot have been caused by the inversion. These two events are therefore probably unrelated.

Summary and prospects

As sister group to seed plants, ferns are an essential point of reference for comparative analyses in land plants. Fern plastomes have undergone a series of inversions, each of which marks an important branch on the fern phylogeny. Fern plastomes also undergo high levels of RNA editing, a trait that occurs at a much lower frequency in seed plants. Based on a small and preliminary sample size, fern plastid genes appear to evolve at a faster rate than those in seed plants; however, this will need to be investigated further. With the application of second-generation DNA sequencing tools, the rate of discovery in all fields of plant molecular biology is rapidly increasing. This should enable plant biologists working on seed-free plant groups to catch-up with the level of current understanding of plastomes in seed plants.