Introduction

Chloroplast is a plant organelle containing the entire enzymatic machinery for photosynthesis; it is supposed to have evolved from ancient endosymbiotic cyanobacteria. Chloroplast genes are responsible for >50% of the total leaf soluble protein encoded by both nuclear and chloroplast genomes. Over the past two decades, the plastid genome and its structure, expression, and evolution have been extensively studied using molecular methods (Sugiura 1992; Wakasugi et al. 2001). The rapid progress in this area has been determined by a permanent increase in the number of newly sequenced chloroplast genomes. The complete nucleotide sequences of the chloroplast genomes are available for more than 40 plants, including 2 ferns, Psilotum nudum and Adiantum capillus-veneris; 2 gymnosperms, Pinus thunbergii and Pinus koraiensis; and more than 20 dicots, including the putative basal angiosperms (Soltis et al. 1999) Amborella trichopoda (Goremykin et al. 2003a), Calycanthus floridus (Goremykin et al. 2003b), and Nymphaea alba (Goremykin et al. 2004). Among monocots, which have been suggested to be the most ancient branch of angiosperms (Goremykin et al. 2003a), genome sequences are available for six species: Acorus calamus (Goremykin et al. 2005) (Acoraceae), Phalaenopsis aphrodite (Chang et al. 2006) (Orchidaceae), and four grasses (Poaceae)—Oryza sativa (Hiratsuka et al. 1989), Triticum aestivum (Ogihara et al. 2002), Zea mays (Maier et al. 1995), and Saccharum officinarum (Asano et al. 2004).

Physical mapping studies and available sequence data revealed that chloroplast genomes of most land plants are highly conserved with respect to their size, ranging from 120 to 217 kb, and structure (Palmer et al. 1987). The presence of a large inverted repeat (IR), which ranges from 5 to 76 kb in length (Palmer 1991), is one of the conserved structural features of chloroplast genomes. The majority of size variations between the genomes can be accounted for by variations in the size of IR and intergenic spacers (Wakasugi et al. 2001; Raubeson and Jansen 2005). IR consists of two completely identical segments, IRA and IRB, which are typically 10–25 kb long but may range from 6 to 76 kb (Palmer 1985; Shinozaki et al. 1986). The repeated segments are separated by long single-copy (LSC) and small single-copy (SSC) regions. The completely sequenced algal and plant chloroplast genomes contain from 63 to 209 genes, with an average number of 110–130 (Jansen and Palmer 1987). The gene content and the polycistronic transcription units of the chloroplast genome are also conserved among the majority of vascular plant species (Kim and Lee 2004). The gene order in chloroplast genome is relatively conserved but sometimes disturbed by invertive mutations that can be mediated by intramolecular recombination or by multiple extensions or reductions of the IR sequences (Perry et al. 2002).

In this work, we present the complete sequence of the cpDNA (165955 bp; minimally 3 and up to 15 reads for each base pair) of duckweed Lemna minor—a floating aquatic plant belonging to the monocot family Lemnaceae. L. minor is one of the most primitively organized flowering plants. Although the Lemnaceae have long been associated with the Araceae (Les et al. 2002), relationships between the Lemnaceae and other monocots remain uncertain (Mayo et al. 1997). We compared Lemna minor cpDNA with available cpDNA sequences of vascular plants to review the evolutionary modes of chloroplast genomes, with particular emphasis on the evolution of chloroplast genomes of monocots.

Duckweed is a very promising object for biotech applications: it can be used as an efficient gene expression system. Lemna is one of the fastest-growing higher plants, doubling its biomass every 1.5 days; it achieves this high growth rate through clonal proliferation. The Lemna biomass protein averages 35% dry weight of the plant. It is possible to obtain transgenic duckweed using an Agrobacterium-mediated method (Yamamoto et al. 2001). Taking into account the recent improvements in transplastomic techniques, duckweed chloroplast transformation will be possible in the very near future. Therefore, knowledge of the nucleotide sequence of Lemna minor chloroplast genome would be helpful in the construction of the expression cassettes for the stable expression of heterologous proteins upon chloroplast transformation (Maliga 2002).

Materials and Methods

The Lemna minor specimen was obtained from its natural habitat. Its identity to Lemna minor species was confirmed by morphological analysis with subsequent sequencing of the trnL-trnF chloroplast intergenic spacer, which can be used for discrimination between different Lemna species (Rothwell et al. 2004). About 40 g of the green tissue was harvested after vegetative propagation of a single duckweed plant over 3 months. Total DNA was extracted using the CTAB-based method (Murray and Thompson 1980) and purified by electrophoresis in low-melting-point agarose.

The fragments of chloroplast genomic DNA were amplified by PCR. In brief, PCR primers were designed using the alignment of known chloroplast genomic sequences of angiosperms (sequences of primers are available online as supplementary material). Using these primers, we covered the entire chloroplast genome of Lemna minor with overlapping PCR fragments ranging in size from 1 to 8 kb. Each fragment was sequenced independently. Some fragments, nonamplifiable with the initial “consensus” primers, were amplified with primers designed based on the newly determined sequences of adjacent regions. Automated sequencing was performed on ABI 3100 and ABI 3730 sequencers using the Big Dye Terminator v.3.1 sequencing kit (ABI, USA). The sequence fragments were assembled using the Gene Studio program (http://www.genestudio.com). All fragments were sequenced 3–15 times (6 times on average). The GenBank accession number for the nucleotide sequence determined in this study is DQ400350.

Structural RNA genes were identified by BLAST search (Altschul et al. 1990) against the GenBank database. The tRNAscan-SE program (Lowe and Eddy 1997) was applied for the assignment of tRNA genes. Gene annotations were performed using the chloroplast annotation package DOGMA (Wyman et al. 2004) (http://phylocluster.biosci.utexas.edu/dogma/). The correctness of the annotation for all genes was additionally verified by similarity search against the available plant chloroplast genome sequences. Sequence alignments were performed using the ClustalX software package (Thompson et al. 1997).

Our phylogenetic analysis did not include all available plastid genomes, representatives of eudicots were restricted to 18, because the position of Lemna among monocots was of main interest in this study. A set of 61 protein-coding genes from 38 plastid genome sequences (Supplementary Table S2) was collected. Nucleotide sequences of the genes were checked for frameshift mutations and corrected when necessary, then translated into amino acid sequences, which were aligned using MUSCLE ver. 3.6 (Edgar 2004) with manual correction. Nucleotide sequences were aligned according to the aligned amino acid sequences.

Gap regions were excluded from the analysis when gapped positions were more than one-third of a column. The 5% chi-square test of nucleotide or amino acid compositional homogeneity and alternative topologies test were performed with the Tree-Puzzle program (Schmidt et al. 2002). Phylogenetic analyses using maximum parsimony (MP) method were performed using PAUP* ver. 4.0b10 (Swofford 2003). Bayesian inference of phylogeny was explored using the MrBayes program ver. 3.1.2 (Ronquist and Huelsenbeck 2003).

MP analysis involved a heuristic search using TBR branch swapping and 100 random addition replicates. Nonparametric bootstrap analysis (Felsenstein 1985) was performed with 100 replicates with TBR branch swapping.

Bayesian approach was applied for both the amino acid and the nucleotide data set. The amino acid data set was divided into 61 partitions, and for each partition the most appropriate model of subsititutions was determined by the BIC in Modelgenerator ver. 0.43 (Keane et al. 2006). The models CPREV and JTT with the presence of rate variation among sites (+Γ) and/or invariable sites (+I) in some partitions were chosen. For one partition the MTREV24 model was specified (see details in Supplementary Table S3). The Bayesian inference was performed with two runs with three chains in each; 2,500,000 replicates were generated, and trees were sampled every 100 generations. The proportion of the invariable sites and the shape of gamma-distribution of the rates were unlinked across the partitions. The number of discarded trees was determined using the convergence diagnostic.

For the nucleotide data set partitioned and unpartitioned approaches were applied. The data set was divided into 61 partitions. Nucleotide frequencies and parameters of the substitution matrix were unlinked across the partitions. For each partition the most appropriate model of the nucleotide substitution was determined by the AIC in Modeltest ver. 3.7 (Posada and Crandall 1998). The models GTR+Γ, HKY+Γ, K2P+Γ, and SYM+Γ, with the presence of invariable sites (+I) for some partitions, were chosen. In unpartitioned analysis the GTR+I+Γ model was used. Bayesian analysis of nucleotide sequences was performed with two runs with three chains in each. Four million replicates were generated, and trees were sampled every 100 generations.

To achieve the nucleotide composition homogenity, the third codon positions were excluded and the representatives of Fabales (Medicago, Lotus, Glycine) were deleted from the data matrix, then the most appropriate model of the nucleotide substitution for the whole data set was determined and the Bayesian analysis was repeated.

Results and Discussion

The Overall Structure and Gene Pool of the L. minor Chloroplast Genome

The L. minor chloroplast genome includes a pair of inverted repeats of 31,223 bp (IRA and IRB) separated by 13,603-bp-long SSC and 89,906-bp-long LSC. The total genome size is 165,955 bp; thus, it is one of the largest chloroplast genomes among recently sequenced ones.

The overall A+T content of L. minor cpDNA is 64.3%, a value similar to those of tobacco (62.2%), A. thaliana (63.7%), rice (61.1%), and maize (61.5%). The A+T contents of the LSC and SSC regions were 66.5% and 69.9%, respectively, whereas that of the IR regions was 59.9%. The lower A+T contents of the IR regions reflect the low A+T content of rRNA genes in this region.

The assignment of the potential genes was performed by similarity search, and the positions of 112 genes, including 95 unique and 17 duplicated ones in the inverted repeat regions, were localized on the map (Table 1 and Fig. 1). L. minor chloroplast genome is colinear to those of tobacco (Shinozuki et al. 1986) and the basal angiosperm A. trichopoda (Goremykin et al. 2003a) with respect to the gene order and overall homology. A number of rearrangements and deletions specific to the completely sequenced plastomes of monocots of the Poaceae family —Oryza (Hiratsuka et al. 1989), Zea (Maier et al. 1995), Triticum (Ogihara et al. 2002), and Saccharum (Asano et al. 2004)—were not found in L. minor cpDNA: namely, three inversions in the LSC, the translocation of rpl23 gene from the inverted repeat to the LSC, loss of an intron in the rpoC1 gene, and a large insertion in the rpoC2 gene.

Table 1 Genes contained in the Lemna minor chloroplast genome (a total of 112 genes)
Fig. 1
figure 1

Lemna minor cp DNA. The genes shown inside the circle are transcribed clockwise; those outside the circle are transcribed counterclockwise. The genes of the genetic apparatus are shown in red, photosynthesis genes are shown in green, and genes of NADH dehydrogenase are shown in violet. Gray color marks ORFs, ycfs, and genes of unknown function. Intron-containing genes are represented by their exons. Their names are given in blue

The gene pool of the L. minor chloroplast genome is almost identical to that of Amborella with two exceptions: absence of the infA gene, encoding translation initiation factor, and absence of the conserved open reading frame (ORF) ycf15. The infA gene is absent in the Arabidopsis, Lotus, and Medicago chloroplast genomes but is present as a truncated pseudogene in several other genomes (Millen et al. 2001). The putative ycf15 gene has also been lost several times during the evolution of vascular plants (it is absent in Psilotum, Adiantum, Pinus, Oryza, Triticum, Zea, Lotus, Medicago, and Arabidopsis, [Kim and Lee 2004]). Recent evidences (Cai et al. 2006) suggest that ycf15 is not a functional protein-coding gene; in L. minor the ycf15 is apparently a pseudogene, since its sequence is interrupted by a stop codon located only 108 bp downstream from the start. Another conserved ORF in the duckweed chloroplast genome, ycf68 (Stoebe et al. 1998), is also a pseudogene, since this ORF contains a frameshift mutation 42 bp downstream from the start codon. Unlike another monocot, P. aphrodite, the duckweed chloroplast genome contains a full set of ndh genes.

RNA-coding genes were identified by similarity search. Two identical copies of rRNA gene clusters (16S–23S–4.5S–5S) were found in inverted repeat regions. Each cluster was intervened by two tRNA genes, trnI and trnA, in the 16S–23S spacer region. A total of 30 tRNA genes, six of them having additional copies in the inverted repeats, were identified (Table 2). These 30 tRNA types can recognize all the codons present in the chloroplast genes, and therefore, no import of nuclear-encoded tRNAs is necessary to complement the chloroplast tRNA set. Six tRNA genes, trnK-UUU, trnV-UAC, trnL-UAA, trnG-UCC, trnI-GAU, and trnA-UGC, contained introns.

Table 2 The codon-anticodon recognition pattern and tRNA genes identified in the L. minor chloroplast genome

Eighteen L. minor chloroplast genes contained one or two introns. Only one of them, the trnL-UAA gene intron, belongs to the self-splicing group I, while all the others belong to group II. Two genes, clpP and rps12, possess two introns. The rps12 gene, as in the tobacco chloroplast genome, is a uniquely divided gene in which the 5′ exon is located in the LSC far away from its second and third exons, which are located as duplicates in the IR regions, thus requiring a trans-splicing mechanism between exon I and exon II to produce mature rps12 mRNA (Sugiura et al. 1987).

Earlier it was shown (Hoch et al. 1991) that RNA editing plays an important role in the translation process in chloroplasts. In at least two cases, RNA editing is expected to be performed in duckweed. Two genes, rpl2 and ndhD, possess ACG instead of ATG at the translation initiation site, as observed in maize (for rpl2), rice (for rpl2), Acorus calamus (for both rpl2 and ndhD), and several other plants. Creation of the AUG initiator codon from the ACG codon by C-to-U editing has been shown to occur in tobacco, spinach, and snapdragon (Neckermann et al. 1994).

Phylogenetic Implications

The data set for phylogenetic analyses consisted of 61 concatenated protein-coding gene sequences for 38 taxa, including 36 angiosperms and 2 gymnosperms as outgroups. The data matrix after exclusion of ambiguously aligned positions contained 42522 nucleotide positions.

Analysis of nucleotide composition shows that within each of 59 genes, compositional homogeneity is present (rpoC1 and ccsA are heterogeneous), but 15 of 38 concatenated sequences do not pass the 5% chi-square test of compositional homogeneity: these are Ginkgo, basal angiosperms, three of four magnoliids, Lemna, and seven eudicots. Amino acid composition is stationary and all amino acid sequences pass the 5% chi-square test.

MP analyses of all aligned nucleotide positions resulted in a single fully resolved tree, most nodes of which gained high support in bootstrap analyses (Fig. 2A). The topologies of the Bayesian trees derived from the partitioned and unpartitioned full-length nucleotide matrix and amino acid sequences analyses are identical (Fig. 2B). Although analysis of the unpartitioned and partitioned full-length nucleotide matrix gave the same tree topology, partitioned Bayesian analysis looks more preferable if we compare the harmonic means of the marginal likelihoods (−387,220.98 and −388,889.80 for partitioned and unpartitioned analysis, respectively). The resulting tree of the Bayesian analysis of the unpartitioned nucleotide matrix after exclusion of the third codon positions and representatives of Fabaceae is shown in Fig. 2C.

Fig. 2
figure 2

(A) Phylogenetic tree of 61 concatenated plastid protein-coding sequences from 38 taxa, derived from equally weighted maximum parsimony analysis of the full-length data. Numbers at nodes indicate maximum parsimony bootstrap support estimates for full-length matrix analysis and after exclusion of third codon positions, respectively; other branches have 100/100 support. The length of the tree is 71,119 steps, CI = 0.457, and RI = 0.602. (B) Bayesian tree derived from the partitioned full-length nucleotide sequences analyses. Numbers at nodes represent posterior probabilities for nucleotide and amino acid data, respectively. Other branches have 1/1 posterior probabilities. (C) Bayesian tree derived from the unpartitioned nucleotide matrix analysis after exclusion of third codon positions and representatives of Fabaceae. Numbers at nodes represent posterior probabilities; only values <1 are shown

In all obtained phylogenetic trees, monocots are monophyletic and relationships among them are insensitive to the method of phylogeny inference and amount of data under analysis, Lemna minor represents the next branch after separation of Acorus. Among rosids, placement of Cucumis is unstable: Cucumis is either united with Myrtaceae (Eucalyptus, Oenothera) in the Bayesian trees or nested within eurosids I in the MP tree. Another difference between the trees obtained is the position of Amborella: in the Bayesian and MP trees derived from analyses of all codon positions, Amborella occupies basal position among angiosperms, whereas the Bayesian phylogeny inference is based on the first two codon positions united Amborella with Nymphaeales.

Considering the fact, that alternative placement of Amborella differs solely in root placement and two currently available gymnosperm sequences may be too divergent for appropriate rooting of the angiosperm tree, we estimated the angiosperm root using nonstationary substitution model, which does not imply the stationarity condition and does not require an outgroup (Yap and Speed 2005). Site variation in the substitution rates was handled by assigning the sites into three classes. We used a program designed for analysis of nine sequences, therefore the initial data set was reduced to nine plant representatives, which included three basal angiosperms (Amborella, Nuphar, Nymphaea), two magnoliids (Calycanthus, Drimys), two monocots (Acorus, Lemna), and two eudicots (Vitis, Ranunculus) and the likelihoods of the competing rooted topologies were compared. In this analysis rooting at a branch leading to Amborella had higher likelihood.

To determine if these topologies with alternative placement of Amborella can be distinguished using the full-length data, a Shimodaira-Hasegawa (1999) test was conducted and the expected-likelihood weights (Strimmer and Rambaut 2002) were calculated using RELL optimization (Kishino and Hasegawa 1989) as implemented in the Tree-Puzzle program. In none of our analyses a basal position of monocots was recovered, but we tested it too, because in several earlier phylogenomic analyses, under certain conditions, monocots were shown as a sister to all other angiosperms (Goremykin et al. 2003a, 2004, 2005; Chang et al. 2006). According to the results of the tests, the basalmost position of monocots is significantly worse than the optimal topology with Amborella as a sister to other angiosperms but a close relationship of Amborella and Nymphaeales cannot be rejected (Δlog= 9.1, = 0.56, = 0.21).

In general, our inferred chloroplast phylogenies are congruent with recently published molecular trees, in which monophyly of magnoliids, as well as monocots and eudicots, were strongly supported (Cai et al. 2006; Saarela et al. 2007; Jansen et al. 2007).

Structural Rearrangements of the Chloroplast Genome in the Course of Evolution of Monocots

With few exceptions, the gene sets, orders, and nucleotide sequence of chloroplast genomes are highly conserved among land plants (Palmer 1991). The location of borders between the two IR and LSC and the two IR and SSC are known to vary among various cpDNAs (Maier et al. 1995; Goulding et al. 1996; Kim and Lee 2004), even among closely related species (e.g., Nicotiana tabacum and Atropa belladonna [Kim and Lee 2004]). Generally, these variations are restricted to “movement” of borders within 1 kb, but more considerable expansions and contractions of IR regions are also known. This phenomenon is responsible for wide size variations in chloroplast genomes and apparent inversions (Perry et al. 2002) in different groups of plants. However, such rearrangements are rather rare events and, therefore, are usually viewed as highly reliable markers of common ancestry for the taxa in which they are found (Jansen and Palmer 1987; Raubeson and Jansen 1992). We compared the positions of IR borders in duckweed, tobacco, Arabidopsis, basal angiosperm A. trichopoda, and several monocots (P. aphrodite, A. calamus, and grasses represented by rice) aiming to recognize general evolutionary implications from the available chloroplast genomes (Fig. 3).

Fig. 3
figure 3

Comparison of the border positions of LSC, SSC, and IR regions and the proposed model of the evolution of the modern arrangement of cpDNAs in monocots from an Amborella-like ancestor. The orientation and order of genes near the borders in Lemna minor are shown at the top. Open boxes immediately above and below the main line represent predicted genes that are transcribed rightward and leftward, respectively; pseudogenes at the borders are shown by the ψ(letter. The bottom shows the proposed model of rearrangements. IR regions are shaded. Abbreviations of gene names are as follows: S, rps; L, rpl; t-N, trnN; t-H, trnH

In both the basal angiosperm Amborella and tobacco, the border between LSC and IRb is located between rps19 and rpl2, and the LSC/IRa border occurs between rpl2 and trnH. The IRa/SSC border is located in the 3′ region of the ycf1 gene and produces the ycf1 pseudogene at the IRb/SSC border. In all analyzed monocots other than Lemna minor, the IR sequences are expanded within LSC so that IRs encompass trnH (Acorus), trnH and rps19 (grasses), or even trnH, rps19, and the 5′ region of rpl22 (Phalaenopsis). This configuration may result from a two-step expansion of IR. At the first step, the IRa/LSC border is moved into LSC, resulting in inclusion of trnH into IR. The second step is expansion of IRb into LSC, resulting in incorporation of rps19 (Acorus and grasses) and a part of rpl22 (Phalaenopsis) into IR. The result is the appearance of the trnH gene between rps19 and rpl2; this is a structural rearrangement specific for monocot genomes, which therefore must have occurred in the early evolution of monocots.

The proposed rearrangement in the chloroplast genomes of monocots fits the scheme presented in Fig. 3. At first, the above mentioned two-step expansion of IRb results in the inclusion of trnH into IR between rps19 and rpl2 and the movement of the IRb/LSC border inside rps19. The resulting structure corresponds to the A. calamus chloroplast genome. The further evolution toward modern chloroplast genomes of grasses and Phalaenopsis involves, according to this model, the expansion of IRa into SSC in Acorus-like genome so that ycf1 and rps15 appear within IRs (“intermediate”). Subsequent deletion of ycf1 in IRs, expansion of IRb into LSC with inclusion of rps19 into IRs, and a set of specific rearrangements and deletions within LSC produce the chloroplast genome of grasses. The Phalaenopsis-like genome could result from two processes: (i) contraction of IRb that would transfer ycf1 and rps15 to SSC and (ii) expansion of IRb into the LSC up to rpl22. The final step toward P. aphrodite genome is shortening of SSC due to the deletion of ycf2 and several ndh genes.

The structure of duckweed chloroplast genome is unique since it combines two features: (i) contrary to other monocots, the LSC/IRb border is located within rpl2 so that the normal sequence of the S10 operon remains intact, while trnH is present only in LSC region; (ii) the IRs are extended into a SSC region to include ycf1, rps15, and the 5′ part of the ndhH gene—a pattern not observed in other completely sequenced cpDNAs of land plants. However, the structure of the Lemna minor chloroplast genome may be explained within the above model. One explanation of the formation mechanism of the duckweed chloroplast genome is an extension of IRa into SSC in an Amborella-like genome, occurring independently of a similar expansion, which has produced an Acorus-like genome. This mechanism seems to be hardly probable since the SSC/IRa borders in L. minor and grasses are located almost in the same sites within the ndhH gene, which is unlikely to result from evolutionarily independent events. The second, more plausible, explanation is that L. minor genome results from contraction of IRb in the “intermediate” genome, resulting in deletion of trnH from IRb so that the original Amborella-like gene order at the IRa/LSC junction is restored. This model of rearrangements of the chloroplast genomes of monocots fits the phylogenetic relations of monocot plants obtained in this work (Fig. 2).