Introduction

The plant nuclear genomes acquired numerous DNA fragments from chloroplasts, which played an important role in the genomic of plants, and as a result, the majority of genes encoding chloroplast proteins reside in plant nuclear genomes (Baldauf and Palmer 1990; Gantt et al. 1991; Martin and Herrmann 1998; Rujan and Martin 2001; Martin et al. 2002). Although the transfer of most of these DNA fragments occurred at an early stage in organelle evolution, functional gene transfer events continue to occur in flowering plants (Martin et al. 1998; Adams et al. 1999, 2002; Millen et al. 2001).

In many eukaryotes, DNA transfer from organelles to the nuclear genome is ongoing (Ayliffe et al. 1998; Bensasson et al. 2001; Woischnik and Moraes 2002; Yuan et al. 2002; Huang et al. 2003, 2004; Stegemann et al. 2003). The transfer rate of chloroplast DNAs to the nuclear genome of tobacco has been measured using specific marker genes that were functional only when integrated into the nuclear genome (Huang et al. 2003; Stegemann et al. 2003). Gene transfer events were found to occur more often than that detected under experimental conditions (Martin 2003). The nuclear-localized plastid DNAs (nupDNAs) also tended to be located in close proximity. Once plastid DNAs become integrated into the rice nuclear genome, they are rapidly fragmented and shuffled, and newly integrated nupDNAs tend to be eliminated rapidly. Large nupDNA fragments preferentially localize at the pericentromeric regions of chromosomes, where integration and elimination frequencies are markedly higher (Matsuo et al. 2005; Noutsos et al. 2005; Sheppard and Timmis 2009). The greatest number of chloroplast DNA insertions occurs at nuclear regions characterized by sharp changes in repetitive sequence density (Guo et al. 2008). The abundance and composition of organellar DNA fragments have been investigated in model plants, such as Arabidopsis thaliana and rice (Martin et al. 2002; Shahmuradov et al. 2003). Compared with the small Arabidopsis genome, the rice nuclear genome is essentially saturated with plastid DNA sequences, and the abundance of nupDNAs correlates positively, on average, with nuclear genome size (Smith et al. 2011). The density and pattern of nupDNA integration events have been investigated in several species, and the mechanisms of integration and genomic organization have been analyzed in detail (Timmis et al. 2004; Kleine et al. 2009). The present genomic constitutions of nupDNAs could be explained by the combination of rapidly eliminated deleterious fragments and a few less deleterious, but more stable, fragments (Yoshida et al. 2013). However, the abundance, age, and predicted promoters of plastid genes in the nuclear genome of most plant species have not been thoroughly investigated.

In this review, we evaluated the abundance and age of nupDNAs in genomic data of 23 plant species (Arabidopsis Genome Initiative 2000; Nishiyama et al. 2003; Shrager et al. 2003; Project IRGS 2005; Tuskan et al. 2006; Jaillon et al. 2007; Ming et al. 2008; Huang et al. 2009; Paterson et al. 2009; Schnable et al. 2009; Schmutz et al. 2010; Shulaev et al. 2010; Vogel et al. 2010; Argout et al. 2011; Banks et al. 2011; Potato Genome Sequencing Consortium 2011; Young et al. 2011a; Prochnik et al. 2012; Tomato Genome Consortium 2012; Xu et al. 2013). The analysis shows that significant differences in the composition of nupDNAs exist, and that based on their age, there are two distinct distribution patterns for nupDNAs in plants. Expressed sequence tags (ESTs) indicated that certain nupDNAs may be functional. An analysis of predicted promoters of nupDNAs revealed that some were shuffled and some were eliminated. This review also reveals that the relationship between transcription output and the efficiency of nupDNA gene promoters needs to be further investigated.

Analytical approach

The plastid genome sequences of the following species were obtained from GenBank: A. thaliana (GenBank NC_000932), Brachypodium distachyon (NC_011032), Carica papaya (NC_010323), Chlamydomonas reinhardtii (NC_005353), Cucumis sativus (NC_007144), Citrus sinensis (NC_008334), Eucalyptus grandis (NC_014570), Fragaria vesca (NC_015206), Glycine max (NC_007942), Manihot esculenta (NC_010433), Medicago truncatula (NC_003119), Oryza sativa Japonica group (NC_001320), Physcomitrella patens (NC_005087), Populus trichocarpa (NC_009143), Panicum virgatum (NC_015990), Phaseolus vulgaris (NC_009259), Sorghum bicolor (NC_008602), Solanum lycopersicum (NC_007898), Selaginella moellendorffii (NC_013086), Solanum tuberosum (NC_008096), Theobroma cacao (NC_014676), Vitis vinifera (NC_007957), and Zea mays (NC_001666) (Hiratsuka et al. 1989; Maier et al. 1995; Sato et al. 1999; Maul et al. 2002; Sugiura et al. 2003; Gargano et al. 2005; Saski et al. 2005; Bausher et al. 2006; Jansen et al. 2006; Kahlau et al. 2006; Tuskan et al. 2006; Guo et al. 2007; Pląder et al. 2007; Saski et al. 2007; Bortiri et al. 2008; Daniell et al. 2008; Smith 2009; Shulaev et al. 2010; Paiva et al. 2011; Young et al. 2011a, b).

Pairwise comparisons of plastid genes and nuclear DNA sequences were performed using a BLAST program (http://www.phytozome.net; Goodstein et al. 2012). The number (K) of substitutions per nucleotide site between each of nupDNAs and chloroplast genes was calculated based on the BLAST alignment (Matsuo et al. 2005; Yoshida et al. 2013). For every plastid and nupDNA gene, 1 kb upstream of the translation start site was considered as promoter sequence. Promoters were detected using the TSSP program (http://softberry.com). We used the BLAST, the nupDNA fragments, and plastid DNA that were searched against the Expressed Sequence Tags (EST) Database (http://www.ncbi.nlm.nih.gov/) with no mismatch to identify whether ESTs are derived from nupDNA or plastid genome.

Plastid genes are abundant in plant nuclear genomes

To evaluate the abundance of plastid genes in plant nuclear genomes, we used the plastid genes as the query when searching plant nuclear genome databases (http://www.phytozome.net). BLASTN identified many complete and partial gene sequences with high levels of sequence identity. Matches with E values lower than 10−10 were defined as nupDNA fragments. The sequences were related to photosynthesis, energy metabolism, fatty acid metabolism, transporters, cellular processes, and biosynthesis of cofactors. This analysis was applied to nupDNA fragments larger than 50 bp because it was difficult to confirm the origin of smaller fragments. The analysis revealed that the number of nupDNAs varies among plant species (Fig. 1 and Online Resource 1).

Fig. 1
figure 1

Distribution of plant nuclear-localized plastid DNA (nupDNA) and the presence or absence of complete or partial nuclear homologs (Online Resource 1). The black, gray, and white boxes indicate the nupDNA of intact coding DNA sequences (CDSs), truncated CDSs, and CDSs, respectively, with only partial nuclear homologs. a Genes related to photosynthesis; b genes related to energy metabolism; c genes related to fatty acid metabolism, transporters, cellular processes, and the biosynthesis of cofactors

Previous work determined that plants with relatively large genomes contain more nupDNAs than those with smaller genomes (Shahmuradov et al. 2003; Smith et al. 2011), and we obtained similar results in our present analysis. F. vesca, G. max, O. sativa, P. vulgaris, S. bicolor, and Z. mays were found to contain more nupDNAs than A. thaliana, C. reinhardtii, P. patens, and S. moellendorffii (Fig. 1). For example, G. max contains 1718 nupDNAs related to photosynthesis, metabolism, fatty acid metabolism, transporters, cellular processes, and the biosynthesis of cofactors, of which, 485 are involved in photosynthesis. On the other hand, A. thaliana contains only 85 nupDNAs, of which, 30 are related to photosynthesis (Online Resource 1). Compared with the genomes of lower plants, such as C. reinhardtii and P. patens, the nuclear genomes of higher plants, such as A. thaliana, F. vesca, and G. max (with the exception of S. moellendorffii), have more plastid DNA sequences (Fig. 1). The lower level of nupDNAs may be a characteristic of lower plant genomes, and sequencing additional lower plant genomes could reveal if this is a typical difference between lower and higher plants.

Interestingly, the ratio of complete coding DNA sequences (CDSs) to total nupDNAs was not constant among plant species. For O. sativa, 51.29 % of the identified nupDNA genes related to photosynthesis contained intact CDSs (Fig. 1a and Online Resource 1). In contrast, in G. max, most of the identified nupDNA genes were partial sequences, or truncated CDSs, and only 8.45 % contained intact CDSs despite having a larger number of nupDNAs than O. sativa. For O. sativa, 15 of the 21 genes related to photosynthesis had at least one nuclear intact CDS copy without mutations and 10 of the 25 genes related to energy metabolism had at least one nuclear intact CDS copy without mutations (Online Resource 1). By contrast, the only nuclear copy of plastid atpI in A. thaliana contained several single nucleotide deletions, which produced mutations. Similar results were found for C. sinensis using the plastid gene petG and in T. cacao using plastid ndhG.

The numbers of intact nuclear copies of different plastid genes varied among plants. For example, B. distachyon psbI, encoding photosystem II protein I, had one intact CDS copy (Online Resource 2), but psbF, encoding photosystem II protein VI, had eight intact CDS copies (four were nonmutated intact CDSs). Interestingly, atpF, petB, petD, ndhB, and ndhB had no intact CDS copies according to our data. By contrast, G. max plastid-derived nuclear sequences covered almost the entire plastid genome. Online Resource 1 revealed that these nupDNA fragments became integrated into the nuclear genome at different frequencies. A. thaliana had 12 nupDNAs corresponding to rbcL but had only two nupDNAs corresponding to psbC, clearly indicating that these chloroplast genes differed in their propensity to undergo integration into the nuclear genome.

Characteristics of intact nuclear copies of plastid genes

To study the features of intact plastid genes in plant nuclear genomes, we classified them as nonmutated or mutated (Fig. 2, Online Resource 1, and Online Resource 2). In A. thaliana, G. max, O. sativa, P. vulgaris, S. bicolor, and Z. mays, we subclassified mutated intact genes into those containing nonsynonymous or synonymous substitutions. The analysis clearly showed that the ratio of nonmutated intact genes to total intact genes varied among plant species. For O. sativa, 42.86 % of genes had nonmutated intact CDSs (Fig. 2a and Online Resource 1), whereas for G. max, 24.39 % of genes had nonmutated intact CDSs.

Fig. 2
figure 2

Distribution of intact plant coding DNA sequences (CDSs) in nuclear-localized plastid DNA (nupDNA) and the presence or absence of mutations (Online Resource 1 and 2). The black and white boxes indicate nupDNAs containing intact CDSs, either with mutations or without, respectively. a Genes related to photosynthesis; b genes related to energy metabolism; c genes related to fatty acid metabolism, transporters, cellular processes, and biosynthesis of cofactors

The analysis also demonstrated that the nupDNAs included copies of plastid genes with synonymous substitutions in the intact CDSs. O. sativa had 19 synonymous substitutions in intact CDSs, and Z. mays had 10 such synonymous substitutions. However, A. thaliana and P. vulgaris had no synonymous substitutions in any intact CDS in their nupDNAs (Online Resource 1). This finding suggested that at least some of these genes have undergone strong positive natural selection.

Estimation of age distribution reveals variation in transfer frequency among different plants

To estimate when individual nupDNA fragments became integrated into the nuclear genome, we compared the nucleotide substitutions in nupDNAs with those present in the chloroplast genome. To estimate the rate of substitution, we estimated the age (million years, Myr) of nupDNA fragments (Matsuo et al. 2005; Yoshida et al. 2013). We excluded data from species having a low level of nupDNA, such as C. reinhardtii, P. patens, and S. moellendorffii. The age distribution profiles of the nupDNA fragments in plants suggested that nupDNAs were repeatedly integrated into the nuclear genome. Furthermore, the proportion of nupDNAs of specific ages (Myr) varied among plant species (Fig. 3 and Online Resource 1).

Fig. 3
figure 3

Age distribution of plant nuclear-localized plastid DNA (nupDNA) by millions of years (Myr) (see Online Resource 1). a Genes related to photosynthesis; b genes related to energy metabolism

There were two distinct age distribution patterns of nupDNAs in the plant species we analyzed (Fig. 3). In one pattern, a large number of nupDNAs were translocated, either within the past 1 Myr or from 1 to 10 Myr ago, and the amount of nupDNA decreased as the age increased. This was illustrated by O. sativa (Matsuo et al. 2005) and Z. mays. In the other pattern, a very low proportion of young nupDNAs existed and decreased slowly with time. This result was similar to that of Yoshida et al. (2013). This was found in A. thaliana, F. vesca, G. max, P. vulgaris, and S. bicolor.

Analysis of predicted promoters of nuclear copies of plastid genes

To study the expression of the nupDNA genes, we analyzed EST sequences corresponding to the nuclear copies of plastid genes. This was performed using only nupDNA fragments with mutated intact genes because it was difficult to confirm the origin of ESTs of nonmutated intact genes as they may have been derived from either nuclear or plastid genes. The analysis suggested that some nuclear-localized plastid genes are transcribed and functional (Online Resource 1 and Online Resource 2). To understand the promoters of nuclear-localized plastid genes, we searched the predicted promoter sequences in the plant nuclear genome database (http://www.phytozome.net) and plastid genome database (http://www.ncbi.nlm.nih.gov/). For every plastid gene and nuclear-localized plastid gene, the region up to 1 kb upstream of the translation start site was considered as the predicted promoter region, unless it was determined to be smaller.

The analysis showed that many of the predicted promoter sequences of nuclear-localized plastid genes had been shuffled or eliminated after integration into the nuclear genome (Fig. 4, Online Resource 1 and Online Resource 2). For example, in O. sativa, 55.93 and 70.67 % of the predicted promoter sequences of genes related to photosynthesis or energy metabolism, respectively, had been eliminated. Interestingly, in G. max, 70.73 % of the predicted promoter sequences of genes related to photosynthesis had been eliminated, yet only 33.33 % of those related to energy metabolism had been eliminated. By contrast, in M. truncatula, 37.5 % of the predicted promoter sequence of genes related to photosynthesis had been eliminated, but 96.08 % of those related to energy metabolism had been eliminated.

Fig. 4
figure 4

Comparison of predicted promoters between plastid genes and their nuclear-localized homologs. The black, gray, and white boxes indicate high similarity, partial similarity, and dissimilarity, respectively. a Genes related to photosynthesis; b genes related to energy metabolism; c genes related to fatty acid metabolism, transporters, cellular processes, and biosynthesis of cofactors

Some genes had ESTs in both nupDNA genes and plastid genes (Table 1, Online Resource 1 and Online Resource 2), such as the O. sativa psbK (nupDNA gene EST: JK503631.1, and plastid gene EST: CI746041.1) and M. truncatula psbM (nupDNA gene EST: CO516909.1 and plastid gene EST: EX528553.1). By contrast, some ESTs were apparent in either the nupDNA gene or plastid gene, such as the O. sativa rbcL (nupDNA gene EST not found; plastid gene EST: CB672943.1) and O. sativa psbZ (nupDNA gene EST: CI741169.1; plastid gene EST not found). More data are shown in the Online Resources 1 and 2. Interestingly, the predicted rbcL promoters of the nupDNA gene and the corresponding plastid gene in O. sativa were very similar (Online Resource 3). However, although the M. truncatula nupDNA psbM and corresponding plastid gene were transcribed, the predicted promoters differed.

Table 1 Promoter analysis of plastid genes and homologous intact coding DNA sequences in nuclear DNA