Introduction

Plastids are a unique but common character to plant cells. All plastids of red algae, diatoms, green algae, and green plants were derived from a single origin by the uptake of a free-living photosynthetic cyanobacteria-like prokaryote into eukaryotic cell about 1.5 billion years ago (Martin and Kowallik 1999; Ochoa de Alda et al. 2014). The main function of the plastid is photosynthesis on which all life forms are depended directly or indirectly to sustain their life. In addition to the photosynthesis, plastid is the cellular organelle where major cellular functions take place, including syntheses of lipids, amino acids, pigments which are also vital for plants’ own life (Neuhas and Emes 2010). Over the evolutionary time, the plastids in each lineage in plants have shaped their own genomes by transferring many genes to nuclear genome to be reduced to about 5–10% of the genes compared to the ancestral cyanobacterial genome (ca. 2000–3000 genes) (Martin et al. 2002), resulting in intertwining the two cellular genomes for efficient function of plants cells (Keeling 2013). The plastid genomes (plastomes) of Streptophytes (land plants and their close algal relatives) are variable in size between 120 and 160 kb in length in quadripartite structure of large single copy region (LSC) and single copy region (SSC) which are separated by two large repeats (Wicke et al. 2011). The repeats are mostly inverted as Inverted Repeats A (IRA) and Inverted Repeat B (IRB), but direct repeats are found in a rare exception such as in lycophytes (Zhang et al. 2018). The inverted repeats were absent in the plastomes in some plant lineages such as in Ulvophyceae (Cai et al. 2017) in algae, Leguminaceae (Kolodner and Tewari 1979; Palmer et al. 1987a) in angiosperms, Pinaceae and Cupressophytes (Wu et al. 2011; Li et al. 2016a), and Taxaceae (Zhang et al. 2014) in gymnosperms. The IRs in the genera Picea and Pinus in gymnopserms were reduced as low as < 500 bp (Li et al. 2016b).

Each plastome carries about 120–130 genes to be one gene per 1 kb in average, which is highly gene dense compared with the nuclear genomes (Ruhlman and Jansen 2014). There are 80–90 protein coding genes that can be categorized into five groups (1) those involved in photosynthetic pathways, (2) those not involved in photosynthetic pathways, (3) those involved in transcription and translation, (4) those for structural proteins, (5) others including post-transcriptional modification (matK), protein turnover or modification, functions not assigned of those ycf genes (Wicke et al. 2011). There are about 30 tRNAs in a typical angiosperm plastome, but the number is variable from 20 to 40 (Wicke et al. 2011). According to the standard wobble rules, there are thirty-two tRNA species to read all codons (Crick 1966), but organelle genomes often encode fewer than the required set of tRNAs so that “superwobbling” between codon and anticodon was proposed to explain the insufficient plastid tRNAs (Barbrook et al. 2006; Rogalski et al. 2008; Mohanta et al. 2019). In a recent report of distribution of tRNA isotypes in six monocots, tRNA for lysine was not present in the plastomes of Sorghum bicolor and Oryza sativa (Mohanta et al. 2019). Non-photosynthetic or minimally photosynthetic plants usually have small plastomes and retain only fraction of tRNA sets (Nickrent and García 2009; Wicke et al. 2011). For rRNA genes, two sets of ribosomal RNA species (rrn23, rrn16, rrn5, rrn4.5) are encoded in large repeat IRA and IRB in most plastomes (Wicke et al. 2011; Ruhlman and Jansen 2014).

Genes are interrupted by introns which are found in major groups of organisms. Since the first report on the plastid intron in the rpl2 gene for the ribosomal protein L2 in Nicotiana debneyi (Zurawski et al. 1984), introns were identified in 20 plastid genes, including 14 protein coding genes and six trn genes, in which most genes have a single intron (Plant and Gray 1988; Brouard et al. 2016). However, rps12 gene is split by two introns; one intron between exon 2 and 3 is about 540 bp, another intron between exon 1 and 2 is about 100 kb in which the latter intron is trans-spliced to produce mature rps12 mRNA (Hildebrand et al. 1988). The intron in trnK-UUU contains matK gene in many plastomes (Wicke et al. 2011). The matK encodes the maturase which acts as a splicing factor for plastid group II introns (Liere and Link 1995). All plastid introns of land plants were group II introns except of the group I intron in trnL-UAA (Plant and Gray 1988). Introns are phased as phase 0, 1, and 2. The phase 0 introns are present between codons, thus not interrupting the reading frames. Whereas the phase 1 and phase 2 introns can disrupt the reading frames by the presence between 1st and 2nd nucleotides (phase 1) or 2nd and 3rd nucleotides (phase 2) in a codon (Long et al. 1995; Long and Rosenberg 2000), respectively.

The current study contains plastome genomics in 115 species from algae to angiosperms. The study includes the length and GC contents of quadripartite structures, intron gain/loss, matK gene in the trnK-UUU intron, intron phases, and length variations of exons and introns to get the insights on the plastome evolution in green plants.

Materials and methods

Taxon selections

One-hundred and fifteen species were selected from green algae, bryophytes, pteridophytes (vascular plants that disperse spores), gymnosperms, and angiosperms. The taxon selection was random to represent each plant lineage. Supplementary Table 1 shows their taxonomic orders, common names, Genbank accession numbers of the plastomes.

Plastid genes and structure analysis

Quadripartite structure was analyzed by the identification of IRs from each plastome using RepEx program (Michael et al. 2019; http://pranag.physics.iisc.ac.in/RepEx/). The default options of the program were ‘inverted’ for the type of repeat to be extracted with minimal length of repeats ‘500 bp’. The repeats can be identical as well as degenerative.

For gene identification, genes for coding protein, tRNA and rRNAs were identified from the information in General Feature Format (GFF) files from FASTA files of each plastome using in-house Phyton code developed by Phyzen Inc., Korea (http://www.phyzen.com). Then, the mined genes were checked with the annotated information in each plastome in NCBI database. The intergenic regions were the non-defined region as genic region in GFF file. The GC contents were calculated by counting the G and C nucleotides, which was divided by the total number of nucleotides in each.

Intron analyses

Presence of intron was checked by manual searching individual genes one-by-one in the annotated information in NCBI database. Exon and intron sequences from each intron gene were retrieved in FASTA file, then the boundary sequences of the exon were identified. If any gene was identified to have intron(s) in a plastome in any species, the same genes in other plastomes were also checked the presence of introns. To confirm the exact exon and intron sequences in protein coding genes, in silico translation was conducted after intronic sequences removed.

Intron phase was calculated by dividing three of the number of nucleotides of exons to check the intron being present between codons (phase 0), between 1st and 2nd nucleotide in a codon (phase 1), and between 2nd and 3rd in a codon (phase 2), respectively.

Results

Plastome length

The plastomes were variable in length from 96.0 kb bp of Ulva fasciata to 203.83 kb of Chlamydomonas reinhardtii in chlorophyte algae with a mean of 146.65 ± 17.98 kb (Table 1, Supplementary Table 2). For convenience, the 115 species were grouped into five groups: one for green algae including chlorophyte and charophyte algae; one for bryophytes including hornworts, mosses, liverwort; one for pteridophytes including monilophyte ferns, horsetails, and lycophytes; one for gymnosperms; and one for angiosperms. Of the five major plant groups, average plastome length was variable from 125.94 ± 15.7 kb in gymnosperms to 164.79 ± 38.9 kb in green algae. The average plastome length in bryophytes was relatively short with an average length of 129.13 ± 15.2 kb, whereas those of pteridophytes (150.16 ± 5.26 kb) and angiosperms (154.15 ± 10.0 kb) were variable in narrow range.

Table 1 The length and GC contents of quadripartite structures in five plant groups

Quadripartite structure

The quadripartite structure was an eminent character of plastomes from photosynthetic algae to seed plants. Of the 115 plastomes, however, seven plastomes lacked the invert repeats and the rest 108 plastomes showed canonical quadripartite genome structure with LSC, SSC, and two IRs (Supplementary Table 2). The IR lacking plastomes were two in green algae (Ulva fasciata, sea lettuce; Spirogyra maxima, green algae), three in gymnosperms (Calocedrus formosana, Taiwan cedar; Taxus baccata, English yew; Taxus mairei, yew), and one in angiosperms (Medicago papillosa, alfalfa). Of the 108 quadripartite structured plastomes, 23 plastomes were difficult in defining the IR identities due to the obscure boundaries between IR and single copy regions or obscure matching the sequences or length between two IRs due to SNPs and InDels. These obscure IR plastomes were 5 of the 10 bryophytes, 11 of the 20 gymnosperms, and 7 of the 45 angiosperms, respectively.

Table 1 shows the average length of each quadripartite region in each plant groups. Average length of the LSC of the 115 plastomes was 84.88 ± 10.79 kb, ranging from 60.64 kb in Ephedra equisetina of gymnosperm to 135.82 kb in Chara vulgaris in charophyte algae (Supplementary Table 2). Of the five groups, the average size of the LSCs ranges from 70.29 ± 10.77 kb in gymnosperms to 94.36 ± 24.81 kb in green algae in which two species (U. fasciata and S. maxima) did not have IRs so that LSCs and SSCs were not defined in these two species. Sizes of the LSC of the four remaining green algae species ranged wide from 72.77 kb in Klebsormidium flaccidum of charophyte algae to 135.82 kb in C. vulgaris of chlorophyte algae. The average lengths of the LSC in bryophytes, pteridophytes, and angiosperms were similar in narrow range from 86.87 ± 7.72 kb in pteridophytes to 89.45 ± 9.9 kb in bryophytes. Compared with the LSC, the length of SSC ranged wide from 1.8 kb in K. flaccidum to 81.3 kb in C. reinhardtii, which is 45 times difference. The SSC in C. reinhardtii is even longer than the LSC. The length of SSC was highly variable from 1.82 kb in K. flaccidum to 81.31 kb in C. reinhardtii (Supplementary Table 2). The average length of SSC of the 115 plastomes was 22.74 ± 13.3 kb. The average length of the SSC ranged from 17.06 ± 2.93 kb in angiosperms to 47.10 ± 33.79 kb in green algae among the five plant groups. In gymnosperms, average length of SSC with 35.67 ± 20.01 kb and size variation was very high such that the genera Gnetum and Ephedra were 8–9 Kb, but Picea and Pinus species were over 50 kb, which is more than two times longer than the overall average SSC length of the analyzed 115 plastomes. Average lengths of SSC in bryophytes and pteridophytes are similar with 19.29 ± 2.27 kb and 21.64 ± 4.86 kb, respectively. Variation of IRs was very remarkable such as from the completely absent in the genera of Ulva, Spirozyra in green algae, Calocedurus, Taxus in gymnosperms, and Medicago in angiosperms to 51.12 kb of K. flaccidum in green algae (Supplementary Table 2). The IRs in the genera Picea and Pinus in gymnosperms were reduced as much as 299 ~ 473 bp. The IRs in bryophytes and gymnosperms were similarly short as 8.11 ± 9.71 kb in IRA in gymnosperms and 10.19 ± 1.95 kb in bryophytes. The IRs in other plant lineage were ranging from 20.63 ± 6.25 kb in IRA of pteridophyte to 25.19 ± 4.71 kb in IRA of angiosperms.

GC contents

GC contents of the 115 plastomes were variable from 24.9% in U. fasciata of chlorophyte algae to 51.0% in Selaginella moellendorffii of lycophyte with an average 38.21 ± 3.8% in total (Table 1, Supplementary Table 2). In each plant group, the plastomes of pteridophytes showed high GC content of average 41.49 ± 3.27%, whereas the plastomes of green algae and bryophytes showed low GC content with an average 32.47 ± 6.1% in green algae and 33.12 ± 4.2% in bryophytes, respectively. Both gymnosperms and angiosperms in seed plants showed similar GC contents about 38%. In quadripartite structure, GC contents of IRs were 41.4 ± 10.2% in average, whereas the single copy regions showed low GC contents with an average 37 ± 4.0% in LSC and 34.53 ± 4.8% in SSC, respectively (Supplementary Table 2). Protein coding region (CDS) showed similar GC content with the total plastome sequences such as 38.2 ± 3.85% in total plastomes and 38.6 ± 3.6%in CDS, whereas the rRNA and tRNA genes showed higher GC contents such as 53.6 ± 1.6%in tRNA genes and 54.7 ± 1.4%in rRNA genes. Genic regions were higher GC content (41.9 ± 3.1%) than that of intergenic regions (34.8 ± 5.2%).

Gene contents: protein coding, tRNA, and rRNA genes

The number of protein coding genes ranged from 40 in Diplopterygium glaucum in Gleicheniales fern to 92 in C. vulgaris in charophyte algae with an average 79.43 ± 5.84 protein genes per plastome (Fig. 1; Table 2). We counted the duplicated genes as one gene. In each plant group, gymnosperm plastomes carried the low number of protein coding genes with an average 72.7 ± 6.92 genes per plastome, whereas bryophytes and pteridophytes carried high number of protein genes with average 84.3 ± 2.61 and 83.66 ± 2.7 genes, respectively (Table 2). Pseudogenes were found in several genes in all plant groups. While some genes were pseudogenized only in a few species (i.e., accD, petB, rpl32 etc.), some genes were pseudogenized in multiple species in different plant groups (i.e., ndhB, ycf12) (Fig. 1).

Fig. 1
figure 1

Presence/absence polymorphisms of plastid protein coding genes. The filled boxes are genes present and empty boxes are genes absent, respectively. The yellow boxes are pseudogenes. The order of species is the same as the Supplementary Table 1

Table 2 Number of protein coding, tRNA and rRNA genes in plastomes in five plant groups

Of the 86 protein coding genes, 17 genes (atpA, atpB, atpE, atpI, psaC, psbF, psbH, psbJ, psbN, petD, petG, rbcL, rpl2, rpl20, rpl36, rps14, rps18) were present in all species (Fig. 1). These genes encoded proteins in light-dependent reactions of the photosynthesis such as atp for F-type ATP synthase, psa for photosystem I, psb for photosystem II, pet for Cytochrome b6/f complex; light-independent proteins related to photosynthetic dark reaction such as rbc for large subunit of RuBisCo; and proteins for translation such as rpl for large subunit of ribosomal proteins, rps for small subunit of ribosomal protein, respectively. Some genes were lost either sporadically in any species among the five plant groups or in specific plant lineages. All angiosperm species did not have all chl genes (chl B, N, L), which were also absent in U. fasciata in chlorophyte algae, Psilotum nudum (whisk fern) in pteriodophyte fern, and Gnetophyta in gymnosperms. The psaM was lost in most of the Polypodiopsida ferns, Cycadophyta, Ginkgophyta, and Gnetophyta of gymnosperms, and all angiosperms. Most or all ndh genes were lost in some plants in distantly related species such as C. reinhardtii, U. fasciata, and Characiochloris acuminata in chlorophyte algae, Aneura mirabilis (liverwort) in bryophytes, Schizaea elegans (comb fern) in monilophyte fern in pteridophytes, all species of Gnetophyta and Pinophyta in gymnosperms. The Phalaenopsis equestris in angiosperm did not have functional ndh genes by lost or pseudogenized. The rpl21 was present in some species of algae, most of the bryophytes, and ferns, but it was lost in all seed plants. While the ycf2 was not present in algae and angiosperms, its presence was polymorphic in other plant groups.

The number of tRNA genes was variable from 13 in Sellaginella moellendorfii of lycophyte to 33 in Equisetum arvense of horsetail fern with an average 29.19 ± 2.42 tRNA genes in total (Table 2; Fig. 2). Several tRNA genes were duplicated with an average 6.7 duplicated genes per plastome. Of the 20 amino acid, ten amino acids (Ala, Cys, Asp, Glu, His, Lys, Asn, Gln, Trp, Tyr) had only a single tRNA gene. While the trnF-AAA and trnI-UAU were present in only a single species, trnQ-UUC and trnF-GAA were present in all species. Some tRNA genes were not present in some plant lineages such that trnK-UUU, trnL-CAA, trnR-CCG, trnS-CGA, and trnV-GAC were lost in all Polypodiale ferns (lane 20 to 41 in Fig. 2). Of the five plant groups, bryophytes had slightly more tRNA genes with an average 31.4 ± 0.46 and other groups had similar numbers with about 27.5 ± 3.25 in algae to 29.8 ± 1.83 in gymnosperms (Table 2).

Fig. 2
figure 2

Presence/absence polymorphisms of plastid tRNA genes. The filled and empty boxes denote the presence and absence of the trn genes, respectively. The order of species is the same as the Supplementary Table 1

The rRNA genes were present in IR regions to be duplicated in most species. The Table 2 shows the rRNA genes among the five plant groups. The number of rRNA genes ranged from 2 in U. fasciata to 10 in C. reinhardtii in algae andTrachelium caeruleum in angiosperm. U. fasciata has rrs encoding for 16S rRNA and rrl encoding for 23S rRNA. C. reinhardtii has rrnS for 16S rRNA, rrn7 for 7S rRNA, rrn3 for 3S rRNA, rrnL for 23S rRNA, rrn5 for 5S rRNA and these five genes are duplicated. The plastomes in bryophytes and ferns had rrn16, rrn23, rrn4.5, rrn5 and these four genes were duplicated. The rrn genes encoding small ribosome unit (rrn16) and the genes for large ribosome unit (rrn23, rrn4.5, rrn5) were spaced by two tRNA genes (trnI and trnA) in most plastomes except of a few cases such as the rrn genes in the genus Taxus in gymnosperms in which the rrn gene order was rrn16-rrn23-rrn4.5-rrn5 without the intervening trn genes (Fig. 3). The plastomes of gymnosperms have variable number of rRNA genes. Gingko biloba has seven rRNA genes of rrn16, rrn23, rrn45, rrn5, rrn4 in which the rrn16 and rrn23 are duplicated, but others not. The species in the genera Pinus, Picea, Calocedurus, Taxus have only four genes of rrn16, rrn23, rrn4.5, rrn5, not duplicated. But cycads and gnetophytes have them in duplicated. In angiosperms, plastomes also have the four basic rRNA genes in duplicated except of the Medicago papillosa which has only one copy of them.

Fig. 3
figure 3

The rrn gene arrangements in K. flaccidum (green algae), C. formosana and T. mairei (gymnosperms), and M. papillosa (angiosperm). The plastomes of these species had lost one of the IRs so that the rrn genes were not duplicated

Introns

Number of genes containing introns ranged from 2 in C. reinhardtii and U. fasciata in green algae to 18 in Apopellia endiviifolia in bryophyte, and the number of introns ranged from 3 in U. fasciata to 19 in A. endiviifolia (Fig. 4; Supplementary Table 3). The trans-splicing intron of rps12 gene was present in plastomes from charophyte algae to flowering plants, but absent in the chlorophyte algae. Most of the intron genes have a single intron except of the clpP and ycf3 with two introns, rrL in Characiochloris acuminate of chlorophyte algae with 3 introns, and psbA with 4 introns in C. reinhardtii of chlorophyte algae. The intron genes consist of 13 protein coding, six trn genes, and one rrn genes. Of the 13 protein coding genes, introns in ten genes (atpF, clpP, petD, rpl16, rpl2, ycf3, ndhB, ndhA, petB_1, rps16) were present in all six plant groups from green algae to angiosperms. Intron gains were detected in two cases, one intron in each cysT and ycf66, which appeared from bryophytes to flowering plants, but not in green algae. Most intron losses were sporadic in all plant groups, but some intron losses were evident only in some lineages such that introns in the clpP were lost in the genera Gnetum, Ephedra, Welwitschia, Pinus, Picea, and Calocedrus in gymnosperms and Gramineae species in angiosperms. The Gramineae species also lost the intron in rpoC1. Unlike nuclear introns in 5’- and 3’untranslated regions (UTRs), we found no introns in the UTRs of plastid genes. In structural RNA genes, the introns in the six trn genes (trnG-UCC, trnK-UUU, trnL-UAA, trnI-UGC, trnV-UAC) were present in all groups, whereas the intron in rrnL gene was present in only C. reinhardtii.

Fig. 4
figure 4

Presence/absence polymorphism of plastid introns. The filled and empty boxes denote the genes present and absent, respectively. The yellow boxes denote genes present but intron absent in the corresponding genes. The introns present only in a plastome in single species are not included in the figure. The order of species is the same as the Supplementary Table 1

The numbers of intron genes and introns were variable in different plant groups (Table 3). Seed plants of gymnosperms and angiosperms showed low number of intron genes compared to the spore bearing plants of bryophytes and pteridophytes. Green algae showed the least number of intron genes by having 6.67 intron genes and average number of introns 8.0 ± 5.35 per plastome. Gymnosperms showed low number of intron genes with an average 12.25 intron genes and 13.35 ± 2.59 introns per plastome because species in Pinophyta and Gnetophyta had lost either introns or intron gene in clpP, ndhA, ndhB, rps16 and several trn genes. Bryophytes, pteridophytes and angiosperms showed similar number of intron genes and number of introns by having about 15 intron genes and 17 introns introns per platome (Supplementary Table 3).

Table 3 Number of introns in five plant groups

trnK-UUU and MatK

Of the 115 plastomes in our study, matK genes was present in trnK-UUU intron in 82 plastomes and the matK was not in the trnK-UUU introns but elsewhere in 30 plastomes which did not have trnK-UUU (Fig. 2; Supplementary Table 3). The matK was present in the exact location within the trnK-UUU intron from green algae to seed plants (Fig. 5a), implying that the insertion of this intron into the trnK-UUU gene occurred in early stage in green plant evolution. In green algae, the chlorophyte algae (C. reinhardtii, H. fasciata, and Characiochloris acuminate) had intronless trnK-UUU gene so no matK gene wherein, whereas charophyte algae (Chara vulgaris and Spirozyra maxima) had trnK-UUU gene with matK gene in its intron. In pteridophytes, two species in lycophytes have the matK in the trnK-UUU intron, whereas two species did not have trnK-UUU but had matK. In the Huperzia genus, H. serrata had the trnK-UUU and matK in its intron, whereas H. lucidula had only matK gene, but no trnK-UUU gene. Sequence comparison of the two MatK in the genus Huperzia revealed that they were identical except the 5 amino acids (AAs) in 535 AAs (Supplementary Fig. 1). Of the 31 monilophyte ferns in pteridophytes, six species have the trnK-UUU and matK, whereas the rest did not have the trnK-UUU, but had matK gene. In angiosperms, Cucubita maxima and Dioscorea collettii lost the trnK-UUU gene, but had matK gene. The matK gene in the trnK-UUU intron and matK gene not in the trnK-UUU intron were present in the same genomic locations in LSC (Fig. 5b), implying that the trnK-UUU seemed to be erased being left only matK gene.

Fig. 5
figure 5

(A) Intronless trnK-UUU is 72 nts long. The intron containing matK inserted at between 37th and 38th nucleotides (AA/CC) in trnK-UUU from algae to flowering plants

(B) The genomic locations of matK in several plastomes. The matK gene is not in the trnK-UUU in the S. moellendorfii, L. japonicum, D. collettii, and A. trichocarpa, whereas the matK in the E. arvense and H. serrata is within the trnK-UUU intron

Introns in protein coding genes: length, phase, intron sliding

Introns were present in 13 protein genes (Fig. 4; Supplementary Table 3). For protein coding gene, introns are twice times longer than the exons having an average length of exon and intron with an average 376.38 ± 348.94 nucleotides (nts) and 751.31 ± 168.22 nts, respectively (Table 4). The exon length ranged from 5 nts in the first exon petD gene in several species to 1785 nts in second exon of rpoC1 of Spirogyra maxima in algae (Supplementary Table 3). The intron length ranged from 264 nts in the second intron of clpP of H. lucidula gene to 2,724 nts in rpl2 of C. vulgaris gene. In each plant group, the average exon length was not much variable among the five plant groups with about 376 ± 348.55 nts, whereas average intron length was variable from 600,78 ± 182.01 nts in angiosperms to 1138 ± 570.44 nts in green algae.

Table 4 Exon and intron length in plastome genes in five plant groups

Because more than one intron phases were present in the same introns in the orthologous genes in the NCBI annotated data, exact positions of introns were analyzed manually in all protein coding genes. Most of the introns started with GT dinucleotides in their 5’-end, but some odd phase introns did not start with GT dinucleotides in the 5’-end in the annotated introns and GT dinucleotides were present in the nearby positions of the annotated exon/intron junctions. So, we redefined the exon/intron junction in the 5’-end acceptor sites by starting introns with GT nucleotides and adjusted the 3’-end donor sites accordingly. In silico translation showed full length translation with the messages of exon/intron boundaries adjusted. Thus, we concluded that the odd phase introns with other orthologous introns were artifacts from mis-annotation. When the trans-splicing intron in trnK-UUU is excluded, 13 genes have 17 introns in which the intron phases were eight in phase 0, six phase 1, and three in phase 2, respectively. The genes with two introns had intron phases (2,0) in clpP and (1,0) in ycf3, respectively (Supplementary Table 3).

Figure 6 illustrated phases of clpP that had two introns in most plastomes. The first intron was phase 2 by residing between 2nd and 3rd nucleotide of the TAC (tyrosine) codon. The second intron was phase 0 by residing between GCT (alanine) and AGG (arginine) codons in most species in which exon/intron junctions were CT/TT in the acceptor junctions and AT/AG or TT/AG in the donor junctions. In the liverwort Ptilidium pulcherrimum, however, it was phase 2 by residing between 2nd and 3rd nucleotide of the TTT (phenylalanine) codon in which the intron started with GT and ended AT so that it had TT/GT acceptor junction and AT/TA in the donor junction. There was an insertion of 22 nucleotides, TAGGATATTACTATTTGTTAAT, before the 3rd exon to create GT dinucleotide so that the intron started with GT and ended with AT. However, this gene can also be spliced with CT/TT at the acceptor junction and AT/AG in donor junction because the AT/AG donor junction was created at the end the inserted 22 nucleotides. After introns were removed in both cases, both messages produced full length polypeptide. Thus, it is not clear which message is being used in vivo, which may be checked with the transcriptomes of this plastome.

Fig. 6
figure 6

Intron position polymorphisms in clpP. The clpP gene has two introns in which the first intron is phase 2 and second intron is phase 0 in most species except of the liverwort P. pulcherrimum. The clpP in P. pulcherrimum can be spliced into two possible messages by the 22 nucleotide insertions before the between AGG (alanine) and AGG (arginine) codons. This insertion created phase 2 intron to have the TT/GT at the 5’-end acceptor junction and AT/TA at the donor junction, resulting in intron phases (2,2). However, it also can be spliced by utilizing the (2,0) intron phases like clpP genes in other species


Introns in trn genes

Six trn genes (trnG-UCC, trnK-UUU, trnL-UAA, trnI-GAU, trnA-UGC, and trnV-UAC) contained introns. The average trn intron length was 708.08 ± 182.26 nts, which was more than 20 times longer than the average trn exons of 37.65 ± 7.16 nts (Table 4). The trnK-UUU intron was about 2400 nts in which matK gene was present. The average trn introns was twice longer than that of protein coding gene. The average intron length (without the intron of trnK-UUU intron) ranged from 643 ± 190.81 in bryophytes to 753.52 ± 178.24 in pteridophytes.

Of the six trn genes with introns, the introns were present in the anticodon loop except of the trnG-UCC in which the intron was present in the D-loop positions between 23rd and 24th site (23/24). The introns in the anticodon loop were between the 1st and 2nd nucleotide after the anticodons (usually 37/38) in, trnI-GAU, trnK-UUU and trnV-UAC or between 2nd and 3rd nucleotide after anticodon in trnA-UGC. The position of trnL-UAA intron was within codon at between 1st and 2nd nucleotide (Fig. 7).

Fig. 7
figure 7

Positions of the plastid trn gene introns. The cloverleaf structure is a general tRNA structure and the three nucleotides in a codon were highlighted blue. The numbers denote the positions of nucleotides in consensus structure

Discussion

There are about 3,500 plastome sequences in the NCBI database in which the plastome length varied from 19.4 Kb in the holoparasite angiosperm Cytinus hypocistis (Roquet et al. 2016) to 1.35 Mb in the microalga Haematococcus lacustris (Bauman et al. 2018). In the current study of 115 plastomes, both the smallest and largest plastomes were in green algae, which might be attributed to polyphyletic and paraphyletic organisms in algae (Gibbs 1990). Parasitic or nonphotosynthetic plants have small plastid genomes (Barbrook et al. 2006; Wicke et al. 2011, 2013). The plastome of parasitic angiosperm Epifagus virginiana is 70 Kb (Wolfe et al. 1992), which is half of the average plastome size in angiosperm in our study. E. virginiana was not included in the current study. The simple-thalloid liverwort Aneura mirabilis is a parasitic bryophyte (Wickett et al. 2008) and it has the smallest plastome among the 10 species in bryophytes in our study. In gymnosperms, the highly reduced IRs in the genus Picea species and absent of IRs in the genus Taxus resulted in the small plastomes. Plastid genomes of cupressophyte species in gymnosperms evolved toward reduced size by shrinking the intergenic spaces, and synonymous mutations were negatively associated with the plastome sizes and frequency of structure rearrangements in the cupressophyte species (Wu and Chaw 2014). The plastid genome size is a highly adaptive trait because hyperploidy of organelles is related with the positive scaling of cell size and plastid genome size so that genomic size variations in plastid and mitochondria were attributed to the rates of organelle mutation and genetic drift (Smith 2016, 2017).

Quadripartite structure is a feature of plastome from unicellular photosynthetic organisms to flowering plants such that Eutrepiella pomquetensis in Eugleonoid has a quadripartite plastome which resembles to the most plastomes in green algae and green plants (Dabbagh et al. 2017). However, losses of one of the IRs were also reported in some lineages including Ulvophyceae chlorophyte algae (Cai et al. 2017), Taxaceae (Zhang et al. 2014) and cupressophytes (Wu and Chaw 2014) in gymnosperms, and Fabaceae in angiosperms (Palmer et al. 1987a). The quadripartite had undergone many structural changes including expansion/contraction of each partite in specific plant lineages (Wicke et al. 2011; Ruhlman and Jansen 2014; Wu and Chaw 2014). An example is that extensive duplications and inversions resulted in 217 kb plastome of geranium (Pelagonium hortorum) that was expanded about 50% of the average land plant plastome (Palmer et al. 1987b). The length of IR of the P. hortorum was 76 Kb which was three times of the average length of angiosperm IRs. We found out frequent shift of IR boundaries into SSC or LSC and this “ebb and flow” of IR boundaries was attributed to the effect of recombination and gene conversion between repeats (Goulding et al. 1996). Xu et al. (2015) reported 30 gene-inversion and 33 gene translocation events in a comparative study of plastomes of 24 representative species from algae to flowering plants in which rearrangements were more frequent in algae than land plants. In a study of plastomes of Wu and Chaw (2014) showed that removal of a specific IR influenced the minimal rearrangements in gnetopines and cupressophytes. The inversions including LSC and IRB resulted in the IR to direct repeat (DR) in a few species in the genus Selaginella (Xu et al. 2018). The sequence transposition was also occurred with a genus Selaginella in lycophytes in which sequence transposition from LSC to SSC occurred from the ancestor of S. uncinata to the current plastome of S. uncinata (Tsuji et al. 2007).

One of the prominent features of the organelle genomes is their low GC content which is also consistent with our results. Smith (2009) reported that 43.1% (average 36.2 ± 4.6%) among 150 plastomes registered in NCBI in 2009. As of 2019 November, 3,507 plastomes and 304 mitogenomes of plants are registered at the NCBI organelle database (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/) in which the average GC content is 37.38 ± 2.26% in plastomes and 42.36 ± 5.25% in mitogenomes, respectively. The AT-richness of the organelle genomes might be related with the endosymbiotic history of cellular genomes by reflecting that genic regions are relatively higher GC contents than the intergenic regions, and many genes of the organelle genomes were lost during the green plant evolution (Smith 2012). Organelles are energy-producing sites for cellular activity, resulting in the production of high reactive oxygen species to promote the GC → AT mutations (Shokolenko et al. 2009). Multiple rounds of replications entailed replication errors which were not properly purged due to inefficient repair systems by lacking of recombination (Bircky 2001). Organelle genomes of some plant lineages had higher GC than AT contents such that GC content is 57% in the mitogenome of unicellular algae Polytomella capuana (Smith and Lee 2008) and over 50% of plastomes of Sellagenella species in lycophytes (Tsuji et al. 2007; Smith 2009). These unparalleled GC contents were explained by the high RNA editing in the organelles in these species (Smith 2009). Ruhfel et al. (2014) analyzed 360 plastomes from algae to seed plants and showed wide variations in GC contents in non-seed plants, but low variations in seed plants, which is corroborated with our analysis. Moreover, the species with high GC contents have significantly more amino acids that are encoded by GC-rich codons (i.e., G, A, R, and P) compared to the amino acid encoded by At-rich codons (i.e., F, Y, M, I, N, and K) in their study, implying that the GC contents are highly modulated with the translation efficiency.

During the green plant evolution, the plastid genes had been reduced about 5 ~ 10% compared to the ancestral cyanobacteria, resulting in the semiautonomic for chloroplast function (Martin et al. 2002). We found seventeen genes that were commonly present in all 115 plastomes. They might have essential role for chloroplast function. The genes in photosynthetic apparatus (atp, psa, psb, rbcL, pet) were deleted in the nonphotosynthetic parasitic plant Epifagus virginiana (Wolfe et al. 1992). In our analysis, the A. mirabilis, nonphotosynthetic bryophyte, had six atp genes (atpA, B, E, F, H, I) in functional copies, whereas other genes in pet, psa, and psb families were present in all copies, but pseudogenized in some of them. Thus, the loss of photosynthesis in A. mirabilis might be relatively recent event compared to the E. virginiana as suggested by Wickett et al. (2008). Gene and intron losses were frequent among the angiosperms. Rhulman and Jansen (2014) demonstrated that multiple independent gene/intron losses of 41 plastid genes and introns in a joint analysis of phylogeny among angiosperms. However, gene losses were not the sole gene dynamics among plastid genes during plant evolution. Two genes, matK and ycf2, were gained after endosymbiotic event. The ancestral cyanobacteria did not have the matK, but it appeared from algae (detailed discussion on the matK is in the following chapter). Molecular functions of the ycf encoded proteins have not been defined and the ycf3 was appeared in land plants (Wicke et al. 2011), but it was revealed to be absent in all angiosperms in our results.

Sixty-one protein codons are possible for translating the 20 amino acids, but 37 plastid tRNA species were recognized in our analysis. There were no known cases of structural RNAs imported from cytosol to plastid, thus the plastid tRNA genes are insufficient to read the possible codons so that supperwobbling theory was proposed to explain the plastid translation with the reduced tRNA sets (Rogalski et al. 2008). Many tRNA species were able to read all four nucleotides (A, G, C, and U) in the third codon position by “two out of three” mechanism, which might allow a single tRNA gene in ten amino acids (Ala, Cys, Asp, Glu, His, Lys, Asn, Gln, Trp, Tyr) to read the corresponding codons (Pfitzinger et al. 1990; Rogalski et al. 2008). The trnE-UUC, present in all 115 plastomes, was known to have dual function to have a role in protein biosynthesis during translation and tetrapyrrole biosynthesis which is involved in haem and chlorophyll synthesis (Barbrook et al. 2006). Unlike the translation in cytosolic 80S ribosomes, the translation of 70S plastid ribosome is initiated by N-Formylmethionine (fMet) that is coded in the trnMf-CAU which was also present in all plastomes in the current study.

The four ribosomal RNA genes (rrn16, rrn23, rrn4.5, and rrn5) reside on the invert repeat (IR) in most plastomes so that they are present in duplicate. Duplication of the rrn genes was attributed to the demand of high quantities ribosomes during early plant development (Bendich 1987). The rrn genes for small ribosomal RNA subunit (rrn16) and genes for large ribosomal RNA subunit (rrn23, rrn4.5, and rrn5) are separated by two tRNA genes (trnI and trnA) in most plastomes except of a few species such as the species in the genus Taxus in gymnosperms that had lost the IR and have rrn gene order as rrn16-rrn23-rrn4.5-rrn5 without the intervening trn genes. Calocedrus formosana is a member of cupressophyte in gymnosperms that had lost IR (Wu and Chaw 2014), but the same two trn genes spaced the small ribosomal genes and large ribosomal genes. Ulva species also lost the IR and they have only rrn16 and rrn23 that are spaced with the same two trn genes (Cai et al. 2017). Thus, loss of these trn genes between rrn genes might be Taxus genus specific because trnI and trnA was present between rrn genes in the Euglenoid Etl. pomquetensis. The Pinaceae species have highly reduced and the rrn genes are in duplicate, but retained the same gene order as rrn16-trnI-trnA-rrn23-rrn4.5-rrn5 as others (Li et al. 2016b).

The trnK-UUU and matK have been used extensively as phylogenetic studies for plant classification due to its fast sequence evolution compared to other chloroplast genes (Hilu and Liang 1997; Kar et al. 2015). The intron in trnK-UUU contains matK gene in many plastid genomes from algae to flowering plants except in the chlorophyte algae (Wicke et al. 2011). The matK encodes the maturase for splicing trnK-UUU intron and a few plastid group II introns (Liere and Link 1995). Hausner et al. (2005) showed that matK ORFs were derived from a mobile group II intron and had undergone progressive structural changes and sequence degeneration during plant evolution. Our analysis showed that the integration of the matK into the trnK-UUU intron was ancient because it was present at the same location in the plastomes from green algae to flowering plants. The matK encoded MatK protein that is required for the trnK intron splicing (Vogel et al. 1999). We are reporting here that many ferns lost the trnK-UUU intron, but retained matK. Both matK in the trnK-UUU introns and matK not in the trnK-UUU intron were at the same genomic sites, implying that the latter matK shed the flanking trnK-UUU intron sequences. It was not known how the trnK-UUU intron sequences were shed from these matK locus. The absence of trnK-UUU but retention of matK was speculated on the MatK function beyond the trnK-UUU intron splicing (Duffy et al. 2009). Indeed, Zoschke et al. (2010) showed that the MatK was associated with multiple group II intron splicing.

Self-splicing introns are categorized into three groups: I, II, and III (Lambowitz and Zimmerly 2004). In plastid genes, group III introns are not present because they were degenerated to group II introns (Dabbagh et al. 2017). The plastid introns in land plants are mostly group II introns except of a single group I intron in the trnL-UAA (Vogel et al. 1999). The introns in the rrs, rrl, and psbA in Characiochloris acuminata of chlorophyte green algae were also group I introns. Kuhsel et al. (1990) showed conserved structure and insertion sites of the group I intron in trnL-UAA among diverse cyanobacteria and plastomes of several major plant lineages to propose that the group I intron of the trnL-UAA has been maintained in the plastomes for at least 1 billion years. Group II introns are also very old since they were found in bacteria and eukaryotic organelles (Bonen and Vogel 2001). Hausner et al. (2005) demonstrated that the organellar group II introns were derived from bacterial-like introns that had standard RNA structure by phylogenetic analysis. The rps12 gene was split by two introns, one intron between exon 2 and 3 is about 540 bp, another but exon 1 and 2 are separated about 100 kb so that the latter intron is trans-spliced to produce mature rps12 protein (Hildebrand et al. 1988). Because rps12 gene in chlorophyte algae did not have intron, the trans-splicing started from charophyte algae. The chlorophyte algae did not have matK gene whose product is necessary for group II introns in plastid genes (Liere and Link 1995) which might explain why the chlorophyte algae did not have introns in most of the protein and tRNA genes in our analysis (Fig. 4). McNeal et al. (2009) demonstrated the necessity of matK for group II intron splicing in the parasitic angiosperms in the genus Cuscuta in which the subgenera Monogyna and Cuscuta did not have trnK-UUU, but retained the matK, whereas species in the subgenus Grammica did not have both trnK-UUU and matK. Intron/intronless of the group II introns were corroborated with the matK presence/absence among the species in the genus Cuscuta.

Because removal of precise intronic sequences is critical for proper protein synthesis, length of the introns is important feature in RNA splicing. Several reports on the length of introns and exons are available in nuclear genes, whereas similar studies on the organelle genes have not been reported. Intron lengths in nuclear genes ranged from 13 to 300,000 nucleotides (nts) (Hawkins 1988). Deutsch and Long (1999) surveyed nuclear gene intron lengths in model eukaryotes and found that most introns were 40–125 nts long and intron length is weakly related with genome size. Moreover, introns < 50 nts were significantly less frequent than long introns from which they argued that minimum intron size is required for splicing. “Minimal” introns were proposed because introns were often clustered a specific-specific low end in size distribution such as 92 ± 14 nts in human, 89 ± 12 nts in Arabidopsis, 61 ± 10 nts in fruit fly, and 48 ± 9 nts in C. elegans, respectively (Yu et al. 2002). Recently, Cheng et al. (2018) reported the smallest intron with only 8 nts in Arabidopsis. The nuclear tRNA introns were small ranging from 11 to 129 nts in archaebacterial (Yoshihisa 2014) and 6–11 nts in plants (Michaud et al. 2011). Compared with the small nuclear introns, the plastid gene introns are much longer such as average length 751.72 ± 167.40 nts in protein coding genes and 707.73 ± 182.71 nts (without the trnK-UUU) in tRNA genes. Introns in mitochondrial genes are also lengthy. For instance, intron length was variable from 953 to 4100 nts in ccmF, 732 to 4080 nts in rps10, and 842 to 2829 nts in mitochondrial protein genes among Fabaceae species (Choi et al. 2019). Nuclear introns are spliceosomal introns, whereas the organelle introns are group II introns that were derived from mobile elements (Zimmerly and Semper 2015). Thus, there might have different constraints for intron length between nuclear genes and organelle genes. Introns in tRNA genes are found in all three kingdoms of life and gain/loss of tRNA introns are obvious during various evolutionary stages of life (Yoshihisa 2014). Exon length is also constraint because small exons (< 51 nts) can cause exon skipping and too small exons may hinder the recognition of next spliceosome binding (Dominski and Kole 1991; Hwang and Cohen 1997). However, Guo and Liu (2015) demonstrated successful demonstration of only 1 nt exon in Arabidopsis. Average exon size of the nuclear genes was reported 180 nts in plants (Brown and Simpson 1998), whereas the average length of the plastid exons was 375 nts in our analysis. Three plastid genes had mini exons (< 10 nts) and these mini exons were 5′-first exons such that 5–6 nts in petB, 5–9 nts in petD, and 8–10 nts in petD.

The self-splicing group II introns are believed to be originated from mobile genetic elements (Lambowitz and Zimmerly 2011; Zimmerly and Semper 2015), but the origin of spliceosomal introns is contentious as intron-early (Doolittle 1978; Gilbert 1987) or intron-late (Cavalier-Smith 1991; Logsdon 1998). The intron-early theory was proposed by the biased distributions of phase 0 introns against phase 1 and phase 2 introns, and more symmetric exons over random expectation because the introns inserted into previously uninterrupted genes that were formed by exon-shuffling (Doolittle 1978; Gilbert 1987; Long et al. 1995), whereas intron-late view advocated the random insertion into the intronless genes rather recently so that intron phases are distributed rather evenly. However, phylogenetic distributions of introns showed recent acquisition of vast number of introns, which preferred the intron-late theory (Cavalier-Smith 1991; Logsdon 1998). The plastid introns were also biased to phase 0 over phases1 and 2 in our analyses. Koonin (2006) proposed a scenario that combined aspects of both intron-early and intron-late theory, in which the earliest form of self-splicing group II introns in bacteria invaded into the genes in early eukaryotes and spawned into the organelle genomes in the endosymbiosis, then subsequent lineage specific loss and gain of introns.

Intron sliding was reported to contribute diversity of intron positions (Stoltzfus et al. 1997; Bocco and Csűrös 2016; Fekete et al. 2017). Intron phase polymorphisms among the homologous introns in the NCBI database were mostly by false annotation. Bocco and Csűrös (2016) showed false annotation in the junction of exon/intron boundaries rather than intron sliding in a study of the introns 1917 orthologous genes in Oomycetes in the NCBI (Bocco and Csűrös 2016). We observed a case of possible intron phase change in the second intron of the clpP, which was caused by 22 nucleotides insertion, in the liverwort P. pulcherrimum. Because in silico translation produced full length polypeptides with the transcripts derived the possibly false annotated NCBI information, plastome transcriptome analysis may resolve whether they were from the intron sliding or false annotation.

Plastid trn introns were reported by others (Plant and Gray 1988; Brouard et al. 2016). However, detailed analyses on the length and position were not attempted. Introns are also present in archaeal and eukaryotic nuclear tRNA genes (Michaud et al. 2011; Yoshida 2014). The length these archaeal and eukaryotic nuclear tRNA gene introns was small (< 56 nts) (Merchant et al. 2007; Michaud et al. 2011) compared to the plastid trn introns (average 708.08 ± 182.26 nts) in our analysis. The intron insertion sites seemed to be conserved in the tRNA genes of archaea, nucleus and plastid. We observed three of the six trn introns at the 37/38 positions (one nucleotide after codon) and this site was most frequent site where the introns were placed in archaeal and nucleus tRNA genes (Yoshida 2014). The introns at 23/24 and within codon (34/35) in the plastid trn genes were also reported in the in archaeal and nucleus tRNA genes.

Summary

We analyzed 115 plastomes from algae to flowering plants. The quadripartite structures were retained except of a few plastomes that had lost an invert repeat (IR). IR boundaries were obscure in some species and expansion or reduction or deletion of IRs resulted in the length variation of the plastomes. Genes encoding proteins had also been reduced in specific lineages such an example as ndh losses in chlorophyte algae and Gnetophyta and Pinophyta in gymnosperms. Plastid tRNA were not sufficient to translate plastid proteins, which might be compensated by supperwobbling. Ribosomal RNA genes were located in the IRs so that they were present in a pair except of the species that had lost one of the IR. Nucleotide compositions were biased to AT rich in most plastomes and structural RNA genes showed higher GC contents than other regions. Plastid introns were present in 13 protein coding genes, six trn genes, and one rrn gene. Presence/absence of the plastid introns were observed among the orthologous genes in different plant lineages. The plastid introns were long compared with the nuclear introns, which might be related with the spliceosome nuclear introns and mobile element derived group II plastid introns. The trnK-UUU intron contained the maturase encoding matK gene and the trnK-UUU was lost chlorophyte algae and monilophyte ferns. Phase 0 introns were more frequent than the phase 1 and phase 2 introns in protein genes. Intron phase polymorphism was observed in a protein coding gene, clpP, which might be derived by intron sliding.