Introduction

Plastids (including chloroplasts) first appeared as cyanobacterial endosymbionts of eukaryotic organisms 1 billion years ago (Mereschkowski 1905; McFadden and van Dooren 2004). This symbiosis allowed eukaryotic organisms to capture solar energy through photosynthesis. The cyanobacterial genome was maintained as an extranuclear genome through the endosymbiosis, but over evolutionary times it underwent significant changes, with numerous genes being transferred to the nuclear genome (Stegemann et al. 2003; Olejniczak et al. 2016). However, over 80 protein-coding genes usually still persist in the plastids of plants, including those encoding elements for the genetic apparatus and the proteins involved in photosynthesis (Bock 2007; Olejniczak et al. 2016).

Markers extracted from plastid genomes (hereafter plastomes) have been widely used in studies of plant phylogenetic relationships, biogeography and species identification (e.g. Schaal et al. 1998; APG III 2009; Hollingsworth et al. 2009; Moore et al. 2010; Nguyen et al. 2015). Their high copy number makes them easy to isolate and sequence. Indeed, one plant cell usually possesses tens of plastids, and each contains multiple copies of the plastome, resulting in a relatively high number of copies per cell compared to the nuclear genome (Burgess 1989). Plastid genes encode proteins and RNA molecules that are crucial to the functioning of the plant metabolism, and can consequently undergo selective pressures. Purifying selection acts to maintain protein functions, while positive selection may come into play in response to environmental changes.

The plastid gene most studied in terms of selective pressures is probably the one encoding the large subunit of RubisCO, rbcL. Previous studies on various plant groups have highlighted positive selection on rbcL in relation to temperature, drought, and carbon dioxide concentration, particularly following C4 photosynthesis evolution (e.g. Miller 2003; Kapralov and Filatov 2007; Christin et al. 2008; Iida et al. 2009; Kapralov et al. 2011, 2012; Wang et al. 2011; Young et al. 2012; Galmés et al. 2014a, b, 2015; Orr et al. 2016). C4 photosynthesis is a novel assemblage of biochemical and anatomical components that result in an increased CO2 concentration around RubisCO (Hatch 1987; von Caemmerer and Furbank 2003). This reduces the selective pressure for higher substrate specificity, allowing the evolution of more efficient versions of the enzyme at the expense of CO2 specificity (Tcherkez et al. 2006; Nisbet et al. 2007; Christin et al. 2008; Whitney et al. 2011b; Studer et al. 2014). Despite this clear evidence for non-neutral substitutions in rbcL, investigations of the selective pressures driving the evolutionary diversification of other plastid genes are still lacking. In this study, we address this gap by assessing the selective pressures acting on all protein-coding genes of the grass plastome.

The grass family (Poaceae), which encompasses more than 12,000 species, is one of the most important families of flowering plants, both economically and ecologically (Gibson 2009; Kellogg 2015). Grasses dominate most open biomes across the world, and constitute a major food source for numerous animals, including humans (Gibson 2009). Over the last 65 million years, grasses have come to dominate a variety of habitats, and these transitions were accompanied by major functional innovations (Gibson 2009). Among the two main clades of Poaceae, the so-called PACMAD clade includes all 22–24 known origins of C4 photosynthesis in grasses (GPWG II 2012). C4 grasses now dominate large open biomes such as savannas across the globe thanks to C4 photosynthesis, which increases photosynthetic efficiency in warm conditions (Griffith et al. 2015; Atkinson et al. 2016). Other PACMAD lineages adapted to a variety of contrasted conditions, from the shade of tropical forests, to arid deserts and cold climates (e.g. Humphreys and Linder 2013; Taylor et al. 2014). These functional and ecological conditions alter the quantity of light available to the plant as well as the amount of CO2 received by RubisCO. Indeed, CO2 availability decreases with temperature, aridity and salinity, which lead to stomata closure, limiting gas exchange and entailing CO2 depletion inside the leaves (Galmés et al. 2005; Sage et al. 2012). It is therefore possible that the optimum of the photosynthetic apparatus of grasses shifted with changing environments. In this case, adaptation of proteins encoded by plastid genes would have left traces in the form of adaptive amino-acid substitutions.

In this study, we used codon-substitution models to assess the past selective pressures acting on protein-coding genes from complete plastomes of grasses. These models can detect the fingerprint of episodes of adaptive evolution along specific branches of a phylogenetic tree (Yang et al. 2005). The hypothesis that C3–C4 transitions were accompanied by adaptive changes in plastid genes was tested. We also asked whether there were adaptive changes in the plastome of PACMAD grasses that are unrelated to the evolution of C4 photosynthesis. We thus analyzed all protein-encoding plastid genes of 113 grass species of the PACMAD clade. For each of the 76 protein-coding genes present in grass plastomes, we tested the hypothesis of non-neutral evolution, and specifically of positive selection leading to an excess of functionally adaptive amino-acid substitutions. We first evaluated the occurrence of positive selection across the whole phylogeny, without prior expectations regarding the selective drivers. We then specifically tested whether the evolution of C4 photosynthesis altered the selective pressures on plastid genes in addition to rbcL. Overall, our efforts provide the first systematic evaluation of selective pressures across plastomes in a functionally and ecologically diverse group of plants. We were able to detect widespread positive selection acting on plastid protein-coding genes, with important consequences for our understanding of the adaptive significance of plastid genes and their usefulness as phylogenetic markers.

Materials and methods

Taxon sampling and dataset assembly

Complete plastomes or complete sets of protein-coding plastid genes (76 genes) were retrieved from GenBank on June 2016, and completed with the sequencing and assembly of plastomes of 21 additional species (Table S1). The final sampling includes 113 grass species (or subspecies in the case of Alloteropsis semialata, the only known species with C3 and C4 lineages) belonging to the PACMAD clade (Table S1). Of these, 77 are taxa performing C4 photosynthesis. Among these C4 accessions, three belong to Aristidoideae, two to Micrairoideae, 14 to Chloridoideae and 58 to Panicoideae. Our sampling covers 16 separate C4 lineages reported in the PACMAD clade (GPWG II 2012) and was thus particularly appropriate for testing the impact of C3–C4 photosynthetic transitions on the selective pressures acting on protein-coding genes.

The newly sequenced plastomes were assembled from shotgun data (following the “genome skimming” approach; Straub et al. 2012), as described by Besnard et al. (2013). Briefly, we generated millions of paired-end sequence reads with HiSeq 2000 or HiSeq 3000 (Illumina Inc., San Diego). Plastomes were then assembled using the Organelle Assembler package, version 00.01.000 (https://pythonhosted.org/ORG.asm/) in Python. Protein-coding genes of Arabidopsis thaliana as available in the package were used as seeds to initiate plastome assembly. Then, assembled plastomes were annotated using Geneious v.6.1.8 (Biomatters Ltd., Auckland; Kearse et al. 2012) by transferring annotations from Zea mays (NC_001666.2). All protein-coding genes were extracted from the different plastomes and each dataset was aligned separately as codons using MUSCLE (Edgar 2004).

The 76 gene alignments were concatenated, and the best substitution model was determined for each gene (partition) using PartitionFinder v.1.1.1 (Lanfear et al. 2014). In all cases, the general time reversible model of nucleotide substitution with a gamma shape parameter (GTR + G) was selected, with or without a correction for invariant sites (+ I) (Table S2). A phylogenetic tree was estimated from this alignment using Bayesian inference as implemented in MrBayes v.3.2.4 (Huelsenbeck and Ronquist 2001). The dataset was partitioned per group of genes identified by PartitionFinder, and each partition was allowed to have different parameter values. Three independent Markov Chain Monte Carlo analyses, each with four parallel chains, were run for 100 million generations, sampling a tree every 20,000 generations. The first 25% were discarded as burn-in, after visual inspection of the log files in Tracer v1.6 (Rambaut et al. 2014), and a consensus tree was computed from the posterior distribution. We rooted the tree with Aristidoideae as outgroup, following GPWG II (2012).

Tests for positive selection

Past selective pressures were inferred using different codon-substitution models, as implemented in the PAML package v.4 (Yang 2007). In these models, the rate of fixation of non-synonymous substitutions (d N) is compared to the rate of fixation of synonymous substitutions (d S). Under neutrality, the ratio d N/d S (ω) will be equal to one, while purifying selection will remove non-synonymous substitutions and therefore lead to ω < 1. Conversely, the preferential fixation of adaptive, non-synonymous substitutions under positive selection will inflate d N, leading to ω > 1. The phylogeny inferred above was used for the different models. Note that the maximum likelihood is calculated on unrooted trees, so that the choice of the outgroup does not influence the outputs.

To statistically test for positive selection, we compared the performance of four codon models, individually for each gene. The null site model M1a allows codons to evolve under either purifying selection or neutrality, while the alternative site model M2a adds a third category corresponding to positive selection. In addition, we used branch-site models allowing variation among branches of the phylogenetic tree. In the null model MA’, some sites evolve under purifying selection or neutrality across the whole tree, while others switch to neutrality on a priori defined branches (i.e. foreground branches). The alternative model MA is identical except that sites switch to positive selection on foreground branches. It therefore specifically tests for positive selection on these branches. In our case, we specifically tested for positive selection in C4 grasses, so all branches within monophyletic C4 clades were set as foreground branches. Because the different models are not nested, they cannot be compared using likelihood ratio tests. Instead, the fit of the different models was compared using the Akaike information criterion (AIC). A ΔAIC threshold of 2 was used to accept alternative models. Posterior probabilities of belonging to the different classes of codons were estimated by the Bayes Empirical Bayes procedure (Yang et al. 2005).

Results

Twenty-one new plastomes were released in our study. All plastomes show a similar organization, except the one of Loudetia simplex whom harbors a long inversion in the Large Single Copy (Fig. S1). In addition, we also detected the insertion of a mitochondrial region in the plastome of Paspalum paniculatum (Fig. S1) as already reported in this lineage by Burke et al. (2016). Seventy-six protein-coding plastid genes were detected in all plastomes.

Positive selection tests

The phylogenetic tree inferred from the 76 protein-coding plastid genes of 113 PACMAD species (Figs. 2, S2) is congruent with previous topologies inferred from a reduced number of markers or taxa (e.g. GPWG II 2012; Burke et al. 2016). Every node is well supported, with posterior probabilities of 1 except for a few nodes (Fig. S2). Based on this topology, codon-substitution models were then tested separately on the 76 protein-coding plastid genes of the 113 PACMAD species. These analyses revealed significant signatures of positive selection on 27 genes (Fig. 1; Table 1). These positively selected genes do not appear to be clustered in particular regions of the plastome (Fig. 1).

Fig. 1
figure 1

The 27 protein-coding genes found to evolve under positive selection in this study mapped on the plastome of Zea mays

Table 1 Akaike information criterion differences (ΔAIC) for the four codon substitution models applied separately to each 76 protein-coding gene of the plastome

First, the site model M2a, allowing positive selection on every branch of the phylogenetic tree, was found to be the best fitting model for 25 of those genes (Tables 1, S3). These genes encode proteins involved in diverse plastid functions: post-transcriptional modification, subunits of the NAD(P)H dehydrogenase complex (hereafter ‘NDH complex’) involved in the light-dependent photosynthesis phase (nine out of the 11 ndh genes), photosystem II, transcription, proteins involved in the light-independent phase of photosynthesis, and translation, more precisely large and small ribosomal proteins. All plastid genes encoding proteins involved in transcription and post-transcriptional modification (matK, rpoA, rpoB, rpoC1 and rpoC2) were thus found to evolve under positive selection across the phylogeny.

Second, the branch-site model MA, allowing positive selection only on C4 branches of the phylogenetic tree, was found to best represent sequence evolution for the two additional genes, rbcL and psaJ (Table 1). For rbcL, the branch-site model MA better fits the data than the site model M2a (ΔAIC = 2.2), the site model M1a (ΔAIC = 143.7) and the branch–site model MA’ (ΔAIC = 136.1; Table 1). Ten codons were identified as evolving under positive selection in C4 species, and these were recurrently mutated in independent C4 lineages, yet mostly conserved in C3 species (Fig. 2; Table 2). Three sites were detected in the 3′ end (sites 468–477) that encodes the C-terminal part. In addition, all species belonging to the C4 tribe Andropogoneae and the C4 species Tristachya humbertii show a codon insertion at position 469, which encodes different amino acids (i.e. Thr, Ser, Asn, or Gly) among the two lineages (Fig. S2). The last codon (site 477) is also missing in a few C4 species and was independently lost at least four times (Fig. S2). For psaJ, the branch–site model MA fitted better than site model M2a (ΔAIC = 6.0), site model M1a (ΔAIC = 3.1), and branch–site model MA’ (ΔAIC = 2.6; Table 1). However, the site model M2a did not perform better than site model M1a. Only one codon (at position 2 numbered following the psaJ coding sequence of Z. mays) was found to be under positive selection on C4 branches with posterior probability > 0.95. Overall, evidence of positive selection on psaJ is weak as differences between all models are low.

Fig. 2
figure 2

Codon sites of rbcL (relative to Zea mays NC_001666.2) identified to have evolved under positive selection in C4 species plotted against the phylogeny inferred from 76 protein-coding plastid genes for 113 PACMAD species. Codon changes are highlighted in black. Branches leading to C4 species are in red. The scale represents substitutions per site. The phylogeny is well supported; for precise support values and classification according to GPWG II (2012), refer to Fig. S2

Table 2 Codon sites in rbcL (relative to Zea mays NC_001666.2) found to be under positive selection

Finally, the branch-site model MA’, assuming a shift to neutrality on C4 branches, performed better than the null site model M1a for nine genes (atpA, atpB, clpP, ndhE, psbD, rps3, rps8, rps16 and ycf4) and the model MA did not perform significantly worse (Table 1). These genes therefore switched to different selective pressures in C4 taxa, but the nature of the new selective pressure cannot be established with confidence. They encode for proteins involved in the light-dependent phase of photosynthesis phase, translation and enzyme modification.

Discussion

Positive selection on grass plastomes

Our codon models indicate that positive selection acted on some sites of roughly one-third of all protein-coding plastid genes (25 out of 76) across the whole PACMAD tree (Tables 1, S3). These genes are involved in different plastid functions, including self-replication (subunits of the ribosome, post-transcriptional processes and RNA polymerase) and photosynthesis (photosystems II, subunit of the NDH complex, cytochrome assembly and chloroplast membrane protein). Interestingly, all plastid genes encoding DNA-dependent RNA polymerases (rpo) evolved under positive selection in the PACMAD clade. These proteins are involved in transcription processes and therefore control gene expression. Six out of the 21 genes encoding for subunits of the ribosome (rpl and rps), carrying out translation, were found to evolve under positive selection across the whole tree. Along with matK, involved in post-transcriptional modifications, these genes have tremendous importance in self-replication and gene expression. Nine of the 11 genes encoding NDH subunits were also found to evolve under positive selection. In chloroplasts, NDH subunits form the NDH complex and are involved in chlororespiration and photosystem I cyclic electron transport. Additionally, four psb genes encoding for different subunits of the photosystem II evolved under positive selection. While the importance of the NDH complex remains obscure in plastids (Martín and Sabater 2010), proteins of the photosystem II are necessary for the light-dependent phase of photosynthesis. The last two genes found to evolve under positive selection across the whole tree were ccsA involved in cytochrome assembly (Xie and Merchant 1996) and cemA, a chloroplast envelope membrane protein whose exact role in plastids is not clearly determined.

The factors driving the detected selective pressures on these genes are so far unknown. Positive selection acting on these genes is probably evidence for adaptation to novel ecological conditions. Another likely hypothesis is that positive selection results from coevolutionary processes, as identified in arms races (e.g. Burri et al. 2010). Indeed, coevolution of nuclear and plastid genes can occur when different subunits of the same enzyme are encoded by each of the two genomes (Zhang et al. 2015; Weng et al. 2016). However, there is no reason to expect that positive selection would be sustained throughout the whole tree. Instead, we hypothesize that selective shifts occurred in disparate branches across the tree, after changes in the catalytic context of photosynthesis following alteration of the external environment (i.e. ecological shifts) or internal conditions (i.e. metabolomic shifts). Since our branch model only tested for C4-specific shifts, positive selection on other sets of branches would favour the model assuming phylogeny-wide adaptive evolution. Specification of the foreground branches in this analysis was focused on C3 versus C4 comparisons, but it is likely that other environmental conditions (e.g. low temperatures or shaded habitats) have also driven the adaptive evolution of plastid genes, not only in C4 species. Testing such hypotheses, with adequate species sampling and foreground branch specification, could be the subject of future studies.

C4 photosynthesis only affects rbcL

Plastid genes code for proteins involved in vital functions (De Las Rivas et al. 2002). In C3 species, photosynthesis is greatly optimized and plastid genes were, until recently, believed to evolve under purifying selection in order to eliminate any deleterious mutations (Clegg et al. 1994; Bock et al. 2014). However, C3–C4 photosynthetic transitions led to important anatomical and biochemical modifications in leaves of C4 species and to changes in ecological conditions (Christin and Osborne 2014). Our hypothesis was that these changes could have relaxed selective pressures acting on some amino-acid sites under purifying selection and led to their adaptive evolution to face novel environmental conditions. Despite significant changes, the evolution of C4 photosynthesis was not accompanied by obvious adaptive evolution on protein-coding plastid genes in grasses, except for rbcL which encodes the large subunit of RubisCO. Even though the model of positive selection on C4 branches (MA) was the best model to represent the data for psaJ, small AIC differences between each model do not show clear evidence of positive selection acting on this plastid gene in PACMAD grasses. For rbcL, AIC differences between alternative models (M2a and MA) and null models (M1a and MA’) are clear, demonstrating strong positive selection acting on this gene. However, small differences between the two alternative models (M2a and MA) suggest that C3–C4 photosynthetic transition is not the only factor driving positive selection on rbcL and other ecological conditions should also drive the adaptive evolution of this gene.

For the ten codons evolving under positive selection in rbcL of PACMAD grasses, the same codon substitutions appear in different C4 lineages, suggesting convergent evolution. However, not every C4 species shows substitutions at the same site or in the same codon. This pattern indicates that not all C4 species evolved alike and that several evolutionary pathways exist to adapt to C4 photosynthesis. Our identification of the putative amino acids positively selected in C4 species is not fully consistent with the results of a related study of rbcL on grasses and sedges (Christin et al. 2008). Six sites out of eight were still detected evolving under positive selection (Table 2). Four additional amino acids are found here to evolve under positive selection, and three of them are located at the end of the gene (positions 468–477), a region not included in the study of Christin et al. (2008) due to the difficulty to amplify the 3′ end of rbcL by PCR on a large sample of species. Other difference in the identification of positively selected amino-acid sites can also be explained by sampling differences, as we performed our analysis on 113 grass species of the PACMAD clade whereas Christin et al. (2008) used more than 200 species, covering the entire grass (Poaceae) and sedge (Cyperaceae) families.

The rbcL gene encodes the large subunit of the RubisCO enzyme which fixes CO2 to RuBP in the Calvin cycle. This enzyme is subject to a trade-off between its specificity for CO2 and its catalytic activity (Tcherkez et al. 2006). Thanks to the carbon concentrating mechanism in C4 species, CO2 specificity of RubisCO may have decreased, allowing its catalytic activity to increase (Sage 2002). Selective pressure changes would have led to the relaxation of the purifying selection acting on amino-acid sites of RubisCO that were important for the maintenance of CO2 specificity. This change would have allowed the directional evolution of some amino-acid sites and especially those having a role in the catalytic activity of RubisCO which evolved under positive selection (Whitney et al. 2011b; Studer et al. 2014). It is known that the C-terminal part of RbcL (positions after 460), and in particular site 471 (referred to as 470 in Burisch et al. 2007), is involved in the opening/closing mechanism of the active site of the RubisCO enzyme (Burisch et al. 2007). Amino-acid changes in this protein region affect RubisCO’s catalytic activity by acting on the opening speed of the active site (Gutteridge et al. 1993; Burisch et al. 2007). Shorter opening time of the active site allow for higher CO2 specificity, because CO2 fixes to the binding niche more rapidly than O2 (Schlitter and Wildner 2000). Positively selected substitutions at three codons (468, 471 and 476; Fig. 2) plus recurrent insertions (site 469) and deletions (site 477) in the 3′ end of rbcL should thus be major determinants to optimize RubisCO in a C4 catalytic context. These changes likely modify the energy needed to open the active site of the RubisCO enzyme (Burisch et al. 2007), leading to an increased catalytic activity to the expense of CO2 specificity. The precise role and adaptive evolution of these terminal sites in rbcL need to be confirmed with biochemical analyses but hold potential for biotechnological applications such as crop improvement for global change adaptation (Withney et al. 2011a; Carmo-Silva et al. 2015; Olejniczak et al. 2016).

Consequences of adaptive selection for the study of plant evolution

Our study revealed widespread positive selection on plastid protein-coding genes of grasses. These genes are widely used in plant phylogenetics and phylogeography. In particular, three genes (matK, ndhF, and rbcL) that were shown here to evolve under positive selection, either across the whole PACMAD phylogeny, or specifically on C4 branches, are among the most widely used in phylogenetic studies, both within grasses (GPWG II 2012) and in other groups (e.g. APG III 2009). It has been shown previously that adaptive evolution can bias phylogenetic reconstructions if it leads to the convergent fixation of some amino-acid residues against a background of limited neutral substitutions (Kellogg and Giullano 1997; Christin et al. 2012). Since the plastid substitution rate is usually much lower than the nuclear substitution rate (Wolfe et al. 1987), we hypothesize that selection could alter inferred phylogenetic relationships, especially when studying closely related species with a few markers. The risk of spurious groupings is decreased by considering multiple genes, a practice that is becoming widespread with the accumulation of complete plastomes (e.g. Moore et al. 2010; Washburn et al. 2015; Burke et al. 2016). For inferences of phylogenetic relationships among close relatives based on a limited number of sites, however, we suggest to minimize the effects of non-neutral evolution by favoring non-coding plastid genes.

Conclusion

In this study, we provide the first analysis of the selective pressures acting across plastomes in grasses. Our results suggest that positive selection is driving the evolution of about one-third of all protein-coding plastid genes. These encode for proteins involved in vital functions such as self-replication of plastids and photosynthesis and are therefore of tremendous importance for plant survival. However, factors driving this adaptive evolution are still unknown and new hypotheses regarding ecological adaptation are needed to define appropriate foreground branches and taxon sampling. Secondly, we found that the establishment of C4 photosynthesis in grasses did not lead to major adaptive evolution of protein-coding plastid genes except for the one encoding the large subunit of RubisCO, rbcL. Finally, despite the common thinking that plastid markers evolve slowly and are well conserved, we suggest that the evolution of plastid genes was shaped, to a great extent, by changing ecological conditions. Given their adaptive potential, plastid genes provide powerful genetic markers to study plant evolution, but possible bias in phylogenetic analyses needs to be taken into account.

Author contribution statement

PAC and GB conceived and designed research. GB prepared DNA for shotgun sequencing. AP, JH and GB analyzed data. AP wrote the manuscript with the help of all co-authors.