Introduction

DNA base composition (GC content), defined as the proportion of cytosines and guanines relative to the total number of nucleotides in the genome, is a key feature of genome architecture and believed to play an important role in genome evolution and species biology (Nishio et al. 2003; Vinogradov 2003; Šmarda et al. 2014). Hence, exploring the dynamics of GC content evolution and drivers/causes could provide insight into genome adaptation in response to ecological fluctuation. In recent genomic studies, variation in GC content has often been shown to correlate with various factors, including phylogenetic relatedness (Stackebrandt and Liesack 1993), GC-biased gene conversion (gBGC, a process associated with recombination, Holmquist 1992; Eyre-Walker 1993; Duret and Galtier 2009; Muyle et al. 2011), mutational biases (Filipski 1987; Suoeka 1988), chromosome/genome structure (e.g., isochore, Eyre-Walker and Hurst 2001; Duret and Arndt 2008; Glémin et al. 2014), and ecological selection (Bernardi and Bernardi 1986; Eyre-Walker 1999; Hildebrand et al. 2010). In addition, researchers have also found that DNA methylation (Smith et al. 2009; Mugal et al. 2015), life-history traits (e.g., plant growth form, Trávníček et al. 2019), and genome size (Veselý et al. 2012; Lipnerová et al. 2013) have also influenced the GC content. To date, the dynamics of GC content evolution has been extensively studied in prokaryotes, vertebrates, and plants (Bentley and Parkhill 2004; Mann and Chen 2010; Eyre-Walker and Hurst 2001; Šmarda et al. 2014). However, most of these studies focused on nuclear genomes, and relatively few have concentrated on organelle genomes (i.e., chloroplast and mitochondrial genomes), in which the GC content is believed to be maintained independently of nuclear genomes (Kusumi and Tachida 2005).

The plastid GC content varies greatly among species, with the lowest value of 22.67% found in a parasitic plant Pilostyles hamiltonii (Bellot and Renner 2015), and the highest 56.50% in spikemoss Selaginella remotifolia (Zhang et al. 2019). A few hypotheses to explain this variation have been proposed (Wicke et al. 2013, 2016; Mower et al. 2019; Yu et al. 2020). Wicke et al (2013, 2016) noted that a reduced GC content accompanied by a lifestyle-specific shift to heterotrophy was caused by relaxed functional constraints on codon usage or nutrient economy. In addition, the highest plastid GC content in spikemoss is probably a consequence of a large number of RNA editing sites and reduced AT-mutation pressure (Smith 2009; Mower et al. 2019), although there is still a lack of empirical data supporting the latter hypothesis. In a recent phylogenomic study of liverworts, Yu et al (2020) pointed out that variation in plastid GC content not only reflected phylogenetic relatedness but also correlated with the diversity of poly-(G)/(C) tracts (G/C tracts, deletions/replications caused by DNA polymerase slippage, Viguera et al. 2001). Nevertheless, little is known about whether such observations are lineage specific or a more widespread phenomenon across all plants. Here, we explored the dynamics of GC content evolution and putative mechanisms in plastid genomes of angiosperms using a robust phylogeny and broad taxon sampling, with a particular focus on the hypothesis of phylogenetic dependence and the correlation with deletion mutations (both G/C tracts and A/T tracts).

As the most successful land plants and an important component of terrestrial ecosystems, angiosperms have been the focus of a large number of plastid genomic and phylogenomic studies (Qiu et al. 1999; Moore et al. 2007; Davis et al. 2014; Gitzendanner et al. 2018; Li et al. 2019). In these studies, a large number of plastid genomes were documented. As of the end of September 2020, a total of 4116 angiosperm plastid genomes are available in GenBank (https://www.ncbi.nlm.nih.gov/genome/organelle/). These reports provide not only a large number of genetic characters, enabling the investigation of the origin and diversification of angiosperms (Davis et al. 2014; Gitzendanner et al. 2018; Li et al. 2019), but also considerable information on the plastome architecture and assembly (Smith 2009; Wicke et al. 2013; Li et al. 2016; Niu et al. 2017; Mower et al. 2019). To achieve the above aim, we measured three genetic traits, namely, GC content, G/C tracts, and A/T tracts (per kb), in the coding region of plastid genomes for 1382 angiosperm species representing 350 families and 64 orders (APG IV 2016). Using a well-resolved phylogeny of angiosperms published recently (Li et al. 2019), we tested the phylogenetic signal of three selected traits and performed correlation analyses. Then, we reveal variation in the evolutionary rate of the selected traits across the phylogeny using a modified phylogenetic comparative analysis-RRphylo, and discuss possible drivers or causes in biological and ecological contexts.

Materials and Methods

Sampling, Phylogeny, and Data Collection

The chloroplast phylogenomic study of angiosperms performed by Li et al (2019) provided not only the most likely family-level backbone of this group to date, but also a suitable genome-scale dataset-consisting of 80 plastid genes, allowing evolution of GC content to be explored without accounting for the noisy signals arising from frequent reshuffling in noncoding regions (Glémin et al. 2014). To reduce the biases caused by missing data, we pruned the maximum clade credibility (MCC) tree of Li et al. to include 1382 species (Table S1) representing 350 families (84% of family diversity) and 64 orders (100% of order diversity, APG IV 2016) using the function “drop.tip” (package ape; Paradis et al. 2004; Paradis 2012). Only one species with the longest sequence length from each genus was sampled. The sampling represents not only the greatest order and family diversity of angiosperms to date, but also ecological, lifestyle, and life-history diversity. Using the dataset of Li et al., we estimated three genetic traits, namely, GC content, G/C tracts, and A/T tracts [the number of poly-(dN) tracts per kb, only poly-(dN) tracts with a length ≥ 3b were calculated; Table S1], for all samples used in the phylogeny.

Test of Phylogenetic Signal, Correlations, and Evolutionary Rate Variation

We tested the phylogenetic signal (a tendency of related species to resemble each other more than species drawn at random from the phylogeny; the concept follows Münkemüller et al. 2012) of three genetic traits using two statistics: Blomberg’s K (Blomberg et al. 2003) and Pagel’s λ (Pagel 1999). The K-statistic provides a reliable effect size measure and performs well in all conditions (Münkemüller et al. 2012). K is estimated as the ratio of the observed mean-squared errors and the mean-squared errors using the variance–covariance matrix derived from a given phylogeny under the assumption of Brown motion model (BM, Blomberg et al. 2003). K < 1 indicates that a trait has less phylogenetic signal than expected under BM. The λ-statistic is most suitable to capture the effect of changing evolutionary rates in simulation experiments (Münkemüller et al. 2012). λ is defined as the transformation of the phylogeny that fits trait data best to BM (Pagel 1999; Freckleton et al. 2002). λ < 1 indicates that relatives show less similarity than expected, while λ > 1 suggests the opposite. To assess the effects of topological uncertainty on the estimated phylogenetic signal (Revell et al. 2008), we randomly selected 100 trees from the posterior sampling after excluding burn-in trees. Both statistics were performed in the MCC and 100 randomly selected trees using the function “phylosig” (package phytools; Pagels 1999; Blomberg et al. 2003; Ives et al. 2007; Revell 2012). All trees are deposited in Figshare Digital Repository (https://doi.org/10.6084/m9.figshare.12901517).

DNA polymerase slippage could cause both G/C tracts and A/T tracts (Viguera et al. 2001). To determine whether GC content accelerates accumulation of G/C tracts or all deletion events, we tested the correlation between GC content and two deletions events separately and as a whole, using the IPC analysis. Furthermore, we performed the correlation analyses in all angiosperms as well as three major subclades, namely, monocots, superrosids, and superasterids, using the PDAP plugin (Midford et al. 2005) for Mesquite (Maddison and Maddison 2018).

RRphylo is a modified phylogenetic comparative method recently developed by Castiglione et al (2018). This method has the advantage of assigning an evolutionary rate to each branch of a phylogeny, dealing with both extinct and extant phylogenies, and low Type I and Type II error rates (Castiglione et al. 2018). Recently, this method has been successfully applied in macroevolutionary studies of animals and humans (Piras et al. 2018; Raia et al. 2018; Sansalone et al. 2020). In this study, we used RRphylo to reveal the variation in evolutionary rate of three selected traits and to identify potential rate shifts. This analysis was performed using the functions “RRphylo” and “search.shift” (Castiglione et al. 2018; Piras et al. 2018).

Results

Three genetic traits have a wide variation range, including 28.10%–43.20% in GC content, 8.27–23.37 per kb in G/C tracts, and 34.93–65.32 per kb in A/T tracts (Fig. 1, Table S1). Interestingly, the amount of A/T tracts was almost three times that of G/C tracts (the mean ratio of A/T tracts to G/C tracts was 2.7, Table S1). Using the K-statistic, only A/T tracts displayed a strong phylogenetic signal, with a K = 1.30 [1.25, 1.34] (p < 0.001), while the other two did not, as indicated by K < 0 (K = 0.64 [0.64, 0.68], p < 0.001 in GC content, and 0.61 [0.59, 0.63], p < 0.001 in G/C tracts, Table S2). In contrast, using the λ-statistic, all three genetic traits showed a nearly BM pattern, with λ ≈ 1.0 (λ = 0.962 [0.960, 0.966], p < 0.001 in GC content, 0.988 [0.987, 0.990], p < 0.001 in G/C tracts, and 0.954 [0.951, 0.957], p < 0.001 in A/T tracts, Table S2).

Fig. 1
figure 1

Variation of GC content, and A/T and G/C tracts per kb in the coding region of plastid genomes across 1382 angiosperm species using the dataset of Li et al (2019). Box bars show the minimum-to-maximum range (whiskers), interquartile range (blue boxes), median (black bars), and outliers (empty circles)

The GC content was positively correlated with G/C tracts with a coefficient > 0.6 (p < 0.001) but negatively correlated with A/C tracts with a coefficient <  − 0.75 (p < 0.001) and the total deletion mutations with a coefficient <  − 0.50 (p < 0.001, Table 1). These correlations were consistently supported in all angiosperms and three major subclades (Table 1). In addition, the negative correlation between G/C tracts and A/T tracts was only found in monocots.

Table 1 Correlations between GC content and two deletion events separately or as a whole

Across the angiosperm phylogeny, we found significant increases in three traits on the internal branches of monocots especially Poales but decreases in superrosids (e.g., Fabids). The additional increases in GC content and decreases in A/T tracts occurred in superasteridae (e.g., Lamiids, Fig. 2).

Fig. 2
figure 2

Reconstructing evolution of GC content, and the diversity of A/T and G/C tracts across the phylogeny of angiosperms generated recently by Li et al. (2019). Colored circles indicate significant increase and decrease (p < 0.01) in each trait identified using RRphylo. Branches in red subtend to non-photosynthetic genera

Discussion

Variation in Plastid GC Content, G/C Tracts, and A/T Tracts Across the Phylogeny

The chloroplast genome contains a subset of genes encoding proteins that are crucial for photosynthesis and some other metabolic processes, such as the cytochrome b6f complex and ATP synthase (Martin et al. 2002). Hence, understanding its evolution is fundamental to comprehend the adaptation, diversification, and ecomorphospace evolution of modern plants. A comparative analysis of 1382 plastomes with focus on the coding region from species spanning the breadth of extant angiosperms revealed some important characteristics of the dynamic evolution of GC content and deletion mutations. The relatively low GC content occurring in a few non-photosynthetic species coincides with massive gene arrangements (e.g., partial or complete loss or transfer to other genomes, Wicke et al. 2013, 2016; Schneider et al. 2018a, b; Wicke and Naumann 2018). These findings not only reconfirmed the hypothesis of functional constraints on photosynthesis as previously proposed (Wicke and Naumann 2018) but also implied that some other processes that determined which genes or proteins were retained in plastids, such as inefficient protein import and regulatory coupling of genes, may be responsible for variation in plastid GC content across non-green plants (Daley and Whelan 2005; Barbrook et al. 2006; Wicke et al. 2016).

The accumulation of both G/C tracts and A/T tracts is regulated by the DNA polymerase/mismatch repair system (Akashi and Yoshikawa 2013), but these two events showed great variation in diversity: the amount of A/T tracts was nearly three times that of G/C tracts (Table 1). This observation suggested a bias toward accumulation of A/T tracts over accumulation of G/C tracts in the coding region of angiosperm plastomes. A similar conclusion was also reached in previous studies (Eyre-Walker 1999; Smith and Eyre-Walker 2001; Massouh et al. 2016). This pattern was considered to be a consequence of the lower biochemical stability and higher energy cost of a G/C pair compared with an A/T pair, the limitation of available resources (Rocha and Danchin 2002; Akashi and Yoshikawa 2013), and selections for protection against inactivation and high mutability, considering the relative mutation rates of mononucleotide repeats (Boyer et al. 2002; Gragg et al. 2002).

Phylogenetic Signal, Correlations, and Variation in the Evolutionary Rate

The taxonomic value of GC content has been widely recognized in taxonomic studies of micro-organisms (Stackebrandt and Liesack 1993; Johnson and Whitman 2007; Tindall et al 2010) and phylogenomic studies of plants (Šmarda et al. 2014; Yu et al. 2020). However, it was also argued recently that distinct species living in the same environmental conditions tend to show similar GC content (Foerstner et al. 2005; Mann and Chen 2010). In this study, we failed to detect strong phylogenetic signals in GC content and G/C tracts in plastid genomes of angiosperms using the K-statistic (Table 1), indicating that close relatives are less similar than expected under a Brownian motion model of trait evolution. This pattern could be resulted either from adaptive radiations in which close relatives rapidly differentiate to fill new niches or from convergent evolution (Kamilar and Cooper 2013). In contrast, evolution of A/T tracts (per kb) showed a strong phylogenetic signal, making it an informative feature that could be used in the taxonomy of flowering plants. Nevertheless, we cannot completely rule out the possibility of “measurement errors” using the present dataset (Blomberg et al. 2003), in relation to disproportional sampling, topological uncertainty, and errors in branch length.

The mutation biases are considered as being among the major causes for variation in GC content (Filipski 1987; Suoeka 1988), and high GC content in turn was assumed to accelerate the rate of all mutations, including single base substitutions and deletions/replications (Kiktev et al. 2018). These assumptions were partly supported in this study, as we found evidence supporting that GC content was positively correlated with G/C tracts. However, the negative correlation with A/T tracts simultaneously identified in this study raised the possibility of a trade-off between accumulation of the two deletion events, G/C and A/T tracts, a process that was probably associated with competition for limited energy/resources (Rocha and Danchin 2002; Hellweger et al. 2018). Under this assumption, the accumulation of G/C tracts is directly affected by GC content, while the accumulation of A/T tracts depend heavily on the availability of energy/resources. In addition, the heterogeneity of energy costs for different base pairs (i.e., higher energy cost for a G/C pair than for an A/T pair) may be responsible for the reduced number of all deletion events as GC content increases, as long as the above trade-off is taken into account. In this respect, variation in plastid GC content might be a mixed strategy for species to optimize fitness in fluctuating environments, partly through influencing the trade-off between GC → AT and AT → GC mutations (both single base substitution and deletions). Nevertheless, little is currently known about how GC content responds to ecological fluctuation.

The heterogeneity of the evolutionary rate of plastid GC content across angiosperms suggested that some other factors, rather than functional constraints on photosynthesis, have shaped the evolution of this trait. One possibility is selection for a broader tolerance range. In monocots, Šmarda et al. (2014) proposed that increased GC content in grasses (Poaceae) may facilitate complex gene regulation, and consequently, favor these groups to grow in seasonally cold and dry climates. If so, an increase in plastid GC content of grass family may also be a response to such stressful environments, given the significant functions of plastid genes in energy and material metabolism, as well as the connections between plastid and nuclear genomes in gene regulation and assembly (Martin et al. 2002). Another possibility is plant size. This trait is often associated with mutation rate across angiosperms (Lanfer et al. 2013) and ferns (Barrera-Redondo et al. 2018): taller vascular plants tend to have slower substitution rates than smaller ones. One line of evidence consistent with this hypothesis is that taller palms showed both slower mutation rate (Barret et al. 2016) and lower GC content than their herb relatives, e.g., Poales (Table S1). Coincidently, in this study, we found a significant increase of GC content in small-mean-sized angiosperm groups, for example, Poales and Lamiids, in which most members are herbs, and a decrease in a few large-mean-sized clades, for example, Rosids, in which most members are taller trees. Apart from these two factors, some others that have been proposed to explain the variation in plastid GC content in several lineages, should also be taken into account, such as the frequency of RNA editing (Smith 2009) and gBGC (Niu et al. 2017). In general, the above hypotheses still need to be verified across a broad range of species diversity.

This study explored the dynamic evolution of GC content in the coding region of plastid genomes across angiosperms using a comprehensive phylogeny and a large taxon-character dataset. Our results not only provide evidence to support the hypothesis of adaptive evolution of GC content and G/C tracts but also revealed the complex correlations between GC content and diversity of mononucleotide repeats. This work also implies that variation in plastid GC content of angiosperms may be attributed to a combination of various factors, such as functional constraints on photosynthesis, selection for a broad tolerance range, competition for available energy/resources, and plant size. Nevertheless, some crucial issues about the biological and ecological significance of plastid GC content remain unknown, such as whether variation in plastid GC content could reflect the ecological distribution range, or mutation rate; whether plastid GC content has evolved independently from floral/lifestyle traits, and to what extent the variation in GC content is heritable.