Motivation of the Original Work

DNA base composition is one of the most fundamental properties of a genome. Chargaff’s measurements of base composition in double-stranded DNA (Chargaff 1951) were important for the development and acceptance of Watson and Crick’s structural model of DNA (Watson and Crick 1953) long before one could count individual guanine and cytosine residues on a sequencing trace. Organismal genomic G + C content can vary widely from less than 20% to over 75%, yet there is typically less variation between different locations within a given species genome (Bohlin et al. 2010). Over 50 years after the discovery of DNA’s structure, understanding what drives variation in genomic G + C content is still very much an open question, despite DNA sequence data from a multitude of biological entities. It is still unclear whether G + C content variation may be generated by neutral processes such as mutational bias or biased gene conversion, or is primarily the result of natural selection. Furthermore, even if such variation is the result of natural selection, is selection acting on the genomic DNA itself, or rather on the molecules (e.g. RNAs and proteins) encoded by the DNA? These questions were ultimately the subject of Galtier and Lobry’s paper published in J. Mol. Evol. in 1997 entitled ‘Relationships between Genomic G + C content, RNA Secondary Structures, and Optimal Growth Temperature’ (Galtier and Lobry 1997).

Despite the far-reaching nature of the questions outlined above, Galtier and Lobry sought to test a relatively specific hypothesis in their work. Chargaff is best known for describing the base-composition of double-stranded DNA, in particular that the quantities of adenosine (A) and thymine (T) are equal, and the quantities of guanine (G) and cytosine (C) are equal (Chargaff’s first parity rule) (Chargaff 1951). Somewhat surprisingly, this observation also appears to hold true for single-stranded DNA in many cases [termed Chargaff’s parity rule 2 or PR2 (Sueoka 1995)], although this rule is not as exact and there are frequently local variations that do not comply. Attributes consistent with PR2 were first described in Bacillus subtilis (Rudner et al. 1968a, b; Karkas et al. 1968), but subsequently proved true in a wide variety of different genomic sequences (Mitchell and Bridge 2006). Two hypotheses to explain PR2 during the late 1990′s were: (1) this phenomenon is due to mutational bias in the replicating polymerase (Sueoka 1962, 1995); and (2) this property is due to natural selection favoring the formation of self-complementary oligonucleotides within the DNA that might form hairpin structures (Forsdyke 1995). Galtier and Lobry proposed that the second hypothesis would predict that genomic G + C content should increase as organismal optimal growth temperature (OGT or Topt) increases to ensure that DNA hairpin structures would remain stable. Thus, the goal of their study was to determine whether this prediction was supported by a large set of prokaryotic genomes.

Like many bioinformaticians, Galtier and Lobry largely compiled existing data for their study (Staley et al. 1984; Dalgaard and Garret 1993; Van de Peer et al. 1994; Sprinzl et al. 1996), and the methods used to determine genomic G + C content (thermal melting curves or buoyant density centrifugation) would be considered quite crude compared to the precision that sequencing provides today. Using this data, Galtier and Lobry found that OGT and genomic G + C content do not display a clear relationship, thus casting doubt on the hypothesis that secondary structures in genomic DNA explain Chargaff’s PR2 (Galtier and Lobry 1997). Despite the specific nature of the hypothesis addressed, the two findings for which this paper is most frequently cited are quite general. The first is the lack of relationship between OGT and genomic G + C content. The second is that G + C content in the stems of the 16S and 23S rRNAs, and generally in the 5S rRNA and tRNAs, does correlate with organismal OGT. Both of these trends had previously been established in the context of hyperthermophilic archaea (Dalgaard and Garret 1993). However, the work of Dalgaard and Garret included a small number of organisms (about twenty vs. over one-hundred in Galtier and Lobry), which belonged to a limited phylogenetic distribution with narrow environmental diversity (thermophilic archaea with a few additional model species for comparison). Galtier and Lobry extended the findings of Dalgaard and Garrett across significantly more bacterial species, and in so doing extended the story beyond thermophilic archaea to a much more general phenomenon that attracted significantly more interest.

In the years since the publication of Galtier and Lobry’s manuscript, work toward understanding forces at work in genome composition has continued. The debate regarding the relationship between genome composition and thermostability was by no means settled by this work, and satisfying explanations for Chargaff’s PR2 and the diversity of G + C observed across diverse genomes remain elusive over 20 years later. The two major findings of Galtier and Lobry have spurred significant further work that encompasses a range of different applications that take advantage of the relationships between OGT, structured RNA G + C content, and genomic G + C content. These include: prediction of organism OGT based on 16S rRNA sequence, separation or enrichment of DNA extracted from microbial communities for a particular sub-populations based on G + C content, and computational methods for structured RNA identification.

Resolving the Relationship Between Genomic G + C Content and Thermoadaptation

The work of Galtier and Lobry provided evidence against adaptation to growth at higher temperature directly impacting genomic G + C content. However, this premise was further assessed using several different genomic subsets or better controlled sets of genomes by many additional studies from a range of authors over the years. Analysis of the three codon positions in coding sequences separately (under the assumption that the third codon position is less likely to be under selection for protein function), showed that GC content of the third codon closely mirrors that of the genome as a whole and does not correlate with OGT (Hurst and Merchant 2001). However, analysis of coding sequence dinucleotide frequencies indicated some OGT correlated changes, suggesting that thermoadaptation could directly impact genome dinucleotide frequencies (Nakashima et al. 2003). Several additional studies have assessed whether better phylogenetically informed sampling (comparing pairs of genomes from within the same class) enable better detection of a correlation between OGT and G + C content (Musto et al. 2004, 2006; Wang et al. 2006). However, findings from such works remain controversial and are not necessarily robust across many bacterial genera. It is clear that many factors such as codon bias (Knight et al. 2001) and changes in protein composition associated with thermoadaptation (Singer and Hickey 2000), may impact genomic G + C content (Hickey and Singer 2004). However, none of these factors yield a clear relationship between genomic G + C content and OGT.

Alternative Explanations for Chargaff’s Second Parity Rule

Although Galtier and Lobry concluded that ssDNA hairpins are not likely a significant contributor to Chargaff’s second parity rule (PR2), during the decades since its original formulation Chargaff’s PR2 has largely proven robust as additional sequence data is collected. It applies to most complete genomes (Mitchell and Bridge 2006), although genomes of organelles (Mitchell and Bridge 2006; Nikolaou and Almirantis 2006) and sDNA viruses (Mitchell and Bridge 2006) are notably not compliant. Furthermore, although most complete genomes do follow PR2, there are significant local deviations. In bacterial genomes, the direction of replication and ori position significantly impact genome composition (McLean et al. 1998; Nikolaou and Almirantis 2005), sequences that are actively transcribed also tend to display purine loading (Szybalski et al. 1966; Bell and Forsdyke 1999), and exons tend to conform to PR2 more than intronic sequence in eukaryotes (Touchon et al. 2004). Despite such local variations, the rule has been extended from symmetry of mononucleotide frequencies to include symmetry of oligonucleotide frequencies (Qi and Cuticchia 2001; Baisnée et al. 2002; Shporer et al. 2016). The most satisfying explanations for the maintenance of Chargaff’s second rule invoke frequent duplication, inversion, and transposition events in the genome (Albrecht-Buehler 2006, 2007; Okamura et al. 2007).

Causes for G + C Content Variability: Neutral Processes or Natural Selection?

The potential causes of diverse genomic G + C content essentially reduce to whether the observed variation is due to neutral processes (Sueoka 1962, 1999) or natural selection. It is easy to imagine how neutral processes may contribute to nucleotide content and several studies have assessed the viability of this option across different species (Zhao et al. 2007; Wu et al. 2012). However, most bacterial polymerases, even those from high G + C content organisms, display a bias toward conversion of G–C pairs into A–T pairs (Lind and Andersson 2008; Hershberg and Petrov 2010; Hildebrand et al. 2010; Wielgoss et al. 2011), although this may not be universally true (Dillon et al. 2015). Increasingly it appears that G + C content in genomes may be the result of a combination of neutral and selection processes that are quite subtle (Reichenberger et al. 2015). In prokaryotes coding sequences tend to be more G + C rich than non-coding regions (Bohlin et al. 2008), coding regions part of the core genome are higher G + C than those of the periphery genome (Bohlin et al. 2017), but modeling studies of substitution rates in the core genome still suggest a universal G–C to A–T mutational bias(Bohlin et al. 2018). Symbiotic bacteria whose genes are under less selective pressure, have both highly reduced and very A + T rich genomes (McCutcheon and Moran 2011) suggesting that lack of selection leads to A + T richness.

An alternative neutral process that has been invoked to explain variation in G + C content is biased gene-conversion. In eukaryotes G–C alleles are more likely to be maintained than A–T alleles during gene conversion events (Mugal et al. 2015). Such events are also proposed to impact bacterial genomes, and a positive correlation is observed between G + C content and evidence of recombination for genes in the core genome (Lassalle et al. 2015). Furthermore, the presence of machinery necessary for non-homologous end joining (NHEJ) is also correlated with increased G–C content (Weissman et al. 2019). The combination of these studies with the observation that increased genomic G + C content may correlate with environmental conditions such as aerobiosis (Naya et al. 2002; Romero et al. 2009), suggests that DNA damage may play a role in prokaryotic genomic G + C content. Thus, the essential question, what causes the strikingly large range of G + C content over diverse prokaryotic genomes, likely has a quite nuanced answer, and remains open even as more, and greater diversity, genomes are available.

rRNA G + C Content and Optimal Growth Temperature

The observation of Galtier and Lobry, that structured non-coding RNAs, and in particular their double-stranded regions, displayed a strong correlation between G + C content and OGT has been widely verified. Additional work shows that the G + C content in rRNAs occurs most noticeably in the regions expected to be base-paired, but also extends to loop regions (although with a small effect size) (Wang et al. 2006). The effect occurs among sequences chosen to control for differences in G + C content due to taxonomy (from the same genera), and cold-adapted organisms (in contrast to just mesophiles and thermophiles) display similar trends in their rRNA (Wang et al. 2006) and tRNA (Dutta and Chaudhuri 2010). Furthermore, the same observation can also be made for other structured RNAs such as the signal recognition particle (SRP) RNA (Miralles 2010). Additionally it has been found that the expression of different copies of the rRNAs with differing G + C composition in the same organism may be tuned to temperature, with higher G + C content rRNAs enabling increased fitness at higher temperatures (Sato et al. 2017; Sato and Kimura 2019).

The robustness of G + C composition correlation with OGT, has also spurred efforts to more broadly understand what other factors contribute to RNA thermostability. OGT also correlates with a decrease in the prevalence of uracil (U) specifically, although this does not seem to correspond with a replacement of G·U base-pairs with more G–C base-pairs, but rather a decrease in U prevalence across the molecule, including loop regions (Khachane et al. 2005). The structure of a thermophilic ribosome also appears to be more tightly packed than that of a mesophile (Mallik and Kundu 2013), and tRNAs in thermophiles may also display better folding characteristics than those in psychrophiles using in silico models of RNA folding (Dutta and Chaudhuri 2010).

Using rRNA G + C to Predict Optimal Growth Temperature

There are several applications of the observed correlation between OGT and G + C content in functional RNAs. One of these is enrichment of a sampled microbial community for organisms from a specific environment (Kimura et al. 2006). A second application of this correlation is the estimation of OGT, typically based on rRNA sequence (or its composition determined from melt-curves) (Kimura et al. 2010). This approach may be applied to single organisms, or increasingly to confirm the native environment of sequences isolated from metagenomic sequencing (Ragon et al. 2013; Kimura et al. 2013). As the amount of sequence in derived from whole genome shotgun sequencing (WGS) compared with 16S rRNA has shifted in such studies, methods have expanded to include additional features from genomic sequence such as ORF composition, but the composition of the tRNA and rRNA has a significant impact on accuracy even in the context of this addition data (Sauer and Wang 2019), although prediction based on proteomic data alone can also be effective (Li et al. 2019).

Even prior to the development of quantitative regressions to predict OGT based on genomic features, the relationship between rRNA G + C content and OGT was used to speculate about the environment of the last universal common ancestor (LUCA). In an early work Galtier et al. used a Markov model of sequence evolution coupled with maximum likelihood analysis to suggest that the ancestral rRNA contained sequence features consistent with a mesophilic origin (Galtier et al. 1999). However, this finding was rapidly disputed by others using alternative reconstruction techniques (e.g. maximum parsimony), as well as including additional molecules for analysis such as tRNA (Di Giulio 2000), or protein sequences (Di Giulio 2001, 2003). More realistic models based on both protein and rRNA reconstructed sequences indicated the potential for a mesophilic origin followed by divergence and parallel adaptation to higher temperatures followed by subsequent adaptation to more temperature environments (Boussau et al. 2008; Groussin and Gouy 2011). While this question is increasingly tackled by approaches that utilize far more information than what was available 20 years ago to reconstruct entire ancestral gene sets, a clear consensus still has not been reached (Weiss et al. 2016; Akanuma 2017).

Using G + C Content to Identify ncRNA

Another application of the relationship between structured RNA G + C content and organismal OGT coupled with the lack of relationship between G + C content and OGT, is the computational discovery of novel structured RNAs. It is established that stable structures may be formed by many sequences that do not encode functional RNA structures (Rivas and Eddy 2000). However, the premise that in a high A + T genome, structured RNAs should be encoded by regions with higher G + C content so that such molecules retain their stability, is valid. Several different methods for ncRNA identification across a range of different species use some variation of this premise. Deviation from genomic G + C content alone was used to identify ncRNAs within extreme hyperthermophiles Methanococcus jannaschii and Pyrococcus furiosus (which have modest genomic G + C contents of ~ 30% and ~ 40%, respectively) (Klein et al. 2002), in combination with dinucleotide frequencies to find similar results in M. jannaschii (Schattner 2002), or to screen intergenic regions in A + T rich prokaryotic genomes that are further processed by other ncRNA comparative genomic approaches (Meyer et al. 2009; Stav et al. 2019). Other approaches used genome composition as one of many features to identify putative ncRNAs in genomes of mesophiles with less genome composition bias such as E. coli (Carter et al. 2001). Finally, several A + T rich eukaryotic genomes have also been screened in a similar manner including Plasmodium falciparum (Upadhyay et al. 2005) and Dictyostelium discoideum (Larsson et al. 2008). Thus, although any given mRNA may fold into a stable structure, when combined with other information G + C content has proven to be a good screening tool for ncRNA identification in specific situations where the G + C content due to structured RNA stability may rise above the genomic background.

Conclusions

The major findings of Galtier and Lobry have proven robust nearly 20 years and many additional genomes later. They were not the first to observe the relationship between G + C content of structured RNA and OGT and contrast it with that between genomic G + C and OGT, but they placed this observation into a much larger context than Dalgaard and Garrett (Dalgaard and Garret 1993), and in doing so made the finding accessible to a larger audience and ultimately seeded several other fruitful areas of research. The specific hypothesis that motivated this work has long since been superseded by other explanations, but the root questions remain largely unresolved. Thus, this work remains highly cited today, and will likely continue to be in the future.