Keywords

1 Introduction

The molecular evolutionary clock has had a profound influence on molecular evolutionary theory, while also providing an indispensable tool for inferring evolutionary rates and timescales. Starting from the simple premise that evolutionary change at the molecular level proceeds at a relatively constant rate, the molecular clock has undergone considerable evolution over the past six decades (Fig. 1.1). The history of research on the molecular clock has featured an extensive debate over molecular evolutionary theory, persistent challenges to its assumptions and predictions, and applications to questions about the timing of major biological events. Throughout this time, researchers have devoted substantial efforts to understand the causes of evolutionary rate variation across the tree of life, and to apply the principle of the molecular clock in methods for estimating evolutionary timescales. The molecular clock has now confirmed its important role in research in the life sciences, finding applications in such diverse fields as evolutionary biology, molecular ecology, archaeology, and epidemiology.

Fig. 1.1
figure 1

Timeline of advances throughout the history of the molecular clock, beginning with its application to amino acid sequences (Zuckerkandl and Pauling 1962). The left side of the timeline lists some of the key developments in molecular evolutionary theory (stars) and in molecular dating methods and models of evolutionary rate variation (squares). The term ‘molecular evolutionary clock’ was introduced in 1965. Most of the developments listed here are referred to explicitly in the main text of this chapter. The right side of the timeline lists the first use of different data types for molecular dating (circles; for references, see Ho et al. 2016). Nucleotide sequences are now the most widely used type of genetic data in molecular dating analyses

The idea of a molecular clock emerged from studies of proteins in the mid-twentieth century, a time when new biochemical and genetic data were bringing important insights into evolutionary biology. In particular, efforts to determine the amino acid sequences of proteins were yielding valuable data sets that could inform evolutionary thinking. A series of innovative studies in the early 1960s gave rise to the molecular clock (Zuckerkandl and Pauling 1962, 1965; Margoliash 1963; Doolittle and Blombäck 1964), which soon grew to become an integral part of the neutral theory of molecular evolution (Kimura 1968, 1969). In the ensuing decades, the molecular clock played a central role in the debates between neutralists and selectionists, who supported opposing theories of molecular evolution (Ohta and Gillespie 1996). In the present genomic age, the molecular clock is perhaps most widely recognized as a tool for estimating the timing of evolutionary events (Bromham and Penny 2003).

This book provides an overview of the molecular evolutionary clock, including its theory and practice. It attempts to cover a huge field of research that cannot be satisfactorily summarized in an individual review article; nevertheless, this book can only be considered as an introductory text. Many of the chapters in this book focus on recent developments in this fast-moving field, including the latest endeavours to cope with genome-scale data sets and to combine molecular, phenotypic, and palaeontological data in a biologically meaningful way.

In this opening chapter, I describe the origins of the molecular clock and its evolution over the past six decades. I then provide an overview of the different forms of evolutionary rate variation across the tree of life, ranging from viruses and bacteria to eukaryotes. The chapter concludes with a description of how molecular clocks are used to infer evolutionary timescales, including a summary of some of the major applications of molecular dating. Throughout this chapter, I introduce the contents of the remaining chapters of the book.

2 The Molecular Clock Hypothesis

2.1 Origins of the Molecular Clock

The term ‘molecular evolutionary clock’ was proposed by Emile Zuckerkandl and Linus Pauling in 1965. Zuckerkandl had joined Pauling in the California Institute of Technology in late 1959 and the two worked on the sequencing and analysis of the haemoglobin protein (Morgan 1998). Less than a decade earlier, the first amino acid sequence of a protein, insulin, had been determined. Zuckerkandl and Pauling (1962) noted that the divergence in the amino acid sequence of haemoglobin increased over time with the evolutionary distance between species. They made the inspired assumption that a simple linear relationship existed between the two quantities.

Zuckerkandl and Pauling (1962) raised the possibility of using this clocklike property to develop a tool for estimating the timing of divergence between haemoglobin chains and between vertebrate species. Based on a palaeontological estimate of 100–160 million years (Myr) for the divergence between human and horse, they inferred an evolutionary rate of 1 amino acid substitution per 14.5 Myr (Fig. 1.2a). Their application of this rate to the amino acid sequences yielded estimates of the divergence times between haemoglobin chains, with the α chain splitting from the β and γ chains about 565–600 Myr ago in the late Precambrian. The divergences between the β chain and the γ and δ chains were estimated to have occurred much more recently, at 260 Myr ago in the Permian and 44 Myr ago in the Eocene, respectively.

Fig. 1.2
figure 2

(a) The earliest use of the molecular clock to infer evolutionary divergence times, based on amino acid sequences of haemoglobin (Zuckerkandl and Pauling 1962). The evolutionary rate was calibrated using a palaeontological estimate of the horse–human divergence at 100–160 million years ago. Assuming a constant rate of amino acid replacements, the divergence time of gorilla and human was estimated (represented by two data points corresponding to the divergences between the two α chains and between the two β chains), along with the divergence times of various pairs of haemoglobin chains (denoted by Greek letters). Data from Zuckerkandl and Pauling (1962). (b) Clocklike evolution in fibrinopeptides, based on pairwise comparisons of amino acid sequences in sheep, goat, reindeer, ox, pig, and human. Pairwise amino acid sequence identity (%) is plotted against the time of divergence estimated from the fossil record. Data from Doolittle and Blombäck (1964)

In their analysis of haemoglobin, Zuckerkandl and Pauling (1962) also obtained an estimate of 11 Myr for the evolutionary split between gorilla and human (Fig. 1.2a). They noted that this estimate was at the lower end of the timing of 11–35 Myr ago suggested by the fossil record. Their estimate, and other molecular estimates of the hominid evolutionary timescale reported in the 1960s (Sarich and Wilson 1967a), were controversial because they were inconsistent with the prevailing notion of a large evolutionary distance between modern humans and the other great apes (Wilson et al. 1977). However, reports would soon emerge of constant evolutionary rates in the amino acid sequences of cytochrome c (Margoliash 1963) and fibrinopeptides (Fig. 1.2b; Doolittle and Blombäck 1964), lending support to the molecular clock hypothesis.

In addition to developing a tool for inferring evolutionary timescales, Zuckerkandl and Pauling (1962) foresaw some of the problems that would beset molecular clock analyses in subsequent decades. They referred to the problems posed by repeated substitutions at the same amino acid site (including back-mutations), the potentially confounding impacts of natural selection, and the influence of population size. Their idea of the molecular clock acknowledged an important role for natural selection, although they later surmised that ‘the changes that occur at a fairly regular over-all rate would be expected to be those that change the functional properties of the molecule relatively little’ (p. 148, Zuckerkandl and Pauling 1965). This statement seemed to anticipate the close association that would soon form between the molecular clock and the neutral theory of molecular evolution (e.g., Kimura 1968; King and Jukes 1969; Wilson and Sarich 1969).

The neutral theory, put forward by Motoo Kimura in 1968, made the bold assertion that the majority of mutations are neutral. This contradicted the dominant view that such mutations are rare or transient (Fisher 1936; Mayr 1963), although the importance of neutral mutations in molecular evolution had been suggested earlier in the same decade (Freese 1962; Sueoka 1962). In Kimura’s proposal, the term ‘neutral’ was not intended to suggest that the corresponding gene lacked function (e.g., Zuckerkandl 1978), but instead meant that the mutation conferred neither an advantage nor disadvantage to the organism and that the fate of the mutation would be governed by genetic drift. Although the molecular clock was influential in the development of the neutral theory (Takahata 2007), Kimura’s case for the theory largely rested on estimates of enzyme variability from electrophoretic studies and rates of protein evolution inferred from analyses of amino acid sequences. He argued that these high evolutionary rates greatly exceeded the limits imposed by the ‘cost of natural selection’ (Haldane 1957), thus suggesting that many of the mutations must be neutral (Kimura 1968).

A significant consequence of the neutral theory is that the rate at which neutral mutations are fixed in the population (known as the ‘substitution rate’) is approximately equal to the rate at which the mutations are spontaneously generated (Kimura 1968). For this reason, the molecular clock was regarded as an additional source of evidence for the neutral theory (Kimura 1969, 1983). In Chap. 2, Soojin Yi provides an introduction to molecular evolution, including the neutral theory and its later developments, as well as some of the principles behind the molecular clock. She also explains the relationship between the mutation rate and substitution rate under the neutral theory.

The initial reactions to the proposal of the molecular clock were largely negative (e.g., Stebbins and Lewontin 1972), with criticisms being levelled by a number of eminent evolutionary biologists. For example, Ernst Mayr argued that ‘evolution is too complex and too variable a process, connected with too many factors, for the time dependence of the evolutionary process at the molecular level to be a simple function’ (p. 137, Zuckerkandl and Pauling 1965). At the time, the evolutionary biologist Morris Goodman was one of the few to recognize the potential applications of the clock (Morgan 1998). With further evidence for the constancy of molecular evolutionary rates, as well as growing appreciation of its great potential for reconstructing the timescale of evolution, the notion of a molecular clock endured. By the late 1970s, Allan Wilson et al. (1977) declared that the ‘discovery of the evolutionary clock stands out as the most significant result of research in molecular evolution’ (p. 577).

2.2 Decades of Evolution

The molecular clock was a prominent source of contention in the molecular evolutionary debates throughout the 1970s to 1990s, an era that also saw a shift in focus from protein sequences to DNA sequences (Fig. 1.1; Ohta and Gillespie 1996; Nei et al. 2010). In the early part of this period, there was growing evidence of a discrepancy between the evolutionary dynamics of ‘silent’ (synonymous or non-coding) and ‘replacement’ (nonsynonymous) changes in DNA. Replacement substitutions occurred at a constant rate per year, which was cited as support for the neutral theory (Kimura 1969). However, silent substitutions, which are expected to be under much lower selective constraint, appeared to occur at a constant rate per generation (Laird et al. 1969; Kohne 1970). There was evidence of a slowdown in evolutionary rates of both proteins and DNA in hominoids compared with other primates and mammals, particularly rodents (Goodman 1961; Kikuno et al. 1985; Wu and Li 1985), in accordance with the differences in generation times among these organisms.

Kimura (1983) later recognized that the neutral theory should predict a constant substitution rate per generation rather than per year, while admitting that evidence of the constancy of evolutionary change per unit time presented a ‘difficult problem’ (p. 246) for the theory. The different dynamics observed for silent and replacement substitutions were partly reconciled in the nearly neutral theory, developed by Tomoko Ohta (1972, 1973). The nearly neutral theory proposed that many mutations have a small impact on fitness and are mildly deleterious or mildly advantageous (see Chap. 2), and predicts a constant evolutionary rate per unit of time. However, this prediction relies on a negative correlation between population size and generation time, which was assumed but not explicitly demonstrated by Ohta (1972, 1973). In any case, as described by Gillespie (1991), Kimura ‘quickly retreated from the [per-year constancy of mutation rates] when he adopted Ohta’s mildly deleterious theory’ (p. 274). Nevertheless, upon considering the evidence of a generation-time effect, Kimura (1987) noted that the departures from rate constancy across lineages were not as great as would be expected on the basis of differences in generation time.

A somewhat different challenge to the hypothesis of a molecular clock was that the occurrences of substitutions were often found to be more erratic than expected. Zuckerkandl and Pauling (1965) had suggested that amino acid substitutions occur stochastically, following a Poisson point process. Under this stochastic process, the variance in the number of substitutions per unit time is equal to the expected number of substitutions per unit time. The ratio of these quantities, known as the index of dispersion, provides a measure of the departure from a Poisson process; values exceeding 1 indicate overdispersion. Studies of proteins found that overdispersion was widespread among proteins (Ohta and Kimura 1971; Langley and Fitch 1974; Gillespie 1984, 1989), contradicting the expectations under the molecular clock. One attempt to explain this overdispersion within the framework of the neutral theory was based on a model of fluctuating neutral space (Takahata 1987), in which each neutral mutation changes the rate of neutral mutations. However, most explanations appealed to the effects of natural selection, with overdispersion being a potential outcome under some conditions of episodic, fluctuating, or negative selection (Gillespie 1984, 1993; Cutler 2000). There is now a body of evidence showing that some features of molecular and genomic evolution cannot be adequately explained by the neutral theory (e.g., Kreitman and Akashi 1995; Kern and Hahn 2018).

The molecular clock gradually moved away from its conspicuous role in the selectionist–neutralist debate and became increasingly appreciated for its practical applications in evolutionary biology. Although there is continued interest in the causes of evolutionary rate variation, the molecular clock is now most widely known as a tool for inferring evolutionary timescales. However, the utility of the molecular clock as a dating tool is potentially diminished by the presence of evolutionary rate variation. There have been considerable efforts to rescue the molecular clock from this quagmire, leading to major advances in molecular dating methods over the past two decades.

3 Evolutionary Rate Variation

3.1 Partitioning Variation in Rates

Evolutionary rate variation occurs in different modes and across a range of temporal, molecular, and biological scales. Early studies considered differences in rates across nucleotide or amino acid sites (site effects), across genes or loci (gene or locus effects; Fig. 1.3a), and across lineages (lineage effects; Fig. 1.3b). For a given gene, any overdispersion that remained after accounting for lineage effects was ascribed to residual effects (e.g., Langley and Fitch 1974; Gillespie 1991). In their comprehensive review of evolutionary rate variation in plants, Gaut et al. (2011) used an approach inspired by an analysis of variance that had been conducted 8 years earlier (Smith and Eyre-Walker 2003). Specifically, in addition to site effects, gene effects, and lineage effects, they considered the two- and three-way interactions among these three components: site-by-gene effects, site-by-lineage effects, gene-by-lineage effects, and site-by-gene-by-lineage effects. For molecular clocks, the most important of these effects are caused by gene-by-lineage interactions (Fig. 1.3c); these are analogous to residual effects (Gillespie 1991).

Fig. 1.3
figure 3

Evolutionary rate variation depicted in phylogenetic trees with branch lengths proportional to the amount of genetic change. Each tree consists of six taxa, labelled A to F. (a) Gene effects lead to rate variation across genes, so that there are different total amounts of genetic change in the phylogenetic trees from genes 1, 2, and 3. (b) Lineage effects lead to rate variation across the branches of the tree, but this effect is shared by all of the genes. Accordingly, the branch lengths of the three phylogenetic trees share the same proportions. (c) Gene-by-lineage interactions lead to different patterns of among-lineage rate variation across the phylogenetic trees from the three genes. (d) Punctuated evolution proposes that bursts of genetic change occur at speciation events, denoted by circles. This leads to a pattern in which the length of the path between the root and any tip of the tree is approximately proportional to the number of speciation events along that path. (e) Epoch effects occur when there is a different evolutionary rate during a particular period of time, as indicated by the grey shaded area. (f) Time-dependent rates lead to a bias whereby evolutionary rates are higher when estimated over recent, short-term timescales than over longer periods of time

3.2 Site Effects and Gene Effects

Site effects can be caused by differences in selective constraints on individual nucleotides or amino acids and by heterogeneity in mutation rates (Hodgkinson and Eyre-Walker 2011). Functionally or structurally important sites tend to evolve more slowly than other sites, or might even be invariant to change, and such amino acid sites in cytochrome c and haemoglobin were discussed at length by Zuckerkandl and Pauling (1965). Differences in the proportions of such constrained sites were argued to be the main cause of rate variation across proteins under the neutral theory (King and Jukes 1969). In the nucleotide sequences of protein-coding genes, nonsynonymous mutations are more likely to be selected against than are synonymous mutations. The distinction between ‘silent’ and ‘replacement’ dynamics in DNA sequences was already well appreciated by the 1970s (e.g., King and Jukes 1969; Jukes and Kimura 1984), and its varying effects on rates at the three codon positions in protein-coding genes are now routinely taken into account in analyses of nucleotide sequences (e.g., Shapiro et al. 2006).

Mutation rates can also vary among nucleotides and according to the local context, with mutations at cytosine-guanine dinucleotides (‘CpG’) occurring at higher rates than at other dinucleotides partly because of the vulnerability of the cytosine to deamination (see Chap. 2). Studies of genomic data have revealed other forms of site effects, such as higher mutation rates in parts of the genome linked to insertions and deletions (Tian et al. 2008). In analyses of molecular sequence data, site effects are typically accommodated by modelling the site rates using a gamma distribution (Yang 1996). As with many models in biology, this approach aims to capture an important feature of sequence evolution without attempting to resolve the underlying mechanisms.

Gene effects were widely recognized during the development of the molecular clock (Fig. 1.3a), with evidence of evolutionary rate variation across haemoglobin, cytochrome c, fibrinopeptides, and other gene products (e.g., Zuckerkandl and Pauling 1965; Dickerson 1971). In her extensive survey of protein sequences, the pioneering biochemist Margaret Dayhoff (1978) found nearly 400-fold variation in evolutionary rates across proteins. Many of the causes of site effects also lead to rate variation across genes, so the two forms of variation are closely linked. However, the evolutionary rates of genes and proteins are most strongly correlated with their levels of expression (e.g., Rocha and Danchin 2004; Park et al. 2012) and not their functional importance (Wang and Zhang 2009). A negative relationship between the expression level of a protein and its evolutionary rate has been found across a wide range of organisms, including bacteria and eukaryotes, but the specific causes of this relationship remain unclear (Zhang and Yang 2015). In contrast with the rate variation across proteins, rates of synonymous substitutions show little variation across protein-coding genes in mammalian genomes (Kumar and Subramanian 2002).

On a broader scale, evolutionary rates can show substantial disparities between nuclear and organellar genomes. A widely recognized pattern in metazoans is that mutation rates are much higher in the mitochondrial genome than in the nuclear genome (Brown et al. 1979; Miyata et al. 1982). However, the ratio of mitochondrial to nuclear evolutionary rates has been found to be considerably greater in birds, reptiles, and other vertebrates than in insects and arachnids (Allio et al. 2017). The comparatively high information content of mitochondrial DNA ensured that it held a long reign as the preferred marker in studies of population genetics, molecular systematics, and phylogenetics in humans and other animals (Avise et al. 1987). The popularity of mitochondrial DNA declined with the advent of high-throughput sequencing technologies, which enabled nuclear genome data to be obtained efficiently and on large scales, and with growing concerns about excessive reliance on a single genetic marker.

In contrast with the trends observed in animals, elevated mutation rates are not seen in the mitochondrial genomes of other eukaryotes (Baer et al. 2007). In plants, nuclear genomes evolve more rapidly than chloroplast genomes, which evolve more rapidly than mitochondrial genomes (Wolfe et al. 1987). This pattern is particularly pronounced in angiosperms, but less so in gymnosperms (Drouin et al. 2008). The reasons for the low evolutionary rates in the chloroplast and mitochondrial genomes of plants are not entirely clear, but might be related to DNA repair mechanisms (Christensen 2013). In plastid-bearing eukaryotes other than land plants, mitochondrial genomes have a higher evolutionary rate than plastid genomes (Smith 2015).

3.3 Lineage Effects and Gene-by-Lineage Interactions

Evidence of lineage effects emerged soon after the proposal of the molecular clock and continued to grow in the ensuing decades (Fig. 1.3b). The generation-time effect, as described in Sect. 1.2.2, appeared to be the most prominent form of evolutionary rate variation across lineages. The hominoid slowdown in evolutionary rates, first quantified by Goodman (1961), has been confirmed in genome-scale analyses of primates (Kim et al. 2006; Chintalapati and Moorjani 2020). A generation-time effect has now been found in a variety of organisms, including bacteria (Weller and Wu 2015), birds (Mooers and Harvey 1994), and invertebrates (Thomas et al. 2010), and broadly across animals (Allio et al. 2017). However, evolutionary rates appear to show a more complex relationship with generation time in plants, in which the germline is segregated at a late stage of their growth (Lanfear et al. 2013).

Lineage effects can be detected using a variety of methods. Sarich and Wilson (1967b, 1973) described a framework for comparing the relative rates between a pair of taxa, which was later developed into a statistical test (Fitch 1976; Wu and Li 1985). The relative-rates test has largely been superseded by methods that can test for among-lineage rate heterogeneity across an entire phylogenetic tree. These include the likelihood-ratio test, which can be used to compare a model in which the phylogeny is constrained to be ultrametric (all tips being equally distant from the root of the tree) against a model in which the branch lengths are unconstrained (Felsenstein 1981). The rapid increase in genetic data throughout the 1980s and 1990s led to an accumulation of evidence of evolutionary rate variation (Britten 1986; Drake et al. 1998). Some of the major patterns of rate variation across the tree of life are described in Sect. 1.4.

Gene-by-lineage interactions (Fig. 1.3c), which comprise the variation in evolutionary rates that are not accounted for by gene effects or lineage effects, represent an additional layer of complexity in patterns of rate variation (e.g., Gillespie 1989; Ayala 1997). These interactions have been found to be more prominent in nonsynonymous than synonymous rates in plant chloroplast genomes (Muse and Gaut 1997). Gene-by-lineage interactions appear to account for a small proportion of evolutionary rate heterogeneity in mitochondrial and nuclear genes from eutherian mammals (Smith and Eyre-Walker 2003), but are potentially important when large sets of genes are being analysed for the purposes of inferring evolutionary timescales. Variation across genes and across lineages are the dominant forms of genome-scale rate heterogeneity (Snir et al. 2012), although gene-by-lineage interactions have been detected in genomic data from eutherian mammals (Duchêne and Ho 2015) and flowering plants (Duchêne et al. 2016a). Further genomic analyses will allow the different forms of evolutionary rate variation to be characterized for other groups of organisms.

3.4 Other Forms of Evolutionary Rate Variation

The framework used in the previous section provides a helpful means of partitioning rate variation into its major components, allowing consideration of the biological and evolutionary drivers of rates of mutation and substitution (Fig. 1.3). Nevertheless, there are several important features of evolutionary rate variation that do not fit neatly into this classification. Here I describe three of these phenomena: punctuated evolution, epoch effects, and time-dependent rates. These forms of rate variation can pose substantial challenges for using molecular clocks to infer evolutionary timescales.

The punctuated equilibrium theory was put forward in an attempt to explain patterns in the fossil record, which appears to feature long periods of stasis punctuated by rapid bursts of morphological change (Eldredge and Gould 1972). Inspired by this theory, molecular evolutionary biologists have sought evidence of bursts of genetic change caused by founder effects at speciation events (Fig. 1.3d; Webster et al. 2003; Pagel et al. 2006). These can potentially be detected using a phylogenetic approach to analyse molecular sequence data, because the theory predicts that a measurable proportion of genetic change is correlated with the number of speciation events along any lineage in the evolutionary tree. However, tests of punctuated evolution have been seriously hindered by a problem known as the node-density effect, which produces patterns similar to those expected under punctuated molecular evolution (Fitch and Beintema 1990). Newly developed phylogenetic models of evolutionary rates might be able to shed further light on the occurrence of punctuated molecular evolution (Manceau et al. 2020).

Rates of molecular evolution can vary across time periods, leading to epoch effects (Fig. 1.3e; Lee and Ho 2016). For example, some external factors, such as environmental conditions, might raise evolutionary rates across an entire population or even an entire assemblage of organisms. One potential example is a several-fold increase in phenotypic and genomic evolutionary rates during the rapid diversification of metazoan phyla in the Cambrian, an event that is often referred to as the ‘Cambrian explosion’ (Lee et al. 2013). Epoch effects are particularly difficult to identify unless the period of evolutionary rate elevation can be bracketed by reliable age constraints from the fossil record. For example, epoch effects cannot be detected by a likelihood-ratio test for clocklike evolution, in which the null hypothesis is that all of the tips are the same distance from the root of the tree (Yang 2014).

The study of evolutionary rates has been hindered by a time-dependent bias, which causes rate estimates to scale negatively with the timeframe of their measurement (Fig. 1.3f). This pattern can be caused by various factors, including the effects of purifying selection and substitution saturation (Ho et al. 2011). On short timeframes, estimates of evolutionary rates can be inflated by the inclusion of deleterious mutations, which tend to be removed from the population by purifying selection over longer periods of time. Substitution saturation can cause underestimation of the amount of genetic change across longer evolutionary timescales, and this bias is exacerbated by model misspecification (Soubrier et al. 2012). The most striking disparities are seen when the short-term rate estimates from pedigrees and mutation-accumulation lines are compared with those inferred using phylogenetic analysis (e.g., Howell et al. 2003). There is evidence of a time-dependent pattern in evolutionary rate estimates from viruses (Duchêne et al., 2014; Aiewsakun and Katzourakis 2016), bacteria (Duchêne et al. 2016b; but see Gibson and Eyre-Walker 2019), and metazoan mitochondrial genomes (Molak and Ho 2015). The evidence for time-dependent biases in metazoan nuclear genomes has so far been limited, although spontaneous mutation rates appear to be greater than long-term evolutionary rates estimated using phylogenetic methods (with modern humans being at least one exception to this pattern; Scally 2016; Chintalapati and Moorjani 2020).

4 Evolutionary Rates Across the Tree of Life

4.1 Estimating Rates of Mutation and Evolution

Across the tree of life, evolutionary rates show striking variation and span multiple orders of magnitude. This variation can be considered at a range of biological scales: within individuals, between generations, between populations, among species, and across clades. Lying at one end of this spectrum are rates of spontaneous mutation, which have commonly been estimated by studying laboratory populations but are increasingly based on genome sequencing of closely related individuals or even of different tissues within the same individual. These rates have typically been difficult to estimate directly, because of the small numbers of mutations between generations and because studies often compare the genomes of somatic rather than germline cells. However, improvements in the efficiency and cost of genome sequencing have led to a stunning increase in studies of spontaneous mutation rates, even in multicellular eukaryotes that experience very few mutations per generation. In Chap. 3, Susanne Pfeifer presents an overview of the major approaches that have been used to estimate spontaneous mutation rates, along with a summary of the estimates that have been published so far. These studies have revealed considerable variation in mutation rates across species (Drake et al. 1998; Baer et al. 2007).

Given that most mutations have negative impacts on fitness, the question arises as to why mutation rates are nonzero (Sturtevant 1937). This can be understood in terms of the fitness costs of reducing mutation rates, because cellular and energetic resources are needed for proofreading and error correction (Kimura 1967). A nonzero mutation rate also provides genetic variation, allowing populations of organisms to adapt to changes in environmental conditions. These factors have led to the idea that mutation rates themselves are evolvable; the optimal mutation rate is expected to vary along the genome and across species (Baer et al. 2007). However, some have argued that mutation rates represent a balance between genetic drift and selection for reduced copying errors (Lynch 2010; Lynch et al. 2016). In Chap. 4, Lindell Bromham describes the current state of knowledge of the causes of rate variation across the tree of life, including the factors that affect rates of spontaneous mutation and the rates of fixation of these mutations (i.e., substitution rates).

In many phylogenetic studies using molecular clock models, evolutionary rates and timescales are jointly estimated. These analyses have produced a comprehensive picture of evolutionary rate variation across the diversity of life. In these cases, evolutionary rates are averaged along branches of the phylogeny, meaning that these estimates represent long-term quantities and are partly dependent on taxon sampling (Lanfear et al. 2010). Furthermore, they are somewhat removed from the underlying rates of spontaneous mutation because they have also been shaped by the effects of selection and drift. Some researchers have attempted to use rates estimated from noncoding or synonymous sites as an approximation of mutation rates. In any case, a more complete understanding of rate variation can be achieved by considering both spontaneous mutation rates and phylogenetic estimates of evolutionary rates.

4.2 Viruses and Bacteria

The genomes of viruses and bacteria show a remarkable range of mutation rates and evolutionary rates. Among viruses, rates broadly vary with the structure and composition of the genome. Viruses with single-stranded genomes evolve more rapidly than those with double-stranded genomes (Duffy et al. 2008; Sanjuán et al. 2010), although the reasons for this pattern remain unclear (Peck and Lauring 2018). RNA viruses copy their genomes using RNA-dependent RNA polymerases, which lack proofreading ability, so most of these viruses are unable to correct any copying errors that occur during genome replication. As a consequence, they generally have higher mutation rates than DNA viruses, especially double-stranded DNA viruses. There is also a negative correlation between genome size and evolutionary rate (Sanjuán et al. 2010), which is particularly noticeable in viruses but is also seen across a broad range of taxa (Drake 1991; Drake et al. 1998).

The most rapidly evolving viruses tend to be those with single-stranded RNA genomes, such as influenza virus, dengue virus, and coronaviruses. These viruses experience substitution rates as high as 10−3 substitutions per site per year (Duffy et al. 2008). At the other end of the spectrum, double-stranded DNA viruses, such as variola virus (which causes smallpox), can evolve at rates below 10−5 substitutions per site per year (Firth et al. 2010). For rapidly evolving viruses, evolutionary rates can be estimated using time-structured data sets in which genomes have been sampled at different points in time (Rambaut 2000; Drummond et al. 2001). In contrast, slowly evolving viruses, such as hepatitis B virus, might not undergo a sufficient amount of genetic change over such timeframes to permit any reliable inference of their substitution rate. In some of these cases, evolutionary rates can be estimated by assuming that viruses have codiverged with their hosts (e.g., Bernard 1994; Paraskevis et al. 2013). Virus-host codivergence appears to be more common in double-stranded DNA viruses than in RNA viruses (Geoghegan et al. 2017).

Bacteria have larger genomes than viruses and tend to evolve more slowly. Analyses of genomic data sets have revealed a wide variation in evolutionary rates among bacterial taxa (Duchêne et al. 2016b). The most rapidly evolving bacterial species, such as Neisseria gonorrhoeae, Helicobacter pylori, and Enterococcus faecium, experience nucleotide substitution rates of about 10−5 substitutions per site per year. In contrast, rates below 10−7 substitutions per site per year are seen in Mycobacterium tuberculosis and the plague bacterium Yersinia pestis (Duchêne et al. 2016b). The variation in evolutionary rates in bacteria has been ascribed to differences in generation time (Gibson and Eyre-Walker 2019), but attempts to resolve these patterns have been hindered by strong time-dependent biases in rate estimation (Rocha et al. 2006; Duchêne et al. 2016b). Nevertheless, a generation-time effect can be seen in the lower evolutionary rates of spore-forming bacteria compared with bacteria that do not form spores (Weller and Wu 2015).

4.3 Eukaryotes

Rates of molecular evolution in eukaryotes, particularly multicellular eukaryotes with long generation times, are generally lower than those of viruses and bacteria. Estimates of mutation rates in unicellular eukaryotes include 1.9 × 10−11 substitutions per site per generation for the protist Paramecium tetraurelia (Sung et al. 2012) and about 2 × 10−10 substitutions per site per generation for the yeasts Saccharomyces cerevisiae (Zhu et al. 2014) and Schizosaccharomyces pombe (Farlow et al. 2015). There have been relatively few estimates of spontaneous mutation rates in the nuclear genomes of animals, but these are growing rapidly with the application of high-throughput sequencing to pedigrees and parent-offspring trios (see Chap. 3).

Animal nuclear genomes evolve slowly, so per-generation mutation rates are difficult to estimate because of the confounding impacts of sequencing error. Inference of mutation rates is also complicated by rate differences between sexes and between the soma and germline. Analyses of genomes from pedigrees and parent-offspring trios have produced a range of estimates of the spontaneous mutation rate in modern humans, centred on a value of 5 × 10−10 mutations per site per year (Scally 2016). Spontaneous mutation rates have also been estimated for the nuclear genomes of the nematode worm Caenorhabditis elegans, the common fruit fly Drosophila melanogaster, Western honey bee Apis mellifera, collared flycatcher Ficedula albicollis, house mouse Mus musculus, and common chimpanzee Pan troglodytes, among other animal species (see Chap. 3; Smeds et al. 2016).

An alternative approach to estimating mutation rates has involved analyses of rates of synonymous substitutions and changes at third codon positions, which are under weaker selective constraints and so are believed to provide an approximation of mutation rates. These analyses have revealed that mitochondrial mutation rates vary considerably across birds and mammals (Nabholz et al. 2008, 2009) and invertebrates (Thomas et al. 2010). In contrast, studies of mitochondrial substitution rates in birds and mammals have identified a relative degree of constancy across lineages, with a mean rate of about 0.01 substitutions per site per Myr (Weir and Schluter 2008; but see Pereira and Baker 2006; Nguyen and Ho 2016). This has led to the notion of a 1% mitochondrial clock in birds and mammals. A similar ‘universal’ mitochondrial clock has been widely used in studies of invertebrates (Brower 1994; but see Papadopoulou et al. 2010).

Evolutionary rates show considerable heterogeneity across plant lineages, but a few general trends can be observed. The nuclear genomes of gymnosperms evolve at rates that are several times lower, on average, than those of angiosperms (De La Torre et al. 2017). This pattern can potentially be explained by the longer generations and large genomes of gymnosperms. Within flowering plants, there is evidence of a substantial increase in evolutionary rates in the early evolution of the grasses (Christin et al. 2014), whereas palms have evolved much more slowly (Gaut et al. 1992). Evolutionary rates are higher in annual plants than in perennial plants, a pattern that has been found in sequence analyses of the internal transcribed spacer of nuclear ribosomal DNA (Kay et al. 2006) and in larger sets of chloroplast and nuclear genes (Yue et al. 2010). Similarly, herbaceous flowering plants have higher rates of molecular evolution than woody plants with shrub or tree habits (Smith and Donoghue 2008). These patterns in rate variation between annual and perennial plants, and between herbaceous and woody plants, are believed to reflect broad differences in generation time.

Mutation rates in nuclear genomes have been estimated for a number of plant species, including thale cress Arabidopsis thaliana (Ossowski et al. 2010), common oak Quercus robur (Schmid-Siegert et al. 2017), Sitka spruce Picea sitchensis (Hanlon et al. 2019), and yellow box eucalypt Eucalyptus melliodora (Orr et al. 2020). Some of these studies were able to trace somatic mutations across the plant, such as along tree branches. For example, 300 mutations were identified along 90.1 metres of branch length in an individual tree of Eucalyptus melliodora, allowing the somatic mutation rate to be calculated at 2.75 × 10−9 mutations per nucleotide for each metre of tree branch (Orr et al. 2020). A detailed genomic analysis of eight plant species revealed evidence of higher per-year mutation rates in roots than in shoots in perennial plants, but such a pattern was not seen in annual plants (Wang et al. 2019). In addition, mutation rates were found to be higher in petals than in leaves. These studies have revealed the complexities of mutation rate variation in plants, while highlighting the difficulty in understanding the relationships of these rates to the long-term evolutionary rates in these taxa.

5 The Molecular Clock as a Tool for Inferring Timescales

5.1 Molecular Dating

In modern genetics and genomics, the molecular clock has its most prominent role as a tool for inferring evolutionary timescales. This application of the molecular clock is sometimes referred to as molecular clock dating, divergence-time estimation, or simply molecular dating. There is a rich history of development of molecular dating methods (Fig. 1.1), with much of the progress in this field being tied to advances in phylogenetic methods and computational power (Bromham and Penny 2003; Kumar 2005). In Chap. 5, Susana Magallón describes the principles behind molecular dating methods and the steps involved in using these methods to infer evolutionary timescales from molecular sequence data.

Research on molecular dating has led to the development of a range of phylogenetic dating methods and statistical models of evolutionary rates (Heath and Moore 2014; Ho and Duchêne 2014; Yang 2014; Kumar and Hedges 2016). These have included methods to cope with among-lineage rate variation, such as nonparametric rate smoothing (Sanderson 1997) and penalized likelihood (Sanderson 2002), as well as models of evolutionary rate variation across branches (Hasegawa et al. 1989; Thorne et al. 1998). Notably, much of the recent progress in molecular clocks has focused on phenomenological rather than mechanistic models, leaving these developments somewhat decoupled from the earlier theoretical context of the molecular clock.

Molecular dating was first performed using amino acid sequences (Zuckerkandl and Pauling 1962) and immunological comparisons by microcomplement fixation (Sarich and Wilson 1967a), but is now overwhelmingly based on the analysis of nucleotide sequences. The most important developments have been in the use of genome-scale data sets for inferring evolutionary timescales. Alongside these efforts, there have been various attempts to use other forms of genetic, genomic, and protein data for molecular dating (Fig. 1.1; Ho et al. 2016). For example, the timing of intraspecific events has been estimated using molecular clocks based on microsatellites (Goldstein et al. 1995), whereas deeper events have been dated using protein folds (Wang et al. 2011).

The application of Bayesian approaches to phylogenetic analysis has led to major developments in molecular dating (dos Reis et al. 2016; Bromham et al. 2018). In Chap. 6, Tianqi Zhu provides an introduction to the Bayesian framework for molecular dating, which permits the application of complex, parameter-rich models that would not be tractable using other methods. These include sophisticated models of evolutionary rate heterogeneity (clock models), models of lineage diversification (in the form of the tree prior), and various means of incorporating data from the fossil record (dos Reis et al. 2016; Bromham et al. 2018).

In Bayesian molecular dating, models of among-lineage rate variation have seen particularly active development. The most widely used are the relaxed-clock models, which allow a distinct rate of evolution along each branch of the phylogenetic tree. The earliest relaxed-clock models were inspired by the work of Gillespie (1991), who suggested that the substitution rate might evolve along lineages. Relaxed-clock models that allow such autocorrelation in the evolutionary rate were implemented in Bayesian dating methods in the late 1990s and subsequently expanded (e.g., Thorne et al. 1998; Kishino et al. 2001; Aris-Brosou and Yang 2002). Later work saw the appearance of relaxed-clock models that allow independent or uncorrelated rates across branches (e.g., Drummond et al. 2006; Rannala and Yang 2007).

The methods developed for molecular dating have also been applied, with some modifications, to analyses of morphological data. In Chap. 7, Michael Lee describes the use of phenotypic traits for estimating evolutionary timescales, focusing on the analysis of discrete morphological characters. The use of morphological clocks has produced useful insights into the evolution of birds and other groups of organisms (e.g., Polly 2001; Lee et al. 2014), although there continue to be various shortcomings that need to be addressed (Puttick et al. 2016). For example, questions persist about the strength of the association between molecular and morphological rates of evolution (Davies and Savolainen 2006; Seligmann 2010). Nevertheless, with continued advances in models of phenotypic evolution (e.g., Álvarez-Carretero et al. 2019), phylogenetic dating analyses of morphological characters present a promising avenue for further research.

Unless there is a priori information about the evolutionary rate, molecular dating methods need to calibrate the clock so that it gives date estimates measured in absolute time. The most widely used types of calibrating information are those based on palaeontological, geological, and biogeographic evidence. In Chap. 8, Jacqueline Nguyen and I describe the use of fossil evidence for calibration, which has a rich history of development and has fostered productive collaborations between geneticists and palaeontologists. In Chap. 9, Michael Landis explains how information from biogeography and palaeogeography can be used to calibrate the molecular clock, based on the timing of geological events such as the separation of landmasses.

Some phylogenetic methods have been extended to account for the inclusion of genomes and morphological data that have been sampled at distinct points in time. In Chap. 10, Sebastián Duchêne and David Duchêne describe the use of sampling times for calibration in analyses of rapidly evolving viruses and bacteria, and when analysing data sets containing ancient DNA sequences. Distinct sampling times are also a feature of morphological data sets that include fossil taxa. In Chap. 11, Alexandra Gavryushkina and Chi Zhang describe the analysis of combined morphological and molecular data, including the development of diversification models that explicitly include extinct species and fossil sampling (e.g., Ronquist et al. 2012; Heath et al. 2014).

The past two decades have seen remarkable growth in genomic data, which has been made possible by the development of high-throughput sequencing methods. This has provided a vast wealth of molecular sequence data for understanding molecular evolution at the genomic scale, but has also brought substantial challenges to molecular dating (Ho 2014; Tong et al. 2016). In Chap. 12, Qiqing Tao, Koichiro Tamura, and Sudhir Kumar review a range of methods that are designed to perform rapid molecular dating, allowing the analysis of data sets containing large numbers of sequences. In Chap. 13, Sandra Álvarez-Carretero and Mario dos Reis describe the application of Bayesian phylogenetic dating to genome-scale data sets, including some of the techniques that have been used to improve computational feasibility. These two closing chapters present a promising picture of how the molecular clock will retain its relevance and utility in the coming years.

5.2 Evolutionary Timescales

The molecular clock has been used extensively to reconstruct evolutionary timescales across the tree of life. Early studies focused on the divergence times of humans and related primates (Zuckerkandl and Pauling 1962; Sarich and Wilson 1967a), but often included other mammals (Margoliash 1963; Doolittle and Blombäck 1964). There continued to be a focus on the evolutionary rates and timescales of mammals, particularly eutherian mammals, primarily because of the availability of molecular data for this group of organisms. Developments in automated DNA sequencing in the late 1980s and early 1990s led to rapid growth in molecular sequence data, allowing a considerable expansion of the scope of molecular dating studies.

Molecular dating gained widespread attention in the 1990s when researchers began analysing large data sets to reconstruct the timescales of major evolutionary events. These studies often involved spectacular claims about the antiquity of major branches of the tree of life. These questions have held perennial interest, including the timing of the divergences among the kingdoms of life (e.g., Doolittle et al. 1996), the divergences among metazoan phyla (the ‘Cambrian explosion’; e.g., Wray et al. 1996; dos Reis et al. 2015), the diversification of angiosperms (e.g., Martin et al. 1989; Magallón et al. 2015), and the radiations of eutherian mammals and modern birds (e.g., Hedges et al. 1996; Easteal 1999; Springer et al. 2003; dos Reis et al. 2013). The molecular date estimates for these events have often been at odds with the timescales supported by a literal reading of the palaeontological evidence, leading to deliberation about the relative merits of the fossil record and molecular clocks (Smith and Peterson 2002; Benton and Ayala 2003; Brochu et al. 2004). For example, many molecular estimates for the age of crown angiosperms have been greater than 200 Myr, whereas the oldest fossil evidence dates to about 136 Myr in the Early Cretaceous (Magallón et al. 2015). The debates over the discrepancies between molecular and fossil evidence identified some important shortcomings in molecular dating methods, which provided a strong impetus for methodological innovation. Improved modelling of evolutionary rate variation and use of fossil evidence has narrowed some of the gaps between molecular and palaeontological date estimates.

Molecular dating has been particularly valuable for understanding the evolutionary history and epidemiological dynamics of pathogens (Pybus and Rambaut 2009). Fine-scale sampling of pathogens, for example during contemporary virus outbreaks, can allow a detailed reconstruction of evolutionary rates, transmission dynamics, and phylogeographic spread (Pybus and Rambaut 2009). Over longer evolutionary timescales, molecular clocks can be used to determine when pathogens crossed species barriers and infected new hosts, and whether these pathogens continued to codiverge with the host populations.

One of the more surprising applications of molecular dating has been to estimate the ages of the biological samples from which genomic data have been obtained (Shapiro et al. 2011; Moorjani et al. 2016). This approach can be used to estimate or validate the ages of any samples that have uncertain or contentious dates, such as those that are beyond the 50,000-year reach of radiocarbon dating or where the cost of direct radiometric dating is prohibitive. For example, a Bayesian dating analysis was used to estimate the age of a 400,000-year-old hominin sample from Sima de los Huesos in Spain (Meyer et al. 2014). Ancient hominin genomes have also been dated using a molecular clock based on the accumulation of recombination events over time (Moorjani et al. 2016).

Continued development of molecular clocks will allow evolutionary and demographic timescales to be resolved with increasing confidence. Some of the most promising areas of research include better techniques for incorporating fossil data, mechanistic models of evolutionary rate variation among lineages, and molecular dating methods that are able to process genome-scale data sets from large numbers of taxa. At the same time, these efforts will be substantially aided by advances in understanding of genomic evolution and other biological processes.

6 Concluding Remarks

This book is intended to provide an overview of the state of the art of molecular clocks, although the continual and rapid expansion of the field prevents a comprehensive treatment from being achievable. Nevertheless, I hope that this book provides a useful starting point for researchers and students interested in molecular evolutionary clocks. The field is likely to carry on developing at a great pace in response to the growth of genomic data. With international efforts to sequence the genomes of all vertebrates, invertebrates, and other eukaryotes, we will continue to make great strides towards placing a timescale on the tree of life.