All living organisms on Earth are related by descent from common ancestors (Darwin 1859) and the main goal of systematics is to disentangle their phylogenetic relationships (Wiley and Lieberman 2011). First phylogenetic trees were reconstructed based on morphological characters (this is still the case in paleontology) using cladistics (Hennig 1966) and maximum parsimony as optimality criterion (Fitch 1971). However, morphology-based phylogenies are normally based only on a restricted number of characters (Scotland et al. 2003) because many have to be discarded if they are not functionally independent, character states not always can be defined unambiguously, and homology (similarity due to common ancestry) is difficult to ascertain between distantly related taxa. Moreover, morphological characters experiencing similar selective forces are prone to convergence, thus producing homoplasy and misleading phylogenetic inference (Wake 1991).

The discovery that protein sequences accumulated amino acid changes at a constant rate over time (the so-called molecular clock) opened the possibility of using this evolutionary information to infer phylogenetic relationships (Zuckerkandl and Pauling 1965). Molecular sequences offered a vast number of independent characters and they could be compared among all living organisms. Moreover, most mutations are neutral due to genetic random drift (Kimura 1983) leading to reduced levels of homoplasy. All these valuable features motivated that molecular sequences have superseded morphological traits as the source data for the reconstruction of robust and reliable phylogenetic trees over the years. Furthermore, it was early on suggested that probabilistic methods such as maximum likelihood, although computationally demanding, could be the most powerful approach for phylogenetic inference based on molecular sequences (Cavalli-Sforza and Edwards 1967). The maximum likelihood optimality criterion searches for the phylogenetic tree (topology plus branch lengths) that best explains the observed alignment of sequences given an explicit statistical Markov model of molecular evolution. It provides a statistical framework to phylogenetic inference and thus allows the application of well-known statistical tools in downstream analyses. In summary, by the end of the 1960s, the theoretical foundations for molecular phylogenetics were set but only a handful of molecular sequences were available and computing power could barely handle most simple maximum likelihood analyses.

The next decade started with a plethora of studies based on immunological techniques and protein electrophoresis assessing genetic variation within populations, such as that of Lewontin (1972), who showed that much of the human genetic variation is found within local populations and rejected the use of the race concept. Moreover, the 1970s witnessed the burst of RNA sequencing (Sanger et al. 1965), which culminated in the discovery of the domain Archaea (Woese and Fox 1977). In parallel, the popularization of molecular cloning techniques using restriction enzymes and plasmid vectors (Cohen et al. 1973), together with the advent of chain-terminating sequencing (Sanger et al. 1977) provided an accurate, robust, and routine methodology to obtain DNA sequences. Thereby, at the onset of 1980s, the complete human mitochondrial genome was sequenced (Anderson et al. 1981) and maximum likelihood algorithms to reconstruct trees based on nucleotide sequences were developed (Felsenstein 1981), demonstrating that molecular phylogenetics could effectively move forward from theory to practice. Many influential studies on molecular evolution (several here cited; see also other commentaries in this anniversary issue) were published in the Journal of Molecular Evolution during these years.

The study of Hasegawa et al. (1985) focused on dating the divergences of orangutans, gorillas, chimpanzees, and humans. A pioneering molecular work (Sarich and Wilson 1967) based on immunological distances had estimated that the split of gorillas and chimpanzees from humans occurred about five million years ago (Ma), challenging the commonly held paleontological view at that time that this divergence could have occurred as far back as 30 Ma. A lively debate started confronting molecular and paleontological evidences, and fostered the use of different types of molecular data (DNA-hybridization, restriction enzyme cleavage sites, protein electrophoresis, amino acid sequences) to provide an accurate estimate of divergence dates within hominids. Hasegawa et al. (1985) was the first phylogenetic analysis tackling this evolutionary question that was based on nucleotide sequences (complete mitochondrial genomes) and used maximum likelihood as method of phylogenetic inference. The study inferred rather young estimates for the separation of gorillas (3.7 ± 0.6 Ma) and chimpanzees (2.7 ± 0.6 Ma) from humans, which have not been confirmed later. Recent studies using probabilistic methods and large genomic data sets provide an estimate for the human-chimpanzee split between 4.98 and 7.90 Ma depending on the calibrations and the estimates of ancestral population size (Kumar et al. 2005; Amster and Sella 2016; Moorjani et al. 2016). Similarly, a phylogenetic analysis integrating paleontological and genomic data estimated the human-chimpanzee split between 6.9–7.9 Ma (Wilkinson et al. 2011).

Despite the study clearly underestimated divergence dates between apes, Hasegawa et al. (1985) has been highly influential (> 8,000 citations) because it contained a hidden jewel. In order to use a model of evolution that could best fit the sequence data, the authors made two important decisions. First, they took into account that in a protein-coding gene, most synonymous substitutions (implying no amino acid replacement) occur in third codon positions, and thus they estimated parameters of the model independently for first plus second versus third codon positions. Second, it had been observed previously that in mitochondrial DNA, nucleotide composition was highly biased (G was particularly underrepresented in the L-strand), and that transitions i.e., changes between purines (A G) or between pyrimidines (C T) were more frequent than transversions, which imply changing purines into pyrimidines or vice versa. Therefore, the authors built a statistical model, henceforth named HKY85, which in the so-called Q matrix (Fig. 1) estimated separately four nucleotide frequencies as well as two instantaneous rates of substitution for transitions and transversions, respectively (Hasegawa et al. 1985).

Fig. 1
figure 1

The Q instantaneous rate matrix for the HKY85 model. The order of the nucleotides for columns and rows are A, C, G, and T. Each (i,j) entry represents the rate at which a nucleotide i is substituted by a nucleotide j (in a Markov model this rate is equal for the change j to i; i.e., the reversibility property). The diagonal is used to constrain the row sums of the matrix to equal zero. π = nucleotide frequencies; µ = mean instantaneous substitution rates; k = transition/transversion ratios; γ = pyrimidines; R = purines

The quest for best evolutionary models had started with the simplest model assuming equal base frequencies and one single type of mutations, and continued adding parameters that distinguished different types of mutation or unequal base frequencies (Fig. 2). The HKY85 model improved all previous models while offering a good compromise between bias and variance in the estimation of the parameters. Hence, it has been the choice in many molecular phylogenetic studies ever since. The sophistication of evolutionary models continued after HKY85, until the most complex evolutionary model possible, the general time reversible (GTR) was built (Tavaré 1986). Afterwards, it was realized that evolutionary models would need also to consider the heterogeneity of substitution rates across the sequence, which can be incorporated into the model by estimating the proportion of invariable sites (Hasegawa and Horai 1991), the alpha parameter of a gamma distribution (Yang 1993), or both (Gu et al. 1995). Given the variety of models of nucleotide substitution available, the Akaike information criterion (Akaike 1973) has been suggested for selecting the one that best fit the data (Posada and Buckley 2004). Furthermore, the same criterion can be used to select optimal partition schemes of the data (Lanfear et al. 2014).

Fig. 2
figure 2

The quest for the best evolutionary model. The simplest nucleotide substitution model (JK69; (Jukes and Cantor 1969) was improved in the early 1980s by adding parameters that either assumed different types of substitution (K80, K81; Kimura 1980, 1981), unequal base frequencies (F81; (Felsenstein 1981) or both (HKY85; (Hasegawa et al. 1985)

The build of models of amino acid replacement has followed a parallel historical development. In this case, the number of changes between the 20 amino acids makes the Q matrix really complex, and thus researchers normally have opted to use empirical matrices that summarize the frequencies of amino acid replacements observed in large data sets such as mtREV (Adachi and Hasegawa 1996), mtART (Abascal et al. 2007) and mtZoa (Rota-Stabelli et al. 2009) for mitochondrial data and JTT (Jones et al. 1992), WAG (Whelan and Goldman 2001), and LG (Le and Gascuel 2008) for nuclear data.

At the end of the 1980s, the advent of automated Sanger sequencing (Ansorge et al. 1987), the popularization of the polymerase chain reaction (Saiki et al. 1988), and the design of versatile primers to amplify genes in many different living organisms (e.g., Kocher et al. 1989) greatly accelerated the acquisition of DNA sequence data for molecular phylogenetics in the 1990s. Moreover, at the turn of the century phylogenetic methods came of age, first by the incorporation of likelihood ratio tests that started the possibility of contrasting evolutionary hypotheses (Huelsenbeck and Rannala 1997) and afterwards by the application of Bayesian inference (Yang and Rannala 1997; Huelsenbeck et al. 2001). The latter allowed the use of empirical mixture models for across-site heterogeneities (Lartillot and Philippe 2004), the implementation of relaxed molecular clocks (Drummond and Suchard 2010), and triggered a burst of phylogenetic comparative methods (Revell 2012), among other innovations.

Since the advent of high-throughput sequencing technologies in the last decade, the new field of phylogenomics has emerged, allowing the reconstruction of phylogenies based on genomic sequences and thus a vast number of characters (Lemmon et al. 2012; McCormack et al. 2012). Nonetheless, this new field is not exempt of challenges. Genomes encode numerous gene families and a first serious problem encountered is to separate unambiguously orthologs (gene copies due to speciation) from paralogs (gene copies due to duplication), as only the former can be used to reconstruct species trees. The concatenation of multiple genes renders robust phylogenetic trees, although it is computationally intensive and poses modeling challenges. Moreover, it disregards single gene tree information, which could be incongruent due to diverse evolutionary phenomena. This is particularly worrisome when inferring phylogenetic relationships among closely related taxa, and new methods of phylogenetic reconstruction based on coalescence models have been devised to account for incomplete lineage sorting, hybridization, and recombination, although they need to be improved in the coming years as they are computationally highly demanding (Jiang et al. 2020).

The possibility of reconstructing the Tree of Life as first envisioned by Darwin (1859) is closer than ever. Moreover, as more whole genomes become available throughout the Tree of Life, phylogenetic comparative methods will pave the way to link genotype and phenotype variation, thus decisively contributing to a better understanding of the evolutionary processes and mechanisms underpinning the origin and maintenance of biological diversity (Smith et al. 2020).