Introduction

The significance of the extant catabolic pathways resides not only in that they are used by organisms as sources of carbon, nitrogen, and other elements, but also to extract the energy from high-energy molecules and transfer it to other biologically useful molecules like ATP or GTP, as well as in the accumulation of the reducing power in the form of molecules like NADH or FADH. In many cases, the end products of a catabolic pathway are crossroads metabolites, which are taken as substrates of the biosynthetic routes of proteins, nucleic acids, and other biologically relevant molecules. Through catabolic routes, organisms are able to use the organic material that surrounds them as the source of essential elements to sustain their existence. Therefore, they must have played a key role in the early evolution of life during the establishment of the intricate network of reactions that constitute extant metabolic pathways.

The first discussion on the origin of catabolism was proposed by Oparin (1924, 1938). One of the central tenets of his ideas on the heterotrophic origin of life was the assumption that glycolysis was the oldest metabolic route of a primitive heterotrophic anaerobe that was totally dependent on the organic material present in the surroundings (Oparin 1924, 1938; Lazcano 2016). Based on the apparent ubiquity of glycolytic enzymes and on the relative simplicity of fermentative reactions, Oparin suggested an anaerobic primitive environment in which the prebiotic synthesis and accumulation of organic chemicals occurred (Lazcano and Miller 1999; Lazcano 2016). More than 20 years later, Krebs and Kornberg (1957) and Krebs (1981) argued that if the Oparin’s scenario was valid, then the second catabolic process that must had evolved was the pentose phosphate pathway, because of its close relation with glycolysis, as indicated by several shared biochemical products. Later on, Clarke and Elsden (1980), based on the easiness of abiotic synthesis of amino acids and the relative simplicity of amino acid fermentation, suggested a “primitive energy-yielding oxido-reduction system” with the usage of glycine and proline as oxidant agents with an associated phosphorylation from which a catabolic pathway could be assembled. The most recent proposal on the origin of enzyme-mediated degradations of cellular components such as carbohydrates, amino acids, and nucleobases was developed by Schönheit et al. (2016). They assumed an autotrophic origin of life, and argued that the most suitable molecules that could be decomposed by early heterotrophs had to be the cellular components of pre-existing autotrophs, and that the enzymatic machineries of those hypothetical autotrophic organisms must have evolved into catabolic pathways resembling for instance the Clostridial-type fermentations which in turn must be the first forms of heterotrophic carbon and energy metabolism. By assuming the autotrophic origin of life, the possibility of earlier degradations, whether enzyme-based or RNA-based, of the organic material present in the primitive earth was dismissed.

In spite of claims on the contrary (Caetano-Anollés et al. 2007; Caetano-Anollés and Caetano-Anollés 2013), it is not easy to correlate the compounds that may have been present in the primitive Earth with the intricate network of reactions that integrate current metabolism. Several attempts have been made based on the extrapolation of extant anabolic routes that may have been part of the metabolic traits of the LCA to the origin of life itself but, in fact, under this view, the origin of metabolism would appear to be closer to the LCA than to the emergence of the first living entities (Lazcano and Miller 1999). On the other side, almost no attention has been given to understand how the constituents of the prebiotic soup lead to the first catabolic pathways (Keefe et al. 1995). It is of course reasonable to assume that the outcome of prebiotic synthesis experiments, combined with the data of organic compounds present in meteorites, provide information about the molecules that may have been available in the primitive environments as substrates of the first catabolic pathways. However, at the time being it may be wiser to limit our extrapolations of extant pathways to a period after protein biosynthesis had already evolved.

It is also important to realize that even in extant biochemical pathways, a certain number of non-enzymatic reactions occur spontaneously, and that it is possible that equivalent reactions played an important role in the establishment of early metabolic pathways (Lazcano and Miller 1999; Keller et al. 2015). It seems likely that a number of reactions of primitive catabolic pathways could have at first taken place semi-enzymatically. In other words, the approach advocated here does not allow any direct inference on the truly primordial catabolic pathways that may have existed during previous stages like the RNA world or a pre-RNA world, where the reactions could have been entirely enzyme-free processes dependent on inorganic catalysts such as mineral cations or ribozymes (Becerra et al. 2007).

Metabolism has been traditionally divided into anabolism and catabolism, though in a number of cases the line separating pathways involved in the synthesis or degradation of a compound is opaque and narrow. Good examples of the latter include the reductive TCA cycle, which involves almost the same set of oxidative reactions going backwards, as well as the lysine degradation pathway, that begins with half the reactions used in the biosynthetic pathway (Michal 1998). In evolutionary terms, it is possible that a pathway once used to synthetize a molecule has evolved to break down the same molecule or vice versa, making it difficult to reconstruct the processes that gave rise to the complex networks of biochemical reactions we see today.

In the present study, we explore the phylogenetic distribution and molecular evolution of the enzymes that integrate extant catabolic pathways of the monosaccharides glucose and ribose, as well as of the nucleobases adenine, guanine, cytosine, uracil, and thymine. Based on their distribution, we were able to date enzymes and pathways that most likely were present in the LCA and those that evolved within particular phylogenetic groups. We have also considered the oxygen dependence of catabolic enzymes as a diagnostic character of their antiquity, since the first accumulation of O2 in the Precambrian atmosphere, known as the great oxygenation event (GOE), changed forever the chemistry of the planet leading to the development of more energetically efficient metabolisms (Goldfine 1965; Canfield 2005). Moreover, recent efforts based on paleobiological and biogeochemical evidence now recognize a second increase in the oxygen concentrations in the atmosphere at the end of the Neoproterozoic Era, known as the Neoproterozoic Oxygenation Event (NOE), that may have had an effect on the evolution of multicellular organisms (Knoll and Nowak 2017). This approach allows us to established a relative date of enzymes and pathways that most likely appeared before and after the oxygen accumulation in the terrestrial atmosphere (Fig. 1), which may have started sometime during the period of 2.8–2.3 billions of years ago (Canfield 2005; Lyons et al. 2014).

Fig. 1
figure 1

Relative dating of the catabolic enzymes and the catabolic pathways. The oxygen dependence of the enzymes and the pathways in which they participate has led us to make a relative dating. They most likely evolved into their present form after the oxygen enrichment of the earth’s atmosphere, sometime between 2.8 and 2.3 billions of years ago (Canfield 2005; Lyons et al. 2014; Knoll and Nowak 2017). Distribution of the catabolic enzymes along the phylogenetic groups of the two major domains of life, Bacteria and Archaea (Forterre and Gribaldo 2010; Embley and Williams 2015), allowed the classification of the pathways that most likely were present in the LCA as universal, and pathways whose enzyme distribution suggested either an archaeal or bacterial origin as non-universal. Eukaryotes were also included in the database for completeness. The distribution results for the eukaryotic enzymes are also presented in the supplementary material or through the following web addresses https://goo.gl/dDdLfd, https://goo.gl/WDwRQE, https://goo.gl/6x7hmo and https://goo.gl/JAPI7R

Methodology

Defining Catabolic Enzymes and Pathways

The catabolic pathways, the enzymes, the substrates, and the products analyzed here have been defined based on the Biochemical Pathways Atlas (Michal and Schomburg 2012), the KEGG database of metabolisms (Kanehisa and Goto 2000; Kanehisa et al. 2014), the MetaCyc database (Caspi et al. 2014), and the BRENDA database (Schomburg et al. 2013). This information was complemented using the data of McMurry and Begley (2005) when required. In some cases, the distinction between anabolic and catabolic enzymes is not evident, since they may intervene in both synthesis and degradation of a given compound. Accordingly, when a report of a catabolic pathway was found in the literature or in the databases, a distinction was made between enzymes used exclusively for degradation, and those involved in both degradation and synthesis using the biochemical information available in KEGG, MetaCyc, and/or BRENDA databases. There are a few special cases when there are no biochemical data available that allow us to establish the directionality of a reaction. In such cases, the enzyme was annotated as catabolic if it most probably catalyzes a reaction identified as essential within the chain of reactions of the degradation pathway. All reactions of catabolic pathways under examination, including the corresponding enzymes and cofactor/coenzyme requirements, were collected in a catalogue available as supplementary material Catabolism-Catalogue or through the following web address https://goo.gl/Ts1xpD.

Recruitment of Sequences

The recruitment of sequences for homologue searches was performed using the Enzyme Commission (EC) number annotation in the UniProt database (The UniProt Consortium 2012), which allowed us to collect high-quality, manually annotated, and non-redundant protein sequences from UniProtKB/Swiss-Prot (The UniProt Consortium 2014), and high-quality computationally analyzed sequences from UniProtKB/TrEMBL (Magrane and Consortium 2011). In the cases where sequences associated to a particular EC identifier in the UniProt database were not found, a random seed-sequence was chosen from one representative of Archaea, Bacteria, and Eukarya (when available) out of the KEGG protein web database (Kanehisa and Goto 2000; Kanehisa et al. 2014). A web-search for each seed-sequence was then performed using BLAST software (Altschul et al. 1990) routine blastp against the non-redundant database from NCBI (Geer et al. 2010). The sequences gathered in each search included the first one hundred best E-values hits with query coverage above 70%.

Analyzing the Collection of Sequences

Once the sequences had been collected, they were first filtered to limit the redundancy on the database using the CD-HIT methodology (Fu et al. 2012) with values C = 0.7 and N = 5. The identification of the domain architecture of the remaining sequences was done by implementing the recognition of Pfam domains (Punta et al. 2012) through HMMER3 software (Mistry et al. 2013) using the hmmerscan routine against the Pfam database (Punta et al. 2012; released 27.0) according to the instructions in the HMMER manual. Identification of domain architecture allowed the recognition of sequences with different domain architectures annotated under a common EC number, which led us to cluster together the sequences with equal domain architectures before further processing. As a result, more than one HMMER profile were built for the sequences that were classified under the same EC number. Every profile represents each one of the previously identified domain architectures, which in turn allowed us to perform independent searches for each profile generated this way.

Building Profiles for Homologue Search

Since multiple sequence alignment is a prerequisite for profile construction, the aligner muscle v3.8.31 (Edgar 2004a, b) was employed using no modifiers to align each set of the previously curated sequences. A profile was then built for each cluster of sequences using the hmmbuild routine according to the manual instructions (Eddy 2011).

Assigning a Limit to the Hmmersearch Results

After obtaining the profile for every enzyme under study, a search was performed using the hmmsearch routine against the original unaligned set of sequences that were used to create such profile, in order to detect the sequences that behave in the less efficient way against the model that they gave rise to. This allowed us to isolate and include that sequence as “The Limit” for each particular model inside the protein database, instead of assigning an arbitrary cutoff value. Finally, in the presence/absence results, those sequences that performed equal or better than the limit sequence for each profile were included.

Protein Database

A database of protein sequences consisting entirely of complete sequenced organisms with representatives from the three domains of life was integrated. It comprises 1201 Bacteria, 97 Archaea, and 146 Eukaryotes, all of them collected from the KEGG database (Kanehisa and Goto 2000; Kanehisa et al. 2014). The Archaea representatives analyzed here include mostly members from the Euryarchaeota and Crenarchaeota major groups; members of the Thaumarchaeota, Nanoarchaeota, and Korarchaeota groups were also analyzed for the sake of completeness. Although we considered that they lack representability over the entire domain due to the few species that they included at the time of the local database integration (2011), results of the searches against their respective organisms are also presented as supplementary material for consideration or through the following web addresses https://goo.gl/dDdLfd, https://goo.gl/WDwRQE, https://goo.gl/6x7hmo and https://goo.gl/JAPI7R. The Bacterial members belong to the α-proteobacteria, β-proteobacteria, δ-proteobacteria, ε-proteobacteria, γ-proteobacteria, Firmicutes, Tenericutes, Actinobacteria, Chlamydiae, Spirochaetes, Acidobacteria, Bacteroidetes, Fusobacteria, Verrucomicrobia, Planctomycetes, Cyanobacteria, Green Sulfur Bacteria, Green non-sulfur bacteria, Deinococcus–Thermus, and Hyperthermophilic-bacteria major groups. The Eukaryote members comprise the Vertebrates, Arthropods, Nematodes, Cnidarians, Eudicots, Monocots, Algae, Ascomycetes, Basidiomycetes, Choanoflagellates, Amoebozoa, Alveolates, Euglenozoa, and Diatoms major groups according to the classification on the KEGG database.

Defining the Antiquity of the Pathways

Relative dating of the catabolic pathways was defined using two different criteria. The first obvious one was based on the oxygen dependency of the enzymes that integrate each catabolic pathway. If an enzyme of a particular pathway catalyzes a degradative reaction that requires molecular oxygen as substrate, and that reaction is considered essential for the pathway to produce a compound described in the literature, but has no oxygen-independent counterpart, then we assumed that the entire pathway evolved into its actual form after the oxygen enrichment of the terrestrial atmosphere (see supplementary material Catabolism-Catalogue for details of the oxygen-dependent enzymes in each catabolic pathway). The remaining pathways were then classified as universal and non-universal in accordance to the distribution of their enzymes along the major groups of Bacteria and Archaea domains (Forterre and Gribaldo 2010; Embley and Williams 2015). The presence of enzymes in all organisms of the database was not defined as a requisite. Instead, we considered that their distribution must be representative among the major groups of Archaea and Bacteria, which means that enzymes of a particular pathway must be present in at least half of the known major groups in the Archaea and Bacteria domains. For a particular group to be considered indicative to the presence of a given pathway, the corresponding enzymes must be present in at least one-third of the organisms that integrate each major group; the later value was defined empirically, based on the phylogenies constructed for all catabolic enzymes as an strategy to detect possible cases of horizontal gene transfer between Archaea and Bacteria. For the sake of completeness, enzyme searches in the Eukarya domain were performed and their distribution is reported in the same manner. This allowed us to distinguish between pathways that were most likely present in the LCA (or LUCA, Last Universal Common Ancestor), and those that evolved after the divergence of the two major prokaryotic domains. Detailed group-distribution tables are provided as supplementary material or through the following web addresses https://goo.gl/dDdLfd, https://goo.gl/WDwRQE, https://goo.gl/6x7hmo and https://goo.gl/JAPI7R.

Enzyme Evolution Within and Between the Pathways

We have traced enzyme ancestry within and between the catabolic pathways using sample sequences taken at random from our previously classified collections of enzymes. In this way, sets of sequences that represent all the enzymes involved in each catabolic pathway were defined. We then merged all sequences corresponding to carbohydrates and nucleobases degradative pathways in a single “fasta” file for each category. Finally, copies of the same set of sequences that integrate each category were used as the target database to emulate same type of results as in the “bidirectional best hit” methodology using BLAST + blastp routine v.2.4.0 (Camacho et al. 2009). Results of the blast comparisons were filtered by the following criteria: (a) only those sequences that were identified as “bidirectional hits” were considered; (b) results with e-values lesser or equal to 1 × 10−10 were included; and (c) alignments with query coverage above 70% were taken as positives. The data collection that emerged from this comparison is presented as supplementary material or through the following web address https://goo.gl/ryvTjK.

Comparisons of the Crystallographic Structures of Proteins

When protein sequence conservation was not enough to derive or discard homology relationships, and the protein domain architecture suggested a possible common history based on Pfam clan classification (Finn et al. 2006), the crystallographic structures of proteins were compared when available (see supplementary material or the following web addresses https://goo.gl/fUklBo and https://goo.gl/nSoRlZ). Protein crystallographic structures were recruited from the Protein Data Bank (Berman et al. 2000) with resolutions of at least 3.0 Å, preferably from the same organism (or a phylogenetically related organism) previously identified by our search for homologue sequences when available. Structure alignments were performed with the PDB web tools (Prlic et al. 2010) using the jFATCAT-rigid algorithm, which is a Java port of the original FATCAT algorithm (Ye and Godzik 2003). Three parameters were considered in order to propose homology between two given enzymes:

  1. 1.

    RMSD The root-mean-square deviation, which represents the standard deviation of the differences between the equivalent α-carbons of two crystal structures, and which is widely used as a reference value to establish the likeness between two crystallographic structures (Chothia and Lesk 1986; Irving et al. 2001). In most cases, our comparisons involved proteins of organisms that belong to separate domains of life and, as noted long ago by Chothia and Lesk (1986), the extent of the structural changes is directly related to extent of protein sequence changes and therefore to the amount of time since both proteins diverged from their common ancestor, in this work the RMSD threshold value was set on 4.0, so that distantly related proteins were not discarded without further consideration.

  2. 2.

    SAS The structural alignment score is a geometric distance measure that considers the number of aligned residues in relation to their respective RMSD (Subbiah et al. 1993). The SAS value evaluates alignments with good RMSD values against the number of residues that were aligned between the crystallographic structures in one standardized measurement. In this work, the threshold value was set at 2.0.

  3. 3.

    Percentage of aligned amino acids The evolution of proteins in separate lineages can cause two homologues to be subject to different selective pressures, leading in many cases to different sizes. However, if they keep on catalyzing similar or identical reactions, then there must be regions along the peptide chain that remain alike. Therefore, we assumed that at least 50% of the residues of the smallest protein under comparison must align with their corresponding protein residues for the comparison to be significant enough to derive homology relationships.

Results and Discussion

As illustrated in Fig. 1, the methodology used here, based both on enzymes distribution and their dependence (or not) to molecular oxygen, has allowed us to construct a model that lists the pathways that may have been present in the LCA and those that evolved after the oxygen enrichment of the Earth’s atmosphere. The possibility of a complete replacement of oxygen-independent enzymes by oxygen-dependent enzymes (Raymond and Blankenship 2004), which could lead to erroneous interpretations in the antiquity of the pathways, should be acknowledged. In our model, enzymes and pathways that have a non-universal distribution but have an oxygen-free biochemistry were placed before this major breaking point in the history of life. This does not necessarily limit their origin to that particular time, since it is possible that those routes are more recent and evolved in anaerobic organisms living in oxygen-free environments. It is important to underline that our methodology and results differ from the reconstructions of the genomic content of the LCA (or LUCA) (Harris et al. 2003; Mirkin et al. 2003; Delaye et al. 2005; Yang et al. 2005; Ranea et al. 2006; Kim and Caetano-Anollés 2011), because we are using a huge number of completely sequenced cellular genomes, and the intersection of the genome content rapidly decreases when an increased number of cellular genomes are used for comparison. Therefore, strict presence of all enzymes in all organisms represented in our database was not a requisite. Instead, as noted above, we have used their distribution along the major phylogenetic groups that integrate the Archaea, Bacteria, and Eukarya domains, since there can always be secondary losses among certain groups, such as parasites and symbionts, that diminish the effective number of genes that can be traced back to the LCA. While it is true that this approach could introduce a bias in our results, it allows us to classify complete pathways and not only isolated genes. Horizontal gene transfer events can also affect the number of genes that are considered to be part of the genetic content of the LCA. Therefore, we analyzed the phylogenetic history of each enzyme under study; cases detected this way will be discussed further in each section.

Glucose

Although it has been long assumed that the fermentation of glucose and ribose were the first catabolic pathways (Oparin 1924, 1938; Krebs and Kornberg 1957; Krebs 1981; Lazcano 2016), the presence of glucose and ribose in the prebiotic environment remains an issue, although the synthesis of smaller sugar compounds such as glycolaldehyde and glyceraldehyde under prebiotically plausible conditions has been reported (Ritson and Sutherland 2012). Even if the first non-enzymatic carbohydrate synthesis was achieved by Butlerov in the well-known “formose reaction” more than 150 years ago (Rauchfuss 2008), it is not easy to envision the synthesis and accumulation of high amounts of carbohydrates like glucose under possible prebiotic conditions. Although ribose synthesis can be achieved through the formose reaction, the overall yield is small (Reid and Orgel 1967; Shapiro 1988), and it is a very unstable carbohydrate with a half-life of ≈ 73 min at 100 °C and pH 7, and of 44 years at 0 °C (Larralde et al. 1995). The identification of by-products of the alkaline hydrolysis of ribose in extracts of the Murchison and Murray meteorites has been reported (Cooper et al. 2001), suggesting the ephemeral presence of ribose in the parent body of some carbonaceous meteorites. In the present work, we have analyzed the main catabolic pathways that degrade glucose and ribose because they are considered relevant to understand the early evolution of carbohydrate metabolism, although we recognize that current evidence suggest that it is possible that their prevalence in extant biology resulted from the earliest evolution of anabolic routes, and not necessarily from prebiotic synthesis or delivery from outer space.

The canonical product of glycolysis is pyruvate, a crossroads metabolite that either connects to the Krebs cycle or enters the oxygenic/anoxygenic fermentative pathways in extant organisms. It is well known that the selective advantage of glycolysis is the mining of the reducing power in the form of two molecules of NADH, and the net income of two molecules of ATP per molecule of glucose that is degraded. On the other hand, the main products of extant ribose degradation are fructose 6-phosphate and/or glyceraldehyde-3-phosphate, which are crossroads metabolites connecting directly to the glycolytic pathway (Kanehisa and Goto 2000; Michal and Schomburg 2012). While the pentose phosphate cycle selective advantage is the production of reducing equivalents, such as NADH, ribose catabolism by its own appears to be related to the production of the intermediaries that feed glycolysis. Although the latter correlates with the Krebs and Kornberg (1957) and Krebs (1981) proposal of the development of the pentose phosphate pathway after the establishment of glycolysis, it does not necessarily imply that those two routes emerged during the very early stages of life.

Glycolysis starts with the phosphorylation of glucose to produce glucose 6-phosphate. There are four different enzymes that can catalyze this reaction, namely, hexokinase (HEX, EC 2.7.1.1), glucokinase (GLUK, EC 2.7.1.2), polyphosphate glucokinase (PGK, EC 2.7.1.63), and ADP-specific glucokinase (AGLU, EC 2.7.1.147). Phylogenetic analysis on the distribution of these enzymes reveals that (a) HEX is restricted to the major groups of Eukarya; (b) GLUK has a universal distribution being present among almost all the major groups of Archaea, Bacteria, and Eukarya; (c) PGK has been found exclusively among some members of the Actinobacteria, Acidobacteria, Bacteroidetes, Verrucomicrobia, Cyanobacteria, and Deinococcus–Thermus bacterial groups; and (d) AGLU appears to be restricted to some members of the Euryarchaeota and some members of the eukaryotic groups of Nematodes and Vertebrates. Such biological distribution suggests that the most ancient form of catalyzing this reaction was through a GLUK-like enzyme that was probably already present in LCA. Sequence comparisons reveal no significant resemblance among these proteins, but domain composition reveals that HEX, GLUK, and PGK are integrated by the hexokinase_1, hexokinase_2, glucokinase, and ROK domains, all of which belong to the same Pfam clan (CL0108), suggesting that they might share an evolutionary history (Finn et al. 2006). Comparisons of the crystallographic structures of HEX, GLUK, and PGK confirm that these proteins share a common ancestor (see Fig. 2 and supplementary material or visit https://goo.gl/fUklBo). On the other hand, AGLU possesses the ADP_PFK_GK domain, which belongs to a different Pfam clan (CL0118). Its distribution and the lack of resemblance between its crystallographic structure and those of HEX, GLU, and PGK, suggest that it is a later, independent, development.

Fig. 2
figure 2

Comparisons of the crystallographic structures of kinases from the Actin-like ATPase superfamily and the Ribokinase-like superfamily that are involved in the phosphorylation reactions in glycolysis and the ribose catabolic pathway. Alignments are accompanied with a representation of the pfam domain composition of the proteins. a Alignment of the crystallographic structures of hexokinase (EC 2.7.1.1, PDB ID 4QS7, red) and glucokinase (EC 2.7.1.2, PDB ID 1SZ2, forest-green). Both structures are visualized with their respective glucose substrate (blue and magenta, respectively). Although it is difficult to establish homologous relationships based solely on their primary sequence alignments, pfam domain classification suggests a possible common evolutionary history, which is supported by the alignment of both crystallographic structures. b Alignment of the crystallographic structures of ribokinase (EC 2.7.1.15, PDB ID 3RY7, yellow) and the ADP-specific glucokinase (EC 2.7.1.147, PDB ID 4B8R, marine-blue). Residues of the glucose binding site of the ADP-specific glucokinase are highlighted in magenta (N38, D42, E96, Q121, H184, R205, and D451). Although ADP-specific glucokinase and ribokinase phosphorylate different carbohydrates, their overall structure is the same, confirming their common ancestry as suggested by their pfam domain composition classification. c Alignment of the crystallographic structures of ribokinase (EC 2.7.1.15, PDB ID 3RY7, pale-yellow) and the archaeal ADP phosphofructokinase (EC 2.7.1.146, PDB ID 3DRW, teal-cyan) with ADP molecule (magenta). The alignment supports the possibility that both enzymes share a common ancestor as suggested by their domain clan assignment. Images presented here were created using The PyMOL Molecular Graphics System, Version 1.8.1.2 Schrödinger, LLC implementing the CE algorithm (Shindyalov and Bourne 1998) and colors correspond to the PyMOL standard color palette. See supplementary material or visit https://goo.gl/fUklBo for details on the comparisons of the crystallographic structures

Isomerization of glucose-6-phosphate into fructose-6-phosphate is catalyzed by glucose 6-phosphate isomerase (EC 5.3.1.9), a universally distributed enzyme that is well represented among all major groups of the Archaea, Bacteria, and Eukarya. Fructose 6-phosphate undergoes a second phosphorylation and produces fructose 1,6-bisphosphate, a reaction that can be catalyzed by two non-homologous enzymes, 6-phosphofructokinase (PPFK, EC 2.7.1.11) and ADP phosphofructokinase (APPFK, EC 2.7.1.146). There is also a PPFK isoenzyme (EC 2.7.1.11/2.7.1.90) that can phosphorylate the fructose-6-phosphate using pyrophosphate (PPi) in the absence of ATP (Reeves et al. 1974). This isoenzyme is so similar to the original PPFK that it is not easy to distinguish between them neither by sequence comparisons nor by position-specific weight matrices. Both versions of the PPFK are characterized by the PFK domain, which constitutes almost the whole length of their respective sequences, demonstrating their common origin. While PPFK is well distributed among the bacterial and eukaryotic groups, it is almost absent from Archaea. Although there are few Archaea from the Methanomicrobia class in which homologues of PPFK have been identified, their limited distribution and close resemblance to the bacterial enzymes suggest that they are the outcome of two horizontal gene transfer (HGT) events, as shown in Fig. S1. By contrast, APPFK has been identified exclusively in members of the Euryarchaeota group, suggesting an archaeal origin. Sequence comparisons of the PPFK and the APPFK reveal no significant resemblance that could lead us to establish any evolutionary relationship between them. Analysis of the domain architectures reveals that PFK is the only functional domain in the PPFKs, while ADP_PFK_GK is the only domain identified on the APPFKs. Since these domains are classified under different Pfam clans (CL0240 and CL0118, respectively), common ancestry appears to be unlikely. As shown in Fig. 2, crystallographic comparison reveals no significant structural relationship between them, but in fact suggests that APPFK is related to kinases of the ribokinase-like family. We therefore conclude that there is no universally distributed phosphofructokinase whose ancestor could catalyze the phosphorylation of fructose 6-phosphate in the LCA.

Once fructose 1,6-bisphosphate is synthesized, it undergoes a C–C bond cleavage catalyzed by fructose-bisphosphate aldolase (FBPA, EC 4.1.2.13), releasing glycerone phosphate and glyceraldehyde-3-phosphate. Although at first the distribution of FBPA among major groups of Bacteria and Eukaryotes suggested its emergence after the two major prokaryotic groups have diverged, detailed analyses of the databases and the sequences of FBPA reveal that they actually belong to two separate classes: Class I is well represented among the major groups of Archaea, and scarcely present on the bacterial lineage, while Class II was found to be well represented among γ-proteobacteria, ε-proteobacteria, Actinobacteria, Spirochaetes, Bacteroidetes, Verrucomicrobia, and Planctomycetes bacterial groups and nearly absent from archaeal organisms. Such distribution suggests independent evolutionary histories within their respective lineages. Specific sequence comparisons of the two classes of aldolases reveal that the identity and similarity values are in the so-called “twilight zone,” and it is difficult to conclude that they are homologues. However, domain architecture analysis revealed that FBPA Class I possess the DeoC domain, while FBPA Class II possess the F_bP_aldolase domain. Both domains are recognized as members of the same Pfam clan (CL0036), which suggests that they could share a common ancestor (Finn et al. 2006). Comparison between the crystallographic structures of both classes of aldolases does support a common origin (see supplementary material or visit https://goo.gl/fUklBo), which in this particular case could date from before the divergence of the two prokaryotic domains, followed by a latter parallel development within their respective prokaryotic groups. This hypothesis is supported by the fact that both crystallographic structures are essentially TIM barrel structures whose functional diversity has been proposed to be the result of an early diversification of the TIM barrel superfamilies prior to the LCA epoch, although the monophyletic origin of all TIM barrel superfamilies remains to be demonstrated (Goldman et al. 2016).

Isomerization of glycerone phosphate into glyceraldehyde-3-phosphate is an essential step if the route must reach its optimum energy extraction. Although it seems that this step could have been dispensable in the very early stages of glycolysis evolution, the universal distribution of triosephosphate isomerase (TPI, EC 5.3.1.1), the enzyme responsible for this reaction, suggests that an ancestral form was already present in the LCA. Following the isomerization reaction, two molecules of glyceraldehyde-3-phosphate are further processed, either by triosephosphate dehydrogenase (TPDH, EC 1.2.1.12), or by glyceraldehyde-3-phosphate dehydrogenase (GPDH, EC 1.2.1.59), two universally distributed enzymes that catalyze the reaction where the energy released by the oxidation of the aldehyde group on the glyceraldehyde-3-phosphate is used to synthetize either NADH or NADPH and incorporate a phosphate into the molecule, releasing 1,3-bisphospho-d-glycerate. TPDH and GPDH common ancestry could not be establish by sequence comparisons, but domain composition revealed a possible evolutionary relationship since both enzymes possess the Gp_dh_C domain (CL0139). Although this shared domain represents less than 50% of the total length of each sequence, their common ancestry could be inferred from the comparisons of the corresponding crystallographic structures (see supplementary material or visit https://goo.gl/fUklBo). This implies that the common ancestor of TPDH and GPDH dates from a pre-LCA epoch.

In a subsequent step, the 1,3-bisphospho-d-glycerate is used as the phosphate donor to synthetize ATP from ADP, releasing 3-phospho-d-glycerate. This reaction is catalyzed by phosphoglycerate kinase (EC 2.7.2.3) that according to our results is universally distributed. The resulting 3-phospho-d-glycerate is then isomerized into 2-phospho-d-glycerate by two non-homologous enzymes, glycerate phosphomutase (GPM, EC 5.4.2.11) and phosphoglycerate mutase (PGM, EC 5.4.2.12). While PGM has a universal distribution, GPM is found mostly among the major groups of Bacteria and Eukarya. Due to its very restricted distribution in Archaea and the close resemblance of the archaeal sequences with those of bacterial organisms, the few archaeal GPM homologues that were detected by our methodology probably represent a case of HGT from Bacteria to Archaea (see Fig. S2). Neither sequence comparisons nor domain composition revealed any possible evolutionary relationship between GPM and PGM, a conclusion supported by comparisons of their crystallographic structures (see supplementary material or visit https://goo.gl/fUklBo). This suggests that GPM evolved independently and after the divergence of the two major prokaryotic domains. The next step in the glycolytic pathway is the dehydration of 2-phospho-d-glycerate into phosphoenolpyruvate by the action of enolase (EC 4.2.1.11), a highly conserved enzyme whose ancestor was probably already present in the LCA, as noted before by Delaye et al. (2005) and supported by our results. Finally, the phosphoenolpyruvate is used as a phosphate donor to synthetize ATP from ADP, a reaction catalyzed by pyruvate kinase (EC 2.7.1.40), another universally distributed enzyme whose ancestor most likely was also present in the LCA, according to our results.

The detailed examination of the phylogenetic distribution of the glycolytic enzymes and their evolutionary relationships discussed here shows how complicated it is to reconstruct the mechanisms that gave rise to the actual pathway. Clearly, the assumption that glycolysis is ubiquitous is no longer valid. As discussed previously by Verhees et al. (2003), several variant enzymes have developed independently within different lineages, like ATP-Glucokinase and ADP-Glucokinase from Bacteria and Archaea, respectively, a hypothesis that appears to be supported by our results. As discussed above, there are indeed non-homologous kinases that catalyze the phosphorylation reaction over glucose, even variants that evolved exclusively among Eukaryotes. Moreover, the phosphorylation of fructose 6-phosphate is a critical glycolytic reaction catalyzed either by PPFK or by APPFK within Bacteria and Archaea, respectively, and whose restricted distribution lead us to conclude that the glycolytic pathway, as we know it today, was not operating in the LCA.

The availability of completely sequenced genomes and the systematic accumulation of biochemical data in public databases allow the recognition of four kinases that catalyze the starting reaction on the glycolytic pathway. The list includes hexokinase (HEX, EC 2.7.1.1), glucokinase (GLUK, EC 2.7.1.2), polyphosphate glucokinase (PGK, EC 2.7.1.63), and ADP-specific glucokinase (AGLU, EC 2.7.1.147). Phylogenetic distribution reveals that, contrary to the canonical version of the pathway, GLUK and not HEX has a universal distribution, and it is quite likely that its ancestor was present in the LCA. Analysis of the domain architecture and crystallographic comparisons show that HEX, GLUK, and PGK most likely share a common ancestor, and that kinases like AGLU and APPFK descend from a different ancestral kinase whose evolutionary history appears to be different from the ancestor of HEX, GLU, and PGK. Phosphoglycerate kinase (EC 2.7.2.3) and pyruvate kinase (PK, 2.7.1.40) represent examples of two other glycolytic kinases that, as shown by sequence comparisons, domain architecture, and crystallographic comparisons, do not exhibit any evolutionary relationship, neither with the previous kinases of the glycolytic pathway nor among them. These results support the notion of a recruitment of different kinases with separate evolutionary origins into the glycolytic pathway, and the available data suggest that this took place at least in five different occasions. This conclusion also supports the hypothesis that the pathway was assembled in a patchwork style within phylogenetic groups, contrary to previous assumptions of a single common origin for the enzymes that catalyze equivalent reactions within a pathway.

Fothergill-Gilmore (1986) suggested that the βα-barrels of triosephosphate isomerase (TPI, EC 5.3.1.1) and pyruvate kinase (PK, EC 2.7.1.40) are in fact the consequence of physicochemical constraints that favor the chemistry of the reactions, and not the direct outcome of evolutionary heritage. The nature of the reaction mechanisms, sequence comparisons, and domain architecture analyses of these enzymes seem to support the latter, but comparisons between the crystallographic structures strongly suggest a common origin between these proteins. The overall connectivity of the βα-elements between the TPI and the TIM-fold of the PK remains very much alike and the TIM-fold, by its own, is known to be associated with additional domains that precede, interrupt, or follow the βα-barrel generating additional variety (Nagano et al. 2002). This seems to be the case of the PK, where the βα-barrel remains as the catalytic part while the following fold functions as the regulatory part of the enzyme (Fig. 3). Even if the reaction mechanisms of the TPI and the PK are different (Fig. 3), the diversity of functions associated with the TIM-fold include five of the six primary classes of enzymes (oxidoreductases, transferases, hydrolases, lyases, and isomerases) (Goldman et al. 2016), which is an example of the great adaptability of the TIM-fold and its functional diversity. Sequences of the proteins that contain the TIM-fold are so diverse that even powerful primary sequence techniques, like the domain composition analyses, have overlooked the evolutionary relationships between TPI and PK. The TIM domain in the TPI and the PK domain of the PK are classified under separate Pfam Clans, although from the crystallographic analysis it seems evident that they are homologues. The issue can be explained by the presence of a 100-residues insertion that interrupts the continuity of the βα-barrel sequence and that constitutes a separate fold within the PK structure (Fig. 3).

Fig. 3
figure 3

Crystallographic structures of triosephosphate isomerase (TPI, EC 5.3.1.1, PDB ID 4YMZ) and pyruvate kinase (PK, EC 2.7.1.40, PDB ID 4HYV). a Crystallographic structure of TPI (green) associated with substrate 1,3-dihydroxyacetonephosphate (red, ligand ID 13P). Close view of the amino-acid residues Asn10, Lys12, His97, Glu169, Gly175, Ser217, Gly238, and Gly239 that according to the crystallographic structure (PDB ID 4YMZ) establish hydrogen bonds to stabilize the 13P in the catalytic core. Isomerization reaction of the 1,3-dihydroxyacetonephosphate into glyceraldehyde 3-phosphate is an example of concerted acid–base catalysis, that in this case is assisted by His97 acting as the acid and Glu169 acting as the base. b Crystallographic structure of PK associated with substrate phosphoenolpyruvate (red, ligand ID PEP), magnesium ion (cyan, ligand ID MG), and potassium ion (purple, ligand ID K). Two protein folds can be appreciated in association with the βα-barrel (green), the regulatory fold (blue) that in Nature interacts with fructose 1,6-bisphosphate, an allosteric positive regulator, but in the crystallographic structure is replaced by fructose 2,6-diphosphate (orange, ligand ID FDP) and a fold (yellow) that interrupts the βα-barrel with an unclear function. b1 Close view of the amino-acid residues Arg50, Lys239, Gly264, Asp265, and Thr297 that according to the crystallographic structure (PDB ID 4HYV) establish hydrogen bonds to stabilize the PEP in the catalytic core. b2 Close view of the amino-acid residues Asn52, Ser54, Asp84, Thr85, Glu241, and Asp265 that according to the crystallographic structure (PDB ID 4HYV) establish hydrogen bonds to stabilize MG and K in the catalytic core, respectively. The PK reaction couples the free energy of PEP cleavage to the generation of ATP during the synthesis of pyruvate and requires the presence of two Mg+ ions and one K+ ion to stabilize the negative charges of PEP and ADP within the catalytic core. Images presented here were created using The PyMOL Molecular Graphics System, Version 1.8.1.2 Schrödinger, LLC and colors correspond to the PyMOL standard color palette. See supplementary material or visit https://goo.gl/fUklBo for details on the comparisons of the crystallographic structures

Ribose

Ribose enzymatic degradation starts with a phosphorylation reaction catalyzed by ribokinase (RIBK, EC 2.7.1.15), whose universal distribution suggests that it was already present in the LCA. While sequence comparisons between glycolytic and ribose catabolic enzymes do not suggest any evolutionary relationship, domain identification does suggests common ancestry of RIBK with the glycolytic kinases AGLU (EC 2.7.1.147) and APPFK (EC 2.7.1.146), based on their respective functional domains that belong to the same Pfam clan (CL0118). Comparisons of their corresponding crystallographic structures support this hypothesis (see Fig. 2 and supplementary material or visit https://goo.gl/fUklBo). Once phosphorylated, ribose 5-phosphate reacts with a molecule of xylulose 5-phosphate, releasing glyceraldehyde 3-phosphate, and sedoheptulose 7-phosphate, a reaction that is catalyzed by transketolase (TRK, EC 2.2.1.1), whose universal distribution suggests that it was already present in the LCA. During this reaction, a molecule of xylulose 5-phosphate is required, that can also be obtained from ribose 5-phosphate, that is first transformed into ribulose 5-phosphate by phosphoriboisomerase (PRI, EC 5.3.1.6), followed by the transformation of ribulose 5-phosphate into xylulose 5-phosphate via phosphoribulose epimerase (PRE, EC 5.1.3.1). While PRI is universally distributed, PRE is restricted almost exclusively to the Bacteria and Eukarya domains. Although some methanogenic Archaea do have a PRE homologue, its restricted distribution and close resemblance to the corresponding bacterial enzymes suggest that it was acquired by HGT (see Fig. S3). The requirement of xylulose 5-phosphate makes it difficult to think of this enzymatic ribose degradation as an ancient catabolic pathway, especially since the distribution of PRE suggests that it evolved in the Bacteria lineage after the divergence of the two major prokaryotic domains. This poses a problem, since there appears to be no other enzyme capable of synthetizing such an essential product for glyceraldehyde 3-phosphate formation. However, the chemical nature of the ribulose-5-phosphate and the xylulose-5-phosphate suggests that since they are stereoisomers (epimers), which vary in the relative position of a hydroxyl group on the C3 (see Fig. S4), it is possible that in earlier times the TRK was endowed with a lesser specificity and also used ribulose 5-phosphate as substrate to produce glyceraldehyde 3-phosphate. This suggestion implies that the catabolic pathway that degrades ribose and produces glyceraldehyde 3-phosphate at first required only three universally distributed enzymes (RIBK, TRK, and PRI), opening up the possibility that this pathway was already present in the LCA. In favor of this hypothesis, it should be mentioned that during the complementary phase of the ribose degradation the glyceraldehyde 3-phosphate and the sedoheptulose 7-phosphate produced before, are transformed into erythrose 4-phosphate and fructose 6-phosphate in a reaction catalyzed by transaldolase (TRA, EC 2.2.1.2). Erythrose 4-phosphate then reacts with xylulose 5-phosphate to produce glyceraldehyde 3-phosphate and fructose 6-phosphate in a reaction also catalyzed by TRK (see Fig. S4). This can be seen as an example of the lack of absolute substrate specificity of TRK. During this last stage of ribose degradation, the limiting enzyme is TRA, whose distribution among Bacteria and Eukaryotes indicates that it evolved after the divergence of the two major prokaryotic domains. Although some homologues are present in the methanogenic Archaea, most probably they represent another case of HGT, as suggested by their limited distribution and close resemblance to the bacterial enzymes (see Fig. S5). Accordingly, it can be argued that the second phase of the enzymatic ribose degradation most probably evolved within the bacterial lineage.

Results on the nature of the kinases that integrate glycolysis and the ribose degradation pathway support the idea that these catabolic routes were assembled by a patchwork mechanism, since they belong to five different families: (a) the actin-like ATPase superfamily that englobes the HEX, GLUK, and PGK of the first phosphorylation reaction on the glycolysis; (b) the PFK-like superfamily, that includes the PPFK, which is one of the enzymes responsible for the phosphorylation of fructose-6-phosphate on the bacterial glycolysis; (c) the ribokinase-like superfamily, that englobes the AGLU, APPFK, and RIBK that are involved in the first and second phosphorylation reaction of the glycolysis, and the first phosphorylation on the ribose catabolic pathway, respectively, (d) the phosphoglycerate kinase (EC 2.7.2.3) that defines its own orphan family (not grouped within a clan) the PGK (PF00162), which catalyzes the first synthesis of ATP out of 1,3-bisphospho-d-glycerate, and (e) the pyruvate kinase (EC 2.7.1.40), whose main domain (PK) is classified under the CL0151 clan and that synthesizes another molecule of ATP from the phosphoenolpyruvate.

Moreover, if the ribose catabolic pathway integrated by RIBK, TRK, and PRI (proposed here) was present in the LCA, it is plausible to assume that it could be feeding the second part of the glycolysis with glyceraldehyde-3-phosphate, since it is not clear that the first part of the glycolysis was already functioning at that time (see Fig. 4). The universality of the enzymes of the second part of the glycolytic pathway and the ribose catabolic pathway could therefore imply that they are vestiges of an older pathway that used ribose to produce NADH, ATP, and pyruvate. The existence of this hypothetical pathway could explain in part the high connectivity of the pentose phosphate pathway with glycolysis not as a later development (Krebs and Kornberg 1957; Krebs 1981), but as the remnant of an antique pathway. A major drawback of this hypothetic scheme is the overall efficiency of the route, since under this scenario the pyruvate, ATP, and NADH production would be cut in half in comparison to the glycolytic pathway. Even so, it is difficult to extend the antiquity of ribose degradation back in time beyond the LCA, since ribose has to be present in sufficient amounts before it feeds an ancient version of the central metabolism. Therefore, one can assume that either a highly efficient abiotic source of ribose was available or that its anabolic pathway evolved prior to this epoch.

Fig. 4
figure 4

Connectivity of the glucose and ribose catabolic pathways. Both produce the essential metabolite glyceraldehyde-3-phosphate. Key crossroads metabolites of the routes are written in bold letters and closed boxes, with other metabolites shown in doted boxes. Enzymes are represented by their respective EC number. A for Archaea, B for Bacteria, and E for Eukaryotes indicate their biological distribution. (*) Actinobacteria, Acidobacteria, Bacteroidetes, Verrucomicrobia, Cyanobacteria, and Deinococcus–Thermus; (**) Nematodes and Vertebrates; (***) γ-proteobacteria, ε-proteobacteria, Actinobacteria, Spirochaetes, Bacteroidetes, Verrucomicrobia, and Planctomycetes. The distribution of some of the enzymes that catalyze the first part of the glycolysis suggests that the pathway, as we know it today, was not operating in the LCA. As discussed in the text, we hypothesized that if the ribose catabolic pathway was sustained by ancestral versions of the ribokinase (EC 2.7.1.15), transketolase (EC 2.2.1.1), and phosphoriboisomerase (EC 5.3.1.6), then it could provide the glyceraldehyde-3-phosphate required for the second part of the glycolysis in an ancient route that could have been present in the LCA

Nucleobases

To the best of our knowledge, the possibility that the enzymatic degradation of adenine, guanine, cytosine, uracyl, or thymine are among the oldest catabolic routes has not been addressed in the literature. Nevertheless, we have analyzed these catabolic pathways since the five nucleobases play a major role in the metabolism of all living creatures as constitutive elements of RNA, DNA, and as crucial components of bioenergetics processes. Some of them are easily formed in prebiotically plausible conditions, either as free nucleobases (Oró 1960, 1961; Ferris et al. 1968, 1978) or as activated ribonucleotides (Powner et al. 2009) and, as suggested by their presence on meteorites (Callahan et al. 2011; Saladino et al. 2011; Burton et al. 2012) they could be among the earliest components of the primitive environment. Direct uptake of such bases, together with that of many other prebiotically synthesized organic compounds may be considered as the oldest form of heterotrophy (Lazcano 2010). It is therefore possible that they were available as metabolites for the first living entities, which does not necessarily imply that extant catabolic pathways are the direct evolutionary outcome of those early forms of heterotrophy. However, some of the first steps in the enzyme-based degradations of nucleobases may have their enzyme-independent counterpart, and therefore could be envisioned as an ancient enzyme-free form of degradation.

Catabolism of Adenine and Guanine

In extant organisms, adenine and guanine catabolic pathways start with two hydrolytic reactions that produce first hypoxanthine by adenine deaminase (AD, EC 3.5.4.2), and xanthine by guanine deaminase (GD, EC 3.5.4.3), respectively. As suggested by Becerra and Lazcano (1998) and supported by the results presented here, the universal distribution of these two homologous enzymes suggests that they might share an ancestor that may have been present in the LCA. Domain composition analysis of the AD sequences revealed two distinctive architectures, one with Amidohydro_1 and Adenine_deam_C domains and one with the A_deaminase domain. As suggested by the domain classification of Amidohydro_1 and A_deaminase (CL0034) and confirmed by crystallographic comparisons (see supplementary material or visit https://goo.gl/nSoRlZ), these enzymes are homologues. Distribution patterns of both versions revealed that the AD whose domain architecture possesses Amidohydro_1 and Adenine_deam_C is the oldest version. By its own, GD is an enzyme that is integrated solely by the Aminohydro_1 domain and whose common ancestry with AD is well supported, as confirmed by crystallographic structure comparisons (see supplementary material or visit https://goo.gl/nSoRlZ). Moreover, the enzyme-free hydrolysis of adenine and guanine produce hypoxanthine and xanthine, respectively, in significant yields (half-lives of A and G ≈ 1 year) (Miller and Orgel 1974; Levy and Miller 1998) opening the possibility that enzyme-free catabolic steps are more ancient than suggested by the distribution of their respective degradative enzymes.

Extant enzyme-mediated degradation of adenine into hypoxanthine is followed by its hydrolysis to xanthine, which in turn undergoes another hydrolysis releasing urate. Both hydrolytic reactions are catalyzed either by xanthine dehydrogenase (XDH, EC 1.17.1.4) or by xanthine oxidase (XO, EC 1.17.3.2), although the latter is an oxygen-dependent enzyme that most likely evolved after the oxygen enrichment of the atmosphere. Analysis of the XDH sequences reveals that it is composed by two subunits (small and large), and that XO is a monomer of over 1000 residues long. Sequence comparison results showed that both subunits of XDH share identity values that range from 24.97 to 45.87% when compared against the XO sequences. Their domain composition reveals that the small subunit of XDH possess the Fer2, Fer2_2, FAD_binding_5, and Co_deh_flav_C domain architecture, while the large subunit possesses the Ald_Xan_dh_C and Ald_Xan_dh_C2 domain architecture, and XO seems to be the direct outcome of the fusion between the large and short XDH subunits. The eukaryotic XO appears to be the product of a reversible post-translational modification (RPTM) of the XDH. In this RPTM, two cysteine disulfide bonds are formed between two couples of cysteine residues. The first bond opens a solvent gate, while the second bond obstructs the NAD+ binding cavity (Nishino et al. 2005). Analyses of the crystallographic structures of XDH and XO reveal that at least three of the four conserved cysteine residues responsible for the RPTM are the evolutionary outcome of the fusion of the two subunits of XDH, since they are located within the elongated parts of the proteins that arose as result of the fusion (see Fig. 5 and supplementary material or visit https://goo.gl/nSoRlZ). These data strongly suggest that XDH was the ancestral form that evolved in the bacteria lineage, and that XO is a later adaptation to the oxygenic conditions in the Earth’s atmosphere. The appearance of the XDH also carried the selective advantage of harvesting the reductive power of hypoxanthine and xanthine in the form of a NADH molecule while releasing either xanthine or urate depending on the catalyzed reaction, while the oxidation of hypoxanthine or xanthine by the XO, that also releases xanthine or urea but coupled to the production of H2O2 instead of NADH (see supplementary material Catabolism-Catalogue or go to https://goo.gl/Ts1xpD). As suggested by the crystallographic structure of the XDH, the small subunit synthesizes NADH by gathering the energy that is generated in the active site of the large subunit that is indeed the catalytic subunit (see Fig. 5). It is thus possible to envision a simpler form of the enzyme that could catalyze the reactions prior to the appearance of the small subunit. This possibility is supported by the observed distribution results, which show that the large subunit is universal, while the small subunit is restricted to the α-proteobacteria, β-proteobacteria, and Deinococcus–Thermus bacterial groups. Thus, one must conclude that the addition of the small subunit is a relatively new event from an evolutionary perspective.

Fig. 5
figure 5

Details and alignments of the crystallographic structures of xanthine dehydrogenase (XDH) and xanthine oxidase (XO). a Bacterial xanthine dehydrogenase (PDB ID 2W3R, heterodimer), catalytic subunit (yelloworange-colored) with substrate hypoxanthine (cyan sticks), cofactor MTE molecule associated with hydroxy-(dioxo)-molybdenum (blue sticks), and energy harvesting subunit (forest-green) associated with two iron–sulfur centers (red spheres) and prosthetic group FAD (magenta sticks). b Eukaryotic xanthine dehydrogenase (PDB ID 3AMZ, chocolate-colored monomer), associated with substrate uric acid (cyan sticks) cofactor MTE molecule covalently linked to a dioxothiomolybdenum (blue sticks), two iron–sulfur centers (red spheres), and prosthetic group FAD (magenta sticks). c Alignment of the bacterial and the eukaryotic xanthine dehydrogenase. Their respective substrates, cofactors, iron–sulfur centers, and FAD prosthetic groups are also shown. Color codes and structures are the same as in (a) and (b). As can be appreciated, the eukaryotic version is the outcome of the fusion of the two different bacterial subunits. d Crystal structure of the mutant eukaryotic xanthine dehydrogenase (PDB ID 1WYG, tv_orange-colored) in which the conversion mechanism into xanthine oxidase has been described (Nishino et al. 2005). Front position shows a molecule of salicylic acid (cyan sticks) occupying the position of the natural substrates (hypoxanthine or xanthine), a molecule of acetic acid (blue sticks) in the place of the naturally occurring MTE cofactor, two iron–sulfur centers (red spheres) and a FAD molecule (magenta sticks). 180 degrees rotation over the Y axis allows the visualization of the residues involved in the conversion mechanism (Light-blue dotted circles). Mutant residues C535A and C992R (green) are involved on the solvent gate formation and mutant residues C1316C and C1324S (yellow) are involved on the obstruction of the NAD binding cavity. Images presented here were created using The PyMOL Molecular Graphics System, Version 1.8.1.2 Schrödinger, LLC implementing the CE algorithm (Shindyalov and Bourne 1998) and colors correspond to the PyMOL standard color palette. See supplementary material or visit https://goo.gl/nSoRlZ for details on the comparisons of the crystallographic structures

It is noteworthy that in the next step of purine degradation two non-homologous enzymes, urate oxidase (EC 1.7.3.3) and urate hydroxylase (EC 1.14.13.113), catalyze a reaction where urate is transformed into 5-hydroxyisourate in two strictly oxygen-dependent reactions that appeared to lack an oxygen-free equivalent counterpart. This suggests that this reaction was the limiting step in the evolution of purine catabolism prior to the exploitation of O2 as a more potent oxidative agent by the bacterial organisms. Furthermore, the enzymes that release allantoin, allantoate, urea, CO2, and NH3 appear to be restricted mostly to aerobic members of the proteobacteria phyla, as well as to several eukaryotic groups. These data correlate well with previous reports that identified these metabolites as excretion products of specific eukaryotic organisms (See Fig. S7). The only exception in the distribution patterns is the case of allantoinase (EC 3.5.2.5), that catalyzes the hydrolysis of allantoin producing allantoate. Sequences of this enzyme exhibit the same pfam architecture as the GD (EC 3.5.4.3), where the Amidohydro_1 domain constitutes almost the whole length of the protein, and although we created a specific profile search model, its resolution proved to be insufficient to distinguish between these homologous enzymes, and in fact we failed to establish in full the exact distribution of this particular enzyme. An alternative possibility is that the GD is a moonlighting enzyme that catalyzes an equivalent hydrolytic reaction over allantoin, although to the best of our knowledge there is no experimental evidence supporting this possibility.

The above strongly suggests that the enzyme evolution of the extant catabolic pathway of purines took place in three different stages. In the first stage, the ancestor of AD (EC 3.5.4.2) and GD (EC 3.5.4.3) evolved prior to the divergence of Archaea and Bacteria and may have thus been present in the LCA. The same probably happed to the ancestral form of the large subunit of the XDH (EC 1.17.1.4), that could catalyze the formation of xanthine out of hypoxanthine and of urate from xanthine. During the second phase, the uric acid oxidase (EC 1.7.3.3) and urate oxidase (EC 1.14.13.113) evolved as a result of the oxygenic conditions of the changing atmosphere, leading to the production of 5-hydroxyisourate. The rest of the enzymes that produce urate, allantoin, allantoate, urea, CO2, and NH3, from the 5-hydroxyisourate, evolved during the final phase within particular phylogenetic groups. This creates a more diverse pool of compounds that nowadays are excreted by different organisms (see Figs. 6 and S7).

Fig. 6
figure 6

The rising in the concentration of atmospheric oxygen led to the evolution of new enzymes that enable more energetically efficient metabolic routes. This is a case of purine catabolic pathway: after the great oxygenation event (GOE), enzymes including uric acid oxidase (EC 1.7.3.3) and FAD-dependent urate oxidase (EC 1.14.13.113) led to the oxidation of urate into a more diverse set of compounds that represent metabolic waste products. Closed boxes represent the metabolites along the purine catabolic pathway and waste products appear in bold letters. Dotted ellipses represent enzymes that catalyze reactions in the pathway. The oxygen dependence and the phylogenetic distribution of the purine degradative enzymes suggest that the evolution of the extant catabolic pathway took place in three different phases. In the first phase (I), the ancestor of AD (EC 3.5.4.2) and GD (EC 3.5.4.3) evolved prior to the divergence of Archaea and Bacteria. The same probably happed to the ancestral form of the large subunit of the XDH (EC 1.17.1.4) that could catalyze the formation of xanthine out of hypoxanthine and of urate from xanthine. During the second phase (II), xanthine oxidase (EC 1.17.3.2), the uric acid oxidase (EC 1.7.3.3) and urate oxidase (EC 1.14.13.113) evolved as a result of the oxygenic conditions of the changing atmosphere, leading to the production of 5-hydroxyisourate. The rest of the enzymes that produce allantoin, allantoate, urea, CO2, and NH3, out of 5-hydroxyisourate evolved during the final phase (III) within particular phylogenetic groups, as described in the text and in Fig. S7. (*) The small subunit of the XDH (EC 1.17.1.4) is well distributed among Deinococcus–Thermus, α-proteobacteria, β-proteobacteria, and most eukaryotic groups. (**) Urate oxidase (EC 1.7.3.3) is well distributed among α-proteobacteria, β-proteobacteria, Actinobacteria, Acidobacteria, and Deinococcus–Thermus and most eukaryotic groups, while urate hydroxylase (EC 1.14.13.113) is well distributed among α-proteobacteria, β-proteobacteria, γ-proteobacteria, Actinobacteria, and Cyanobacteria, as well as in Diatoms, Algae, Eudicots, Monocots, Cnidarians, and Vertebrates. (***) Systematic name is imidazole 5-carboxylate is 5-Hydroxy-2-oxo-4-ureido-2,5-dihydro-1H-imidazole-5-carboxylate

The possibility that anaerobic bacteria (mostly Clostridium) could ferment xanthine, hypoxanthine, and guanine was discussed long ago (Barker and Beck 1941). Later on, Rabinowitz et al. extended their efforts in the study of purine fermentations so successfully (Rabinowitz 1956; Rabinowitz and Barker 1956a, b; Rabinowitz and Pricer 1956a, b), that their work remains as a canonical reference and has served as a guide to establish the possible fermentative pathways in some metabolic databases like KEGG (Kanehisa and Goto 2000). Unfortunately, those pioneering efforts were not continued, resulting in a poorly characterized biochemistry and the lack of knowledge about the enzymes that may catalyze those fermentative reactions. As of today, there is no sequence available in public databases that allow phylogenomic analysis of those enzymes.

Catabolism of Cytosine, Uracil, and Thymine

In extant organisms, the cytosine catabolic pathway starts with the deamination reaction catalyzed by cytosine deaminase (EC 3.5.4.1) that produces uracil and ammonia. Spontaneous deamination of cytosine is an efficient reaction that clearly preceded the evolution of cytosine deaminase and could have been the initial step in cytosine degradation. The half-life for cytosine deamination to uracil is ≈ 340 years at 25 °C (Shapiro 1999) or 19 days at 100 °C (Levy and Miller 1998), but these rates do not compare with the reaction yield by cytosine deaminase (CYDA), whose activity varies from 0.1 to 94 µm/min/mg (Sakai et al. 1975; Katsuragi et al. 1986; Porter and Austin 1993). Analysis of domain composition revealed that CYDA is a single domain protein, but their sequences are divided in those that possess the Amidohydro_1 domain, and those that exhibit the dCMP_cyt_deam_1 domain. Sequence comparisons and domain classification (CL0034 and CL0109, respectively) suggest that both versions of the enzyme evolved independently. Results on their phylogenetic distribution suggest a universal distribution and since neither of them depend on molecular oxygen, it is quite possible that both versions were already present in the LCA.

The final product of the uracil catabolic pathway is β-alanine, which is a well-known crossroads metabolite that is connected to other metabolic pathways such as the pantothenate-, propanoate-, Coenzyme A-, acetyl-CoA-, and fatty acid biosynthetic pathways. The production of β-alanine from uracil is a three-step process catalyzed by pyrimidine reductase (PYR, EC 1.3.1.1) [or by dihydropyrimidine dehydrogenase (DPYD, EC 1.3.1.2)], dihydropyrimidinase (DHPY, EC 3.5.2.2), and β-ureidopropionase (UP, EC 3.5.1.6), respectively. While PYR (EC 1.3.1.1) requires the reducing power of NADH, DPYD (EC 1.3.1.2) requires NADPH to transform uracil into 5,6-dihydrouracil. Domain architecture analysis on the nature of the DPYD sequences reveals that these represent two different types of sequences annotated as EC 1.3.1.2. The first is a protein of ≈ 681 residues in which the Fer4_20 and Pyr_redox_2 domains have been identified. The second is a longer sequence of ≈ 1059 residues with four well-identified domains, Fer4_20, Pyr_redox_2, DHO_dh, and Fer4_21. The last two domains in this long protein have the same domain architecture as PYR sequences (DHO_dh and Fer4_21), but no significant result is obtained when the two smaller versions of the proteins PYR (≈ 500 residues) and DPYD (≈ 681 residues) are aligned. This is consistent with the Pyr_redox_2 and DHO_dh domains classification in separate clans (CL0063 and CL0036, respectively). Unfortunately, there is no crystallographic structure for PYR (EC 1.3.1.1) available in the public databases that could allow crystallographic comparisons between PYR and DPYD. Based on the sequence comparisons and domain architecture analysis, it can be concluded that PYR (≈ 500 residues) and DPYD (≈ 681 residues) are not homologous, and that the larger version of the DPYD is the evolutionary outcome of the fusion of the shorter versions of the PYR and DPYD. The phylogenetic distribution of the PYR suggests a recent development within the bacteria lineage. It is present mostly in organisms that belong to α-proteobacteria, β-proteobacteria, γ-proteobacteria, and Firmicutes clades, together with few members of the Acidobacteria, Verrucomicrobia, and Deinococcus–Thermus, and it is well represented among the major eukaryotic groups. The short version of the DPYD has been found among organisms that belong to α-proteobacteria, β-proteobacteria, γ-proteobacteria, δ-proteobacteria, Actinobacteria, Acidobacteria, and Deinococcus–Thermus bacterial groups, as well as the major eukaryotic groups, suggesting a bacterial origin. The long version of the DPYD is restricted to few organisms that belong to γ-proteobacteria group and very well represented within eukaryotic groups such as Vertebrates, Arthropods, Nematodes, Cnidarians, and Amoebozoa, suggesting that it evolved recently in evolutionary terms.

In the subsequent reaction, dihydropyrimidinase (DHPY, EC 3.5.2.2) hydrolyzes 5,6-dihydrouracil and releases 3-ureidopropionate. Domain architecture analysis revealed that DHPY exhibit the Amidohydro_1 domain, which comprises over half the residues that integrate the sequence length. Alignment of the crystallographic structures of the GD and DHPY do support the common ancestry of these proteins (see supplementary material or visit https://goo.gl/nSoRlZ). Phylogenetic distribution results reveal a universal distribution but, as in the case of allantoinase (EC 3.5.2.5), where its sequences exhibit the exact same pfam architecture as the GD (EC 3.5.4.3), where the Amidohydro_1 domain constitutes almost the whole length of the protein; the profile search model created for DHPY proved to be insufficient to distinguish between DHPY and GD and therefore we could not establish in full the exact distribution of DHPY. Finally, β-ureidopropionase (UP, EC 3.5.1.6) catalyzes the hydrolysis of 3-ureidopropionate, releasing β-alanine, CO2, and NH3. UP sequences were identified as possessing two different domain architectures, one that includes the Peptidase_M20 as the only domain along the sequence and one with the CN_hydrolase domain which comprises over half the residues that integrate the sequence length. Phylogenetic distribution results suggest that the version of the UP that possesses the Peptidase_M20 domain is restricted to a limited number of groups within Bacteria (mostly Proteobacteria), while the UP that possesses the CN_hydrolase domain and that has a distribution that suggests that it is possible to assume that it could be present in the LCA.

Biochemical data indicate that the enzymes that degrade uracil are quite unspecific and, not surprisingly, can also produce 3-aminoisobutyrate as the final product of the thymine catabolic pathway. The 3-aminoisobutyrate can be incorporated into the last part of the valine catabolic pathway, which in turn produces succinyl-CoA. Although both PYR (EC 1.3.1.1) and DHPY (EC 1.3.1.2) catalyze oxygen-free reactions, they have a very restricted distribution mostly among proteobacterial and eukaryotic groups. Such distribution suggests a recent evolution of the extant cytosine-, uracil-, and thymine catabolic pathways. Although in the archaeal domain two members, Sulfolobus tokodaii and Bathyarchaeota archaeon, have the enzymes necessary to complete the pyrimidine degradation process, our phylogenetic analysis indicate that, like pyrimidine reductase (EC 1.3.1.1), they were most probably acquired by horizontal gene transfer from Bacteria (see Fig. S6).

Uracil and thymine degradation can also be achieved through other routes like the oxidative uracil/thymine degradation pathway (Hayaishi and Kornberg 1952; Lara 1952) that involves the action of uracil/thymine oxidase (EC 1.17.99.4), barbiturase (EC 3.5.2.1), and ureidomalonase (EC 3.5.1.95), in this oxidative pathway uracil is reduced to malonate and urea via barbiturate. Unfortunately, this route is poorly characterized, and there are no sequences available in the public databases, associated neither with uracil–thymine oxidase (EC 1.17.99.4) nor with ureidomalonase (EC 3.5.1.95) that could be used to study the phylogenetic distribution of the pathway. A third pathway, known as the Rut pathway, was discovered much later in Escherichia coli (Loh et al. 2006). Although the enzymes of this pathway need further characterization before being subjected to phylogenetic analyses, the fact that the pathway depends on the enzyme pyrimidine oxygenase (EC 1.14.99.46), which is an strictly oxygen-dependent catalyst, suggests that it must have evolved into its actual form only after the oxygen enrichment of the Earth’s atmosphere.

Conclusions

Glucose catabolic pathway used to be considered ubiquitous, but, as noted before, not all the essential enzymes have a universal distribution, suggesting an independent evolutionary origin in the Archaea and the Bacteria lineages (Verhees et al. 2003; Schönheit et al. 2016). However, new insights have been obtained from our analysis of recent genomic data. As discussed here, on a first approach, the distribution of the archaeal and bacterial versions of the fructose phosphate aldolase (class I and class II, respectively, EC 4.1.2.13) could lead to the hasty conclusion that they have independent evolutionary origins. However, their respective protein domain architecture and the crystallographic structure comparisons of these proteins strongly suggest that they may share a common ancestor from before the divergence of the two major prokaryotic domains.

The analyses presented here suggest that the glycolytic pathway was still evolving at the time of the LCA, and that it underwent further adaptations within specific phylogenetic groups. These adaptations complicate our attempts to establish the relative antiquity of the pathway. For instance, the distribution of the non-homologous kinases PPFK and APPFK (EC 2.7.1.11 and EC 2.7.1.146) suggests an independent evolution within the archaeal and the bacterial lineages. Therefore, the enzymatic phosphorylation of fructose-6-phosphate still limits our understanding of the actual antiquity of the pathway and, as suggested by its phylogenetic distribution, the recruitment of the ADP-dependent glucokinase (EC 2.7.1.147), a member of the ribokinase-like superfamily of kinases, into the glycolytic pathway took place recently in the Euryarchaeota group. The presence of kinases from the ribokinase-like superfamily in archaeal metabolisms is sometimes misinterpreted as a signal of a separate evolution within different lineages and as a signal of a late development of the carbohydrate metabolism (Schönheit et al. 2016). While the distribution of the APPFK (EC 2.7.1.146), another member of the ribokinase-like superfamily of kinases, support the latter possibility, our results show that the universality of GLUK (EC 2.7.1.2) makes the presence or absence of the ADP-dependent glucokinase (EC 2.7.1.147), in the archaeal organisms, irrelevant for the discussion on the antiquity of the glycolytic pathway.

Contrary to the assumption that Archaea lack the enzymes involved in the direct phosphorylation of free pentoses, which has been interpreted as evidence that free ribose degradation is a recent evolutionary invention (Schönheit et al. 2016). As shown here, core enzymes of the ribose catabolic pathway do exhibit a universal distribution, which suggests that they were already present in the LCA. The pentose phosphate pathway principal selective advantage appears to be NADH synthesis, while ribose enzymatic degradation represents no real selective advantage on its own other than feeding the second part of the glycolysis with glyceraldehyde-3-phosphate. It is therefore possible that its ancestral function during the LCA epoch was to feed the second part of the glycolysis leading to the generation of ATP, NADH, and pyruvate through a hybrid ancestral route (see Fig. 4). Both the first part of the glycolysis and the complementary enzymes of the ribose catabolic pathway appear to have evolved after the division of the two major phylogenetic groups, not necessarily at the same time. For instance, the emergence of transaldolase (EC 2.2.1.2), whose distribution among the Bacteria and Eukaryotes suggests that it was incorporated into the ribose catabolic pathway after the divergence of the two major prokaryotic domains. The selective advantage of a complementary phase of the ribose catabolic pathway may have been the condensation of the first molecule of glyceraldehyde-3-phosphate and the secondary product sedoheptulose-7-phosphate into a second molecule of glyceraldehyde-3-phosphate and fructose-6-phosphate, another product that can be incorporated into the first part of the glycolytic pathway.

On the other hand, the evidence discussed here suggested that extant adenine and guanine catabolic pathways are the outcome of a patchwork addition of enzymes at different epochs. As exemplified by the oxygen-free highly conserved enzymes adenine deaminase (EC 3.5.4.2), guanine deaminase (EC 3.5.4.3) and the large subunit of the xanthine dehydrogenase (EC 1.17.1.4) there are purine catabolic enzymes whose origin seemed to date at least from the LCA epoch. During a second epoch, the accumulation of oxygen in the atmosphere allowed the evolution of enzymes like uric acid oxidase (EC 1.7.3.3) or urate oxidase (EC 1.14.13.113), that further degrade the urate released by xanthine dehydrogenase. Finally, several enzymes like hydroxyisourate hydrolase (EC 3.5.2.17), allantoinase (EC 3.5.2.5), or urease (EC 3.5.1.5) appear to have evolved recently within some classes of proteobacteria or even in some eukaryotic groups (see Fig. 6 and S7).

On the other hand, not all enzymes of the cytosine, uracil, and thymine catabolic pathways are universally distributed. One exception is cytosine deaminase (EC 3.5.4.1) whose biological distribution suggests that its ancestor was already present in the LCA. Deamination of cytosine can take place in the absence of any enzymatic catalyst (Levy and Miller 1998; Shapiro 1999) suggesting that this could have been the very first way to convert cytosine to uracil. According to their limited distribution either in the proteobacteria or the acidobacteria phyla, enzymes like pyrimidine reductase (EC 1.3.1.1) and dihydropyrimidine dehydrogenase (EC 1.3.1.2) are recent developments. Therefore, contrary to recent proposals (Schönheit et al. 2016), the conversion of uracil into β-alanine does not represent a widespread catabolic pathway, and it is very unlikely that this pathway was in operation during the early stages of the evolution of heterotrophy.

Detailed examination of the enzyme distribution and evolution, together with the biochemical data available, has led us to estimate the relative antiquity of the glycolytic and the ribose catabolic routes, as well as the guanine, adenine, cytosine, uracil, and thymine catabolic pathways. The data presented here strongly suggest that enzymes that are currently involved in these routes come from different epochs, and that the extant pathways are the result of patchwork recruitment throughout their evolutionary history. Accordingly, the metabolic abilities of the LCA that can be derived from our data include (a) ribose degradation coupled with the second part of the glycolytic pathway that together served as a source of ATP, NADH, and pyruvate (see Fig. 4); (b) the production of hypoxanthine, xanthine, and urate from guanine and adenine (see Fig. 6) that could serve as substrates in the synthesis of free mononucleotides, as well as parts of an ancient excretion mechanism that releases the excess of nitrogen; and (c) the production of uracil from cytosine as an efficient source of uracil to be incorporated into RNA molecules and ribonucleotide derivatives.