Introduction

The complexity that we see today in biochemistry and life is believed to have originated gradually, starting with small polypeptide molecules (Dyson 1982) that were perhaps synthesized pre-biotically in hydrothermal vents (Huber and Wächtershäuser 1998; Martin and Russell 2007) or other environments (Rode 1999, 2007) and were recurrently used to build larger and more complex structures (e.g., Remmert et al. 2010). The challenge of finding an origin for this possible evolutionary progression is however considerable since many primordial polypeptide constituents are possible and there is still no metric of comparison or homology criterion in structure that can be used globally to dissect the origin and evolution of proteins (Taylor 2007). To overcome this limitation, we have shifted the focus from molecules to proteomes and in a number of studies explored the evolution of all protein structures that are known (reviewed in Caetano-Anollés et al. 2009a).

Structural genomics has produced a wealth of validated models of atomic structure that describe the packing and 3-dimensional (3D) arrangement of folded polypeptide chains (Levitt 2009). These models portray the molecular complexity of proteins and protein complexes and have been used effectively to assign folds to protein sequences in proteomes using sequence-structure comparisons and knowledge from protein domain classification (Chothia and Gough 2009). The initial genomic survey of the folds of protein domains (Gerstein and Levitt 1997) has been extended from few microbial proteomes (Gerstein 1998) to over a thousand proteomes encompassing all three superkingdoms of cellular life (reveiewed in Caetano-Anollés et al. 2009a).

Since structures endow proteins with a remarkable diversity of molecular functions and are consequently highly conserved (Cothia and Lesk 1986; Illergård et al. 2009), the structural census carries deep phylogenetic signal and is therefore useful for the construction of trees of life (e.g., Gerstein 1998; Lin and Gerstein 2000; Caetano-Anollés and Caetano-Anollés 2003; Yang et al. 2005; Wang et al. 2007). The structural census can also be used to construct phylogenies of folds and study the evolutionary history of proteins at global level (Caetano-Anollés and Caetano-Anollés 2003; Forslund et al. 2007; Caetano-Anollés et al. 2009a). These trees uncovered reductive evolutionary tendencies in proteomes and a cellular origin for the tripartite world (Wang et al. 2007), the origin and evolution of metabolic networks (Caetano-Anollés et al. 2007, 2009b), the history of metallomes and biological metal utilization in ancient seas (Dupont et al. 2010), the metabolic origins of translation (Caetano-Anollés et al. 2011), and the emergence of aerobic metabolism (Wang et al. 2011). Trees also behave as molecular clocks, linking evolutionary patterns in structure to the geological record (Wang et al. 2011). In these studies, domains were defined at increasing levels of structural complexity and conservation. For example, trees of domain structure were generated (reviewed in Caetano-Anollés et al. 2009a) at fold family (FF), fold superfamily (FSF), and fold (F) levels of the structural classification of proteins (SCOP; Murzin et al. 1995). FFs group protein structures that are homologous at sequence level and are unambiguously linked to specific molecular functions. FSFs group FFs with common structures and functions and offer high levels of certainty that proteins belonging to this hierarchical level share a common evolutionary origin (Yang et al. 2005). Fs group FSFs that share similarly arranged and topologically connected secondary structures, but that may not be necessarily related at the evolutionary level. FF and FSF levels are the most useful. Although proteins in FFs often diverge and obscure sequence similarities, the close packing of amino acid side chains in the buried core of the protein retains the same FSF folded structure.

The age of protein domains at all of these levels of structural complexity can be derived directly from the trees. We have previously shown that domain age has considerable predictive power in terms of patterns of structural changes that are known (Caetano-Anollés and Caetano-Anollés 2003), patterns of organismal diversification (Wang and Caetano-Anollés 2006; Wang et al. 2007), patterns in the early evolution of molecular functions (Kim and Caetano-Anollés 2010), and patterns of domain accretion in proteins (Caetano-Anollés et al. 2011). Tracing the age of domains in metabolic networks (Kim et al. 2006; Caetano-Anollés et al. 2007) confirmed the fundamental role of recruitment in enzyme evolution (Ycas 1974; Jensen 1976; Teichmann et al. 2001). A historical account of domain appearance (Caetano-Anollés et al. 2011) matched rings of gene neighbors derived from an analysis of physical clustering of conserved genes in bacterial genomes (Danchin et al. 2007). Furthermore, we have shown that the age of domains in ribosomal proteins coevolves tightly with the age of rRNA substructures, uncovering recruitment and accretion patterns, and revealing the relatively late molecular origins of the ribosome (Sun and Caetano-Anollés 2009; Harish and Caetano-Anollés 2011). Here, we focus on a historical account of domains defined at FF level (Caetano-Anollés et al. 2011) and study the emergence of the most ancient molecular functions. We show how these primordial functions are linked to the structural design of the folded protein structure. We also reveal membranes played a crucial role in early protein evolution and show translation started with the primordial discovery of RNA and thioester cofactor-mediated aminoacylation. Our findings allow elaboration of a detailed model of evolution of modern biochemistry that is firmly grounded in phylogenomic information.

Materials and Methods

Phylogenomic Constructs

We studied the functions, ligands, and fold features linked to the structure of the 54 most ancient FF domains. The age of these domains was derived from a rooted phylogenomic tree of FFs that we described previously (Caetano-Anollés et al. 2011). FF domain age was calculated directly from the tree as a node distance (nd) using perl and phyton scripts that count the number of nodes from the ancestral architecture at the base of the tree to each leaf and provides it in a relative 0–1 scale. The tree was reconstructed from a structural census in the genomes of 420 free-living organisms that included 48 Archaea, 239 Bacteria, and 133 Eukarya (dataset FL420; Caetano-Anollés et al. 2011). scop, the gold standard used to describe the complexity of proteins and to benchmark structural prediction methods, was used to define domain structure (Andreeva et al. 2008). scop was selected because it partitions proteins into fewer and larger components than other structural classifications and takes into account both functional and evolutionary considerations (Holland et al. 2006). The structures of 2,397 FFs (out of 3,464 defined by scop 1.73) were assigned to genomic sequences using linear HMMs of structural recognition in SUPERFAMILY (Gough et al. 2001; Wilson et al. 2009) with a probability cutoff of 10−4. The numbers of these genomic assignments were transformed, treated as multistate linearly ordered phylogenetic characters, encoded using an alphanumeric format, and used to construct a FL420 data matrix for phylogenetic analysis. A full account of phylogenetic tree reconstruction using maximum parsimony and character argumentation was given previously (Caetano-Anollés et al. 2011). The tree of FFs was built from the transformed FL420 matrix using a combined parsimony ratchet and iterative search approach with 300 ratchet iterations (10 × 30 chains) in PAUP* (Swofford 2002). Multiple chains and iterations avoid the risk of optimal trees being trapped by sub-optimal regions of tree space (Nixon 1999). For simplicity, domains were identified with concise classification strings (ccs). For example, in c.26.1.3 of phosphopantetheine adenyltransferase (EC 2.7.7.3), c represents the protein class (α/β proteins), 26 the F (adenine nucleotide alpha hydrolase-like fold), 1 the FSF (nucleotidylyl transferase superfamily), and 3 the FF (adenylyltransferase fold family).

Molecular Functions

To explore the molecular functions of primordial FF domains we first obtained Gene Ontology (GO) annotations in SUPERFAMILY (Wilson et al. 2009) and MANET (Kim et al. 2006) assignments for domains. Each FF is associated with a number of enzymatic activities and molecular functions, many of which are derived. Consequently, we used published phylogenies of metabolic subnetworks (Caetano-Anollés et al. 2007) and ancient GO terms (Kim and Caetano-Anollés 2010) to identify those functions that were ancestral. Manual annotations also involved queries in the UniProtKB (protein knowledgebase) database (http://www.uniprot.org/) and HMM-based structure assignments. Annotations were mapped onto the architectural chronology, generating a timeline that describes the evolution of biological functions.

Structural Analysis

Idealized structures were defined according to Taylor (2002) by expert annotation. α-helices and β-strands were considered rigid structures, and the 3D arrangement of these structures relative to others used to define idealized layered topologies. The twist of the first β-strand relative to the last β-strand in sheets was numerically recorded on a scale of 0–3: 0, absence of twist; 1, twist of up to 45°; 2, twist of 45–90°; and 3, twist of more than 90º. The curl of sheets was numerically recorded on a scale of 0–4: 0, flat structure; 1, slightly curled; 2, markedly curled; 3, curled and forming a half barrel structure; and 4, curled and forming a full barrel structure. The number of β-strands and parallel/anti-parallel arrangements of strands in sheets was also noted. Other features were explored in PDBsum (Laskowski 2009; http://www.ebi.ac.uk/pdbsum/). The 3D structures of proteins were aligned using GANGSTA+ (Guerler and Knapp 2008) against version 1.75 of the ASTRAL40 compendium. The software uses a nonsequential structural alignment method with proper assignment of helices and strands in the structure. To examine whether or not β-sheets of FFs have become more hydrophilic along the evolutionary timeline, we calculated percentage of hydrophobic residues that are positioned in β-sheets of a FF. From the SUPERFAMILY ver. 1.75, we extracted the regions in PDB chains that are actually assigned to the FF. In protein sequences of the PDB chains retrieved from the PDB (http://www.rcsb.org), we identified the parts of the sequences corresponding to the regions. Using PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/), we identified which residues in the regions consist of β-sheets. Referring to a hydrophobicity table of amino acids, we also identified which residues in the β-sheets of the regions are hydrophobic. For each of the PDB chain regions assigned to a FF, we then obtained the numbers of hydrophobic residues in β-sheets and residues in β-sheets. The sum of the former divided by the sum of the latter for all PDB chain regions of a FF resulted in the percentage of hydrophobic residues.

Macromolecular Movement Analysis

We predicted the existence of flexible hinges directly from atomic coordinates of single molecular conformations in selected FFs using FlexOracle (Flores and Gerstein 2007). FlexOracle cuts molecules in two at all positions and calculates intra-potential energies for fragments, which are then summed, and those with lowest potential energy are predicted to represent hinges. We also explored molecular motions by searching the Database of Macromolecular Movements (DMM) (http://www.molmovdb.org/molmovdb/).

Cofactors

We examined the abundance of associations between cofactors and FFs with PROCOGNATE ver. 1.6 (Bashton et al. 2008). The associations in which cofactors are shared by more than one SCOP domain and that are not experimentally validated were excluded in this analysis. We then grouped the associations of this kind into a separate category called COGNATE. Since the large number of cofactors in COGNATE is difficult to visualize, we chose major cofactor species based on knowledge from the two CoFactor databases (Fischer et al. 2010) and Wikipedia (http://en.wikipedia.org/wiki/Cofactor). We then calculated the abundance of associations for every pair of only COGNATE cofactors and the 54 FFs. Abundance values were normalized to a scale 0–1 and were then plotted for the selected major ligands with a heatmap using R ver. 2.12.

Results and Discussion

Uncovering Structural Origins from a Timeline of Primordial Domain Discovery

We recently assigned ages to protein domains at FF level in the proteomes of 420 free-living organisms spanning the three superkingdoms of life (Caetano-Anollés et al. 2011) using the strategy summarized in Fig. 1a. The relative age of individual domains was calculated from the published phylogenomic tree (Fig. 1b). The tree describes the history of 2,397 FF structures and was obtained from phylogenomic analysis of a set of 754,867 inferred structures. Time was measured by a relative distance in nodes from a hypothetical ancestral FF at the base of the tree. This node distance (nd) was used to construct a timeline of domain discovery, with time flowing from the origin of FFs (ndFF = 0) to the present (ndFF = 1). nd values have been shown to be linearly proportional to geological time when trees of domains defined at F and FSF levels are used as molecular clocks (Wang et al. 2011). Extending the clock to FFs showed that domain age continued to be proportional to time but with larger dispersion at high nd values (data not shown). In this study, we focus on the structure and function of the first 54 FFs that appeared in evolution and span the ndFF = 0–0.126 window (Fig. 1c, d; Tables S1, S2). These FFs (and associated FSFs) are responsible for laying the structural foundation of both modern metabolism and the translation machinery (Caetano-Anollés et al. 2009a, 2011). Our goal is to study the structural make up of these early FFs and their associated functions and use this information to build a general hypothesis that would be compatible with phylogenomic inferences and would explain the origins of an ancient protein world and of life. We note that by focusing on the FF level of structural complexity we look at protein evolution with a definition of structure and function that is quite modern and can be therefore misleading. Nevertheless, while inferences about the past necessarily require we distance ourselves from those definitions, we invoke structural canalization (sensu Ancel and Fontana 2000; see below) has “frozen in time” the most prominent structural and functional features of evolving molecules. In this study we attempt to dissect those prominent and ancestral features. We also invoke the reuse of old structures for new functions by gradual change induced by mutational changes. We claim that the same principles of “neofunctionalization” and “subfunctionalization” that are proposed for the generation of genes with new functions can be extended to early protein structures.

Fig. 1
figure 1

Timeline describing the very early evolution of protein domains. a Experimental strategy for the construction and annotation of phylogenomic trees and timelines. The structural census is used to build data matrices for the construction of trees of proteomes (not described in this article) and trees of domain structures at FF level of structural complexity. Elements of the matrix (g mn ) represent genomic abundances of domains in proteomes, and different databases (DB) are used to annotate domain structures and functions. b Universal phylogenomic tree of FFs. One optimal most parsimonious tree of FFs (177,864 steps; ensemble consistency index = 0.030; ensemble retention index = 0.749; g 1 = −0.070) was reconstructed from 420 parsimony-informative phylogenetic characters (proteomes) derived from a genomic census of domain structures in free-living genomes. Terminal leaves representing 2,397 FFs are not labeled in the tree since they would not be legible. Branches colored in red depict evolutionary relationships of the 54 most ancient FFs. The Venn diagram shows occurrence of FFs analyzed in the three superkingdoms of life. c Timeline describing the evolution of FF domain structures. The timeline was derived directly from the tree of FFs. Ages are given as node distances (ndFF) and time flows from left to right. The three evolutionary epochs of the protein world defined by Wang et al. (2007), “architectural diversification” (epoch 1), “superkingdom specification” (epoch 2), and “organismal diversification” (epoch 3) are overlapped to the timeline (colored with different shades) and transitions between epochs traced back to the tree with dashed lines. Landmark discoveries defined in Caetano-Anollés et al. (2011) are identified with circles along the timeline. d Evolution of the 54 most ancient FFs. Idealized forms describing the overall structural topology of the individual FFs is traced along the timeline together with descriptions of β-sheet size and topology. e Periodic table of idealized forms (Taylor 2002) with shaded boxes representing forms used by the first 54 FFs

The Structural Origin of Modern Proteins is Linked to Cellular Membranes and Primordial Cellular Machinery

The most ancient proteins in the timeline of FF discovery harbored the P-loop containing nucleoside triphosphate (NTP) hydrolase fold (c.37), confirming previous observations (Caetano-Anollés and Caetano-Anollés 2003; Caetano-Anollés et al. 2007, 2009a, b; Wang et al. 2006; Wang and Caetano-Anollés 2006, 2009). The ABC transporter ATPase domain-like (c.37.1.12) was the oldest protein family, but the extended and tandem AAA-ATPase domain (c.37.1.20 and c.37.1.19) FFs immediately followed, together with an additional five P-loop NTP hydrolase fold structures in the ancestral set of 54 FFs (Tables S1, S2). An analysis of recruitment in metabolic networks showed these ancient proteins were most likely hydrolase and transferase enzymes involved in nucleotide interconversion, storage and recycling of chemical energy through high energy phosphate transfer, and terminal production of nucleotides and cofactors (Caetano-Anollés et al. 2007, 2009a, b). The early appearance of these FFs is also congruent with phylogenies of molecular functions that showed the oldest proteins had ATPase, GTPase, and helicase activities (Kim and Caetano-Anollés 2010). The primordial ATPases at the start of the timeline of FFs have the potential to use the energy of nucleotide binding and hydrolysis for mechanical work, which is currently employed in extant proteins to move a wide range of molecules, from nucleotides to polypeptides (Ye et al. 2004). Their common fold structural design, exemplified in the recA protein, contains a central β-sheet flanked by α-helices, a highly conserved Walker A (P-loop) sequence motif located at the tip of the first β1-strand that binds to di- and tri-nucleotides, and a less conserved Walker B motif in the β3-strand that coordinates to Mg2+. The Walker A sequence motif is embedded in a glycine-rich loop (the phosphate-binding loop or P-loop) that spans a β-sheet and α-helix. This P-loop exhibits a main-chain structure that can accommodate an atom with a whole or partial negative charge, the nest (Watson and Milner-White 2002), a structure believed common in prebiotic polypeptides (Milner-White and Russell 2008). This central core is associated typically with a more or less separate bundle of four α-helices at one end of the molecule. The bundle is integrated in the fold design of the ABC transporter c.37.1.12 FF (Fig. 2) but sometimes constitute (e.g., PDB entries 1NJG and 1UAA) one or more separate subdomains in the primordial c.37.1.20 and c.37.1.19 FFs. Subdomains in these ancient FFs are well dissected by the CATH classification of proteins (Greene et al. 2007), which splits structures into smaller domain segments, showing the 3-layer (αβα) sandwich Rossmann fold (3.40.50) and the orthogonal bundle/helicase Ruva protein-domain 3 (1.10.8) are recurrent. Remarkably, a timeline derived from a census of protein domains defined by CATH revealed the 3.40.50 multi-layered folds and the 1.10.8 and 1.10.10 bundles were the oldest domain superfamily homology structures, confirming the ancestrality of these structural designs with different domain definitions (Bukhari et al. in preparation). The structure and function of the first four FFs is revealing and important for the model of early protein evolution we propose:

Fig. 2
figure 2

The structure of the ABC transporter ATPase domain-like FF. a Structure of the ABC transporter from Escherichia coli involved in vitamin B12 uptake (1L7V) showing the helical transmembrane domain with the “ABC transporter involved in vitmin B12 uptake, BtuC” (f.22.1.1) FF and the ATP-binding domain with the “ABC transporter ATPase domain-like” (c.37.1.12) FF colored according to FF age (ndFF). b Chrystallographic model describing the structure of the P-loop containing ATP-binding domain of the histidine permease of Salmonella typhimurium (PDB entry 1B0U). c Wiring diagram of the secondary structure of the ATP binding subunit. Structural elements defining the c.37.1.12 FF are described

The ABC Transporter ATPase Domain-Like Family

The oldest protein structure, c.37.1.12, is linked to ATP-binding cassette (ABC) transporters, which are universally distributed in the living world and constitute one of the largest families of proteins that are known (Higgins 1992; Linton and Higgins 2001; Locher 2009). These proteins transport a wide range of molecules across membranes, from small compounds to large polypeptides and complexes (e.g., organic and inorganic phosphate esters, nucleotides, amino acids, sugars, peptides, sulfate, polyamines, metallic cations, organo-iron complexes, and vitamins; e.g., Tam and Saier 1993) and play a variety of physiological roles. The diversity of these proteins is cataloged in over 600 family groups in the Transporter Classification Database (Saier Jr. et al. 2009). ABC transporters have a minimum of two domain regions: (1) a nucleotide-binding domain with a recA-like core structure that couples energy released by ATP catalysis with transport through a catalytic site that involves the Walker A/B motifs, and a small bundle of α-helices with a “signature” substrate (MgATP) binding site (C-site) that confers substrate specificity, and (2) a transmembrane domain containing an helical bundle of typically up to six α-helices, which facilitates the transfer of substrates across membranes (Fig. 2). Prokaryotic transporters have an extra periplasmic substrate-binding domain and domains are in different chains. A topology diagram of the ATP-binding subunit of histidine permease from Salmonella and its structure (Hung et al. 1998) exemplifies the typical wiring of secondary structures in c.37.1.12 and illustrates how the helical bundle is integrated in the molecule and two layers of β-strands define the nucleotide-binding pocket (Fig. 2). Note that two of these nucleotide-binding pockets locate at the interface between the two subunits of the protein dimer. This forms the physiological interface of the transporter complex. We propose that the helical bundle of c.37.1.12, which resembles the transmembrane domain of the ABC transporter, is an ancient remnant of a primordial transporter molecule that was integral to protocell membranes.

The evolutionary origin of ion channels and transporters has been associated with the transmembrane domain (Morris 2002; Saier 2003; Pohorille et al. 2005) and to the process of protein folding in membranes (Popot and Engelman 2000). In contrast with soluble globular proteins that fold into a highly diverse repertoire of topologies (Levitt 2009), the vast majority of membrane proteins have only two kinds of transmembrane structures, bundles of α-helices or barrels of β-strands (Popot and Engelman 2000). This lack of diversity is remarkable and has important evolutionary implications, which we will elaborate below. In particular, helical membrane proteins represent about a quarter of all proteins in a genome and mediate a wide variety of functions that are crucial for the cells (Wallin and von Heijne 1998). The structural simplicity and universality of these proteins suggests they are ancient and embody primordial structural designs (Pohorille et al. 2005; Pohorille and Deamer 2009). Biophysics, chemistry, and dynamics also provide crucial information about the physical origins of membrane protein folds. A multi-step model of folding was proposed almost two decades ago in which independently stable α-helices fold across membrane lipid bilayers, interact with each other to form high order structures, and finally partition additional polypeptide regions (e.g., coil regions or short helices) while facilitating prosthetic group binding (Popot and Engelman 1990, 2000; Engelman et al. 2003). The model has been largely confirmed by biological assays and biophysical free energy measurements of transmembrane protein interactions (e.g., using fluorescence resonance energy transfer and thiol disulfide interchange methods), revealing tendencies for transmembrane domains to self-associate in multiple hydrophobic environments and in micellar and bilayer systems (MacKenzie and Fleming 2007). The structural simplicity of systems that transport molecules across membranes is also evident in the properties of small peptides, such as the antiamoebin, a 16-residue fungal antibiotic made of α-aminoisobutyric acid, isovaline, and hydroxyproline. The polypeptide adopts an α-helical structure, spans the length of lipid bilayers, associates in groups (generally 4-helical bundles), channels cations through membranes, and has a structure that resembles a potassium channel assembly (O’Reilly and Wallace 2003). A number of similar α-helical peptides, some 20–25 amino acid-long, have been shown to self-assemble into artifical membrane channels (some reviewed in Morris 2002). Natural or synthetic lipid soluble molecules (ionophores) that make up carriers and channels increase the ionic permeability of membranes and their structures represent good candidates for primordial membrane components. Carriers have hydrophobic and hydrophilic surfaces that are used to mediate ion transfer. Channels form water-filled pores that act as trans-membrane conduits. A number of carrier peptides and antibiotics composed of l- and d-amino acids (e.g., valinomycin) are synthesized by template-free megaenzyme systems, the non-ribosomal peptide synthetases (NRPS), in microbial organisms (Marahiel 2009). Similarly, a number of antibiotics (e.g., gramicidin) form channels that are similar to channel proteins. These peptides can be synthesized under plausible prebiotic conditions and have membrane transport abilities (Pohorille et al. 2005). They also have atypical amino and imino acids, and in rare cases racemic mixtures (e.g., gramicidin), which are believed to have been common in early Earth (Rode 1999). Some of these short polypeptides, especially non-ribosomally synthesized peptides such as vancomycin, are also rich in main-chain anion-forming nest structures (Milner-White et al. 2004).

An early origin of proteins associated with membranes is therefore feasible and compatible with our phylogenomic analysis. In contrast, a link between nucleic acids and membranes in an ancient RNA world is less likely. Nucleic acids are not only difficult to synthesize prebiotically (Powner et al. 2009) but they do not behave as ionophores and are instead membrane disruptors (Vlassov et al. 2001). We note that the phospholipid constitution of modern membranes limits membrane growth and suggest ancient membranes were constructed differently, perhaps using fatty acids and alcohol and glycerol monoester derivatives, which allow passage of charged molecules such as nucleotides (Mansy et al. 2008). However, the plausible existence of primordial transport peptides that were integral to membranes, would not only explain the origins of modern carrier and channel-forming machinery (and soluble proteins that could have derived from it), but would enable the active and selective retention of metals, cofactors, nucleic acid and protein building blocks, and other primordial chemicals by protocells with a membrane structure similar to that of modern cells.

The Extended and Tandem AAA-ATPase Domain Families

The c.37.1.20 and c.37.1.19 FFs are the second and fourth structural lineages of the timeline. These families represent “ATPases associated with diverse cellular activities” (AAA or AAA+) that play important roles today in a number of cellular processes, including protein folding and transport, proteolysis, membrane trafficking, cytoskeletal regulation, intracellular motility, and DNA replication (Vale 2000; Lupas and Matin 2002; White and Lauring 2007). AAA+ proteins are mechanoenzymes, macromolecular machine components that act as molecular switches and shift reversibly from one stable conformation to another. Their ability to change shape exerts tension on other molecules, dissociating interactions between proteins, unfolding polypeptide chains, or acting as molecular motors. AAA+ proteins share a highly conserved P-loop NTP hydrolase domain architecture that defines the ATPase domain and usually forms oligomeric (often hexameric) ring complexes (e.g., proteasomal ATPases). Some AAA+ proteins have two ATPase domains and form complexes with stacked or double rings (e.g., ClpA, NSF). The ATPase domain of AAA+ proteins is 200–250 amino acid long and contains the Walker A loop, the Walker B motif, sensor-1 and sensor-2 motifs, and an arginine finger. The fold contains two subdomains, the typical Rossmann-like α/β/α layered structure of the c.37 fold and a small subdomain that is predominantly helical. Chemical energy released by the hydrolysis of ATP, sometimes through cooperative ATP hydrolysis (e.g., Hsp104; Hattendorf and Lindquist 2002) is used to remodel bound target molecules such as proteins and protein complexes. One example is the NSF-mediated disassembly of the coiled-coil SNARE complex formed during fusion of vesicles in vesicular transport (Ungermann et al. 1998). In this process, components of the SNARE complex are recycled non-destructively for further rounds of membrane fusions as they pass through the double ring structure of the NSF-complex. In turn, AAA+ proteins such as ClpA unfold proteins and deliver them to associated proteases by translocation (Hinnerwisch et al. 2005). Conformational changes make the ring structures behave as “molecular crowbars” (Vale 2000) through concerted molecular change that induce vectorial force (sometimes directed to the center of the rings; e.g., Clp ATPases) or as “molecular motors” (e.g., dynein) by altering the angle of a target binding domain embedded in the ring (e.g., movement of microtubule domains by dynein complexes). The versatility of mechanoenzymatic complexes is probably evolutionarily derived (Iyer et al. 2004). Most of these AAA+ proteins act in highly structured cellular environments (e.g., the export of misfolded secretory proteins in the endoplasmic reticulum for proteasome-linked degradation), are involved in processes that are relatively modern (e.g., DNA helicase activity and replication initiation), and their functions are uniquely implemented in eukaryotes and prokaryotes. However, the ability of AAA+ domains to unfold proteins is universal, crucially important, and relevant for the early origin of proteins, especially because these primordial proteins could have helped chaperone protein transfer to membranes by unfolding the polypeptide chains before assuming helical conformations. This aspect is central to the evolution of membrane proteins (Renthal 2010). We note ClpA/ClpP proteins exhibit both chaperone dissociating activities of this kind and protease-directed translocation abilities (Pak et al. 1999), traits that could be very ancient.

The Tyrosine-Dependent Oxidoreductase Domain Family

The c.2.1.2 FF is the third structural lineage of the timeline. This protein family is known as the short-chain dehydrogenase/oxidoreductase (SDR) superfamily, a large group of NAD(P)-dependent oxidoreductases with the NAD(P)-binding Rossmann fold (Kavanagh et al. 2008). SDR proteins are approximately 250–350 amino acid residue long, have a serine, tyrosine and lysine-defined active site, and a glycine-rich N-terminal βα-turn. These globular enzymes play a wide variety of metabolic functions but some associate with membranes as peripheral membrane proteins (e.g., ferrochelatase). The first SDR enzymes to be described were alcohol dehydrogenases, but they are widespread in metabolism. They represent mostly oxidoreductases (EC 1), lyases (EC 4), and isomerases (EC 5). MANET identifies their involvement in four major enzymatic activities (EC 1.1.1, EC 1.3.1, EC 4.2.1, and EC 5.1.3) spread throughout 32 metabolic subnetworks (Kim et al. 2006). Metabolic-wheel analysis (Caetano-Anollés et al. 2007) reveals the most ancient of these subnetworks is “porphyrin and chlorophyll metabolism” (COF 00860), which holds a number of pathways for the biosynthesis of cofactors. While the COF 00860 c.2.1 structure is embodied in a large number of reductases (e.g., biliverdin reductase; glutamyl-tRNA reductase), transferases (e.g., adenosylcobinamide-GDP ribazoletransferase), and chelatases (e.g., ferrochelatase, cobaltochelatase) needed for cofactor biosynthesis, only EC 1.3.1.24 (biliverdin reductase) is linked directly to c.2.1.2 with nine structural entries. It has also been proposed that the β-ketoacyl [acyl carrier protein] reductase (EC 1.1.1.100) of the “fatty acid biosynthesis (path 1)” subnetwork (LIP 00061) is one of the most ancient SDR proteins (Duax et al. 2009). The almost exclusive use of the GC-rich half of the codon table and the fact that SDR genes have multiple open reading frames appears to indicate these enzymes expanded very early in evolution before the genetic code diversified (Duax et al. 2005).

Early Structures Reveal Primordial Patterns of Structural Change

The eleven most ancient FFs (ndFF = 0–0.041) share with the most ancient P-loop NTP hydrolase fold (c.37) a common α/β/α-layered architecture. These FFs embody the NAD(P)-binding Rossmann (c.2), ribonuclease H-like motif (c.55), adenine nucleotide alpha hydrolase-like (c.26), class II aminoacyl-tRNA synthetase (aaRS) and biotin synthetase (d.104), PLP-dependent transferase-like (c.67), periplasmic-binding protein-like (c.94), and thiolase-like (c.95) folds (Tables S1, S2). The “Rossmann-like” α/β/α-layered design remains recurrent during early protein evolution in our timeline and as we will show represents the structural cradle of translation. In fact, 36 out of the 54 primordial folds that we analyzed involved layered structural topologies of this kind.

In order to study structural change occurring in primordial proteins, we decided to use abstract representations of their central topological features. The core 3D topological arrangement of proteins can be summarized by a set of idealized form structures made up of layers of packed α-helices and hydrogen-bonded β-strands (β-sheets)(Taylor 2002). These idealized forms define a “periodic table” that captures the helical-sheet make up and the curvature of β-sheets in proteins (Fig. 1e). β-sheets can twist and coil (Chothia 1973). They can also curl by incorporating a stagger between adjacent β-strands, sometimes distorting the β-sheets to form cylindrical structures in which the first β-strand is hydrogen-bonded to the last (Murzin et al. 1994a, b). These cylindrical structures are known as barrels. The columns of the “periodic table” of forms define the number of helical and/or sheet layers that exist in the structural core of the protein (the number of sheets is indicated with subscripts). The rows define how curled and staggered are the layers, ranging from being flat-like (I), curl-like (C) or barrel-like (O) in conformation. A simple nomenclature captures the position of a form in the table. For example, form I31 represents a protein structure that has three layers that are flat, one of which is a sheet. We note that the table at present cannot accommodate the design of proteins made solely from α-helices and does not capture the structural complexities of β-sheets (e.g., propellers, β-bundles, etc.) (Taylor 2002).

We mapped these idealized forms to the 54 ancient FF domain structures (Table S2) revealing ancestral patterns of structural change (Fig. 1d). The basal β/α/β/α-layered architecture (form I42) embodied in c.37.1.12 has fold-defining flat sheets of parallel and antiparallel (mixed) β-strands and a peripheral α-helical bundle (Fig. 2). This architecture gave rise first to α/β/α-layered I31 forms with parallel β-strands, most probably through β-strand loss (c.37.1.20, c.2.1.2, and c.37.1.19, in that order), and then I31 forms with even shorter sheets of mixed or parallel β-strands (c.55.1.1, c.37.1.8, c.26.1.1, and d.104.1.1) without associated bundle-structures. Since c.37.1.12 transporter structures are tightly linked to membranes, the c.37.1.12 ancestor was probably integral to the primordial membranes but later on in evolution was pushed to the membrane surface for gate keeping. We speculate that an exposure to hydrophilic environments resulted in experimentation with β-strands, definition of the primordial α/β/α-layered structure, and production of more compact globular derivatives by loss of β-strands and the integral component. The timeline and functions of the oldest proteins suggests the initial sheet reductive tendency and loss of the α-helical subdomain produced globular proteins that became more and more independent from their originating membranes. We note that initial globular FFs that appear in the timeline embody large groups of proteins that have important enzymatic, chaperone, and regulatory roles. For example and as discussed previously, the SDR superfamily with the c.2.1.2 structure embodies a large group of globular enzymes that are widely distributed in metabolism and have allosteric regulatory mechanisms of function. Moreover, at ndFF = 0.02, three globular structures appear together in the timeline that are central and act within the intracellular environment: (i) the actin/Hsp70 (c.55.1.1) FF of proteins that assist protein folding an manage cellular stress and provide the structural scaffold for actin and actin-like molecules that make up cellular microfilaments (Hurley 1996); (ii) the G protein (c.37.1.8) FF of proteins that act as molecular switches, sensing the environment and regulating enzymes, transporters and a wide variety of cellular processes, including ribosomal protein synthesis, and (iii) the catalytic domain of class I aaRS (c.26.1.1) FF of enzymes that aminoacylate cofactors and nucleic acids, and in modern cells define the rules of the genetic code (Ribas de Pouplana and Schimmel 2001a). The catalytic domain of class II aaRS (d.104.1.1) FF immediately follows (ndFF = 0.024). We will discuss below in detail the relevance of these crucial structural and functional discoveries.

The “globularization” of proteins appears to have left remnants of the hydrophobic-to-hydrophilic transition in the amino acid make-up of the β-sheet of the α/β/α-layered architecture. Analysis of all PDB structures associated with the first 54 FFs shows there is no clear correlation between the fraction of hydrophobic residues of the sheets and the age of the FFs (Fig. 3). However, a clear pattern of hydrophobic decrease was observed in the sheets of the first 10 FFs.

Fig. 3
figure 3

Hydrophobic amino acid constitution of the β-sheets of the most ancient 54 FFs. The fraction of hydrophobic residues in the β-sheets of a FF domain (y-axis) were calculated by dividing the sum of the hydrophobic residues in β-sheets by the sum of the total number of residues in β-sheets for all PDB chain regions assigned to each FF. The 54 most ancient FFs are arranged along the evolutionary timeline (ndFF; x-axis). The first 10 FFs and the 44 remaining FFs were labeled with gray and black circles. Two dotted lines (a, b) show the correlation between the fraction of hydrophobic residues and nd values for the 54 and the 10 FFs, respectively

At ndFF = 0.03–0.045, three new forms (C31, O21, and O11) result from a general tendency of β-sheets to curl by incorporating a stagger between adjacent β-strands. This tendency, which develops throughout the timeline, produced structures with curled sheets and complete barrels. At ndFF = 0.03, the flat I31 gave rise for the first time to slightly curled α/β/α-layered forms (C31) that resemble twisted “open” barrels. Two structures appear almost together in the timeline, the phosphate-binding protein-like (c.94.1.1) and thiolase-related (c.95.1.1) FFs with double domain structures connected by a hinge region that harbors the active site (Fig. S1). The C31 structure is present in the periplasmic binding protein c.94.1.1 FF and its ligand-binding site interacts with a wide range of ligands, including carbohydrates, amino acids, dipeptides, and polypeptides (Dwyer and Hellinga 2004). The hinge topology of the fold offers incredible adaptability in this group of proteins, which has been used in protein engineering to design biosensors and allosteric control elements. The sister c.95.1.1 FF is one of the two FFs of the thiolase superfamily, a group of enzymes that catalyze the formation of carbon–carbon bonds via a Claisen condensation reaction (Haapalainen et al. 2006). The c.95.1.1 enzymes are part of fatty acid and polyketide synthesis pathways. Enzymes include β-ketoacyl-acyl-carrier synthases, which require the activity of acyl carrier protein (ACP)(a.28.1.1; ndFF = 0.425) in type I and II enzyme-mediated fatty acid elongation steps, or are primed by acetyl coenzyme A (acetyl-CoA) during type II enzyme-mediated fatty acid synthesis initiation. They also include polyketide synthases that synthesize large numbers of biologically and medically relevant natural products. The nucleophilic groups necessary for the condensation reaction are thioesters, which are much more reactive than oxygen esters. The carbon–carbon bond formation results from nucleophilic attack of a thioester in the pantetheine group of coenzyme A (CoA) or ACP on a carbonyl carbon of a fatty acid, a reaction that is mediated by a reactive cysteine in the active site that is transiently acylated during catalysis.

At ndFF = 0.04, the curled C31 form gives rise for the first time to a full β/α-barrel with form O21. This FMN-linked oxidoreductase (c.1.4.1) FF is part of the triose-phosphate isomerase (TIM) (βα)8-barrel fold (c.1). The structural components of the fold are 7–8 β-strand and α-helix pairs connected by βα-turns. The tilted β-strands define a staggered sheet with shear number of 8 and complete barrel structure, surrounded by the α-helices. The fold is widely present in metabolic enzymes and makes up ~10% of all proteins with structures that are known or are inferred (Sterner and Höcker 2005). With notable exceptions, TIM (βα)8-barrel proteins are globular enzymes, mostly hydrolases, but also oxidoreductases, transferases, lyases, and isomerases. MANET identifies the involvement of c.1.4.1 in 8 enzymatic activities at third level of EC classification (EC 1.1.2, EC 1.1.3, EC 1.3.1, EC 1.3.3, EC 1.3.99, EC 1.4.1, EC 1.4.7, EC 2.5.1, EC 5.3.3) spread in 8 metabolic subnetworks (Kim et al. 2006), including “pyrimidine metabolism” (NUC 00240) and “pantothenate and CoA biosynthesis” (COF 00770), the two most ancient in the set (Caetano-Anollés et al. 2007), and “pyruvate metabolism” (CAR 00620), one of the three most important donors of TIM (βα)8-barrel enzymes (Caetano-Anollés et al. 2009b). Of special relevance is the (S)-3-O-geranylgeranylglyceryl phosphate synthase (GGGPS)(EC 2.5.1.41), which harbors the c.2.1.4 structure and mediates an important step in the biosynthesis of polar lipids in Archaea and the ancient cellular ancestor of life (Kim and Caetano-Anollés 2011).

At ndFF = 0.045, the β-barrel-like (O11) form appears for the first time in the timeline as an integral part of a central multiple α/β/α-layered architecture with 4 subdomains in the acetyl-CoA synthetase-like FF (e.23.1.1). The barrel-like fold is a β-roll non-local structure that spans the two I31 and one I21 subdomains and in which 3 pairs of antiparallel strands linked to separate regions of the molecule are wrapped in 3D to form a barrel. The I21 subdomain also represents a new form in the timeline that likely resulted from the loss of α-helices on one side of the I31 form. Fatty acids, polyketides, and non-ribosomal peptides are synthesized by megaenzyme complexes in assembly lines in a series of iterative condensation steps (Koglin and Walsh 2009). The e.23.1.1 FF defines the central aminoacylation (A) or acyltransferase (AT) domains of modern modules in NRPSs, fatty acid synthetases (FASs), and polyketide synthetases (PKSs) that are responsible for template-free synthesis of peptides, fatty acids, and secondary metabolites, respectively (Marahiel 2009; Koglin and Walsh 2009). Proteins in this family catalyze two partial reactions that resemble reactions catalyzed by aaRSs. In a first reaction step, ATP is used to activate a carboxylate group and form a high-energy acyl-, aminoacyl-, or aryl-adenylate intermediate and inorganic pyrophosphate. In a second step, the activation energy stored in the high-energy acid anhydride is used to form a thioester by attack of the carboxylate carbon by a pantetheine thiol group [bound to a 4-helix dynamic peptidyl carrier protein (PCP) or acyl carrier protein (ACP) domains] with displacement of AMP. This second thioester-forming (e.g., adenylation module of NRPS synthetases) or oxidative decarboxylation (e.g., luciferase) reaction produces intermediates for condensations in protein or fatty acid biosyntheses or to produce light, respectively. These adenylation proteins rotate their C-terminal domains 140° exposing alternative faces of the domain to the active site to accomplish the two-step reaction (Gulick 2009). This large-scale domain rotation is quite unique. Compared for example to the 10–20° rotation in the closing of catalytic groups, the rotation allows transport of the intermediate between active sites in two different faces of a same domain.

Forms I31, I21, C31, O21, and O11 described above are reused during the discovery of the rest of the 54 ancient FFs, with O11 barrel structures appearing linked to translation (regulatory factor domains, the first two ribosomal proteins, and a aaRS editing domain) and O21 barrel structures to globular enzymes. Two new structural designs appear during this timeframe, the 4-layered I42 form with unusual sheet-sheet packing in the class II glutamine amidotransferase d.153.1.1 FF and the half β-barrel C11 form in the N-terminal alcohol dehydrogenase-like domain b.35.1.2 FF.

Early Origin of Cofactor/Nucleic Acid Acylation and Peptide Ligation

The group of primordial protein families that appears soon after the membrane-facilitated origins of proteins (ndFF = 0.02–0.045) catalyzes crucial acylation and condensation reactions. Adenylating proteins include acyl- and aryl-CoA synthetases, pantothenate synthetases, adenylating domains of NRPSs, and aaRSs, which as indicated over four decades ago (McElroy et al. 1967), share a number of functional similarities despite of generally lacking sequence homology. As we will describe, the structural discovery of these enzymes generally involve large conformational changes in structure and is responsible of cofactor and nucleic acid acylation and non-ribosomal peptide ligation.

The Functionally Versatile Structures of aaRS Enzymes: Early Origin of Catalytic Domains in Structures Important for Acylation and Peptide Ligation

aaRSs are “activating” multidomain enzymes that in a two-step reaction define the algorithmic rules of the genetic code by specifically attaching amino acids to their cognate tRNAs (Ribas de Pouplana and Schimmel 2001a). In the amino acid binding site of the catalytic (aminoacylation) domain, an amino acid is first activated by condensation with ATP to form aminoacyl-adenylate (aa-AMP), releasing inorganic pyrophosphate. The activated amino acid then esterifies the 2′ or 3′-hydroxyl group of the ribose in the 3′ end of the acceptor arm of tRNA. Activation is highly specific and involves rejection of larger amino acids by the acylation site, and in about half of aRSs, hydrolysis of incorrectly activated small amino acids in an editing site (Ling et al. 2007). Editing can occur before or after transfer of the activated amino acid to tRNA, and, as with polymerases, does not necessarily require that molecules be dissociated from the enzyme. Besides this crucial proofreading activity, aaRSs must also recognize the corresponding tRNA triplets of the anticodon arm of tRNA, which usually requires tRNA recognition by less-conserved anticodon binding domains. The pattern of domain accretion that we see in our timelines reveals that editing is an evolutionary enhancement that increases specificity of amino acid binding and that cognate tRNA recognition appears a derived addition to the central catalytic role of aminoacylation domains (Caetano-Anollés et al. 2011). The first of editing domains, the ValRS/IleRS.LeuRS editing domain (b.51.1.1), was one of the last to appear in the group of the 54 most-ancient FFs (ndFF = 0.126) (Fig. 1), but the first of a number of editing and anticodon-binding domains that were discovered after the emergence of the ribosome (Caetano-Anollés et al. 2011; Harish and Caetano-Anollés 2011). Thus, possible interactions of aaRSs with nucleic acids during the appearance of the first 54 FFs were not completely specific and could have not contributed fundamental specificity-determinants toward the establishment of the canonical genetic code. Domain accretion seems important not only for translation functions. In eukaryotes, aaRSs have progressively incorporated a number of protein domains into their molecular make up, endowing enzymes with functions that are beyond those of translation (Guo et al. 2010). Their accretion follows tightly the increases in complexity of the eukaryotic superkingdom.

The catalytic aminoacylation domains are not only the most ancient and highly conserved but they are also central to aaRS activity, recognizing amino acids and nucleotides in the acceptor tRNA stem (Ribas de Pouplana and Schimmel 2001a). The Rossmann-like structures of these domains splits enzymes in two classes (Eriani et al. 1990), class I aaRSs with a core parallel β-sheet of 5 β-strands in order 32145 (c.26.1.1) and class II aaRSs with a mixed (parallel and anti-parallel) arrangement of 5 β-strands (d.104.1.1). Both classes of catalytic domains originate very early in the FF timeline (Fig. 1d), class I appearing concurrently with the GTP-binding domain of elongation and initiation factors (c.37.1.8), at ndFF = 0.020, and class II appearing immediately after (ndFF = 0.024). The almost concurrent appearance of domains necessary for aminoacylation and for the formation of ternary complexes with tRNA and other proteins marks the nominal origin of the translation machinery and is quite revealing. Class I and II aaRS domains bind to nucleotides in the minor and major groove side of the acceptor tRNA stem, respectively, (Ribas de Pouplana and Schimmel 2001b), and are thought to have arisen in pairs encoded in opposite (i.e., complementary) strands of ancient RNA genomes (Rodin and Ohno 1995; Pham et al. 2007). Modeling suggests aaRSs can bind simultaneously to opposite sides of the tRNA acceptor stems if they are complementary (Ribas de Pouplana and Schimmel 2001b; Terada et al. 2002). Class I and II aaRSs also interact with each other and with elongation factors and “cofactor proteins” to form in some cases multi-aaRS complexes (Robinson et al. 2000; Praetorius-Ibba et al. 2007; Hausmann et al. 2007; Hausmann and Ibba 2008). These complexes assemble around individual aaRSs which act as core scaffolding proteins and enable for example the interaction of aaRS proofreading domains with elongation factors (Hausmann and Ibba 2008). A number of proteins with OB and SH3-like fold domains, which are known to interact tightly with RNA (e.g., in the ribosome), are a significant part of the multi-aaRS complexes (Lee et al. 2004). It has been proposed that aaRSs could have served as chaperones to protect the tRNA substrate from destruction by nucleases and phosphate bond-cleaving metal ions (Ribas de Pouplana and Schimmel 2001a). However, interactions in factor and aaRS complexes and vestigial functions in primordial class I and II aaRSs strongly suggest these proteins could have instead interacted to perform non-coded protein biosynthetic functions before the emergence of the genetic code and the ribosomal machinery.

Catalytic domains of class I and II aaRSs have thiol acylation activities, which have been interpreted as vestiges of an ancient thioester world (Jakubowski 1997, 1998, 2000). For example, CoA and tRNA minihelices composed of acceptor and TΨC arms are weakly aminoacylated by IleRS, ValRS, and LysRS, selectively or with relaxed specificity, respectively (Jakubowski 2000). Pantetheine also serves as amino acid acceptor, and both CoA and pantetheine thiol groups bind to an active site that is different from the ATP and tRNA bindings sites (Jakubowski 1998). These initial observations suggest that ancestral forms of aaRSs were involved in noncoded thioester-dependent peptide synthesis.

Other reaction chemistries also support the primordial functional versatility of ancient protein structures. aaRSs share with NRPSs, firefly luciferase, and other enzymes that form an acyladenylate intermediate the ability to produce unusual hetero- or homo dinucleoside oligophosphates in the presence of amino acids (e.g., Goerlich et al. 1982; Dieckmann et al. 2001). Molecules such as Ap4A are signal-transducing molecules with functions that are largely unknown (Kisselev et al. 1998). These side reactions appear ancient vestiges of nucleotide interconversion and peptide biosynthetic functions that may have existed during the origin of aaRSs. It is noteworthy that the nucleotidylyltransferase α/β-phosphodiesterase fold superfamily (c.26.1) contains domains important for nucleotide ligation. For example, the adenylyltransferase FF structure (c.26.1.3), which is quite ancient (ndFF = 0.171) but derived compared to the c.26.1.1 class I aaRS FF, embody enzymes for the synthesis of cofactors such as pantothenic acid, NAD, and FMN (Bork et al. 1995). For example, the nicotinamide-nucleotide adenylyltransferase (EC 2.7.7.1) catalyzes the synthesis of NADH (a 5′–5′ linked dinucleoside diphosphate of the form Np2N) from nicotinamide d-ribonucleotide and ATP. NAD is a cofactor believed to have jumpstarted the ancient RNA world (Yarus 2010). Phosphopantetheine adenylyltransferase (EC 2.7.7.3) catalyzes the penultimate reversible step in the biosynthesis of CoA by transfering an adenylyl group from Mg2+:ATP to 4′-phosphopantetheine to form 3′-dephospho-CoA (dPCoA) and pyrophosphate (Izard 2003). The phosphopantetheine arm of dPCoA binds to the same pocket in distinct conformations while the adenylyl moiety has distinct binding sites. This mimics peptide ligation activities described below. The N-terminal domain of the pantothenate synthetase enzyme (EC 6.3.2.1) that synthesizes panthotenate from pantoate, ATP and β-alanine is also a member of the nucleotidyltransferase superfamily (von Delft et al. 2001) and belongs to the c.26.1.4 pantothenate synthetase (pantoate-β-alanine ligase) FF (ndFF = 0.220). The structure of this enzyme also contains a C-terminal subdomain similar to one of the structural repeats of the creatinase/aminopeptidase FF (d.127.1.1; ndFF = 0.078) and the location of the ATP and pantoate binding sites and the nature of hinge bending leads to a ternary enzyme-pantoate-ATP complex. Other relevant c.26.1.3 cytidylyltransferases play roles in the formation of intermediates in the biosynthesis of lipids and complex carbohydrates by producing activated CDP-alcohols or CMP-acid sugars (e.g., glycerol-3-phosphate cytidylyl-transferase; EC 2.7.7.39) and in cofactor biosynthetic pathways (e.g., in de novo biosynthesis and salvage of NAD+ and NADP+; nicotinamide/nicotinate mononucleotide adenylyltransferase; EC 2.7.7.18). The class II aaRS fold FF has also FFs with functions unrelated to tRNA aminoacylation, such as the biotin holoenzyme synthetase FF (d.104.1.2), which is derived (ndFF = 0.392). Biotin is a cofactor in gluconeogenesis and the metabolisms of fatty acids and leucine and is responsible for carbon dioxide transfer of several carboxylase enzymes, including pyruvate and acetyl-CoA carboxylases (Zempleni et al. 2009). A SerRS distant paralog transfers biotin to a lysine in biotin carrier-proteins (Artymiuk et al. 1994) and is also derived.

Class I and II aaRS catalytic domains can also form peptide bonds. A family of small peptide bond-forming enzymes, together with NRPSs, catalyzes steps in diketopiperazine biosynthetic pathways that are responsible for a broad arrange of secondary metabolites (Gondry et al. 2009). In particular, cyclodipeptide synthetases (CDPSs) use activated amino acids in the form of aminoacyl-tRNA to catalyze the formation of the diketopiperazine bonds. A recent crystallographic, mutational, and biochemical study reveals that a CDPS from Mycobacterium tuberculosis that catalyzes the synthesis of tyrosine cyclodipeptides from Tyr-tRNATyr has a structure of class I aaRSs (Vetting et al. 2010). The peptide ligation reaction proceeds via a ping pong kinetic mechanism with a unique intermediate produced by an aminoacyl transesterification reaction. The ability to perform non-ribosomal synthesis of peptide bonds is therefore embedded in the c.26.1.1 structure and appears very early in protein domain evolution. A systematic analysis with sequence profile methods identified CDPSs as being derived class I aaRSs, including genes in fungi and animals (Aravind et al. 2010).

Truncated genes that encode single domain aaRSs are quite abundant in genomes. serRS homologues in Streptomyces viridifaciens, for example, play roles in the synthesis of the antibiotic valanymicin from l-valine and l-serine (Garg et al. 2008). Enzymatic and chemical studies revealed that class II SerRSs homologues catalyze the transfer of amino acid residues from seryl-tRNA to a hydroxyl moiety of isobutylhydroxylamine. Even more surprising is the observation that homologues of catalytic domains of SerRSs in methanogenic bacteria display relaxed amino acid specificity and transfer activated amino acids to 4′-phosphopantheteinyl prostetic groups of carrier proteins (Mocibob et al. 2010). Kinetic analysis confirmed these class II truncated aaRSs aminoacylate the carrier proteins efficiently but lack tRNA aminoacylation activities and functionally resemble aminoacylation domains of NRPSs.

The early origin of aaRS catalytic domains (Fig. 1) and the growing number of functions that have been identified linked to these enzymes (Minajigi and Francklyn 2008) suggest primordial aaRS FFs were multifunctional. We propose that these proteins used CoA, phopsphopantetheine, nucleotides, and oligonucleotides as cofactors for primordial peptide biosynthesis and were ancestors of both thiol-acylating enzymes needed for fatty acid, cofactor, and nonribosomal peptide biosynthesis and modern aaRSs required for ribosomal protein synthesis (Fig. 4). The initial functions of these primordial aaRSs were replaced in some cases by other domain structures in the set of 54 FFs that appeared soon after in the timeline.

Fig. 4
figure 4

Evolution of protein synthesis. The first two-step acylation reaction developed by primordial aaRSs provides the ability to donate amino acids and other monomer components to a multiplicity of substrates and the capacity of protein biosynthesis. This same chemical scheme was reused in evolution for the synthesis of fatty acids, the assembly line synthesis of peptides, and modern ribosomal protein synthesis. The time period of these evolutionary developments are indicated

Appearance of NRPSs and Other Non-Coded Peptide Ligases Before the Emergence of the Ribosome

Template-based ribosomal synthesis of proteins requires two committed steps that are highly specific, the aaRS-mediated loading of amino acids to the tRNA acceptors and the template-mediated proofreading interaction of the ternary complex of aminoacyl-tRNA, elongation factor and GTP with the ribosome (Rodnina and Wintermeyer 2009). While the two steps insure translation fidelity, GTP hydrolysis-driven ribosomal translocation (which resembles an advanced molecular switch) enable the high processivity levels necessary for the operation of modern cells. Similarly, the non-ribosomal serial assembly of peptides requires two steps, an NRPS A domain-mediated aminoacylation step that tethers reactants as thioesters on phophopantetheinyl arms of carrier proteins and a C–N bond formation step in condensation (C) domains that enables individual amino acid ligations (Finking and Marahiel 2004). The aminoacylation step resembles that of aaRSs and the condensation step plays a role similar to the peptidyl transferase center (PTC) of the ribosome. While the NRPS modules lack ribosomal proofreading and high-throughput processivity functions, their modular assembly-line strategy enables the synthesis of peptides from hundreds of different building blocks, with each NRPS module specializing in individual sets of chemistries. NRPS-synthesized peptides are cyclic or branched structures, contain atypical amino acids such as d-amino acids, carry N-methyl or N-formyl modifications, or can be glycosilated or acylated (Marahiel 2009). They have a broad range of biological activities. These peptides act as toxins, siderophores, pigments, antibiotics, cytostatics, or immunosuppressants. In turn, only the 20–22 standard amino acids are incorporated in ribosomal protein synthesis.

The combinatorial versatility of NRPS modules is encoded in the residues that line up the amino acid binding pocket of the A-domain (Stachelhaus et al. 1999). In silico studies and structure–function mutagenesis confirmed the existence of general amino acid sequence-based rules for substrate recognition and selectivity that resemble those of the genetic code. Each module uses a same structural make-up but enables different chemistries through functionally flexible active sites. This functional diversity mimics the wide range of aminoacylating and condensation functions that are embodied by class I and II aaRS catalytic domains.

While the acetyl-CoA synthetase-like FF (e.23.1.1) that makes up the NRPS A-domains is probably a structural and functional derivative of acylating domains of aaRSs, the domain appears much earlier (ndFF = 0.045) than the cold shock DNA-binding domain-like FF (b.40.4.5; ndFF = 0.114) of the first ribosomal protein (Fig. 1), which signal the emergence of the ribosomal ensemble (Harish and Caetano-Anollés 2011). The NRPS modules are also composed of C-domains (CoA-dependent acyltransferase FF, c.43.1.2; ndFF = 0.567) that catalyze peptide bond formation, peptidyl carrier protein (PCP) (acyl carrier protein-like FF, a.28.1.1; ndFF = 0.424) acceptors that load amino acids, and termination modules (thioesterase domain of polypeptide, polyketide and fatty acid synthetases FF, c.69.1.22; ndFF = 0.580) that disconnect the assembly line. The C-domain and carrier proteins that facilitate non-coded peptide bond formation appeared, however later than the PTC (Harish and Caetano-Anollés 2011) at ndFF = 0.253 (d.66.1.2). The origin of modern assembly-line peptide synthesis is therefore older than the origin of processive ribosomal protein synthesis but its evolution is quite protracted (Fig. 4).

Non-ribosomal peptide synthesis unlinked to assembly lines is also invoked in diverse cellular processes (Minajigi and Francklyn 2008; Iyer et al. 2009). A number of peptide ligases harbor folds related to the ATP-grasp, and the most ancient are present in purine metabolism (Iyer et al. 2009). These enzymes include for example the BC ATP-binding domain-like (c.142.1.2; ndFF = 0.061), the SAICAR synthase (c.143.1.1; ndFF = 0.188), and the ATP-binding domain of peptide synthases (c.142.1.2; ndFF = 0.285) FFs. Some of these biosynthetic activities originated before ribosomal protein synthesis, but none before the appearance of aaRS-linked peptide synthetases. Other peptide ligase activities are clearly derived. For example, the peptidoglycan polymer of the bacterial cell envelope is held together by peptide bridges that are assembled by the tRNA-mediated action of peptidyl transferases of the acyl-CoA N-acyltransferases (Nat) superfamily (d.108.1). Examples include the FemXAB non-ribosomal peptidyl transferases (d.108.1.4; ndFF = 0.557). Aminoacyl-tRNA also mediates the N-terminal tagging of proteins for degradation that is necessary to control protein turnover by L/F and R-transferases with the d.108.1 structure.

In summary, structures of class I and II aaRS catalytic domains provide the structural make up and functional versatility needed for primordial acylation and peptide synthesis, features that later on appear recurrently in protein evolution but most notably in NRPS assembly lines and ribosomal protein synthesis. Remarkably, the patterns of domain discovery we see in our timelines support the proposed progression from multifunctionality and poor specificity of ancient macromolecules to the higher specificities and efficiencies of modern biochemistry (Ycas 1974; Kacser and Beeby 1984).

Gradual Development of Macromolecular Movement, Emergence of Switches and Impact of Domain Recruitment

Modern proteins and protein complexes fold into conformations that are essential for their function. However, not all biological processes run continuously in the cell, and proteins often exist in different conformations that reflect different functional states. Conformational changes are sometimes induced by ligands that act as allosteric activators or inhibitors, blocking or facilitating access of small molecules, metal ions, cofactors, proteins, and nucleic acids to pockets in their structures. Proteins can also toggle between “on” and “off” states and thus behave as molecular switches. Canonical molecular switches involve conformational changes driven by nucleotide triphosphate binding and hydrolysis.

As described above, the most ancient proteins in the timeline are nucleotide-binding proteins, transmembrane P-loop ATPases (c.37.1.12, c.37.1.20), NADP-binding oxidoreductases (c.2.1.4), actin/Hsp70 ATPases (c.55.1.1), and GTPases (c.37.1.8), in that order. Collectively, these proteins play important roles in energy interconversion, protein folding, and metabolism. GTPases, however, are true switches that translocate proteins through membranes or operate in protein biosynthesis, signal transduction, and transport of vesicles within the cell. GTPases appear concurrently with aaRSs (c.26.1.1 and d.104.1.1) in the timeline, which are the first protein structures known to interact with RNA. We note that the GTP-binding domain of elongation and initiation factors (c.37.1.8) is necessary for the formation of ternary complexes with rRNA and exchange factors that load aminoacylated tRNA onto the ribosome (Rodnina and Wintermeyer 2009). These factors act as conformational switches that clamp tRNA via a “switch helix” that is capable of moving almost 40 Å. However, tRNA binds to the elongation factor domain (b.43.3.1), which was recruited later in evolution (ndFF = 0.073) (Fig. 1d), suggesting the initial function of c.37.1.8 was probably unrelated to nucleic acid transport and instead linked to metabolism and cofactor interactions (Caetano-Anollés et al. 2011).

The magnitude of change between conformations appears to increase gradually in early protein evolution, materializing fully in the unique large-scale rotation of the A-domain that is typical of NRPS megaenzymes (Tanovic et al. 2008; Gulick 2009). Since the vast majority of macromolecular motions are driven by hinge bending mechanisms (Flores et al. 2006), we predicted the existence of flexible hinges directly from atomic coordinates of single molecular conformations in selected FFs using FlexOracle (Flores and Gerstein 2007). The magnitude of predicted movement in these switches increases in evolution of the first FFs, as we illustrate with four representative structures (Fig. S1), showing P-loop ATPase-like and aaRSs-like structures are quite rigid, while periplasmic binding protein and NRPS A domains exhibit clear hinge mechanisms established between two sizable rigid regions in the molecules and large intramolecular movements. The accretion of domains following the discovery of the first 11 FFs results in hinges with even larger movements.

Cofactors in the Early Evolution of Proteins

We examined the abundance of associations between cofactors and the 54 most ancient FFs. For a pair of a ligand and a FF, the abundance was defined by the number of different PDB chains of the FF bound to the cofactor. All of known associations between cofactors and PDB chains were retrieved from PROCOGNATE (Bashton et al. 2008), a database that contains 26,108 PDB chains, 2,329 FFs, and 4,646 cofactors. We then chose the relationships of cofactors and PDB chains that are verified in vivo, prepared a dataset called COGNATE that consists of 2,000 PDB chains and 565 cofactors, and calculated the abundance of associations for every pair of cofactors and the 54 FFs. The large number of cofactors is difficult to visualize. We therefore selected major cofactors according to biochemical knowledge and published databases. The values for the abundance of the associations were normalized to a scale 0–1 and displayed in the form a heatmap (Fig. 5). Nine out of the 54 most ancient FFs (c.55.1.1, d.14.1.1, c.37.1.19, b.43.3.1, c.117.1.1, d.122.1.2, b.40.4.5, b.44.1.1 arranged by nd values) are absent in the dataset COGNATE. To cover these missed FFs, we also analyzed all of the associations for the 54 FFs regardless of experimental validation. The new dataset consisting of 3,788 PDB chains and 784 cofactors was analyzed and displayed in the same way (Fig. S2). A clear progression of cofactor recruitment appears in the timeline. The first two FFs, c.37.1.12 and c.37.1.20 recruit ATP and ADP, the third most ancient FF c.2.1.2 recruits a host of cofactors including NAD, NADP, NADPH, and nicotinamide derivatives, adenosine, guanosine and cytidine derivatives, biopterin and its derivatives, FMN, the heme group, and CoA-related cofactors, and c.37.1.8 recruits GTP. More derived FFs recruit pyridoxal-5′-phosphate and derivatives (c.67.1.4), CoA and acetyl-CoA (c.95.1.1), and flavin-related cofactors (c.1.4.1), in that order. Clearly, ATP/ADP are the most ancient cofactors used by proteins in the protein world, as was intimated in previous studies (Ji et al. 2007). The ubiquitous NAD family of redox cofactors appears also to be very ancient, and as we discuss below may have been a crucial evolutionary development.

Fig. 5
figure 5

The evolutionary appearance of protein cofactors. The heatmap shows the quantitative association of cofactors with ancient FFs based on experimentally verified relationships between cofactors and FFs (COGNATE dataset; for a detailed definition see “Materials and Methods”). The 54 most ancient FFs are arranged by their nd values (e.g., the most ancient FF = c.37.1.12; the most recent FF = d.14.1.1). Cofactors are ordered according to their first time appearances in the timeline according to the nd values of the FFs (left: appearing earlier; right: appearing later). The values in cells indicate the numbers of PDB chains of FFs bound to the cofactors, normalized to a scale 0–1, and are visualized in a scale of color intensity. Stronger associations between a cofactor and its FF have hues that are closer to red. The complete absence of the association is labeled in gray

Emergence of Protein Structure

The existence of RNA before proteins has been seriously questioned on a number of important grounds (Kurland 2010). This view has been supported by data from genomics, protein and RNA structure, and gene ontology (Caetano-Anollés et al. 2011; Kim and Caetano-Anollés 2010; Harish and Caetano-Anollés 2011) and by the timelines we here report. The emergence of life is therefore tightly linked to the emergence of protein structure. Kauffmann (1986, 1993, 2007), inspired by Dyson (1982), used the connectivity properties of random directed graphs to propose that sets of random polypeptides of different size that can convert into others by hydrolysis and condensation reactions would at any fix probability catalyze the condensation of larger products in the set and that this process would be autocatalytic. The crucial suggestion was that emergence of self-replicating peptide systems express the self-organizing collective property of critically complex systems. This theoretical formulation stands in good grounds (despite critique; Orgel 2008). The synthesis and degradation of short peptides is possible in the laboratory under different prebiotic environments. For example, amino acids and peptides can mutually catalyze peptide bond formation and clay minerals are known to favor the synthesis of longer peptides and reduce their hydrolysis (Rode 2007). Dipeptides and short protein motifs can also self-assemble to form amyloid-like structures, microfibrills, nanotubules, and even vesicles (e.g., Vauthey et al. 2002). Even proteinoids microspheres that are easily generated by repeated cycles of heating and desiccation exhibit in the laboratory a wide spectrum of weak catalytic activities (Fox 1980). Recent studies have shown that short synthetic peptides made up of assemblies of α-helices that wrap around each other with a slight left-handed super-helical twist (coiled coils) catalyze the ligation of short peptide fragments with high sequence and stereoselectivity (Lee et al. 1996; Severin et al. 1997). These replicating peptides generate networks of ligation reactions with clear patterns of self-organization (Ashkenasy et al. 2004). Thus, it appears plausible that short random peptides perhaps composed of a limited set of amino acid monomers with atypical and diverse chemistries could have quickly gained structural properties. We posit that peptides with primordial folds would have been functionally advantageous for emerging protocells in a prebiotic world.

Proteins in their native state are compactly folded, with backbones spanning a tube with a radius of about 2.7 Å and a ratio of tube thickness to range of attractive interations close to one (Banavar and Maritan 2007). These structural confines define a marginally compact phase that in the context of magnetic systems exhibits a behavior similar to spin glasses (a frustrated system resulting in a multiplicity of ground states) and lies somewhere between crystal, semicrystal, and liquid phases of matter. Within this physical framework, the folding landscape is not determined by amino acid sequence. Instead, a finite set of fold structures emerges from geometrical constraints of folding (Hoang et al. 2004). In fact, all fold structures in the PDB can be reproduced by packing hydrogen-bonded secondary structural elements of a protein homopolymer in silico (Zhang et al. 2006). More recently, an atomistic simulation of a 60 amino acid homopolymer reveals that only a small fraction of simulated folds manifest in real fold structures (Cossio et al. 2010). However, these folds tend to have low contact order (average sequence distance between residues that form native contacts) possibly to reduce the entanglement of bundles by preferring α-helices and shorter loops between contacting residues. This pattern can be seen in the length and distribution of secondary structure elements in the structures of the PDB (Caetano-Anollés et al. 2009a). Thus, structure materializes fully in folding space regardless of sequence but with strong evolutionary bias. The common α/β/α-layered architecture of the primordial folds of our timelines and its possible origin from α-helical bundles provides phylogenetic support to the existence of this evolutionary pattern from the start.

Simulations show polypeptides in sequence space gain unique fold conformations in few generations when forced to adopt an active site configuration that would benefit catalysis or ligand binding (Yomo et al. 1999). The prediction that functional selection induces gradual development of global structures has materialized in a number of experiments. Proteins from large random peptide libraries when subjected to a selective pressure become structured for the selected function (Martinez et al. 1997; Keefe and Szostak 2001; Seelig and Szostak 2007). In de novo protein evolution experiments, eight rounds of selection produced for example four families of independent protein sequences that bound ATP (Keefe and Szostak 2001). One of these families was optimized for improved binding affinity, and one of its clones (clone 18–19, which differed from the consensus sequence at 15 out of 80 amino acids) bound ATP with high specificity and affinity. The structure of the 18–19 “artificial nucleotide binding protein” (ANBP) was solved by crystallography (1UW1) and found it adopted an undescribed Zn-binding α/β-fold that bound Zn to four cysteine residues (function-directed selections FF k.42.1.1; Lo Surdo et al. 2004). ANBP shared structural features with extant proteins, including a nucleotide binding pocket similar to those found in ATP-binding proteins and a Zn-binding site resembling the typical treble clef Zn binding motif (Krishna and Grishin 2004). Additional in vitro selection experiments designed to improve folding stability of ANBP revealed that surface residues are as important as those buried in the structure (Smith et al. 2007). The fine tuned protein (2P09) retained the original structure but folded more stably (Fig. 6a). Expression of the protein in Escherichi coli inhibited cell division and disrupted the energetic balance of the cell by altering intracellular ATP (Stomel et al. 2009). The engineered protein did not only fold stably but exhibited physiological roles in vivo.

Fig. 6
figure 6

Structural origins of fold structure. a Structural alignment of the in vitro evolved 18–19 “artificial nucleotide binding protein” (ANBP) selected to bind ATP with high specificity and affinity (1UW1; structure colored in green) to a structural derivative selected to improve folding stability (2P09; blue). Increases in protein stability do not change the original fold structure. A total of five secondary structure elements and 67 amino acid residues (all in the structure) were aligned with an RMSD of 0.48 Å. b GANGSTA+ structural comparison of the 2P09 structural ANBP derivative to protein folds in the ASTRAL40 compendium. The diagram correlates the number of aligned residues with indiviual RMSD for structure alignments. All results with more than 20 aligned residues are displayed and the dashed line delimits structures with more than 70% of residues being aligned. Alignments described in c are labeled. c The evolutionary age of tructural alignments with more than 70% amino acid residues matching the 2P09 structure. d Structural models of the oldest and best (GlyRS with d.104.1.1 FF) and the oldest (CDC-like, c.37.1.20) structural alignments of 2P09 (blue) to target structures (green)

Since ANBP proteins were selected for their ATP binding functions, we explored if they shared structural features with any of our primordial FF structures. Given that selection was performed in an aqueous environment, our expectation was that possible structural analogies would not be linked to membrane proteins, including the most ancient ABC transporter c.37.1.12 structure. A non-sequential structural alignment of ANBP against all PDB entries of the ASTRAL compendium using a modern algorithmic implementation (Guerler and Knapp 2008) revealed a total of 9,521 aligned entries, 163 of which had more than 70% amino acid residues matching its structure (Fig. 6b). The average age of these structures [ndFF = 0.52 ± 0.28 (SD)] and spread of their age in the timeline suggests structural appearance of the ancient ANB motif is not biased in evolution and is recurrent (Fig. 6c). Structures also exhibit an average root mean square deviation (RMSD) of superimposed protein backbones of 3.16 ± 0.24 Å, which is not far from the average of the total set. A set of only 14 structures matched the 54 most ancient FFs, the most ancient aligning to c.37.1.20 AAA-ATPases, the second most ancient to c.37.1.8 GTPases, and the next 5 most ancient structures aligning to d.104.1.1 class II aaRS catalytic domains (Table S3). The most ancient match is to the N-terminal domain of CDC6-like protein APE0152, an AAA-ATPase of the CDC6/ORC family (PDB entry 1W5S) of cell division cycle loading proteins (RMSD = 3.41 Å, aligned residues = 49) and the most ancient and perfect 3D match is to glycyl-tRNA synthetase (GlyRS) (1ATI), a class II aaRS believed to be very ancient (RMSD = 2.83 Å; aligned residues = 49) (Fig. 6d). However, the most extended match is to aspartyl-tRNA synthetase (AspRS; 1B8A) (RMSD = 3.05 Å; aligned residues = 52). The structural alignment of ANBP to the second most ancient c.37.1.20 FF structures showed a good 3D match of the β-strands and the α-helix spanning the ATP binding site to the core component of c.37.1.20 (Fig. 6d). As expected, no match was revealed with c.37.1.12. These results suggest that searches for ATP-binding proteins in sequence space converge to the common α/β/α layered structure typical of primordial folds and that generation of canalized functional structure is more common than anticipated.

The folding of proteins should be considered a global optimization process in which molecular conformations transition from disorder to order to achieve their native state. This occurs through a complex interplay of cooperative molecular interactions that enable folding and refolding with speed and efficiency (Onuchic and Wolynes 2004; Dill et al. 2008). Such multidimensional energy landscape provides a statistical view of the energetics of protein conformations while it is being culled by evolution through natural selection and self-organization. Experimental and theoretical considerations suggest protein evolution is driven by a frustrated interplay of stability and function (Caetano-Anollés and Mittenthal 2010). The function of molecules influences the fitness of an organismal system (the net rate of reproduction, birth rate minus death rate) while molecular stability measures the tendency of molecular components to maintain their minimum free energy structure after a perturbation. Stability to mutational, thermodynamic or kinetic perturbations involves maintaining the minimum free energy structure despite of the effects of mutations, environmental conditions, or tendencies to misfold (Fontana 2002). Optimization within this evolutionary, thermodynamic and kinetic landscape is complex (Babajide et al. 2001) but reveals the existence of: (i) “neutral networks” in which adaptive walks in sequence space produce a same fold structure despite molecules lacking significant sequence similarity, and (ii) “shape space covering,” in which sequences differing in few amino acid residues fold into different structures and all structures materialize within relatively few mutational changes. The consequence of these properties is “structural canalization” (Fontana 2002), the process by which evolved molecules lock into target structures and functions while loosing conformational diversity and plasticity. It is noteworthy that these properties have been made explicit in peptide in vitro evolution experiments (Martinez et al. 1997; Keefe and Szostak 2001; Seelig and Szostak 2007).

In our models of emergence of modern biochemistry, we invoke structural canalization as a force that molds primordial proteins with multiple conformations into fold structures that are evolutionarily, thermodynamicaly, and kineticaly stable (Fig. 7). Once these structures materialize in evolution, the interplay of function and stability enable change within the evolving structural lineage. The discovery of ANBP and its ulterior evolution by the selection embodies change within an in vitro evolution setting. However, the structural link that exists between ANBP and ancient globular structures typical of ancient AAA and aaRS globular proteins suggests structural canalization associated with the most ancient cofactor (ATP) materializes into the α/β/α-layered fold quickly and efficiently.

Fig. 7
figure 7

Diagram describing structural canalization and the evolution of proteins. Primordial proteins with multiple conformations are molded in evolution into fold structures that are evolutionarily, thermodynamicaly, and kineticaly more stable. Each shape in the diagram represents an ensemble of allowed conformations given a constant external environment and the intensity of its shade represents the average time a molecule spends in conformations. Red shades represent highly canalized structures with well-defined thermodynamic funnels. The diagram shows that evolution results in less diverse and more stable ensembles that sometimes associate with cofactors (black circles) to fulfill functions

Phylogenomic-Grounded Models

Les perceptions remarquables viennent par degrés de celles qui sont trop petites pour etre remarqées (Leibniz 1923)

Ancient biological systems must emerge gradually through physical and chemical mechanisms that are plausible. A number of experimental achievements provide insights into the emergence of cellular, catalytic, and replicative functions in possible origins of life scenarios. Simple membranes have been shown in the laboratory to spontaneously self-assemble, grow and divide (Hanczyc et al. 2003; Chen et al. 2004). In vitro evolution studies (Ellington et al. 2009) reveal that RNA molecules can serve as catalysts (e.g., Robertson and Scott 2007) and as replicators (Lincoln and Joyce 2009) and proteins in random libraries can be selected to gain structure and function (Keefe and Szostak 2001). Autocatalytic processes of molecular recognition develop in simplified systems of organic molecules (Vidonne and Philp 2009), including autocatalytic hydrogenations and nucleophilic additions to unsaturated aldehydes (Kamioka et al. 2010).

While molecular experimentation of this kind can test the likeliness of primordial systems, biological information in the sequences and structures of macromolecules hold the clues of molecular history and can be mined for patterns of ancient and recent descent that were left behind in genomes as relics of ancient times (Caetano-Anollés et al. 2009a). In this study, we explore features in protein structures and functions linked to the most ancient protein domains that were identified at the base of a rooted phylogenomic tree of FF structures (Caetano-Anollés et al. 2011). The tree defines an explicit statement of history and a timeline of structural change. This timeline delineates a detailed historical account of protein domain discovery that we here use to propose models for the origin of proteins, protein synthesis, cofactors, and RNA.

A Model of Emergence of Primordial Proteins in Cellular Membranes

The infall of meteorites and comets during Earth’s late accretion period probably brought large amounts of amphiphilic molecules to the planetary surface, especially polycyclic aromatic hydrocarbons. Meterorites carrying carbonaceous chondrite amphiphiles still impact Earth sporadically. For example, the Murchison meteorite of 1968 was rich in amphiphilic lipids and amino acids (Deamer 1997; Engel and Macko 1997). These extraterrestrial lipids have been shown to form liposomes spontaneously in water (Deamer 1997). Other scenarios of origin of amphiphile molecules and amino acids are possible, including their prebiotic synthesis (Morowitz 1999). We therefore assume, lipids were available for the formation of cellular containers, either as single or multi-layer vesicles or as lining of other container systems [e.g., iron sulfide membrane compartments (Martin and Russell 2007); proteinoid microspheres (Fox 1980); peptide vesicles (Vauthey et al. 2002)], and that amino acids were abundant for pre-biotic peptide synthesis. The synthesis of macromolecules requires the formation of phosphate activated intermediates capable of providing the 2–4 kcal/mol or the ~5 kcal/mol needed for peptide or phosphodiester bonds formation, respectively. The energy to make dehydration condensates is substantial and is assumed to derive from thioester and pyrophosphate intermediates that exploited the abundant redox energy of early Eath (e.g., generated by UV excitation of Fe electrons).

Clays and other mineral deposits could have also helped delimit cellular compartments. For example, montmorillonite clays accelerate the conversion of fatty acid micelles into vesicles which can trap clay particles (and perhaps autocatalytic pre-biotic systems) and act as protocells (Hanczyc et al. 2003). It is quite possible that early protocells used amphiphiles with fatty acid chains smaller than 16–18 carbons that are typical of modern cells. However, short amphiphiles increase ion permeability many fold (Paula et al. 1996) allowing unfacilitated entry of solutes and inducing liposome instabilities. While it is likely that increases in membrane thickness offered an important selective advantage to primordial cells by avoiding lysis, other molecules embedded in the membranous structures (e.g., peptides, polyphophates, polyhydoxybutyrates) could have facilitated the growth and persistence of protocellular structures. For example, short peptides similar to natural peptide antibiotics that insert spontaneously onto lipid bilayers (e.g., alamethicin from fungi, melittins from bees, magainins from frogs, and bacterial bacteriocins and colicins) would have enhanced selective retention of crucial solutes (metals, cofactors, and pre-biotic building blocks) and enabled membranes to be formed with longer fatty acid chains. In particular, small non-polar membrane-spanning peptides could have served as mechanosensitive channels and osmotic valves, helping protocells tolerate environmental dilutions and avoiding lytic membrane disruptions (Morris 2002). The selective advantage of these two features would have crucially enhanced protocell growth and persistence in the prebiotic environment.

We propose that persistent protocells with osmotic safety valves that were capable of chemical transport provided a stable and permanently expanding container for Kauffmann–Dyson’s self-replicating peptides (Fig. 8). These systems were weakly catalytic but had the capacity to increase the size of their peptide components and enhance both protocell persistence and division. The protocell peptides were probably made of few monomer chemistries, were short and more or less statistical, folded in multiple marginally stable conformations, and had average amino acid compositions of varying polarity and charge. Non-polar peptides with largely hydrophobic side chains in the emerging peptide population were spontaneously incorporated in the expanding membranes if they enhanced protocell persistence, but were diluted by membrane growth and protocell division. Thus, membranes would act effectively as natural “chemostats” bioreactors that would enable the continuous selection of those integral peptides that increased protocell fitness and at the same time would bias the self-replicating enzymatic sytem (Fig. 8a). The mechanism of incorporation into membranes would be similar to that of synthetic peptides, natural peptide antibiotics, or C-terminal tailed-anchored proteins harboring single transmembrane segments, which is facile (Renthal 2010). The integral peptides would tend to self-aggregate into α-helical bundles to form more efficient osmotic safety valves (Fig. 8b), following the experimentally supported multi-step aggregation model of modern protein structures (Popot and Engelman 1990, 2000; Engelman et al. 2003). Ligation of peptide ends exposed to the intracellular environment could have enhanced the stability of integral peptides, diminishing intramembrane difussion, and enhancing their carrier and channel-forming properties. These α-helical bundles, the early ancestors of the subdomain bundle of the P-loop containing c.37.1.12 FF, adopted a wide variety of conformations but in time were structurally canalized by the chemical and physical environment of the membrane into those that enhanced protocell fitness. Peptide ligations coupled to novel polytopical peptide insertions would have generated diversity in folding conformations, producing subdomain protrusions into the intracellular aqueous environment (Fig. 8b). These subdomains were advantageous because they were more stable than other peptides and could gain catalytic functions while being anchored into membranes, facilitating the carrier and channel-forming properties of the transmembrane subdomain. We propose that the exposure of these membrane-tethered subdomains to the intracellular environment generated the α/β/α-layered structural design typical of ancient proteins, which appeared recurrently in the self-replicating emerging system and we see materialize repeatedly in our timeline. Again, physical and chemical constraints imposed by the membrane-intracellular protocell environment facilitated structural canalization of the c.37.1.12 ABC transporter structure. Finally, with the rise of the primordial structure of ABC transporters came the emergence of the first advanced catalytic function, the hydrolysis of ATP, and the first “internalization” of a prebiotic function into the emerging canalized protein system, that of energy interconversion. We consider coenzymatic ligands such as ATP (usually in coordination with heavy atoms such as Mg and metals) are the true catalytic centers. The protein scaffold provides allosteric regulation and specificity and delimits the overall kinetics. Thus, ATP and ADP became the first cofactors of modern biochemical catalysis, the Walker loop residues that interact with the phosphate moiety of the cofactor the first catalytic pocket in protein evolution, and the one-way “ping pong” pumping mechanisms driven by ATP the most ancient kinetics. We note that adenine recognition in these primordial pockets is responsible for the ancient adenylate-binding common motif present in ATP-, CoA-, NAD-, NADP- and FAD-dependent proteins (Denessiouk et al. 2001) and embodied in the very first FFs of the timeline (Fig. 5).

Fig. 8
figure 8

The origin of structured proteins. a Protocells behave as bioreactors. The illustration describes how vesicle-entrapped peptides enhance protocell persistence and spread successfully through protocell populations when they offer improvements to membrane stability. Vesicles behave as liposomes and regulate surface area through thermal energy, sometimes increasing their surface by accretion of amphiphilic hydrocarbons (feature 1) or sometimes reducing it by sheding vesicles (2). Vesicle-entrapped peptides have higher probability of generating novelties. For example, peptides capable of inserting into membranes can enhance membrane stability (feature 2). These “adaptive” peptides are colored in red. However, protocell integrity can be challenged by environmental fluctuations. In certain cases, vesicles rupture, returning adaptive and non-adaptive membrane and cellular components to the surrounding environment (feature 3). Vesicles with membranes enriched with adaptive peptides however are more stable and tend to persist for longer times (feature 4). Fusion of vesicles is thermodynamically favored and can produce for example flaccid vesicles with enhanced protoplasmic content and complexity (feature 5). This enhances the ability to generate more adaptive peptides and homogenizes protocell populations. The input and output of cellular components (amphiphiles, proteins, etc.) make protocells living bioreactors that retain adaptive (feature 6) and shed non-adaptive (feature 7) components. This results in selective sweeps that preserve useful novelties (feature 8). b Ten steps describe the very early evolution of membrane and globular proteins. In the diagram we illustrate the model with a double membrane; primordial single-layered membranes are more likely. In step 1, small peptides with properties analogous to modern peptide antibiotics that were trapped in micelles or vesicles enter into primordial membranes without help and adopt α-helical structures. In step 2, peptides associate in groups, increasing membrane stability and vesicle persistence. Vesicles become bioreactors and peptides help recruit fatty acid compounds with longer chains that enhance membrane stability. In step 3, trapped peptides act as Kauffman–Dyson’s replicators and bias peptide make-up and function, enhancing the persistence of the emerging protocells. In step 4, longer peptides associate to form bundles of α-helices and act as rudimentary carriers and channels. These more complex and adaptive structures facilitate movement of ions and molecules across the membranes and provide additional selective advantages. In step 5, protrusions of transmembrane bundles into the protoplasm become structurally canalized and give rise to the first protein fold structure (c.37.1.12). This step improves the carrier and channel properties of the primordial peptides. The structure exposed to the protoplasm now impinges on Kauffman’s replicators. In step 6, interaction of c.37.1.12 with cofactors result in the first fold-mediated catalytic property (energy interconversion), which also facilitates carrier and channel functions. Interactions of c.37.1.12 surface membrane structures further stabilize transmembrane bundle domains. In step 7, the selected advantage offered by structured proteins to the catalytic properties of Kauffman’s replicators curb the structure of c.37.1.12 to become less and less attached to the protocell membrane and more involved in protoplasmic functions. In step 8, some structures become globular and help chaperone peptide insertions into membranes (the first polytopic proteins). These c.37.1.19 and c.37.1.20 AAA structures provide opportunities to actively control transmembrane protein content and function. In step 9, the c.2.1.2 SDR structures become globular and enhance the catalytic activities of more mature protoplasm, the cytoplasm. Finally, in step 10, transmembrane domains interact with mature c.37.1.12 membrane surface domains to produce more efficient catalytic carriers and channels

We also propose that membrane-tethered subdomains formed the first globular structures with structurally canalized functions directly important for protocell persistence. The first structures of this group were the primordial c.37.1.20 and c.37.1.19 AAA-ATPases, the first intracellular globular protein structures. These peptides had the ability to interact with each other and with the ABC transporters either to self-assemble into primordial protein complexes or to facilitate protein integration into membranes. Their ability to establish protein interactions resulted in primordial chaperone and mechanoenzymatic functions that enhanced protein folding and incorporation of polytopic proteins into membranes. Their ability expresses today in the typical functions of modern AAA-ATPases, including the capacity to act as molecular crowbars and motors. The initial evolutionary separation of globular and integral proteins was followed by a second group of structures embodied in primordial c.55.1.1 actin/Hsp70 ATPases capable of assembly of more advanced structural scaffolds than those possible with the AAA domain structures and a third group of c.37.1.8 GTPases capable of acting as molecular switches. These ATPases were the early ancestors of cytoskeleton proteins such as the actin and actin-like prokaryotic homologues that make up cellular microfilaments. The ancient cytoskeletal proteins would have again enhanced protocell integrity and would have insured survival against environmental change. The GTPases were also the ancestors of the catalytic components of translation factors needed for molecular shepherding. These structures turned into important molecular switches by domain accretion later on in evolution (ndFF = 0.073; Fig. 1).

The early rise of the NAD(P)-dependent c.2.1.2 SDR oxidoreductases generated a number of catalytic globular enzymes that fulfilled the energetic and metabolic demands of the now more structurally stable protocells. These globular proteins started to use AMP-containing cofactors, including small 5′–5′ linked ribonucleic acids such as the ubiquitous redox cofactor NAD(P) that carry electrons from one chemical reaction to another. Some structures harbored Bi Bi reaction mechanisms (i.e., one enzyme, two substrantes, and two products, defined using Cleland’s nomenclature; Cleland 1963). Transferases of this kind are crucial for building hubs in metabolic networks (Pffeifer et al. 2005) and their functions were the first to be shared by structures during modern metabolic expansion (Caetano-Anollés et al. 2007, 2009b). Metabolic functions linked to these proteins and built around cofactors enhanced cofactor and fatty acid biosyntheses, functions that again would have enhanced protocell persistence and duplication.

Under the scenario of emergence that we propose, globular membrane-tethered proteins and membrane integral proteins had a common ancestral structure and diversified in parallel. We expect for α-helical bundle interactions in the membrane to become increasingly canalized with time, bringing surface membrane nucleotide-binding subdomain/domain components together. These surface structures developed contacts that would enhance their stability and later on would enable their ATPase catalytic activities. We hypothesize that the size of initial bundles was limited and consequently only two surface subdomains could be brought together in primordial c.37.1.12 ATPases. This would explain why most modern ABC transporters contain two interacting nucleotide-binding domains, which together enable their catalytic sites. As membranes increased in thickness and bundles increased in size, larger numbers of surface components coalesced into multichain-ensembles, explaining why modern AAA-ATPases derived from primordial c.37.1.20 and c.37.1.19 FFs have typically 2–6 interacting domains and form complexes with stacked or double rings. Their globular make-up and function was however historically determined by the early interactions that were facilitated in the membranes. The evolution of modern transmembrane proteins appears driven by the duplication and rearrangement of genes and by a tendency to increase the number of helical segments in transmembrane domains (Saier 2003). Similarly, the need to improve transport regulation and efficiency could have driven the early evolution of the integral bundle of transporters, increasing the diversity and structural complexity of the transmembrane domains. Constrained by the membrane environent, the structure of these domains however could not diversify to the extent of globular proteins but were sufficiently flexible to adapt to the functional needs of the protocell. This would explain the limited number of structural folds we find today in membrane proteins (Popot and Engelman 2000).

In summary, while the emergence of simple membrane α-helical channels is protobiologically plausible and compatible with current biophysical and biochemical knowledge (Morris 2002; Pohorille et al. 2005; Renthal 2010), our phylogenomic timeline (Fig. 1) provides important support to the early origin of transport proteins and the centrality of membranes and proteins in evolution of early life. Phylogenomic evidence also provides a functional and evolutionary link between autocatalytic prebiotic processes, proteins and protocells, and solves the “chicken-and-egg” dilemma of the primordial need of proteins that chaperone the insertion of polytopic transmembrane helices to prevent their aggregation in the cytosol (Renthal 2010). Phylogenomic evidence is consistent with membranes and embedded peptides co-evolving from the start, beginning with small peptides that do not require translocon systems to be incorporated into membranes. Our model of evolution of modern biochemistry enables protein complexification and the early rise of AAA-ATPase-like structures with known chaperone functions that would have inserted primordial structured proteins into membranes. The initial chaperone systems were probably later replaced by more advanced translocon channels that chaperone the hydrophobic chains into the hydrophobic environment of the membrane (White and von Heijne 2005) and complexes that act for example in concert with ribosomes to insert α-helices into membrane lipid bilayers through pores (Kramer et al. 2010).

A Model of Emergence of Modern Protein Synthesis

The model of emergence of proteins in cellular membranes proposed above exploits protein features that enhance membrane growth, stability, and persistence (Fig. 8). The model also suggests that structurally canalized proteins originating in membrane-tethered structures slowly replaced the weakly catalytic, ineffective, and primordial Kauffmann–Dyson peptides. Phylogenomics reveals that the appearance of the c.2.1.2 SDR oxidoreductases constitutes an important evolutionary step in structuring the components of the more or less statistical collection of intracellular peptides by endowing the population with cofactor-binding pockets that would act as binding sites for reactants. The large numbers of cofactors that currently interact with modern SDR enzymes support their primordial versatility (Fig. 5). This step improved peptide catalysis, presumably to enhance cellular energetics and the biosynthesis of cofactors and fatty acids. We now propose that the emergence of the c.26.1.1 and d.104.1.1 aaRS structures, which appeared almost together in the timeline with elongation factors, represents a second and crucial evolutionary step that curbed the structure and function of the emerging intracellular enzymatic system. A model describing the impact of these structures in protein biosynthesis is illustrated in Fig. 9. We argue that once protocells gained stability and cellular structure and reaped the benefits of cofactor-driven redox and energy transfer reactions they improved the biosynthesis of proteins to enhance overall cellular activity. In our model of emergence of modern protein biosynthesis, the primordial aaRSs emerged as highly versatile structures with the unique ability of acting as general cofactor acylating agents. The fundamental improvement was the development of structures with two active sites (catalytic-editing) capable of a two-step (activation-acylation) catalytic process and the use of a random Bi Uni Uni Bi ping pong kinetic mechanism that is present in modern acyl-CoA synthetases (Bar-Tana and Rose 1968), aaRSs (Cramer et al. 1991), and NPRS acylating domains (Gulick 2009). These chemical properties enabled the donation of moeities (e.g., amino acids) to a multiplicity of substrates (Fig. 4). The aaRS structures had also the capacity to interact with larger ligands while orienting them in space. While aaRs-mediated acylation of CoA enhanced lipid biosynthesis (crucial for membrane expansion), primordial aaRSs harbored other useful enzymatic functions, including pyrophosphate interconversion, ligase, and oligonucleotide aminoacylation activities. As we described above some of these functions are still associated with modern aaRSs, some considered relics and other typical of modern proteins. More importantly, oligonucleotides bound to primordial aaRSs acted as efficient cofactors for peptide synthesis and enhanced the diversity of the pool of cofactors in the protocell.

Fig. 9
figure 9

Early evolution of protein biosynthesis. a Origin and evolution of aaRSs and NRPS modules. aaRSs and NRPS A-domain structures are derived from an primordial aaRS structure (p-aaRS unit) that was functionally versatile and had the ability to acylate a wide variety of cofactors (COF) in a two-step catalytic reaction, the formation of an activated acyladenylate intemediate [e.g., aminoacyl-adenylate (AMP-aa)] and the transfer of the acyl moiety onto cofactors such as 4′-phosphopantetheine, CoA, NADP or related derivatives. The aminoacylated COF intermediates were then delivered to biosynthetic active sites for peptide formation. The reaction at first was nonspecific and was capable of ligating amino acid (aa) or short peptides in the protoplasm or exposed in the surface of membranes, enhancing the combinatorial interplay of Kauffmann–Dyson peptide synthesis. Diversification of the p-aaRS units produced p-aaRS domain complexes by aggregation (probably induced by stabilizing intermolecular interactions similar to those that exist today in the quaternary structure of aaRSs) or p-NPRS A-domain-like structures by neofunctionalization. These two lineages added functions and specificities by domain evolution and accretion and gave rise to modern aaRSs and NRPS assembly lines. b Origin and evolution of modern ribosomes and regulatory factors. Protein domain accretion in primordial p-aaRS-TF-minihelix complexes facilitated the development of the emerging functions of protein biosynthesis, including the structural canalization of a large cofactor moiety (the primordial tRNA) that enhanced ratchet motion nedded for aminoacylaation and peptide synthesis. Diversification of the minihelix and the protein scaffolds, including primordial transcription factors (TFs), gave rise to at least two lineages, one leading to modern aaRSs and TFs and another leading to modern ribosomes. New protein domains added by accretion, including OB and SH3-like fold domains known to interact with membranes, also diversified within complexes giving rise to modern translation and replication machinery

We further postulate that primordial class I and II aaRS types developed the ability to interact with each other and with primordial elongation factors in fashions that resemble the loose combination of modules in some NRPS assembly lines (Fig. 9a). These interactions are preserved today in the formation of multi-aaRS complexes that enhance catalytic functions and provide a scaffold for aminoacylation and tRNA shepherding, among many other cellular functions (Lee et al. 2004). However, primordial interactions were constrained by protein structure, since aaRSs can bind simultaneously to opposite sides of RNA. We very much argue that as membranes facilitated the ATPase functions of ABC transporters, hydrogen bonding between the nucleotide moieties of cofactors (pantetheine, CoA, NAD, NADP, FAD, dinucleotides, oligonucleotides) facilitated the interaction and cooperativity of primordial aaRSs and vice versa protein–protein interactions facilitated cofactor interactions. For example, enzymatic cofactors such as NADP could have aligned their nucleotide moieties for coordinated base-pairing interactions within the protein scaffold, seeding a possible complementary replication (Yarus 2010). This would have benefited nucleotide ligations as well as acylations for synthesis of small peptides and other condensations and would have helped synthesize longer peptides, fatty acids, and nucleotide cofactors with more efficiency. As we mentioned above, aaRSs are known to form a wide arrange of dinucleoside oligophosphates in the presence of amino acids, some of which still carry physiological roles (e.g., Goerlich et al. 1982; Dieckmann et al. 2001), showing their α//β/α-layered structure can induce oligonucleotide biosynthesis. The emergence of structured ligand–ligand interactions and protein-catalyzed peptide synthesis facilitated by the primordial aaRSs was beneficious to Kauffmann–Dyson’s peptides but also to the synthesis of cofactors that were needed to enhance the functional repertoire of the protocells and their cellular stability. We propose that present day nucleic acid stereochemistries are derived from these very early side-by-side hydrogen-bonding interactions between cofactor moieties that were supported by the aaRS protein scaffolds (Fig. 9a). We suggest that these early stereochemical interactions were prompted by the multifunctionality and poor specificity of the emerging (fuzzy) biochemistries and were responsible for the patterns that exist in the genetic code and the complementary modes of tRNA aminoacylation (Rodin et al. 2009). We also suggest the primordial translation factors played important roles in multi-aaRS complexes by facilitating intermolecular movements of cofactor moieties. In fact, we note modern elongation factors are known to be part of aaRS complexes and play RNA binding roles (e.g., EF-1α with ProRS, LeuRS and LysRS; Hausmann and Ibba 2008), suggesting a long history of associations is still preserved in the molecular organization of these proteins.

Primordial aaRSs defined a lineage of proteins with the α/β/α-layered structural scaffold capable of propagating its functional and kinetic properties to variants within close neighborhoods in sequence space. This emerging property allowed the search for functional improvements and diversifications that would again improve protocell persistence and biosynthesis. For example, and as discussed above, the aminoacylation domains of NRPSs and aaRSs share a number of biochemical, structural, and kinetic features in common besides appearing very close in the timeline. Thus, the discovery of NRPS A-domains in sequence space is expected because of its ability to generate amino acid sequence-based rules for substrate recognition. These functionally flexible active sites later on in evolution will provide the selectivity needed to produce peptides in assembly lines from ~500 different amino acid monomers (Marahiel 2009). The discovery of NRPS A-domain structures with improved amino acylating activities and a embedded substrate recognition “code” had the second and very important consequence of relieving aaRSs from their initial protein synthesis activities and allowing exploration of other functions that may be beneficial for cofactor-protein relationships. We posit this crucial neofunctionalization allowed improvements in the interaction of evolving aaRSs with oligonucleotides, which was already ongoing. These improvements included the establishment of interactions with the catalytic GTP-binding domain of elongation and initiation factors, which are known to enhance aminoacylation rates (Hausmann et al. 2007; Hausmann and Ibba 2008).

The interaction between primordial aaRSs and associated nucleic acid cofactors was enhanced by translation factors, which ratcheted cofactors from one aaRS to another in the complex (Fig. 9a). In modern NRPS assembly lines, PCP plays a similar role by inducing a “swinging” movement of the amino acid-loaded phophopantetheine from aminoacylation to condensation sites, which moves the thiol group by 16 Å (Koglin et al. 2006). The acyl carrier protein-like FF of the PCP and ACP proteins is however a uniquely advanced cofactor protein structure (Marahiel 2009; Chan and Vogel 2010) and as expected, is evolutionarily quite derived in our timeline (ndFF = 0.424). We suggest complexes between oligonucleotide cofactors and translation factors resembled loaded PCP structures with the difference that conformational (ratcheting) movements were more restricted and interactions accommodated growth in the size of the oligonucleotide cofactors, which sometimes folded to form helical structures. Possible duplications and folding have been proposed for early origins of tRNA (Widmann et al. 2005), supported by ancestral split genes of tRNA in Archaea (Di Giulio 2009). We suggest these ancient “minihelices” were the primordial acceptor arms of modern tRNA (Sun and Caetano-Anollés 2008b). These ancient acceptor arms developed the stereochemical rules of a primordial “operational” code (Rodin and Rodin 2008), which placed complementary codons face-to-face and favored formation of 3D RNA structure. With time, minihelices became more and more structurally canalized as their ensemble of conformations diminished in complexity (Ancel and Fontana 2000). The stability of the primordial minihelix was important for the aminoacylating function of aaRSs so it is expected that the structure of tRNA evolved quickly relative to the specificity of amino acid charging (Sun and Caetano-Anollés 2008b). In fact, coevolution patterns of rRNA and r-proteins suggest a modern tRNA structure capable of embodying the canonical genetic code appeared already with the ribosomal PTC (ndFF = 0.253) (Harish and Caetano-Anollés 2011). In turn, the late appearance of the editing domain (ndFF = 0.126; Fig. 1) and even later appearance of anticodon binding domain structures (ndFF = 0.196–0.650) in the timeline (Caetano-Anollés et al. 2011) suggests the development of aminoacylation specificity (and its associated genetic code) was slow, as anticipated by Di Giulio (2006).

While evolution of crucial stereochemistries occurred during the discovery of the first 54 FFs (Fig. 9a), cofactor interactions in the multi-aaRS complex cauldron produced other structured RNA molecules, some of which could have helped jump-start important functions (RNA replication, processing, and degradation) in interaction with novel translation factor-linked domains (harboring for example OB and SH3-like fold structures; Harish and Caetano-Anollés 2011). The possible origin and evolution of these functions benefit immensely from “RNA world” inspired research and information in RNA structure but will not be elaborated here since our focus is the emergence of protein synthesis and not the emergence of the nucleic acid replication apparatus. Instead, we postulate that one of these variant RNA structures was a primordial form of helix 44 (the ribosomal ratchet; Fig. 9b), the most ancient helical structure of rRNA in the ribosomal ensemble (Harish and Caetano-Anollés 2011), and gave rise to the ribosome. In modern ribosomes, helix 44 is the main component of the functional relay that links processes in the decoding site of the small ribosomal subunit with processes centered in the large subunit such as peptide bond formation and the release of elongation factors (Cate et al. 1999). The complex mechanism embodied in helix 44 modulates intersubunit interactions by ratcheting the ribosomal subunits relative to each other in a choreographed set of discrete movements between the body and head of the small subunit and the intersubunit interface (Zhang et al. 2009). For example, the head domain swivels 11º forward and 6º backward while moving ribosomal protein L13 a molecular distance of 20 Å. These macromolecular movements are crucial to maintain the reading frame and accuracy of translation, and are similar in magnitude to the swiveling of the PCP cofactor protein in NRPSs. However, we envision the movement of primordial helix 44 in the aaRS-factor complex was at the beginning limited and induced by comformational changes involving GTP hydrolysis of the primordial translation factors. A recent study shows the OB-fold barrel structure of ribosomal protein S12 contacts aminoacyl-tRNA in ternary complex with elongation factor EF-Tu and GTP while inducing conformational changes in the small subunit rRNA upon aminoacyl-tRNA codon recognition (Gregory et al. 2009). We propose this same configuration that links nucleic acid stereochemistries and GTP hydrolisis was already present in the primordial multi-aaRS complex once the S12 cold shock DNA-binding domain-like FF structure (b.40.4.5; ndFF = 0.114) of S12 was added by domain accretion to primordial translation factor domains (Fig. 9b). Since the b.40.4.5 structure of S12 and S17 is the oldest in the ribosome and interacts tightly with helix 44 (Harish and Caetano-Anollés 2011), we further suggest the proto-ribosome originated as a “vectorial” off-shoot of the interaction of b.40.4.5 with the ternary complex of translation factors and tRNA, perhaps during the disassembled stage of multi-aaRS complexes or as an alternative complex (Fig. 9). This evolutionary transfer responsible for modern ribosomal processivity is still preserved in the mechanics of translation when aminoacyl-tRNAs are vectorially transferred from aaRSs to ribosomes as ternary complexes of elongation factors, GTP and aminoacyl-tRNA. Our model therefore has considerable explanatory power, describing how the discovery of a handful of crucial protein fold structures was responsible for originating ribosomal and non-ribosomal protein biosynthesis, the generation of modern aaRS complexes, and the origin of the ribosome.

As hinted by historical hourglass patterns in protein evolution that are reminiscent of developmental hourglasses (Caetano-Anollés et al. 2011), a major evolutionary transition occurred during the “superkingdom specification” epoch (Fig. 1c) at the time the first superkingdom-specific domain structure appeared in evolution (ndFF ~ 0.250). This transition was probably responsible for the widespread emergence of modularity in protein organization that soon ensued (Wang and Caetano-Anollés 2009), very much as the phylotipic stage in embryonic development represents the crossroads of developmental paths and a period in which interactions between components of the organismal system responsible for morphological differences are maximal (e.g., Domazet-Laso and Tautz 2010). Remarkably, the transition occurred only after the two ribosomal subunits assembled together to form a functional ribosome with a modern PTC and the L7/L12 complex was added to the ensemble to increase ribosomal processivity (Harish and Caetano-Anollés 2011) and before modular biochemical organization, including ribosomal ribonucleoprotein growth, aaRS domain accretion responsible for pathways of genetic code expansion, the appearance of modular cofactor proteins (ACP and PCP), and modern assembly line NRPSs with modules specific for hundreds of amino acid components (Fig. 1c).

Limitations of the Models

The validity of the models depends on the accuracy of phylogenomic statements. The exact order of closely positioned FSFs in the timeline is potentially debatable in trees of this size, but trends across the phylogeny have been certainly robust and informative and consistently obtained as datasets increase with numbers of genomes that have been sequenced (Caetano-Anollés et al. 2009a). In particular, the robustness of phylogenetic relationships is high at the base of the trees. Thus, phylogenetic statements related to the origin and evolution of the 54 most ancient FFs should be considered reliable, especially the relationship of evolutionary landmarks which have been consistently obtained at F, FSF, and FF levels of structural complexity (Caetano-Anollés et al. 2011). For example, the appearance of crucial landmarks supporting our model of biochemical evolution reveal a progression that was congruently recovered at all levels of structural complexity (Table 1). Moreover, the origin of the protein world in the c.37.1.12 and c.37.1.20 FFs, which is crucial for our model of protein emergence, has been confirmed by congruence with a phylogeny of molecular functions (Kim and Caetano-Anollés 2010).

Table 1 The age of domain landmarks inferred at different levels of structural complexity

While our arguments and models are supported by phylogenomic data, our interpretations of ancient biochemistry could be biased by knowledge grounded in modern biochemistry. In other words, phylogenetic extrapolations, though powerful, depend on extant molecules and molecular functions that exist today and cannot yet exploit data from ‘‘resurrection’’ experiments that bring to life ancient phylogenetically derived structures in the laboratory and test their structural, functional and thermodynamic properties (e.g., Jermann et al. 1995; Gaucher et al. 2003; Ortlund et al. 2007). Unfortunately, resurrections are experimentally demanding, suffer from the limitations of sequence analysis, and are necessarily limited to the study of relatively recent proteins. Our interpretations are also limited by our ignorance about the relationship of ancient peptides, prebiotic chemistries, and structurally canalized proteins responsible for modern biochemistry. Similarly, we do not know how the structures and weak catalytic activities of Kauffmann–Dyson peptides were “inherited” in the primordial system and how easy did canalized structure materialized in sequence space. Recent bench experiments however have shown that random polypeptides of the size of small proteins can fold into 3D conformations in the absence of selection (LaBean et al. 2011). Self-organization in proteins thus plays an important role in generating molecular structure. Despite caveats, the model we propose will be valuable for the design of future experiments in the laboratory and future bioinformatic explorations that could attempt to answer some of these fundamental questions.

The Models and the Principle of Continuity

The “principle of continuity” best defined by Leibniz’s “Nature never makes leaps” governs causal, spatiotemporal, and conceptual connections within his theory of consciousness (Leibniz 1923) and can be used as “fruitfuil principle of discovery” to put theories to the test. In life research, the principle of continuity is embodied in gradual change. The principle has been already discussed in the context of phylogenomic analysis (Caetano-Anollés et al. 2011). In extant biology, change is mostly fueled by the action of mutations that act on the genetic repository, i.e., on nucleic acid biopolymers. In the absence of a formal genetic repository, however, gradual change must occur in the chemical repertoires of cofactors, small molecules and emerging biopolymers that are needed to jumpstart modern biochemistry. Our phylogenomic-supported models posit that gradual change involved at first polypeptides and small proteins that associated with primordial membranes. These primordial molecules self organized into cellular and energy-dissipating structures within the energetic context of early Earth and fulfilled basic principles of thermodynamics. The systems however gained functions that provided selective advantages. For example, the early interaction of polypeptides and membranes stabilized the primordial cells, extending their average life and promoting self-organization of membrane and peptide components. This process was gradual, at first linked to fulfilling energy-dissipation demands but then linked to other emerging functions that would stabilize and perpetuate the system’s structure. The principle of continuity ultimately embodied the gradual improvement of molecular interactions during the very early stages of biochemistry. Our phylogenomic reconstructions suggest these started with interactions of primordial membranes, peptides and cofactors. Initial interactions were gradually followed by interactions between proteins, nucleotides and nucleic acids, first as substrates, then as docking-guides and cofactors, and finally as molecular switches and actuators (Caetano-Anollés et al. 2011).

It can be argued that phylogenomic analyses are only valid if distinct lines of genetic continuity relate information in extant cellular machinery with past molecular events (Lazcano 2010). Indeed, our study suggests several primordial Kauffmann–Dyson polypeptides preceded those encoded in active sites of aaRS and NRPS modules of assembly lines and later on in emerging nucleic acid molecules. How can the information in these primordial peptides be ‘fossilized’ for later phylogenomic extraction? We posit however that the emergence of genetic continuity occurred gradually, first supported by energy dissipation, self-organization and other principles that helped the system “remember” (select) best molecular designs from those that were accessible. Codes can be regarded as biases or constraints in perpetuating systems that enable selection. For example, biases in the nucleotide make up of nucleic acids (the genetic code) enable change in life systems with genetic repositories. By the same token, constraints in the sequence and structural space of proteins (the structural code) enable changes in the stability of the self-organizing and expanding membranes of primordial cells. These codes do not need to operate in different spatiotemporal scales and can interact to remember information that we now extract with our phylogenomic methods. The fact that plausible prebiotic scenarios for origins of metabolic chemistries seem to be embodied in modern metabolism support the contention that evolving codes are rememberd and their information can still be extracted (Caetano-Anollés et al. 2009b). We note however that while random polypeptides do self-organize (LaBean et al. 2011), there is still no experimental evidence to support the existence of self-replicating polypeptides of the Kauffmann–Dyson type. We also note that while some self-organizing physical and chemical systems exist (Bernard cells, the cyclic Belousov-Zhabotinsky chemical reactions, and notably self-organizing lipid layers and micelles) none can produce self-generating complex cellular systems (Lazcano 2010). We hope synthetic biology will make this possible in the near future.

Conclusions

A general theory describing the origin of fundamental complexity in modern biochemistry has been conspicuously absent in scientific inquiry. Numerous hypotheses and likely scenarios of origin have been however proposed based on fragmented statements of similarity (sometimes framed as homology) and reductionist thinking. In turn, holistic views grounded in phylogenetics and genomics have been inexistent because of the limitations of sequence analysis and our tendencies to make inferences about the past with modern definitions. Some insights, however, have been forthcoming. Fritz Lipmann, for example, lucidly anticipated (Lipmann 1971):

In the domain of the transfer of genetic information I felt naïvely that, before translation into amino acid sequences could become the target of the genetic code, the extraordinary functional pliability of proteins would have to be established… Antibiotic polypeptide synthesis, in terms of process evolution, is between fatty acid synthesis and the complex ribosomal polypeptide synthesis. It parallels features of the two reaction mechanisms; it runs in single steps through elongation to a product that grows vectorially by transpeptidation of thiol-linked peptides to amino groups of thiol-linked amino acids

The appeal of a “RNA world” however diverted attempts to explain the emergence of biochemical complexity as a coordinated evolutionary process involving not only nucleic acid molecules but also proteins and membranes. In the “RNA world” view that prevails, nucleic acid structure precedes protein structure. This scenario is incompatible with phylogenomics, parsimony thinking, RNA and ribosomal biology, and data from molecular structure and function (Kurland 2010; Caetano-Anollés et al. 2011; Kim and Caetano-Anollés 2010). The timeline we mine in this paper uncovers a very likely scenario for the origins of modern biochemistry that takes us far beyond the insight of Lipmann, this time crucially grounded in genomics and based on phylogeny. In our study, homology statements are “shared and derived” relationships of common ancestry obtained with modern bioinformatic tools of phylogenetic analysis. The models we derive are complex but portray a very likely scenario in which membranes and catalytic chemistries constrain options for protein change in the emerging biological system. The models have also considerable explanatory power, and in contrast with any that has been previously proposed, provide a detailed step-by-step description that is transformative.

One important realization derived from our models of early evolution is the progressive appearance of complexity in biochemistry, first fueled by membranes, then by primitive cofactor chemistries and macromolecular interactions (including the formation of complexes), and finally by constraints of protein structure that operate on catalysis and molecular function. We have showcased this progression with proteins but expect nucleic acids will follow a similar progressive trend. In fact, we see the gradual formation of structure unfold for example in phylogenetic analysis of tRNA (Sun and Caetano-Anollés 2008b), RNase P RNA (Sun and Caetano-Anollés 2010), 5S RNA (Sun and Caetano-Anollés 2009), and the small and large subunits of rRNA (Harish and Caetano-Anollés 2011). We also reveal progressive growth in biological information. The number and diversity of monomers, modules and macromolecules increases progressively in the timeline. Codes emerged gradually with increases in specificity and modularity, starting with codes in active sites such as those of primordial aaRSs and NRPSs, followed by a code of stereochemical rules in nucleic acid cofactors, and ending with the link of the two early codes in an emerging genetic code. The first of these three codes is used today in the modular assembly lines of non-ribosomal peptide synthesis. The last corresponds to the modern genetic code that is interpreted by aaRSs, translation factors, and the ribosome. One aspect linked to the timelines that is remarkable is the relatively late appearance of cofactor proteins and a new code associated with them that appears internalized in the structure of proteins. We suggest this code is probably developing in our contemporary world to replace the tRNA cofactors with translation factor protein molecules that mimic tRNA in structure and function and have the potential to overtake the ability to decipher the genetic code (Nakamura and Ito 2003). This emerging code is already operating in the recognition of stop codons in mRNA during translation termination.

Another important realization is that evolutionary precursors of tRNA and other ancient functional RNA molecules were initially cofactors of primordial proteins. An origin of nucleic acids in cofactors is a logical idea that has been widely entertained (see for example Yarus 2010, for a recent proposal), especially because cofactor chemistries are belived to be very ancient (Denessiouk et al. 2001). Recent developments in biochemistry and molecular biology have shown that modern tRNA behaves very much as a standard cofactor (such as the ancient NAD or CoA). Modern tRNA holds a multiplicity of roles and is used as biochemical scaffold in many biochemical reactions. tRNA is a central player not only in protein biosynthesis through its canonical role in aminoacylation and editing but also in a wide range of biological processes, including the synthesis of antibiotics, bacterial cell wall peptidoglycans and tetrapyrroles, modification of bacterial membrane lipids, protein turnover and its role as precursor for other aminoacyl-tRNAs (Francklyn and Minajigi 2010). tRNA also facilitates the central functions of aminoacyl transfer in aaRSs (Minajigi and Francklyn 2008), peptidyl transfer in the ribosome (Weinger et al. 2004) and PheRS-mediated deacylation of Thr-tRNAThr (Ling et al. 2007).

The evolutionary precursors that gave rise to RNA were probably minihelices homologous to the acceptor arm of tRNA. We have shown that rRNA and other functional RNA molecules (e.g., RNAse P RNA) are derived compared to tRNA (Sun and Caetano-Anollés 2008a, b, 2009, 2010; Harish and Caetano-Anollés 2011) and that domains that associate with tRNA (e.g., editing and anticodon binding domains of aaRSs) are also derived (Caetano-Anollés et al. 2011). Thus, we hypothesize that precursors of tRNA were also precursors of ancient functional RNA molecules (including rRNA) and that their initial role was to act as ancient cofactors for primordial proteins. Thus, the modern RNA world derives from improvements in cofactor chemistries that were originally subservient to proteins. Later, this world gained enzymatic properties independent of the protein scaffold and the ability to store genetic information more efficiently. Our model is therefore compatible with an RNA world that is derived and not ancestral.

A final realization is the progressive generation of modules in biology, first with modular transmembrane domains, then with modular membrane-associated domains, and finally with globular domains and the establishment of quaternary and complex arrangements. We propose primordial aaRS complexes acted as modules in primodial assembly lines, each module developing specificities for different amino acid sets. The rise of these modules biased the constitution of emergent proteins and folds and biased protein interaction with cofactors and RNA. These biases left primordial imprints in the genetic code and the structure of the protein world that we will report elsewhere. The rise of modules in biochemistry is embodied in the modules of NRPS and aaRS complexes we see today and the codes embedded in them, but more vividly in the combinatorial complexity of domains that permeates the protein world (Wang and Caetano-Anollés 2009).