Introduction

All extant organisms use the same DNA–RNA–protein system known as the “central dogma of molecular biology”, wherein DNA contains the genetic information and proteins work as functional molecules. Therefore, many biologists believe that all extant organisms are descendants of a single common ancestor – in other words, the hypothesis first predicted by Darwin (Darwin 1859). While not all biologists necessarily believe in the single common ancestor hypothesis (Doolittle 1999; Kandler 1995; Woese and Fox 1977), all modern organisms share significantly similar mechanisms for replication and expression of genetic information – a fact that supports the existence of a single ancestor because it seems unlikely that such similar mechanisms were established many times independently. It should be kept in mind that the common ancestor is not the “oldest”, but the “most recent” common ancestor of all extant organisms, which is often referred to as the last universal common ancestor (LUCA). This differentiation is necessary because the oldest ancestor that appeared on primitive Earth might have diversified over several million years (Cornish-Bowden and Cardenas 2017). Most of the subsequent primitive organisms might have become extinct for various reasons and only LUCA might survive (Nisbet and Sleep 2001). One study has suggested that LUCA was an anaerobic, autotrophic microorganism with a metabolic system for nitrogen fixation and carbonate fixation using hydrogen as an electron acceptor (Weiss et al. 2016), while another has suggested that LUCA could already synthesize proteins using 20 types of amino acids, similar to extant organisms (Mat et al. 2008).

Regardless of the exact nature of LUCA, the origins of life, which emerged earlier, remain a long-running controversy because, in extant organisms, the nucleic acid polymers (i.e., DNA and RNA) carry the information for the amino acid sequences of proteins, while proteins play a central role in the replication of nucleic acid polymers. More than 50 years ago, Rich proposed the idea that RNA served as both a genetic and functional molecule in the primitive environment (Rich 1962), while Gilbert coined the ‘RNA world hypothesis’ in 1986 (Gilbert 1986). Much experimental evidence now supports the hypothesis that life on Earth began with an RNA molecule (Guerrier-Takada et al. 1983; Kruger et al. 1982; Nissen et al. 2000).

In extant organisms, proteins are generally composed from 20 types of l-amino acid, which the organisms obtain by two primary means: intracellular synthesis by the metabolic system if the corresponding amino acid synthesis pathway is available; or acquisition from the external environment if it is not. It can be reasonably assumed that primitive proteins were synthesized using only amino acids available in the environment before innovation of the corresponding intracellular amino acid synthetic pathways (Cleaves 2010; Shibue et al. 2018). To our best knowledge, no experimental evidence supports the plausibility that all 20 of the current proteinogenic amino acids were present at concentrations sufficient for synthesizing primitive proteins in the environment 4 billion years ago. However, classic Miller–Urey experiments simulating the hypothetical early Earth environment suggest that the synthesis of organic compounds from inorganic substances was possible (Ferus et al. 2017; Weber and Miller 1981; Miller 1953). As well as various other compounds, those experiments yielded many amino acids, of which only 10 of the 20 current proteinogenic amino acids were present. The 10 amino acids were also found in samples returned from the asteroid Ryugu by the Hayabusa2 spacecraft (Nakamura et al. 2022). Further clues to the amino acids present on early Earth have been obtained from the Murchison meteorite, which is rich in organic compounds (Wolman et al. 1972). The meteorite contains more than 70 amino acids, but only eight are essential for current protein synthesis (Cronin 1989; Cronin and Pizzarello 1983). Thus, it can be reasonably assumed that only a subset of the current 20 amino acids was present in sufficient amount in primitive Earth’s environment. Other evidence supports the idea that proteins with fewer than 20 amino acids were synthesized in the early stages of evolution before the emergence of LUCA (Akanuma et al. 2002; Angyan et al. 2014; Cornell et al. 2019; Giacobelli et al. 2022; Longo et al. 2013; Shibue et al. 2018; Solis 2019; Yagi et al. 2021).

It is possible that early proteins and RNA served as mutual cofactors and scaffolds, necessitating strong interactions (Lupas and Alva 2017; Vázquez-Salazar and Lazcano 2018). Because ribose, which forms the backbone of RNA, and its analog are quickly decomposed at high temperatures (Larralde et al. 1995), Miller and Lazcano pointed out that the earliest life was unlikely to thrive in a high-temperature environment (Miller and Lazcano 1995). Conversely, Pearce and colleagues predicted that a hot early environment on Earth (50–80 °C) would favor rapid nucleotide synthesis as compared with a warm early environment (5–35 °C) (Pearce et al. 2017). Geochemical evidence suggests that early Earth underwent catastrophic meteoritic bombardment (Chyba 1990); therefore, the presumed early environment was likely to be hotter and more unstable in terms of climate change than today (Knauth and Lowe 1978; Robert and Chaussidon 2006). This unstable environment would be fatal to single-stranded RNA structures and highly likely to cause RNA inactivation and even breakage. Therefore, the role of the earliest proteins might have been to stabilize the RNA molecules that are hypothesized to play a central role in the RNA world (Shibue et al. 2018).

By comparing a large number of extant homologous protein sequences, ancestral protein reconstruction can resurrect the proteins of past organisms (Akanuma and Yamagishi 2016; Gaucher et al. 2010; Merkl and Sterner 2016; Rouet et al. 2017; Thornton 2004; Wheeler et al. 2016). Many studies have used this approach, for example, to understand the evolution of ethanol production and consumption in yeast (Thomson et al. 2005) and the trajectory of ligand-specific changes in hormone receptors (Bridgham et al. 2006, 2009; Harms and Thornton 2013; Ortlund et al. 2007), and to estimate ancient biosphere temperature (Akanuma et al. 2013; Garcia et al. 2017; Gaucher et al. 2008, 2003).

To further explore the subset of amino acids present in early proteins, here we have applied the reconstruction method to the ribosomal protein uS8 (named according to the new system for ribosomal proteins; Ban et al. 2014), which directly interacts with the central domain of 16S rRNA (Wiener et al. 1988). uS8 is a 130-residue protein essential for organization of the central domain of the small subunit of the ribosome (Collatz et al. 1976). Its deletion prevents other ribosomal proteins from assembly on the 30S small subunit, resulting in a significant loss of protein synthesis (Allmang et al. 1994; Shimojo et al. 2020). First, we inferred two potential ancestral sequences of uS8 using the information contained in a predictive phylogenetic tree of the amino acid sequences of extant uS8 proteins, and characterized the resulting ancestral uS8 proteins in terms of thermal stability and RNA-binding properties. Next, by eliminating one amino acid letter at a time from the ancestral uS8 sequence, we identified amino acids that are not essential for RNA binding, and used this information to create simplified uS8 variants lacking multiple types of amino acid to derive a minimal set of amino acids essential for a stable uS8 variant with RNA-binding activity. Lastly, we compared this minimal set with amino acids that have been identified as plausibly abundant in the prebiotic environment by earlier geochemical studies.

Materials and Methods

Phylogenetic Tree Building and Ancestral Amino Acid Sequence Inference

A BlastP search (Altschul et al. 1997) of the inhouse KF database v.1.2, which contains all protein sequences of 804 organisms (Furukawa et al. 2017), was performed to construct a dataset of uS8 amino acid sequences. The amino acid sequence of Thermus thermophilus uS8 (accession numbers: AAB25287) was used as a query sequence because it is one of the most well-studied uS8 proteins. Methanococcus maripaludis uS8 (WP_011171358) was used as a query sequence to retrieve archaeal and eukaryote sequences. Duplicate identical amino acid sequences were removed and the remaining sequences were used as the primary dataset. Individual sequence datasets for Bacteria, Archaea and Eukaryotes were aligned independently using MAFFT ver.7.3 (Katoh and Standley 2013). Amino acid sequences annotated as proteins other than uS8 were removed from the alignment. Sequences with a long insertion (> 50 amino acids) relative to T. thermophilus and M. maripaludis uS8 proteins were also removed. The alignment was then manually corrected by referring to secondary structure information and the known tertiary structures of uS8 proteins from T. thermophilus (PDB code: 1QD7), Bacillus anthracis (PDB code: 4PDB) and Methanocaldococcus jannaschii (PDB code: 1I6U).

Conserved regions in the final alignment were selected via the automated1 mode, gappyout mode, and no gaps mode of trimAl (Capella-Gutierrez et al. 2009). IQ-TREE v. 1.6.9 (Nguyen et al. 2015), in conjunction with the LG + R8 amino acid substitution model, was used to build a phylogenetic tree (Figure. S1). ModelFinder (Kalyaanamoorthy et al. 2017) selected LG + R8 as the optimal amino acid substitution model. We removed eukaryotic sequences from the dataset, as well as some prokaryotic sequences that might cause long-branch attraction, and recalculated the tree. Again, IQ-TREE v. 1.6.9 (Nguyen et al. 2015) was used in conjunction with the LG + R9 amino acid substitution model, which was selected as the optimal amino acid substitution model by ModelFinder (Kalyaanamoorthy et al. 2017). We also build other trees with LG + R7, LG + R8, and LG + R10 amino acid substitution models. The four resulting trees all showed a pectinate shape topology with a few sequences branching directly near the basal position of the tree (Fig. S2-1–4), probably due to long-branch attraction. We therefore used a site-heterogeneous mixture model (CAT) as an alternative amino acid substitution model (Lartillot and Philippe 2004) because this model is expected to suppress long-branch attraction artefacts (Lartillot et al. 2007). We build six more phylogenetic trees using IQ-TREE in conjunction with an LG + C10 + F + G, LG + C20 + F + G, LG + C30 + F + G, LG + C40 + F + G, LG + C50 + F + G, or LG + C60 + F + G amino acid substitution model (Fig. S2-5–10). Among the six resulting trees, the tree built with the LG + C30 + F + G model showed the best log likelihood score, although a possible long-branch attraction artefact (branch leading to Gold_HGW-Goldbacteria-1_PKL91838.1) was still observed (Fig. 1, Fig. S2-7 and Fig. S3). Using the phylogenetic tree built with the LG + C30 + F + G model and either IQ-TREE or CodeML in PAML (Yang 2007), we inferred two ancestral uS8 sequences (named I_Bac and P_Bac, respectively; Fig. 2 and Fig. S4) that might correspond to the last bacterial common ancestor. GASP (Edwards and Shields 2004) was used to estimate the location of gaps in the ancestral sequences. The amino acid sequences of I_Bac and P_Bac are available in fasta format (Supplementary Data 1).

Fig. 1
figure 1

Phylogenetic tree used to infer ancestral uS8 sequences. Arrow marks the node corresponding to the position of the ancestral protein. For the complete tree, see Fig. S3

Fig. 2
figure 2

Amino acid sequence comparison of the two bacterial ancestral uS8 proteins. Residues that differ between the two ancestral sequences are shown in bold. Boxes indicate plausible RNA-binding residues predicted using the structure of the B. anthracis uS8–RNA complex (PDB code: 4PDB) as a guide (see Fig. S9)

Construction of Expression Plasmids for Ancestral uS8 and Simplified Variants

Nucleotide sequences encoding the last bacterial common ancestral uS8 were generated by reverse-translating the inferred ancestral amino acid sequences. Codon usage was optimized for an Escherichia coli expression system. The nucleotide sequences were artificially synthesized by Eurofins Genomics and cloned into the NdeI-BamHI site of plasmid pET23a( +) (Merck).

The genes encoding uS8 from T. thermophilus and B. anthracis were artificially synthesized by Eurofins Genomics and cloned into the NdeI-XhoI site of plasmid pET23a( +) (Merck), which expressed the proteins as a C-terminally His-tagged form.

The genes encoding simplified P_Bac variants were also synthesized by Eurofins Genomics and cloned into the NdeI-BamHI site of plasmid pET23a( +), except for those encoding simplified P_Bac variants lacking cysteine, phenylalanine, threonine or tryptophan, which were synthesized by the splicing-by-overlap-extension PCR method (Horton et al. 1993). The mutated genes were PCR-amplified in a reaction mixture containing 1 × PCR buffer for KOD Plus DNA polymerization (Toyobo), 1 mM MgSO4, 0.2 mM each of the dNTPs, 0.25 μM each of the synthetic oligonucleotides, 1.0 unit of KOD Plus DNA polymerase, and the expression plasmid for P_Bac as the template DNA. The PCR conditions were 95 °C for 3 min, followed by 25 cycles of 95 °C for 30 s, 55 °C for 30 s, and 68 °C for 1 min. The PCR product was digested with NdeI and BamHI (New England Biolabs), and cloned into the NdeI-BamHI site of pET23a( +). The genes encoding simplified P_Bac variants devoid of multiple types of amino acid were artificially synthesized by Eurofins Genomics and cloned into the NdeI-XhoI site of plasmid pET23a( +) (Merck) to produce the protein as a C-terminally His-tagged form.

Overexpression of uS8 Proteins

E. coli C41 (DE3) pLysS (Lucigen) was transformed with the expression plasmids for the bacterial ancestral proteins and simplified variants. Transformants were spread on Luria–Bertani (LB) medium plate supplemented with ampicillin (150 μg/ml) and grown overnight at 37 °C. For protein production, one colony was inoculated into 2 ml of LB liquid medium containing ampicillin (150 μg/ml) and shaken at 37 °C for 15 h. Next, 2 ml of this culture was added to 200 ml of LB medium containing ampicillin (150 μg/ml) and shaken at 37 °C for 3 h. Isopropyl β-D-thiogalactopyranoside (final concentration, 1 mM) was added and incubation was continued at 30 °C for 18 h. Finally, the cells were harvested by centrifugation at 5,000 g for 10 min, the supernatant was removed, and the cells were stored at –20 °C.

Purification of Proteins

To purify the bacterial ancestral proteins and the simplified variants lacking a single type of amino acid, each cell pellet was resuspended in 10 ml of 20 mM Tris–HCl, pH 6.8, 800 mM NaCl, disrupted by sonication, and then centrifuged at 18,000×g for 20 min at 4 °C. The supernatant was heat-treated at 70 °C for 20 min to precipitate proteins originating from E. coli, which were removed by centrifugation at 18,000×g for 20 min at 4 °C. The resulting supernatant was diluted with 20 mM Tris–HCl, pH 6.8, to a NaCl concentration of 250 mM and then passed through HiTrap-SP FF (Cytiva). Fractions containing uS8 were recovered, dialyzed against 20 mM Tris–HCl, pH 8.8, 250 mM NaCl, and then passed through HiTrap-SP FF again.

To purify the simplified variants lacking multiple types of amino acid, each cell pellet was resuspended in 10 ml of 20 mM Tris–HCl, pH 7.5, 800 mM NaCl, 30 mM imidazole and disrupted by sonication. After centrifugation at 18,000×g for 20 min at 4 °C, the supernatant was passed through a HisTrap HP column (Cytiva). T. thermophilus uS8 was purified by a similar method.

B. anthracis uS8 was purified under denaturing conditions because the protein collected in inclusion bodies. The cell pellet was resuspended in 10 ml of 20 mM Tris–HCl, pH 7.5, 800 mM NaCl, 30 mM imidazole, and disrupted by sonication. The soluble protein fraction was removed by centrifugation at 18,000×g for 20 min at 4 °C. Insoluble protein was dissolved in 10 ml of 20 mM Tris–HCl, pH 7.5, 800 mM NaCl, 1 mM dithiothreitol, 30 mM imidazole containing 7.0 M urea, and passed through a HisTrap HP column (Cytiva). The solution containing B. anthracis uS8 was step-wise dialyzed against 20 mM Tris–HCl, pH 7.5, 800 mM NaCl containing 7.0 M, 5.0 M, 3.0 M, 1.0 M and 0.5 M urea. Lastly, the protein solution was dialyzed against 20 mM Tris–HCl, pH 7.5, 800 mM NaCl twice, and the protein molecules that remained insoluble were removed by centrifugation at 18,000×g for 20 min at 4 °C.

The purity of each protein was > 95% as judged by SDS–polyacrylamide gel electrophoresis (SDS-PAGE) followed by Coomassie Brilliant Blue staining (Fig. S5). Protein concentrations were determined by measuring the A280 values of the samples as described by Pace et al. (1995) because all proteins analyzed in this study contained either tyrosine or tryptophan, or both residues.

Circular Dichroism Measurement

Circular dichroism (CD) measurements were carried out using a J-1100 CD spectropolarimeter (Jasco) equipped with a programmable temperature controller. Proteins were diluted to 20 μM in 20 mM potassium phosphate buffer (pH 7.6), 200 mM NaCl, and placed in a quartz glass cell with a 0.1-cm path length. Far-UV CD spectra were recorded from 200 to 250 nm at 25 °C.

Temperature-induced unfolding of the proteins was measured in duplicate by monitoring the change in ellipticity at 222 nm. Proteins were diluted to 20 μM in 20 mM potassium phosphate buffer, pH 7.6, 200 mM NaCl. The temperature was increased at a rate of 1.0 °C/min. A pressure-proof cell compartment was used to prevent the solutions from bubbling and evaporating at high-temperature.

RNA-Binding Assays

Interactions between an RNA fragment and I_Bac, P_Bac and some variants of P_Bac were examined quantitatively using a BLItz System with a streptavidin sensor chip (FORTEBIO/Zartorius Japan) at 25 °C. We used a previously reported RNA fragment selected for binding to B. anthracis uS8 by an in vitro aptamer selection method (Davlieva et al. 2014). The sequence (5’-GGG AUG CUC AGU GAU CCU UCG GGA UAU CAG GGC AUC CC-3’) with a 5’ biotin modification was artificially synthesized by Eurofins Genomics. The sensor chip was washed with running buffer (20 mM Tris–HCl, pH 7.5, 800 mM NaCl, and 50 μM BSA), placed in RNA (50 μM) solution diluted with running buffer, and washed with buffer again. The sensor chip was then placed in uS8 (2.0 or 10 μM) solution diluted with running buffer, and binding of uS8 to RNA captured on the sensor chip was measured. Lastly, the sensor chip was placed in the running buffer to measure the dissociation of uS8 from the RNA fragment. Rate constants for association (ka) and dissociation (kd), and dissociation constant (KD) were calculated by the BLItz System built-in software.

Interactions between the RNA fragment and P_Bac variants lacking a single type of amino acid were examined by pull-down assay. The RNA fragment was captured with Magnosphere MS300/Streptavidin (SR Life Sciences) magnet beads. In brief, 1 µL of 50 μM RNA solution was incubated with 100 µL of magnet beads (pre-treated according to the manufacturer’s protocol) for 15 min at 4 °C with agitation. The beads were precipitated by the magnetic stand, the supernatant was removed, and the RNA-bound beads were washed twice with 20 mM Tris–HCl, pH 7.5. The beads were resuspended in 100 µL of 20 mM Tris–HCl, pH 7.5, 2 mM MgCl2, 0.1% Tween 20, 800 mM NaCl, an equal volume of a solution containing 3.0 µM uS8 and 30 µM BSA was added, and the suspension was incubated for 15 min at 4 °C with agitation. The beads were then precipitated by the magnetic stand, the supernatant was removed, and the beads were washed three times with 20 mM Tris–HCl, pH 7.5, 800 mM NaCl, 0.1% Tween 20. Lastly, the beads were resuspended in 10 µL of ultrapure water, uS8 was dissociated by boiling for 20 min in 2% SDS, and the samples were analyzed by SDS-PAGE. The interaction of the variants with RNA-free magnetic beads was also analyzed as a negative control.

Results and Discussion

Phylogenetic Tree Building and Ancestral Amino Acid Sequence Inference

The first step in reconstructing an ancestral sequence is to perform a multiple sequence alignment using the amino acid sequences of the target protein from multiple living organisms. The alignment is then used to build a phylogenetic tree by a modeling approach, such as maximum–likelihood (ML) (Yang et al. 1995) or Bayesian (Yang and Rannala 1997) modelling. In this study, an ML method was used because ML is reported to be relatively accurate in the reconstruction of ancestral sequences (Hanson-Smith et al. 2010).

The multiple sequence alignment included the amino acid sequences of the ribosomal protein uS8 from 582 bacterial, 140 archaeal and 138 eukaryotic species. Although there were many more sequences from bacteria than from archaea or eukaryotes, this was not considered an issue because the primary aim at this stage was to infer the sequence of the bacterial common ancestor. In the resulting phylogenetic tree, built using the ML program IQ-TREE (Nguyen et al. 2015), the bacteria and archaeal sequences are clearly divided into their own monophyletic groups, while eukaryotic sequences are found among the archaeal sequences (Fig. S1). Our phylogenetic tree supports the two-domain hypothesis of all modern life, which proposes that eukaryotes emerged within the archaeal domain (Cox et al. 2008; Raymann et al. 2015; Rivera and Lake 1992; Williams et al. 2013), and is consistent with a recently reported tree built using the concatenated sequences of ribosomal proteins (Hug et al. 2016). By contrast, a phylogenetic tree based on small subunit ribosomal RNA sequences showed a monophyletic status of Bacteria, Archaea, and Eukarya (Woese et al. 1990), with Eukarya located as a sister group of Archaea. That tree, together with various molecular phylogenetic studies and phylogenomic studies support the three-domain hypothesis of all modern life (Ciccarelli et al. 2006; Fournier and Gogarten 2010; Harris et al. 2003; Rinke et al. 2013; Yutin et al. 2008).

Because the eukaryotic sequences are unlikely to influence estimation of the sequence at the deepest node, we removed them from the dataset, as well as some prokaryotic sequences that might cause long-branch attraction, and recalculated the tree. In the final tree built from 527 bacterial and 124 archaeal sequences (Fig. 1 and Fig. S3), the two major domains form two distinct monophyletic groups.

To define the root of the phylogenetic tree, we needed to include sequences that diverged from uS8 before LUCA as an outgroup. However, no such sequences were identified in our Blast search. Therefore, the sequence at the deepest bacterial node was inferred from the tree by treating the archaeal sequences as an outgroup. We used two programs to predict the sequence of the last bacterial common ancestor: the sequence predicted by CODEML in PAML (Yang 2007) was named P_Bac; and that predicted by IQ-TREE (Nguyen et al. 2015) was named I_Bac (Fig. 2). The amino acid sequences of P_Bac and I_Bac were very similar (121 of 130 residues are identical). Furthermore, most of the identical residues in P_Bac and I_Bac had an a posteriori probability of higher than 0.9 (Fig. S4); therefore, the inclusion of these residues seems likely to be correct. In contrast, the residues that differed between the two sequences had a relatively low a posteriori probability.

Thermal Stabilities of P_Bac and I_Bac

Genes encoding the two inferred ancestral amino acid sequences were artificially synthesized and the encoded proteins were individually expressed in E. coli and purified. Ellipticity at 222 nm was monitored as a function of temperature to generate the temperature-induced unfolding curves of the two ancestral proteins. As shown in Fig. 3, both P_Bac and I_Bac were less thermostable than T. thermophilus uS8, but still had very high unfolding mid-point temperatures (Tm, ~ 85 °C for both proteins), comparable to those of extant thermophilic proteins.

Fig. 3
figure 3

Thermal denaturation of T. thermophilus and ancestral uS8 proteins. Change in ellipticity at 222 nm was monitored as a function of temperature for T. thermophilus uS8 (dotted line), P_Bac (solid line), and I_Bac (dashed line). The temperature was increased at a rate of 1.0 °C/min. The samples comprised 20 μM protein in 20 mM potassium phosphate (pH 7.6), 200 mM NaCl. Each experiment was conducted in duplicate, which produced identical melting profiles within experimental error. The plots have been normalized with respect to the baseline of the native and denatured states

The high thermal stability of bacterial ancestral uS8 is consistent with observations of other reconstructed ancestral proteins (Akanuma et al. 2013; Busch et al. 2016; Butzin et al. 2013; Gaucher et al. 2008; Gumulya et al. 2018). There is often a direct correlation between the unfolding temperature of a protein and the optimal environmental temperature of its host organism (Akanuma et al. 2013; Gromiha et al. 1999). Furthermore, most reconstructed ancestral proteins are very thermostable, suggesting that ancestral organisms such as the last bacterial common ancestor, last archaeal common ancestor, and LUCA were thermophilic or hyperthermophilic. The high thermal stabilities of the two reconstructed bacterial ancestral ribosomal uS8 proteins also support the idea that the last bacterial common ancestor was a thermophilic organism that thrived in a high-temperature environment. We note, however, that the environmental temperature of primitive organisms remains under debate, and a non-thermophilic ancestry of life is supported by computational studies focusing on the environmental temperatures experienced by ancient life (Boussau et al. 2008; Galtier et al. 1999; Groussin et al. 2013).

It should also be noted that an accurate tree is not always obtained and ancestral sequences cannot be reconstructed with absolute certainty, although the techniques used to infer ancestral sequences have greatly improved in the past decade. Therefore, any implications derived from the tree and the ancestral sequence are hard to verify. Williams et al. proposed that the high thermal stabilities observed for ancestral proteins might be related to the inherent nature of the ancestral sequence reconstructions (Williams et al. 2006). They asserted that an inaccurately reconstructed sequence would result in an overestimation of its thermostability. Furthermore, Tawfik and coworkers have suggested that a high environmental temperature may not have been the only factor requiring the high thermodynamic stability of ancestral proteins (Trudeau et al. 2016). They considered that the stability of ancestral proteins might have been driven by high oxidative pressure and radiation levels, the absence of cellular osmolytes and/or chaperones, or the low fidelity of the transcription–translation machinery.

Other studies based on ancestral sequence reconstruction have also connected ancient proteins to early environments. For example, Schopf and colleagues reconstructed proteins from phototrophic species, which suggested that there has been a general cooling of the Earth’s photic zone from the Archean Eon to the present. In addition, Kaçar and colleagues resurrected a Precambrian-age, ancestral RuBisCO gene from extant cyanobacteria (Kędzior et al. 2022) and found that the carbon isotope signatures of the engineered cyanobacteria being cultured under potential Precambrian environments fell within modern ranges. Therefore, uniformitarian assumptions of carbon isotope signatures over geologic time might be justified, but with an important caveat because the modern organism and its proteins might have influenced the ancestral RuBisCO phenotype. Garcia and Kaçar also warned of the pitfalls of facilely interpreting paleophenotype models and data (Garcia and Kaçar 2019).

Nevertheless, even if the inferred sequence is not the correct ancestral sequence and the high thermostability of the reconstructed proteins does not reflect a high-temperature environment of the ancient organism, proteins with high thermostability remain suitable as starting molecules to simplify amino acid usage. Therefore, one of the ancestral uS8 proteins was chosen as the scaffold on which to reduce the size of the amino acid alphabet.

Interaction Between Ancestral uS8 and RNA

The interaction of the two wild-type and two ancestral uS8 proteins with biotinylated synthetic RNA captured on a streptavidin sensor chip was quantitatively analyzed in the presence of 800 mM NaCl and 50 μM BSA to suppress nonspecific adsorption of the protein. We used an RNA fragment selected by the systematic evolution of ligands using an exponential enrichment (SELEX) technique (Davlieva et al. 2014). We note here that the interactions of protein residues with SELEX-generated RNA are not the same as those in natural protein–RNA interactions. The sensor chip provides real-time data on molecular interactions: when protein molecules interact with the RNA on the sensor chip, the binding signal shifts in the positive direction. Here, the binding signal sharply increased after the RNA-bound chip was exposed to both ancestral uS8 solutions, suggesting the formation of a protein–RNA complex on the sensor chip (Fig. 4). Upon changing the solution to protein-free buffer, the binding signal slightly decreased, indicating the dissociation of some protein molecules from RNA.

Fig. 4
figure 4

Interaction of wild-type uS8 proteins, ancestral uS8 proteins and reduced-alphabet variants with RNA. RNA binding of T. thermophilus uS8 (yellow), B. anthracis uS8 (brown), P_Bac (magenta), I_Bac (blue), P_Bac-15 (green), P_Bac-14 (cyan), P_Bac-13N (purple) and P_Bac-13M (grey) was measured at a protein concentration of 2.0 μM (A) or 10 μM (B) using a streptavidin sensor chip (BLItz system). The association and dissociation curves of P_Bac-13L were almost identical to those of P_Bac-13M and have been therefore omitted. The running buffer was 20 mM Tris–HCl, pH 7.5, 800 mM NaCl, and 50 μM BSA. The RNA fragment used was previously selected for binding to B. anthracis uS8 (Davlieva et al. 2014) (Color figure online)

The BLItz System’s built-in software generated ka, kd and KD values from the association and dissociation curves (Table 1). The KD values (33 × 10–7 M and 14 × 10–7 M at protein concentrations of 10 μM and 2.0 μM, respectively) observed for B. anthracis uS8 were more than 10 times weaker than that (1.1 × 10–7 M) previously reported for the binding of B. anthracis uS8 to an RNA fragment with a similar but not identical sequence (Davlieva et al. 2014). This difference may be due to the following two reasons: (i) the previous study measured binding to free RNA, whereas we measured with RNA immobilized to a sensor chip; (ii) the previous study measured RNA–protein binding in a moderate salt concentration (150 mM potassium acetate), whereas our measurement was performed in high salt (800 mM sodium chloride) and the high salt concentration may have somewhat inhibited the binding of B. anthracis uS8 to RNA. In contrast, the binding of T. thermophilus uS8 to the RNA fragment exhibited much better KD values of 6.2 × 10–8 M and 3.5 × 10–8 M at protein concentrations of 10 μM and 2.0 μM, respectively.

Table 1 Kinetic parameters for the interaction of RNA with the wild-type, ancestral uS8 proteins and simplified variants

I_Bac showed KD values of 8.4 × 10–8 M and 5.5 × 10–8 M at protein concentrations of 10 μM and 2.0 μM, respectively, while P_Bac showed KD values of 35 × 10–8 M (10 μM) and 27 × 10–8 M (2.0 μM); thus, I_Bac showed 4–5-fold stronger binding. The KD values observed for I_Bac were similar to those reported for the binding of T. thermophilus uS8 to the RNA fragment.

Effect of Eliminating One Amino Acid Letter on the Stability and RNA Binding of P_Bac

In a first step toward identifying a minimal set of amino acids that would retain the stability and RNA-binding properties of uS8, we individually eliminated each type of amino acid from the inferred uS8 ancestral protein. For this experiment, we used P_Bac because, based on the a posteriori probability, the accuracy of the residues in P_Bac was higher than that in I_Bac (Fig. S4). We constructed 19 simplified variants of P_Bac, the sequences of which each lacked one amino acid letter. Because P_Bac contains no histidine residues, each variant comprised an 18-amino-acid alphabet. In each variant, amino acids were replaced with the “second-best” ancestral residue; in other words, ancestral amino acids with a posteriori probability < 1.0 were replaced by the amino acid that showed second highest probability. Ancestral amino acids with a posteriori probability of 1.0 were replaced by the amino acid that was second most frequent at the corresponding position in the multiple sequence alignment used for tree building and ancestral sequence inference. Completely conserved residues were replaced by physicochemically similar amino acids. For constructing variants lacking methionine, the N-terminal residue was not taken into account. The amino acid sequences of P_Bac and its 19 variants are given in Supplementary Data 1 and aligned in Fig. S6.

The variants that lacked glycine, glutamate, isoleucine or lysine appeared to be insoluble and could not be subjected to further analysis. It seems likely that the presence of glycine, glutamate, isoleucine and lysine is crucial for the proper folding and/or thermodynamic stability of the protein. The other 15 variants were successfully expressed and recovered as soluble forms.

We measured the temperature-induced unfolding of each protein by monitoring the change in ellipticity at 222 nm as a function of temperature (Fig. S7). For each protein, the duplicate measurements gave identical unfolding curves within experimental error (data not shown). The unfolding curves of P_Bac and its variants showed a single transition (Fig. S7), the midpoint of which was used to compare the thermal stabilities of the proteins and identify which types of amino acid are important for protein stability. We also assessed the interaction between the variants and an RNA fragment by a magnetic beads-based pull-down assay (Fig. S8).

Table 2 summarizes the unfolding midpoint temperature and the RNA-binding ability of the simplified variants, showing that the elimination of some amino acid letters from the sequence of P_Bac exerts a large effect on its stability and/or RNA-binding ability. For example, elimination of arginine, tyrosine, proline, alanine or serine resulted in a lower unfolding temperature and loss of RNA-binding activity. Among these residues, the hydroxyl groups on the side chains of serine at positions 105 and 107 are predicted to be closely involved in RNA-binding via the formation of hydrogen bonds with the RNA molecule (Fig. S9). In addition, valine was found to be crucial for RNA binding but not for thermal stability. Elimination of methionine, asparagine or leucine lowered the unfolding temperature but did not affect RNA-binding activity, while elimination of glutamine moderately reduced the unfolding temperature. In contrast, the remaining four amino acids (phenylalanine, tryptophan, cysteine, threonine) could be eliminated from the sequence of P_Bac without compromising its stability or RNA-binding activity. Based on the structure of the B. anthracis uS8 and RNA complex (Davlieva et al. 2014), methionine, asparagine, leucine, phenylalanine, tryptophan, cysteine and threonine are unlikely to be involved in RNA binding (Fig. 2 and Fig. S9). In contrast, the side chain of glutamine at position 56 may form hydrogen bonds with the RNA molecule (Fig. S9). Nevertheless, replacement of the glutamine residue by lysine did not affect RNA binding (Table 2 and Fig. S8). Thus, the various types of amino acid do not contribute equally to the stability and RNA binding of P_Bac; in particular, these findings suggested that phenylalanine, tryptophan, cysteine and threonine might be eliminated in combination to produce more simplified variants of ancestral uS8.

Table 2 Unfolding midpoint temperatures and RNA-binding activity of simplified P_Bac variants lacking a single type of amino acid

Construction of Simplified P_Bac Variants Lacking Multiple Amino Acid Letters

Next, we tested whether four or more types of amino acid could be eliminated in combination without substantial loss of stability or RNA-binding activity. We first eliminated phenylalanine, tryptophan, cysteine and threonine from the sequence of P_Bac by replacing them with “second-best” ancestral amino acids. The resulting protein, P_Bac-15 (Supplementary Data 1; Fig. S6), was reasonably thermally stable (Tm = 84 °C; Fig. 5A) and its RNA-binding activity was significant (Fig. 5B). We further eliminated glutamine from P_Bac-15, thus producing P_Bac-14 (Supplementary Data 1; Fig. S6). The thermal stability of P_Bac-14 (Tm = 86 °C) was comparable to that of P_Bac and P_Bac-15 (Fig. 5A), and its RNA-binding activity was also significant (Fig. 5B). These findings indicate that a reduced amino acid alphabet, comprising only 14-amino acid letters, is sufficient to achieve high thermal stability and strong RNA binding in the ribosomal protein uS8.

Fig. 5
figure 5

Functional analysis of P_Bac-15 and P_Bac-14. A Thermal denaturation curves of P_Bac-15 (solid line) and P_Bac-14 (dashed line). The unfolding midpoint temperature is indicated. B RNA-binding assay. The binding buffer was 20 mM Tris–HCl (pH 7.5), 800 mM NaCl, 0.1% Tween 20. The wash buffer was 20 mM Tris–HCl (pH 7.5), 2 mM MgCl2, 0.1% Tween 20 containing 800 mM NaCl. Plus and minus signs above the gels represent the presence and absence of RNA, respectively

We further eliminated methionine, asparagine or leucine from P_Bac-14 by replacing them with “second-best” ancestral amino acids to produce P_Bac-13M, P_Bac-13N and P_Bac-13L, respectively (Supplementary Data 1; Fig. S6). It should be noted that methionine, leucine and asparagine are not directly involved in RNA binding (Fig. 2 and Fig. S9). The far-UV CD spectra of P_Bac-13M and P_Bac-13N (Fig. 6) indicated the presence of significant secondary structure in these variants; however, the ellipticities were smaller than that of P_Bac. The smaller ellipticity may reflect a reduced secondary structure content; alternatively, it might indicate that a certain percentage of protein molecules did not fold correctly even at room temperature. The far-UV CD spectrum of P_Bac-13L indicated that this variant did not contain significant secondary structure. In the temperature-induced unfolding experiment, none of the three variants (P_Bac-13M, P_Bac-13N and P_Bac-13L) showed a cooperative secondary structure unfolding transition under the conditions used (Fig. S10). Therefore, Tm values could not be determined for these three proteins.

Fig. 6
figure 6

Far-UV CD spectra of reduced-alphabet uS8 ancestral variants. Spectra are shown for P_Bac (thick solid line), P_Bac-13M (solid line), P_Bac-13N (dashed line) and P_Bac-13L (dotted line). The samples comprised 20 μM proteins in 20 mM potassium phosphate buffer (pH 7.6), 200 mM NaCl

We also measured the interaction of P_Bac-15, P_Bac-14 and P_Bac-13N with an RNA fragment using a streptavidin sensor chip, which determined the association and dissociation curves for P_Bac-15, P_Bac-14 and P_Bac-13N (Fig. 4). At protein concentrations of 10 μM and 2.0 μM, the KD values of P_Bac-15 were both 4.8 × 10–7 M, which is similar to those of P_Bac (Table 1). Furthermore, the KD values of P_Bac-14 were 1.9 × 10–6 M (10 μM) and 2.0 × 10–6 M (2.0 μM), indicating that this simplified uS8 variant also interacted with the RNA fragment significantly, albeit with weaker binding than P_Bac and P_Bac-15. Unexpectedly, the assay indicated that P_Bac-13N also bound to RNA to some extent, although the shift in binding signal was much smaller than that observed for the ancestral uS8 protein, P_Bac-15 and P_Bac-14. Correspondingly, the KD values of P_Bac-13N were 3.3 × 10–6 M and 2.1 × 10–6 M at protein concentrations of 10 μM and 2.0 μM, respectively, similar to those of P_Bac-14 (Table 1). No change in binding signal was observed on exposure of the RNA-bound chip to P_Bac-13M or P_Bac-13L solution (Fig. 4), showing that these two simplified proteins did not bind to the RNA fragment.

Implications for the Amino Acid Repertoire in Primitive RNA-Binding Proteins

The amino acid repertoire used in primordial protein synthesis must be closely related to the origin and early evolution of the genetic code. The ‘frozen accident’ and other theories commonly predict that, by gradually incorporating new amino acids into the repertoire, the modern genetic code has progressively evolved from a primitive, simpler one involving a subset of the current 20 proteinogenic amino acids (Baumann and Oro 1993; Crick 1968; Eigen and Schuster 1977; Higgs 2009; Ikehara et al. 2002; Johnson and Wang 2010; Wong 1975). Several studies have proposed that the functionality of proteins would have been increased by the amino acids added later to the proteinogenic amino acid repertoire. For example, the sidechains of subsequently added amino acids might have had higher chemical reactivity (Granold et al. 2018), and expanded the chemistry space in terms of size, charge and hydrophobicity (Ilardo and Freeland 2014). They might also have increased protein function (Francis 2013), or both protein structure and function to improve the fitness of primitive organisms (Muller et al. 2013). Consistent with these ideas, Trifonov proposed an all-encompassing order for amino acid emergence (G/A, V/D, P, S, E/L, T, R, N, K, Q, I, C, H, F, M, Y, W; Trifonov 2000). Recently, Mayer-Bacon and Freeland examined how the current set of 20 proteinogenic amino acids is distributed throughout extant life in terms of quantitative measures (Mayer-Bacon and Freeland 2021), showing that the remarkable distributions of volume, hydrophobicity and charge (pKa) become far more obscure when comparing a prebiotically plausible subset of amino acids with a much smaller subset of prebiotically plausible alternatives detected in meteorites. Lastly, Masel and colleagues performed integrated phylostratigraphy across 435 organisms with full genome sequences, observing that trends in amino acid usage among ancient domains reflect the order in which the amino acids were incorporated into the genetic code (James et al. 2021). They suggested that amino acid usage in the extant descendants of ancient sequences may reflect the availability of the amino acids when the sequences first emerged.

In our experiments to explore the properties of uS8 variants constructed from a reduced set of amino acids, elimination of histidine, phenylalanine, tryptophan, cysteine, threonine, glutamine, or methionine from the ancestral uS8 protein P_Bac had little effect on thermal stability or RNA-binding activity. Notably, these types of amino acid, with the exception of threonine, were presumably incorporated into protein synthesis at a relatively late stage of evolution (Jordan et al. 2005; Trifonov 2000).

The ancestral uS8 variants lacking glutamate, glycine, isoleucine or lysine were insoluble when they were expressed using the recombinant E. coli expression system. It is possible that these variants could not form adequate tertiary structures. In our previous experiment, in which the size of the amino acid set constituting an ancestral nucleoside kinase was systematically reduced, the two variants lacking either glycine or glutamate seemed to be insoluble (Shibue et al. 2018). Therefore, glutamate and glycine – considered members of the plausible prebiotically available amino acid set and presumably incorporated into protein synthesis at a relatively early stage of evolution – may be necessary for ensuring a stable conformation of proteins in general.

Because RNA is negatively charged, it is reasonable to predict that amino acids with positively charged side chains, such as lysine and arginine, will be important for RNA binding by proteins. Indeed, elimination of arginine from ancestral uS8 did result in loss of RNA-binding activity. As mentioned above, however, elimination of lysine from the ancestral uS8 led to no detectable level of expression in E. coli. It seems unlikely that lysine and arginine were synthesized in the prebiotic environment (McDonald and Storrie-Lombardi 2010) and therefore the positively charged amino acids were plausibly unavailble for the earliest protein synthesis. One hypothesis for this discrepancy is that amino acids with a simpler positively charged side chain, such as ornithine and 2,4-diaminobutyrate, may have been used instead of lysine and arginine in the synthesis of primitive proteins. For example, Tawfik and coworkers suggested that the first nucleic acid-binding proteins may have arisen from short simple sequences containing ornithine, which is not used in extant proteins (Longo et al. 2020). Alternatively, metal ions might have mediated the binding of proteins to RNA. In this regard, Hlouchová and coworkers generated a genetic library encoding modified amino acid sequences of the C-terminal domain of ribosomal protein uL11 by combining 10 types of amino acid without a positively charged side chain (Giacobelli et al. 2022). After selection for RNA binding by a mRNA display method, they obtained a uL11 variant in which glutamate residues, instead of positively charged amino acids, facilitated binding to the phosphate groups of RNA via Mg2+ ions. However, the possibility that either lysine or arginine or both were abiotically synthesized in some way and available in the primordial environment cannot be completely ruled out. For example, Sutherland and colleagues have reported an abiotic synthesis pathway for a precursor to arginine (Patel et al. 2015).

We note that our study has some limitations. First, our sequence-wide individual substitutions ignored any epistasis effect between residues: while all residues of one kind were substituted across the alphabet, we note that compensatory mutations would be likely to occur in nature throughout evolution. Second, we reduced the alphabet for a reconstructed ancestral protein, and it remains to be tested whether mutating wild-type ribosomal proteins would have the same impact. In the future, we will conduct studies along these lines to explore further the implications for evolution of the amino acid alphabet.

In conclusion, construction of a phylogenetic tree based on uS8 amino acid sequences from representative extant organisms enabled us to infer two amino acid sequences corresponding to the last bacterial common ancestor of uS8; the proteins reconstructed from the sequences were thermally stable and bound to an RNA fragment. Among a series of elimination variants, the most simplified sequence variant (P_Bac-13N), lacking seven amino acid letters, was still able to bind to the RNA fragment. Collectively, our findings show that the full set of 20 proteinogenic amino acids is not necessarily essential to create an RNA-binding protein, raising the possibility that primitive RNA-binding proteins in the early stage of evolution were made from a reduced amino acid set. It is impossible, however, to assert definitively that the amino acids excluded from our simplified uS8 protein were not used in the synthesis of primitive proteins. Moreover, even if not all of the current 20 proteinogenic amino acids were available, other non-proteinogenic amino acids might have existed in the primitive environment. In short, it cannot be entirely ruled out that, before the emergence of LUCA, protein synthesis involved more than 20 amino acids, which were subsequently ‘standardized’ to the current set at an early stage of evolution leading to LUCA. In that case, the early single genetic code might have specified multiple amino acids unambiguously in the primitive translation system.