Introduction

Conserved sequences and structures are commonly considered either as representatives of ancestral elements or as products of convergent evolution. In our recent publication (Sobolevsky and Trifonov 2005) we have introduced a convenient quantitative measure of sequence conservation—a proportion of genomes where a given conserved sequence element is present in its entirety. In particular, the longest protein sequence segment that is present in all prokaryotic proteomes sequenced so far, an omnipresent element, is the octamer GHVDHGKT (see also Goto et al. 2002). About 400 other nearly omnipresent elements have been found as well. In a similar study collection of “invariant sequence peptides” has been extracted from prokaryotic genomes (Prakash et al. 2004, 2005). Such spread and degree of conservation is hard to view as result of convergence, simultaneously in so many genomes. The conservation, therefore, rather points to the ancient origin of the omnipresent elements. Indeed, calculation of the “ages” of the sequences (Sobolevsky and Trifonov 2005) based on the consensus temporal order of appearance of 20 amino acids in evolution (Trifonov 2000, 2004) shows very strong correlation between protein sequence conservation, as defined, and its evolutionary age.

The question we address in this work is in what types of proteins the omnipresent sequence elements are found, and what are the structures containing the elements. That would constitute a first chapter of an overview of the sequence/structure elements and functions projecting back to last universal common ancestor.

Materials and Methods

Protein sequences of prokaryotes containing the omnipresent octamers were taken from the database at ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/fasta_files/proteomes/. Protein crystal structures were taken from PDB crystal database (release July 2004). The structures harboring given octamer were superimposed by the minimal RMSD (Root Mean Square Deviation) procedure, to estimate limits of structure conservation.

Results and discussion

GHVDHGKT—Element of Translation Initiation and Elongation Factors

This sequence is encountered in all 131 prokaryotic proteomes analyzed so far (listed in Sobolevsky and Trifonov 2005). According to sequence database descriptions (see Materials and Methods) it is found in translation initiation factors of respective organisms (138 times) and elongation factors (129) as well as in four other proteins.

PDB protein crystal database (release July 2004) contains total 41 occurrences of the octamer. Many of them appear in identical chains within the same crystal. Respective factors and fragments thereof are often present in various co-crystals. Of special interest for this study would be structures of pure proteins from diverse organisms, or of their complexes with natural partners, like GDP in the case of elongation and initiation factors. The search for such structures resulted in two sets of three and five members, respectively:

Pure factors:

1AIP, chain A. Elongation factor Tu from Thermus thermophilus.

...PHVNVGTIGHVDHGKTTLTAALTY...

1KK0, chain A. Large subunit of initiation factor Eif2 from Pyrococcus abyssi

...AEVNIGMVGHVDHGKTTLTKALTG...

1S0U, chain A. Translation initiation factor Eif2 from Methanococcus jannaschii

...AEVNIGMVGHVDHGKTSLTKALTG...

Cocrystals of the factors with GDP:

1D2E, chain A. Bovine mitochondrial Ef-Tu in complex with GDP.

...PHVNVGTIGHVDHGKTTLTAAITK...

1EFC, chain A. Elongation factor EF-Tu from E. Coli (GDP form).

...PHVNVGTIGHVDHGKTTLTAAITT...

1G7S, chain A. Translation initiation factor If2/Eif5B complexed with GDP, Methanobacterium thermoautotrophicum.

...RSPIVSVLGHVDHGKTTLLDHIRG...

1KK3, chain A. Large subunit of initiation factor Eif2 from Pyrococcus abyssi complexed with GDP-Mg2+

...AEVNIGMVGHVDHGKTTLTKALTG...

1TUI, chain A. Elongation factor Tu in complex with GDP, Thermus aquaticus.

...PHVNVGTIGHVDHGKTTLTAALTY...

The 3D structures of the chains containing the octamer were extracted from the PDB, and superimposed, separately for the two sets (Fig. 1A, B). The superposition shows that in both cases the chain trajectories, eight residues upstream and eight residues downstream from the octamer (turns), are very similar. Beyond these limits the trajectories diverge significantly (Fig. 1, only two extra residues are shown at the ends). Interestingly, the structure of the octamer itself is slightly but distinctly different for the pure protein and its GDP forms. The simple consensus sequence (residues of the highest occurrences in respective positions) of the 24 residue long structural fragment calculated from the sequences of all proteins in which the octamer is found, in all genomes, is shown below:

$$ {\matrix{ {\rm R} & {\rm H} & {\rm P} & {\rm N} & {\rm V} & {\rm G} & {\rm T} & {\rm I} & {\underline{\bf G}} \cr {123} & {120} & {120} & {133} & {175} & {134} & {131} & {118} & {271} \cr {\underline{\bf H}} & {\underline{\bf V}} & {\underline{\bf D}} & {\underline{\bf H}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf T}} & {\rm T} & {\rm L} \cr {271} & {271} & {271} & {271} & {271} & {271} & {271} & {172} & {259} \cr {\rm L} & {\rm D} & {\rm A} & {\rm I} & {\rm R} & {\rm K} & & & \cr {126} & {125} & {172} & {217} & {121} & {71} & & & } } $$
Fig. 1
figure 1

Structures harboring the element GHVDHGKT. A Pure elongation and initiation factors. B Same factors with GDP. The octamer is located at the turns. To show the difference between the octamers within the structures A and B, the backbones of the octamers are superimposed by root-mean-square minimization.

Comparison of this sequence with the closed loop prototypes (Berezovsky and Trifonov 2002, Berezovsky et al. 2003a, 2003b) reveals that it is related to the Prototype I (Aleph)—9 matching residues of 23. Indeed, the above structures are, essentially, identical to the Prototype I. Thus, the most conserved sequence/structure element of prokaryotic proteins belongs to the most abundant type of the closed loops, PI (Aleph). This prototype corresponds to P-loop of GTPases and ABC cassettes (Leipe et al. 2002). It contains the sequence GPSGSGKT—Walker A motif GxxxxGK[ST] (Walker et al. 1982) directly involved in GTP/ATP binding and hydrolysis. Massive molecular evolution analysis of the P-loop GTPases suggests that the initiation and elongation factors are the earliest GTPases, descending from the last common ancestor (Leipe et al. 2002).

SGSGKSTL—Element of ABC Family Proteins

This octamer is found in 125 proteomes of 131, in various ABC transporters and ABC family proteins (514 times), in ATP binding proteins (182), in particular of lipoprotein releasing system (40), in excinucleases ABC (114), ATPases (14) and other proteins (73). It, thus, appears to be involved in binding of mostly if not exclusively ATP, contrary to the GHVDHGKT octamer, involved in binding of GTP.

In the PDB database only one representative of the SGSGKSTL element is located:

1MT0, Chain A. ATP-binding domain of haemolysin B from Escherichia coli

...QGEVIGIVGRSGSGKSTLTKLIQR...

The sequence contains the Walker A motif GRSGSGKS, and the structure that harbors the sequence (not shown) is indistinguishable from the representatives of the GHVDHGKT element shown above (Fig. 1). The consensus sequence of the 24 residue long common structure calculated from all occurrences of the element SGSGKSTL in prokaryotes:

$$ {\matrix{ {\rm K} & {\rm G} & {\rm E} & {\rm F} & {\rm V} & {\rm A} & {\rm I} & {\rm V} & {\rm G} \cr {154} & {743} & {531} & {209} & {347} & {365} & {468} & {242} & {893} \cr {\rm P} & {\underline{\bf S}} & {\underline{\bf G}} & {\underline{\bf S}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf S}} & {\underline{\bf T}} & {\underline{\bf L}} \cr {340} & {897} & {897} & {897} & {897} & {897} & {897} & {897} & {897} \cr {\rm L} & {\rm R} & {\rm L} & {\rm L} & {\rm N} & {\rm G} & & & \cr {402} & {318} & {203} & {332} & {220} & {342} & & & \cr }} $$

After alignment it matches the consensus sequence of the Aleph prototype by 17 residues of 23. Thus, the sequence around the SGSGKSTL octamer, and the structure it is imbedded into correspond to the closed loops of prototype Aleph, as well as the omnipresent octamer GHVDHGKT.

LSGGQQQR—Element of ABC Cassette Proteins

This octamer is identified in 125 prokaryotic proteomes, of total 131. It is located in ATP-binding proteins (565 times), including those involved in amino acid (161) and spermidine/putrescine (34) transport, as well as transport of sugars, iron(III) and polyamines. One hundred and eighty three other ABC transporters contain this element, including phosphate import proteins pstB (136). It is found also in 36 ATPase components of ABC transporters and in 38 other proteins.

In the current PDB database the element LSGGQQQR is present in seven structures of which three, from different species, are indicated below:

1OXS, chain C. ABC-ATPase of the glucose ABC transporter from Sulfolobus solfataricus

...NHFPRELSGGQQQRVALARALVKDPSLLLLDEPFSN...

1F3O, chain A. Hypothetical ABC transporter ATP-binding protein from Methanococcus jannaschii

...NHKPNQLSGGQQQRVAIARALANNPPIILADEPTGA...

1B0U, chain A. ATP-binding subunit of the histidine permease from Salmonella typhimurium

...GKYPVHLSGGQQQRVSIARALAMEPDVLLFDEPTSA...

Corresponding segments of the backbones are shown in the Fig. 2, superimposed. The consensus sequence that corresponds to the apparent common structure in the Fig. 2, as calculated from all occurrences of the element in the prokaryotic genomes, is:

$$ {\matrix{ {\rm D} & {\rm R} & {\rm Y} & {\rm P} & {\rm S} & {\rm Q} & {\underline{\bf L}} & {\underline{\bf S}} & {\underline{\bf G}} \cr {212} & {273} & {302} & {582} & {124} & {361} & {822} & {822} & {822} \cr {\underline{\bf G}} & {\underline{\bf Q}} & {\underline{\bf Q}} & {\underline{\bf Q}} & {\underline{\bf R}} & {\rm V} & {\rm A} & {\rm I} & {\rm A} \cr {822} & {822} & {822} & {822} & {822} & {604} & {519} & {583} & {761} \cr {\rm R} & {\rm A} & {\rm L} & {\rm A} & {\rm M} & {\rm E} & {\rm P} & {\rm E} & {\rm V} \cr {792} & {677} & {621} & {444} & {197} & {182} & {709} & {182} & {354} \cr {\rm L} & {\rm L} & {\rm L} & {\rm D} & {\rm E} & {\rm P} & {\rm T} & {\rm S} & {\rm A} \cr {376} & {708} & {235} & {822} & {822} & {758} & {577} & {509} & {557} \cr } } $$
Fig. 2
figure 2

Structures containing the element LSGGQQQR. The superposition is optimized over entire length of the chains (42 residues in this case). The LSGGQQQR octamer is located at the bottom ends of the helices.

The element LSGGQ in this consensus is known to be a signature of ABC cassettes, while the LLLLD element represents the Walker B motif (Walker et al. 1982). After sequence alignment of this consensus with the closed loop Prototype II (Beth) sequence (Berezovsky et al. 2003a, b) as many as 26 of 30 overlapping residues match. The structure of the closed loop in the Fig. 2 is also indistinguishable from the Prototype Beth structure (not shown). Thus, the element LSGGQQQR and the structure it is found in belong to the Prototype Beth family of the closed loops (Berezovsky et al. 2003a,b). Actually, the structure 1b0u in the list of structures above had been used as representative of the Beth family in the original publication (Berezovsky et al. 2003a,b).

GPPGTGKT—Element of Cell Division Proteins

This element is present in 122 of 131 prokaryotes. Amongst total 386 occurrences there are 180 cell division proteins, 46 ATPases, 44 DNA related proteins, including DNA helicases (21), 50 other proteins, and 66 unidentified ones.

The PDB database contains six entries with this element:

1OZ4, 1S3S, 1E32, 1R7R. Transitional endoplasmic reticulum ATPase of Mus musculus

...PPRGILLYGPPGTGKTLIARAVAN...

1SXJ. Eukaryotic clamp loader (replication factor C) of Saccharomyces cerevisiae

...NLPHMLFYGPPGTGKTSTILALTK...

1LV7. AAA domain of cell division protein Ftsh of Escherichia coli

...IPKGVLMVGPPGTGKTLLAKAIAG...

The consensus sequence around the element, calculated from all 386 occurrences in the 122 genomes is:

$$ {\matrix{ {\rm I} & {\rm P} & {\rm K} & {\rm G} & {\rm V} & {\rm L} & {\rm L} & {\rm V} & {\underline{\bf G}} \cr {112} & {225} & {167} & {226} & {182} & {273} & {267} & {106} & {386} \cr {\underline{\bf P}} & {\underline{\bf P}} & {\underline{\bf G}} & {\underline{\bf T}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf T}} & {\rm L} & {\rm L} \cr {386} & {386} & {386} & {386} & {386} & {386} & {386} & {192} & {247} \cr {\rm A} & {\rm K} & {\rm A} & {\rm V} & {\rm A} & {\rm G} & & & \cr {310} & {173} & {247} & {163} & {280} & {158} & & & \cr }} $$

It matches in 14 positions of 23 to the Prototype Aleph consensus. Respective crystal structure elements carrying the GPPGTGKT octamer are indistinguishable from those shown in the Fig. 1. Thus, the GPPGTGKT element and the structure around it represents yet another family of closed loops belonging to the common general Prototype Aleph.

KMSKSLGN—Signature for Aminoacyl-tRNA Synthetases, Class I

One hundred and twenty one prokaryotic proteomes (of 131) contain this conserved element. It appears there 221 times, almost exclusively in aminoacyl-tRNA synthetases of class I (215 times). The motif KMSKS was suggested earlier as the signature of the aminoacyl-tRNA synthetases of class I, involved in ATP binding (Carter 1993).

Only two nonredundant structures can be extracted from the PDB database, that contain the KMSKSLGN element:

1LI7, chain A. Cysteinyl-tRNA synthetase, Escherichia coli.

...GMVMVDREKMSKSLGNFFTVRDVLKYYDAETVRYFLMSGHYRSQLN...

1QU3, chain A. Isoleucyl-tRNA synthetase of Escherichia coli.

...FVMDGEGKKMSKSLGNVIVPDQVVKQKGADIARLWVSSTDYLADVR...

These structures, as they appear in the vicinity of the octamer, are shown in the Fig. 3, superimposed. The structurally conserved part appears as rather convoluted loop closed by the tight contacts between the ends (several Cα-Cα contacts with separation 3–5A). The contour of the loop between the closest residues at the ends is unusually large (42 residues) compared with typical size 25–30 residues (Berezovsky et al. 2000) for the closed loops. This is, probably, due to presence of two α-helical segments within the contour. Since only two trajectories are available the limits of the common part of the trajectories are rather uncertain. The ends of the structures seem to be at the limits of the conserved structure.

Fig. 3
figure 3

Conserved structure around the element KMSKSLGN found in ile- and cys-tRNA synthetases of E. coli. The KMSKSLGN element is shown at the bottom right.

Consensus sequence for the 221 occurrences of the octamer is shown below.

$$ {\matrix{ {\rm G} & {\rm V} & {\rm L} & {\rm D} & {\rm E} & {\rm D} & {\rm G} & {\rm E} & {\underline{\rm K}} \cr {78} & {50} & {63} & {85} & {52} & {51} & {137} & {72} & {221} \cr {\underline{\bf M}} & {\underline{\bf S}} & {\underline{\bf K}} & {\underline{\bf S}} & {\underline{\bf L}} & {\underline{\bf G}} & {\underline{\bf N}} & {\rm F} & {\rm I} \cr {221} & {221} & {221} & {221} & {221} & {221} & {221} & {59} & {99} \cr {\rm T} & {\rm P} & {\rm R} & {\rm D} & {\rm V} & {\rm I} & {\rm K} & {\rm K} & {\rm Y} \cr {60} & {96} & {51} & {102} & {89} & {75} & {58} & {46} & {85} \cr {\rm G} & {\rm A} & {\rm D} & {\rm V} & {\rm L} & {\rm R} & {\rm F} & {\rm F} & {\rm L} \cr {90} & {93} & {89} & {53} & {108} & {158} & {50} & {70} & {84} \cr {\rm L} & {\rm S} & {\rm G} & {\rm H} & {\rm Y} & {\rm R} & & & \cr {62} & {68} & {42} & {77} & {94} & {87} & & & \cr }} $$

Neither the consensus nor the structure correspond to any of previously described prototypes (Berezovsky et al. 2003a, b). The conserved KMSKSLGN element, thus, represents a new class of the closed loop prototypes.

LRPGRFDR—Element of Cell Division Proteins

This is found in 119 genomes. One hundred and thirty two times—in cell division proteins, including Ftsh (124 times), 16 times in proteasome proteins, and 11 in others. Many of the occurrences hit the same proteins as the element GPPGTGKT (see above) but in different sequence location. E. g., in Ftsh sequences it is located about 110–120 residues downstream from the GPPGTGKT octamer.

PDB database contains two non-redundant structures with the octamer LRPGRFDR:

1LV7, chain A. AAA domain of cell division protein Ftsh of Escherichia coli

...GIIVIAATNRPDVLDPALLRPGRFDRQVVVGLPD...

1IY2, chain A. Ftsh ATPase domain from Thermus thermophilus ...

AIVVMAATNRPDILDPALLRPGRFDRQIAIDAPD...

In the Fig. 4 both structures are superimposed. The borders of the conserved structure around the octamer are uncertain. They could be determined only when several structures are available, as, e.g. in case of the GHVDHGKT octamer. In the Fig. 4 the limits are arbitrarily chosen at 18 residues upstream of the octamer LRPGRFDR and 8 residues downstream. The loop observed in the Fig. 4 has contour length about 18 amino acids and is locked by several contacts with CB-CB distances 4–10 A between the contacting residues.

Fig. 4
figure 4

Conserved structure with the embedded element LRPGRFDR (right side of the loop).

Neither consensus sequence around the LRPGRFDR element (shown below) nor the structure have analogs among known closed loop prototypes.

$$ {\matrix{ {\rm E} & {\rm G} & {\rm V} & {\rm I} & {\rm V} & {\rm I} & {\rm A} & {\rm A} & {\rm T} \cr {65} & {103} & {92} & {114} & {91} & {106} & {147} & {158} & {158} \cr {\rm N} & {\rm R} & {\rm P} & {\rm D} & {\rm V} & {\rm L} & {\rm D} & {\rm P} & {\rm A} \cr {158} & {155} & {120} & {144} & {93} & {151} & {158} & {117} & {158} \cr {\rm L} & {\underline{\bf L}} & {\underline{\bf R}} & {\underline{\bf P}} & {\underline{\bf G}} & {\underline{\bf R}} & {\underline{\bf F}} & {\underline{\bf D}} & {\underline{\bf R}} \cr {145} & {159} & {159} & {159} & {159} & {159} & {159} & {159} & {159} \cr {\rm Q} & {\rm V} & {\rm V} & {\rm V} & {\rm P} & {\rm L} & {\rm P} & {\rm D} & \cr {90} & {84} & {50} & {139} & {44} & {63} & {159} & {142} & \cr } } $$

QRVAIARA—Element of ABC Cassettes, Part of the Consensus Around LSGGQQQR

Found in 119 proteomes of 131. Five hundred and fifty three times it is encountered in ATPbinding components of ABC transport proteins, including oligopeptide and amino acid transporters (216). It is also present in 51 ATPase components of the ABC transporters, in 29 various ABC transporters, as well as in 13 other proteins and in 21 hypothetical proteins. The element QRVAIARA, essentially, is located in the same proteins as the element LSGGQQQR, and partially overlaps with the latter in the sequences: LSGGQQQRVAIARA. The corresponding consensus sequence shown below is almost identical to the consensus around the element LSGGQQQR. Respective structures embedding the QRVAIARA element are identical to the structure shown in the Fig. 2.

$$ {\matrix{ {\rm D} & {\rm H} & {\rm Y} & {\rm P} & {\rm S} & {\rm Q} & {\rm L} & {\rm S} & {\rm G} \cr {191} & {133} & {243} & {573} & {132} & {257} & {621} & {666} & {660} \cr {\rm G} & {\rm Q} & {\rm Q} & {\underline{\bf Q}} & {\underline{\bf R}} & {\underline{\bf V}} & {\underline{\bf A}} & {\underline{\bf I}} & {\underline{\bf A}} \cr {666} & {500} & {334} & {667} & {667} & {667} & {667} & {667} & {667} \cr {\underline{\bf R}} & {\underline{\bf A}} & {\rm L} & {\rm A} & {\rm M} & {\rm E} & {\rm P} & {\rm K} & {\rm V} \cr {667} & {667} & {502} & {314} & {174} & {127} & {628} & {180} & {259} \cr {\rm L} & {\rm L} & {\rm A} & {\rm D} & {\rm E} & {\rm P} & {\rm T} & {\rm S} & {\rm A} \cr {320} & {521} & {231} & {666} & {666} & {560} & {508} & {354} & {445} \cr } } $$

DEPTSALD—Element of ABC Cassettes, Part of the Consensus Around LSGGQQQR

One hundred and nineteen prokaryotic proteomes of 131 have this element in their proteins. The motif is found in 408 ATP binding domains of ABC transporters, including amino acid and oligopeptide (217) and phosphate (53) transporters, in 49 ABC cassettes, 40 ATPase domains of the transporters, 39 ATP binding proteins, and 20 other proteins, including hypothetical ones.

As in previous case, the element DEPTSALD frequently belongs to the same proteins as do the elements LSGGQQQR and QRVAIARA. It is located at the C-ends of the respective consensus sequences. Its own consensus sequence shown below is almost identical to the ones for LSGGQQQR and QRVAIARA. Respective structures embedding the DEPTSALD element are the same as the structure in the Fig. 2 (data not shown).

$$ {\matrix{ {\rm D} & {\rm R} & {\rm Y} & {\rm P} & {\rm H} & {\rm Q} & {\rm L} & {\rm S} & {\rm G} \cr {141} & {115} & {301} & {409} & {112} & {181} & {443} & {546} & {518} \cr {\rm G} & {\rm Q} & {\rm Q} & {\rm Q} & {\rm R} & {\rm V} & {\rm A} & {\rm I} & {\rm A} \cr {546} & {471} & {279} & {541} & {534} & {334} & {360} & {471} & {520} \cr {\rm R} & {\rm A} & {\rm L} & {\rm A} & {\rm M} & {\rm E} & {\rm P} & {\rm K} & {\rm V} \cr {480} & {409} & {429} & {350} & {224} & {119} & {513} & {142} & {210} \cr {\rm L} & {\rm L} & {\rm F} & {\underline{\bf D}} & {\underline{\bf E}} & {\underline{\bf P}} & {\underline{\bf T}} & {\underline{\bf S}} & {\underline{\bf A}} \cr {212} & {404} & {254} & {556} & {556} & {556} & {556} & {556} & {556} \cr {\underline{\bf L}} & {\underline{\bf D}} & & & & & & & \cr {556} & {556} & & & & & & & \cr }} $$

SIGEPGTQ—Key Element of DNA-Directed RNA Polymerases

This octamer is located in 117 proteomes. All 117 occurrences are within DNA-directed RNA polymerases. Only two crystal structures are available in the PDB database that contain the octamer SIGEPGTQ together with its sequence/structure environment (1IW7 and 1SMY). These two structures involve the same protein subunits, from the same species (T. thermophilus). One is described below:

1SMY, chain D. DNA-directed RNA polymerase of Thermus thermophilus.

...SIGEPGTQLTMRTFHTGGVAGAADITQGLPRVIELFEA...

The loop-like fragment of the chain containing the SIGEPGTQ element is shown in the Fig. 5. It is very compact. The arms of the structure are at close distance from one another, 5 – 9A, including the α-helices at the ends. It may or may not correspond to structurally conserved region. Only future comparisons with RNA polymerases from other species will allow to find the limits of the structural conservation in the vicinity of the SIGEPGTQ element. The consensus sequence that corresponds to the segment in the Fig. 5:

$$ {\matrix{ {\underline{\bf S}} & {\underline{\bf I}} & {\underline{\bf G}} & {\underline{\bf E}} & {\underline{\bf P}} & {\underline{\bf G}} & {\underline{\bf T}} & {\underline{\bf Q}} & {\rm L} \cr {117} & {117} & {117} & {117} & {117} & {117} & {117} & {117} & {107} \cr {\rm T} & {\rm M} & {\rm R} & {\rm T} & {\rm F} & {\rm H} & {\rm I} & {\rm G} & {\rm G} \cr {115} & {106} & {115} & {111} & {116} & {116} & {41} & {107} & {116} \cr {\rm V} & {\rm A} & {\rm S} & {\rm R} & {\rm A} & {\rm A} & {\rm T} & {\rm E} & {\rm G} \cr {53} & {95} & {47} & {30} & {28} & {30} & {30} & {24} & {27} \cr {\rm L} & {\rm I} & {\rm R} & {\rm V} & {\rm K} & {\rm E} & {\rm F} & {\rm G} & {\rm E} \cr {31} & {25} & {25} & {31} & {35} & {33} & {19} & {43} & {21} \cr {\rm V} & {\rm R} & & & & & & & \cr {33} & {39} & & & & & & & \cr } } $$
Fig. 5
figure 5

Polypeptide chain structure (1SMY) around conserved sequence element SIGEPGTQ (bottom left).

SGGLHGVG—Conserved Sequence Element of Topoisomerases

This octamer is found in 117 proteomes, 124 times in DNA gyrases, subunit B, and 58 times in DNA topoisomerases IV, subunit B. Three nonredundant structures containing the octamer are available in crystallized form:

1S16, chain A. Topoisomerase IV subunit B of Escherichia coli.

...GVPAVELILCRLHAGGKFSNKNYQFSGGLHGVGISVVNALSKRVEVNVRRDGQVYNIA..

1EI1, chains A. DNA gyrase B of Escherichia coli

...GVSAAEVIMTVLHAGGKFDDNSYKVSGGLHGVGVSVVNALSQKLELVIQREGKIHR QI...

1KIJ, chain A. ATPase domain of Thermus thermophilus gyrase B.

...GKPAVEVIYNTLHSGGKFEQGAYKVSGGLHGVGASVVNALSEWTVVEVFREGKHH RIA...

The first two contain nearly identical very compact structures with the element SGGLHGVG (Fig. 6A). There are many short distances, 5–10A, across the structure, and between its ends. The third structure (Fig. 6B) is very similar, except in the small section where the SGGLHGVG octamer is located. It is part of the gyrase B of T. thermophilus, and the difference, apparently, is due to the fact that the ATPase domain of the gyrase in this case is in open form (Lamour et al. 2002).

Fig. 6
figure 6

Conserved structural element containing the SGGLHGVG octamer (bottom left corners). A Normal structure, B “open form” structure.

Consensus sequence for this sequence/structure element is given below.

$$ {\matrix{ {\rm M} & {\rm T} & {\rm V} & {\rm L} & {\rm H} & {\rm A} & {\rm G} & {\rm G} & {\rm K} \cr {72} & {157} & {88} & {171} & {168} & {161} & {174} & {196} & {178} \cr {\rm F} & {\rm D} & {\rm G} & {\rm G} & {\rm S} & {\rm Y} & {\rm K} & {\rm V} & {\underline{\bf S}} \cr {172} & {87} & {40} & {63} & {65} & {182} & {122} & {107} & {182} \cr {\underline{\bf G}} & {\underline{\bf G}} & {\underline{\bf L}} & {\underline{\bf H}} & {\underline{\bf G}} & {\underline{\bf V}} & {\underline{\bf G}} & {\rm V} & {\rm S} \cr {182} & {182} & {182} & {182} & {182} & {182} & {182} & {86} & {177} \cr {\rm V} & {\rm V} & {\rm N} & {\rm A} & {\rm L} & {\rm S} & {\rm E} & {\rm X} & {\rm L} \cr {170} & {175} & {182} & {178} & {180} & {178} & {56} & {50} & {124} \cr }} $$

VEGDSAGG—Conserved sequence Element of Topoisomerases

This octamer is found in 116 proteomes, 121 times in DNA gyrases, subunit B, and 59 times in DNA topoisomerases IV, subunit B. No crystal structure that would contain the octamer is identified in the PDB database. The consensus sequence around the VEGDSAGG octamer has no visible connection with the other topoisomerase element, SGGLHGVG, and does not resemble any known prototype sequence:

$$ {\matrix{ {\rm A} & {\rm D} & {\rm C} & {\rm Q} & {\rm S} & {\rm K} & {\rm D} & {\rm P} & {\rm A} \cr {135} & {150} & {156} & {74} & {83} & {84} & {125} & {123} & {57} \cr {\rm K} & {\rm S} & {\rm E} & {\rm L} & {\rm F} & {\rm I} & {\underline{\bf V}} & {\underline{\bf E}} & {\underline{\bf G}} \cr {46} & {73} & {180} & {133} & {89} & {85} & {180} & {180} & {180} \cr {\underline{\bf D}} & {\underline{\bf S}} & {\underline{\bf A}} & {\underline{\bf G}} & {\underline{\bf G}} & {\rm S} & {\rm A} & {\rm K} & {\rm Q} \cr {180} & {180} & {180} & {180} & {180} & {170} & {177} & {175} & {131} \cr {\rm G} & {\rm R} & {\rm D} & {\rm R} & {\rm K} & {\rm F} & {\rm Q} & {\rm A} & {\rm I} \cr {136} & {179} & {117} & {140} & {52} & {101} & {179} & {176} & {156} \cr {\rm L} & {\rm P} & & & & & & & \cr {161} & {179} & & & & & & & \cr }} $$

GLPNVGKS—GTP/ATP Binding Proteins

This octamer is present in 116 proteomes, mostly in GTP/ATP binding proteins (80 times), GTPases (8) and other proteins (29). This motif is found in only one crystal (1JAL), and the structure around the GLPNVGKS octamer is indistinguishable (not shown) from the one in the Fig. 1. It is similar to the GHVDHGKT element functionally as well. By its consensus sequence (below) it is close to the prototype Aleph, like the most conserved GHVDHGKT element.

$$ {\matrix{ {\rm M} & {\rm G} & {\rm L} & {\rm K} & {\rm C} & {\rm G} & {\rm I} & {\rm V} & {\underline{\bf G}} \cr {86} & {41} & {65} & {43} & {50} & {110} & {111} & {109} & {117} \cr {\underline{\bf L}} & {\underline{\bf P}} & {\underline{\bf N}} & {\underline{\bf V}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf S}} & {\rm T} & {\rm L} \cr {117} & {117} & {117} & {117} & {117} & {117} & {117} & {105} & {107} \cr {\rm F} & {\rm N} & {\rm A} & {\rm L} & {\rm T} & {\rm K} & & & \cr {112} & {106} & {109} & {85} & {109} & {56} & & & \cr }} $$

DEPSIGLH—Element of Excinuclease ABC (UvrA)

Found in 115 proteomes 128 times of which 126 times — in excinucleases ABC (UvrA). No crystal structures are identified to harbor this octamer. By the consensus sequence around the element (below) it belongs to closed loop prototype Beth (matching 11 residues of 32), like the element LSGGQQQR (Fig. 2). It is close to the LSGGQQQR element functionally as well, both belonging to ABC cassettes.

$$ {\matrix{ {\rm G} & {\rm G} & {\rm E} & {\rm A} & {\rm Q} & {\rm R} & {\rm I} & {\rm R} & {\rm L} \cr {128} & {128} & {128} & {79} & {123} & {128} & {118} & {111} & {120} \cr {\rm A} & {\rm T} & {\rm Q} & {\rm I} & {\rm G} & {\rm S} & {\rm G} & {\rm L} & {\rm T} \cr {120} & {70} & {111} & {104} & {124} & {93} & {69} & {126} & {49} \cr {\rm G} & {\rm V} & {\rm L} & {\rm Y} & {\rm V} & {\rm L} & {\underline{\bf D}} & {\underline{\bf E}} & {\underline{\bf P}} \cr {118} & {112} & {84} & {123} & {91} & {124} & {128} & {128} & {128} \cr {\underline{\bf S}} & {\underline{\bf I}} & {\underline{\bf G}} & {\underline{\bf L}} & {\underline{\bf H}} & & & & \cr {128} & {128} & {128} & {128} & {128} & & & & \cr } } $$

DLGGGTFD—Chaperone (Heat Shock) Proteins

One hundred and fifteen prokaryotic genomes harbor this conserved octamer element. Of 163 total occurrences of the element 129 correspond to chaperone protein dnaK (Heat shock protein 70), 24 to chaperone protein HscA, and 10 to other chaperone and heat shock proteins. Three non-redundant crystal structures in the PDB database contain the element of which two belong to eukaryotes:

1S3X, chain A. Human Hsp70 ATPase domain ...NVLIFDLGGGTFDVSILTID...

1DKG, chain D. ATPase Domain Of The Molecular Chaperone Dnak from Escherichia coli

...TIAVYDLGGGTFDISIIEID...

3HSC. ATPase fragment of a 70K heat-shock cognate protein, Bos Taurus

...NVLIFDLGGGTFDVSILTIE...

These are shown superimposed in the Fig. 7. This is new structural element not belonging to any earlier described family. Consensus sequence around the octamer has no similarity to other profiles described in this work:

$$ {\matrix{ {\rm K} & {\rm I} & {\rm A} & {\rm V} & {\rm Y} & {\underline{\bf D}} & {\underline{\bf L}} & {\underline{\bf G}} & {\underline{\bf G}} \cr {58} & {119} & {74} & {157} & {108} & {163} & {163} & {163} & {163} \cr {\underline{\bf G}} & {\underline{\bf T}} & {\underline{\bf F}} & {\underline{\bf D}} & {\rm V} & {\rm S} & {\rm I} & {\rm L} & {\rm E} \cr {163} & {163} & {163} & {163} & {96} & {151} & {112} & {123} & {123} \cr {\rm I} & {\rm G} & & & & & & & \cr {75} & {74} & & & & & & & \cr }} $$
Fig. 7
figure 7

Conserved chain structure with embedded octamer DLGGGTFD (at the turn).

GPNGAGKS—Element of ABC Transporters

This element is present in 114 proteomes of 131, in 384 ATP-binding components of various ABC transporters, including amino acid transporters (46), 20 transporters of iron (III), zinc (17), and manganese (13); 39 ATPase components of ABC transporters; and 56 other proteins. Only one representative is found in the PDB crystal database −1L7V (vitamin B12 transporter of E. coli). The structure of the chain that includes the GPNGAGKS octamer (not shown) is identical with the one in the Fig. 1. This octamer matches the Walker A motif GxxxxGKS, as well as other representatives of the Aleph family described above. The consensus sequence around the octamer is a good match to the consensus sequence of Aleph (15 residues of 23):

$$ {\matrix{ {\rm P} & {\rm G} & {\rm E} & {\rm I} & {\rm V} & {\rm G} & {\rm L} & {\rm I} & {\underline{\bf G}} \cr {94} & {406} & {214} & {172} & {94} & {213} & {205} & {184} & {479} \cr {\underline{\bf P}} & {\underline{\bf N}} & {\underline{\bf G}} & {\underline{\bf A}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf S}} & {\rm T} & {\rm L} \cr {479} & {479} & {479} & {479} & {479} & {479} & {479} & {435} & {270} \cr {\rm L} & {\rm K} & {\rm M} & {\rm I} & {\rm A} & {\rm G} & & & \cr {183} & {177} & {101} & {183} & {90} & {349} & & & \cr }} $$

GIDLGTTN—Chaperones

One hundred and thirteen proteomes contain this element. One hundred and fifty occurrences—chaperone protein dnaK (heat shock protein 70), 7—other chaperone proteins. In the crystal database only one structure with this octamer is found: 1DKG, ATPase domain of the molecular chaperone Dnak of Escherichia coli. This is the same molecule as in case of the DLGGGTFD (see above), but sequence location of the GIDLGTTN element is different. Interestingly, the chain structure in the vicinity of this element (not shown) is the same as for the DLGGGTFD element (Fig. 7). The octamers have common DLG tripeptide in the same position within the loop structure. The consensus sequence around the GIDLGTTN octamer

$$ {\matrix{ {\rm K} & {\rm I} & {\rm I} & {\underline{\bf G}} & {\underline{\bf I}} & {\underline{\bf D}} & {\underline{\bf L}} & {\underline{\bf G}} & {\underline{\bf T}} \cr {98} & {82} & {98} & {157} & {157} & {157} & {157} & {157} & {157} \cr {\underline{\bf T}} & {\underline{\bf N}} & {\rm S} & {\rm C} & {\rm V} & {\rm A} & {\rm V} & {\rm M} & {\rm E} \cr {157} & {157} & {152} & {84} & {135} & {122} & {95} & {63} & {92} \cr {\rm G} & {\rm G} & & & & & & & \cr {106} & {93} & & & & & & & \cr }} $$

is similar to the consensus around DLGGGTFD (8 matching residues of 20).

These are, thus, elements belonging to the same sequence/structure/function family, perhaps, a new separate closed loop prototype.

VITVPAYF—ATPase of Heat Shock Protein 70

This element is found in 113 proteomes, 146 times, mostly in ATPase domains of chaperones (HSP70) (140 times). E. coli HSP70 contains all three chaperone octamer elements described here: DLGGGTFD, GIDLGTTN (see above) and VITVPAYF but all three in different locations of the same chain. The PDB crystals that contain the octamer:

1S3X, chain A. Human Hsp70 ATPase domain.

...VTNAVITVPAYFNDSQRQATKDAGVIAGLNVLRIINEPTAAA...

1DKG, chain D. ATPase Domain Of The Molecular Chaperone Dnak of Escherichia coli

...VTEAVITVPAYFNDAQRQATKDAGRIAGLEVKRIINEPTAAA...

The chain structures around the octamer in these two cases are essentially identical (Fig. 8) and dissimilar to any of the structures discussed above, as well as to any known prototype structures. Limits of the structural conservation are uncertain since only two structures are compared. Interestingly, the consensus sequence downstream from the octamer VITVPAYF resembles the closed loop prototype Beth sequence, matching 10 residues of 30 (bold), including the tripeptides QRQ and EPT characteristic for Beth:

$$ {\matrix{ {\rm V} & {\rm T} & {\rm E} & {\rm A} & {\underline{\bf V}} & {\underline{\bf I}} & {\underline{\bf T}} & {\underline{\bf V}} & {\underline{\bf P}} \cr {108} & {86} & {48} & {135} & {150} & {150} & {150} & {150} & {150} \cr {\underline{\bf A}} & {\underline{\bf Y}} & {\underline{\bf F}} & {\rm N} & {\rm D} & {\rm A} & {\bf Q} & {\bf R} & {\bf Q} \cr {150} & {150} & {150} & {113} & {143} & {91} & {135} & {149} & {121} \cr {\rm A} & {\rm T} & {\rm K} & {\rm D} & {\bf A} & {\rm G} & {\rm R} & {\rm I} & {\bf A} \cr {128} & {149} & {139} & {121} & {150} & {123} & {65} & {119} & {147} \cr {\rm G} & {\rm L} & {\rm E} & {\rm V} & {\bf L} & {\rm R} & {\rm I} & {\rm I} & {\rm N} \cr {148} & {140} & {70} & {148} & {68} & {148} & {119} & {102} & {145} \cr {\bf E} & {\bf P} & {\bf T} & {\rm A} & {\bf A} & {\rm A} & & & \cr {150} & {150} & {150} & {145} & {149} & {135} & & & \cr }} $$
Fig. 8
figure 8

Conserved structure that contains the VITVPAYF octamer (bottom right of the loop).

The functional involvements are similar as well (both belong to ATPases). Yet, the structure shown in the Fig. 8 is not superimposable with the prototype Beth structure in the Fig. 2, except for the α-helical segments (not shown). It is tempting to speculate that the new type of the closed loop shown in the Fig. 8 is evolutionary descendant of the original Beth structure. Alternatively, this could correspond to a conformational variant of the Beth prototype. It remains to be found whether a conformational transition between the two structures exists, similar to transitions in case of the GHVDHGKT (Fig. 1) and SGGLHGVG (Fig. 6).

LNRAPTLH...NADFDGDQ—RNA Polymerase Beta’ Subunit

One hundred and thirteen proteomes contain these two conserved octapeptides, simultaneously in the same protein structures: 107 times this tandem occurs in DNAdirected RNA polymerase beta’ chains, and six times—in RNA polymerase gamma subunits. There are two non-redundant crystal structures with these elements:

1SMY, chain D. DNA-directed RNA polymerase Rnap’ subunit from Thermus thermophilus

...LNRAPTLHRLGIQAFQPVLVEGQSIQLHPLVCEAFNADFDGDQ...

1HQM, chain D. DNA-Directed RNA Polymerase Rnap’ subunit from Thermus aquaticus

...LNRAPTLHRLGIQAFQPVLVEGQSIQLHPLVCEAFNADFDGDQ...

The sequences of the segments that include the octamers are identical for these two species. Nearly identical are respective chain structures as well, shown in the Fig. 9. The limits of structural conservation are uncertain. Two examples only are not sufficient to determine the limits. The structure in the Fig. 9 is bordered by the octamers LNRAPTLH and NADFDGDQ. Consensus sequence between the octamers (below) does not resemble any sequences listed above, nor the sequences of known closed loop prototypes.

$$ {\matrix{ {\underline{\bf L}} & {\underline{\bf N}} & {\underline{\bf R}} & {\underline{\bf A}} & {\underline{\bf P}} & {\underline{\bf T}} & {\underline{\bf L}} & {\underline{\bf H}} & {\rm R} \cr {113} & {113} & {113} & {113} & {113} & {113} & {113} & {113} & {109} \cr {\rm L} & {\rm G} & {\rm I} & {\rm Q} & {\rm A} & {\rm F} & {\rm E} & {\rm P} & {\rm V} \cr {106} & {104} & {113} & {113} & {112} & {113} & {99} & {113} & {57} \cr {\rm L} & {\rm I} & {\rm E} & {\rm G} & {\rm K} & {\rm A} & {\rm I} & {\rm Q} & {\rm L} \cr {108} & {59} & {91} & {113} & {90} & {110} & {108} & {71} & {106} \cr {\rm H} & {\rm P} & {\rm L} & {\rm V} & {\rm C} & {\rm A} & {\rm A} & {\rm F} & {\underline{\bf N}} \cr {112} & {113} & {112} & {102} & {102} & {40} & {109} & {71} & {113} \cr {\underline{\bf A}} & {\underline{\bf D}} & {\underline{\bf F}} & {\underline{\bf D}} & {\underline{\bf G}} & {\underline{\bf D}} & {\underline{\bf Q}} & & \cr {113} & {113} & {113} & {113} & {113} & {113} & {113} & & \cr } } $$
Fig. 9
figure 9

Conserved structure that contains two octamers LNRAPTLH and NADFDGDQ (at the ends of the structure).

NLLGKRVD—RNA Polymerase Beta’ Subunit

Remarkably, this highly conserved octamer is found literally in the same proteins and crystals as the above two octamers (see also No. 9, SIGEPGTQ). Their locations in the sequences, however, are different. The NLLGKRVD octamer is very different structurally. In all seven occurrences in the PDB database the local structures around the octamer are different, not superimposable (data not shown). That is, there is no structural conservation neither in the vicinity of the octamers, nor within the octamers themselves. The structures are all extended, not forming the closed loops as all the above examples. The consensus sequence around the octamer (below) does show unusually high conservation. In the vicinity of the octamer nine more positions are as conserved as the octamer itself (all shown in boldface).

$$ {\matrix{ {\rm D} & {\rm M} & {\rm L} & {\rm K} & {\bf G} & {\bf K} & {\rm Q} & {\bf G} & {\rm R} \cr {82} & {80} & {70} & {103} & {113} & {113} & {96} & {113} & {108} \cr {\rm F} & {\bf R} & {\rm Q} & {\underline{\bf N}} & {\underline{\bf L}} & {\underline{\bf L}} & {\underline{\bf G}} & {\underline{\bf K}} & {\underline{\bf R}} \cr {112} & {113} & {109} & {113} & {113} & {113} & {113} & {113} & {113} \cr {\underline{\bf V}} & {\underline{\bf D}} & {\rm Y} & {\bf S} & {\rm G} & {\bf R} & {\rm S} & {\bf V} & {\bf I} \cr {113} & {113} & {105} & {113} & {105} & {113} & {112} & {113} & {113} \cr {\rm V} & {\rm V} & {\rm G} & {\bf P} & {\rm E} & & & & \cr {67} & {105} & {112} & {113} & {33} & & & & \cr }} $$

It appears, thus, that the structure around the octamer exemplifies naturally unfolded state. As in some other known examples of the unfolded chains (Liu et al. 2002) it may become uniquely structured after engagement in a specific functional complex with other proteins and, in this case, with DNA.

AGDGTTTA—Chaperonin GroEL

One hundred and twelve proteomes harbor this octamer. It is found in the sequences of chaperonin GroEL (136 cases) and other chaperonins (9 cases). PDB database offers 19 crystal structures of the chaperonins of which the following four structures are non-redundant:

1GRL, Chaperonin GroEL of Escherichia coli.

...ENMGAQMVKEVASKANDAAGDGTTTATVLAQAIITEGLKAVAA...

1IOK, chain A. Chaperonin-60 from Paracoccus denitrificans.

...ENMGAQMVREVASRTNDEAGDGTTTATVLAQAIVREGLKAVAA...

1Q2V, chain A. Chaperonin from Thermococcus strain Ks-1.

...QHPAAKMMVEVAKTQDKEAGDGTTTAVVIAGELLRKAEELLDQ...

1WE3, chains A. Chaperonin complex Cpn60/Cpn10/(ADP)7 from Thermus thermophilus.

...ENIGAQLLKEVASKTNDVAGDGTTTATVLAQAIVREGLKNVAA...

The structural elements that include the octamer AGDGTTTA are shown in the Fig. 10, superimposed. The consensus sequence around the octamer:

$$ {\matrix{ {\rm E} & {\rm N} & {\rm M} & {\rm G} & {\rm A} & {\rm Q} & {\rm L} & {\rm V} & {\rm K} \cr {128} & {129} & {98} & {140} & {141} & {99} & {78} & {113} & {85} \cr {\rm V} & {\rm V} & {\rm A} & {\rm S} & {\rm K} & {\rm T} & {\rm N} & {\rm D} & {\rm V} \cr {117} & {132} & {142} & {102} & {128} & {109} & {93} & {135} & {47} \cr {\underline{\bf A}} & {\underline{\bf G}} & {\underline{\bf D}} & {\underline{\bf G}} & {\underline{\bf T}} & {\underline{\bf T}} & {\underline{\bf T}} & {\underline{\bf A}} & {\rm T} \cr {145} & {145} & {145} & {145} & {145} & {145} & {145} & {145} & {138} \cr {\rm V} & {\rm L} & {\rm A} & {\rm Q} & {\rm A} & {\rm I} & {\rm V} & {\rm R} & {\rm E} \cr {126} & {139} & {133} & {112} & {104} & {84} & {79} & {53} & {128} \cr {\rm G} & {\rm L} & {\rm K} & {\rm N} & {\rm V} & {\rm A} & {\rm A} & & \cr {139} & {97} & {98} & {63} & {127} & {92} & {121} & & \cr }} $$

does not resemble any prototype described earlier and, thus, represents a new conserved sequence/structure element.

Fig. 10
figure 10

Conserved structure that contains octamer AGDGTTTA (left end of the bottom helix).

Discussion

Universal presence of the conserved octamers in almost all sequenced genomes (proteomes), of many different phylogenies, makes their origin by evolutionary late horizontal exchange quite unlikely. They represent, thus, a common ancestor of the species. In those species where the conserved octamer is not found, its analog is usually detected (data not shown) with one to two residues different from the octamer sequence. We believe that the omnipresent sequence elements, and the structural modules they represent, should serve as a basis for molecular reconstruction of last common ancestor of modern species. The twenty one elements described above already indicate how the reconstruction may look like.

Two Major Prototypes

Many of the last universal common ancestor (LUCA) elements appear as closed loops outlined earlier as evolutionary predecessors of modern proteins (Trifonov and Berezovsky 2003). The most conserved elements correspond to the major prototypes of the closed loops, Aleph and Beth (Berezovsky and Trifonov 2002; Berezovsky et al. 2003a, 2003b).

Clustering of Omnipresent Elements

Different omnipresent octamers sometimes belong to the same structural module as in cases of LSGGQQQR/QRVAIARA/DEPTSALD and LNRAPTLH/NADFDGDQ. This indicates a special importance of these modules in early protein evolution. Some modern proteins may carry two or more different omnipresent motifs at different locations within their structures (ABC-transporters, chaperones, RNA polymerases). This, again, suggests a key role of these proteins in early evolution.

Search for Even Earlier Elements

Several highly conserved modules containing common motif (e.g. Walker A motif in GHVDHGKT, gpSGSGKSTL, GPPGTGKT, GLPNVGKS, GPNGAGKS) structurally and sequence-wise may belong to the same prototype (Aleph in this case). That points to possible even earlier “baby-LUCA” ancestor. The “age” of the above Walker A octamers calculated as in (Sobolevsky and Trifonov 2005) varies between 12.1 and 14.0 units as compared to the corresponding octamer GPSGSGKS of the baby-LUCA prototype Aleph (Berezovsky and Trifonov 2002; Berezovsky et al. 2003a, 2003b) — 14.4 units. That is, the octamers are, indeed, younger than the prototype. This suggests a straightforward strategy for reconstruction of the earliest version(s) of the Walker A octamers — by exploring even “older” sequence variants of the octamer embedded in the prototype. It can be done, for example, by replacing amino acids P, S, and especially K (ages 15, 14 and 7 units, respectively) by older residues. The immediate function of the Walker A motif is binding of GTP (ATP). The estimated “oldest” variants that would demonstrate the binding in the experiment, will represent, perhaps, the earliest functional protein elements in evolution.

Sequence—Structure Relation

In most of the cases described the structure within and around the omnipresent octamers is conserved while the sequence around the octamer is variable. This suggests that conservation of structure is at least as important as sequence conservation. On the other hand, in two cases (GHVDHGKT and SGGLHGVG) the octamers themselves adopt alternative configurations while the structure around the octamers stays unchanged. These observations, thus, suggest that the relationship sequence/structure is rather convoluted. It is far from being a one-to-one correspondence. Earlier analysis of variable structures of some oligomers in protein crystals (Kabsch and Sander 1985) brought the authors to similar conclusions.

Abiotic Syntheses in Early Life

Among 21 topmost conserved octamers none appears to be involved in synthesis of amino acids, sugars or nucleobases. This is highly counterintuitive, considering vital importance of these very basic syntheses. Were they, indeed, of no demand at the earliest steps of Life? That could only be on condition that the elementary building blocks had been somehow supplied, e.g., synthesized abiotically. Possibility of abiotic synthesis of amino acids is well demonstrated by classical work of Miller (1953, 1987). Abiotic synthesis of nucleobases is more problematic. A very promising system has been recently developed (Saladino et al. 2003a, b) in which purine and cytosine were synthesized in presence of titanium dioxide, an abundant natural catalyzer. Further scrutiny of the less conserved octamers, towards bottom of the list, may indicate which of the basic biosynthetic activities appeared first in the evolution, and which ones stayed for some time dispensable. Among the latter activities one would find, perhaps, those that are involved today in biosynthesis of amino acids formed in the imitation experiments of Miller (1953, 1987).

Elementary and Composite Functions

An important issue is the protein function(s), the sequence/structure elements associated with the function(s), and partition of the functions between the elements. The conserved omnipresent oligopeptides as well as other frequent sequence elements make a good basis for functional description of a protein. The frequent sequence elements can be identified in any protein sequence of interest, so that the sequence conservation plots can be generated (Aharonovsky and Trifonov 2005). Numerous sites of sequence conservation, spanning 6–10 amino acid residues each, are seen in the plots, indicating that one nominal function ascribed to a given protein, actually, splits in 5–10 elementary functions most of which are unknown. One example of such elementary function is GTP/ATP binding by the Walker A octamers within conserved structure of the prototype Aleph. This is, of course, one of the components of the general composite function of, say, translation elongation factor, or of ABC transporter. Full description of any such composite function would require characterization of elementary functions of all omnipresent and frequent oligopeptides indicated by the conservation plot for respective protein (Aharonovsky and Trifonov 2005). Thus, the frequent elements mapped within protein sequence are an immediate source for functional diagnostics of the protein.

The cases outlined above correspond to the 21 most conserved omnipresent octamers. Total amount of those octamers that are likely to represent LUCA elements is of the order of several hundreds (Sobolevsky and Trifonov 2005). It is, of course, a fearsome task to characterize all the elements in the same way as described in this work. We believe, however, that these highly conserved elements rather than variable sequences between the elements carry, indeed, the surviving information about the proteins of LUCA, and the ultimate full characterization of the elements is, thus, a justified line of research.

It is quite possible that not all LUCA proteins have their signatures in form of the oligopeptides still present in a variety of proteomes. Some of them, important at simple early stages of evolution, may have been irreversibly lost in sophisticated modern species. Perhaps, only after detection and reconstruction of all what can be recovered one would be able also to guess what is missing.