Introduction and Hypothesis

In its more general meaning, the gene duplication should have been one of the fundamental mechanisms on which was played a large part of the origin and evolution of biological structures on all its levels. This statement is strongly supported by the extraordinary propensity that nucleic acids have for replication and duplication. Indeed, this point of view has been universally recognised after the work of Ohno (1970), who suggested that gene and genome duplications must have been one of the primary forces guiding the adaptive evolution of genetic systems. It is obvious that gene duplication must have been more important and more operative in the very early phases of the evolution of life, those which led to the formation of fundamental molecules. That is to say, gene duplication should also be easily traceable in the genesis of several of fundamental molecules that characterise the present-day organisms and in particular into basic molecules that are made by RNA. In actual fact, there are some instances of basic molecules that seem to have originated by means of duplication as, for example, the tRNA molecule which might have been generated by means of the assembly of two hairpin-like structures of RNA (formally a direct duplication) (Di Giulio 1992; Branciamore and Di Giulio 2011).

The aptitude by which nucleic acids replicate and duplicate coupled with the existence, for instance, of symmetry conditions of a given molecule which might have appeared by means of gene duplication has not prevented that gradualistic “increasing” mechanisms were responsible for the generation of some molecules through the insertion of a few monomers in evolutionary steps. For instance, Rodin et al. (2011) have suggested that the tRNA molecule originated by means of the gradual accumulation of small sections, during six evolutionary stages, in a growth of this molecule which seems gradualistic. A gradualistic point of view has also been proposed by Sun and Caetano-Anollés (2008), although their model is different in detail from that suggested by Rodin et al. (2011). Similarly, Sun and Caetano-Anollés (2009, 2010) have reported that the 5S ribosomal RNA molecule (5S rRNA) emerged through five evolutionary stages, in which it seems that a gradualistic and non-punctuated growth of this molecule was present.

However, the almost perfect symmetry of the 5S rRNA molecule (Fox and Woese 1975; Hannock and Wagner 1982; Joachimiak et al. 1990; Luehrsen and Fox 1981; Sun and Caetano-Anollés 2009; Fig. 1) seems strongly to favour the intervention of a single indirect duplication for its origin (Di Giulio 2010; Fig. 1), that is to say, its origin might have been punctuated and not gradualistic. This also seems to be the case for the origin of the tRNA molecule (Di Giulio 1992, 2004, 2006, 2009) contrarily to the suggestions of other authors (Sun and Caetano-Anollés 2008; Rodin et al. 2011). The V domain of the 23S rRNA seems also has had a segment of 110 nucleotides duplicated to place the P and A sites of the ribosome in the present evolutionary relationship (Bokov and Steinberg 2009). Furthermore, we wish to emphasise that a unique direct duplication could also be involved in the origin of the 5S rRNA molecule, depending upon the presence or the absence of the stem between the first bases at the 5′ and the last at the 3′-end of this molecule, i.e. the so-called I stem (or S1) (Fig. 1), in the nascent stages of the origin of the 5S rRNA.

Fig. 1
figure 1

A model of the origin of the 5S rRNA molecule (Di Giulio 2010). It is shown how a simple complementary copy of the 5PH of this molecule it is able to produce the 3PH, and the ligation of these two halves creates the complete secondary structure of the 5S rRNA molecule. The indicated regions of the secondary structure are consistent with the official nomenclature (see, for instances, Sun and Caetano-Anollés 2010)

The apparent near-perfect symmetry of the 5S rRNA molecule (Fig. 1) is doubted by Sun and Caetano-Anollés (2010), who have provided evidence in favour of a model of gradualistic growth which would result more parsimonious than the one based on a unique inverse duplication (Di Giulio 2010). Their suggestion is particularly challenging in that it would imply that a single duplication could become less parsimonious than a gradualistic accretion for the formation of a symmetrical molecule like the 5S rRNA. This should contradict the logic that nucleic acids are highly predisposed to undergo duplications of discrete entity into the case in which there are conditions of symmetry that can be generated by means of duplication. More generally, we favour the hypothesis that a symmetrical molecule is only created by means of gene duplication. The symmetry could not arise through from gradualistic growth as the latter should be less parsimonious than the propensity of nucleic acids to undergo discrete duplications.

Thus, we are not persuaded by the suggestion of Sun and Caetano-Anollés (2010) and therefore instead have subjected the punctuated model to an empirical test, by means of its main prediction, that is to say, that of a unique indirect duplication. In other words, our model sustains the idea that the complementarity between the two halves of the 5S rRNA molecule was the cause of its origin (Di Giulio 2010; Fig. 1). We have thus analysed sequences of the 5S rRNA to see whether there are yet clues of an evolutionary relationship of complementarity between the two halves of this molecule: witnesses of the indirect duplication. At the same time, we have tested the direct duplication model which might reveal the origin of the 5S rRNA molecule, looking for, in this case, not the complementarity but the similarity between the two halves of this molecule in the expectation of finding the two halves to be homologous.

Materials and Methods

Source of 5S rRNA Sequences

5S rRNA sequences were downloaded from Szymanski et al. (2000). The database provides information of phylogenetically conserved secondary structures of Archaeal, Bacterial and Eukaryal 5S rRNAs and multiple sequence structural alignments for each of the three domains of life. The original database was cleaned removing low quality sequences (i.e. incomplete) and we eliminated sequences which were too similar (i.e. belonging to strictly related species). Our refined database consisted of 364 sequences from Eukarya, 64 sequences from Archaea, and 256 sequences from Bacteria.

Generation of Random 5S rRNA Sequences

Each simulated 5S rRNA sequence has the following properties: (i) the same base composition and (ii) it can fold in the same secondary structure of the corresponding real 5S rRNA sequence, as reported in the database.

The algorithm used to produce a simulated sequence of a real 5S rRNA sequence consists of three steps

  1. (1)

    read the real 5S rRNA sequence from the database and evaluate the base composition. We estimate independently the base frequency for paired and unpaired regions

  2. (2)

    read the corresponding secondary structure in dot-bracket notation from the database

  3. (3)

    starting from the 5′-end read the secondary structure of the 5S rRNA at the position ith:

    • if position ith is a dot ‘.’ then assign a random nucleotide N with probability proportional to the frequency of unpaired nucleotides in the real 5S rRNA.

    • if position ith is open bracket ‘(‘ then find the complementary position i′th in the secondary structure, and assign one of the possible nucleotide pairs N::N′ to position ith and i′th with probability proportional to the frequency of paired nucleotides in the real 5S rRNA. Repeat step 3 for the whole length of the sequence (see also Branciamore and Di Giulio 2011).

Definition of the Two Halves of the 5S rRNA Molecule

According to the proposed model, the molecules of the 5S rRNA were divided into two halves: the 5′ half (5PH) and the 3′ half (3PH) (Di Giulio 2010; Fig. 2). The two halves were then aligned and compared (see below).

Fig. 2
figure 2

The 5S rRNA consensus secondary structure for a Eukarya, b Archaea and c Bacteria. Arrow marks the position where the 5S rRNA molecule was divided in the 5PH and in the 3PH (3PH)

To check the hypothesis that insertions/deletions occurred during evolution of the 5S rRNA, we removed nucleotides progressively from the 5′-end of the 5PH before proceeding to the alignment of the two halves. The number 0, 1, 2, 3, … indicate the number of nucleotides removed from the 5PH before the alignment.

Alignment of Two Halves of the 5S rRNA

Alignments of two halves were obtained using the software MUSCLE (Edgar 2004) implemented in a custom Perl scripts. The identity percentage calculated as the number of identical nucleotides in the alignment of the real 5S rRNA from database was compared with the one observed in a simulated sample. For testing the direct and inverse duplications, a percentage of identity between 5PH and 3PH (or 5PH and rc3PH) was calculated. A library of 1,000 simulated 5S rRNA sequences was built as described above. For each 5S rRNA sequence in the database, we calculate a Z(I):

$$ Z(I) = (I^{\text{real}} - I^{\text{rand}} )/s^{\text{rand}} $$

where I real is the identity percentage found in the real sequence, I rand is the average identity percentage found in the simulated sample and s rand is the standard deviation of the percentage identity in the simulated ensemble [see Chamary and Hurst 2005 for a discussion of a variable equivalent to Z(I)]. The Z(I) represents the number of standard deviations that the observed identity percentage in real 5S rRNA sequences differs from the one that is observed in the simulated ensemble. Positive values of the Z(I) represent a similarity higher than is expected only by chance. The distribution of Z(I) values should be normal with mean zero.

In the intra-molecular analysis, we compared regions of the 5S rRNA belonging to the same molecule. In the inter-molecular analysis the 5PH of a molecule was compared to the 3PH of another 5S rRNA molecule taken randomly from the remaining 5S rRNAs of the same domain of life. Finally, in the case of inter-domain analysis the two halves belong to different domains of life; for example, in the comparison between Eukaryal and Bacterial domains, the first half belongs to Eukarya whereas the second half to Bacteria.

R software package was used to perform a t test to evaluate whether the distribution of the Z(I) score observed in the real 5S rRNA sample are significantly different from that observed, only by chance, in a simulated sample.

Results and Specific Discussion

We tested the hypothesis that the origin of the 5S rRNA molecule was based on duplication (direct or inverse) of an ancestral hairpin-like molecule followed by covalent bonding of the two halves to generate the actual 5S rRNA secondary structure (see Introduction and Hypothesis, and Fig. 1; Di Giulio 2010). If the hypothesis were to be correct we might find clues of the common origin of the two halves in modern sequences of the 5S rRNA. We estimated the percentage of identity observed between the two halves in real 5S rRNA sequences and compared it to that expected only by chance in the simulated sample (Chamary and Hurst 2005). In particular, according to the consensus secondary structures reported in the 5S rRNA database, the 5S rRNA secondary structure was divided in two halves: the 5PH and the 3PH as shown in Fig. 2. The direct duplication hypothesis was tested by direct alignment of the 5PH and 3PH, whereas to test the inverse duplication hypothesis we aligned the 5PH with the reverse complementary sequence of the 3PH (rc3PH).

As the consensus secondary structures of the 5S rRNA differ in the three domains of life (Fig. 2) the analyses were done independently for Bacterial, Archaeal and Eukaryal domains. We performed an intra-molecular, inter-molecular and inter-domain comparison.

Intra-molecular Comparison

The percentage of identity between the two putative halves, in the case of the direct duplication model (5PH vs 3PH) for Archaeal, Eukaryal and Bacterial domains is not significantly higher than would be expected by chance (Tables 2, 3, 4). Considering the inverse duplication model (5PH vs rc3PH) the percentage of identity between the two halves was significant for Eukaryal and Bacterial domains (Tables 2, 4), whereas for Archaeal domain the percentage of identity was not statistically significant (Table 3) with respect to that observed in the simulated 5S rRNA sequences.

The comparison of the two putative 5PH and 3PH (or rc3PH) as such seems to support the prediction of the inverse duplication model for the origin of the 5S rRNA molecule, for Eukaryal and Bacterial domains but not for the Archaeal domain and with probabilities only marginally significant. However, before we reject the models, we observe that the length of the 5PH in modern 5S rRNAs is longer than that of the 3PH (Fig. 2; Table 1). This means that the perfect symmetry assumed by direct and inverse duplication models was lost (if ever present) during evolution of the 5S rRNA. In particular, in Archaeal and Eukaryal domains on the average 5PH is about 11 nucleotides longer than 3PH and more than 17 nucleotides in the case of the Bacterial domain (Table 1). This observation could suggest that during the evolutionary process towards the modern 5S rRNA, insertions/deletions occurred, breaking the perfect symmetry assumed by duplication models. Whether this is correct, it will be necessary to remove the ‘excess’ of nucleotides presented in the 5PH to correctly align with the right portions of the 3PH. More generally, the removal of nucleotides from the 5PH might favour the alignment between the 5PH and 3PH. To this end, we use three distinct strategies to remove nucleotides from the 5PH:

Table 1 Average lengths of the 5PH and 3PH
  1. (i)

    5′ removal (5PR): nucleotides from the 5PH at the 5′-end were progressively removed before the alignment with the 3PH or rc3PH. Then the percentage of identity and Z(I) between the two halves was evaluated.

  2. (ii)

    3′ removal (3PR): nucleotides from the 5PH at the 3′-end were progressively removed before the alignment with the 3PH or rc3PH. Then the percentage of identity and Z(I) between the two halves was evaluated.

  3. (iii)

    Random removal (RR): an increasing number of nucleotides from the 5PH were randomly selected and removed before the alignment with the 3PH or rc3PH. Then the percentage of identity and Z(I) between the two halves was evaluated.

The Z(I) value between the two halves is reported as a function of the number of nucleotides removed (using the different strategies described above) from the 5PH, before of its alignment with the 3PH (or rc3PH) (Fig. 3). Results for the intra-molecular comparison are showed in Fig. 3.

Fig. 3
figure 3

Intra-molecular analysis. Variation of the Z(I) score (see ‘Materials and methods’) as a function of nucleotides removed at the 5′-end of the 5PH (5PR strategy) in a Eukarya, b Archaea and c Bacteria. The solid line with filled circles represents the Z(I) score observed for the direct duplication model (5PH vs 5PH). The dashed line with empty squares represents the Z(I) score observed for the inverse duplication model (5PH vs rc3PH)

Fig. 4
figure 4

Inter-molecular analysis. Variation of the Z(I) score (see ‘Materials and methods’) as a function of nucleotides removed at the 5′-end of the 5PH (5PR strategy) in a Eukarya, b Archaea and c Bacteria. The solid line with filled circles represents the Z(I) score observed for the direct duplication model (5PH vs 5PH). The dashed line with empty squares represents the Z(I) score observed for the inverse duplication model (5PH vs rc3PH)

Initially, we examined the case of the direct duplication model. For all three domains of life, the percentage of identity between the two halves did not increase using 5PR strategy, and the percentage of identity was not higher than that observed in a sample of simulated 5S rRNA sequences (Tables 2, 3, 4). The results were similar also using the 3PR or RR strategies (data not shown). The only observed exception was for Eucaryal and Archaeal domains showing a peak with a significantly higher percentage of identity after removal of about 22 nucleotides (data not shown).

Table 2 Eukaryal intra-molecular analysis utilising the 5PR strategy
Table 3 Archaeal intra-molecular analysis utilising the 5PR strategy
Table 4 Bacterial intra-molecular analysis utilising the 5PR strategy

The inverse duplication model showed a different case. Indeed, under the assumption of this model, one half has been the precursor of the entire molecule, and an inverse duplication of this precursor created the complete 5S rRNA molecule (Di Giulio 2010; Fig. 1). In this case, we should expect complementarity between the 5PH and 3PH (similarity between the 5PH and rc3PH). We first describe the results using the 5PR strategy.

The Eukaryal domain (Fig. 3a; Table 2) shows a clear increase of the percentage of identity between the 5PH and rc3PH after removal of nucleotides at the 5′-end of the 5PH. In particular, after removal of the first four nucleotides it is observed a clear increase in the Z(I) value which reaches a maximum value of 0.82 (P = 8.2 × 10−43) after removal of 7–8 nucleotides, remaining almost constant up to the removal of 30 nucleotides and then starts to decrease (see Table 2 for a full list of results after removal up to 40 nucleotides).

The Archaeal domain presents a similar behaviour (Fig. 3b; Table 3). The Z(I) value growths as nucleotides are removed from the 5′-end of the 5PH and reaches a maximum value after removal of 18–19 nucleotides (Z(I) = 0.63, P = 1.39 × 10−9) (Table 3).

Similar to that previously observed in Eukaryal and Archaeal domains also the Bacterial domain shows the Z(I) value which increases after removal of nucleotides at the 5′-end of the 5PH (Fig. 3c; Table 4). The comparison of the 5PH versus rc3PH in the Bacterial domain shows Z(I) values <0, but after removal of 11 nucleotides the Z(I) value suddenly increases, and reaches a maximum value of Z(I) = 0.68 (P = 8.76 × 10−14) when 16 nucleotides were deleted (Table 4).

The situation is different if nucleotides were removed starting from the 3′-end of the 5PH (3PR strategy). In this case, the alignment of the 5PH and rc3PH did not show the similarity expected for the inverse duplication model (data not shown, see also Table 11A). The RR of nucleotides from the 5PH by means of the RR strategy gives results equivalent to that observed using the 5PR strategy (data not shown, see also Table 11).

Inter-molecular Comparison

The inter-molecular comparison should reduce correlation biases eventually present in the intra-molecular analysis due to interactions between different regions of the same molecule. We repeated the analysis described in the previous session. However, in this case the halves compared (the 5PH and 3PH) do not belong to the same molecule but were chosen from two different 5S rRNAs.

Results obtained for the inter-molecular comparison overlap with that of the intra-molecular analysis. Identity percentages of the two aligned halves do not support direct duplication model (for the 5PR, 3PR and RR strategies) (Tables 5, 6, 7), As in the previous case exceptions were observed for Eucaryal and Archeal domains showing a peak with a significantly higher percentage of identity after the removal of about 22 nucleotides, using the 3PR strategy. However, the removal of nucleotides at the 5′-end of the 5PH (using the 5PR, 3PR and RR strategies), before alignment with the rc3PH, clearly support the inverse duplication model (see Tables 5, 6, 7; Fig. 4) (data not shown for the 3PR and RR strategies, see Table 11).

Table 5 Eukaryal inter-molecular analysis utilising the 5PR strategy
Table 6 Archaeal inter-molecular analysis utilising the 5PR strategy
Table 7 Bacterial inter-molecular analysis utilising the 5PR strategy

Inter-domain Comparison

To further substantiate these results, we also compared the two halves of the 5S rRNA molecule choosing them from different domains of life. The finding of traces of a common origin between the two halves should represent a severe test for the proposed models, given the divergence in the 5S rRNA secondary structures between Eukaryal, Archaeal and Bacterial domains (Fig. 2).

In this inter-domain comparison the 5PH of Eukaryal domain was compared to the 3PH of Archaeal domain (Fig. 5a; Table 8) or to that of Bacterial domain (Fig. 5b; Table 9), and the 5PH of the Bacterial domain was compared to the 3PH of the Archaeal domain (Fig. 5c; Table 10). The 3PR and RR strategies (data not shown, see Table 11) resemble that previously described for intra- and inter-molecular comparisons. Here, we point out 5PR strategy comparisons for different domains of life.

Fig. 5
figure 5

Inter-domain analysis. Variation of the Z(I) score (see ‘Materials and methods’) as a function of nucleotides removed at the 5′-end of the 5PH (5PR strategy) in a 5PH of Eukarya versus 3PH of Archaea, b 5PH of Eukarya versus 3PH of Bacteria, c 5PH of Bacteria versus 3PH of Archaea. The solid line with filled circles represents the Z(I) score observed for the direct duplication model (5PH vs 3PH). The dashed line with empty squares represents the Z(I) score observed for the inverse duplication model (5PH vs rc3PH)

Table 8 Eukaryal versus Archaeal inter-domain analysis utilising the 5PR strategy
Table 9 Eukaryal versus Bacterial inter-domain analysis utilising the 5PR strategy
Table 10 Bacterial versus Archaeal inter-domain analysis utilising the 5PR strategy
Table 11 Summary of results obtained in the intra-molecular, inter-molecular (A) and inter-domain (B) comparisons, using the 5PR strategy, RR strategy and 3PR strategy (see ‘Results’)

Eukaryal and Archaeal domains show that: (i) the direct duplication model is not supported by percentage of identity analysis (Table 8), whereas (ii) the inverse duplication model receives a significant support (Table 8). 5S rRNA molecules in Eukaryal and Bacterial domains present the highest structural divergence (Fig. 2). Results obtained substantially confirm what already observed, although the increase in the percentage of identity between the two halves is less evident, but still statistically significant, after the removal of 13 nucleotides (Table 9).

The comparison of Bacterial and Archaeal domains confirms the trend observed in all other cases (Table 10) and this is in agreement to predictions of the inverse duplication model (Di Giulio 2010). On the contrary, the direct duplication model does not receive, as already seen, a strong support by this analysis (Table 10).

Finally, in Table 11 we summarise the statistical significance observed in the intra-molecular, inter-molecular and inter-domain comparisons, for the three strategies of nucleotide removal. In the inter-domain comparison we performed also another check. Besides performing, for instance, the Eukarya versus Archaea comparison, we also performed the Archaea versus Eukarya comparison, which differed for the domain to which belongs the 5S rRNA molecule from which nucleotides were removed. Results are in agreement with the indirect duplication model and against the direct one (Table 11).

Summary and General Discussion

We have tested the hypotheses that the origin of the 5S rRNA molecule could have occurred trough a direct or indirect duplication. The results, here reported, support a hypothesis of complementarity between the two halves of the 5S RNA molecule (Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), favouring the indirect duplication model (Fig. 1; Di Giulio 2010).

In particular, the indirect duplication model is supported by the intra-molecular analysis in all three domains of life (Fig. 3; Tables. 2, 11; see ‘Results’), whereas the direct duplication model does not receive an adequate corroboration (Tables 2, 4, 11). Moreover, the inter-molecular analysis confirms the intra-molecular one, corroborating again the inverse duplication model and not the direct one (Tables 5, 6, 7, 11; Fig. 4). Also, the severe test represented by the inter-domain comparison favours the inverse duplication model and not the direct one (Tables 8, 9, 10, 11; see ‘Results’). In particular, the comparison between 5S rRNA molecules from Eukaryal and Bacterial domains, showing the highest divergence in the secondary structure (Fig. 2), favours again the indirect duplication model and not the direct one (Tables 9, 11).

Evidently, also the total statistical significance of the indirect model is higher than that of the direct model (Table 11). However, there is an exception in the intra-molecular analysis (using the 3PR strategy), in which the indirect model did not show a statistical significance (Table 11A). We interpret this curious observation and cases in which the direct model shows some degree significance by assuming that this might be either (i) an indication that the removal of nucleotides favours the right alignment between the two halves of the 5S rRNA molecule, or (ii) reflecting the hypothesis that a hairpin-like molecule was the precursor of these duplications. Thus, in this case it is impossible to distinguish between the indirect and the direct models, as already we have previously and extensively discussed (Branciamore and Di Giulio 2011).

More generally, these observations along with the ones made for other RNA molecules showing symmetry or semi-symmetry (Di Giulio 1992, 2010; Bokov and Steinberg 2009), corroborate the hypothesis, here supported, that in the case an molecule of RNA or protein shows a symmetry, this must necessarily have been generated by means of gene duplication and not by a gradualistic accretion. Thus, we reject those models that try to explain the evolution of some symmetrical molecules by means of gradualistic accretion (Sun and Caetano-Anollés 2008, 2010; Rodin et al. 2011) and instead noting the extraordinary capability the nucleic acids have of duplicating themselves, we believe the most parsimonious way of generating these symmetries should be used.