Introduction

In DNA there are four different transition (ti) substitutions in which a purine (or a pyrimidine) base is replaced by another purine (or a pyrimidine) base (R → R; Y → Y). Similarly, there are eight different transversion (tv) substitutions in which a purine (or a pyrimidine) base is replaced by a pyrimidine (or a purine) base (R → Y; Y → R) (Fersht and Knill-Jones 1981) (Fig. 1). Under conditions where all the 12 substitutions are occurring in equal proportions, the ti/tv ratio would be 0.50. But, the reported ratio in different genomes is usually ~ 2.00 or more, which suggests that a transition is ~ four times more frequent than a transversion in DNA (Seplyarskiy et al. 2012; Duchêne et al. 2015; Stoltzfus and Norris 2016; Lyons and Lauring 2017). This bias towards transition has been known since comparison between homologous DNA sequences (Gojobori et al. 1982; Wu and Maeda 1987). The purine:purine and pyrimidine:pyrimidine pairing distorts the geometry of the double helix for which transversion frequency is lower than that of transition (Topal and Fresco 1976; Sinha and Haimes 1981). The lower transversion frequency can also be due to several other factors. One such factor is cytosine deamination in DNA during replication and transcription causing one of the most common transitions, C → T (G → A) (Lewis et al. 2016; Tubbs and Nussenzweig 2017). In coding regions, transversions at the third and first codon positions are more non-synonymous than transition (Vogel and Kopun 1977; Abdullah et al. 2016).

Fig. 1
figure 1

Different substitution mutations in the genome. Figure presents twelve possible directional base substitutions in a sequence. In theory, out of the four bases A, C, G, and U any one base can be replaced by the other three bases resulting into twelve base substitutions. Out of these twelve substitutions, the four are called transitions (ti) in which a purine (R: A/G) (or a pyrimidine (Y:C/U)) is replaced by another purine (R) (or a pyrimidine (Y)); eight different substitutions called transversion (tv) in which a purine (R) (or a pyrimidine (Y)) is replaced by a pyrimidine (Y) (or a purine (R))

Though it is well-known that ti is more frequent than tv in genomes, all factors influencing their occurrence in genomes remains to be explored further. Transition and transversion were observed with similar frequencies at non-methylated cytosine sites in grasshopper pseudo-genes (Keller et al. 2007). Selective constraints imposed by secondary structure can account for the relative enrichment of ti in tRNA and rRNA genes (Knies et al. 2008). In case of tRNA and rRNA genes, there is selection for G + C enrichment in the stem region unlike in the loop region, in thermophilic bacteria (Galtier and Lobry 1997; Wang and Hickey 2002; Hickey and Singer 2004). It is known that RNA secondary structure plays an important role in gene expression and regulation (Park et al. 2013). There can be different impacts of transition and transversions in the encoded RNA that forms double helix as explained with the help of schematic scenarios as described in Fig. 2. For example, a transition such as G → A results in A:C pairing while the A → G substitution results in G:U pairing in the encoded transcript. A transversion such as G → U (R → Y) results in U:C (Y:Y) pairing while U → G (Y → R) substitution results in G:A (R:R) pairing in the transcript. The different base pairing due to ti and tv substitutions can cause different magnitudes of instabilities in the secondary structure of a transcript (Rossetti et al. 2015). The selection pressure on secondary structure acts in favor of transition and against the transversion, which is reflected in changes in free energy level (Fig. 2). Therefore, it can be assumed that tv and ti mutations are fixed/selected according to evolutionary constraints associated with maintenance of tRNA secondary structure.

Fig. 2
figure 2

Effect of substitutions on secondary structure in a hypothetical RNA sequence. Figure presents secondary structure and minimum free energy for six hypothetical sequences (wild type (Seq 1) and five others with base substitutions (Seq 2-Seq 6)). Base substitutions are marked in bold. RNAfold webserver (Gruber et al. 2008; Lorenz et al. 2011; Mathews et al. 2004) available at http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi was used for estimating secondary structure and minimum free energy (∆G) of these sequences. Seq 2 having a base substitution in the loop region and the Seq 1 have equal minimum free energy suggesting the maximum stability. Seq 3 and Seq 6 have the transitions in stem region. Secondary structures of Seq 3 and Seq 6 have lesser free energy than Seq 4 and Seq 5, having transversions. Predicted folded structure as well as their free energy support the hypothesis that transversion is more deleterious for the RNA secondary structure than the transition. The C → T (or A → G) transition is more favorable than T → C (or G → A) transition

Earlier studies on tRNA genes regarding ti and tv have mainly been carried out by comparing genes across the species (Higgs 2000; Savill et al. 2001; Jow et al. 2002; Hudelot et al. 2003). The main finding is that compensatory transition or transversion substitutions are more frequent than single site independent substitution in stem regions of these genes. Further, ti to tv ratio is observed to be higher in stem region in comparison to loop region as compensatory tv is more rare than compensatory ti. However, a systematic study of ti and tv in tRNA by comparing gene sequences within a species have not been reported in literature till date. Considering single substitution being more frequent than double substitutions, the possible occurrence of non-compensatory substitutions in stem region cannot be avoided if the study is carried out within a species. In addition, as G:U base pairing often being allowed in tRNA stem, non-compensatory substitution from an amino base to keto base might be preferred over the reverse substitution, which is yet to be explored. Availability of genome sequences from a large number of strains belonging to a species has opened up avenue to address the above queries. tRNA structures are relatively easy to survey for finding out the stem and the loop regions in different organisms. Therefore, in this study, we carried out a comparative analysis of ti and tv substitutions in tRNA genes using large whole genome datasets of five bacterial species: Escherichia coli (Ec), Klebsiella pneumoniae (Kp), Salmonella enterica (Se), Staphylococcus aureus (Sa) and Streptococcus pneumoniae (Sp). The frequency of ti was found to be higher than tv in the stem regions than in the loop region of tRNA genes. Further the transitions from amino to keto bases were found to be more frequent than the reverse transitions in the stem regions. These observations indicate that secondary structure in tRNA influences differentially the transition and transversion frequencies in organism.

Materials and Methods

Extracting Intergenic Regions, tRNA Genes and Segregating Loop and Stem Regions

In this study, we have considered five bacteria, Escherichia coli (Ec), Klebsiella pneumoniae (Kp), Staphylococcus aureus (Sa), Salmonella enterica (Se) and Streptococcus pneumoniae (Sp) for which large number of genome sequences were available in public databases. In total we have done a detailed computational analysis of the tRNA genes of the alignment of 157 Ec strains (Thorpe et al. 2017), 208 Kp strains (Holt et al. 2015), 132 Sa strains (Reuter et al. 2016), 366 Se strains (Thorpe et al. 2017), and 264 Sp strains (Chewapreecha et al. 2014) for finding out base substitutions. Python scripts were written to extract alignments of the tRNA gene sequences from the alignment of DNA sequences using coordinate information of the annotated tRNA genes. Intergenic regions (IRs) which are the stretches of DNA sequences located between genes are also extracted considering coordinates of the coding regions (protein coding genes, tRNA genes and rRNA genes). Genomic compositional details of the tRNA genes and IRs analyzed are given in Table 1.

Table 1 ti/tv ratio in tRNA genes and intergenic regions (IRs)

We extracted the predicted secondary structure of the tRNA genes using tRNAscan-SE On-line webserver (Lowe and Chan 2016) available at http://lowelab.ucsc.edu/tRNAscan-SE/. This web database (GtRNAdb) is rich in genomic tRNA information. It confirms the clover leaf shaped tRNA secondary structure, gene size, anti-codon position and anti-codon for a given query tRNA gene sequence. The extracted tRNA sequences from the alignments of genomes but not confirmed in the GtRNAdb, were not analyzed in this study. The extracted genes of Ec, Kp, Sa, Se and Sp are 89, 86, 61, 88 and 59 respectively, but confirmed number of genes from GtRNAdb of the organisms are 87, 85, 61, 87 and 58 respectively. Using this webserver, we obtained predicted secondary structure of the tRNAs in terms of dot-bracket notations given in the Supplementary Table 1 and the list of tRNA genes of five bacteria is presented in Supplementary Table 2. For further analysis, we considered these predicted secondary structures and classified nucleotides into two groups (i) paired and (ii) unpaired. Paired and unpaired nucleotides are considered largely to be from stem and loop regions, respectively. In general, the paired nucleotides or the stem region are the bases occurring in the acceptor stem, D-arm, anticodon-arm, variable region and T-arm. The unpaired nucleotides of the above four arms and the variable region are considered as the bases in the loop regions (Supplementary Fig. 1).

Segregating Compensatory and Non-compensatory Substitutions in Stem Regions

For each tRNA gene, substitution positions were mapped to the secondary structure and segregated into loop and stem regions. Further, the substitutions in stem region were marked as compensatory or non-compensatory depending on whether there exist a pair of substitutions or only one substitution in a paired position in the stem region respectively (Fig. 3).

Fig. 3
figure 3

a Polymorphisms in secondary structure of a hypothetical sequence. Figure presents hypothetical scenarios of base substitutions in loop, compensatory and non-compensatory polymorphisms in stem regions (polymorphisms are depicted in left panel and descriptions are given in right panel). Wild type is the original sequence and there are 7 mutant strains. The secondary structure of the sequence is in dot-bracket notation, where dot represents a base in loop and bracket represents a base in stem. In mutant strain 1 and 2, there are one ti and one tv polymorphisms at the 12th and the 14th positions in the loop region respectively. In mutant strain 3, there are two non-compensatory ti polymorphisms in stem at the 6th and the 16th positions. In mutant strain 4, there is one non-compensatory tv polymorphism at the 22nd position. Mutant strain 5, 6 and 7 has compensatory polymorphisms. In mutant strain 5, there are two ti polymorphisms at the 9th and the 17th positions, in mutant strain 6, there are two tv polymorphisms at the 7th and 19th positions, respectively. In mutant strain 7, there is one ti polymorphism in one side and one tv polymorphism in the other side of compensatory pair at 5th and 21st positions, respectively. b Polymorphisms in secondary structure of Leu tRNA gene of K. pneumonia. Figure presents polymorphisms in secondary structure of Leu tRNA gene of K. pneumoniae

Finding Substitutions from the Sequence Alignments

Considering the most frequent nucleotide at a position in the alignment of the nucleotide sequences of a tRNA gene, we computed a reference sequence and then used this reference sequence to identify a substitution in each sequence (Thorpe et al. 2017; Supplementary Document 1). Substitution frequencies were computed by dividing total count of a given substitution by the total number of the nucleotide in which substitution has occurred. For example, suppose the total number of C → U substitution is 2 and total number of C in a tRNA gene sequence is 10, then the normalized frequency would be 2/10 = 0.2. We further wrote a Python script to classify these substitutions into transitions (ti) and transversions (tv). Observed substitutions in secondary structure of a sample tRNA gene are shown in Supplementary Fig. 1. We further classified the substitutions in stem regions as compensatory and non-compensatory. For statistical analysis and determining p-value for significance test, Mann Whitney test is used (Mann and Whitney 1947).

Visualization of 2-D and 3-D Structures of tRNA Genes

For 2-D visualization of tRNA secondary structure we have used tRNAscan-SE On-line software (Lowe and Chan 2016). To visualize the tRNA 3-D structures, we have used two web servers. First, we gave the tRNA sequence and the secondary structure in dot-bracket notation obtained from tRNAscan-SE On-line software as input to Vfold3D webserver (Xu et al. 2014; Zhao et al. 2017), to obtain the 3-D structure in pdb format. Next, the 3-D secondary structure of tRNA was visualized from the pdb file using iCn3D web server (Wang et al. 2020) (Supplementary Fig. 1).

Results

Higher Transition to Transversion Ratio in tRNA Genes than Intergenic Regions

By multiple sequence alignment from hundreds of strains of a species, the possible twelve substitutions were found in intergenic regions (IRs) and tRNA genes of five bacterial species such as Ec, Kp, Sa, Se and Sp (Table 1). Substitution frequency in tRNA genes was observed to be ten times lower than that in the IRs. The difference between IRs and tRNA genes is significant (p-value < 0.01) in all the five bacteria. This finding was anticipated because the tRNA genes are transcribed to make functional tRNA structures which carries out the vital translation process inside the cell. Therefore, the low substitution frequency in tRNA genes is most likely due to strong purifying selection on these genes in comparison to IRs. We compared ti and tv between the two regions. The ti/tv values were greater than 1.0 across the five bacteria, which suggested that the ti frequency is more than that of tv in both the genomic regions. The ti/tv values in tRNA genes were greater than that in the IRs (p-value < 0.05) (Table 1). There might be two possibilities for the higher ti/tv value in tRNAs: either low tv or high ti. We calculated the relative fold increase in the ti and tv in the IRs, separately, in comparison to the tRNA genes (Supplementary Table 3). The fold increase of tv was more than that of ti in the IRs (p-value < 0.05), which suggested that the higher ti/tv in tRNA is due to lower tv occurrence in tRNA than that in IRs. This observation was in support of the notion that impact of tv on tRNA secondary structure is higher than that of ti (Fig. 2).

Higher Transition to Transversion Ratio at the Stem Regions than the Loop Regions Within tRNA Genes

Transfer RNA genes have well-defined secondary structures: the double stranded helical stem regions that constitutes ~ 2/3rd part and the single stranded loop regions that constitutes ~ 1/3rd part of a tRNA gene. We analyzed substitutions separately in stem and loop regions. The substitutions in both loop as well as stem regions were observed to be significantly lower than that in the IRs. This difference indicated that both the regions are under strong purifying selection as mentioned above. It is known that the stem and the loop regions are functionally important and make contacts with translation factors.

To compare the stem and the loop regions regarding ti and tv, we separated substitutions in the stem regions as compensatory as well as non-compensatory substitutions (Table 1). It was evident that non-compensatory substitutions were more in number than compensatory substitutions (Supplementary Fig. 2). This is pertinent to note that in previous studies tRNA genes were compared across species for which often the substitutions were observed as compensatory, unlike the observed values here. It may be noted that compensatory substitutions are relatively earlier in evolution in comparison to non-compensatory substitutions considering the higher stability of the former than the latter. We considered the ti/tv values of non-compensatory substitutions in the stem region to compare it further with the substitutions in the loop regions. The ti/tv values in the stem region were observed to be significantly higher than that in the loop regions across the bacteria (p-value < 0.01) (Fig. 4). This suggested that frequency of tv is more in the loop region than the stem region. This observation was in concordance with our hypothesis that tv in stem regions are more deleterious than the ti.

Fig. 4
figure 4

Ratio of ti to tv in loop and non-compensatory stem regions in tRNA of five bacteria. Histogram presenting the ratio of ti to tv (ti/tv) values in loop and stem regions in tRNA genes. Only non-compensatory substitutions are considered in stem region. The ti/tv values between stem and loop tRNA genes are significantly different (p-value < 0.05). The x-axis presents the five bacteria Escherichia coli (Ec), Klebsiella pneumoniae (Kp), Salmonella enterica (Se), Staphylococcus aureus (Sa) and Streptococcus pneumoniae (Sp). To avoid division by 0 error, ti and tv values are incremented by 1 each in case of Kp

Biased Transition Substitution Towards keto Bases in the Stem Region of tRNA Genes

In tRNA stem regions, the G:U pairing is found to be accepted more favorably than A:C pairing. Therefore, non-compensatory substitutions from amino bases (A/C) to keto bases (G/T) that facilitates G:U pairing are likely to be favorable in the stem region. However, the reverse transition such as non-compensatory substitutions from keto bases (G/T) to amino bases (A/C) that facilitates A:C pairing are likely to be less preferred in the stem region. These two pairings have been described in Supplementary Fig. 3 and the stability of these pairing in terms of interaction energy has been calculated using GAUSSIAN 09 software (Frisch et al. 2009) which are presented in Supplementary Table 4. This encouraged us to compare the stem and the loop regions further in terms of substitutions from amino bases to keto bases and the vice versa. In stem regions, substitutions from amino (A/C) to keto bases (G/T) were significantly more than keto to amino bases (p-value < 0.05) (Fig. 5). This high amino to keto substitutions were consistently observed among all the five bacteria studied. However, in loop regions, no such significant pattern was observed with regard to substitutions from keto to amino or amino to keto bases (Fig. 6). This provided additional support that secondary structure influences significantly towards different substitutions in tRNA genes. Further, the comparative results between keto to amino and amino to keto transitions in IRs indicated that, there was no significant difference between these two transition substitutions in IRs (p > 0.01) (Supplementary Fig. 4).

Fig. 5
figure 5

Amino(A/C) → Keto (G/T), Keto (G/T) → Amino (A/C) polymorphism frequencies in stem region of tRNA genes. Histogram presenting the frequency values of single nucleotide polymorphism in stem regions in tRNA genes. Only non-compensatory substitutions are considered in stem region. The normalized values between Amino → Keto, Keto → Amino polymorphism frequencies in stem regions of tRNA genes are significantly different (p-value < 0.05). The x-axis presents the five bacteria Escherichia coli (Ec), Klebsiella pneumoniae (Kp), Salmonella enterica (Se), Staphylococcus aureus (Sa) and Streptococcus pneumoniae (Sp)

Fig. 6
figure 6

Amino (A/C) → Keto (G/T), Keto (G/T) → Amino (A/C) polymorphism frequencies in loop region of tRNA genes. Histogram presenting the frequency values of single nucleotide polymorphism in loop regions in tRNA genes. The normalized values between Amino (A/C) → Keto (G/T), Keto → Amino polymorphism frequencies in loop regions of tRNA genes are not significantly different (p-value > 0.05). Further there is no consistent pattern here across the bacteria, unlike the stem region. The x-axis presents the five bacteria Escherichia coli (Ec), Klebsiella pneumoniae (Kp), Salmonella enterica (Se), Staphylococcus aureus (Sa) and Streptococcus pneumoniae (Sp)

Isoacceptor tRNAs are with different anticodons but charged with the same amino acid by the same amino acyl tRNA synthetase enzyme. Therefore, we explored to compare among the isoacceptor tRNA genes regarding substitution frequencies. In majority of the cases, number of substitutions per tRNA gene was observed as 0 or 1 or 2 (Supplementary Table 5). Therefore, an elaborate comparison among these iso-acceptors tRNA genes regarding substitutions was not possible. However, there were a few cases where the number of substitutions among isoacceptor tRNA genes have noticeable difference as follows. In Se, there are five Ser tRNA genes of which one with CGA anticodon had 9 substitutions, one tRNA gene with GCT anticodon had 1 substitution, while two genes with GGA anticodon and one gene with TGA anticodon had 0 substitutions. In Sa, isoacceptor tRNA genes in case of Ser as well as Gly were observed to have different substitutions. There are five Ser tRNA genes of which one with GCT anticodon had no substitutions while one with GGA anticodon had 9 substitutions. There are seven Gly tRNA genes of which two tRNA genes with GCC anticodon had no substitutions but five Gly tRNA with TCC anticodon had 0 to 9 substitutions. In Sp there are three Lys tRNA genes of which one with CTT anticodon and the other two with TTT anticodon. It is interesting that the tRNA gene with CTT anticodon was observed with 10 substitutions while the other tRNA genes with TTT anticodon were with no substitutions. Future studies will elucidate these differences observed among the isoacceptor tRNA genes.

Discussion

Secondary structure in transcripts is important for its function and intra-strand base pairing is important for their stability. Transfer RNA genes are known to have well-defined secondary structures unlike IRs. Though it is known in literature that ti frequency is higher than tv, role of RNA secondary structure towards it has not been explored adequately at species level. Our endeavor in this aspect is to study ti and tv in tRNA genes and compare these substitutions between loop and stem regions. We have observed that in comparison with IRs, tv frequency is proportionately lower than that of ti in tRNA genes. This observation is in concordance with the assumption that secondary structure region is likely to have low tv frequency. Further we have compared ti and tv between loop and the stem regions. In stem regions ti were proportionately higher than tv when compared with the loop region. This is in concordance with the assumption made in this study that tv is more deleterious in the stem regions than ti. It is known that G:U is a more stable pair than A:C pair in tRNA stem. Therefore, ti substitution from amino base (A/C) to keto base (G/T) that results stable G:U pairing is likely to be preferred over the ti substitution from keto base (G/T) to amino base (A/C) that results unstable A:C pairing. In concordance to this hypothesis in tRNA stem region transition substitution from amino base (A/C) to keto base (G/T) is observed to be significantly higher than that in the loop region. This further supports the notion that secondary structure in tRNA influences base substitutions. It may be the postulation that tv and ti mutations are fixed/selected according to evolutionary constraints associated with maintenance of tRNA secondary structure.

Previous researchers had studied ti and tv in stem and loop regions of tRNA genes by comparing sequences across the species. They had observed more compensatory substitutions in tRNA stem region than non-compensatory substitutions. They had attributed the high ti/tv values in tRNA stem regions due to low frequency of compensatory tv in comparison to the frequency of compensatory ti. In the present work, we analyzed ti and tv in tRNA genes by comparing sequences within a species. So, we observed non-compensatory substitutions more in number than compensatory substitutions. It is likely that the non-compensatory substitutions are more recent ones in evolution than the compensatory substitutions. Among these substitutions, those having strong deleterious impact on tRNA structures are removed faster from the population in comparison to the ones that have moderate impact. It is assumed that tv are likely to have more deleterious impact on the secondary structure than ti. Therefore, we observed tv frequency is significantly lower than ti in stem region, in comparison to the ti and tv in loop regions. Further, ti substitutions with keto to amino results A:C pairing in the stem which destabilizes the secondary structure more than ti substitutions with amino to keto that results in G:U pairing. Though G:U pairing has been reported to be present in tRNA stem regions (Holley et al. 1965; Hou and Schimmel 1988; McClain and Foss 1988) there is recent report stating that, G:U pairing in tRNA genes having deleterious impact on mouse survivability (Ishimura et al. 2014). Therefore, the former ti should be under stronger purifying selection than the latter. In concordance to it, amino to keto transition was observed to be higher than keto to amino transition in stem regions. However, no such pattern was observed in the loop region. Unlike the stem region, the loop region in tRNA undergoes several posttranscriptional modifications, that are important for their stability as well as various functions such as recognition by amino acyl tRNA synthetase, binding to translation factors, ribosome and codon anticodon recognition (decoding the codons) (Agris 2008; Agris et al. 2017; Pereira et al. 2018). Therefore, any base substitutions arising at the modification site are likely to be selected against in the loop regions. This might be the reason for low substitution rates in the loop region. In future studies this sequence alignment studies might be helpful to find out residues in the loop region that undergoes post-transcriptional modifications. However, the influence of secondary structure on different base substitutions is evident from this analysis. Future research may be done to find out possible implications of compensatory as well as non-compensatory substitutions in tRNA gene in phylogenetic studies.

We believe our findings on mutations in stem vis-à-vis loop regions in tRNA genes might be extended to understand regions in mRNA that are potentially involved in secondary structure formation and gene expression regulation. It is known that the presence of secondary structure in mRNA is important in case of rho-independent (Abe and Aiba 1996) and rho-dependent (Kriner and Groisman 2017) transcription termination, pausing of ribosome (Bao et al. 2020) during translation and protein folding. It is also known that secondary structure near the ribosome binding site or Shine-Dalgarno sequence plays important role in translation regulation (Hall et al. 1982). The role of RNA secondary structure on riboswitches is well documented in the literature (Grundy and Henkin 2006; Montange and Batey 2008). Further, there are regions in other RNA such as miRNA (Maiti et al. 2010) and introns in eukaryotes (Libri et al. 1995) involved in secondary structure formation. The difference between amino to keto and keto to amino transitions in the stem regions of tRNA might be of importance to understand the potential stem region in the transcript. In the stem regions of secondary structures, amino to keto is expected to be more compared to keto to amino transitions. Exploring this in future, might be helpful to understand the contribution of RNA secondary structure on gene regulation. Further, it has been reported that the GC content of the paired stem regions of the 16S rRNA genes positively correlates with the optimal growth temperature of bacteria and archaea (Wang et al. 2006). Therefore, it might be anticipated that difference between the two transitions such as amino to keto and keto to amino will be more in thermophiles in comparison to mesophiles or psychrophiles. In future it will be interesting to study this in bacteria.