Introduction

Several organisms have been isolated within the protist Acanthamoeba polyphaga. The giant virus Acanthamoeba polyphaga Mimivirus (APMV) of the genus Mimivirus, family Mimiviridae (La Scola et al. 2003), is a virus that has shaken the ideas about the tree of life (Raoult and Forterre 2008; Forterre 2010). It has a huge size for a virus, a giant genome encoding tRNAs, repair proteins, translational-related proteins and, for the first time in a virus, aminoacyl-tRNA synthetases (Legendre et al. 2011), which by definition were thought to exist only in cellular organisms and not in viral genomes (Raoult et al. 2004). APMV is in turn infected by the virophage mimivirus-dependent virus Sputnik (La Scola et al. 2008), genus Sputnikvirus, family Lavidaviridae. Other giant viruses able to infect A. polyphaga are Megavirus chilensis (Arslan et al. 2011), Megavirus courdo7 (Desnues et al. 2012), Mimivirus terra2 (Filée 2015) and Acanthamoeba polyphaga Moumouvirus (Yoosuf et al. 2012), members of the three lineages belonging to the Mimivirus genus. Another non-mimivirus virus isolated in this amoeba is the Marseillevirus marseillevirus (Boyer et al. 2009) of the Marseillevirus genus, Marseilleviridae family. Regarding cellular entoorganisms, the amoeba-resistant bacteria (ARB) category has several cases isolated from Acantamoebas (Greub and Raoult 2004), such as the lytic ARB Francisella tularensis (Berdal et al. 1996). Recently, an intracellular bacterium, Candidatus Babela massiliensis, was isolated in A. polyphaga (Pagnier et al. 2015).

Here we propose the prefix ento (from the Greek ἐντός, inside), denoting an organism, whether cellular or viral, inhabiting and coevolving within the membrane boundaries of a cellular host. The coexistence of these very organisms within a single cell makes the inner parts of this protist an unexpected new environment for studying evolutionary processes (Moliner et al. 2010). This entoorganism (Table 1; see Methods) raised the hypothesis of a common feature, a feature lying at the genomic level.

Table 1 List of genomes and general characteristics of the amoeba, amoebal mitochondrion, amoebal entoorganisms and negative controls studied in this work

Classifications of contemporary methods for comparative genomics form two groups: parametric and phylogenetic methodologies (Ravenhall et al. 2015). Since there is no homology between all the entoorganisms of this study, phylogenetic methodologies may not be effective. Parametric methods for sequence analyses search for characteristic patterns of a particular clade and can unveil a genomic signature (GS) that reflects a “total net response to selective pressure” (Karlin and Burge 1995; Abe et al. 2003). Oligonucleotide distributions and codon usage (CU) profiles are well-known and accepted GS methodologies (Burge et al. 1992; van Passel et al. 2006). Regardless of the sequence length and region selected, composition biases are detected. This phenomenon is called pervasivity (Deschavanne et al. 2000; Jernigan and Baran 2002). This pervasivity is constant in a species genome and differs between related species (Gentles and Karlin 2001; Lerat et al. 2002).

Several GS analyses have been performed on plasmids, phages and viruses (Blaisdell et al. 1996; Campbell et al. 1999; Robins et al. 2005; Pride et al. 2006; Mrázek and Karlin 2007; Suzuki et al. 2008). An advantage of using GS instead of traditional phylogenetic methods is that results will not vary regarding the set of sequences utilized (Campbell et al. 1999). Another advantage is that the use of GS allows comparisons regarding a lack of common ancestor, independence of base composition, coding or noncoding regions, making comparison of viral and cellular organisms possible.

In this article, we will put forward the hypothesis of the entoorganisms and inner organisms—amoebal in this case—coevolving and sharing a common genomic pattern and possible explanations for such evolutionary phenomena.

Materials and Methods

Sequences

For this study, we used scaffolds of the host Acanthamoeba polyphaga’s genome (Apss), A. polyphaga’s mitochondrial genome (Apm), genomes of the viruses Megavirus chilensis (Mch), Megavirus courdo7 (Mc7), Mimivirus terra2 (Mt2), APMV, APMoV, the virophage mimivirus-dependent virus Sputnik (Spu) and Marseillevirus marseillevirus (Mma) and genomes of the cellular organisms Candidatus Babela massiliensis (Bab) and Francisella tularensis (Ftu). As negative controls, we used the genomes of the bacteria Deinococcus radiodurans (Dra) and the human immunodeficiency virus 1 (HIV) (Table 1). All sequences were downloaded from the NCBI GeneBank (Benson et al. 2017) and the Viral Genome Resource (Brister et al. 2015).

Oligonucleotide Frequencies

Oligonucleotide relative frequencies (OnRF), namely dinucleotide relative frequencies (DiRF), trinucleotide relative frequencies (TriRF) and tetranucleotide relative frequencies (TetRF), were obtained by an algorithm written at our group using Python (Rossum et al. 2010), which counted the frequency of each oligonucleotide and returned its relative frequency.

Relative Codon Usage (RCU)

The relative codon usage test was described by Sharp and Li (1987) to examine codon usage without the confounding influence of amino acid composition of different gene products. Here we implemented it with the modification of including methionine, tryptophan and stop codons, originally not considered by Sharp because of the lack of synonym codons. Calculations were done using the same script written in Python 3.5 to obtain the codon count of each CDS and then calculate the relative frequency among the synonymous codons. Heatmap was plotted using the PAST software v3.15 (Hammer et al. 2001).

Genomic Landscape at the Codon Usage (GLCU)

The genomic landscape at the codon usage (GLCU) was obtained by calculating the average RCU frequencies of each codon on every CDS of a genome. The codon count was obtained in the same step as the RCU, and the calculations were performed in the same script written in Python 3.5. As in the RCU, stop codons were retrieved from the CDS as well.

Correlation Analyses

Pearson’s correlation analyses of oligonucleotide frequencies were calculated between the frequencies of each genome’s dinucleotides, trinucleotides and tetranucleotides. It was also performed between the GLCU of each genome, including stop codons, obtained using the PAST v3.15 software using the ‘Linear r (Pearson)’ for the correlation statistic parameter and ‘Statistic\p(uncorr)’ for the table format parameter.

Results

Dinucleotide Relative Frequencies

Since oligonucleotide frequencies are influenced by GC content, and GC content is usually related to the environment (Karlin 1998; Foerstner et al. 2005), a common genomic pattern may be elucidated from A. polyphaga’s entoorganisms.

Dinucleotide relative frequencies (DiRF) of the Acanthamoeba polyphaga mitochondrial genome, all Mimivirus and Sputnik genomes—i.e. the entovirals, endobacteria Candidatus B. massiliensis and ARB Francisella tularensis, showed strikingly similar DiRF profiles, as depicted in Fig. 1. These entoorganism profiles behave very similarly given any dinucleotide combination. Other ARBs (e.g., Parachlamydia acanthamoebae) were used as well, rendering very similar results to Ftu in every test; these are not shown because of redundancy. On the other hand, the Acanthamoeba polyphaga scaffolds, Marseillevirus marseillevirus and the negative controls Deinococcus radiodurans and HIV show clearly different patterns in their dinucleotide profiles.

Fig. 1
figure 1

Genomic signatures from dinucleotide relative frequencies. Dinucleotide distribution values are sorted in descending order. Black lines correspond to Acanthamoeba polyphaga’s genomic scaffolds (Ap) and mitochondrial genome (Apmt). Red lines correspond to viral genomes Acanthamoeba polyphaga Mimivirus (APMV), Acanthamoeba polyphaga Moumovirus (APMoV), Megavirus chilensis (Mch), Megavirus courdo7 (Mc7), Mimivirus terra2 (Mt2) and Marseillevirus marseillevirus (Mma). Orange line is the Sputnik (Spu) virophage. Blue lines correspond to bacteria isolated from A. polyphaga: Candidatus Babela massillensis (Bab) and Francisella tularensis (Ftu). Green lines correspond to negative controls: as cellular, Deinococcus radiodurans (Dra); as viral negative control, human immunodeficiency virus 1 (HIV)

Broadly, there are three groups of DiRF. First is the higher group comprised of AA, TT, AT and TA with a range of 12–18% DiRF each. Second is the medium group comprised of TG, CA, GA, TC, AG, CT and AC with a range of 4–6% DiRF observed. Third is a low group made up of GC, CC, CC and CG with DiRF values < 4% each. The profiles of Apmt, APMV, APMoV, Mch, Mc7, Mt2, Spu, Bab and Ftu show very similar distributions within ranges of 18–11% at the high value group, as described above. The medium and lower groups show even less variance.

In closer analyses, the mitochondrial genome of A. polyphaga showed the unique highest DiRF on TT with almost 18%; the rest of this organelle profile is highly correlated with the rest of the entoorganisms. The DiRF of the A. polyphaga scaffolds shows a different pattern regarding its mitochondria and the entoorganisms.

Candidatus B. massiliensis and Francisella tularensis show distributions resembling the entoviral and mitochondrial DiRF patterns. Unexpectedly, with Mma—a non-Mimivirus—we observed a different GS profile, similar to the negative controls. As negative controls, Deinococcus radiodurans and human immunodeficiency virus 1 were selected. Their genomic lengths, GC content and biology render completely different profiles, as expected.

Overall Pearson correlation tests \(r\) were performed on DiRF, TriRF and TetRF, namely \({r}^{di}\), \({r}^{tri}\) and \({r}^{tet}\), respectively. Regardless of the OnRF, high correlation values \(r\ge 0.89\) were detected in pairwise viral comparisons as well as with Candidatus B. massiliensis\(r\ge 0.95\) and Francisella tularensis\(r\ge 0.92\). Similar correlations \(r\ge 0.89\) were detected between the mitochondria and entoamoebal organisms. A subtle decay in correlation values was detected with increasing OnRF complexity, TriRF and TetRF, as shown in Tables 2 and 3.

Table 2 Pearson correlation analyses of the DiRF, TriRF and TetRF of entoorganisms
Table 3 Pearson correlation analyses of GLCU between genomes

High correlation values were detected among entoviruses at \({r}^{di}\ge 0.99\), \({r}^{tri}\ge 0.98\) and \({r}^{tet}\ge 0.85\). APMV pairwise correlations were the highest detected, for example, APMV and Mt2 values \({r}^{di}=1\), \({r}^{tri}=1\) and \({r}^{tet}=1\). Also, high correlations were detected between APMV and Spu \({r}^{di}=0.98\), \({r}^{tri}=0.96\) and \({r}^{tet}=0.95\).

The virophage Sputnik showed lower correlation with the mitochondria \({r}^{di}=0.89\)\({r}^{tri}=0.86\) and the highest with APMoV \({r}^{di}=0.98\), \({r}^{tri}=0.96\) and \({r}^{tet}=0.95\). In the case of Bab, \({r}^{di}\) are always close to 0.97 with every entoorganism, except for Apmt and APMV. For the ARB Ftu, the highest correlations are with Bab \({r}^{di}=0.98\), \({r}^{tri}=0.97\) and \({r}^{tet}=0.96\) and lower with Apmt \({r}^{di}=0.92\), \({r}^{tri}=0.9\) and APMV \({r}^{tet}=0.87\). For the mitochondria, the highest values are with Bab \({r}^{di}=0.95\)and Bab\({r}^{tri}=0.93\).

For the negative cellular control Dra, regardless of OnRF combination, every pairwise comparison resulted in negative values except at pairing with Apss. For the viral negative control HIV, values are near \(r\le 0.5\) and negatives at pairing with Apss.

An interesting case is the Mma, showing values of \({r}^{di}\le 0.52\), \({r}^{tri}\le 0.42\) and \({r}^{tet}\le 0.39\). Its higher values are constant with Spu.

Relative Codon Usage (RCU)

The RCU test on all CDSs unveiled the GSs at the codonic level on every genome. The RCU of all entoorganisms showed a high preference for codons ending in A or T (darker halves on Fig. 2 profiles) and low or no preference for codons ending in C or G (clearer halves on Fig. 2 profiles). A bias was expected because of the low GC content (see Table 1); what was not expected was a common expression at the third position of the codons. Non-entoorganisms, namely Mma and the negative controls Dra and HIV, do not have this RCU pattern. Organisms with > 1000 CDSs (Table 1) and random samples of 1000 CDSs of the given genome were used.

Fig. 2
figure 2

Genomic signature at relative codon usage (RCU). Each CDS of the given genome is displayed as a row. Codons correspond to columns and are sorted by the third position. Mt2 is not included because of its partial genome. Apss is not included because of the lack of available CDS information. RCUs were calculated according the standard codon table. Stop codons (*) are included. Methionine (M) and tryptophan (W) codons are depicted as well. The legend for the amino acid translation is based on the standard genetic code and serves as a global reference

Broadly, neat strips of preference are common for entoorganisms, namely the Apmt, entoviruses and entobacteria. Analyzing the A/T-ending high-frequency codons, AAA (lysine), CAA (glutamine), GAA (glutamic acid), TAA (Ochre), AAT (asparagine), CAT (histidine), GAT (aspartic acid), TAT (tyrosine), TGT (cysteine) and TTT (phenylalanine) are the most frequent common entootganism codons. The A/T-ending low-frequency codons also form a neat common pattern for entoorganisms, e.g., CGA (arginine), CTA (leucine), TGA (stop), AGT (serine), CGT (arginine), CTT (leucine) and TCT (serine).

Among the C/G-ending frequency codons of entoorganisms, all are close to zero, but ATG (methionine), TAG (Amber) and TGG (tryptophan) show a presence-absence behavior because of the lack of synonym codons.

Ftu presents a slight reduced preference in GGA (glycine) with respect to all entoorganisms. Ftu and Bab show higher preferences for GGC (glycine), TGC (cysteine), AAG (lysine) and GAG (glutamic acid) than the entovirus and the mitochondrion.

Mma also shows a preference for the most frequent codons used by entoorganisms, but with lower values. The viral negative control, HIV, among its more homogeneous RCUs, shows some common preferences to entoorganisms, e.g., CAA (glutamine), TAT (tyrosine) and TGT (cysteine). Interestingly, CGA is rather low in all. On the other hand, AAA (lysine) and GAA (glutamic acid) are universally preferred.

Genomic Landscape at Codon Usage

Calculating the RCU bias per each CDS of every genome leads us to construct a new picture of codonic genomic values for a faster and condensed overall visualization and comparison of entoorganisms, the genomic landscape at codon usage (GLCU). Codons were sorted by the third position as well, and the common codon preference pattern is shown in Fig. 3, confirming Fig. 2's results.

Fig. 3
figure 3

Genomic signature at the genomic landscape at codon usage (GLCU). Rows are the average of codon frequencies of the given genome; columns represent codons. They are sorted in the same way as RCU for easy comparison. Stop codons (*), methionine and tryptophan codons are also included. The legend for the amino acid translation is based on the standard genetic code and serves as a global reference

This test shows again a general preference for A/T-ending synonymous codons in all entoorganisms. There are shared preferences of codon usage such as AAA (Lys), CAA (Gln), GAA (Glu), AAT (Asn), GAT (Asp) and TTT (Phe) and slightly lower ones auch as TAA (Ochre), GAT (Asp) and TGT (Cys). The C/G-ending preferred codons are with ATG (Met) and TGG (Trp) because of their uniqueness.

A second group of prevalence codons comprises those with a frequency of < 40%, namely GTT and GTA (both Val); TCT, TCA and AGT (all Ser); CCA and CCT (Pro); ACT and ACA (both Thr); GCT and GCA (both Ala); GGA and GGT (both Gly).

There are overall differences at stop codon frequencies. Despite TAA (Ochre) being the most frequent among entoorganisms, TGA (Opal) is the second most preferred, with TAG (Amber) almost avoided.

Mma shows a homogeneous distribution in preferences regarding the entoorganisms. The negative cellular control Dra is prone to G/C-ending codons. The negative viral control HIV shows a bias for A-ending codons.

Pearson correlation analyses were performed for the GLCU to compare the patterns found for each genome. Very high correlation values are detected among the entoviral genomes closely followed by Sputnik and the entobacteria. Bab had the highest correlation with Moumovirus \(r=0.975\), but correlated strongly with the entoviruses as well as with Ftu \(r=0.97\) and Apmt \(r=\text{0,93}\). The mitochondria have the least correlated codon usage with viruses (ranging from \(r=0.907\) to \(r=0.92\)). Ftu correlation values range from \(r=0.94\) pairing with Apmt to \(r=0.97\) pairing with Bab.

As expected, correlation values of the cellular negative control Dra were nonsignificant compared with every other genome. Values were always \(r\le 0.46\). Interesting results are the pairwise comparisons of Mma and HIV with every entoorganism. As depicted in Figs. 2 and 3, HIV and Mma showed a different pattern regarding entoorganisms. In this analysis they do as well, but HIV shows higher correlation values than Mma, though not significantly regarding entoorganisms.

Discussion

Karlin and Burge (1995) defined the genomic signature as the “total net response to selective pressure.” Several factors interact to maintain the constant and coherent uniqueness of a GS: restriction avoidance (McDowall et al. 1994), core processes such as replication, recombination and reparation of DNA (Moran 2002), physical constraints such as the DNA structure stacking energy (Sinden 1994) and DNA curvature (Kozobay-Avraham et al. 2006; Mrázek 2009). Mutational processes include methylation, short oligonucleotide modifications and context-dependent mutation biases (Karlin 1996). There are also environmental factors such as energy sources and temperature (Kirzhner et al. 2007), g-radiation damage and osmolarity gradients (Prabha and Singh 2014). Even habitats and lifestyles exert selective pressure on maintaining a GS (Foerstner et al. 2005; Xia et al. 2002).

Phylogenetic-oriented genomic comparisons between the A. polyphaga and A. polyphaga mitochondria and entoorganisms presented here would not be effective because of their different evolutionary origins (Blaisdell et al. 1996; Serrano-Solís et al. 2016). Analyses for GS detection have advantages: they do not depend on homolog aligning tests, use the whole genome, present small variances and are unaffected by mutations such as rearrangements (Karlin 1998). In this study, we demonstrate an ecologic GS that transcends the species specifity hallmark of a GS between the A. polyphaga entoorganisms.

Detection of the GS is based on a pattern such as oligonucleotide frequencies being maintained pervasively on a given species genome and accumulating variations as the phylogenetic distance increases. Dinucleotide relative abundance is a demonstrated GS with pervasive characteristics (Karlin and Cardon 1994; Karlin et al. 1994; Prasaot and Vemuri 2007; Prabha and Singh 2014).

The dinucleotide genomic patterns studied here are clearly discernable by the dinucleotide relative abundance as depicted in Fig. 1. For entoorganisms, the cellular and viral profiles show striking similarity, while the negative controls D. radiodurans and HIV are clearly differentiated.

A special case is the Marseillevirus, whose genome is hypothesized to derive from several sources because of HGT, nonetheless grouping phylogenetically with APMV (Boyer et al. 2009). While for the Mimivirus linage—the irophage included—a common pattern is clear: Mma shows a homogeneous codon usage, in terms of the GS, clearly different from entoorganisms. An explanation for this may be a recent host range addition into A. polyphaga and that not enough time has passed for the entoorganism line of adaptation—the GS—to be adopted. It is pertinent to remember that the entoamoebales studied here also replicate in Acantamoeba castellanii as Mma.

The A-T dinucleotide combinations comprise the high-frequency group, an intuitive bias given the high A-T content of these genomes. This implies a grade of compromise because the high A-T content may increase improper binding of regulatory factors such as TATA boxes and polyadenylation signals (Nussinov 1987; Burge et al. 1992).

Correlation analyses statistically support dinucleotide profiles in the sense that all entoorganisms, whether cellular or viral, are very similar generally and particularly given any oligonucleotide combination. The viruses are the most correlated organisms, followed by entobacteria and mitochondria. This higher viral correlation value may mean a longer coevolutive process, even for the virophage case.

It is worth mentioning that the entobacteria Bab is more correlated to viruses than to the mtDNA in DiRF, TriRF or TetRF comparisons. Candidatus Babela massiliensis is an obligatory intracellular bacterium that interestingly shows common adaptations to NCLDV such as the ankyrin repeats implicated at virus-host interactions (Pagnier et al. 2015). Therefore, the comparison of the entoviral results with the mitochondria and this bacterium is necessary for evaluating the common adaptation of all these entoorganisms to the amoebal host.

It has been reported that similar dinucleotide relative abundance profiles could reflect the similarity of the enzymes engaging in a replication process (Frick and Richardson 2001). As speculation, replication processes of this entoorganism might be performed with the same replication machinery, either that of the amoeba or APMV, or a mixture of both. For APMVs, this would suggest a case where complex viral replication machinery (Raoult et al. 2004) might be recruited by bacteria for their normal genome replication without suffering a viral infection process, overtaking the hypothesis of the melting pot (Moliner et al. 2010). Our hypothesis would inevitably expand the host range definition into a new notion of the viral “accessory-host” range and a “core-host” range, both adding to a phenotypic complementation of a PAN-host range.

The codon usage patterns were clearly discernable as well. The results of CU preferences of entoorganisms were compared for detecting a GS. Codon bias is a direct consequence of dinucleotide bias (Kunec and Osterrieder 2016). CU is related to an efficiency increase in the translation speed (Plotkin and Kudla 2011; Kumar et al. 2016) and to a correlation of the tRNA repertoire (Sharp et al. 1986; Kumar et al. 2016; Duan and Antezana 2003).

Coincidental convergence would be the current scenario for the entoorganisms studied here because of crowding of the GS space (Mrázek 2009). However, this phenomenon is detected only at low order oligonucleotides such as dinucleotides and ruled out at the higher order ones (Mrázek 2009) such as trinucleotides and tetranucleotides.

Therefore, a possible explanation is through low DNA recombination and repairing activity, since reduced genomes have lost sensitive genes related to these pathways (Moran 2002; Bentley and Parkhill 2004). Their absence in virophage genomes and decreased function on APMV (Abergel et al. 2007; Silva et al. 2015) allows for mutations to accumulate. Experiments demonstrate that the most frequent random mutation occurring in cells is C to T (or G to A) because of the deamination of cytosine to form uracil, which is subsequently replicated as thymidine (Glass et al. 2000). Thus, in the absence of DNA repair, genomes tend to become more AT-rich, leading to amelioration (Paz et al. 2006). Naturally, the low GC content of Candidatus Babela massiliensis might occur through another process because of replication proteins coded into the bacterial genome (Pagnier et al. 2015).

In conclusion, here we provide evidence of shared genomic signatures between A. polyphaga and its entoorganisms. It is not clear how all these organisms interact, but the presence of common GSs reveals a coevolutionary process with two probable scenarios: (1) multiple coincidental evolutionary convergences or (b) an adaptive process to selective pressures caused by the intracellular environment of the host. What seems clear is the current adaptation to the ecologic affinity and dynamics for this unique amoebal intracellular environment. Further work is needed to determine the actual mechanisms driving this coevolution.

The ability of A. polyphaga to resist harsh conditions, such as extreme temperatures, pH and osmolarity, suggests its usefulness as a safe harbor for pathogenic bacterial and viral vectors (Greub and Raoult 2004; Moliner et al. 2010; Khan and Siddiqui 2014) possibly facilitating lateral transfer events of virulence and resistance traits among concurrent entoorganisms. Understanding the amoebomics and entoecologies has the utmost animal and human biomedical importance. Further isolation and sequencing of new entoamoebal organisms, either transient or perennial, may reveal the broader hallmark of a probable wide genomic signature associated with each amoeba species.