Introduction

Bacteria belonging to the Deinococcaceae family have been isolated from the diverse environments. Deinococcus radiodurans a member of this family is characterized for its unusually higher resistance to several DNA damaging agents including radiations and desiccation. While the γ radiation D10 for human is 0.005 kGy, it is 0.25 kGy for Escherichia coli and as high as 10 kGy for D. radiodurans. Ionizing radiation kills by primarily causing DNA double-strand breaks (DSBs) in the DNA and the resistance to ionizing radiation is seen in but a handful of organisms notably among the Deinococci, the cyanobacteria like Chroococcidiopsis spp. and various fungi like Filobasidium (Slade and Radman 2011). Genome sequencing showed that D. radiodurans has a normal complement of repair proteins (White et al. 1999) and hence the question naturally arose as to how a repair complement common with other sensitive bacteria could repair the large number of DSBs created by exposing the cells to radiation. Molecular genetic experiments later showed that DSBs were found to be repaired initially by a novel pathway, the extended synthesis-dependent strand annealing (ESDSA) which involves non-reciprocal crossovers mediated primarily by RecJ, Rec A, DNA pol I, and DNA pol III followed by Rec FOR pathway of homologous recombination (Misra et al. 2006; Slade et al. 2009). ESDSA is similar to the synthesis-dependent strand annealing (SDSA) pathway seen in yeast, which itself can tolerate up to 0.8 kGy of γ radiation (Bennett et al. 2001). Again, D. radiodurans RecA shares 61 % identity with E. coli RecA but in contrast to the E. coli RecA, which prefers to bind to single-strand DNA, Deinococcal RecA preferentially binds to double-strand DNA (Kim and Cox 2002). Thus, possibly D. radiodurans has evolved to overcome damage due to radiation by improving on the SDSA pathway and by evolving minor but functional changes in known repair proteins. In this context, Sghaier et al. (2008) reported that the basal DNA repair machinery including DNA polymerases and DNA glycosylases are under positive selection. The presence of positive selection implies that there is a tendency to gain mutations in a protein which confers newer functions and subsequently allows the organism to adapt better to a environment, whereas the presence of negative selection or purifying selection means that the proteins are under functional constraints and there is a tendency to conserve the amino acids which are involved in that particular function or involved in the maintenance of structure (Nielsen et al. 2005).

Recovery from radiation also involves many other aspects like membrane regeneration, protein recycling, a regulated nucleolytic activity, signaling mechanisms etc., which are poorly understood. Transcriptomics (Liu et al. 2003), proteomics (Tanaka et al. 1996), and biochemical studies have shown that many hypothetical proteins are present during the recovery from radiation damage. Almost 40 % of the annotated ORFs in Deinococcus codes for “hypothetical proteins” or “putative proteins,” which by definition do not have any functionally, characterized homologs (Siew and Fischer 2003). Since studies done earlier in other species have shown that the genes which code for hypothetical proteins (hereafter referred as ORFans) are the major component of positively selected genes in those species, implying that these are important for the fitness of these species under the respective living conditions (Soyer et al. 2009; Tai et al. 2011). We thus explored the subset of ORFans existing in the recovery phase in D. radiodurans for the evidence of positive selection as a simple test for their importance to the fitness and survival of D. radiodurans. Our results show that ORFans encoding for hypothetical proteins present during the radiation recovery phase are rather under purifying selection and this tendency of conservation indicates their essentiality in the recovery process.

Materials and Methods

Identification of Hypothetical Proteins and Orthologs

We shortlisted all hypothetical proteins obtained from the proteomics papers on Deinococcus, which have reported for recovery from radiation except for (Lipton et al. 2002), which did not. Transcriptome data were obtained from (Liu et al. 2003; Tanaka et al. 1996) and all genes with induction greater than twofold was considered significant. We analyzed the hypothetical proteins in the pSORTb server to find the sub cellular location of these proteins. Usually orthologs are obtained from closely related genomes in toto but due to lack of clear phylogenetic relationship between radiation resistant species, we decided to obtain orthologs from the curated eggNOG database (Powell et al. 2012). Since the accuracy of prediction improves by increasing the number of species involved in the study, in most cases 10 orthologs were taken for each “ORFan” under study. We carried out the similar exercise for the DNA repair proteins in this study.

Determination of Positive Selection and Homology Model of Structures

After obtaining the pre-aligned orthologs for each query, we obtained the corresponding set of coding sequences (CDS) of this set of proteins from EMBLCDS database and codon aligned them in PAL2NAL (Suyama et al. 2006). In a few cases especially for those proteins where adequate homologs were not available in the database, a reciprocal BLAST search with an E-value cutoff of 0.0001 was used to find additional homologs. Then CLUSTALX was used to align these additional homologs to the pre-aligned dataset obtained from the eggNOG database. Subsequently, for each query protein and its corresponding orthologs we built a maximum likelihood (ML) tree by DNAML in the PHYLIP package, with gamma distributed rates and a randomized input order. We obtained the 3-D homology models of the proteins from Phyre2 fold prediction server (Soding 2005) and I-Tasser server (Roy et al. 2010). These models were visualized in PyMOL. For positive selection test, we used the codeml program in PAML4.6 package (Yang 2007). The F3x4 codon substitution model was used to calculate likelihoods. The likelihood ratio test (LRT) was calculated from the likelihoods obtained from M1a and M2a models. Subsequently, the positively selected sites were seen by Bayesian empirical Bayesian (BEB) analysis using PAML.

Results and Discussion

Hypothetical Proteins were Upregulated During Recovery Phase of γ Irradiation

Deinococcus radiodurans cells exposed to γ radiation showed upregulation of ~832 genes at various time intervals of the recovery phase (Liu et al. 2003; Tanaka et al. 2004). Out of these, around 375–500 are ORFans (Fig. 1a). Through independent studies, the functional significance of a number of hypothetical proteins in radiation resistance has been demonstrated. For example, Tanaka et al. (2004) generated deletion mutants of some of the hypothetical proteins, which were upregulated, and showed that several of these mutants were unviable. Thus, the roles of hypothetical proteins like ddrA (DR0423), ddrB (DR0070), pprA (DRA0346), DRB0100 etc. in the radiation resistance were discovered (Narumi et al. 2004; Harris et al. 2004; Kota et al. 2010). On comparing the gene expression levels at the two different doses i.e., 15 and 3 kGy, it was noticed that at 15 kGy dose, there are 473 ORFans showing greater than two-fold-induced expression and this number is well above the 26 ORFans induced at 3 kGy γ dose (Fig. 1b). This showed that the extent of cellular damage has a role in expression of ORFans. Interestingly, some of the highly induced ORFans at 3 kGy like DR0326, DR0491, DR0533, DR1439, DR1440 were not reported at 15 kGy, while reverse is true for some of the other ORFans like DR1358, DR1141, DR0697, DR1359. This qualitative shift hints to a possible dose-dependent or DNA damage-dependent gene regulation. We surveyed the data from independent proteomics studies on D. radiodurans and it showed that around 60 different hypothetical proteins were present during the post irradiation recovery (PIR) phase (Table S1) (Das and Misra 2011; Lu et al. 2009; Zhang et al. 2005; Kota and Misra 2008; Basu and Apte 2012). This number included both proteins induced in response to γ radiation as well as constitutively expressed proteins and what is interesting for this small subset is that most of these proteins were reported from first 2 h of recovery period when the novel ESDSA phase is highly active. An upregulation in the transcripts did not necessarily lead to the detectable levels of corresponding proteins. For example, the upregulation of DR0422 (20 fold), DR1141 (tenfold), DR2574 (sixfold), and DR1440 (fourfold) ORFans of D. radiodurans were reported in different transcriptomic studies, but the proteins corresponding of these ORFs are not yet reported. Another very interesting feature observed was that of the 23 paralogous genes belonging to 9 gene families, which have expanded in D. radiodurans, only DR2179 was found to be present during this phase (Omelchenko et al. 2005). The predicted functions of these proteins indicated that these constitute the group of proteins involved in lipid transport, exhibiting protease activity and DNA binding proteins including transcriptional activators. Almost one-third of all the hypothetical proteins detected were membrane bound or extracellular proteins (Fig. S1). These proteomic studies have also reported several small molecular weight proteins with an average molecular weight of 18 kDa having very few orthologs, including some like DRA0281 and DR1977, which were present only in D. radiodurans. Nevertheless, the upregulation of several ORFans and the presence of hypothetical proteins during the recovery phase indicated that these proteins also contribute to the recovery of D. radiodurans from γ radiation effects.

Fig. 1
figure 1

Gene expression profile of ORFans encoding for hypothetical proteins in response to γ-radiation. a Box plot showing the pattern of expression of ORFans at various time intervals during the PIR phase as reported in (Tanaka et al. 1996). b Altered expression profile of ORFans involved in radiation recovery at different doses at 3 and 15 kGy shows that expression of select genes sharply increases at higher doses as reported in Liu et al. (2003)

Many Hypothetical Proteins Have Evolved Altered Features on Common Structural Templates

Many of the proteins listed in Table S1 were previously annotated as “hypothetical,” but subsequently have been assigned functions based on their sequence and structural features. However, in several cases we saw structural alterations different from what has been annotated, which could lead to the altered capabilities for substrate and/or protein interaction. Since these proteins were found in the recovery phase, we tried to understand their functions in the context of recovery from radiation damage and we have highlighted here a few such examples with greater details.

DR0672 is a typical case where commonly available structural folds have been modified. It has remote sequence homology even within Deinococci but showed structural homology with Neisserial surface protein A (NspA) (Fig. 2a). A beta sheet β-5, was absent in DR0672 and this implies that DR0672 possibly forms a much more compact barrel. Unlike the hydrophobic residues present in NspA, the domain of DR0672 oriented toward the outer side is lined with charged residues in the large loops, as seen in extracellular lipid binding domains of OmpA proteins (LaLonde et al. 1994). This protein is highly divergent from the canonical OmpA proteins and may be involved in binding lipids or in maintaining membrane integrity as seen in Porphyromonas gingivalis (Iwami et al. 2007).

Fig. 2
figure 2

The 3D model of DR0672 and DR2377 and the multiple sequence alignment of DR0459 with its homologs. a A cartoon representation of the 3D homology model of DR0672 (DR0672), which was obtained from the structure of NspA (1P4T) from N. meningitides with an RMSD of 5.0. The beta sheet in NspA, which is absent in DR0672 is highlighted in red color. b A space filling model of DR2377 shows the deeper substrate binding pocket of DR2377 compared to its homolog TTHA0849 from the closely related T. thermophilus spp. c Part of a multiple sequence alignment of DR0459 and its homologs from foot and mouth disease virus (FMDV) sequences. The conserved sites are marked with an asterisk. DR0459, Dgeo1751, Deide03390, DGoCA2467, Deima2270 are Deinococcus orthologs, while TTHA0681 and TTC0322 are Thermus orthologs (Color figure online)

DR2377 is a homolog of TTHA0849 from Thermus spp., a member of the steroidogenic acute regulatory related lipid-transfer domain (START) superfamily (Iyer et al. 2001). Members of this family are involved in transport of various lipids. It has been reported that the cavity of active site of TTHA0849 is small so that it can only accommodate lipids smaller than cholesterol. A 3D model of DR2377 based on the template of TTHA0849 shows that its cavity is much larger and thus the nature of lipids transported should be different in Deinococcus and Thermus spp. (Fig. 2b). This protein is induced twofold higher during recovery from radiation and may be involved in the metabolism of lipids associated with radiation resistance.

DR0459 is a membrane bound protein with an N-terminal signal peptide. In DR0459, the structure could not be completely modeled but N-terminal of the protein shows homology to foot and mouth disease virus (FMDV) leader protease. The multiple sequence alignment (Fig. 2c) shows an interesting aspect. Homologs of DR0459 are present in both Deinococci and Thermus and in both of them have the conserved cysteine (Cys51) and aspartate (Asp163) residue of FMDV leader protease. However, the conserved histidine (His148), which completes the catalytic network, is replaced by alanine in Deinococcus spp. A similar mutation was observed in case of plant storage protein narbonin, which is an inactive form of chitinase (Hennig et al. 1992). The C-terminal has an adhesin domain, thus earmarking DR0459 to the cell wall. DR0459 is induced threefold higher after a 1 kGy dose of γ radiation thus highlighting its role in the recovery of radiation damaged cells (Lu et al. 2009). It would be worth examining if the presence of Ala148 in place of mostly conserved His148 makes this protein inactive in D. radiodurans.

DR2623 was found induced when D. radiodurans cells were irradiated at 1 kGy dose and then allowed to undergo repair for 1 h (Zhang et al. 2005). DR2623 is structurally homologous to thioredoxin reductase (Fig. 3a, b). The thioredoxin system is critical in Deinococcus as this system works with thioredoxin-dependent thiol peroxidases, which scavenge the harmful reactive oxygen species (ROS) generated during γ radiation. The active sites of this class of proteins have conserved thiol rich active site motif “CXXC” and during catalysis the electrons flow from nicotinamide adenine dinucleotide (NADH) to the active site disulfide via flavin adenine dinucleotide (FAD) and then to thioredoxin (Fig. 3c) (Yamamura et al. 2009). In DR2623 the conserved cysteine residues are replaced by isoleucine and threonine and no contacts with FAD could be detected at least in the generated 3-D model, ruling it out as a reductase (Fig. 3d). Also missing is the C-terminal dimerization domain and thus probably DR2623 functions as a monomer. But homologs of DR2623 are widely distributed among the bacteria as seen in the 16S rDNA phylogenetic tree (Fig. 3e). These homologs do not have conserved residues seen in their characterized counterparts like glutathione reductase and lipoamide dehydrogenase and hence it would be interesting to know the exact function of this protein other than binding with FAD and NADH.

Fig. 3
figure 3

Bioinformatic analysis of DR2623 with closely related proteins. a A cartoon representation of Glutathione reductase from B. henselae (PDB id 3T30). b A cartoon representation of the 3-D model of DR2623 modeled on GSR of B. henselae. c A close up view of DR2623 superimposed on 3T3O showing active site superimposition. The flavin adenine dinucleotide (FAD) molecule, which interacts with DR2623 is shown as a collection of gray spheres and the distant cysteine residues of DR2623 are marked in are positively selected are marked in red , while the catalytic cysteine residues of 3T3O are marked in green. d A multiple sequence alignment of homologs of DR2623. The site where cysteine is replaced is marked by asterisk. e A 16S rDNA phylogenetic tree of the homologs of DR2623. The bootstrap values are indicated at the nodes (Color figure online)

Among the other interesting proteins, the 3D structure of 16 kDa protein DR2179, which is induced during this phase, showed domain structure similar to 4VR domain initially identified as a novel small molecule binding domain (SMBD) in proteins (Fig. 4a, b) (Anantharaman et al. 2001). A distinguishing feature of this domain is the presence of two conserved cysteine residues (Fig. 4c). A multiple sequence alignment (Fig. 4d) shows that DR2179 and its homologs have three conserved cysteine residues. However, the relative position of these residues is different with respect to the canonical standalone 4VR domain as identified in 23 kDa 4-vinyl reductase. DR2179 has been annotated as a heme NO binding protein and is structurally homologous to the H-NOX protein SO2144 from Shewanella oneidensis. SO2144 has been shown to sense the intracellular NO and regulate a cognate histidine kinase SO2145 by inhibiting its autophosphorylation (Price et al. 2007). DR2179 has the conserved S109 and R111 which interact with the heme moiety. But it lacks the conserved Y-X-S-X-R motif reported in H-NOX proteins. Thus, it is difficult to envisage its role as a typical H-NOX protein. The comparison of 3D model of DR2179 with the structure of a known pyrroloquinoline quinine (PQQ) binding protein like PqqB from Klebsiella pneumoniae did rule out the possibility of PQQ an antioxidant and essential cofactor for quinoprotein, interaction with this protein (Puehringer et al. 2008).

Fig. 4
figure 4

Bioinformatic analysis of DR2179. The homology model of DR2179 (a) based on the crystal-structure cartoon of SO1244 (PDB id 2KIL) (b). The porphyrin ring of the heme molecule is shown as a collection of gray spheres for clarity. c A close up view of the putative active site of DR2179 showing the conserved cysteine residues (C101 and C113) forming bonds with the porphyrin ring. d A multiple sequence alignment of DR2179, its orthologs and the orthologs with SO2144, which are clustered within a purple lined box. The conserved cysteine residues of DR2179 are marked by black outlines while the conserved cysteine residues of the H-NOX homologs are marked by red outlines (Color figure online)

DR0390 is a 2-domain protein with its N-terminal matching the C-terminal nucleotide binding domain of dihydroxyacetone (Dha) kinase of Citrobacter freundii and the C-terminal matching with DegV family lipid binding proteins (Fig. 5a). The homolog of this protein is found mainly in Gram-positive bacteria like Bacillus spp. and Clostridium spp., while it is absent in Gram-negative (Fig. 5b). In Bacilli and Clostridia, it is adjacent to RecG but we could not ascertain any functional significance for this linkage. DR0390 in D. radiodurans and a homolog in Bacillus halodurans is constitutively expressed (Wallace et al. 2012). The Dha kinases utilize either ATP or PEP to phosphorylate dihydroxyacetone and other small aldoses or ketoses. The N-terminal of these proteins forms barrel shape structure comprised of 8 α-helices. In DR0390 the catalytic site is formed of the amino acids like D57, D59, T60, T150, S102, which are characterized as essential for catalytic function in Dha kinase of C. freundii (Fig. 5a) (Siebold et al. 2003). Another conserved residue D380 of Dha kinase is replaced by a functional analog N51 in this protein. DR0390 has a conserved threonine in the first loop instead of a histidine and this is usually a hallmark of two domain Dha kinases. Unusual in DR0390 is that, unlike a well formed capping loop as seen in PEP-dependent kinases (Siebold et al. 2003) or an unstructured region as in ATP-dependent kinases, it extends the helix H1 to encircle the active site. Structural studies show that DegV domain of both independent proteins and proteins with multi domain are composed of the conserved serine, threonine, arginine, and histidine residue, which interact with phospholipids/lipids substrates. In the DegV domain of DR0390, the conserved threonine is replaced by a lysine (K40) and the histidine is absent. The pattern of conservation on the 3D homology model shows that the C-terminal domain of DR0390 has different distribution of charged residues (Fig. 5a) compared to DegV proteins and that the substrate binding pocket is fairly large compared to the more compact Deg V family proteins. Thus, the nature of substrates could bind to DR0390 appears to be quite different as compared to DegV family proteins. D. radiodurans has at least three orthologs of DegV family proteins and none of them match to each other at their C-terminal regions.

Fig. 5
figure 5

Structure and functional domain distribution in DR0390. a DR0390 a 2-domain protein is comprised of N-terminal kinase domain as modeled on C. freundii Dha kinase (PDB id 1UN8) while C-terminal domain is modeled on DegV of S. pyogenes. The electrostatic surface potential of conserved residues of both the templates and DR0390 are mapped onto the respective structures to compare the similarities in the N-terminal and differences in the C-terminal. b The phylogenetic distribution of the homologs of DR0390 in bacteria. The numbers in the branches are the number of hits to DR0390 in BLAST search

DR2577 (SlpA) and its homolog DR1124 are annotated as S-layer proteins in D. radiodurans and have been shown to maintain the cell envelope structure (Rothfuss et al. 2006). But both these proteins have a C-terminal porin domain and N-terminal phenylalanine characteristic of outer-membrane proteins like OmpM1 from Mitsuokella multacida (Kalmokoff et al. 2009). Similarly, DRA0009 is an inactive homolog of sensor histidine kinase from Thermotoga maritima (Marina et al. 2005). All the catalytic residues of this protein are conserved with histidine kinase except asparagine the site for phosphorylation that is replaced by alanine DRA0009. The subtle and pervasive alterations seen in these proteins raised a pertinent question whether these observed changes were random or were selected during evolution.

Hypothetical Proteins Present During the Recovery Phase are Under Purifying Selection

Since many of these hypothetical proteins showed novel modifications of their structure, we wanted to see whether the corresponding genes are under positive selection and contribute to the ability of D. radiodurans to adapt to harsh environments like doses of high γ radiation. Selection pressure is determined by the ratio of non-synonymous substitutions to synonymous substitutions (dN/dS) for a given site in a gene (for convenience of readers dN/dS is represented as ω). If ω is <1, it means that mutations are deleterious and are preferentially removed and it was the case of purifying selection. On the other hand if ω is >1 then mutations are advantageous and these mutations are retained and this is a case of positive selection (Bielawski and Yang 2004). We calculated the dN/dS ratio for the hypothetical proteins and their homologs based on (a) a single ratio model (M0) for overall selection pressure and (b) a nested set of models (M1a vs M2a) for site-specific selection. A LRT based on the nested pair of models compares the probability of sequences having positively selected sites. We used BEB analysis is a complementary statistical tool to the above LRT and probability of ω in a given data set was estimates. Results showed that the actual sites are under positive selection. When LRT and BEB are in agreement, it means that the selection pressure is robust. Several multilocus-sequence typing (MLST) studies as well as whole genome studies have shown that housekeeping genes in bacteria are under purifying selection because the corresponding proteins have to conserve their structures/active sites for participating in similar biochemical functions across the living systems (Lan and Reeves 2001; Dingle et al. 2001). Likewise it has been seen that genes encoding most hypothetical proteins have recently evolved and are species-specific and usually they are under positive selection (Ge et al. 2008).

Orthologs selection-based studies for the hypothetical proteins were difficult in this study because though Deinococcus and Thermus are in the same phylum, there have been a lineage-specific gene gain and loss in both these species (Omelchenko et al. 2005). For example, Deinococcus has homologs of proteins including DR1252 a saccharopine dehydrogenase with N-terminal Rossman fold, which are well conserved in other species but are absent in Thermus spp. Also in several cases, the orthologs in Thermus and Deinococcus differ in length, e.g., DRA0009 is half the size of a homologous sensor histidine kinase from Thermus, indicating that these proteins are under different evolutionary constraints. For these reasons we obtained the set of orthologs from a curated database like eggNOG. There is an exponential decrease in the number of hypothetical proteins having more than 100 orthologs as compared to the DNA repair genes, which had ~900 orthologs each, indicating that these have evolved recently (Fig. 6a).

Fig. 6
figure 6

Selection pressure and synteny of ORFans in D. radiodurans. a The relationship of hypothetical proteins and the number of orthologs. b The distribution of hypothetical proteins as a function of global ω values. c Relationship of site-specific ‘w’ values and the number of orthologs. The proteins, which are positively selected are marked in dark circles and labeled with LRT values in brackets. d The distribution of %GC content in hypothetical proteins. e Lack of synteny in Deinococcus species

In some cases like DR0672 where suitable orthologs were absent we could not obtain data for selection pressure. When we applied the single ratio model (M0) for all the other cases, we found all the ORFans encoding for the hypothetical proteins were under purifying selection with ω < 1 (Fig. 6b). Thus, the whole gene per se was under purifying selection but it was still possible that a few select sites would be under positive selection (Yang and dos Reis 2011). Next, even the LRT for site-specific selection showed no statistically significant evidence of site-specific positive selection in these genes (Table S1). In the case of DR1654, DR1314, DR0423 (DdrA), DR1940, and DRA0282, the LRT values were positive but below the levels of significance. In the given set of studied ORFans, the fractions of ORFans with positive LRT values were few (Fig. 6c), suggesting that these are newly evolved genes. In the case of extracellular proteins DR1654 and DR1940, analysis also showed that these proteins have positively selected sites with a confidence >90 %. DR1654 was found in heparin-binding fraction in cell lysate of D. radiodurans recovering from radiation injury (Das and Misra 2011). A low-resolution model obtained from I-Tasser (RMSD value of 10) showed it to be a homolog of the NC4 domain of collagen (Fig. 7a, b), which is known to bind with various macromolecules such as proteoglycans and heparin (Leppanen et al. 2007). Thus, homology prediction corroborated the observation that DR1654 is a heparin-binding protein (Das and Misra 2011). It is a poorly distributed protein and the homologs of these proteins are seen almost exclusively in radiation resistant bacteria (Fig. 7c). Like wise, the N-terminal of DR1940 is homologous to HslJ, a chaperone in E. coli and the C-terminal is homologous to Ecotin, a trypsin inhibitor (Fig. 7d, e). Also the homology model shows that positively selected residues Ser13, Leu15, Ala 185 are exposed on the same face of the chaperone domain suggesting that they may be involved in substrate binding. This protein is fairly well distributed among bacteria (Fig. 7e) and knowing its function can lead to insights on survival under stress. Pertinent to note is that BEB analysis showed positively selected sites only for the extracellular proteins, a class which is well-documented for positive selection, as they primarily interact with a changing environment (Nielsen et al. 2005; Petersen et al. 2007). Both DdrA and DRA0282 are intracellular and interact with DNA in vitro. These are upregulated during radiation recovery (Liu et al. 2003) and their deletion mutants show no effect when grown in rich medium (Harris et al. 2004; Das and Misra 2011). DR1314 is homologous to the PRC-H barrel domain of the photosynthetic reaction center of Rhodopseudomonas viridis. This domain is reported to be a key regulator in electron transfer between quinones in photosynthetic reaction center (Anantharaman and Aravind 2002). In D. radiodurans, the quinone like PQQ plays an important role to counter oxidative stress indicating some roles of DR1314 in electron transport processes during radiation recovery.

Fig. 7
figure 7

Hypothetical proteins with positively selected sites in BEB analysis. a The 3D homology model of DR1654 showing the positively selected residues Gln72 and Leu104 in red color. b The structural homolog of DR1654, the NC4 domain of collagen. c The 16S rRNA phylogeny tree showing the sparse distribution of bacteria in which this protein is found. d The 3D homology model of DR1940 showing the positively selected residues Lys 274, Ser13, Leu15, and Ala185 in red color. e The structural homolog of DR1940, YP557733 is a hypothetical protein from Burkholderia xenovorans, which matches with the N-terminal of DR1940 while Ecotin, a trypsin inhibitor matches with the C-terminal of DR1940. f The 16S rRNA phylogeny tree showing the distribution of this protein in several bacteria (Color figure online)

Since horizontal gene transfer has been a prevalent phenomenon in this genus, we tried to find out whether any of the ORFans studied here were recently transferred. Deinococcus genome is GC rich and a variation in the GC content of a gene could be an evidence for horizontal gene transfer. Our studies show that the hypothetical proteins under study have the same mean GC content as that of Deinococci and prima facie are not the case of any recent horizontal gene transfer (Fig. 6d). Another highlight is that, usually genes which have recently evolved like many of the ORFans are poorly expressed (Tautz and Domazet-Loso 2011) but in our case most of these ORFans are either constitutively expressed or are highly induced during the recovery process. Thus, we have realized that most of the hypothetical genes present in this phase have evolved novel features through purifying selection, which shows the tendency to conserve these features. Since purifying selection occurs due to functional constraints on a protein, it means that these proteins perform a key role in the recovery process.

Recombination/Repair Genes are also Under Purifying Selection

Since we found that most of the ORFans in the recovery phase are under purifying selection we decided to check the selection pressure on some of the recombination repair genes like recFOR pathway genes and a few other ubiquitous DNA repair proteins present in this phase (Table 1). A couple of earlier studies had shown that recFOR is important in the DSB repair of this bacterium (Misra et al. 2006; Bentchikou et al. 2010). Among the dozen repair genes studied by us with the rigorous model M1a versus M2a we found that they too were under purifying selection. This result remains true even when very divergent orthologs were selected and also has experimental support because when the DNA polymerase of E. coli is expressed in Deinococcus, it is able to participate in the DNA repair process and restore DSBs, which is possible only with a certain degree of conservation. Although, the sample size of housekeeping genes and recombination/repair genes analyzed under this study is not very large, this finding is in agreement with numerous reports, which show that housekeeping genes are under purifying selection (Petersen et al. 2007). Global genome studies have shown that the lack of gene order (synteny) is usually observed with organisms when they belong to distant phyla and that correlates with increasing dN/dS values in the corresponding orthologs (Novichkov et al. 2009). Thus the mechanism, which tend to conserve the gene order also conserve the sequence. But Deinococci lack synteny within themselves (Fig. 6e) and although protein sequences are conserved, the gene order is not and this implies that operons would also not have been conserved. The DNA transposition activity reported in D. radiodurans may possibly be responsible for such a phenomenon (Mennecier et al. 2006). Thus in this study we have found that these hypothetical proteins in Deinococcus spp. are subjected to unique evolutionary process where the genes have been shuffled around during speciation but their phenotypes are conserved and they are robustly expressed.

Table 1 List of DNA repair genes checked for selection pressure in this study

Since positive selection is an indication of adaptive changes to the environmental conditions, we set out to find examples of positively selected hypothetical proteins present during the recovery phase of D. radiodurans from γ radiation. We have found a wide diversity in the phylogenetic distribution of these hypothetical proteins. For example there were conserved proteins like DR2623 with a wide phylogenetic distribution to proteins like DRA0281, which was present only in D. radiodurans. Homology models for a number of these hypothetical proteins showed distinct adaptations of the active site or binding surfaces or rearrangement of domains, which should lead to newer capabilities. All the ORFans encoding for these hypothetical proteins and the key DNA repair genes present in this phase were found to be under purifying selection. We suspect that these proteins have evolved adaptations for performing novel functions, which are necessary to overcome the effects of ionizing radiation, and hence these adaptations are conserved. A functional study of these proteins and their novel biochemical properties could throw a new light on the phenomenal radiation recovery of this organism. Moreover earlier studies have shown that a significantly low dN/dS value is an indication of a functional exon in eukaryotes (Nekrutenko et al. 2002) and we have seen evidence of purifying selection in many of these functional hypothetical proteins. Thus a possibility exists in future for finding of functional hypothetical proteins in Deinococcus-Thermus phylum by detecting for purifying selection.