Introduction

Viral diseases have become a limiting factor to the sustainable growth of the global shrimp culture industry [1]. White spot syndrome virus (WSSV), the sole member of the monotypic family Nimaviridae, genus Whispovirus [2], is an extremely lethal pathogen that can cause cumulative mortality of up 100 % within 2–10 days after the onset of clinical signs [3, 4]. Furthermore, it has been reported that WSSV is capable of infecting most of the commercially cultivated shrimp species and has consistently emerged as a highly prevalent and widespread virus. The WSSV virion consists of an enveloped nucleocapsid containing a circular double-stranded DNA genome that shows size variations among different geographic isolates (307,287, 305,107, and 292,967 base pairs [bp] for the Taiwan, China, and Thailand isolates, respectively) [57].

It has long been known that all organisms have a specific codon usage signature, and the degeneracy of the genetic code implies that multiple codons specify the same amino acid. Several factors have been proposed to explain the deviations in codon patterns, and such uneven usage is not selectively neutral as previously proposed, but related to compositional constraints and translational selection, gene expression [8], protein structure formation [9], viral packaging [10], GC-biased gene conversion [11] and even tRNA base modifications [12]. It has been proposed that new viruses and strains may emerge as a result of selection pressures from environmental fluctuations or by host shifts [13].

Viral recombination, here defined as the exchange of genetic material between at least two viral genomes [14], is a process that influences viral fitness at many different levels [15, 16]. It has been suggested that viral recombination may be a way in which viruses adapt quickly to changing environmental conditions, new hosts and ecological niches [17, 18], as recombination can enable access to evolutionary innovations that would otherwise be inaccessible by mutation alone [19].

Because of the rising prevalence of WSSV virus and the global implications of this virus in the aquaculture trade, it is imperative to understand its genome dynamics, which may facilitate its evolution to create novel variants that can adapt to a changing environment and host genotype [20]. We studied all three of the different geographical isolates sequenced so far to illustrate the genome dynamics of WSSV. The findings of the present study established the presence of compositional constraints in the WSSV genome. Interestingly, we found that most of the genes that are under the influence of positive selection are associated with the control of virus replication, which confers some characteristics to the viral genome that may ensure its efficient replication and may consequently provide increased fitness. The presence of recombination hotspots in the WSSV genome was also evaluated, and some factors that might influence the recombination rate are proposed.

Materials and methods

Genome sequence data and multivariate analysis

The W-70, W-93 and W-29 genome sequences were retrieved from the GenBank database (accession numbers AF440570, AF332093, and AF369029). A threshold of 100 codons was applied to sequence filtering, and finally, 260, 252, and 146 complete coding sequences (CDS) from W-70, W-93 and W-29, respectively, were extracted directly to avoid sampling bias in calculations of codon usage [21]. The G + C frequency distribution was calculated as described earlier [22]. The effective number of codons (Nc) was calculated as described previously [21]. The relative synonymous codon usage (RSCU) values were calculated to normalize and identify the intra-genomic variations with differing amino acid compositions [23]. Correspondence analysis (COA) for RSCU was implemented using Codon W (http://codonw.sourceforge.net) [24].

The codon adaptation index—A measure of gene expression and evolution

The codon adaptation index (CAI), determines the bias of codon usage in highly expressed genes. For the calculation of CAI, a set of 16 highly expressed genes was selected for each genome. These included wsv151 (latency related), wsv427 (latency related), wsv366 (latency related), wsv230 (ICP11), wsv360 (VP664), wsv421 (VP28), wsv069 (iE1), wsv254 (VP37), wsv514 (DNA polymerase), wsv129 (VP357), wsv214 (VP15), wsv311 (VP26), wsv414 (VP19), wsv002 (VP24), wsv386 (VP68), wsv001 (Collagen-like) as previously suggested [25]. All correlations were based on the nonparametric Spearman’s rank correlation (ρ) analysis method using R (http://www.r-project.org/). In order to compute orthologs, the best reciprocal BLAST hit approach (RBH) approach was used to find the best bidirectional hits. To calculate dN/dS ratios, amino acid sequences were first aligned using ClustalW1.83 with default parameters [26], and the corresponding codon alignment and dN/dS ratio were then calculated using an in-house Perl script.

Characterization of potential recombination events

Detection of potential recombinant sequences, identification of potential parental sequences, and localization of possible recombination breakpoints was done using the GENECON, BOOTSCAN, MaxChi, CHIMAERA, SISCAN and 3SEQ methods embedded in the RDP3 software package [27]. A multiple-comparison-corrected P-value cutoff of 0.01 was used throughout the study.

Results and discussion

Codon usage pattern in the WSSV genome and highly expressed genes

It has been shown that strong codon bias is common in highly expressed genes compared to those that are not highly expressed within the same genome [23, 28]. The overall RSCU values of the 59 sense codons in the whole genome of the WSSV isolates and for 16 highly expressed genes are shown in Table 1. Codon usage in the WSSV isolates is preponderantly A- or T-ended (W-29, 67 % T-ended, 22 % A-ended, and 11 % G- or C-ended; W-70 and W-93, 56 % T-ended, 39 % A-ended, and 5 % G-ended), which correlates with the low overall GC3 content in all three of the isolates (~39.0 %). It was further observed that the average GC content of the three isolates at the first position was higher than at the second and third codon positions, which clearly demonstrates the GC compositional pressure on the biased codon usage. We further evaluated the relationship between nucleotide content and codon usage using an effective number of codons (Nc) - plot. The Nc- GC3s plot has been widely used to study codon usage variation among different genes, as it has been shown that this index has a relationship to GC3. The Nc value showed a wide variation, ranging from 23 to 61 in W-93. This shows that some of the genes with low Nc values have a stronger bias in comparison to genes with higher Nc values, which supports the presence of compositional constraints and a bias gradient in WSSV genomes.

Table 1 Summary of the average relative synonymous codon usage (RSCU) of the 59 degenerate codons used in the three WSSV genomes for whole-genome analysis (all) and for 16 HE genes (high)

Multivariate analysis of codon usage

Correspondence analysis on RSCU (COA) for isolates W-29, W-70 and W-93 revealed that axis 1 accounted for ~12.79, ~9.06, and ~9.09 %, respectively, of the total variation of the 59-dimensional space. Interestingly, the RSCU values observed in the W-29 isolate were higher than those observed in the W-70 and W-93 isolates. These results may indicate that the reduction of the genome size of W-29 has favored the usage of T-ending codons, while isolates W-70 and W-93 (both of which have larger genome sizes), may seem to have more varied options for codon usage, and taken together, this may reflect that compositional limitations played a central role in shaping the codon usage pattern of WSSV. It seems probable that the reduction of the genome size of WSSV over time has selectively driven the eradication of background nucleotide content, specifically affecting the number of A-ending codons, but also favoring the appearance of C-ending codons. This can be interpreted as an adaptive strategy that may offer an advantage to W-29 by allowing unrestricted access to the full pool of tRNAs of the host to exploit its translation machinery in order to replicate unrestrictedly. Aragonès et al. [29] found that poliovirus shows a highly optimized codon usage that conforms to that of the host cell, confirming that viral replication reaches its maximum level when the correspondence between codon usage (demand) and tRNA availability (supply) is optimal.

Gene expression in WSSV

To confirm the assumption that highly expressed genes are clustered along the first major axis, the codon adaptation index (CAI) was calculated for all of the genes identified in the WSSV genome. CAI was calculated taking highly expressed genes as a reference (see “Materials and methods”). A weak but significantly positive correlation (W-29; r = 0.065, W-70; r = 0.010, and W-93; r = 0.002, P < 0.001) was observed between the positions of the genes along the first major axis and their corresponding CAI values in all isolates. The CAI value, which ranges between 0 and 1, indicates that genes with a CAI value close to 1 are composed of very frequently occurring codons. In this case, all WSSV isolates showed CAI values that range from ~0.5 to ~0.85, but most of the WSSV genes showed CAI values between 0.7 and 0.8. It is worth noting that the CAI values for W-29 span a narrower range than those of W-70 and W-93, which may indicate that the WSSV genes avoid the use of rare codons, resulting in a codon usage bias. It is known that the introduction of rare codons, or pairs of rare codons, into an ORF reduces viral translation efficiency [30, 31].

Substitution rate and evolutionary constraints

Differences in the synonymous and non-synonymous nucleotide substitution ratio (Ka/Ks, termed as the “acceptance rate”) between WSSV isolates were also investigated. The acceptance rate has been widely used as an estimator of the stringency of the purifying selection or the strength of adaptive evolution. Among the 132 orthologous genes, a total of 51 genes appear to be under positive selection. Furthermore, it was found that the average synonymous (Ks) rate for the orthologs under positive selection is 0.111 ± 0.0023, and the average non-synonymous (Ka) substitution rate is 0.040 ± 0.0036. Interestingly, only five of the WSSV orthologs under positive selection showed a relatively high ratio of synonymous substitutions over non-synonymous substitutions (ORF134, Ks = 4/1218, Ka = 1/1218; ORF42, Ks = 3/1279, Ka = 2/1279), while most of these orthologs showed high ratios of non-synonymous to synonymous substitutions (ORF14, Ks = 39/301, Ka = 5/301; ORF30, Ks = 190/1683, Ka = 37/1683; ORF183, Ks = 147/496, Ka = 31/496; ORF61, Ks = 68/579, Ka = 16/579; ORF40, Ks = 73/1534, Ka = 19/1534). According to van Hulten et al. [6], ORF30 encodes a collagen-like protein, and ORF61 encodes a putative serine/threonine protein kinase.

Furthermore, nine out of the 51 genes (~18 %) showing the most evidence for positive selection have inferred putative functions. Most of these orthologs appear to be associated with the replication of WSSV, collagen-like protein, serine/threonine protein kinase, class I cytokine receptor, DNA metabolism, and transcription. For example, ORF27 encodes a DNA polymerase, ORF92 and ORF98 encode the large and small subunits of the ribonucleotide reductase, respectively, ORF171 encodes a chimeric thymidine kinase-thymidylate kinase, and ORF 149 putatively encodes a TATA box binding protein). Thus, WSSV seems to be under the influence of a balance of different selective forces at different regions and sites that display different functional constraints. Similar results have been described previously for other viruses. In a recent study, it was found that one gene (tat) of simian immunodeficiency virus exhibits positive selection, while the overlapping gene (vpr) shows signs of strong purifying selection [32].

Characterization of potential recombination events

A recombination detection analysis using RDP3 identified three potential events. The genome segment affected by the first putative recombination event starts at positions 23,227 and ends at position 44,587. It has been reported previously that this part of the genome includes both a highly variable region, which is located at position 22,961-23,619 in the W-29 isolate, and a genomic deletion when compared with the W-70 and W-93 isolates [33]. In addition, it has been suggested that the chimeric thymidine kinase (TK) and thymidylate kinase (TMK) genes were incorporated into the WSSV genome via homologous recombination [34]. Moreover, the vast majority of the informative sites on which the respective recombination signals were based lie in these variable regions. Since RDP assumes that differences between sequences arise from independent point mutations, the first recombination event was discarded as unrealistic. As the remaining two recombination events were located in regions of low variability and few deletions, and as low p-values were achieved during the analysis (Table 2), they were considered accurate and trustworthy.

Table 2 The average p-values of three recombination events occurring in WSSV calculated by six different recombination detection methods

Based on the findings of RDP and the distribution of sequence similarity (Fig. 1), it was concluded that the following recombination events (event II and event III) are the most plausible: W-93 resulted from a recombination of W-29, W-70 and an unknown WSSV sequence (probably an ancestral WSSV variant) in which the segments (measured in reference to the alignment used) started at position 1 to 96,803 and 201,825-280,000 stem from W-29, while the segment 96,803-201,825 was derived from W-70, and the segment 280,000-end originated from an unknown sequence. Furthermore, according to the results obtained in this study, it is proposed that W-29 resulted from a recombination of W-70 and an unknown WSSV isolate in which the segment starting at position 1-285,484 stems from an unknown sequence and the segment 285,484-end originated from WSSV-70. These results contrast with those reported recently, in which a model of gradual WSSV genome shrinkage has been proposed [35].

Fig. 1
figure 1

Identification of recombination events II and III occurring in the WSSV genome, using the results of bootscan analysis for the recombination origin on the basis of pairwise distance, modeled with a window size of 200, a step size of 20 and 100 bootstrap replicates

Accordingly, the WSSV genome has been shrinking by removing some variable regions while at the same time its virulence has increased due to a faster replication of a smaller genome. Thus, the suggested expansion of the WSSV genome is not paradoxical. A genome is a collection of genes that controls and coordinates the essential functions of an organism through a dynamic interaction of its elements [36]. Thus, it is clear that a virus containing a small genome will depend extensively on the host cell as a provider of the elements needed for its replication. It is also clear that a reduction in the genome size may confer some evolutionary advantages to the virus. However, a virus genome reduction is not necessarily a straightforward process. According to the results obtained in the present study, a reduction in the WSSV genome indeed occurred early during its evolution; however, successive recombination events have caused an increase in the genome size. Similar findings involving recombinational events during viral genome size increase have been reported previously, suggesting that some genome components of the geminiviruses may have experienced homologous and non-homologous recombination events that finally caused a size increase [37]. If this hypothesis is proven correct, it may be clear that the WSSV genome is non-static, and recombination is certainly an important force in its evolution, conferring an outstanding ability to adapt to any given environment.