Introduction

Hepatitis B virus (HBV) is a common and serious viral pathogen of humans that belongs to the family Hepadnaviridae. It is estimated that that there are approximately 300 million HBV carriers globally, with Asia particularly badly affected (WHO 2005). Chronic carriers may eventually succumb to serious disease manifestations such as cirrhosis and hepatocellular carcinoma. Despite its importance, the evolutionary origins of HBV and the timescale of its spread, remain uncertain. For example, it is still unclear whether HBV co-diverged with the various hominoid primate species with which it is associated—chimpanzees, gorillas, humans, orang-utans, and gibbons—over many millions of years, or that its spread reflects more recent patterns of contact among human populations following cross-species transmission (Simmonds 2001). This evolutionary complexity is evidently due, in part, to the fact that HBV is a unique enveloped double-stranded (ds) DNA virus with a circular genome of ~3200 bp that employs the error-prone polymerase reverse transcriptase (RT) for part of its replication process and that its four open reading frames (ORFs)—encoding the polymerase (P), surface antigen (S), and nucleocapsid (C) and X proteins—exhibit substantial overlap with each other.

A central aspect in resolving the origin of HBV is obtaining an accurate rate of nucleotide substitution from which divergence times can be estimated. Although variable, previous estimates of evolutionary rates in HBV are intermediate between those of RNA and DNA viruses, ranging from 1.5 to 7.9 × 10−5 nucleotide substitutions (subs) per site per year (substitutions/site/year) (Fares and Holmes 2002; Hannoun et al. 2000; Okamoto et al. 1987; Orito et al. 1989; Osiowy et al. 2006). On the basis of these rates, the upper bound for the time to the most recent common ancestor (tMRCA) of sampled isolates of human HBV is in the region of 3000–6000 years (Fares and Holmes 2001; Orito et al. 1989). Such a timescale implies that modern human dispersal is responsible for the phylogeographic pattern of HBV, with those viruses observed in other hominoid primates generated by cross-species transmission rather than long-term host-virus codivergence (Starkman et al. 2003). Related hepadnaviruses are also found in some avian species (geese, ducks, herons) and assigned to the genus Avihepadnavirus. However, as yet, no work has been done to estimate evolutionary parameters in these avian viruses, even though they make an excellent counterpart to the study of human HBV.

Most studies of rates of nucleotide substitution in HBV undertaken to date have used serially sampled sequences from one or several individuals who have carried the virus for extended time periods (Okamoto et al.1987; Orito et al. 1989; Osiowy et al. 2006) or from mother-child pairs (Fares and Holmes 2002; Hannoun et al. 2001). Consequently, these results are only indicative of the short-term, and often intrahost, evolutionary rate in chronic HBV carriers, which does not necessarily reflect the process of nucleotide substitution in the long-term. Indeed, extensive intrahost genetic diversity has been observed on occasion (Osiowy et al. 2006). Similarly, no previous study has convincingly accounted for the occurrence of lineage-specific variation in rates of nucleotide substitution. Such a process may be particularly important in HBV as it has been demonstrated that viruses that do not express the e antigen (HBeAg–) evolve more rapidly than those that are HBeAg+ (Hannoun et al. 2000).

Several other characteristics of HBV also contribute to the complexity of inferring the timescale of its evolutionary history. Perhaps of most importance is that the hepadnaviruses are characterized by a complex pattern of overlapping reading frames. In particular, the P ORF occupies 75% of the genome, with the S ORF entirely embedded within it, and nucleocapsid and X ORFs overlapping in its distal regions. Such overlap obviously acts to constrain viral evolution, as synonymous substitutions in one frame are often nonsynonymous in another (and vice versa), leading to complex patterns of purifying selection (Bollyky and Holmes 1999; Fares and Holmes 2002; Mizokami et al. 1997). It also likely that the relatively high rate of recombination in HBV has adversely affected estimates of substitution rate and hence divergence times. Since its discovery, recombination has been widely identified across the HBV genome and in the eight circulating human genotypes (Bollyky et al. 1996; Bowyer and Sim 2000; Fares and Holmes 2002; Kurbanov et al. 2005; Luo et al. 2004; Olinger et al. 2006; Simmonds and Midgley 2005; Suwannakarn et al. 2005; Wang et al. 2005; Yang et al. 2006). Indeed, some recombinant HBV isolates have reached high prevalence; for example, the dominant HBV type in Tibet is a C/D recombinant, and the genotype B isolates sampled from China similarly have a recombinant origin (Cui et al. 2002; Luo et al. 2004).

In an attempt to infer the dynamics and timescale of evolution in the human and avian hepadnaviruses with as much accuracy as possible, we employed a recently developed Bayesian Markov chain Monte Carlo (MCMC) method that can account for lineage-specific rate variation through the use of a “relaxed” molecular clock. This method analyzes the distribution of mutational differences among viruses sampled at different times (serial sampling) and provides a natural measure of statistical uncertainty as all estimates are based on a very large number of sampled trees (Drummond et al. 2002, 2005). We employ these methods to human and avian HBV for the first time.

Materials and Methods

Sequence Data

Complete genomes of human HBV where the date (year) of sampling was available were downloaded from GenBank and aligned manually (a full list is provided in Supplementary Table 1). Where dating information was unavailable from GenBank we either referred to the primary literature or contacted the relevant authors. Known recombinants were excluded from the data, as were those viruses sampled from patients subject to anti-viral therapy or suffering immunosuppression (i.e., AIDS patients or organ transplant patients), as these are likely to have “artificially” altered evolutionary rates. This resulted in a data set of 145 human HBV sequences. A number of different data subsets were then constructed for evolutionary analysis (see below). Using the same criteria we collected the complete genomes of 35 avian hepadnaviruses for which the year of sampling was known (full details given in Supplementary Table 2).

Identification of HBV Recombinants

We used the RDP2 program (Martin et al. 2005a) to identify recombinant HBV sequences. This employs six different algorithms to identify recombinants: RDP, SiScan, Bootscan, Chimeric, MaxChi, and GENECONV. As there is no consensus on the optimal method for detecting recombination, we deemed that sequences identified as recombinant by three or more methods within RDP2 should be excluded from any further analysis (a list of recombinant sequences is provided in Supplementary Table 1). This approach therefore represents a balance between being overly conservative and including false-positive results. Following the exclusion of recombinants, our data set comprised 76 complete genome sequences of human HBV, representing genotypes A, C, D, E, F, and H, which may be considered our “global” human HBV data set (all viruses assigned to genotypes B were found to be recombinant and no dated G genotype sequences were available for study). For these 76 complete genome sequences we also constructed alignments of the (concatenated) overlapping and nonoverlapping regions of the viral genome. Similarly, we also extracted and concatenated the third codon positions of the nonoverlapping reading frame regions, as these are expected to comprise the least constrained set of nucleotide sites in the viral genome. Finally, because sufficient sequences were available, we also extracted all those sequences assigned to genotypes C (n = 26) and D (n = 22). All sequence alignments are available from the authors on request.

During this recombination screen we also observed that most recombination events occurred in one region of the viral genome. Therefore, to improve the power of our phylogenetic analysis, we divided the HBV genome into two halves, reflecting their differing frequencies of recombination (Fig. 1): (i) nucleotides 1–1626, covering the pre-S2 region, the S ORF, 1623 bp of the P ORF, and the first 226 nucleotides of the X ORF, and which had a lower rate of recombination, (ii) nucleotides 1627–3252, covering the whole pre-C/C gene, the pre-S1 region, the first 915 bp of the P ORF, and the last 228 bp of the X ORF, and which contained the majority of the recombination events. By removing recombinant region ii from our initial data set of 145 human HBV genomes, and excluding identical sequences, we were able to produce a data set of 108 sequences comprising region i described above and characterized by a relatively low recombination rate (Supplementary Table 1).

Fig. 1
figure 1

Genome structure of human and avian HBV. The outer circle represents the human genome, while the inner circle depicts the avian hepadnavirus genome. Individual ORFs and the recombination “hot spot” and region of low recombination rate are also indicated (although, because of the variable distribution of recombination break-points, the demarcation of these regions is only approximate)

Analysis of Evolutionary Dynamics

Overall rates of evolutionary change (nucleotide substitutions per site per year) and the tMRCA for each data set were estimated using the Bayesian MCMC approach available in the BEAST package (http://www.evolve.zoo.ox.ac.uk/Beast/) (Drummond and Rambaut 2003). In all cases, we employed the most general GTR+I+Γ4 model of nucleotide substitution, as this consistently provides the best description of viral evolution. Data sets were also analyzed under demographic models of (a) constant population size, (b) exponential population growth, (c) expansion population growth, and (d) logistic population growth. For all models, we used both strict and relaxed (uncorrelated exponential) molecular clocks. In each case, MCMC chains were run for sufficient time to achieve convergence (assessed using the TRACER program: http://www.evolve.zoo.ox.ac.uk/software.html?id=tracer), and models were compared using Akaike’s information criterion (AIC), with uncertainty in parameter estimates reflected in the 95% highest probability density (HPD).

Analysis of Avian Hepatitis B Virus

For the 35 avian hepadnavirus genome sequences collected herein, we performed the same process of recombinant identification as with the human HBV data. Strikingly, the majority of recombination events in the avihepadnavirus occurred in the same (homologous) region as human HBV (Fig. 1) (although these regions are numbered differently in the GenBank accessions). Hence, we were again able to construct a “low recombination rate” data set reflecting approximately 50% of the viral genome for the avian viruses in the same manner as for the human HBV sequences. These data were then analyzed using the same Bayesian coalescent methods as in the analysis of the human hepadnavirus data sets.

Phylogenetic Analysis

Maximum likelihood (ML) phylogenetic trees were inferred both for the complete genome data sets of 76 human and 35 avian HBV sequences and for the low recombination rate region of 108 human HBV and 35 avian HBV sequences. In all cases phylogenetic trees were estimated using the ML method implemented in the PAUP* package (Swofford 2003), assuming the most general GTR+I+Γ4 model of nucleotide substitution. A neighbor-joining bootstrap resampling analysis (1000 replications) was performed to assess the support for specific nodes, again utilizing the ML substitution model. Finally, average pairwise distances among data sets were calculated under the Tamura-Nei model with a Γ-distribution (shape parameters taken from the ML phylogenetic analysis) using the MEGA package (Kumar et al. 2004).

Results and Discussion

Recombination in Human and Avian HBV

The six methods contained within the RDP2 package identified different subsets of HBV sequences as recombinants (Supplementary Table 1). On average, across all six methods, 41% of 145 human HBV sequences were suspected as recombinant, suggesting that this process occurs frequently in the hepadnaviruses. The highest number of recombinants—92—was identified by the RDP method, while Chimera gave the lowest number (27). Notably, nine sequences were identified as recombinant in all methods, all of which belong to genotype B. A further seven genotype B sequences were identified as recombinant by five of the six methods, four of which were sampled from a Vietnamese family following mother-children transmission (Hannoun et al. 2000). Overall, we found only 21 sequences with no detectable recombination and 29 identified as recombinant by only one method.

In the case of the 35 avian hepatitis B viruses, no genome sequence was entirely clear of recombination. Two sequences were identified as recombinant by all six methods, with GENECONV identifying the highest number of recombinant sequences (n = 30). Although recombination is clearly frequent within avian HBV, the small number of HBV sequences available necessitated that all were retained for further analysis.

Perhaps the most notable result of our recombination analysis was its bias toward a specific genomic region (Table 1). In human HBV, approximately 80% of the recombination break points were located in a region (approximately 50%) of the viral genome that covers the pre-S1 region, the pre-C/C gene, the second half of the X gene, and the first part of the polymerase gene (Fig. 1). Hence, our results suggest that this region comprises a “hot spot” for recombination. A similar recombination hot spot was identified in the avian hepadnaviruses and covers orthologous gene regions (pre-S gene and partial S gene, pre-C/C gene, and the first part of the polymerase gene) as human HBV, although avian viruses do not contain the X gene (Fig. 1). We therefore designate the remaining ~50% of the HBV genome as a “low recombination rate” region, although (i) sporadic recombination events may still occur in this region, and (ii) the varied distribution of recombination break-points across the HBV genome means that the demarcation of these regions is only approximate.

Table 1 Analysis of recombination frequency in human and avian HBVa using RDP2

There are a number of reasons why recombination could occur more frequently in our putative recombination hot spot. First, it could be subject to weaker selective constraints than other regions, such that recombinants are more likely to survive the selective sieve once they are generated. However, this scenario seems unlikely given that the recombination hot spot has a high concentration of overlapping reading frames, which will increase selective constraints, and the average pairwise distance (π) in this region is not substantially higher than the genome average (0.121 versus 0.105 and 0.215 versus 0.183 for the human and avian viruses, respectively), as expected with relaxed selective constraints. Alternatively, it may be that recombination rates are higher in the hot spot due to increased positive selection, such that recombination accelerates the generation of advantageous genotypes, although adaptive evolution is only rarely documented in HBV (Fares and Holmes 2002). Finally, it is possible that recombination rates reflect patterns of genome modularity, such that most recombination events occur at gene boundaries (Martin et al. 2005b; Simmonds and Midgley 2005). Although the use of overlapping reading frames in HBV undoubtedly complicates this phenomenon, it is notable that the notional boundary of the recombination hot spot in human HBV is within one nucleotide of the boundary of the P ORF and that the low recombination rate region roughly coincides with the single-stranded (ss) DNA region of the HBV genome (Fig. 1).

Rates of Evolutionary Change in Human HBV

After recombinant sequences were excluded, we were left with a data set of 76 human HBV genome sequences, from which we could estimate rates of nucleotide substitution and the tMRCA using a Bayesian coalescent approach. Given the inherent difficulties in inferring evolutionary rates in HBV we focus not only on the mean rates evolutionary change, but also on the lower 95% HPD values, as these represent the lowest (i.e., most conservative) possible rates of viral evolution.

In all cases except the concatenated nonoverlapping regions of the viral genome, a model of exponential population growth had the highest posterior probability (detailed results presented in Supplementary Table 3). For the nonoverlapping regions of the HBV genome, a logistic growth model was best-fit to the data. Further, in all cases the strict molecular clock was rejected a valid description of the evolutionary dynamics of HBV in favor of a relaxed (uncorrelated) molecular clock. Under the best-fit models the mean evolutionary rate of the 76 complete genomes (global human HBV data) was 7.72 × 10−4 subs/site/year, with a lower 95% HPD value of 8.36 × 10−5 to subs/site/year. These rates are approximately one order of magnitude higher than those estimated previously and fall closer to the substitution rates observed in RNA than DNA viruses (Hanada et al. 2004; Jenkins et al. 2002). Notably, broadly similar (and high) mean substitution rates were observed in different data subsets: 2.60 × 10−4 subs/site/year for the low recombination rate region of the HBV genome (lower 95% HPD = 3.22 × 10−5 subs/site/year) (Fig. 2) and 3.73 × 10−4 subs/site/year (lower 95% HPD = 4.68 × 10−5 subs/site/year) and 4.30 × 10−4 subs/site/year (lower 95% HPD = 1.16 × 10−4 subs/site/year) for genotypes C and D, respectively (Table 2).

Fig. 2
figure 2

Phylogenetic tree of the low recombination rate region (~50%) of the human HBV genome (n = 108). The tree is midpoint rooted for clarity only and all horizontal branch lengths are drawn to a scale of nucleotide substitutions per site. Bootstrap support values (>90%) are shown for key nodes and the genotypes of human HBV are indicated. The year of sampling is given in the isolate name

Table 2 Substitution rates and tMRCA for human and avian HBV sequences

We next considered evolutionary rates in the overlapping and nonoverlapping regions of the HBV genome. As expected, the average pairwise distance of the concatenated nonoverlapping regions (π = 0.14) is approximately 1.5 times greater than the overlapping regions (π = 0.09), indicating that the latter is subject to stronger selective constraints. Likewise, the substitution rates in the nonoverlapping region are approximately twice those of the overlapping region (Table 2) but still close to those estimated for the remainder of the viral genome. The π value for the concatenated third codon positions from the nonoverlapping regions was greater still (0.48), and produced the highest rate of evolutionary change observed in our analysis of human HBV, at 9.63 × 10−4, although with very wide HPD values (lower 95% HPD = 4.13 × 10−5 subs/site/year). Finally, it is noteworthy that our estimates of mean substitution rates under the strict molecular clock (although a less favored description of the data) range from 2.29 × 10−5 to 1.57 × 10−4 subs/site/year, so that they are both lower and closer to those HBV rates estimated previously.

Our observation that HBV evolves relatively rapidly is not unexpected given its reliance on reverse transcriptase-mediated replication, in which intrinsic mutation rates of ~0.2 per genome per replication have been reported (Drake et al. 1998). Thus, although arranging genomes into overlapping reading frames is likely to reduce the long-term substitution (fixation) rate in HBV, it is unlikely to be sufficient to reduce this rate by the orders of magnitude required for this to have codiverged with hominoid species (Bollyky and Holmes 1999). Indeed, the evolutionary rate in the overlapping region of the viral genome is only about half that in the nonoverlapping region.

Evolutionary Dynamics of Avian Hepadnaviruses

Our study also considered, for the first time, the evolutionary dynamics of avian hepadnaviruses, focusing on the rates of evolutionary change and the tMRCA of the available samples. The evolutionary rates in avian HBV estimated here are similar to those observed in human HBV (Table 2). Specifically, our mean estimate for the substitution rate in the complete genome of avian HBV is 7.32 × 10−4 subs/site/year (lower 95% HPD = 1.56 × 10−5 subs/site/year) with an estimated tMRCA at 811 years (upper 95% HPD = 2335 years). Substitution rates were again lower in the low recombination rate region of the viral genome at 4.85 × 10−4 subs/site/year (lower 95% HPD = 9.20 × 10−6 subs/site/year), which resulted in a deeper mean tMRCA of 1351 years (upper 95% HPD = 5389 years) for the viruses sampled here. Further evidence for a recent evolutionary history in avian hepadnaviruses is that those isolates sampled from duck species (the major hosts of avian HBV) are intermixed with isolates sampled from other species, including geese and storks, indicative of frequent cross-species transmission (Fig. 3). However, it is also clear that a far larger sample of viruses is required to fully understand the evolutionary dynamics of avian hepadnaviruses.

Fig. 3
figure 3

Phylogenetic tree of the avian hepadnaviruses (n = 35). The tree is midpoint rooted for clarity only and all horizontal branch lengths are drawn to a scale of nucleotide substitutions per site. Bootstrap support values (>90%) are shown for key nodes. The year of sampling is given in the isolate name

The tMRCA of Human HBV

We next inferred the tMRCA of human HBV using the same Bayesian coalescent approach (Table 2). Under the strict (constant) molecular clock, estimates for the tMRCA of the sampled isolates of human HBV were highly variable, although most fell in the range 2000–4000 years. These dates are similar to those previously estimated for HBV, all of which similarly utilized a strict molecular clock (Fares and Holmes, 2002; Hannoun et al. 2000; Okamoto et al. 1987; Orito et al. 1989; Osiowy et al. 2006). However, far more recent tMRCA estimates were obtained using our best-fit relaxed molecular clock: 229 years (upper 95% HPD = 580 years) for our 76 whole-genome sequence data set, 394 years (upper 95% HPD = 1127 years) for the low recombination rate region, 282 years (upper 95% HPD = 697 years) for the concatenated overlapping regions, 237 years for the concatenated nonoverlapping regions (upper 95% HPD = 461 years), 477 years (upper 95% HPD = 1509 years) for the third codon positions from the concatenated nonoverlapping regions, 96 years (upper 95% HPD = 271 years) for genotype C, and, finally, 65 years (upper 95% HPD = 140 years) for genotype D. Thus, our estimate of the tMRCA for the low recombination rate region is almost twice that of the whole genome (as was also the case in the analysis of avian HBV), indicating that recombination, especially at the intergenotypic level, strongly affects estimates of both the rate of nucleotide substitution and the age of genetic diversity. Hence, it is clearly important to purge recombinant sequences from any evolutionary analysis of HBV gene sequence data. As a corollary, we suggest that the results of the low recombination rate region of the HBV genome are more reliable.

Although there is considerable variation in our estimates of the tMRCA of human HBV, the oldest possible estimate from our analysis (under a nonfavored model) was 7228 years (third codon positions, nonoverlapping regions under a model of constant population size, and a relaxed molecular clock; upper 95% HPD = 23,610 years). Although this represents a deeper time estimate than most studies of HBV evolution, it is still clearly incompatible with the hypothesis that the timescale of HBV evolution follows patterns of hominoid speciation and the subsequent dispersal of anatomically modern humans. Further, the mix of human and nonhuman primate isolates on phylogenetic trees of HBV provides strong evidence against the codivergence theory (Starkman et al. 2003). As such, the results presented here strongly suggest that the evolutionary history of HBV is one in which cross-species transmission has been the norm, with the current distribution of relatively rapidly evolving genotypes following patterns of modern human dispersal. Finally, a similarly shallow evolutionary history was observed in the case of avian HBV, although the implications of this recent history are less clear and the small sample size means that the full effects of recombination have yet to be accounted for.