Introduction

Phylogenetic analyses have been used extensively in the molecular epidemiology of HIV-1: for tracing the epidemic spread worldwide (http://www.unaids.org, [Holmes et al. 1995]), for dating the origin of the HIV-1 epidemic (Korber et al. 2000; Salemi et al. 2001), or for investigating the HIV-1 transmission in several cases of uncertain history (Ou et al. 1992; Holmes et al. 1993; DeBry et al. 1993; Hillis and Huelsenbeck 1994; Albert et al. 1994; Blanchard et al. 1998; Metzker et al. 2002). Additionally, phylogenetic analyses were used to test the ability of different methods to reconstruct a known HIV-1 transmission network (Leitner and Albert 1996) and to estimate the HIV-1 nucleotide substitution rate (Leitner et al. 1997; Leitner and Albert 1999).

Phylogenies of known transmission histories provide a framework to improve our knowledge about the evolution of the HIV-1 virus after sequential transmissions and, thus, a way to trace epidemiological relationships in cases with unknown links. Given a number of uncertainties, most notably that the transmission of a population of viruses makes it unclear what the “true tree” really is (Leitner and Fitch 1999), it is important to know the levels of discordance between the evolutionary tree (inferred from DNA sequences) and the transmission tree (based on epidemiological information) using different phylogenetic inference methods, as well as how they are affected by the sequence data in hand. Until now, this issue has been addressed only once in HIV-1 viral sequences with known transmission histories (Leitner and Albert 1996).

In this study we phylogenetically analyzed a known HIV-1 transmission network belonging to CRF04_cpx in the Thessalonica area in northern Greece (Paraskevis et al. 2002; Papa et al. 2002), for which the transmission events and dates were accurately known. CRF04_cpx, initially designated subtype I (Kostrikis et al. 1995), is one of the 16 HIV-1 circulating recombinant forms representing recombinant HIV-1 genomes that have infected three or more persons who are not epidemiologically related (Robertson et al. 2000). CRF04_cpx was documented in Cyprus and Greece and it was found to be comprised of subtypes of at least five distinct groups, A, G, H, K, and unclassified regions (Gao et al. 1998; Nasioulas et al. 1999; Paraskevis et al. 2001).

More specifically, we compared the levels of discordance between the inferred phylogenies and the transmission history in different genomic regions, using maximum likelihood (ML) and Bayesian methods; we examined whether phylogenetic discordance in different genomic regions could be due to recombination; and, finally, we investigated how accurately we could estimate the dates of the transmission events using a Bayesian method.

Materials and Methods

Study Population

Viral strains were isolated from six HIV-1-infected individuals as shown in Table 1. The history of transmission events for all the HIV-1 patients infected by a common source is shown in Fig. 1.

Table 1 Characteristics of the study population
Figure 1
figure 1

Schematic representation of the three-generation HIV-1 transmission network from a single source (S). Squares and circles denote males (m) and females (f), respectively. The first and second suffixes denote the generation of the transmission and the sequential number of the patients, respectively. Filled symbols correspond to patients for whom biological samples were available and HIV RNA was detectable in plasma, whereas gray symbols correspond to samples in which HIV RNA was undetectable. Transmission dates are shown in parentheses.

RNA Extraction and cDNA Synthesis

RNA was extracted from plasma samples using the Total RNA Isolation Kit (Ambion Inc., Austin, TX) and cDNA was synthesized using the GeneAmp RNA PCR Kit (Perkin Elmer, USA), according to the manufacturer’s recommendations.

PCR

Two fragments of approximately 400 and 660 bp from the C2–C4 of gp120 and p17/5′ p24 of gag, respectively, were amplified by a single-step PCR using HIV-1 specific primers as described previously (Paraskevis et al. 2001).

Cloning and Nucleotide Sequencing

The PCR products from partial gag were cloned into the pCRII plasmid (TA Cloning Kit; Invitrogen Corp., San Diego, CA). An individual clone and the PCR products for partial gag and partial env, respectively, were sequenced on a VGI automated DNA sequencer using the Cy5/Cy5.5 Dye Primer Kit (Visible Genetics Inc., Toronto, Ontario, Canada), according to the manufacturer’s instructions.

Sequence Analysis

DNA sequence alignment was performed using Clustal W version 1.74 (Thompson et al. 1994). The best-fitting nucleotide substitution model was chosen according to the Akaike information criterion (AIC; Akaike 1973) among 64 different models using the Modeltest (Posada and Crandall 1998) and PAUP*4.0b10 programs (Swofford 1998). Phylogenetic analysis was accomplished using ML with the best-fitting evolutionary model as implemented in PAUP*. Bootstrapping was performed using neighbor joining or ML (1000 replicates) to assess the reliability of the obtained topologies. Phylogenetic trees were also obtained using Bayesian inference with the general time reversible (GTR) model including a Γ distribution rate heterogeneity among sites as implemented in MrBayes v. 3.0 (Huelsenbeck et al. 2001). For Bayesian inference, four Markov chains run for 106 generations with burn-in=2 × 104 were used to reconstruct the consensus tree. Dissimilarities between the transmission tree and the inferred topologies, the ML and Bayesian inferred trees, as well as between the ML topologies inferred in gag and env were quantified using the quartets option in COMPONENT (Page 1993). To assess for any significant differences between candidate topologies in the gag and env regions, we used the approximately unbiased (AU) test (Shimodaira 2000) as implemented in CONSEL (Shimodaira and Hasegawa 2001). For each region, sitewise log-likelihoods were estimated for the candidate trees using PAUP* with the best-fitting evolutionary model.

Simulations

Nucleotide sequences in partial gag and env were simulated (100 replicates) over the ML estimated branch lengths and evolutionary parameters, as implemented in the Evolver program of PAML (Yang 1997).

Molecular Clock Calculations

Estimation of the substitution rate was performed by inferring simultaneously population and substitution parameters and tree topology using Bayesian inference as implemented in BEAST v. 1.0.3 (Drummond et al. 2002; Drummond and Rambaut 2003) and using a different substitution and evolutionary model (GTR+Γ) for gag and env partitions. Five separate MCMC runs were made for 5 × 106 generations, with a burn-in of 5 × 105.

Results

Viral Isolates

In the current study, we characterized six newly identified CRF04_cpx isolates in addition to the four sequences which have been described previously (Kostrikis et al. 1995; Gao et al. 1998; Nasioulas et al. 1999; Paraskevis et al. 1999) (Table 1). The virus isolated from individual f1.5 in 1991 was classified as CRF04_cpx according to phylogenetic analysis of the partial reverse transcriptase (RT) region, which was initially examined for monitoring genotypic resistance to antiretroviral drugs.

Transmission Network

Subject f1.5 attended the AIDS Clinic at AHEPA University Hospital in Thessalonica, and according to a detailed epidemiological contact tracing, she belonged to a transmission network of 7 females that were infected with HIV-1 by a common source and further spread the infection to at least 12 persons. All these 19 subjects acquired the HIV-1 infection heterosexually. The history of the transmission events was reconstructed by interviews with physicians and is depicted in Fig. 1. Overall the transmission events consisted of three distinct “spread generations” among 7 females that were initially infected by a common source (first generation), 11 males (second generation) that acquired the infection by the females of the previous generation, and 1 female (third generation) that was infected from a male from the second generation (Fig. 1). The dates of the first documented HIV-positive sample of the patients included in this study are shown in Table 1. The “source” subject was a male infected by an unknown source, who was hospitalized for the first time in 1993 with progressive multifocal encephalopathy and died 1 month later. The HIV-1-seropositive status was initially documented in 1993 during the hospitalization period. Blood samples for PCR amplification were available for 10 subjects, whereas the rest of them either had died or were no longer attending AHEPA University Hospital. Among the 10 individuals for whom blood samples were available, PCR amplification was attempted for only 6 for which the HIV-RNA was detectable in plasma. Among these six individuals, PCR was unsuccessful for f1.4.1999 in partial gag and, thus, excluded from phylogenetic analysis. HIV-1 DNA sequences were obtained directly from the PCR-amplified products in the C2–V3 region of env or after DNA cloning for partial gag spanning p17 and the 5′-terminus of p24.

Phylogenetic Analyses

Figure 2 depicts the transmission history of the CRF04_cpx sequences as reconstructed from the transmission events including information on isolation dates of viral samples. We investigated the levels of discordance between transmission and evolutionary trees constructed by phylogenetic analyses using partial gag (660 nt), env (327 nt), and the concatenated alignment of gag+env. Phylogenetic trees were inferred by ML using the best-fitting model (Table 2) and Bayesian inference using the general time reversible (GTR) model including Γ-distributed rates heterogeneity among sites.

Figure 2
figure 2

Tree representing the true transmission history of viral samples collected within the transmission network according to a detailed epidemiological tracing of HIV-1 infections. Internal branches in black represent first-generation transmission events by the source. Dashed lines indicate branches corresponding to viral sequences from patients belonging to the second generation of infections. The first part of the f1.2.1999 branch, until 1993, represents the history of the virus in the source.

Table 2 Best-fitting evolutionary model for different genomic regions including a different set of sequences

According to the phylogenetic analysis, in partial gag using ML and Bayesian methods, the dissimilarity between the quartets of the transmission and those of the inferred tree was 0.87 (Table 3). More specifically, one misclassified branch between f1.5.1999 and m2.5.1999 was obtained, supported by 72% bootstrap values using ML (Fig. 3A) and a Bayesian posterior probability (PP) of 0.95 (Fig. 3B). In phylogenetic trees inferred by both methods there was an additional misclassified branch between f1.3.1999 and m2.6.1999, albeit with low support (Fig. 3A and B). Phylogenetic analysis in partial env showed a lower dissimilarity index between the transmission and the inferred tree than in partial gag (0.47 vs 0.87, respectively; Table 3) where the link between the donor (f1.5.1999) and the recipient (m2.6.1999) was accurately reconstructed, whereas a misclassified branch was obtained between f1.3.1999 and m2.5.1999 (Fig. 3C and D). The f1.5.1999/m2.6.1999 cluster inferred by ML received 80 and 64% bootstrap support using NJ and ML methods (BPML and BPNJ), respectively, similarly to as the misclassified branch between f1.3.1999 and m2.5.1999 (Figure 3C).

Table 3 Quantitative quartet comparisons between the transmission tree and the maximum likelihood (ML) estimated trees (Fig. 2) and between the ML and the Bayesian inferred trees in partial gag, env, and gag+env
Figure 3
figure 3

Phylogenetic analysis (A, B) in partial gag, (C, D) in partial env, and (E, F) in gag+env using ML with the best-fitting evolutionary model (tree in upper part) or Bayesian inference (tree in lower part). The numbers at the tree nodes (upper part) not in and in parentheses indicate bootstrap values obtained from 1000 replicates using ML and NJ, respectively. Numbers at the tree nodes inferred by the Bayesian method indicate posterior probabilities. Arrows indicate clusters that match the transmission tree.

Unlike partial gag and env, the combined gag+env region resulted in an improved estimate of the transmission history (Fig. 3E and F) indicated by the diminished dissimilarity score of 0.27 (Table 3). The only discrepancy between the inferred and the transmission tree was the indirect relationship between the two noncontemporaneous sequences isolated from a single individual (f1.5.1991, f1.5.1999). The donor–recipient relationship between f1.5.1999 and m2.6.1999 was accurately reconstructed, albeit with low support (Fig. 3E and F). The low levels of support between f1.5.1999 and m2.6.1999 (posterior probability of 0.43 and 32% bootstrap by ML) (Fig. 3E and F) could be explained due to the very short time interval (12 months), in which the infection of patients f1.5.1999, m2.6.1999, and m2.5.1999 occurred (Table 1), thus suggesting that the evolutionary distance between the viral sequences at that time was very small. This is in accordance with the very short branch separating the node of the clade f1.5.1999–m2.6.1999 with that of m2.5.1999, as well as the high levels of PP support (0.99) for the common clade connecting f1.5.1999, m2.6.1999, and m2.5.1999 (Fig. 3F).

We should note that in the two regions (gag and env) examined here, along with the combined data set of both regions, ML and Bayesian methods were highly correlated as indicated by comparisons among all possible quartets using COMPONENT (Table 3).

The confidence of the inferred topology in gag+ env, reflected by the number of trees within 95% CI, was higher than in phylogenetic trees estimated in gag and env separately, where a higher number of inferred trees fell within the 95% CI (3 vs 12 and 10, respectively) (Fig. 3B, D, and F). Unlike ML, which infers the single most likely tree, the Bayesian method provides the best-set trees, which means that given a large enough number of generations, the frequency of sampled trees reflects the frequency of those trees in the posterior probability distribution. Thus, the posterior probability of any sampled topology is the probability of this particular topology conditional on the data (Huelsenbeck et al. 2001). Finally, a credible set of the trees (95% CI) can be set by the number of trees with the highest posterior probability that sum up to 0.95. Thus, for a genomic region, the smaller the number of trees with cumulative posterior probability 0.95, the higher the confidence of the inferred topologies for this particular region.

Testing for Phylogenetic Discordance Between gag and env

The significance of dissimilar topologies in partial gag and env was tested, also, by the approximately unbiased (AU) test (Simodaira et al. 2000), according to which the topologies for gag and env were significantly rejected for env and gag alignment regions, respectively (p < 0.05) (Table 4). Phylogenetic discordance between gag and env may have been caused by recombination between different viral lineages in the donor or by other means, such as differential selective pressure between gag and env, etc. Interestingly, we found similarly that in the Swedish transmission network (Leitner et al. 1996), the topologies of env and gag were significantly rejected for the gag and env alignment regions, respectively (p < 0.05). However, similarly to the results we present here, the similarity score between the inferred and the transmission trees increased for gag, env, and gag+env, respectively.

Table 4 AU test for assessing the significance of different trees in different pieces of the alignment

To examine further whether the high similarity index in gag+env could be observed by chance, by combining the gag and env regions, we concatenated 100 simulated data sets for partial gag and env and then phylogenetically analyzed them separately using the ML method. In this case, the consensus tree (based on the inferred topologies of the 100 concatenated data sets) was identical to the topology inferred in env but not in gag+env: The f1.5.1999/m2.6.1999 cluster was observed at a frequency of 84%, and the m2.5.1999/f1.3.1999 and f1.5.1991/f1.2.1999 clusters at a frequency of 73 and 62%, respectively. Thus, after concatenating two data sets simulated according to the topology and branch lengths inferred in gag and env, the phylogenetic signal is biased toward the more dominant gene signal (partial env).

Molecular Clock Calculations

To estimate the date of the root of the tree and to examine whether the divergence dates of the coalescent events correspond to the transmission dates, we estimated the substitution rate by inferring simultaneously population, substitution parameters, and tree topology using Bayesian inference as implemented in BEAST v. 1.0.3 (Drummond et al. 2002; Drummond and Rambaut 2003). More specifically, the divergence dates were estimated in the topology inferred in gag+env using a different substitution and evolutionary model (GTR+Γ) for gag and env partitions. All MCMC independent runs converged to almost-identical values for all parameters (data not shown). The mean substitution rates (combined runs) for gag and env were estimated as 2.48 × 10−3 (95% highest posterior density interval [HPD]: 6.3 × 10−4–4.67 × 10−3) and 6.55 × 10−3 (95% HPD: 1.88 × 10−3–1.21 × 10−2) substitutions per site per year, respectively, which were almost identical to those estimated previously for the Swedish transmission network (Leitner and Albert 1999). The mean date of the most recent common ancestor (MRCA) of the tree (troot) was estimated as 1985.1 (95% HPD: 1973.5–1991).

Interestingly, for patients with known infection dates the coalescent events were accurately dated: the f1.5.1999/m2.6.1999 node as 1987.5 (95% HPD: 1974.4–1995), which was close to the infection date (12/1989) (Table 1), and the f1.2.1999/f1.3.1999 cluster as 1989.3 (95% HPD: 1981.6–1995.3) (infection date of f1.3: 6/1992). In both cases, the MRCA estimates predated their respective transmissions but these were contained in credible intervals.

Moreover, the MRCA of the f1.5.1999/m2.6.1999/m2.5.1999 cluster was estimated as 1986.9 (95% HPD: 1973.6–1994.9), and for the f1.2.1999/f1.3.1999/f1.5.1991 branch it was estimated as 1986.6 (95% HPD: 1979–1991).

Discussion

In this study we examined the levels of discordance between the transmission and the evolutionary trees, inferred by ML and Bayesian methods in different genomic regions, for HIV-1 CRF04_cpx sequences isolated from patients belonging to a well-documented transmission network. Furthermore, we explored the possibility of recombination between gag and env and its potential to bias phylogenetic inference in the concatenated alignment.

All the HIV-1 sequences isolated from patients that, according to their personal interview, were infected by a common source were classified as CRF04_cpx, in agreement with epidemiological information for a common source of infection. The transmission network included at least 19 HIV-infected subjects, of whom 6 were characterized by phylogenetic analysis to belong to the CRF04_cpx group. For all these patients, except for the f1.5.1991 sample, viral RNA was isolated approximately 10 years after the infection dates that occurred within the time period between December 1988 and December 1989. The only exceptions were patients f1.2 and f1.3, who were infected in 1993 and 1992, respectively. Thus, most of the infections occurred in a short period of time, and viral sequences were isolated distantly from infection dates; therefore it was challenging to investigate whether the transmission history was in accordance with the evolutionary tree inferred by phylogenetic analysis.

Using two different approaches we found that the similarity index between the transmission tree and the inferred topologies increased for partial gag, env, and gag+env, which was in accordance with a previous study (Leitner and Albert 1996). These findings could be explained by the fact that in env, which evolves faster than gag, several substitutions were possibly accumulated between the transmission events, thus rendering the env region phylogenetically more informative than gag.

The inferred topology in gag+env was 73% identical to the transmission tree, which was in accordance with the Swedish transmission network (Leitner and Albert 1996). This kind of incongruence might have been caused by any differences between the viral phylogeny and the transmission events. For instance, branch nodes that reflect the separation of different viral lineages do not necessarily correspond with the time period of transmission events (Leitner and Fitch 1999). In our case, it seems that the viral lineage sampled in 1991 from patient f1.5 (f1.5.1991) was distantly related to the virus sampled in 1999 from the same individual (f1.5.1999) and the virus transmitted to her recipient (m2.6.1999), a finding which is consistent with an earlier estimated coalescent event of the f1.5.1991/f1.2.1999/f1.3.1999 branch node compared to the f1.5.1999/m2.6.1999 node.

Thus our findings based on phylogenetic analysis using ML and Bayesian methods provide evidence that the transmission history is almost entirely compatible with the evolutionary tree of the virus, given that enough phylogenetic signal is present in the alignment. We should stress that in our case, first, most of the transmission events occurred in a very short time interval, and second, viral sequences were isolated distantly from the infection dates, thus probably rendering the reconstruction of the transmission tree rather complex. One of the issues that still remains open is how the ability of phylogenetic methods to identify clusters of epidemiologically linked individuals among unrelated patients is affected by these parameters.

Interestingly, we found evidence for significant phylogenetic discordance between the gag and the env regions, in a similar way as in the Swedish transmission network, which might have been caused by recombination or other mechanisms. To examine whether the high similarity score between the transmission and the inferred tree in gag+env could be observed by chance, which would be the case if recombination had occurred between gag and env, thus biasing the phylogenetic inference in the concatenated alignment, we performed the following analysis. We reconstructed the phylogeny in gag+env for 100 replicates simulated according to the evolutionary trees in gag and env, separately. In this case, we found evidence that the particular topology in gag+env was not obtained by combining two random segments simulated according to the evolutionary process in gag and env. Thus, although the hypothesis of recombination cannot be entirely excluded, it provides a less parsimonious explanation for the phylogenetic incongruence between gag and env. On the other hand, differential selective pressure in these genes might have caused this discrepancy, or alternatively the limited number of accumulated mutations in gag, within the short time interval in which most of the infections occurred, may seriously diminish the ability to infer the evolutionary process in this region correctly. The latter is in accordance with the relatively low confidence of the phylogenetic tree in partial gag.

Given that the real tree is unknown in most of the phylogenetic reconstruction problems, the way that the different measures of phylogenetic confidence reflect accuracy is of crucial importance, since in molecular phylogenetics, apart from the levels of confidence of inferred trees, we also need to know how close we are to the truth. Interestingly, in our case the overall confidence of the inferred trees, reflected by the number of trees within 95% CI, gradually increased in gag, env, and gag+env and directly correlated with the similarity score between inferred topologies and the transmission tree. Assuming that we are closer to the true tree in gag+env, this means in turn that the overall confidence of the tree may provide a better estimate of phylogenetic accuracy than PP or BPML alone, which both may be misleading for assessing phylogenetic accuracy, especially in genomic regions where there is limited phylogenetic signal. However, it is important to stress that in the case of a transmission network the evolutionary tree may not be entirely compatible with the evolutionary tree, thus suggesting that we cannot draw general conclusions about how phylogenetic confidence reflects accuracy in this case.

Dating of the transmission events, using a recently developed Bayesian method, was accurate at least in two cases where direct comparisons could be made between estimated and actual transmission events, whereas the confidence intervals of the estimated dates were wide. In our case we observed also differences between the estimated and the actual transmission events—pretransmission interval—as described previously (Leitner and Albert 1999). This can be explained due to the population diversity in the donor patient, thus suggesting that timing of a divergence between sequences in two different patients can only be a maximum estimate of transmission time (Shankarappa et al. 1999).