Keywords

1 Introduction

A number of potentially life-threatening infectious diseases are caused by RNA viruses, including human immunodeficiency virus (HIV), hepatitis C virus (HCV), influenza and Ebola. RNA viruses have a relatively high mutation rate due to both their error-prone replication process and the lack of sophisticated repair mechanisms [1]. Consequently, they rapidly evolve and exist as a set of non-identical but closely related genetic variants, known as a viral quasispecies. Viral populations can readily adapt to dynamic environments and develop resistance to antiviral drugs and vaccines, which makes the design of effective and long-lasting treatments for RNA viral diseases exceedingly difficult [2]. Determining the structure of viral populations helps the understanding of viral diseases and provides guidance in the development of effective medical therapeutics. Quasispecies spectrum reconstruction (QSR) aims to assemble individual haplotype sequences in a population and estimate their prevalence using sequencing reads generated from a sample containing a set of viral variants. High-throughput next-generation sequencing (NGS) technologies have enabled affordable acquisition of data needed to assemble quasispecies. However, relatively short length of the NGS reads and the presence of errors in sequencing data render the QSR problem difficult. The QSR problem is particularly challenging when the strains in a viral population are highly similar, i.e., the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare [3].

Several software tools for solving the QSR problem by analyzing NGS data have been developed in recent years. ShoRAH [4], the earliest publicly available such software, was developed by combining a path cover based approach and probabilistic clustering in [5, 6], respectively, and applied to analysis of HIV data [7]. Read-graph approach was the basis for ViSpA [8], developed as a variant of the network flow method proposed in [9]. [10], proposed a combinatorial method for QSR and the resulting software, QuRe, was provided by [11]. An approach that resulted in the software package PredictHaplo [12] relied on a Dirichlet Process mixture model and was developed specifically targeting HIV population reconstruction; QuasiRecomb [13] is based on a hidden Markov model that explicitly models recombination events. In [14], a benchmarking study that compares the performance of several publicly available quasispecies reconstruction softwares was presented. The study demonstrated that none of the tested methods could reconstruct populations characterized by low pairwise distance between the haplotype sequences. Following this study other softwares, including HaploClique [15], based on max-clique enumeration of a read alignment graph, and VGA [16], a graph-coloring based heuristic method, were developed. Most recently, a reference-assisted de novo assembly pipeline, ViQuaS, was proposed in [17]. ViQuaS extends an existing algorithm, QuRe [10], and outperforms various other techniques on a wide range of dataset. However, performance of these more recent methods deteriorates dramatically in the scenarios where the genetic diversity of a population is low [3].

Both [3, 14] have pointed out that the existing methods for viral quasispecies reconstruction struggle in the scenarios where the populations are characterized by low diversity. This, in part, is due to the presence of relatively long genetic regions that are common to pairs of closely related viral sequences; clearly, this makes distinguishing different strains challenging. The problem becomes even more difficult when the frequency of one (or more) of the close strains is low; in such settings small genetic distances may be confused for sequencing errors and hence remain undetected. Such failures to detect may have serious consequences in antiviral treatment studies since undetected strains cannot be properly targeted for drug and vaccine design. It has been shown that even the viral strains existing at low frequencies can cause a drug treatment failure due to their resistance to the drug [18, 19]. Therefore, complete recovery of the composition of viral populations is of critical importance for effective antiviral therapies.

In this paper, we propose a novel QSR algorithm, aBayesQR (combining agglomerative hierarchical clustering and Bayesian inference), that overcomes limitations of the existing methods and reliably reconstructs quasispecies characterized by low diversity. The algorithm performs reconstruction of a quasispecies from next-generation sequencing (NGS) data in two stages. In the first stage, conflict-free short reads are hierarchically merged and assembled into longer sequences (contigs) which we refer to as super-reads. In the second stage, likelihoods of the probable quasispecies are computed using the assembled super-reads (rather than using the original set of short reads), and the most likely set of viral strains is selected. Note that the super-reads synthesized in the first stage of aBayesQR allow us to distinguish between closely related strains which share long genetic regions as well as reduce the search space and enable computational tractability of the Bayesian inference conducted in the second stage. The second stage of aBayesQR involves sequential pruning of the solution space; in particular, the likely set of partial viral strains comprising n single nucleotide variants (SNVs) is generated by extending previously inferred partial viral strains having \(n-1\) SNVs. The number of sequences in a set (i.e., the size of a viral population) is dynamically updated at each step by evaluating quality of the set of partially reconstructed viral strains, and ultimately precisely inferred at the end of the search process. The relative frequencies of each strain are determined by counting the numbers of reads unambiguously associated with each of the reconstructed strains. Our tests on both simulated and experimental data demonstrate superior performance compared to state-of-the-art methods for quasispecies reconstruction. In particular, it is shown that unlike the competing methods, aBayesQR is capable of detecting and reliably reconstructing viral haplotypes having very small mutual genetic distances.

2 Proposed Method

Our algorithm for inferring spectrum of a viral population consists of the following two steps: (1) constructing super-reads by hierarchically clustering aligned paired-end reads, (2) inferring the most likely quasispecies from the set of super-reads and estimating the frequencies of the strains in the quasispecies.

2.1 Super-Reads Construction via Agglomerative Clustering

In the first stage of aBayesQR, paired-end reads uniquely mapped to a reference genome are grouped into super-reads via agglomerative hierarchical clustering. This is facilitated by a weighted graph \(G=(\mathcal {V},\mathcal {E})\) which is constructed and recursively updated as the clustering proceeds. In particular, each vertex of G is associated with a cluster collecting reads that originated from a single strain in a quasispecies; we denote the set of reads in the \(i^{th}\) cluster (i.e., the cluster associated with the \(i^{th}\) vertex) as \(V_{i} = \{v_{i}^{j}, j=1,\cdots ,|V_{i}|\}\). Let \(sr_{i}\) denote a consensus sequence (i.e., a super-read) constructed from the reads in \(V_{i}\). The \(i^{th}\) and \(j^{th}\) vertex of G are connected by an edge \(e_{ij}\in \mathcal {E}\) if all the reads in \(V_{i}\) and \(V_{j}\) (or, equivalently, \(sr_{i}\) and \(sr_{j}\)) are conflict-free and an overlap criterion, specified later in this subsection, is satisfied. The weight \(w_{ij}\) of the edge \(e_{ij}\) is a measure of similarity between \(V_{i}\) and \(V_{j}\) at each step, the algorithm merges a pair of vertices connected by the edge having the largest weight to form a new vertex and agglomerates the corresponding clusters.

The alleles at homozygous sites, common to all the components of a quasispecies, are not utilized in the reconstruction procedure. Instead, we separate reads having originated from different strains by clustering them using heterogeneous sites with reliable SNV information. An SNV information is considered reliable if the relative abundance of the allele is above a pre-determined threshold, as in [20]; alleles whose abundance is below the threshold are treated as sequencing errors and disregarded in the process of clustering. For convenience, let us denote the set of pre-processed paired-end reads by \(R = \{r_{i},i=1,\cdots ,|R|\}\). The agglomerative clustering is initialized with |R| clusters, one for each read; in other words, we start with \(V_{1}=r_{1},\cdots ,V_{|R|}=r_{|R|}\), implying that \(|\mathcal {V}|=\sum \limits _{i=1}^{|\mathcal {V}|} |V_{i}| = |R|\), and proceed by sequentially merging judiciously chosen pairs of vertices (i.e., agglomerating the corresponding clusters). Intuitively, it is meaningful to reduce the number of vertices in the graph by merging those associated with conflict-free consensus sequences that have a large overlap. To formalize this, let \(L_{i} = \{l_{1},\cdots ,l_{|L_{i}|}\}\) denote an index set of the SNV positions covered by \(sr_{i}\), let \(L_{i \cap j} = \{l_{1},\cdots ,l_{|L_{i \cap j}|}\}\) be the index set of SNV positions covered by both \(sr_{i}\) and \(sr_{j}\), and let \(L_{i \cup j} = \{l_{1},\cdots ,l_{|L_{i \cup j}|}\}\) be the index set of SNV positions covered by either \(sr_{i}\) or \(sr_{j}\). Then the pairs of vertices (ij) that we consider as candidates for merging and thus connect by an edge are those satisfying either

$$\begin{aligned} |L_{i \cap j }| \ge \theta \cdot |L_{ i \cup j }| \quad \text {or} \quad |L_{i \cap j }| = \min (|L_{i}|,|L_{j}|), \end{aligned}$$

where the \(2^{nd}\) condition promotes merger of short super-reads, and the choice of \(\theta \) is discussed below. To quantify uncertainty inherent to a clustering solution due to existence of non-overlapping positions among the reads in each cluster, we define a position-specific confidence score

$$\begin{aligned} score_{i}[l] = \frac{cr_{i}[l]-cr[l]}{1-cr[l]} \end{aligned}$$

where l denotes the position, \(cr[\cdot ]\) is the overall coverage rate, and \(cr_{i}[\cdot ]\) denotes cluster-specific coverage rate for \(V_{i}\) (i.e., \(cr_{i}[l]\) is the fraction of reads in \(V_{i} = \{v_{i}^j, j = 1,\cdots ,|V_{i}|\}\) covering position l). On the one hand, this score is penalized at a site where the fraction of cluster members (short reads) covering the site is low; the score is negative if the cluster-specific coverage rate is below the global coverage rate which implies uncertainty of the clustering decision. On the other hand, positive scores indicate high confidence in the decision to group the reads into the same cluster. Note that the highest possible score of 1 at position l is achieved when all the reads in a cluster cover the \(l^{th}\) position. Using the confidence scores, we define the weight \(w_{ij}\) assigned to an edge \(e_{ij}\) to quantify similarity between \(V_{i}\) and \(V_{j}\) as

$$\begin{aligned} w_{ij} = \frac{1}{ |L_{ i \cup j } | } \sum \limits _{l\in L_{i \cup j}} score_{i \cup j}[l]. \end{aligned}$$

Given the weights \(w_{ij}\), we can now specify the clustering procedure. In each step, the pair of vertices connected by the edge with maximum weight is merged; the newly constructed vertex inherits edges from the merged vertices and the weights on those edges are re-evaluated. A new (longer) consensus sequence is constructed by combining the two super-reads associated with the merged vertices; recall that there are no conflicts between the super-reads being merged. If after such an update step no edges connect the new vertex with the rest of the graph (because no inherited edges satisfy the connectivity condition), \(\theta \) is decreased and the above process is repeated. We initially set \(\theta \) to 0.9 and gradually decrease it by 0.1 while \(\theta>\) 0. The above procedure is repeated until no pairs of vertices satisfy the connectivity condition. By that point, a set of long consensus sequences (the final super-reads) has been formed from the clusters of reads associated with the nodes of the final graph. While the complexity of agglomerative clustering is, in general, \(O(N^{3})\) where N denotes the input data size [21], it has been shown that its time complexity can be reduced to \(O(N^2)\) with accuracy equal to that of the brute-force method by using the partial maximum array technique [22]. We exploit this to efficiently construct super-reads. The algorithm for super-read construction is formalized as Algorithm 1.

figure a

2.2 ML Reconstruction of Quasispecies from Super-Reads

Here we describe how to reconstruct the most likely set of strains in a viral quasispecies using super-reads from Sect. 2.1 and their confidence scores. While in principle the method outlined in this section could be applied directly to the short reads provided by a sequencing platform, such an approach would in general not only be computationally prohibitive due to a very large number of short reads but also limit the ability of the algorithm to distinguish strains with small mutual genetic distances due to having long conserved regions. Relying on a relatively small number of long super-reads constructed from short reads circumvents both of these problems and makes the reconstruction more accurate and practically feasible. Note that sequencing errors may undesirably prevent clusters of reads from being merged with other clusters due to a violation of conflict-free requirement; consequently, a set of short reads in a small cluster is likely to have a disproportionate amount of sequencing errors. For this reason, we ignore clusters with very small memberships (in particular, those containing fewer than \(0.001 \cdot |R|\) reads), which limits the detection of strains to those constituting more than \(0.1\%\) of the quasispecies.

Let \(\mathcal {C}=\{C_{m},m=1,\cdots ,M\}\) denote the collection of clusters that remain after deleting clusters having only few reads; moreover, for convenience let us re-label the reads in \(C_{m}\) as \(c_{m}^j\), i.e., \(C_{m}=\{c_{m}^{j},j=1,\cdots ,|C_{m}|\}\) where \(c_{m}^{j} \in R\). We organize the super-reads obtained by Algorithm 1 in Sect. 2.1 into the rows of an \(M \times N\) matrix \(\mathbf {S}=\{s_{mn}, m=1,\cdots ,M, n=1,\cdots ,N\}\) with entries \(s_{mn} \in \{\mathsf {A},\mathsf {C},\mathsf {G},\mathsf {T},-\}\) where − denotes a site not covered by a super-read and N denotes the total number of SNV sites in the strains of a quasispecies. A nucleotides in the (mn) position of \(\mathbf {S}\) is assigned confidence \(score_{m}[n]\) defined in Sect. 2.1; the scores for the entire matrix are normalized so that they fall between 0 and 1 in order to use them in our Bayesian approach to assembly. Let \(\varepsilon _{mn}\) be the probability that \(s_{mn}\) was estimated erroneously due to either a sequencing error in reads on the \(n^{th}\) SNV position or the uncertainty induced by reads not covering the \(n^{th}\) SNV position. Note that negative scores indicates low confidence resulting from insufficient cluster-specific coverage rate while positive scores imply relatively confident information. In order to map \(score_{m}[n] \in (-\infty ,1]\) to the set [0, 1], we set \(\varepsilon _{mn} = 1-e^{score_{m}[n]}\) for \(score_{m}[n] < ln(1-\epsilon )\), where \(\epsilon \) denotes the error rate of a sequencing platform. Otherwise, we set \(\varepsilon _{mn} = \epsilon \).

Let \(Q=\{q_{k},k=1,\cdots ,K\}\) denote the set of K strains of a viral quasispecies. The goal in the second stage of our method is to determine Q from the super-reads matrix \(\mathbf {S}\) using a probabilistic framework. An exhaustive search over the entire solution space is computationally intractable even for small \(\mathbf {S}\); instead, we reconstruct the set of K viral strains sequentially, extending partially estimated strains one SNV position at each step. Since maintaining and extending all possible partial strains inevitably increases their number exponentially, unlikely sets of candidate strains are pruned in each step. Each step consists of three basic parts: (a) extension of the partially reconstructed strains, (b) selection of probable sets comprising K strains chosen among those generated in step (a), and (c) evaluation of the quality of the selected sets of strains and an update of K. The sequential Bayesian inference procedure in step t is illustrated in Fig. S1 in Appendix A.

Extending Partially Reconstructed Strains. Let \(F_{1:t\text {-}1} = \{f_{1:t\text {-}1}^{i},i=1,\cdots ,\) \(|F_{1:t\text {-}1}|\}\) be the collection of partially reconstructed strains covering the first \(t-1\) SNV sites and let \(B_{t} = \{b_{t}^{j}, j=1,\cdots ,|B_{t}|\}\) be the lists of distinct bases in the \(t^{th}\) column of \(\mathbf {S}\), where \(b_{t}^{i} \in \{\mathsf {A},\mathsf {C},\mathsf {G},\mathsf {T}\}\) and \( 2 \le |B_{t}| \le 4\). Then, all the possible extensions of \(f_{1:t-1}^{i}\) to the SNV site t can be enumerated as \(\{[f_{1:t\text {-}1}^{i},b_{t}^{1}],\cdots ,[f_{1:t\text {-}1}^{i},b_{t}^{|B_{t}|}]\}\). Let \(S_{1:t\text {-}1}^{i} = \{s_{1:t\text {-}1}^{i_{c'}},c' = 1,\cdots ,|S_{1:t\text {-}1}^{i}| \}\) be the collection of super-reads covering some of the first t SNV sites which are consistent with \(f_{1:t\text {-}1}^{i}\) (ignoring “−” in \(s_{1:t\text {-}1}^{i_{c'}}\)) where \(\{i_{c'}\}\) denote indices of rows of \(\mathbf {S}\) that are placed in \(S_{1:t\text {-}1}^{i}\), and let \(S_{t}^{i}=\{s_{t}^{i_{c}},c = 1,\cdots ,|S_{t}^{i}| \}\) denote the collection of nucleotides (\(s_{t}^{i_{c}} \in \{\mathsf {A},\mathsf {C},\mathsf {G},\mathsf {T}\}\), not “−”) observed at the \(t^{th}\) SNV site of the super-reads in \(S_{1:t\text {-}1}^{i}\) where \(\{i_{c}\}\) denote the indices of rows in \(\mathbf {S}\) that contribute to \(S_{t}^{i}\). Given \(S_{1:t\text {-}1}^{i},S_{t}^{i}\) and \(f_{1:t\text {-}1}^{i}\), the probability of \(b_{t}^{j}\) being the true extension of \(f_{1:t\text {-}1}^{i}\) is given by

$$\begin{aligned} P(S_{t}^{i}|b_{t}^{j},S_{1:t\text {-}1}^{i},f_{1:t\text {-}1}^{i}) = \prod \limits _{c=1}^{|S_{t}^i|} P(s_{t}^{i_{c}}|b_{t}^{j}), \end{aligned}$$
$$\begin{aligned} P(s_{t}^{i_{c}}|b_{t}^{j}) =\left\{ \begin{array}{ll} 1-\varepsilon _{i_{c} t}, &{} \text {if } b_{t}^{j}=s_{t}^{i_{c}} ,\\ \frac{\varepsilon _{i_{c} t}}{|B_{t}|}, &{} \text {otherwise.}\\ \end{array} \right. \end{aligned}$$

We extend \(f_{1:t\text {-}1}^{i}\) to \([f_{1:t\text {-}1}^{i},b_{t}^{j}] \in F_{1:t\text {-}1,t}\) by appending the \(b_{t}^{j} \in B_{t}\) which satisfies \(\frac{P(S_{t}^{i}|b_{t}^{j}, S_{1:t\text {-}1}^{i},f_{1:t\text {-}1}^{i})^{\frac{1}{|S_{t}^{i}|}}}{\sum \limits _{B_{t}} P(S_{t}^{i}|b_{t}^{j}, S_{1:t\text {-}1}^{i},f_{1:t\text {-}1}^{i})^{\frac{1}{|S_{t}^{i}|}}} \ge \delta _{0}\), where the exponent ensures proper normalization and is needed since the number of super-reads, \(|S_{1:t\text {-}1}^{i}|\), varies for each \(\{f_{1:t\text {-}1}^{i},i=1,\cdots ,|F_{1:t\text {-}1}|\}\). For \(f_{1:t\text {-}1}^{i}\) which has no matched super-reads, i.e., \(|S_{1:i\text {-}1}^{i}|=0\), we keep all of \(|B_{t}|\) possible extensions of \(f_{1:t}^{i\text {-}1}\). By collecting probable extensions for each \(f_{1:t\text {-}1}^{i} \in F_{1:t\text {-}1}\), we obtain the set of partial strains stretching over the first t SNV sites, \(F_{1:t\text {-}1,t}\). This procedure is formalized as function ExtendFrag in Appendix A.

Inferring Likely Sets of K Partial Strains. Having generated the probable partial strains \(F_{1:t\text {-}1,t}\), we denote the set of all its possible subsets of K strains (i.e., the quasispecies population candidates) as \(\mathcal {Q}_{1:t\text {-}1,t}=\{Q_{1:t\text {-}1,t}^{i}, i=1,\cdots ,\left( {\begin{array}{c}|F_{1:t\text {-}1,t}|\\ K\end{array}}\right) \}\) where \(Q_{1:t\text {-}1,t}^{i}=\{q_{kn}^{i},k=1,\cdots ,K,n=1,\cdots ,t\}\) and \(q_{kn}^{i} \in F_{1:t\text {-}1,t}\). The log-likelihoods of \(Q_{1:t\text {-}1,t}^{i}\) can be expressed as

$$\begin{aligned} \text {ln} P(\mathbf {S}|Q_{1:t\text {-}1,t}^{i})&= \sum \limits _{m=1}^M \text {ln} P(s_{m\cdot }|Q_{1:t\text {-}1,t}^{i}), \\ P(s_{m\cdot }|Q_{1:t\text {-}1,t}^{i})&= \frac{1}{K} \Bigg ( \sum \limits _{k=1}^{K}\bigg (\prod \limits _{n=1}^{t} P(s_{mn}|q_{kn}^{i}) \bigg ) \Bigg ), \end{aligned}$$

where \(s_{m\cdot }\) denotes the \(m^{th}\) row vector of the matrix of super-reads \(\mathbf {S}\) and

$$\begin{aligned} P(s_{mn}|q_{kn}^{i}) =\left\{ \begin{array}{ll} 1-\varepsilon _{m n}, &{} \text {if }q_{kn}^{i}=s_{mn} ,\\ \frac{\varepsilon _{m n}}{|B_{n}|}, &{} \text {if }q_{kn}^{i} \ne s_{mn} \quad \text {for} \quad s_{mn} \ne - .\\ \end{array} \right. \end{aligned}$$

Let \(Q_{1:t}^{max}= \max \limits _{Q_{1:t\text {-}1,t}^{i} \in \mathcal {Q}_{1:t\text {-}1,t}}P(\mathbf {S}|Q_{1:t\text {-}1,t}^{i})\). Among the \(\left( {\begin{array}{c}|F_{1:t\text {-}1,t}|\\ K\end{array}}\right) \) sets in \(\mathcal {Q}_{1:t\text {-}1,t}\), we keep only those that satisfy \(P(\mathbf {S}|Q_{1:t\text {-}1,t}^{i}) > \delta _{1} \cdot Q_{1:t}^{max}\) while the others are discarded; let us denote the collection of candidate sets that pass this test as \(\mathcal {Q}_{1:t}\). For practical feasibility of the scheme, the collection of partially reconstructed strains \(F_{1:t\text {-}1,t}\) is trimmed by excluding from it all the strains that are not part of at least one of the sets in \(\mathcal {Q}_{1:t}\); we denote the resulting collection of partial strains by \(F_{1:t}\in F_{1:t\text {-}1,t}\) and use it when extending the strains onto the \({t+1}\) SNV site. The described procedure is formalized as function InferQuasi in Appendix A.

Determining the Number of Strains K in a Quasispecies. In this step, we assess appropriateness of K used in the inference of \(\mathcal {Q}_{1:t}\) and update it if necessary. To this end, we rely on the minimum error correction (MEC) score which has previously been broadly used as a criterion in the design of methods for haplotype assembly [23, 24]. In the context of polyploid haplotype assembly, the MEC score is defined as the smallest number of nucleotides that needs to be changed in data (i.e., in observed reads) so that the corrected reads are consistent with having originated from K haplotypes. Let \(HD_{t}(\cdot ,\cdot )\) denote the Hamming distance between two sequences counted over the observed nucleotides in the first t SNV positions.Footnote 1 Then the MEC score of the most likely set \(Q_{1:t}^{max}\)of K viral strains evaluated on the first t SNVs is

$$\begin{aligned} \text {MEC}_{t}(K) = \sum \limits _{m=1}^{M} \min \limits _{k \in \{1,\cdots ,K\}} \sum _{j=1}^{|C_{m}|}HD_{t}(c_{m}^{j}, q_{k \cdot }^{max}), \end{aligned}$$

where \(q_{k \cdot }^{max}\) is the \(k^{th}\) row vector of \(Q_{1:t}^{max}\). Let \(N_t\) be the total number of nucleotides observed in the first t SNV positions of all the reads of the dataset. Note that the smaller the MEC scores, the higher the accuracy of a clustering. If \(\text {MEC}_{t}(K)/N_t < 2 \epsilon \), we use the same value K in the next step where the likely set of viral strains stretching over the first \(t+1\) SNV positions is inferred. Otherwise, we increase K by 1, repeat the estimation of \(\mathcal {Q}_{1:t}\), and evaluate the improvement rate of MEC score as

$$\begin{aligned} MECimpr(K)&= \frac{\text {MEC}_{t}(K)-\text {MEC}_{t}(K+1)}{\text {MEC}_{t}(K)}. \end{aligned}$$

The reason for selecting K based on the MEC improvement rate (MECimpr) is that the MEC score drops significantly once K matches the actual number of clusters; our scheme attempts to detect that change in order to infer population size. If \(MECimpr(K) > \eta \), where \(\eta \) denotes a pre-specified threshold, the number of species is updated as \(K \leftarrow \min \{K+n,|F_{1:t\text {-}1,t}|\}\) where n is the smallest integer number such that \(MECimpr(K+n) < \eta \). If \(MECimpr(K) < \eta \), we update the number of species as \(K \leftarrow \max \{K-n,2\}\) where n is the smallest integer such that \(MECimpr(K-n) \ge \eta \). The choice of threshold \(\eta \) is discussed in the Appendix B. The updated value of K is used for the inference of \(\mathcal {Q}_{1:t+1}\). Note that the probable set of viral strains, \(\mathcal {Q}_{1:t}\), is stored for each K to avoid performing redundant \(MECimpr(\cdot )\) calculations.

Once we obtain the most likely set of K viral sequences covering N SNVs, \(Q_{1:N}^{max}\), the full-length K quasispecies strains are reconstructed by inserting the consensus nucleotides observed in R into the non-SNV sites. We estimate relative frequencies \(p_k\), \(1 \le k \le K\), of quasispecies strains based on the Hamming distance between super-reads and the reconstructed sequences. In particular, for each super-read \(sr_{i}\) we determine the nearest assembled strain \(q_{j}\) where \(j = \arg \min \limits _{k \in \{1,\cdots ,K\}} HD(sr_{i},q_{k})\) and the number of reads involved in constructing the super-read \(sr_{i}\) is counted towards \(p_j\). The entire scheme proposed in this subsection is summarized as Algorithm 2.

figure b

3 Results and Discussion

3.1 Performance Comparison on Simulated Data

To evaluate performance of the proposed method for quasispecies reconstruction, we use metrics Recall, Precision, Predicted Proportion, and Reconstruction Rate. Recall is defined as the ratio of the number of correctly reconstructed strains to the total number of true strains in the quasispecies, i.e., Recall = \(\frac{TP}{TP+FN}\), while Precision is defined as the fraction of correctly reconstructed strains among all the assembled sequences, i.e., Precision = \(\frac{TP}{TP+FP}\). Noting that Precision usually reports high scores when the number of strains is underestimated while penalizing overestimation of the population size, we also report the ratio of the number of reconstructed sequences to the true population size, Predicted Proportion. The closer Predicted Proportion to 1, the more accurate the number of reconstructed strains. Moreover, to assess the degree of reconstruction accuracy, we define \(Reconstruction \,Rate = \frac{1}{K}\sum _{k=1}^{K}\bigg (1-\frac{HD(q_{k},\hat{q_{k}})}{G}\bigg ),\) where G is the length of a genome, K is the number of strains in a quasispecies and \(q_{k}\) and \(\hat{q_{k}}\) denote the \(k^{th}\) true strain and its nearest sequence among the K estimated ones, respectively. To assess the accuracy of estimated frequencies, we use Jensen-Shannon divergence (JSD) which quantifies similarity between two distributions. Given a true distribution P and its approximation Q, the Kullback-Leibler (KL) divergence \(D(P||Q)= \sum \limits _{i=1}^{n} P(i) \text {log} \frac{P(i)}{Q(i)}\) is undefined when \(Q(i) = 0\). JSD, a symmetrized and smoothed version of the KL divergence, circumvents this problem by defining similarity of P and Q as \(JSD(P||Q)= \frac{1}{2} D(P||M) + \frac{1}{2} D(Q||M),\) where M is defined as \(M=\frac{1}{2}(P+Q).\)

We compare our algorithm with publicly available ShoRAH [4], PredictHaplo [12], and ViQuaS [17]. Since ViQuaS is an extension of the algorithm in [10, 11], and was shown to have superior performance compared to its predecessor, we omit the comparison with the software QuRe in [10, 11]. It is worth pointing out that for the synthetic data sets we study, ShoRAH could not reconstruct strains in the regions where the simulated sequencing coverage is relatively low compared to the average, resulting in reconstruction of strains that are shorter than the true length G. To facilitate a fair comparison with ShoRAH, we aligned its reconstructed strains to the reference genome and completed missing sites with bases from the reference. ViQuaS, on the other hand, tends to reconstruct many more strains than actually present; thus we followed ViQuaS’s authors recommendation and retained only those having frequencies greater than \(f_{min}\) when calculating \(Precision \). Finally, not all of the synthetic data sets could be processed with PredictHaplo, preventing us from reporting its performance in some of the scenarios.

Table 1. Performance comparison of different methods for varied diversities (div) on simulated data. Performance comparison of aBayesQR, ShoRAH, ViQuaS and PredictHaplo in terms of Recall, Precision, Predicted Proportion (PredProp), Reconstruction Rate (ReconRate) and JSD on the simulated data with \(err=0.1\%\) and \(cov=500\times \) vs. div for a mixture of 5 and 10 viral strains. Averaged PredictHaplo results are reported if it provides answers for more than 50% of data sets. Boldface values indicate the best performance for each \(div(\%)\).

We generated synthetic datasets by emulating high-throughput sequencing of a viral population consisting of a number of closely related viral genomes having length of 1300bp; this particular length was chosen to coincide with the longest region of the HIV pol gene. Quasispecies sequences are generated by introducing independent mutations at uniformly random locations along the length of a randomly generated reference genome so as to obtain a predefined level of diversity (\(div\%\)), i.e., a predefined average Hamming distance between quasispecies strains. Simulating Illumina’s MiSeq data, \(2\times 250\)bp-long paired-end reads are sampled uniformly from each viral strain with a mean coverage of \(cov\times \) per strain. Inserts of the paired-end reads are on average 150bp long with standard deviation of 30. In our benchmarking tests, we focus on exploring the effects of diversity (\(div\%\)) on the accuracy of the quasispecies reconstruction. Two sets of viral populations are considered: (1) a mix of 5 viral strains with abundance levels \(50\%, 30\%, 15\%, 4\%\) and \(1\%\); and (2) a mix of 10 strains with abundance levels \(36\%,24\%,16\%,8\%,5.5\%,4\%,3\%,2\%,1\%\) and \(0.5\%\). Note that the abundances are chosen to approximately follow geometric distribution and that the populations include low abundant strains. For each combination of the parameters, 100 data sets were generated and the reported results were obtained by averaging over those data instances. For PredictHaplo, which did not produce results in each instance, the averaged results are reported if more than 50 instances were successfully processed.

In all of the following experiments, potential SNVs are called if their abundance is higher than 1%, which is set relatively high to avoid false positives (FPs); FPs prevent reads to be merged with existing clusters in Sect. 2.1. We execute the function ExtendFrag with parameter \(\delta _{0}=0.1\). Parameter \(\delta _{1}\) in function InferQuasi is initially set to 0.001, but adaptively increases if the number of combinations of partially reconstructed strains exceeds 10000; this is done to limit the number of likelihood calculations performed in each run of InferQuasi. The recommended value of \(\eta \), a threshold used to determine population size K based on MECimpr( \(\cdot \) ), is discussed in Appendix B.

We compare performances of aBayesQR, ShoRAH, ViQuaS and PredictHap when applied to the reconstruction of a quasispecies spectrum with diversity levels varying between \(1\%\) and \(5\%\) (i.e., \(div\in \{1\%,2\%,3\%,4\%,5\%\}\)). To test the ability of different methods to reconstruct quasispecies with low diversity, we assume low sequencing error rate of \(err=0.1\%\) (median mismatch error rates for 454 Life Sciences and Illumina platforms are 0.1% and 0.12%, respectively [25]). Coverage per strain \(cov\times \) is set to \(500\times \), implying total coverage of \(2500\times \) and \(5000\times \) for the 5-strain and 10-strain population, respectively; strains having frequencies 0.23% or higher in the 5-strain case and those with frequencies 0.46% or higher in the 10-strain case are covered with probability 0.99 [5].

Table 1 demonstrates that the proposed aBayesQR algorithm outperforms existing schemes. In terms of Recall and Precision, aBayesQR exhibits exceptionally good performance compared to competing methods when reconstructing quasispecies strains with diversity \(div < 4\%\). The performance of ViQuaS deteriorates at low diversities in terms of most of the criteria (i.e., Recall, Precision, Predicted Proportion and JSD). PredictHaplo could not perform reconstruction in most of the low diversity instances yet it overall achieves the highest Precision because it typically underestimates the number of strains as shown by Predicted Proportion (e.g., estimating only 2–3 out of 10 strains), which is in agreement with the results reported by a previous study [14]. Among all methods, ShoRAH has the lowest performance in terms of Recall and Precision. As indicated by Predicted Proportion, aBayesQR is the most accurate method in terms of estimating the population size although it often misses a strain with the lowest frequency when applied to reconstruction of a quasispecies consisting of 10 strains. ViQuaS and ShoRAH typically overestimates the number of strains especially at low diversity levels. aBayesQR is the best method in terms of Reconstruction Rate at all levels of diversity. In terms of frequency estimation, aBayesQR overall outperforms all the other methods whereas PredictHaplo shows the highest JSD due to its drawback of underestimating the number of strains. Note that both ViQuaS and ShoRAH exhibit significantly increased (i.e., deteriorated) JSD at low diversity levels. This fact, along with the low Recall and Precision scores they have in low diversity settings, indicates that state-of-the-art methods experience major difficulties when attempting to reconstruct viral quasispecies in those settings, as also observed in [5, 14, 17].

We further study the effects that sequencing error rate (\(err\%\)) and coverage per strain (\(cov\times \)) have on the performance of the algorithms. Those results are reported in Table S2 and S3 in the Appendix C, demonstrating superiority of aBayesQR as compared to the competing methods. The runtimes of the tested algorithms are shown in Table S4 in the Appendix C.

3.2 Performance Comparison on Real HIV Data

To further test the performance of our proposed method, we employ it for the analysis of the HIV 5-virus-mix dataset published in [20]. Specifically, we apply our algorithm to reconstruct an in vitro generated quasispecies population consisting of 5 known HIV-1 strains: HIV-\(1_{HXB2}\), HIV-\(1_{89.6}\), HIV-\(1_{JR-CSF}\), HIV-\(1_{NL4-3}\) and HIV-\(1_{YU2}\). Compared to the simulated data set, relative frequencies of the 5 HIV-1 strains are more evenly distributed (about 10–30%) and the pairwise distances between strains are higher (2.61–8.45%) [20]. We use the \(2 \times 250\)bp-long paired-end reads provided by Illumina’s MiSeq Benchtop Sequencer. The reads are aligned to the HIV-\(1_{HXB2}\) reference genome; the reads shorter than 150nt and those having bases with quality scores less than a PHRED threshold of 60 are discarded. We compare the performance of our method applied to gene-wise quasispecies reconstruction of the above described HIV data with that of the competing techniques. Since the current version of ViQuaS software does not support specifying genomic regions, we could not use it in this experiment. When running aBayesQR, we set the parameter \(\eta \) to 0.09 (the setting recommended in Appendix B). Other parameters are set to the same values as the ones used in Sect. 3.1.

Table 2. Performance comparisons on a real HIV-1 5-virus-mix data set. Predicted Proportion (PredProp) and Reconstruction Rate (RR) for aBayesQR, ShoRAH and PredictHaplo applied to reconstruction of HIV-\(1_{{\tiny \text {HXB2}}}\), HIV-\(1_{{\tiny \text {89.6}}}\), HIV-\(1_{{\tiny \text {JR-CSF}}}\), HIV-\(1_{{\tiny \text {NL4-3}}}\) and HIV-\(1_{{\tiny \text {YU2}}}\) for all 13 genes of the HIV-1 dataset. (note: RR are expressed in percentages.) Boldface values indicate the genes where all the strains are perfectly reconstructed. The inferred frequencies are shown in Table S5 in Appendix C.

We evaluate and report the Predicted Proportion (i.e., the fraction of correctly estimated strains as defined in Sect. 3.1) and Reconstruction Rate in Table 2. On this real HIV-1 data set which (as pointed above) has different properties than the simulated data set in Sect. 3.1, aBayesQR is the most accurate among the considered methods in terms of Predicted Proportion. PredictHaplo underestimates the population size and reconstructs three or four strains in the 8 considered genes and ShoRAH greatly overestimates the population size for all 13 genes of the HIV-1 data set (e.g., it reconstructs 119 strains in gp120), which is consistent with our simulation results as well as with the results in [14]. aBayesQR and PredictHaplo are tied for the number of genes where all the strains are perfectly reconstructed (5 each); for the remaining genes, PredictHaplo provides a larger number of perfectly reconstructed strains. However, it is worth pointing out that PredictHaplo, designed for identification of HIV haplotypes, missed at least one strain in each of the remaining 8 genes while aBayesQR reconstructed most of the strains on all but two genes, gp120 and gp41. ShoRAH did not perfectly reconstruct any of the 13 genes, which is consistent with the simulation results. Moreover, overestimating the number of strains negatively affects the accuracy of ShoRAH’s frequency estimation; for instance, the sum of frequencies corresponding to the most abundant 5 strains does not exceed 50% in 9 out 13 genes (71% is the largest such sum, on vpu) (see Table S5 in the Appendix C).

To complement the gene-wise quasispecies reconstruction study with that of a global reconstruction, we consider the HIV-1 gap-pol region spanning 4307bp. To efficiently process 355241 paired-end reads that remain after applying a quality filter, we organize the region into a sequence of windows of length 400bp where the consecutive windows overlap by 150bp and run aBayesQR on those windows. The entire region is assembled by connecting strains in the consecutive windows while testing consistency in the overlapping intervals. The number of strains retrieved in the global reconstruction is decided by majority voting of the number of strains obtained in each window. The frequencies are estimated by counting reads nearest (in terms of Hamming distance) to each of the reconstructed strains. Following this procedure, both aBayesQR and PredictHaplo could reconstruct all 5 HIV-1 strains in the gap-pol region correctly, i.e., they both achieved Reconstruction rate of 100 for all 5 strains and Predicted Proportion of 1. The frequencies estimated by aBayesQR are 15.21%, 19.34%, 25.56%, 27.61% and 12.27% while those estimated by PredictHap are 13.21%, 13.56%, 25.67%, 19.69% and 27.86%. ShoRAH highly overestimated the number of strains and reported Predicted Proportion of 41.8; its five most abundant strains estimated are reported to have frequencies 8.51%, 5.04%, 3.41%, 3.24% and 3.09%.

4 Conclusions

In this paper, we presented a novel maximum-likelihood based approximate algorithm for reconstructing viral quasispecies from high-throughput sequencing data. aBayesQR assembles paired-end short reads into longer fragments based on similarity of the read overlaps and the uncertainty level of non-overlapping regions. The probable sets of partially reconstructed strains are inductively searched and a subset of those strains is extended to efficiently deduce the most likely set of strains in a quasisepcies. Detection of the population size is embedded into the algorithm and is empirically shown to be very accurate; the number of strains is dynamically adjusted based on the reliability of the partially assembled quasispecies in each extension step. Performance of the developed method is tested on both synthetic datasets and a real HIV-1 dataset. In both settings, the new algorithm outperforms existing techniques in terms of accuracy of the quasispecies size estimation, perfect reconstruction of strains, proportion of correct bases in each reconstructed strain and the estimation of their abundance. A particularly high accuracy is observed in estimating the population size (i.e., the number of strains) and their relative abundance. Tests on synthetic datasets demonstrates that aBayesQR is capable of reconstructing quasispecies at low diversity, showing superior performance in those settings compared to state-of-the-art algorithms. Furthermore, the study on a real HIV-1 dataset demonstrates that our proposed algorithm outperforms or has performance comparable to that of the existing methods in the general setting of viral quasispecies reconstruction.

aBayesQR can be extended and applied to the problem of estimating the population size and the degree of variation among the constituent species in related fields such as immunogenetics. On a related note, bacterial populations are characterized by having relatively lower mutation rates than viral and thus typically have fewer segregating sites on the sequences in a population. The ability of our method to perform highly accurate reconstruction in such settings should be further investigated.

A software aBayesQR is available at https://github.com/SoyeonA/aBayesQR. An appendix can be found in a bioRxiv version of this paper which is available at http://biorxiv.org/content/early/2017/01/27/103630.