Background

Humans are diploid, i.e. their chromosomes except the sexual male ones consist of two copies, one inherited from the mother and the other inherited from the father. Most positions of a pair of chromosomes are formed of the same nucleotide. Based on International Hapmap Consortium 2007, the difference between two chromosomes of an individual or population is one among 1000 nucleotides [14]. If this difference is statistically meaningful (at least it is different in 5 percent of people) the site is called single nucleotide polymorphism (SNP). Most of SNPs are bi-allelic, i.e. two of the four possible nucleotides appear in almost all the population. The alleles with high frequency are called major and coded into 0, and the alleles with low frequency are called minor and coded into 1 [5]. There are also sites whose nucleotides of the two copies of a chromosome are identical among all population. Such sites are ignored and SNPs are listed sequentially. The result is a binary string and is called the haplotype.

Haplotypes have many applications in prediction of diseases [6]; association studies [79], and drug design [10]. Unfortunately, the presence of sequencing errors makes it difficult and expensive to experimentally examine a single chromosome. This has motivated researchers to design algorithms for obtaining haplotypes from aligned reads. To deal with errors when looking for the best reconstruction of haplotypes, we have to fix an objective function for evaluating candidate haplotypes and solve an optimization problem (called the haplotype assembly problem). Many different objective functions have been proposed, including Minimum Fragment Removal (MFR), Minimum SNP Removal (MSR), Longest Haplotype Reconstruction (LHR), Minimum Implicit SNP Removal (MISR), Minimum Implicit Fragment Removal (MIFR), MEC [11, 12], and Maximum Fragments Cut (MFC) [13]. Of special interest among the proposed functions is MEC, which aims at minimizing the total number of conflicts (errors) between the reads and the constructed haplotypes. MFC is another interesting function. However, Chen et al. [14] found out that optimizing MEC leads to better output haplotypes than MFC (in terms of popular measures such as the number of switch errors and the QAN50 score).

As one can expect, the problem of minimizing MEC is NP-hard even for gapless reads [12, 15]. This has motivated researchers to design both heuristic and exact algorithms for the problem. Wang et al. [16] designed a branch-and-bound algorithm to solve the problem to optimality. But due to the exponential complexity of their algorithm, they proposed a genetic algorithm for large scale problems. Using Sanger sequencing technology, Levy et al. [17] introduced the first diploid genome sequence of an individual human, referred as HuRef, and proposed a greedy heuristic method for concatenating the reads with minimum conflicts. Although their method is fast, it becomes inaccurate when errors are present in reads. A heuristic method (called Fast Hare) was proposed in [18]. He et al. [19] introduced a dynamic-programming algorithm with the complexity of O(2kmn), where k is the length of the longest read, m is the number of reads, and n is the total number of SNPs in the haplotypes. Their algorithm works well when k≤15. However, when k is large, they have to model the problem as a MaxSAT problem, which should be solved by a MaxSAT solver. In comparison to dynamic programming, Fast Hare is much faster and more accurate. The complexity of Fast Hare is O(n logn+mn), where n and m are the number of fragments and SNPs, respectively. Bansal et al. [20] introduced a software package (named HapCUT), in which the algorithm tries to minimize the MEC score by iteratively computing max-cuts in graphs derived from the sequenced fragments. Bansal et al. [21] developed a Markov chain Monte Carlo algorithm (called HASH). Both HASH and HapCut work well in practice, but may not find optimal haplotypes. Mousavi et al. [22] developed an improved maximum likelihood model which solves the problem approximately. They also showed that the maximum likelihood model is approximated by the MEC model when the minor allele frequencies are ignored from the model and in this way provides a theoretical support for the MEC model despite some criticisms against it [13]. Bonizzoni et al. [23] provided new results on the fixed-parameter tractability and approximability of MEC and showed that MEC is fixed-parameter tractable (FPT) when parameterized by the number of corrections; moreover, on “gapless” instances, it is also FPT when parameterized by the length of the fragments. Ahn et al. [24] developed a probabilistic framework and a sequential Monte Carlo algorithm (called ParticleHap), which employs a deterministic sequential Monte Carlo algorithm that associates SNPs with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. Das et al. [25] formulated the problem as a semi-definite program and using the low rank of the underlying solution, solved it fast and with high accuracy. Rhee et al. [26] surveyed the development of computational haplotype determination methods.

Chen et al. [27] introduced a novel approach via which we can solve the problem to optimality. Their algorithm proceeds as follows. First, we reduce the size of the input read matrix and further partition the problem into a number of independent smaller problems whose optimal solutions can be combined into an optimal solution of the original problem. To solve each smaller problem, they formulate it as an ILP problem and use a fast ILP solver (such as CPLEX and GUROBI) to solve it. With their approach, they succeed in finding optimal haplotypes for the difficult HuRef dataset on a single PC within 31 h in total.

When solving the haplotype assembly problem, we usually make the all-heterozygous assumption which assumes that if a column of the input read matrix contains at least one 0 and one 1, then it corresponds to a heterozygous SNP site. It is well known that the all-heterozygous assumption is correct for most columns in the read matrix but it may be incorrect due to errors in reads. So, it is natural to consider the general case of the problem, where we do not make the assumption. The general case is much harder than the all-heterozygous case, because it allows many more candidate haplotypes. Nonetheless, Chen et al. [27] show that their ILP-based approach to the all-heterozygous case can be generalized to solve the general case as well. In particular, they show that even in the general case, we can solve the HuRef dataset (except a single subproblem for the 15th chromosome) on a single PC within a total of 12 days. For convenience, we refer to their model for the all-heterozygous (respectively, general) as OldModelA (respectively, OldModelG).

To the best of our knowledge, Chen et al.’s approach [27] leads to the previously fastest algorithm for optimally solving the problem. However, their program can take hours or even days for difficult instances. Hence, it is of great interest if we can design better ILP models that lead to significant speed-up of Chen et al.’s approach. In this paper, we come up with such models for both the all-heterozygous and the general cases. The new models are theoretically better because they contain significantly fewer constraints. More importantly, our experimental results show that for both simulated and real datasets, the new model for the all-heterozygous (respectively, general) case can usually be solved via CPLEX (a popular ILP solver) at least 5 times (respectively, twice) faster than the previously best models. Indeed, the running time can sometimes be 41 times better. The real datasets we use are HuRef [17] and MP1 [28], while the simulated dataset is Geraci’s dataset [29] and Chen et al.’s Simu95_MP1 [14].

One may argue that exact algorithms are much slower than heuristic algorithms for difficult instances. However, exact algorithms are still of great interest for the following reasons. First, as found out in [14], exact algorithms usually output significantly better solutions than heuristic algorithms in terms of popular measures such as the MEC score, the number of switch errors, and the QAN50 score. Secondly, exact algorithms can be used to witness how good a heuristic algorithm is. Thirdly, as found out in [14], ILP models of the problem can be used to improve the output quality of every heuristic algorithm with a little decline in speed. Thus, better ILP models also lead to faster improvement of heuristic algorithms.

Methods

One method for obtaining haplotype information is to collect genotype data from an individual. Genotype data indicate information of the set of alleles at each locus, but it is not known on which chromosome a particular allele occurs [3032]. Other methods are based on the data of shotgun sequence assembly [12, 21, 33]. Given a reference genome sequence and a set of reads containing sequences of both chromosomes, first of all we need to align the reads to the reference genome. We keep only those columns with multiple (different) values in the alignment. These columns are called the heterozygous sites and correspond to alleles that differ between chromosomes. For each heterozygous site, the major and minor alleles are coded into 0 and 1, respectively. In this way, we obtain aligned SNP fragments as the input data. Then, we need to partition the aligned SNP fragments into two sets according to certain objective function and further determine a haplotype from each set via the majority rule. Formally, the input data is a matrix A whose rows and columns correspond to reads and SNP sites, respectively. Each entry of A is a 0, 1, or ‘ −’, where ‘ −’ is called a gap and either corresponds to missing data or serves as gaps to connect disjoint parts of the read. The first (respectively, last) entry of a read that is not a ‘ −’ is called the start (respectively, end) position of the read. Note that there may exist gaps between the start and the end positions of a read. The following is an example read matrix:

$$A=\left[ \begin{array}{cccccc} - & 1 & 0 & 0 & - & - \\ 1 & 1 & - & 0 & - & - \\ 0 & 1 & - & 1 & 0 & 1 \\ - & 0 & 1 & - & - & - \\ 1 & - & - & 0 & 1 & - \\ 1 & 0 & 1 & - & - & 0 \\ \end{array} \right] $$

Each row of A can be viewed as a ternary string of length m over {0,1,−}, where m is the number of columns in A. For two ternary strings s and t of the same length, we use d(s,t) to denote the total number of positions at which both characters of s and t belong to {0,1} but they differ. Given A, the haplotype assembly problem requires the computation of a pair (h,h) of binary strings with length m whose MEC score, i.e., \(\sum _{r}\min \{d(h,r), d(h',r)\}\) is minimized, where r ranges over all rows of A. For convenience, we call (h,h) an optimal pair of haplotypes, and say that r is aligned to h (respectively, h) if d(h,r)≤d(h,r) (respectively, d(h,r)<d(h,r)).

In the all-heterozygous case of the problem, it is required that h and h are complementary, i.e., each letter of h differs from the letter of h at the same position, while in the general case, a letter of h can equal the letter of h at the same position. The general case is more difficult to solve than the all-heterozygous case because it allows many more candidate haplotypes. To make the general case less difficult, Chen et al. [27] prove a lemma which can be used to find intrinsically heterozygous columns, i.e., those columns in A at which the letters of h and h can be assumed to differ without altering the optimal MEC score. Of course, if a column in A is not known to be intrinsically heterozygous, then the column may be heterozygous or homozygous, i.e., the letters of h and h at the column may differ or equal.

Reductions

Suppose that A is the given read matrix. Clearly, we can assume that each row of A contains at least one 0 or 1 because otherwise the row can be deleted without changing the problem. We want to reduce the problem for A to smaller problems whose optimal solutions can be combined into an optimal solution of the original problem. The following three reductions are obvious and have been used before [27]. The first reduction removes all monotone columns, i.e., those columns in which at least one of 0 and 1 does not appear. Note that if one does not want such a column to be deleted by our program because the column is known to be heterozygous, then he/she should simply add a new read which starts and ends at the column so that both 0 and 1 appear in the column. The second reduction is called Singleton-Removal; it removes all rows in which only one column is not a ‘ −’ and the column is known to be (intrinsically) heterozygous. Let m be the number of columns in A. The third reduction first finds the set S of all i∈{1,…,m−1} such that for each read r in A, either the end position of r is in the j-th column for some ji or the start position of r is in the k-th column for some ki+1; and then partition A into |S|+1 (smaller) matrices by cutting A vertically at the border between the i-th and the (i+1)st columns for each iS. Each of the |S|+1 matrices is called a block.

Consider a block B of A. Let \(\hat {m}\) be the number of columns in B. Chen et al. [27] suggests a less obvious reduction which can be used to reduce the problem for B to even smaller problems whose optimal solutions can be combined into an optimal solution of the problem for B. More specifically, the reduction first finds the set T of all \(i\in \{2,\ldots,\hat {m}-1\}\) such that (1) the i-th column of B is known to be (intrinsically) heterozygous and (2) for each read r in B, either the end position of r is in the j-th column with ji or the start position of r is in the k-th column with ki. The reduction then modifies B by making a copy of the i-th column and putting it immediately after the original i-th column for all iT. In this way, the modified B has |T| more columns than the original B. Finally, the reduction splits the modified B into |T|+1 (smaller) matrices by cutting the modified B vertically at the border between the original i-th column and its duplicate for each iT. Each of the |T|+1 matrices is called a reduced block. If T, then the Singleton-Removal reduction can be applied to each reduced block \(\tilde {B}\) because at least one read in \(\tilde {B}\) starts and ends at the same (intrinsically) heterozygous column which is the first or the last column of \(\tilde {B}\).

Solving reduced blocks via ILP

Let \(\tilde {B}\) be a reduced block. We want to formulate the haplotype assembly problem for \(\tilde {B}\) as an ILP problem in both the general and the all-heterozygous cases. Towards this end, we first need a definition. Two columns c and c of \(\tilde {B}\) are complementary if c can be obtained from c by flipping each 0 (respectively, 1) in c to a 1 (respectively, 0). The point is that if c and c are complementary, then there must exist an optimal pair (h,h) of haplotypes for \(\tilde {B}\) such that the letters of h (respectively, h) at c and c are complementary and hence it suffices to compute the letter of h (respectively, h) at c instead of computing the letters of h (respectively, h) at both c and c. Clearly, similar observations hold for identical rows and columns in \(\tilde {B}\). Based on these observations, we first partition the rows (respectively, columns) of \(\tilde {B}\) into disjoint sets so that the rows (respectively, columns) in the same set are identical (respectively, identical or complementary) but no two rows (respectively, columns) in different sets are, and then for each set S in the partition, modify \(\tilde {B}\) by merging the rows (respectively, columns) in S into a single row (respectively, column) to which we assign the multiplicity |S|.

Suppose that we have modified \(\tilde {B}\) as above. Let p (respectively, q) be the number of rows (respectively, columns) in \(\tilde {B}\). For each i∈{1,…,p}, let w i be the multiplicity of the i-th row of \(\tilde {B}\). Similarly, for each j∈{1,…,q}, let c j be the multiplicity of the j-th column of \(\tilde {B}\).

The all-heterozygous case

For each integer i with 1≤ip, let Ji,0 (respectively, Ji,1) be the set of integers j∈{1,2,…,q} such that the i-th entry in the j-th column of \(\tilde {B}\) is a 0 (respectively, 1). Since we want to compute an optimal pair (h,h) of complementary haplotypes for \(\tilde {B}\), we introduce a binary variable x j for the j-th column whose value is supposed to be 1 if and only if the j-th bit of h is a 1 (and hence the j-th bit of h is a 0). Moreover, we introduce a binary variable z i for the i-th row of \(\tilde {B}\) whose value is supposed to be 1 if and only if the i-th row of \(\tilde {B}\) is aligned to h. Furthermore, we introduce a binary variable t i,j for the i-th row and the j-th column of \(\tilde {B}\) whose value is supposed to be 1 if and only if the i-th row of \(\tilde {B}\) is aligned to h but its j-th entry is different from that of h. Then, we claim that the problem of finding an optimal pair (h,h) of complementary haplotypes for \(\tilde {B}\) has the following ILP formulation (denoted by NewModelA):

$$\begin{array}{@{}rcl@{}} \text{Minimize} & &\sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,0}} c_{j} \left(1 - x_{j} - z_{i} + 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,1}} c_{j} \left(x_{j} - z_{i} + 2 t_{i,j}\right) \\ \text{Subject to} & & \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0}} \ \ x_{j} + z_{i} - 1 \le t_{i,j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,1}} \ \ z_{i} - x_{j} \le t_{i,j} \\ & & \text{all the variables \(x_{j}\), \(z_{i}\), \(t_{i,j}\)~:~binary} \end{array} $$

To prove the above claim, we first show that for each i∈{1,…,p} and each jJi,0, (1−x j z i +2t i,j )=1 if and only if either z i =1 and x j =1, or z i =0 and x j =0. Note that z i =1 and x j =1 together mean that the i-th row is aligned to h but its j-th entry is different from that of h, while z i =0 and x j =0 together mean that the i-th row is aligned to h but its j-th entry is different from that of h. In Table 1, we enumerate the four possible values of (x j ,z i ). As can be seen from the table, t i,j also happens to be the minimum binary variable to bound x j +z i −1 from above. Now, since the coefficient of t i,j in the objective function is positive, minimizing the objective function leads to minimizing t i,j and hence the condition x j +z i −1≤t i,j (together with the fact that t i,j is binary) guarantees that t i,j is indeed as defined.

Table 1 The value of (1−x j z i +2t i,j ) for jJi,0

To finish the proof of the claim, it suffices to show that for each i∈{1,…,p} and each jJi,1, (x j z i +2t i,j )=1 if and only if either z i =1 and x j =0, or z i =0 and x j =1. The proof is similar to the case where jJi,0; the main difference is to use Table 2.

Table 2 The value of (x j z i +2t i,j ) for jJi,1

We next compare NewModelA against Chen et al.’s OldModelA [27] in terms of the numbers of variables and constraints. OldModelA is as follows:

$$\begin{array}{@{}rcl@{}} \text{Minimize} & &\sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,0}} c_{j} \left(1 - x_{j} - z_{i} + 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,1}} c_{j} \left(z_{i} + x_{j} - 2 t_{i,j}\right) \\ \text{Subject to} & & \\ & & \forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0} \cup J_{i,1}} \ \ t_{i,j} \le z_{i} \\ & & \forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0} \cup J_{i,1}} \ \ t_{i,j} \le x_{j} \\ & & \forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0} \cup J_{i,1}} \ \ t_{i,j} \ge z_{i} + x_{j} -1 \\ & & \text{all the variables \(x_{j}\), \(z_{i}\), \(t_{i,j}\)~:~binary} \end{array} $$

The number of variables in NewModelA is p+q+pq and so is that in OldModelA. On the other hand, the number of constraints in NewModelA is \(\sum _{i=1}^{p} (|J_{i,0}| + |J_{i,1}|)\), while that in OldModelA is \(3\sum _{i=1}^{p} (|J_{i,0}| + |J_{i,1}|)\). So, NewModelA has significantly fewer constraints. Thus, it looks very promising that NewModelA can almost always be solved in much shorter time. In a later section, we will see that this is actually the case.

The general case

For each integer i with 1≤ip, let Ji,0 (respectively, Ji,1) be the set of integers j∈{1,2,…,q} such that the j-th column of \(\tilde {B}\) is known to be intrinsically heterozygous and the i-th entry in the j-th column of \(\tilde {B}\) is a 0 (respectively, 1). Further let \(\overline {J_{i,0}}\) (respectively, \(\overline {J_{i,1}}\)) be the set of integers j∈{1,2,…,q} such that the j-th column of \(\tilde {B}\) is not known to be intrinsically heterozygous and the i-th entry in the j-th column of \(\tilde {B}\) is a 0 (respectively, 1). As in the all-heterozygous case, we introduce a binary variable z i for the i-th row of \(\tilde {B}\) whose value is supposed to be 1 if and only if the i-th row of \(\tilde {B}\) is aligned to h. For each j∈{1,…,q} such that the j-th column of \(\tilde {B}\) is known to be intrinsically heterozygous, we introduce a binary variable x j whose value is supposed to be 1 if and only if the j-th bit of h is a 1 (and hence the j-th bit of h is a 0). Moreover, for each j∈{1,…,q} such that the j-th column of \(\tilde {B}\) is not known to be intrinsically heterozygous, we introduce two binary variables x j and y j , where the value of x j (respectively, y j ) is supposed to be 1 if and only if the j-th bit of h (respectively, h) is a 1. Furthermore, we introduce a binary variable t i,j for the i-th row and the j-th column of \(\tilde {B}\) whose value is supposed to be 1 if and only if the i-th row of \(\tilde {B}\) is aligned to h but its j-th entry is different from that of h. In addition, if the j-th column of \(\tilde {B}\) is not known to be intrinsically heterozygous, we further introduce a binary variable u i,j whose value is supposed to be 1 if and only if the i-th row of \(\tilde {B}\) is aligned to h but its j-th entry is different from that of h.

Then, we claim that the problem of finding an optimal pair (h,h) of haplotypes for \(\tilde {B}\) has the following ILP formulation (denoted by NewModelG):

$$\begin{array}{@{}rcl@{}} \text{Minimize} & &\sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,0}} c_{j} \left(1 - x_{j} - z_{i} + 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,1}} c_{j} \left(x_{j} - z_{i} + 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \cdot \sum_{j \in \overline{J_{i,0}}\cup \overline{J_{i,1}}} c_{j} \left(t_{i,j} + u_{i,j}\right) \\ \text{Subject to} & & \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0}\cup\overline{J_{i,0}}} \ \ x_{j} + z_{i} - 1 \le t_{i,j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,1}\cup\overline{J_{i,1}}} \ \ z_{i} - x_{j} \le t_{i,j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in \overline{J_{i,0}}} \ \ y_{j} - z_{i} \le u_{i,j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in \overline{J_{i,1}}} \ \ 1 - y_{j} - z_{i} \le u_{i,j} \\ & & \text{all the variables \(x_{j}\), \(y_{j}\), \(z_{i}\), \(t_{i,j}\), \(u_{i,j}\)~:~binary} \end{array} $$

To prove the above claim, we first show that for each i∈{1,…,p} and each \(j\in \overline {J_{i,0}}\), t i,j =1 if and only if z i =1 and x j =1, while u i,j =1 if and only if z i =0 and y j =1. Note that z i =1 and x j =1 together mean that the i-th row is aligned to h but its j-th entry is different from that of h, while z i =0 and y j =1 together mean that the i-th row is aligned to h but its j-th entry is different from that of h. In Table 3, we enumerate the eight possible values of (x j ,y j ,z i ). As can be seen from the table, t i,j (respectively, u i,j ) also happens to be the minimum binary variable to bound x j +z i −1 (respectively, y j z i ) from above. Now, since the coefficients of t i,j and u i,j in the objective function are positive, minimizing the objective function leads to minimizing t i,j and u i,j and hence the condition x j +z i −1≤t i,j (respectively, y j z i u i,j ) together with the fact that t i,j (respectively, u i,j ) is binary guarantees that t i,j (respectively, u i,j ) is indeed as defined.

Table 3 The value of (t i,j +u i,j ) for \(j \in \overline {J_{i,0}}\)

Continuing the proof of the claim, we next show that for each i∈{1,…,p} and each \(j\in \overline {J_{i,1}}\), t i,j =1 if and only if z i =1 and x j =0, while u i,j =1 if and only if z i =0 and y j =0. The proof is similar to the case where \(j\in \overline {J_{i,0}}\); the main difference is to use Table 4.

Table 4 The value of (t i,j +u i,j ) for \(j \in \overline {J_{i,1}}\)

Now, by the analysis in the all-heterozygous case, the claim holds.

We next compare NewModelG against Chen et al.’s OldModelG [27] in terms of the numbers of variables and constraints. OldModelG is as follows:

$$\begin{array}{@{}rcl@{}} \text{Minimize} & &\sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,0}} c_{j} \left(1 - x_{j} - z_{i} + 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \sum_{j \in J_{i,1}} c_{j} \left(z_{i} + x_{j} - 2 t_{i,j}\right) \\ &+& \sum_{i=1}^{p} w_{i} \cdot \sum_{j \in \overline{J_{i,0}}} c_{j} \left(y_{j} + t_{i,j} - u_{i,j} \right) \\ &+& \sum_{i=1}^{p} w_{i} \cdot \sum_{j \in \overline{J_{i,1}}} c_{j} \left(1 - y_{j} - t_{i,j} +u_{i,j}\right) \\ \text{Subject to} & & \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0}\cup J_{i,1}\cup\overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ t_{i,j}\le z_{i} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0}\cup J_{i,1}\cup\overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ t_{i,j}\le x_{j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in J_{i,0}\cup J_{i,1}\cup\overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ t_{i,j}\ge x_{j} + z_{i} -1 \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in \overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ u_{i,j}\le z_{i} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in \overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ u_{i,j}\le y_{j} \\ & &\forall_{1\le i\le p} \ \ \forall_{j\in \overline{J_{i,0}}\cup\overline{J_{i,1}}} \ \ u_{i,j}\ge y_{j} + z_{i} -1 \\ & & \text{all the variables \(x_{j}\), \(y_{j}\), \(z_{i}\), \(t_{i,j}\), \(u_{i,j}\)~:~binary} \end{array} $$

Let I be the set of all j∈{1,…,q} such that the j-th column of \(\tilde {B}\) is known to be intrinsically heterozygous. The number of variables in NewModelG is \(p + 2q - |I| + \sum _{i=1}^{p}\left (|J_{i,0}|+|J_{i,1}| + 2|\overline {J_{i,0}}| + 2|\overline {J_{i,1}}|\right)\) and so is that in OldModelG. Moreover, the number of constraints in NewModelG is \(\sum _{i=1}^{p} \left (|J_{i,0}| + |J_{i,1}| + 2|\overline {J_{i,0}}| + 2|\overline {J_{i,1}}|\right)\), while that in OldModelG is \(\sum _{i=1}^{p} \left (3|J_{i,0}| + 3|J_{i,1}| + 6|\overline {J_{i,0}}| + 6|\overline {J_{i,1}}|\right)\). So, NewModelG has significantly fewer constraints. Thus, it looks very promising that NewModelG can almost always be solved in much shorter time. In a later section, we will see that this is actually the case.

Results and discussion

To evaluate the efficiency of our new models and compare them against the previous bests, we have implemented both our models and Chen et al.’s models [27] with CPLEX-v12.6.3 (an ILP solver by IBM). In our experiments, we use both real datasets and simulated datasets among which some include hard instances, because we expect that the new and the old models show significant difference in performance for hard instances. Since solving hard instances takes long time and we want to speed up our experiments, we use two Linux (x64) desktop PCs in our experiments. One PC has Intel Xeon E5-2687W v4 CPU (3.00 GHz, 48 threads) and 252 GB RAM, while the other has Intel i7-3960X CPU (3.30 GHz, 12 threads) and 32 GB RAM. The former is used to solve real datasets, while the latter is used to solve simulated datasets. To guarantee that we can finish the experiments within a reasonable amount of time, we put a time limit of 1 day on the running time of CPLEX for each reduced block.

CPLEX is always fast enough to find a solution within the time limit, albeit it is not necessarily optimal. However, in general, there is no guarantee on the MEC score of the solution when CPLEX fails to optimally solve within the time limit. Nonetheless, for the datasets used in our experiments, we will explicitly state (in tables) the (total) MEC scores of solutions found by CPLEX even when CPLEX fails to optimally solve one or more reduced blocks within the time limit. As will be seen from the tables, the MEC scores are quite close to the optimal when CPLEX fails.

Results for real datasets

The real datasets used in our experiments are HuRef [17] and MP1 [28]. Both of them have been extensively used in previous studies to compare previous methods against each other [14, 17, 1921, 27].

The HuRef dataset

This dataset is known to contain very hard instances. Indeed, He et al.’s program [19] fails to completely solve the all-heterozygous case of the problem for HuRef after taking 15 h on a PC cluster. Chen et al.’s program [27] based on ILP was the first to completely solve the all-heterozygous case of the problem for HuRef on a desktop PC. It turns out that our new models lead to much shorter time than Chen et al.’s models [27]. More specifically, Table 5 compares the performance of CPLEX on our new ILP models against that on Chen et al.’s old ILP models [27].

Table 5 Comparing the performance of CPLEX on our new ILP models against that on Chen et al.’s models [27] for HuRef

We first focus on the all-heterozygous case. As can be seen from Table 5, CPLEX can solve NewModelA 9 times faster than OldModelA on average. Indeed, CPLEX failed to optimally solve OldModelA for one reduced block of Chromosome 15 within the time limit. If we exclude Chromosome 15, then CPLEX can solve NewModelA 12 times faster on average. In particular, for Chromosome 22, CPLEX can solve NewModelA 41 times faster.

We next focus on the general case. As can be seen from Table 5, CPLEX can solve NewModelG twice faster on average. Indeed, CPLEX failed to optimally solve NewModelG for one reduced block of Chromosome 15 within the time limit, while CPLEX failed to optimally solve OldModelG for a total of two reduced blocks of Chromosomes 15 and 22 within the time limit. If we exclude the two chromosomes, then CPLEX can solve NewModelG 8 times faster on average. In particular, for Chromosome 1, CPLEX can solve NewModelG 11 times faster.

The MP1 dataset

MP1 is the most complete haplotype-resolved genome generated by fosmid pool-based next generation sequencing. The completeness of its phasing allows the determination of the molecular haplotype pairs for 81% of all autosomal protein-coding genes and provides the haploid DNA segments significantly larger than other standard shotgun sequencing technologies [34]. Solving the instances in MP1 optimally is much more difficult than solving those in HuRef optimally, because a number of chromosomes in MP1 contain very large reduced blocks with very large MEC scores. Previously, only Chen et al. [14] tried to optimally solve the all-heterozygous case for MP1. To the best of our knowledge, nobody has tried to optimally solve the general case for MP1.

As in [14], we focus on the all-heterozygous case for MP1. Chen et al. [14] assume that all columns of the input matrices in MP1 are heterozygous even if some of them are monotone. In contrary, we assume that monotone columns are homozygous and hence remove them from the input matrices when applying reductions to them. The removal of monotone columns makes the problem smaller and allow us to finish the experiments faster. In other words, the difference is only technical rather than essential. So, the MEC scores obtained in this paper will be smaller than those obtained in [14].

With NewModelA, we have tried to use CPLEX to optimally solve the all-heterozygous case for MP1. It turns out that NewModelA leads to much shorter time than OldModelA. More specifically, Table 6 compares the performance of CPLEX on NewModelA against that on OldModelA.

Table 6 Comparing the performance of CPLEX on NewModelA against that on OldModelA for MP1

As can be seen from Table 6, CPLEX can solve NewModelA almost twice faster than OldModelA on average. Indeed, CPLEX failed to optimally solve NewModelA for a total of 12 reduced blocks of 11 chromosomes within the time limit, while CPLEX failed to optimally solve OldModelA for a total of 22 reduced blocks of 17 chromosomes within the time limit. If we exclude the 17 chromosomes for which CPLEX failed to optimally solve OldModelA, then CPLEX can solve NewModelA 6.5 times faster on average.

We note that the solutions found by CPLEX with OldModelA for the 5th and the 10th chromosomes happen to be optimal although CPLEX fails to finish within the time limit for some blocks of the two chromosomes; their optimality is witnessed by the results found by CPLEX with NewModelA. So, we suspect that with OldModelA, it takes too much time for CPLEX to verify the optimality of the solutions of some blocks of the two chromosomes.

Simulated datasets

The simulated datasets used in our experiments are Geraci’s dataset [29] and Chen et al.’s Sim95_MP1 [14].

Geraci’s dataset

The dataset was generated from 3 parameters: the haplotype length ∈{100,350,700}, the coverage c∈{3,5,8,10}, and the error rate e∈{0,0.1,0.2,0.3}. For each triple (,c,e), 100 matrices were generated. Intuitively speaking, the larger c (respectively, e) is, the more reads we have in the matrix (respectively, the larger optimal MEC score we have for the matrix).

As Chen et al. [27] did, we only use the matrices generated from (ℓ,c,e) with ∈{100,350}, c∈{3,5,8,10}, and e=0.1, because these matrices contain both easy and difficult instances. It turns out that our new models lead to much shorter time than Chen et al.’s models [27]. More specifically, Table 7 compares the performance of CPLEX on our new ILP models against that on Chen et al.’s old ILP models [27].

Table 7 Comparing the performance of CPLEX on our new ILP models against that on Chen et al.’s models [27] for Geraci’s dataset

We first focus on the general case. As can be seen from Table 7, CPLEX can solve NewModelG 10 times faster than OldModelG on average for hard instances (namely, those generated with (,c)=(350,10)). Indeed, CPLEX failed to optimally solve OldModelG for a total number of 2 reduced blocks of two instances generated with (,c)=(350,10) within the time limit. Even for the other (easier) instances, CPLEX can solve NewModelG at least 4 times faster than OldModelG on average. We note that the solutions found by CPLEX with OldModelG for the two failed reduced blocks happen to be optimal, as witnessed by the results found by CPLEX with NewModelG. So, we suspect that with OldModelG, it takes too much time for CPLEX to verify the optimality of the solutions.

For the all-heterozygous case, the results in Table 7 show that CPLEX can solve our new model about twice faster on average.

Chen et al.’s Sim95_MP1

The matrices in MP1 are hard to solve to optimality. The main reason for this seems to be that the optimal MEC scores for the matrices are too large (i.e., the quality of the reads is bad). Since we believe that reads will be of higher quality in the future, we want to generate a simulated dataset from MP1 so that the reads cover the same SNP sites as in MP1 but have better quality. More specifically, Chen et al. [14] generated Sim95_MP1 from MP1 as follows. Initially, Sim95_MP1 is a copy of MP1. Then, we randomly partition the reads of Sim95_MP1 into two sets R and \(\bar {R}\). For each read rR, we change each 1 in r to 0. On the other hand, for each read \(r \in \bar {R}\), we change each 0 to 1. Now, the true solution for each matrix is simply a pair of complementary strings one of which consists of 0’s only. Finally, for each read r in R (respectively, \(\bar {R}\)) and each 0 (respectively, 1) of r, we flip the bit with a probability 1−t, where t is chosen by first generating it in normal distribution with mean 0.95 and standard deviation 0.05 and further resetting t=1 (respectively, t=0.6) if t>1 (respectively, t<0.6). Intuitively speaking, t is the quality of the bit and hence we want it to be around 0.95 and not smaller than 0.6.

With NewModelA, we have tried to use CPLEX to optimally solve the all-heterozygous case for Sim95_MP1. It turns out that NewModelA leads to much shorter time than OldModelA. More specifically, Table 8 compares the performance of CPLEX on NewModelA against that on OldModelA.

Table 8 Comparing the performance of CPLEX on NewModelA against that on OldModelA for Sim95_MP1

As can be seen from Table 8, CPLEX can solve NewModelA 8.3 times faster than OldModelA on average. Indeed, CPLEX never failed to optimally solve NewModelA within the time limit, while CPLEX failed to optimally solve OldModelA for a total of 3 reduced blocks within the time limit.

Conclusion

As aforementioned, solving the haplotype assembly problem to optimality is of great interest but very time-consuming. In order to speed up the computation of optimal solutions, we have designed better ILP models for both the all-heterozygous and the general cases of the problem. Our experimental results for both real and simulated datasets confirm that the new models can be solved within significantly shorter time than the previous bests. In this paper, we focused on finding optimal solutions. As future work, we plan to find out how much speed-up we can obtain when applying the models to speeding up heuristic algorithms.