Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In computational biology, we constantly need to process various biological data to extract meaningful biological relation, like building a phylogenetic tree. Such a process sometimes involves computing the genomic distance between two genomes, which was first investigated as early as in 1926 [61, 62]. The problem was more formally studied in 1990s and is in general polynomially solvable for signed genomes, e.g., under the signed translocation distance [7, 37, 49, 56], under the signed reversal distance [3, 38, 48, 63, 64], and under the DCJ distance [69]. For unsigned genomes, the problems are typically NP-hard, e.g., sorting by reversals [17], sorting by translocations [71], sorting by DCJ operations [19], and sorting by transpositions [16]. But these problems on sort unsigned genomes do admit small-factor (≤1.5) polynomial-time approximations, e.g., sorting by reversals [9, 28], sorting by translocations [31, 46], sorting by DCJ operations [19, 20, 42], and sorting by transpositions [33].

The above results are all under the assumption that each genome is given in a form where there is no loss and duplication of genes and a genome is represented as a permutation of genes. For many genomes, due to the fast evolution/self-reproduction process, duplicated (paralogous) genes are common. So it is useful to select the ancestral ortholog of a gene family on an evolutionary basis. In 1999, David Sankoff first formulated this problem as an algorithmic problem, now known as the Exemplar Breakpoint/Genomic Distance problem [59]. In Sect. 9.2, we will survey the development of the follow-up research since 1999, mostly with negative complexity results. Some of these methods and results have already been applied in other (biological and non-biological) problems [6, 58, 67].

In some eukaryotic genomes, under many situations, like sequencing error or errors due to an inappropriate design of the biological experiments, we might have noise and redundant genes. Before eliminating these redundancies, using the given genomes for many biological studies might introduce further errors. While this problem was known to the biologists long time ago, in 2007 David Sankoff again first formulated this as an algorithmic problem, now known as the Maximal Strip Recovery and the Complementary Maximal Strip Recovery problems [27, 70]. This again led to a series of research on fixed-parameter tractable and approximation algorithms, performed by several groups in US, Canada, Europe and China. In Sect. 9.3, we will survey the most recent development of these researches.

Genome sequencing has been a hot research area for the last 20 years. Behind the huge success a commonly ignored fact is that most genomes sequenced are not really ‘sequences’; in fact, most of them are made of scaffolds, i.e., composed of incomplete gene markers. David Sankoff and his group initiated this problem of scaffold filling in 2010 [54]. My group and a group led by Prof. Daming Zhu at Shandong University (China) have been following up this research. While initially the work was done on filling scaffolds with no gene repetitions, which is a problem polynomially solvable, recently a lot of effort has been put on filling scaffolds with gene repetitions (which is in general NP-hard). In Sect. 9.4, we will survey the current status of this research.

In the area of bioinformatics and computational biology, for a lot of NP-complete problems one would typically apply three methods to handle them. One is to find an approximation solution, with the requirement being that the approximation factor is small (better close to one). The other is to look for an exact solution (FPT algorithm) when some parameter (say, the solution size) of the problem is small. The vast majority of practical solutions for bioinformatics and computational biology are heuristic ones, which are possibly based on some formal methods like integer linear programming, branch-and-bound, etc.

In this survey, we focus on the approximability and fixed-parameter tractability results for the above three general problems related to computing genomic distance with some preprocessing. In these problems, we are given some genomes or genetic maps and we try to optimize some solution values by deleting some genes or gene markers. So these problem fit naturally for approximation and/or FPT solutions. Unfortunately, as we will review a bit later, some of these problems are very hard in both aspects. In other words, it might be impossible to design good approximation and/or FPT algorithms for them, unless P = NP, NP = ZPP or FPT = W[1]. On the other hand, many problems are still open along these lines.

The paper is organized as follows. In Sect. 9.2, we first review the approximability and fixed-parameter tractability for the Exemplar Breakpoint Distance (EBD) problem. We then review the approximability and fixed-parameter tractability for the Exemplar Non-breaking Similarity (ENbS) problem (which is the dual of EBD). In Sect. 9.3, we review the approximability and fixed-parameter tractability for the Maximal Strip Recovery (MSR) problem and its complement, the Complementary Maximal Strip Recover (CMSR) problem. In Sect. 9.4, we review the approximation results for the Scaffold Filling Problem, focusing on the One-sided Scaffold Filling Problem with Gene Repetitions. In Sect. 9.5, we list a set of open problems to conclude this paper.

2 The Exemplar Breakpoint Distance and Related Problems

As we have covered in the introduction, in the genome comparison and rearrangement area, a standard problem is to compute the number (i.e., genetic distance) and the actual sequence of genetic operations which converts a source genome to a target genome. This problem is important in evolutionary molecular biology as it gives some useful information on genome evolution. Typical genetic distances include edit [53], signed reversal [4, 38, 52, 57] and breakpoint [66], etc. In fact, the idea of signed reversal and, implicitly, breakpoint, was initiated as early as in 1926 by Sturtevant [61]. In the past years, conserved interval distance was also proposed to measure the similarity of multiple sequences of genes [8]. Interested readers are referred to [35] for a summary of the research performed in this area.

In genome rearrangement research, it is usually assumed that each gene appears in a genome exactly once. Under this assumption, the genome rearrangement problem is in essence the problem of comparing and sorting signed permutations [35, 38]. However, this assumption is very restrictive and is only justified in several small virus genomes. For example, this assumption does not hold on eukaryotic genomes where paralogous genes exist [55, 59]. So we have to handle this gene duplication problem.

David Sankoff first considered the problem of computing the breakpoint distance with duplicated genes. In [59], Sankoff proposed a way to select, from the duplicated copies of genes, the common ancestor gene such that the breakpoint distance between the reduced genomes (exemplar genomes) is minimized. The distance is called the exemplar breakpoint distance henceforth. A general branch-and-bound algorithm was also implemented in [59]. In [55], Nguyen, Tay and Zhang proposed to use a divide-and-conquer method to compute the exemplar breakpoint distance empirically.

For the theoretical part of research, it was shown that both of the problems of computing the signed reversal and breakpoint distances between exemplar genomes are NP-complete [14]. A few years ago, Blin and Rizzi further proved that computing the conserved interval distance between exemplar genomes is NP-complete [11]; moreover, it is NP-complete to compute the minimum conserved interval matching (i.e., without deleting the duplicated copies of genes). Starting in 2005, we showed much stronger inapproximability results for the exemplar breakpoint and conserved interval distance problems (even under a weaker model of approximation) [21, 24]. (In fact, a series of workshops were organized at University of Texas—Pan American between 2005 and 2008, focusing on this topic.) While various exemplar genomic distances have been researched before, in this survey we will focus on the exemplar breakpoint distance. In fact, all the inapproximability result for exemplar breakpoint distance holds for any other genomic distance d(−,−) satisfying d(G,H)=0 implies G=H or G=−H.

2.1 Problem Definitions

In the genome comparison and rearrangement problem, we are given a set of genomes, each of which is a signed sequence of genes where the order of the genes corresponds to the position of them on the linear chromosome and the signs correspond to which of the two DNA strands the genes are located. Here we interpret a genome as a set of such sequences (chromosomes), though we focus mostly on singleton genomes, i.e., a single sequence, in this paper. When the input genomes contain gene repetitions, Sankoff proposed a method to select an exemplar genome, by deleting redundant copies of a gene, such that in an exemplar genome any gene appears exactly once; moreover, the resulting exemplar genomes should have a property that a given genetic distance between them is minimized [59].

The following definitions are very much following those in [11]. Given n gene families (alphabet) \(\mathcal{F}\), a genome \(\mathcal{G}\) is a sequence of elements of \(\mathcal{F}\) such that each element is with a sign (+ or −). In general, we allow the repetition of a gene family in any genome. Each occurrence of a gene family is called a gene, though we will not try to distinguish a gene and a gene family if the context is clear. Given a genome with no repetition of any gene G=g 1 g 2g m , we say that gene g i immediately precedes g j if j=i+1. Given genomes G,H (with no gene repetition), if gene a immediately precedes b in G and neither a immediately precedes b nor −b immediately precedes −a in H, then they constitute a breakpoint in G. The breakpoint distance is the number of breakpoints in G (symmetrically, it is the number of breakpoints in H), denoted as \({\rm bd}(G,H)\).

The number of a gene g appearing in a genome \(\mathcal{G}\) is called the cardinality of g in \(\mathcal{G}\), written as card(\(g,\mathcal{G}\)). A gene in \(\mathcal{G}\) is called trivial if g has cardinality exactly 1; otherwise, it is called non-trivial. A genome \(\mathcal{G}\) is called r-repetitive, if all the genes from the same gene family appear at most r times in \(\mathcal{G}\). For example, \(\mathcal{G}=c-adc-bdeb\) is 2-repetitive.

Given a genome \(\mathcal{G}\) over \(\mathcal{F}\), an exemplar genome of \(\mathcal{G}\) is a genome G′ obtained from \(\mathcal{G}\) by deleting duplicating genes such that each gene family in \(\mathcal{G}\) appears exactly once in G′. For example, let \(\mathcal{G}=-bcaadag-e\), there are two exemplar genomes: −bcadge and −bcdage.

The Exemplar Breakpoint Distance (EBD) problem is defined as follows:

Instance: :

Genomes \(\mathcal{G}\) and \(\mathcal{H}\), each is of length O(m) and each covers n gene families (i.e., at least one gene from each of the n gene families appears in both \(\mathcal{G}\) and \(\mathcal{H}\)); integer K.

Question: :

Are there two respective exemplar genomes of \(\mathcal{G}\) and \(\mathcal{H}\), G and H, such that \({\rm bd}(G,H)\leq K\)?

2.2 Algorithmic Foundations

In the next subsection, we present some hardness results on the approximability and fixed-parameter tractability for EBD, namely, the hardness to compute or approximate the minimum value K in the above formulation. Here we give some standard definitions regarding approximation and FPT algorithms. Given a minimization (maximization) problem Π, let the optimal solution value of Π be \({\rm OPT}\). We say that an approximation algorithm \(\mathcal{A}\) provides a performance guarantee of α for Π if for every instance I of Π, the solution value returned by \(\mathcal{A}\) is at most \(\alpha\times {\rm OPT}\) (at least \({\rm OPT}/\alpha\)). Usually we say that \(\mathcal{A}\) is a factor-α approximation for Π. For the obvious reason, we are only interested in polynomial-time approximation algorithms. Readers are referred to [29, 34] for more details regarding the definitions related to approximation algorithms and NP-completeness.

As a well-known subject as well, an FPT algorithm for a decision problem with parameter k is an algorithm which solves the problem in O(f(k)n c) time, where f is any function only on k and c is some fixed constant not related to k. More details on FPT algorithms can be found in [32].

2.3 Hardness Results

In [21], we presented the first set of inapproximability results for the Exemplar Breakpoint Distance problem, given two genomes each containing only one sequence of genes drawn from n gene families. We showed that even if a gene appears at most three times, deciding whether the optimal exemplar breakpoint distance is zero, i.e, whether G=H, is NP-complete. It was left as an open problem whether the result holds when each gene appears at most twice in each of the input genomes [2, 21]. Recently, this open question was finally answered, i.e., it remains NP-complete even when each gene appears at most two times [13, 47]. Combining these results, we have the following inapproximability result.

Theorem 1

If both \(\mathcal{G}\) and \(\mathcal{H}\) are 2-repetitive genomes, then the Exemplar Breakpoint Distance problem does not admit any polynomial-time approximation (regardless of its approximation factor), unless P = NP.

Proof

If we view the Exemplar Breakpoint Distance problem as a minimization problem, then the result in [13], with an example presented at the end of this subsection, implies that deciding whether \({\rm OPT}=0\) is NP-complete (even if the input genomes are 2-repetitive). Let \(\mathcal{A}\) be any approximation algorithm for EBD with factor α. By definition, \(\mathcal{A}\) returns an approximation solution value \({\rm APP}\), with

$${\rm APP}\leq \alpha\times {\rm OPT}. $$

When \({\rm OPT}=0\), clearly \({\rm APP}\) must also satisfy \({\rm APP}=0\). In other words, \(\mathcal{A}\) would be able to solve the instance in [13] in polynomial time. This, however, contradicts with the corresponding NP-completeness result (unless P = NP). □

Regarding the fixed-parameter intractability for EBD, we have the following theorem.

Theorem 2

If both \(\mathcal{G}\) and \(\mathcal{H}\) are 2-repetitive genomes, then the Exemplar Breakpoint Distance problem does not admit any FPT algorithm, unless P = NP.

Proof

Again, if we view the Exemplar Breakpoint Distance problem as a minimization problem, then the result in [13, 47] implies that deciding whether \({\rm OPT}=0\) is NP-complete (even if the input genomes are 2-repetitive). Let \(\mathcal{B}\) be any FPT algorithm for EBD which runs in O(f(k)n c) time. When \({\rm OPT}=k=0\), \(\mathcal{B}\) solves EBD in O(f(0)n c)=O(n c) time. In other words, \(\mathcal{B}\) would be able to solve the instance in [13] in polynomial time. This, again, contradicts with the corresponding NP-completeness result, unless P = NP. □

On the other hand, it is necessary to point out that the reduction in [21, 24] is much simpler than in [13, 47]. As a matter of fact, it has been applied to show the NP-hardness of other problems in computational geometry [6], computational biology [67] and program download [58]. We show a simple example on this reduction.

Given a 3SAT formula ϕ=F 1F 2F 3F 4, where \(F_{1}=(x_{1}\vee \overline{x_{2}}\vee x_{3})\), \(F_{2}=(\overline{x_{1}}\vee x_{2}\vee \overline{x_{4}})\), \(F_{3}=(\overline{x_{2}}\vee \overline{x_{3}}\vee x_{4})\), and \(F_{4}=(x_{1}\vee \overline{x_{3}}\vee \overline{x_{4}})\), we want to find a truth assignment for ϕ. For each variable x i , define S i (resp. \(S'_{i}\)) as the list of clauses containing x i (resp. \(\overline{x_{i}}\)) followed by clauses containing \(\overline{x_{i}}\) (resp. x i ). So S 1=F 1 F 4 F 2 and \(S'_{1}=F_{2}F_{1}F_{4}\), etc.

Then we construct two sequence \(\mathcal{G}=S_{1}g_{1}S_{2}g_{2}S_{3}g_{3}S_{4}\), \(\mathcal{H}=S'_{1}g_{1}S'_{2}g_{2}S'_{3}g_{3}S'_{4}\), where g j ’s are peg genes only appearing once. Each gene appears at most three times as each clause contains three literals. The truth assignment can be set as follows: if x i =TRUE, then keep the clauses in S i and \(S'_{i}\) which contain x i ; if x i =FALSE, then keep the clauses in S i and \(S'_{i}\) which contain \(\overline{x_{i}}\). If there are still duplicated clauses after this, then keep one such clause and delete the remaining ones arbitrarily. Regarding the above example, we can have x 1=x 3=TRUE, x 2=x 4=FALSE. So the corresponding exemplar genomes obtained are G=H=F 4 g 1 F 3 g 2 F 1 g 3 F 2, whose breakpoint distance is zero.

In different applications, F i ’s and g j ’s can be constructed to fit the corresponding problems, for instance as geometric points [6, 67] or programs to be downloaded [58].

2.4 The Complement Problem—ENbS

We comment that the negative results in Sect. 9.2.3 hold for any genomic distance d(−,−) satisfying that d(G,H)=0 implies G=H or G=−H. This, of course, implies that all the exemplar genomic distance problems (like exemplar reversal, exemplar transposition, and exemplar conserved interval distances) do not admit any polynomial-time approximation algorithms or any FPT algorithm, unless P = NP.

There have been two ways to handle this problem. One is to use a weak model of approximation, which will be covered as related to open problems in Sect. 9.5. The other, on the other hand, is to use a different similarity measure. In this case, one would try to maximize certain similarity measure. The most notable of such measures include non-breaking similarity (or number of adjacencies) [23] and the number of common intervals [12]. (A common interval is a pair of substrings appearing in the two genomes with the same genes, but possibly different orders. Example. G=abced, H=deacb. (abc,acb) is a length-3 common interval.) We will focus on the non-breaking similarity, which is really the complement of the breakpoint distance.

For two exemplar genomes G and H over the same alphabet of size n, recall that a breakpoint in G is a two-gene substring g i g i+1 such that neither g i g i+1 nor −g i+1g i is a substring in H. A non-breaking point (or an adjacency) is a common two-gene substring g i g i+1 that appears either as g i g i+1 or as −g i+1g i in G and H. The number of non-breaking points between G and H is also called the non-breaking similarity between G and H, denoted as \({\rm nbs}(G,H)\). Clearly, we have \({\rm nbs}(G,H)+{\rm bd}(G,H)=n-1\). For two genomes \(\mathcal{G}\) and \(\mathcal{H}\), their exemplar non-breaking similarity \({\rm enbs}(\mathcal{G},\mathcal{H})\) is the maximum \({\rm nbs}(G,H)\), where G and H are exemplar genomes derived from \(\mathcal{G}\) and \(\mathcal{H}\). Again we have \({\rm enbs}(\mathcal{G},\mathcal{H})+{\rm ebd}(\mathcal{G},\mathcal{H})=n-1\).

The Exemplar Non-breaking Similarity (ENbS) problem is formally defined as follows:

Instance: :

Genomes \(\mathcal{G}\) and \(\mathcal{H}\), each is of length O(m) and each covers n gene families (i.e., at least one gene from each of the n gene families appears in both \(\mathcal{G}\) and \(\mathcal{H}\)); integer K.

Question: :

Are there two respective exemplar genomes of \(\mathcal{G}\) and \(\mathcal{H}\), G and H, such that the non-breaking similarity between them is at least K?

We have the following negative results which have been proved in [23, 26].

Theorem 3

If one of \(\mathcal{G}\) and \(\mathcal{H}\) is exemplar and the other is 2-repetitive, then the Exemplar Non-breaking Similarity problem does not admit any factor-n 0.5−ϵ polynomial-time approximation unless NP = ZPP.

Proof

We give a sketch of proof from [23, 26]. In [23, 26], it was shown that Independent Set can be linearly reduced to ENbS; i.e., the input graph has an independent set of size k iff the constructed ENbS instance has a non-breaking similarity (or number of adjacencies) equal to k. As Independent Set cannot be approximated within a factor of |V|1−ϵ unless NP = ZPP [39] and as in the reduction we use Θ(|V|2) genes (where |V| is the number of vertices in the input graph), the theorem follows. □

In [26], a factor-\(O(\sqrt{n})\) approximation was presented for ENbS, show that the above inapproximability result is tight.

Theorem 4

If one of \(\mathcal{G}\) and \(\mathcal{H}\) is exemplar and the other is 2-repetitive, the Exemplar Non-breaking Similarity problem does not admit an FPT algorithm unless FPT = W[1].

Proof

It is noted that the reduction from Independent Set to ENbS in [23, 26] is in fact an FPT reduction. As Independent Set is W[1]-complete [32], the theorem simply follows. □

In fact, with the lower bound results proved in [18], Independent Set (hence ENbS) cannot be solved in O(f(k)n o(k)) time even if k is bounded by an arbitrarily small function of n, unless ETH fails. (ETH—Exponential Time Hypothesis: 3SAT cannot be solved in subexponential time.)

In the next section, we will survey another problem initiated by David Sankoff on computing syntenic blocks from genetic maps.

3 Maximal Strip Recovery and Its Complement

In a genome or physical map, the distance between two genes is exact. This is different in a genetic map, where only the relative positions between gene markers along chromosomes are indicated. A genetic map is usually constructed from DAGs (Directed Acyclic Graphs) which represent the partial order of gene markers. We omit the construction of genetic maps and interested readers are referred to [10, 68]. It should be noted that in a genetic map all the gene markers are distinct.

Given two genetic maps G and H represented by a sequence of n gene markers, a strip (syntenic block) is a sequence of distinct markers of length at least two which appear as subsequences in both of the input maps, either directly or in reversed and negated form. The problem Maximal Strip Recovery (MSR) is to find two subsequences G′ and H′ of G and H, respectively, such that the total length of disjoint strips in G′ and H′ is maximized An example is as follows: G=abcdefgh, H=hgfcbdae and the optimal solution is G′=cdefg and H′=−gfcde, each containing two syntenic blocks cde and fg.

The MSR problem was proposed to handle the elimination of noise and ambiguities in genetic maps. This is related to the well-known problem in comparative genomics—to decompose two given genomes into syntenic blocks, i.e., segments of chromosomes which are deemed to be homologous in the two input genomes. In 2007, a heuristic method was proposed to handle the MSR problem [27, 70]. In [25], a factor-4 polynomial-time approximation algorithm was proposed for the problem. This was done by applying the Maximum Weight Independent Set on 2-interval graphs, which admit a factor-4 approximation [5]. We also proved that several close variants of MSR, MSR-d (with d>2 input maps), MSR-DU (with marker duplications), and MSR-WT (with markers weighted) are all NP-complete. It was left as an open problem whether the problem can be solved in polynomial time or is NP-complete [25].

Recently, in [65] we showed that MSR is in fact NP-complete, via a polynomial-time reduction from One-in-Three 3SAT (which was shown to be NP-complete in [34, 60]). We summarize the results in [25, 65] as follows.

Theorem 5

MSR is NP-complete, and it admits a factor-4 polynomial-time approximation.

As an effort to solve the MSR problem practically, we tried to handle MSR by solving its complement (CMSR) with FPT algorithms, i.e., showing that CMSR is fixed-parameter tractable [65]. Note that CMSR is a minimization problem where one deletes some markers such that the remaining ones in the genetic maps all belong to some syntenic blocks. With the previous example G=abcdefgh and H=hgfcbdae, the optimal CMSR solution is to delete markers a,b,h.

Let k be the minimum number of markers deleted in some optimal solution of CMSR, the running time of known algorithms are O(3k n+n 2) [43], and O(2.36k n+n 2) [15]. In [45], we proved a 18k parameterized search space for CMSR and subsequently obtained a linear kernel of size (the actual size should be 78k, slight better than in the conference version). Combining all these results, we have the following theorem.

Theorem 6

Let k be the optimal number of gene markers deleted from the input genetic maps. CMSR can be solved in O(2.36k k+n 2) time; i.e., CMSR is fixed-parameter tractable.

Note that as k is typically greater than 50 in real datasets, our FPT algorithms are not yet practical.

At the same time, approximation algorithms are presented for CMSR in the last couple of years. In [43], a factor-3 approximation was presented. The current best approximation factor is 2.33 [50]. Further improvement of approximation and FPT algorithms for CMSR remains open.

In the next section, we will survey the scaffold filling problem, again initiated by David Sankoff. Due to the technical difficulty of handling breakpoints and adjacencies in sequences (which was not completely given in [44]), this time we focus more on the details.

4 Approximation for Scaffold Filling with Gene Duplications

With respect to a target singleton genome, possibly with gene repetitions, a scaffold is simply an incomplete sequence. It was found that most of the sequenced genomes are in fact in the form of scaffolds. Muñoz et al. first formulate the problem of filling an incomplete scaffold H into H′, using a reference genome G, such that certain genomic distance between H′ and G is minimized [54]. More specifically, they showed for multichromosomal genomes, this (one-sided) scaffold filling problem under the DCJ distance is polynomially solvable. David Sankoff visited Montana State University in early 2010 and gave a talk on this topic. We then started to collaborate by showing that for singleton genomes without gene repetitions, under the breakpoint distance, even the two-sided scaffold filling problem (i.e., both G,H are incomplete scaffolds or permutations) is polynomially solvable [40]. Then this result is generalized to multichromosomal genomes under the DCJ distance [44].

When genomes contain some duplicated genes, the scenario is completely different. There are three general criteria (or distance) to measure the similarity of genomes: the exemplar genomic distance [59], the minimum common string partition (MCSP) distance [30] and the maximum number of common string adjacencies [2, 41, 44]. Unfortunately, as covered in Sect. 9.2, unless P = NP, there does not exist any polynomial-time approximation (regardless of the factor) for computing the exemplar genomic distance even when each gene is allowed to repeat three times [21, 24] or even two times [13, 47]. The MCSP problem is NP-complete even if each gene repeats at most two times [36] and the best known approximation factor for the general problem is O(lognlog n) [30]. Based on the maximum number of common string adjacencies, Jiang et al. proved that the one-sided scaffold filling problem is also NP-complete, and designed a 1.33-approximation algorithm with a greedy strategy [41, 44]. As some of the details on handling breakpoints/adjacencies for sequences are missing in [44], we try to present the complete solution here. We comment that handling breakpoints/adjacencies for permutations is much easier.

4.1 Preliminaries

At first, we revise some necessary definitions, which are also defined in [44], but not in a perfect way. (Also, note that the breakpoint and adjacency definitions are more general than in Sect. 9.2 which only handle permutations.) We assume that all genes and genomes are unsigned, and it is straightforward to generalize the result to signed genomes. Given a gene set Σ, a string P is called permutation if each element in Σ appears exactly once in P. We use c(P) to denote the set of elements in permutation P. A string A is called sequence if some genes appear more than once in A, and c(A) denotes genes of A, which is a multi-set of elements in Σ. For example, Σ={a, b, c, d}, A=abcdacd, c(A)={a,a,b,c,c,d,d}. A scaffold is an incomplete sequence, typically obtained by some sequencing and assembling process. A substring with m genes (in a sequence) is called an m-substring, and a 2-substring is also called a pair, as the genes are unsigned, the relative order of the two genes of a pair does not matter, i.e., the pair xy is equal to the pair yx. Given a scaffold A=a 1 a 2 a 3a n , let P A ={a 1 a 2,a 2 a 3,…,a n−1 a n } be the set of pairs in A.

Definition 1

Given two scaffolds A=a 1 a 2a n and B=b 1 b 2b m , if a i a i+1=b j b j+1 (or a i a i+1=b j+1 b j ), where a i a i+1P A and b j b j+1P B , we say that a i a i+1 and b j b j+1 are matched to each other. In a maximum matching of pairs in P A and P B , a matched pair is called an adjacency, and an unmatched pair is called a breakpoint in A and B, respectively.

It follows from the definition that scaffolds A and B contain the same set of adjacencies but distinct breakpoints. The maximum matched pairs in B (or equally, in A) form the adjacency set between A and B, denoted as a(A,B). We use b A (A,B) and b B (A,B) to denote the set of breakpoints in A and B, respectively. A gene is called a bp-gene, if it appears in a breakpoint. A maximal substring T of A (or B) is call a bp-string, if each pair in it is a breakpoint. The leftmost and rightmost genes of a bp-string T are call the end-genes of T, the other genes in T are called the mid-genes of T. We illustrate the above definitions in Fig. 9.1.

Fig. 9.1
figure 1

An example for adjacency, breakpoint and the related definitions

Given two scaffolds A=a 1 a 2a n and B=b 1 b 2b m , as we can see, each gene except the four ending ones is involved in two adjacencies or two breakpoints or one adjacency and one breakpoint. To get rid of this imbalance, we add “#” to both ends of A and B, which fixes a small bug in [41, 44]. From now on, we assume that A=a 0 a 1a n a n+1 and B=b 0 b 1b m b m+1, where a 0=a n+1=b 0=b m+1=#.

For a sequence A and a multi-set of elements X, let A+X be the set of all possible resulting sequences after filling all the elements in X into A. Now, we define the problems we study in this paper formally.

Definition 2

Scaffold Filling to Maximize the Number of (String) Adjacencies (SF-MNSA).

Input: :

Two scaffolds A and B over a gene set Σ and two multi-sets of elements X and Y, where X=c(B)−c(A) and Y=c(A)−c(B).

Question: :

Find A A+X and B B+Y such that |a(A ,B )| is maximized.

The one-sided SF-MNSA problem is a special instance of the SF-MNSA problem where one of X and Y is empty.

Definition 3

One-sided SF-MNSA.

Input: :

A complete sequence G and an incomplete scaffold I over a gene set Σ, a multi-set X=c(G)−c(I)≠∅ with c(I)−c(G)=∅.

Question: :

Find I I+X such that |a(I ,G)| is maximized.

Note that while the two-sided SF-MNSA problem is more general and more difficult, the One-Sided SF-MNSA problem is more practical as a lot of genome analysis are based on some reference genome [54].

We now list a few basic properties of this problem.

Lemma 1

Let G and I be the input of an instance of the One-sided SF-MNSA problem, and x be any gene which appears the same times in G and I. If x does not constitute breakpoint in G (resp. I), then it also does not constitute any breakpoint in I (resp. G).

Proof

W.L.O.G, assume that x appears q times in I and G, respectively. Also, assume that there are q 1 adjacencies in the form “xx”, and q 2 adjacencies in the form “xy” (yx) in G. In G, since each copy of x is involved in two adjacencies: one adjacency on its left and one adjacency on its right, but the two x’s share the adjacency “xx”, so the total number of adjacencies containing x is 2qq 1, then we have 2qq 1=q 1+q 2, which implies 2q−2q 1=q 2.

In the scaffold I, there must be at least q 1xx” adjacencies. As x appears only q times, x has 2q neighbors where there are at least 2q 1 x’s. So x has at most 2q−2q 1 neighbors which are not x, which means that there are at most 2q−2q 1 (=q 2) pairs in the form “xy” (yx) in I. Since there are q 2xy” (yx) adjacencies in G, there must be q 2xy” (yx) adjacencies in I. Therefore, there are exactly q 1 adjacencies in the form “xx”, and all the q 2 pairs in the form “xy” (yx) are adjacencies in I, and none of them is a breakpoint. □

Lemma 2

Let G and I be the input of an instance of the One-sided SF-MNSA problem, let bp(I) and bp(G) be the multi-set of bp-genes in I and G, respectively. Then any gene in bp(G) appears in bp(I)∪X, and bp(I)⊆bp(G).

Proof

Assume to the contrary that there exists a gene x, xbp(G), but xbp(I)∪X. Since xX, x appears the same number of times in G and I; moreover, xbp(I), then all the pairs in I containing x are adjacencies. From Lemma 1, all the pairs involving x in G are adjacencies, contradicting the assumption that xbp(G). So any gene in bp(G) appears in bp(I)∪X. By a similar argument, we can prove bp(I)⊆bp(G). □

Each breakpoint contains two genes, from what we discussed in Lemma 2, every breakpoint in the complete sequence G belongs to one of the three multi-sets according to the affiliation of its two bp-genes.

BP 1(G)::

breakpoints with one bp-gene in X and the other bp-gene not in X.

BP 2(G)::

breakpoints with both of the bp-genes in X.

BP 3(G)::

breakpoints with both of the bp-genes not in X.

An example is shown in Fig. 9.2.

Fig. 9.2
figure 2

Classification of the breakpoints

4.2 Approximation Algorithm for One-Sided SF-MNSA

In this subsection, we present a 1.33-Approximation algorithm for the one-sided SF-MNSA problem. The goal of solving this problem is, while inserting the genes of X into the scaffold I, to obtain as many adjacencies as possible. No matter in what order the genes are inserted, they appears in groups in the final I′∈I+X, so we can consider that I′ is obtained by inserting strings (composed of genes of X) into I.

Obviously, inserting a string of length one (i.e., a single gene) will generate at most two adjacencies, and inserting a string of length m will generate at most m+1 adjacencies. Therefore, we will have two types of inserted strings.

  1. 1.

    Type-1: a string of k missing genes x 1,x 2,…,x k are inserted in between y i y i+1 in the scaffold I to obtain k+1 adjacencies (i.e., y i x 1, x 1 x 2,…, x k−1 x k , x k y i+1), where y i y i+1 is a breakpoint.

    In this case, x 1 x 2x k is called a k-Type-1 string, y i y i+1 is called a dock, and we also say that y i y i+1 docks the corresponding k-Type-1 string x 1 x 2x k .

  2. 2.

    Type-2: a sequence of l missing genes z 1,z 2,…,z l are inserted in between y j y j+1 in the scaffold I to obtain l adjacencies (i.e., y j z 1 or z l y j+1, z 1 z 2, …, z l−1 z l ), where y j y j+1 is a breakpoint; or a sequence of l missing genes z 1,z 2,…,z l are inserted in between y j y j+1 in the scaffold I to obtain l+1 adjacencies (i.e., y j z 1, z 1 z 2, …, z l−1 z l , z l y j+1), where y j y j+1 is an adjacency.

This is the basic observation for devising our algorithm. Most of our work is devoted to searching the Type-1 strings.

Searching the 1-Type-1 Strings

To identify the 1-Type-1 strings, we use a greedy method. For each gene x i of X and each breakpoint y j y j+1 of b I (I,G), if we can obtain two adjacencies by inserting x i in between y j y j+1, then insert x i to y j y j+1.

figure a

Searching the 2-Type-1 Strings

To identify the 2-Type-1 strings, we again use a greedy method. For each pair of missing genes x i x k if we can obtain three adjacencies by inserting x i x k in between y j y j+1, where y j y j+1b I (I,G), then insert x i x k in between y j y j+1.

figure b

Inserting the Remaining Genes

In this subsection, we present a polynomial-time algorithm guaranteeing that the number of adjacencies increases by the same number of the genes inserted. A general idea of this algorithm was mentioned in [44], with many details missing, and we will present the details here.

Given the complete sequence G and the scaffold I, as we discussed in Sect. 9.4.1, the breakpoints in G can be divided into three sets: BP 1(G), BP 2(G), and BP 3(G). In any case, the breakpoints in BP 3(G) cannot be converted into adjacencies; so we try to convert the breakpoints in BP 1(G) and BP 2(G) into adjacencies.

Lemma 3

If BP 1(G)≠∅, then there exists a breakpoint in I where after some gene of X is inserted, the number of adjacencies increases by one.

Proof

Let t i t i+1 be a breakpoint in G, satisfying that t i t i+1BP 1(G), t i X, and, from Lemma 2, t i+1bp(I). Then, there exists a breakpoint t i+1 s j or s k t i+1 in I. Hence, if we insert t i in between that breakpoint, we will obtain a new adjacency t i t i+1 without affecting any other adjacency. □

Thus, it is trivial to obtain one more adjacency whenever BP 1(G)≠∅.

Lemma 4

For any xXc(I), if there is an “xx” breakpoint in G then after inserting x in between some “xy” pair in I, the number of adjacencies increases by one.

Proof

If “xy” is a breakpoint, then after inserting an ‘x’ in between it, we obtain a new adjacency “xx”. If “xy” is an adjacency, then after inserting an ‘x’ in between it, we have “xxy”. The adjacency “xy” still exists, and we obtain a new adjacency “xx”. □

Lemma 5

If there is a breakpoint “xy” in BP 2(G) and a breakpoint “xz” (resp. “yz”) in I, then after inserting y (resp. x) in between “xz” (resp. “yz”) in I, the number of adjacencies increases by one.

Proof

From the definition of BP 2(G), we know that x,yX. Since “xy” is a breakpoint in G and “xz” is a breakpoint in I, we obtain a new adjacency “xy” by inserting y in between “xz”, without affecting any other adjacency. A similar argument for inserting x in between “yz” also holds. □

Next, we show that the following case is polynomially solvable. This case satisfies the following conditions.

  1. 1.

    BP 1(G)=∅;

  2. 2.

    It does not contain a breakpoint like “xx” in G unless xXc(I);

  3. 3.

    For any breakpoint of the form “xy” in BP 2(G), all the pairs in I involving x or y are adjacencies.

Let BS 2(G) be the set of bp-strings in G with all breakpoints belonging to BP 2(G).

Lemma 6

In the case satisfying (1), (2) and (3), the number of times a gene appears as an end-gene of some bp-string of BS 2(G) is even.

Proof

Let gene x=t i be an end-gene of some bp-string t i t i+1t j of BS 2(G). Since BP 1(G)=∅ and x will not be involved in any breakpoint of BP 3(G), t i−1 t i must be an adjacency. Assume that x appears q times in G and q′ (<q) times in I. As there is no breakpoint in the form “xx” in G and I, we could assume that there are q 1 adjacencies in the form “xx” in G and I. Then, the total number of pairs (adjacencies and breakpoints) involving x in G is 2(qq 1), and of which, 2(q′−q 1) are adjacencies. So the number of breakpoints involving x in G is 2(qq′), which is even. An end-gene only constitutes one breakpoint and other mid-genes each constitutes two breakpoints. Therefore, any gene should appear at the end of some bp-string of BS 2(G) for an even number of times. □

From Lemma 6, if we denote each bp-string of BS 2(G) by a vertex, and there is an edge between two vertices iff their corresponding bp-strings have a common end-gene, the resulting graph contains a cycle of distinct vertices. Traveling this cycle, concatenating the bp-strings corresponding to the vertices, and deleting one copy of the common end-gene, eventually we can obtain a string composed of genes of X. The following lemma and corollary shows that this string can be inserted into I entirely, generating no breakpoint at all.

Lemma 7

In the case satisfying (1), (2) and (3), for a gene x, let q 1 be the number that it appears as an end-gene, let q 2 be the number that it appears in some bp-string of BS 2(G) as a mid-gene, and let r be the number that it appears in X. Then, we have r=q 1/2+q 2.

Proof

Assume that x appears q times in G, q′ (<q) times in I, and there are p adjacencies in the form “xx” in G and I. Then, the total number of pairs (adjacencies and breakpoints) involving x in G is 2(qp), and of which, 2(q′−p) are adjacencies. So the number of breakpoints involving x in G is 2(qq′). Each x of q 1 end-genes contributes to one breakpoint, and each x of q 2 mid-genes contributes to two breakpoints, thus, 2(qq′)=q 1+2q 2. Note that (qq′) is exactly r; and following Lemma 6, q 1 is even. Then, r=q 1/2+q 2. □

We summarize the above ideas as the following algorithm, which ensures us to obtain as many adjacencies as the number of missing genes inserted.

For two strings s 1 and s 2, if the right end-gene r(s 1) of s 1 is the same as the left end-gene (s 2) of s 2, we use s 1s 2 to represent the string obtained by first concatenating s 1 with s 2 and then delete one copy of r(s 1) and (s 2). For example, s 1=acbd, s 2=decb, then s 1s 2=abcdecb.

Theorem 7

The algorithm Insert-Whole-Strings(•) guarantees that the number of adjacencies increased is not smaller than the number of genes inserted.

Proof

At step 2, 3, 4 of the algorithm, one gene is inserted into I and each time one more adjacency is obtained. At each round of step 6, a string of length l is inserted in between an adjacency in I, then we obtain l+1 new adjacencies with one destroyed. So the number of adjacencies increased is not smaller than the number of genes inserted. □

figure c

We run the above algorithm on the following example.

$$\begin{aligned} &G=\#daebxceafceb1234\#,\qquad I=\#dafcxb1324\#,\qquad X=\{a,b,c,e,e,e\},\\ &\mathit{BP}_{1}(G)=\emptyset,\qquad \mathit{BP}_{2}(G)=\{ae,eb,ce, ea,ce,eb\},\qquad \mathit{BP}_{3}(G)=\{12,34\}, \end{aligned}$$

then the set of breakpoint strings BS 2(G)={aeb,cea,ceb}. According to the algorithm, we have L=aebecea. Gene a in I is replaced with string L to obtain sequence I =#daebeceafcxb1324#. The number of adjacencies is added to by 6 and no new breakpoint is generated.

4.3 Analysis of the Approximation Algorithm

In this subsection, we will prove that the approximation factor of our algorithm is 4/3. Firstly, we present a lower bound of the optimal solution.

A Lower Bound

Given an instance of One-sided SF-MNSA, let I I+X be the final scaffold in the optimal solution after inserting all genes of X into I. Compared to I, all genes belonging to X appear as substrings in I . Let x 1 x 2x l be a string inserted in between y i y i+1 in I , then either y i x 1 or x l y i+1 or both are adjacencies. Since otherwise, we could delete this string from I (number of adjacencies decreases by at most l−1), re-insert it following the algorithm Insert-Whole-Strings(•) (number of adjacencies increases by at least l), and obtain one more adjacency. Thus, we have the following corollary of Theorem 7,

Corollary 1

Each substring in I composed of genes of X is either Type-1 or Type-2.

Now, we present a lower bound for the optimal number of adjacencies.

Lemma 8

Let OPT be the number of adjacencies between G and I , k 0 be the number of adjacencies between G and I, and k 1=|X|. Let b i be the number of i-Type-1 substrings and q be the maximum length of Type-1 substrings in the optimal solution between G and I . Then

$$ \mathit{OPT}-k_{0}=k_{1}+b_{1}+b_{2}+ \cdots+b_{q}\leq\frac{4}{3}\biggl(k_{1}+ \frac{1}{2}b_{1}+\frac{1}{4}b_{2}\biggr) $$
(9.1)

Proof

Define C as the total number of genes in Type-2 substrings in I . Since inserting an l-Type-1 string will generate l+1 more adjacencies, and inserting a l-Type-2 string will generate l more adjacencies, we have,

$$\mathit{OPT} = k_{0}+ \sum_{i=1}^{q}(i+1) \times b_{i}+C. $$

By the definition of Type-1 and Type-2 substrings, we have

$$k_{1}=\sum_{i=1}^{q}(i\times b_{i})+C\geq b_{1}+2b_{2}+3(b_{3}+ b_4 + \cdots + b_{q}) +C. $$

Thus,

$$\sum_{i=3}^{q}b_{i}\leq (k_{1}-C-b_{1}-2b_{2})/3. $$

Hence, we have

$$\begin{aligned} \mathit{OPT} - k_{0} =& C+ \sum_{i=1}^{q}i \times b_{i} + b_{1}+b_{2}+\cdots +b_{q} \\ =& k_{1}+ b_{1}+b_{2}+\cdots +b_{q} \\ \leq& k_{1}+ b_{1}+b_{2}+(k_{1}-C-b_{1}-2b_{2})/3 \\ \leq&\frac{4}{3}\biggl(k_{1}+\frac{1}{2}b_{1}+ \frac{1}{4}b_{2}\biggr). \end{aligned}$$

 □

Lemma 8 shows that if the number of Type-1 substrings computed in the approximation algorithm is not smaller than (2b 1+b 2)/4, then the approximation factor is 4/3.

Description of the Main Algorithm

There are three main steps in our algorithm. Firstly, we try to search the 1-Type-1 strings; secondly, we try to search the 2-Type-1 strings; finally, we insert the remaining genes in X, guaranteeing that on average we will obtain at least one adjacency for each inserted missing gene.

figure d

4.4 Proof of the Approximation Factor

In our algorithm, we make effort to insert Type-1 substrings as much as possible. But a Type-1 substring (say I s ) inserted by our algorithm may make other Type-1 substrings in some optimal solution infeasible, we say I s destroys them. The following lemma shows the number of Type-1 substrings that could be destroyed by a given Type-1 substring.

Lemma 9

A i-Type-1 substring can destroy at most i+1 Type-1 substrings in some optimal solution.

Proof

Assume that an i-Type-1 substring I s is inserted in between some breakpoint y j y j+1 in I. Then each of the genes in I s , if not use by I s , could form a distinct Type-1 substring in some optimal solution. Also, there may exist another Type-1 substring that could be inserted in between the breakpoint y j y j+1 in the optimal solution. Totally, at most i+1 Type-1 substrings in the optimal solution could be destroyed by I s . □

We have the following lemma regarding this greedy algorithm.

Lemma 10

Let \(b'_{1},b'_{2}\) be the number of type-1 1-substrings and 2-substrings inserted at Step 1 and Step 2 of our greedy algorithm, respectively. Then \(b'_{1}+b'_{2}\geq \frac{b_{1}}{2}+\frac{b_{2}}{4}\).

Proof

Let \(k'_{1},k'_{2}\) be the number of missing genes inserted at Step 1 and Step 2, respectively. (So \(b'_{1}=k'_{1}\) and \(b'_{2}=k'_{2}/2\).) First, by Lemma 9, each of the \(k'_{1}\) inserted missing genes can destroy at most two type-1 1-substrings in some optimal solution. Moreover, each of the \(k'_{1}\) inserted missing genes can destroy at most two type-1 2-substrings in some optimal solution, this will be illustrated with an example at the end of this paragraph. Let \(b'_{10}\) be the number of missing genes inserted at Step 1 which destroy exactly one type-1 1-substring (and some type-1 m-substring, with m≥3) in some optimal solution. Let \(b'_{11}\) be the number of missing genes inserted at Step 1 which destroy exactly two type-1 1-substrings in some optimal solution. Let \(b'_{12}\) be the number of missing genes inserted at Step 1 which destroy one type-1 1-substring and one type-1 2-substring in some optimal solution. Let \(b'_{13}\) be the number of missing genes inserted at Step 1 which destroy exactly two type-1 2-substrings in some optimal solution. Obviously,

$$k'_1=b'_1=b'_{10}+b'_{11}+b'_{12}+b'_{13}. $$

Then, we show an example for a, one of the \(b'_{13}\) inserted missing genes that destroy two type-1 2-substrings in the optimal solution (i.e., counted into b 2). Let G=…αaβγabδαuvβ… and let I=ααβγδβa…. We need to insert a,b,u,v into I. Due to the greedy fashion of the algorithm, a is inserted between α,β in I to have αaβ (destroying the possibility of inserting uv at the same location). On the other hand, due to the insertion of a (instead of ab), ab cannot be inserted in between γ and δ. Therefore, we destroy the optimal adjacencies 〈αuvβ〉 and 〈γabδ〉 (with the corresponding two type-1 2-substrings: uv and ab).

Again, by Lemma 9, at Step 2, each of the inserted 2-type-1 substrings can destroy at most three 2-type-1 substrings in some optimal solution.

Now, putting all together,

$$b_1\leq b'_{10}+2b'_{11}+b'_{12}, $$

and

$$b_2\leq 3b'_2+b'_{12}+2b'_{13}. $$

Then

$$\begin{aligned} \frac{b_1}{2}+\frac{b_2}{4} \leq& \frac{b'_{10}+2b'_{11}+b'_{12}}{2}+\frac{3b'_2+b'_{12}+2b'_{13}}{4} \\ =& \biggl(\frac{b'_{10}}{2}+b'_{11}+\frac{3b'_{12}}{4}+ \frac{b'_{13}}{2}\biggr)+\frac{3b'_2}{4} \\ \leq&b'_1+b'_2 \end{aligned}$$

 □

Theorem 8

There is a greedy algorithm which approximates One-sided SF-MNSA with a factor of 1.33.

Proof

Following the greedy algorithm, Theorem 7, Lemma 8, and Lemma 10, we have the approximation solution value APP, which satisfies the following inequalities:

$$\mathit{APP}-k_{0} = k_{1}+b_{1}'+b_{2}' \geq k_{1}+\frac{1}{2}b_{1}+\frac{1}{4}b_{2} \geq \frac{3}{4}(\mathit{OPT}-k_{0}). $$

So, we have \(\mathit{APP}\geq \frac{3}{4}\mathit{OPT}+ \frac{1}{4}k_{0}\geq\frac{3}{4}\mathit{OPT}\). Hence \(\frac{\mathit{OPT}}{\mathit{APP}} \leq 1.33\), and the theorem is proven. □

In [51], a better factor-1.25 approximation was proposed. While the overall framework is similar, the details are quite different. The new approximation is achieved by a combination of maximum matching, local improvement and greedy search.

5 Concluding Remarks and Open Problems

The negative results on EBD and ENbS do not mean that we have absolutely no way to tackle these problems. For instance, in [1], with integer linear programming, very nice empirical results are obtained. Here, we try to present a different way to handle these problems formally.

In many biological problems, the optimal solution value \({\rm OPT}\) could be zero. (Besides EBD, in some minimum recombination haplotype reconstruction problems the optimal solution value could be zero.) As implied by Theorem 1, if computing such an optimal solution with zero solution value is NP-complete then the problem does not admit any polynomial-time approximation (unless P = NP). However, in reality one would be satisfied to obtain a solution with value one or two. Due to this reason, we can relax the traditional definition of approximation to a weak approximation. Given a minimization problem Π, let the optimal solution of Π be \({\rm OPT}\). We say that a weak approximation algorithm \(\mathcal{W}\) provides a performance guarantee of α for Π if for every instance I of Π, the solution value returned by \(\mathcal{W}\) is at most α×(OPT+1).

In [21, 22, 24] we showed that EBD and the exemplar conserved interval distance problems are both hard to approximate even under the weak approximation model. But for the exemplar reversal distance problem, no such result is known yet.

For the exemplar common interval number problem [12], the only negative result is its NP-hardness. It would also be interesting to know whether it admits an efficient polynomial-time approximation. We conclude this paper with a list of open problems.

  1. 1.

    For the One-sided Exemplar Breakpoint Distance problem, does there exist a factor-o(n) approximation? The only known negative result is the APX-hardness of the problem.

  2. 2.

    For the exemplar common interval number problem, does there exist a good approximation?

  3. 3.

    For the CMSR problem, does there exist faster FPT algorithm and/or a smaller linear kernel?

  4. 4.

    For the One-side SF-MNSA problem, does there exist an FPT algorithm?