Keywords

1 Introduction

When comparing two genomes, one of the main goals is to determine the sequence of mutations that occurred during the evolutionary process capable of transforming a genome into another. In comparative genomics, we estimate this sequence through genome rearrangements, evolutionary events (mutations) affecting a large sequence of the genome.

Two genomes \(G_1\) and \(G_2\) can be computationally represented as the sequence of labels assigned to their shared genes (or shared blocks of genes). Labels are usually integer numbers. In addition, we associate a positive or negative sign in each of the numbers, reflecting the orientation of that gene (or block) inside of the genomes. Assuming that the genomes do not contain duplicated genes, this representation results in a signed permutation, when the orientation of the genes is known, and in an unsigned permutation otherwise. One of the genomes can be seen as the identity permutation, in which the elements are in ascending order, so problems dealing with genome rearrangements are usually treated as sorting problems, in which the goal is to transform a given permutation into the identity.

Two of the most studied genome rearrangements in the literature are the reversal, that inverts the order and the orientation of the genes inside a segment of the genome, and transposition, that moves a segment of the genome to another position. The Sorting by Reversals problem has an exact polynomial algorithm for signed permutations [5] but it is NP-hard for unsigned permutations [4].

The Sorting by Transpositions problem is NP-hard [3]. When we allow the use of reversals and transpositions, and assuming that both events occur with the same frequency (unweighted approach), we have the Sorting by Reversals and Transpositions (SbRT) problem that is NP-hard on signed and unsigned permutations [7].

In the weighted approach each type of event has an associated cost, and the goal is to find a sequence of rearrangement events that transforms one genome into another minimizing the sum of the costs. Oliveira et al. [7] showed that Sorting by Weighted Reversals and Transpositions (SbWRT) problem is NP-hard on signed and unsigned permutations when the ratio between the cost of a transposition and the cost of a reversal is less than or equal to 1.5. Oliveira et al. [8] developed a 1.5-approximation algorithm for SbWRT on signed permutations considering costs 2 and 3 for reversals and transpositions, respectively.

The problem with weighted approaches is that they do not guarantee that lower cost rearrangements, i.e., assumed to be most frequent, will be the most frequently used by the algorithms. To overcome this issue we propose and investigate the Sorting by Reversals and Transpositions with Proportion Restriction problem on signed and unsigned permutations. In this problem, we seek a sorting sequence with an additional constraint in which the ratio between the number of reversals and the size of the sequence must be greater than or equal to a given parameter \(k \in [0..1]\). We provide an algorithm that guarantees an approximation for any value of k on signed and unsigned permutations. We also show an asymptotic algorithm for the signed case with an improved approximation factor.

This manuscript is organized as follows. Section 2 provides definitions used throughout the paper. Section 3 presents an approximation algorithm for the signed and unsigned cases. Section 4 presents an asymptotic approximation algorithm for the signed case with an improved approximation factor. Section 5 concludes the paper.

2 Basic Definitions

This section formally presents the definitions used in the genome rearrangement problems. Given two genomes \(\mathcal {G}_1\) and \(\mathcal {G}_2\), each synteny block (common block of genes between the two genomes) is represented by an integer that also has a positive or negative sign to indicate its orientation, if known. Therefore, each genome is a permutation of integers. We assume that one of them is represented by the identity permutation \(\iota _n = ({+1}~{+2}~\ldots ~{+n})\) and the other is represented by a signed (or unsigned) permutation \(\pi = (\pi _1~\pi _2~\ldots ~\pi _n)\).

We define a rearrangement model \(\mathcal {M}\) as the set of rearrangement events allowed to compute the distance. Given a rearrangement model \(\mathcal {M}\) and a permutation \(\pi \), the rearrangement distance \(d(\pi )\) is the minimum number of rearrangements of \(\mathcal {M}\) that sorts \(\pi \) (i.e., that transforms \(\pi \) into \(\iota \)). The goal of the Sorting by Genome Rearrangements problems consists in finding such distance and the sequence that reflects it.

In this work, we will assume that \(\mathcal {M}\) contains both reversals and transpositions. Let us formally define these events.

Definition 1

Given a signed permutation \(\pi = (\pi _1~\ldots ~\pi _n)\), a reversal \(\rho (i,j)\), with \(1 \le i \le j \le n\), transforms \(\pi \) in the permutation \(\pi \cdot \rho (i,j) = (\pi _1~\ldots ~\pi _{i-1}{}\underline{{-\pi _j}~\ldots ~{-\pi _i}}~\pi _{j+1}~\ldots ~\pi _n)\).

Definition 2

Given an unsigned permutation \(\pi = (\pi _1~\ldots ~\pi _n)\), a reversal \(\rho (i,j)\), with \(1 \le i < j \le n\), transforms \(\pi \) in the permutation \(\pi \cdot \rho (i,j) = (\pi _1~\ldots ~\pi _{i-1}~\underline{{\pi _j}~\ldots ~{\pi _i}}\) \(\pi _{j+1}~\ldots ~\pi _n)\).

Definition 3

Given a permutation \(\pi = (\pi _1~\ldots ~\pi _n)\), a transposition \(\tau (i,j,k)\), with \(1 \le i< j < k \le n + 1\), applied to \(\pi \) transforms it in the permutation \(\pi \cdot \tau (i,j, k) = (\pi _1~\ldots ~\pi _{i-1}\) \(\underline{{\pi _j}~\ldots ~{\pi _{k-1}}}\) \(\underline{\pi _i~\ldots ~\pi _{j-1}}~\pi _{k}~\ldots ~\pi _n)\). The effect of a transposition is the same on signed and unsigned permutations.

The following definition helps us to formally define the problem of sorting by reversals and transpositions with a constraint on the number of reversals used in the sorting sequence.

Definition 4

Given a sequence of reversals and transpositions S, let |S| denote the number of events in S and let \(|S_{\rho }|\) denote the number of reversals in S.

figure a

Note that when \(k = 1\) the SbRTwPR problem becomes the Sorting by Reversals problem on signed [5] and unsigned [4] permutations. Moreover, when \(k = 0\) we have the Sorting by Reversals and Transpositions problem on signed [10] and unsigned [9] permutations.

Example 1 shows an optimal solution S for \(\pi =({-1}~{+4}~{-8}~{+3}~{+5}~{+2}~{-7}~{-6})\) considering the SbRT and the SbWRT problems (SbWRT using costs 2 for reversals and 3 for transpositions). Note that half of the operations in S are reversals and half are transpositions, even using a higher cost for transpositions.

Example 1

Example 2 shows an optimal solution \(S'\) for the same signed permutation \(\pi \) considering the SbRTwPR problem, adopting \(k = 0.6\) (i.e., at least 60% of the operations in S must be reversals). Compared with Example 1, the sequence \(S'\) has only one more operation than S, while ensuring the minimum proportion of reversals and using both reversals and transpositions.

Example 2

In the following, we present breakpoints and the cycle graph, both widely used to obtain bounds for the distance and to develop algorithms.

2.1 Breakpoints

Given a permutation \(\pi = (\pi _1~\ldots ~\pi _n)\), we extend \(\pi \) by adding the elements \(\pi _0 = 0\) and \(\pi _{n+1} = n+1\), with these elements having positive signs when considering signed permutations. We observe that these elements are not affected by rearrangement events. From now on, we work on extended permutations.

Definition 5

For an unsigned permutation \(\pi \), a pair of elements \(\pi _i\) and \(\pi _{i+1}\), with \(0 \le i \le n\), is a breakpoint if \(|\pi _{i+1} - \pi _{i}| \ne 1\).

The number of breakpoints in a permutation \(\pi \) is denoted by \(b(\pi )\). Given an operation \(\gamma \), let \(\varDelta b(\pi , \gamma ) = b(\pi ) - b(\pi \cdot \gamma )\), that is, \(\varDelta b(\pi , \gamma )\) denotes the change in the number of breakpoints after applying \(\gamma \) to \(\pi \).

Remark 1

The identity permutation \(\iota \) is the only permutation with \(b(\pi ) = 0\).

2.2 Cycle Graph

For a signed permutation \(\pi \), we define the cycle graph \(G(\pi ) = (V, E)\), such that \(V = \{ +\pi _0, -\pi _1, +\pi _1, -\pi _2, +\pi _2, \ldots , -\pi _{n}, +\pi _n, -\pi _{n+1}\}\) and \(E = E_b \cup E_g\), where \(E_b = \{(-\pi _i, +\pi _{i-1}) \,|\, 1 \le i \le n+1\}\) and \(E_g = \{(+(i-1), -i) \,|\, 1 \le i \le n + 1\}\). We say that \(E_b\) is the set of black edges and \(E_g\) is the set of gray edges.

Note that each vertex is incident to two edges (a gray edge and a black edge) and, so, there exists a unique decomposition of edges in cycles. The size of a cycle \(C \in G(\pi )\) is the number of black edges in C. A cycle C is trivial if it has size 1. If C has size less than or equal to 3, then C is called short and, otherwise, C is called long. The identity permutation \(\iota _n\) is the only one with a cycle graph containing \(n+1\) cycles, which are all trivial.

The number of cycles in \(G(\pi )\) is denoted by \(c(\pi )\). Given an operation \(\gamma \), let \(\varDelta c(\pi , \gamma ) = c(\pi \cdot \gamma ) - c(\pi )\), that is, \(\varDelta c(\pi , \gamma )\) denotes the change in the number of cycles after applying \(\gamma \) to \(\pi \).

The cycle graph \(G(\pi )\) is drawn in a way to highlight characteristics of the permutation, as shown in Fig. 1. In this representation, we draw the vertices in a horizontal line, from left to right, following the order \(+\pi _0, -\pi _1, +\pi _1, \ldots , -\pi _n,\) \(+\pi _n, -\pi _{n+1}\). The black edges are horizontal lines and the gray edges are arcs.

For \(1 \le i \le n+1\), the black edge \((-\pi _i, +\pi _{i-1})\) is labeled as i. We represent a cycle C by the sequence of labels of its black edges following the order they are traversed, assuming that the first black edge is the one with highest label (rightmost black edge of C) and it is traversed from right to left. Assuming this representation, if a black edge is traversed from left to right we add a minus sign to its label (the first black is always positive since it is traversed from right to left by convention).

Fig. 1.
figure 1

Cycle Graph for \(\pi = (+5~+2~+4~+3~+1~+6~-7)\). In this cycle graph, we have the cycles \(C_1=(5, 3, 4, 1)\), \(C_2 = (6, 2)\), and \(C_3 = (8, -7)\).

Two black edges of a cycle C are divergent if their labels have different signs, and convergent otherwise. A cycle C is divergent if at least one pair of black edges of C are divergent, and it is convergent otherwise.

We also classify convergent cycles as oriented and non-oriented. A cycle \(C = (c_1, c_2, \ldots , {} c_k)\) is non-oriented if \(c_i > c_{i+1}\), for all \(1 \le i < k\). Otherwise, we say that C is oriented.

Two cycles \(C = (c_1, c_2, \ldots , c_k)\) and \(D = (d_1, d_2, \ldots , d_k)\) are interleaving if either \(|c_1|> |d_1|> |c_2|> |d_2|> \ldots> |c_k| > |d_k|\) or \(|d_1|> |c_1|> |d_2|> |c_2|> \ldots> |d_k| > |c_k|\).

Let \(g_1\) be a gray edge adjacent to black edges with labels \(x_1\) and \(y_1\), such that \(|x_1| < |y_1|\) and let \(g_2\) be a gray edge adjacent to black edges with labels \(x_2\) and \(y_2\), such that \(|x_2| < |y_2|\). We say that two gray edges \(g_1\) and \(g_2\) intersect if \(|x_1|< |x_2| \le |y_1| < |y_2|\). Two cycles C and D intersect if an edge from C intersect with an edge from D.

An open gate is a gray edge from a cycle C that does not intersect with any other gray edge from C. An open gate \(g_1\) from C is closed if another gray edge (which is not from C) intersects with \(g_1\). All open gates of \(G(\pi )\) must be closed [8].

In the example of Fig. 1, the cycle \(C_1 = (5,3,4,1)\) is convergent and oriented, the cycle \(C_2 = (6,2)\) is convergent and non-oriented, and the cycle \(C_3 = (8, -7)\) is divergent. The gray edge from \(C_1\) adjacent to black edges 1 and 4 intersects with the gray edge from \(C_2\) adjacent to black edges 2 and 6, so the cycles \(C_1\) and \(C_2\) intersect.

3 Approximation Algorithms

In this section, we present approximation algorithms considering both unsigned and signed permutations.

3.1 Unsigned Case

Here we present an approximation algorithm with a factor of \(3-k\) based on breakpoints for SbRTwPR on unsigned permutations.

Lemma 1

(Kececioglu and Sankoff [6]). For any reversal \(\rho \), \(\varDelta b(\pi , \rho ) \le 2\).

Lemma 2

(Walter et al. [10]). For any transposition \(\tau \), \(\varDelta b(\pi , \tau ) \le 3\).

Lemma 3

Given an instance \((\pi , k)\) for SbRTwPR on unsigned permutations, and an optimal sequence of events S, the average number of breakpoints decreased by an operation in S is less than or equal to \(3-k\).

Proof

Since |S| is an optimal sequence for the instance \((\pi , k)\), we have that at least |S|k operations present in S are reversals. By Lemmas 1 and 2, we have that a reversal can remove up to two breakpoints while a transposition can remove up to three. Let \(\phi b(S)\) denote the average number of breakpoints decreased by an operation in S, we have that

$$\phi b(S) \le \frac{(2 |S| k) + (3 |S| (1 - k))}{|S|} = 2k + 3(1 - k) = 3 - k.$$

   \(\square \)

Theorem 1

Given an instance \((\pi , k)\) for SbRTwPR on unsigned permutations, we have that \(d_{k}(\pi ) \ge \frac{b(\pi )}{3-k}\).

Proof

Since \(b(\pi )\) breakpoints must be removed in order to turn the permutation \(\pi \) into \(\iota \) and, by Lemma 3, up to \(3-k\) breakpoints are removed per operation on average, the theorem follows.   \(\square \)

Theorem 2

(Kececioglu and Sankoff [6]). It is possible to turn an unsigned permutation \(\pi \) into \(\iota \) using at most \(b(\pi )\) reversals.

Theorem 3

SbRTwPR is approximable by a factor of \(3-k\) on unsigned permutations.

Proof

By Theorem 2, we can turn any unsigned permutation \(\pi \) into \(\iota \) using at most \(b(\pi )\) reversals. Since we use only reversals, the constraint \(\frac{|S_{\rho }|}{|S|} \ge k\) is not violated. By the lower bound showed in Theorem 1, we have \(\frac{b(\pi )}{\frac{b(\pi )}{3-k}} = 3-k\).    \(\square \)

In order to avoid solutions for the problem consisting exclusively of reversals, we propose the Algorithm 1. This algorithm guarantees the same approximation factor for the problem and tends to provide solutions in which the ratio between the number of reversals and the size of the sorting sequence is close to k.

figure b

Note that a transposition \(\tau \) is only applied if two constraints are fulfilled: (i) \(\frac{|S_{\rho }|}{|S| + 1} \ge k\), this ensures that the sorting sequence will comply with the main restriction of the problem that \(\frac{|S_{\rho }|}{|S|} \ge k\). (ii) \(\varDelta b(\pi , \tau ) \ge 1\), this constraint ensures that the sorting sequence will contain a maximum of \(b(\pi )\) operations, since every reversal sequence removes, on average, one or more breakpoints per operation. Since Algorithm 1 removes one or more breakpoints by iteration, it guarantees that the permutation \(\pi \) will be sorted. In addition, no more than \(b(\pi )\) operations will be used to sort \(\pi \), maintaining the approximation factor of \(3-k\). Since each operation (reversal or transposition) can be found in linear time and \(|S|~\le ~b(\pi )~\le ~{n + 1}\), the running time of Algorithm 1 is \(\mathcal {O}(n^2)\).

3.2 Signed Case

Here we present an approximation algorithm with a factor of \(3-\frac{3k}{2}\) based on the cycle graph for SbRTwPR on signed permutations.

Lemma 4

(Hannenhalli and Pevzner [5]). For any reversal \(\rho \), \(\varDelta c(\pi , \rho ) \le 1\).

Lemma 5

(Bafna and Pevzner [1]). For any transposition \(\tau \), \(\varDelta c(\pi , \tau ) \le 2\).

Lemma 6

Given an instance \((\pi , k)\) for SbRTwPR on signed permutations, and an optimal sequence of events S, the average number of cycles increased by an operation in S is less than or equal to \(2-k\).

Proof

Since |S| is an optimal sequence for the instance \((\pi , k)\), we have that at least |S|k operations in S sequence are reversals. By Lemmas 4 and 5, we have that a reversal creates at most one new cycle, while a transposition creates at most two new cycles. Let \(\phi c(S)\) denote the average number of cycles increased by an operation in S, we have that:

$$\phi c(S) \le \frac{(1 |S| k) + (2 |S| (1 - k))}{|S|} = 1k + 2(1 - k) = 2 - k.$$

   \(\square \)

Theorem 4

Given an instance \((\pi , k)\) for SbRTwPR on signed permutations, we have that \(d_{k}(\pi ) \ge \frac{n + 1 - c(\pi )}{2-k}\).

Proof

Since \((n + 1) - c(\pi )\) new cycles must be created in order to turn the permutation \(\pi \) into \(\iota \) and, by Lemma 6, up to \(2-k\) new cycles are created per operation on average, the theorem follows.   \(\square \)

Theorem 5

Given a signed permutation \(\pi \), there exists a sequence of reversals S that transforms \(\pi \) into \(\iota \) such that the average number of cycles increased by any reversal in S is greater than or equal to 2/3.

Proof

If at any stage \(G(\pi )\) has a divergent cycle C, then there exists a reversal applied to C that increases the number of cycles by one unit [10]. Otherwise, \(G(\pi )\) has only convergent cycles, and one of the following is true [8]:

  • there exists a long oriented cycle (Fig. 2, Case 1);

  • there exists a short cycle C whose open gates are closed by another non-trivial cycle D (Fig. 2, Case 2);

  • there exists a long non-oriented cycle C whose open gates are closed by one or more non-trivial cycles (Fig. 2, Case 3).

If \(G(\pi )\) has an oriented long cycle C, then we can apply a reversal on its black edges in such a way that it turns C into a divergent cycle \(C'\). Since \(C'\) is long, we can apply at least two reversals on \(C'\) that increase the number of cycles by one unit each (Fig. 2, Case 1).

In the other two cases we can turn the cycle C into an oriented cycle \(C'\) by applying one reversal to a cycle D that closes an open gate from C. If \(C'\) is short, we can break it into two trivial cycles with a reversal, and this second reversal turns D into a divergent cycle \(D'\), which guarantees that we can apply a third reversal to \(D'\) that increases the number of cycles by one (Fig. 2, Case 2). If \(C'\) is long, then we can apply at least two reversals that increase the number of cycles by one unit each (Fig. 2, Case 3).

In the three cases above we applied three reversals that increased the number of cycles by two, and the theorem follows.   \(\square \)

Fig. 2.
figure 2

Operations applied in each case of Theorem 5.

Theorem 6

SbRTwPR is approximable by a factor of \(3-\frac{3k}{2}\) on signed permutations.

Proof

By Theorem 5, we can turn any signed permutation \(\pi \) into \(\iota \) using at most \(\frac{3(n + 1 -c(\pi ))}{2}\) reversals. Since we use only reversals, the constraint \(\frac{|S_{\rho }|}{|S|} \ge k\) is not violated. By the lower bound showed in Theorem 4, we have:

$$\frac{\frac{3(n + 1 -c(\pi ))}{2}}{\frac{n + 1 - c(\pi )}{2-k}} = 3-\frac{3k}{2}.$$

   \(\square \)

Note that in order to avoid a solution composed exclusively of reversals, the approach used in Algorithm 1 can be adapted to be applied in this case as well. In Sect. 4, we will present an asymptotic approximation algorithm with an improved approximation factor for the signed case.

4 Asymptotic Approximation for the Signed Case

In this section we show an asymptotic algorithm for SbRTwPR on signed permutations, where \(k \in [0,1]\) with an approximation factor of \((\frac{2-k}{1-\frac{k}{3}})\).

Definition 6

Let \(\mathcal {A}_\rho \) be an algorithm that sorts a permutation using only signed reversals and guarantees a ratio of 2/3 of cycles increased by applied reversals (Theorem 5), and let \(\mathcal {A}_\rho (\pi )\) represents the sequence of reversals returned by the algorithm that sorts \(\pi \).

Now consider Algorithm 2.

figure c

Lemma 7

Given a signed permutation \(\pi \), Algorithm 2 sorts \(\pi \) using at most \((n+1-c(\pi )) / (1-k/3) + 4\) operations.

Proof

Let \(S=(S_1, \ldots , S_{|S|})\) be the sorting sequence generated by the algorithm without considering the substitution of transpositions by reversals applied in line 14. Let \(S'\) be the subsequence of operations applied in the while loop of lines 2 to 10. Each operation in \(S'\) increases the number of cycles by at least one unit, and each operation in \(S \setminus S'\) (that is, the operations applied outside the while loop) increases on average in 2/3 the number of cycles. By the condition of line 2, we have that \(|S'| \ge (1-k) |S|\) and, therefore, the average increase in the number of cycles in S is at least \( \frac{(1\,-\,k)|S| \,+\, k|S|2/3}{|S|} = 1 \,-\, k/3. \) Since these operations increase at most \(n\,+\,1\,-\,c(\pi )\) cycles, we have that \( |S| \le \frac{n\,+\,1\,-\,c(\pi )}{1\,-\,k/3}. \) In the final sequence, we may increase four operations by replacing the last two transpositions with six reversals (only if necessary). Therefore, the size of this sequence is at most \(\frac{n\,+\,1\,-\,c(\pi )}{1\,-\,k/3} + 4\).

Theorem 7

Algorithm 2 is a \(\frac{2\,-\,k}{1\,-\,k/3}\)-asymptotic approximation algorithm for SbRTwPR.

Proof

Since the algorithm only adds transpositions while the condition of line 2 is satisfied and at most two transpositions are added in the sorting sequence in one iteration, we guarantee that \(|S_\rho | \ge k\) by replacing the last two transpositions by reversals. By Lemma 7 and Theorem 4, the sequence S returned by Algorithm 2 satisfies \( |S| \le \frac{n+1-c(\pi )}{1-k/3} + 4 \le \frac{2-k}{1-k/3} d_k(\pi ) + 4. \) Therefore, it is a \(\frac{2-k}{1-k/3}\)-asymptotic approximation algorithm for SbRTwPR.   \(\square \)

5 Conclusion

We investigated the Sorting by Reversals and Transpositions with Proportion Restriction problem and presented an approximation algorithm with a factor of \(3-k\) for unsigned permutations, and an approximation and an asymptotic approximation algorithm with factors \(3-\frac{3k}{2}\) and \(\frac{2-k}{1-\frac{k}{3}}\) for signed permutations, respectively.

As future work, we intend to test the proposed algorithms and develop heuristics for the problems. Another interesting research line would be to investigate the complexity of the problems when \(0< k < 1\).