Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes

Brito, Klairton Lima; Jean, Géraldine; Fertin, Guillaume; Oliveira, Andre Rodrigues; Dias, Ulisses; Dias, Zanoni

doi:10.1007/978-3-030-20242-2_3

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11490))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

758 Accesses
1 Citations

Abstract

During the evolutionary process, the genome is affected by various genome rearrangements, which are events that modify large stretches of the genetic material. In the literature, several models were designed to estimate the number of events that occurred during the evolution, but these models represent genomes as a sequence of genes, overlooking the genetic material between consecutive genes. However, recent studies show that taking into account the genetic material present between consecutive genes can be more realistic. Reversal and transposition are genome rearrangements widely studied in the literature. A reversal inverts a segment of the genome while a transposition swaps the positions of two consecutive segments. Genomes also undergo non-conservative events (events that alter the amount of genetic material) such as insertion and deletion, which insert and remove genetic material from intergenic regions of the genome, respectively. We study problems considering both gene order and intergenic regions size. We investigate the reversal distance between two genomes in two scenarios: with and without non-conservative events. For both problems, we show that they belong to NP-hard problems class and we present a 4-approximation algorithm. We also study the reversal and transposition distance between two genomes (and the variation with non-conservative events) and we present a 6-approximation algorithm.

Access provided by Autonomous University of Puebla. Download conference paper PDF

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

Article Open access 29 December 2021

A 3.5-Approximation Algorithm for Sorting by Intergenic Transpositions

Super Short Reversals on Both Gene Order and Intergenic Sizes

Keywords

1 Introduction

Genome rearrangements are events that insert, remove or change the position and/or orientation of large stretches of the genetic material. When we compare the genomes of two individuals, one of the main goals is to estimate the sequence of rearrangement events that transformed one genome into the other.

A model $\mathcal {M}$ determines the set of rearrangement events allowed to modify the genome. The size of the smallest sequence of rearrangement events in a model $\mathcal {M}$ capable of transforming a genome into another is called rearrangement distance. When we assume no duplicated gene, we can map every gene to a unique number to represent genomes as permutations of integer elements. Usually, rearrangement problems are treated as sorting problems and the goal is to turn any genome $\pi $ into a specific genome $\iota =(1,\dots ,n)$, which is called identity permutation. If the orientation of the genes is known, a positive or negative sign is assigned to each element. Otherwise, signs are omitted.

When the gene orientations are unknown, Caprara [6] proved that the Sorting Permutations by Reversals problem belongs to NP-hard problems class. The best algorithm has an approximation factor of 1.375 and was presented by Berman et al. [1]. Sorting Permutations by Transpositions problem also belongs to NP-hard problems class [4] and the best algorithm has an approximation factor of 1.375 [7]. When we consider a model allowing reversals and transpositions we have the Sorting Permutations by Reversals and Transpositions problem. The best algorithm for this problem has an approximation factor of $2.8334+\epsilon $, for any $\epsilon > 0$ [10] and its complexity is unknown.

The representation of a genome only as a gene sequence is a technique very useful, but all the information that is not present on the genes are discarded that implies in loss of information. In particular, the intergenic regions between consecutive genes are not taken into account by these representations. Recently, authors have suggested that incorporating this information may improve the distance estimate from an evolutionary point of view [2, 3]. Thus, it is justified to perform investigations considering the order of the genes and the size of the intergenic regions. Works considering gene order and intergenic sizes have already been presented. A model allowing Double-Cut and Join (DCJ) operation was presented together with the NP-hard proof and a 4/3-approximation algorithm [8]. The DCJ [11] is a rearrangement event that cuts the genome into two points and the extremities of the resulting segments are reassembled following specific criteria. A model allowing DCJs, insertions, and deletions also was investigated and an exact polynomial time algorithm was designed [5] when insertions and deletions act only on intergenic regions. Besides, Oliveira et al. [9] presented a model that allows the use of only super short reversals (reversals that affect at most two genes) also considering intergenic regions size.

In this paper, we consider that gene orientations are unknown. We investigate four problems to estimate the distance between genomes also taking into account intergenic regions, two of them allowing just conservative events of reversal and transposition: Sorting by Intergenic Reversals (SbIR) and Sorting by Intergenic Reversals and Transpositions (SbIRT). We also investigate two other problems that allow non-conservative events of insertion and deletion: Sorting by Intergenic Reversals, Insertions, and Deletions (SbIRID) and Sorting by Intergenic Reversals, Transpositions, Insertions, and Deletions (SbIRTID).

This manuscript is organized as follows. Section 2 provides definitions that are used throughout the paper. Section 3 presents the complexity proofs for SbIR and SbIRID problems and an approximation algorithm for each problem addressed in this manuscript. Section 4 concludes the paper.

2 Basic Definitions

Given a genome $\mathcal {G}$ as a sequence of n genes $g_1,\dots ,g_n$ and a sequence of $n+1$ intergenic regions $r_1,\dots ,r_{n+1}$, each gene is surrounded by two intergenic regions so that: $\mathcal {G} = (r_1, g_1, r_2,g_2, \ldots , r_n, g_n, r_{n+1})$. Our model assumes that (i) genes orientation is unknown, (ii) there are no duplicate genes, and (iii) the considered genomes share the same set of genes. In this way, we can assign each gene a unique value and map the gene sequence as a permutation $\varvec{\pi = (\pi _1 \; \pi _2 \; \ldots \; \pi _n)}$, $\pi _i \in \mathbb {N}$, $1 \le \pi _i \le n$, and $\pi _i \ne \pi _j$ for all $i \ne j$.

Rearrangement events can split intergenic regions, and each intergenic region has a well-defined amount of nucleotides. Thus, we represent it by the size (amount of nucleotides). The sequence of intergenic regions $\varvec{\breve{\pi }= (\breve{\pi }_1 \; \breve{\pi }_2 \; \ldots \; \breve{\pi }_{n+1})}$, $\breve{\pi }_i \in \mathbb {N}$, represents the respective intergenic region sizes, such that $\breve{\pi }_i$ is on the left side of $\pi _i$, whereas $\breve{\pi }_{i+1}$ is on the right side.

Since we represent genes as a permutation $\pi $, we can treat the problem as a sorting problem in which the target permutation is the identity $\iota $. This approach is widely used and it means that if we are able to transform ($\pi ,\breve{\pi }$) into ($\iota ,\breve{\iota }$) we can also transform ($\alpha ,\breve{\alpha }$) into ($\sigma ,\breve{\sigma }$). Therefore, an instance of our problem is composed by three elements $(\pi , \breve{\pi },\breve{\iota })$ and the rearrangement distance considering a model $\mathcal {M}$ is represented as $d_\mathcal {M}(\pi , \breve{\pi },\breve{\iota })$.

Definition 1

An intergenic reversal $\rho ^{(i, j)}_{(x, y)}$, with $1 \le i \le j \le n$, $0 \le x \le \breve{\pi }_i$, $0 \le y \le \breve{\pi }_{j+1}$, and $\{x,y\} \subset \mathbb {N}$, splits the intergenic regions $\breve{\pi }_i$ and $\breve{\pi }_{j+1}$ into pieces with sizes $(x, x^{\prime }=\breve{\pi }_i-x)$ and $(y, y^{\prime }=\breve{\pi }_{j+1}-y)$, respectively. The sequence $(x^{\prime },\pi _i,\dots ,\pi _j,y)$ is inverted and the pieces (x, y) and $(x^{\prime },y^{\prime })$ form new intergenic regions $\breve{\pi }_i$ and $\breve{\pi }_{j+1}$, respectively.

Definition 2

An intergenic transposition $\tau ^{(i,j,k)}_{(x,y,z)}$, with $1 \le i< j < k \le n+1$, $0 \le x \le \breve{\pi }_i$, $0 \le y \le \breve{\pi }_j$, $0 \le z \le \breve{\pi }_k$, and $\{x,y,z\} \subset \mathbb {N}$, splits the intergenic regions $\breve{\pi }_i$, $\breve{\pi }_{j}$, and $\breve{\pi }_k$ into pieces with sizes $(x, x^{\prime }=\breve{\pi }_i-x)$, $(y, y^{\prime }=\breve{\pi }_j-y)$, and $(z, z^{\prime }=\breve{\pi }_k-z)$, respectively. The sequences $(x^{\prime },\pi _i,\dots ,y)$ and $(y^{\prime },\pi _j,\dots ,z)$ swap positions without changing orientation, and pieces $(x,y^{\prime })$, $(z,x^{\prime })$, and $(y,z^{\prime })$ form new intergenic regions $\breve{\pi }_i$, $\breve{\pi }_{k+i-j}$, and $\breve{\pi }_{k}$, respectively.

Figures 1(a) and (b) show a generic example of intergenic reversal and intergenic transposition, respectively.

Definition 3

An intergenic insertion $\phi ^i_x$, such that $1 \le i \le (n+1)$, $x > 0$, and $x \in \mathbb {N}$ acts on the intergenic region $\breve{\pi }_i$ inserting an amount x of nucleotides.

Definition 4

An intergenic deletion $\psi ^i_x$, such that $1 \le i \le (n+1)$, $0 < x \le \breve{\pi }_i$, and $x \in \mathbb {N}$ acts on the intergenic region $\breve{\pi }_i$ removing an amount x of nucleotides.

From now on, we will refer to the operations simply as a reversal, transposition, insertion, and deletion. Since reversal and transposition do not insert or remove genetic material, when dealing solely with conservative events the sum of intergenic regions size in both genomes is the same.

Definition 5

Given a permutation $\pi $ with n elements, the extended permutation $\pi ^{\prime }$ is a permutation with two new elements $\pi ^{\prime }_0=0$ and $\pi ^{\prime }_{n+1}=n+1$.

From now on, we use the term permutation to denote the extended permutation. We say that an intergenic region is of (in)correct size if it occurs between two elements consecutive both in $\pi '$ and $\iota '$ and whose size is (not) the same in $\pi '$ and $\iota '$.

Definition 6

An intergenic breakpoint is a pair of elements $\pi ^{\prime }_i$ and $\pi ^{\prime }_{i+1}$ of the permutation $\pi ^{\prime }$, such that either they are not consecutive in the identity permutation $\iota ^{\prime }$, or they are consecutive and the intergenic region between them has an incorrect size.

An intergenic breakpoint represents a region that at some point has to be affected by some operation either to fix the gene order or the size of the intergenic region. Figure 2 shows examples of intergenic breakpoints.

Definition 7

The number of intergenic breakpoints is denoted as $ib(\pi ^{\prime },\breve{\pi },\breve{\iota })$.

Remark 1

The instance $(\iota ^{\prime }, \breve{\iota }, \breve{\iota })$ has a property that does not occur in any other instance: $ib(\iota ^{\prime }, \breve{\iota }, \breve{\iota }) = 0$.

Definition 8

The variation in the number of intergenic breakpoints after applying a sequence of operations S is denoted as $\varDelta _{ib}(\pi ^{\prime }, \breve{\pi }, \breve{\iota }, S)$, such that:

$$\varDelta _{ib}(\pi ^{\prime }, \breve{\pi }, \breve{\iota }, S) = ib((\pi ^{\prime }, \breve{\pi }, \breve{\iota })\cdot S) - ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota }).$$

Definition 9

A pair (a, b) is a block in a permutation $\pi ^{\prime }$ if $|a-b|=1$, the elements a and b are consecutive in the permutation $\pi ^{\prime }$, and the intergenic region between them has a correct size.

Definition 10

Two intergenic breakpoints $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$, such that $i < j$, are connected if the following conditions are fulfilled:

i.
At least one of the pairs $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$, $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$, $(\pi ^{\prime }_i,\pi ^{\prime }_{j})$, $(\pi ^{\prime }_i,\pi ^{\prime }_{j+1})$, $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j})$, or $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j+1})$ corresponds to two consecutive elements in the identity permutation $\iota ^{\prime }$ that do not form a block in the permutation $\pi ^{\prime }$.
ii.
$\breve{\pi }_{i+1} + \breve{\pi }_{j+1} \ge \breve{\iota }_k$, such that $\breve{\iota }_k$ is the size of the intergenic region between the pair of consecutive elements in the identity permutation $\iota ^{\prime }$.

Connected intergenic breakpoints represent regions that have the potential to remove at least one intergenic breakpoint by placing two consecutive elements and fixing the size of the intergenic region between them. For example, in Fig. 2, the intergenic breakpoints $(\pi ^{\prime }_3,\pi ^{\prime }_4)$ and $(\pi ^{\prime }_6,\pi ^{\prime }_7)$ are connected while the intergenic breakpoints $(\pi ^{\prime }_2,\pi ^{\prime }_3)$ and $(\pi ^{\prime }_3,\pi ^{\prime }_4)$ are not.

3 Results

We start this section by showing that problems SbIR and SbIRID belong to the problem class NP-hard. For this, we used a reduction of Sorting by Reversals problem, which does not consider intergenic regions. Then, we present lower bounds (Subsect. 3.1) and approximation algorithms (Subsect. 3.2) for each of the variations of the problems addressed in this work.

The Sorting by Reversals problem (SbR) has already been proven NP-hard [6]. An instance of this problem consists of a permutation $\delta $ and a natural number d. The goal is to determine if its possible to transform $\delta $ into $\iota $ applying at most d reversals.

Lemma 1

SbIR problem is NP-hard.

Proof

We can reduce all instances of SbR to instances of SbIR by setting $\pi =\delta $ and $\breve{\pi }= \breve{\iota }= (0 \; 0 \; ... \; 0)$. Note that it is possible to transform $\delta $ into $\iota $ applying at most d reversals if and only if $d_{SbIR}(\pi , \breve{\pi }, \breve{\iota }) \le d$. $\square $

Lemma 2

SbIRID problem is NP-hard.

Proof

We can reduce all instances of SbR to instances of SbIRID by setting $\pi =\delta $ and $\breve{\pi }= \breve{\iota }= (0 \; 0 \; ... \; 0)$. Note that for these instances of the SbIRID problem no insertion and deletion will be applied, otherwise we could get a smaller sequence of reversals just by ignoring the insertions and deletions. That way, it is possible to transform $\delta $ into $\iota $ applying at most d reversals if and only if $d_{SbIRID}(\pi , \breve{\pi }, \breve{\iota }) \le d$. $\square $

3.1 Lower Bounds

Following lemmas present lower bounds for each problem we consider.

Lemma 3

$\varDelta _{ib}(\pi ^{\prime }, \breve{\pi }, \breve{\iota },\rho ) \ge -2$ for any reversal $\rho $.

Proof

Suppose that $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$ are intergenic breakpoints. In this case, the best scenario after applying reversal $\rho ^{(i, j)}_{(x, y)}$ removes the intergenic breakpoints $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$, reducing the number of intergenic breakpoints by two. Since any reversal only affects the neighborhood of two pairs of genes and two intergenic regions it is impossible to remove more than two intergenic breakpoints. $\square $

Lemma 4

$\varDelta _{ib}(\pi ^{\prime }, \breve{\pi }, \breve{\iota },\tau ) \ge -3$ for any transposition $\tau $.

Proof

Suppose that $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$, $(\pi ^{\prime }_{j-1},\pi ^{\prime }_j)$, and $(\pi ^{\prime }_{k-1},\pi ^{\prime }_k)$ are intergenic breakpoints. In this case, the best scenario is a transposition $\tau ^{(i,j,k)}_{(x,y,z)}$ that removes the intergenic breakpoints $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$, $(\pi ^{\prime }_{j-1},\pi ^{\prime }_j)$, and $(\pi ^{\prime }_{k-1},\pi ^{\prime }_k)$, reducing the number of intergenic breakpoints by three. Since any transposition only affects the neighborhood of three pairs of genes and three intergenic regions it is impossible to remove more than three intergenic breakpoints. $\square $

Lemma 5

$\varDelta _{ib}(\pi ^{\prime }, \breve{\pi }, \breve{\iota }, \phi ) \ge -1$ for any insertion $\phi $.

Proof

As an insertion acts in just one intergenic region this means that the best scenario is to remove the intergenic breakpoint $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$ after applying an insertion $\phi ^i_x$, reducing by one the number of intergenic breakpoints. $\square $

Lemma 6

$\varDelta _{ib} (\pi ^{\prime }, \breve{\pi }, \breve{\iota }, \psi ) \ge -1$ for any deletion $\psi $.

Proof

As a deletion acts in just one intergenic region this means that the best scenario is to remove the intergenic breakpoint $(\pi ^{\prime }_{i-1},\pi ^{\prime }_i)$ after applying a deletion $\psi ^i_x$, reducing by one the number of intergenic breakpoints. $\square $

Theorem 1

$d_{SbIR}(\pi , \breve{\pi }, \breve{\iota }) \ge \frac{ib(\pi ^{\prime },\breve{\pi }, \breve{\iota })}{2}$.

Proof

By the Remark 1, we know that $(\iota ^{\prime },\breve{\iota },\breve{\iota })$ is the only instance with no intergenic breakpoints. To achieve the identity permutation and to fix the intergenic region sizes we need to remove $ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota })$ intergenic breakpoints. Also, by Lemma 3, a reversal removes at most two intergenic breakpoints and lemma follows. $\square $

Theorem 2

$d_{SbIRID}(\pi , \breve{\pi }, \breve{\iota }) \ge \frac{ib(\pi ^{\prime },\breve{\pi }, \breve{\iota })}{2}$.

Proof

Directly by Lemmas 5 and 6, and Theorem 1. $\square $

Theorem 3

$d_{SbIRT}(\pi , \breve{\pi }, \breve{\iota }) \ge \frac{ib(\pi ^{\prime },\breve{\pi }, \breve{\iota })}{3}$.

Proof

Directly by Remark 1 and Lemmas 3 and 4. $\square $

Theorem 4

$d_{SbIRTID}(\pi , \breve{\pi }, \breve{\iota }) \ge \frac{ib(\pi ^{\prime },\breve{\pi }, \breve{\iota })}{3}$.

Proof

Directly by Lemmas 5 and 6, and Theorem 3. $\square $

3.2 Approximation Algorithms

In this subsection, we will present four approximation algorithms. Initially, we will show an algorithm with approximation factor 4 for the SbIR and SbIRID problems. Then, we will present an algorithm with approximation factor 6 for the SbIRT and SbIRTID problems.

Lemma 7

Let $(\pi ,\breve{\pi },\breve{\iota })$ be an instance such that $\sum _{i=1}^{n+1}\breve{\pi }_i \ge \sum _{i=1}^{n+1}\breve{\iota }_i$ and the number of intergenic breakpoints is greater than one. It is always possible to find at least one pair of intergenic breakpoints that are connected.

Proof

Since $ib(\pi ^{\prime },\breve{\pi },\breve{\iota }) > 1$, we can find at least a pair of intergenic breakpoints. We have to show that at least one of those pairs will be connected. Suppose that exists an instance $(\pi ,\breve{\pi },\breve{\iota })$, such that $\sum _{i=1}^{n+1}\breve{\pi }_i \ge \sum _{i=1}^{n+1}\breve{\iota }_i$, $ib(\pi ^{\prime },\breve{\pi },\breve{\iota }) > 1$, and there is not a pair of intergenic breakpoints that are connected. The possibilities for not finding such a pair of intergenic breakpoints are:

For all pairs of intergenic breakpoints $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$ the elements $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$, $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$, $(\pi ^{\prime }_i,\pi ^{\prime }_{j})$, $(\pi ^{\prime }_i,\pi ^{\prime }_{j+1})$, $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j})$, and $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j+1})$ are not consecutive in the identity permutation, but if it is true $\pi $ cannot be a permutation.
For all pairs of intergenic breakpoints $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$ we do not have enough intergenic material to remove any intergenic breakpoint $\breve{\pi }_{i+1} + \breve{\pi }_{j+1} < \breve{\iota }_k$, such that $\breve{\iota }_k$ is the size of the intergenic region between the pair of consecutive elements in the identity permutation. If it is true it implies that $\sum _{i=1}^{n+1}\breve{\pi }_i < \sum _{i=1}^{n+1}\breve{\iota }_i$ and that contradicts the initial assumption. $\square $

Lemma 8

Let $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$ be intergenic breakpoints that are connected. It is possible to remove at least one intergenic breakpoint after at most two reversals.

Proof

We analyze the possibilities to remove an intergenic breakpoint based on a pair of connected intergenic breakpoints.

i.
$(\pi ^{\prime }_i,\pi ^{\prime }_{j})$ or $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j+1})$ are consecutive in the identity permutation: these cases are symmetric and we need to apply only one reversal $\rho ^{(i+1,j)}_{(x,y)}$ to place the element $\pi ^{\prime }_{j}$ on the right side of the element $\pi ^{\prime }_i$ or $\pi ^{\prime }_{i+1}$ on the left side of $\pi ^{\prime }_{j+1}$. As $\breve{\pi }_{i+1} + \breve{\pi }_{j+1} \ge \breve{\iota }_k$, then the x and y parameters can always be chosen properly to fill the intergenic region with the correct size between the consecutive elements generated (Fig. 3(a)).
ii.
$(\pi ^{\prime }_i,\pi ^{\prime }_{j+1})$: in this case we apply two consecutive reversals. In this scenario, we need an intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$, such that $k < i$ or $k > j$, to apply the sequence of reversals without creating new intergenic breakpoints. We will prove that exists such intergenic breakpoint by contradiction. Suppose that there is no intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ such that $k < i$ or $k > j$. This means that the segments $(\pi ^{\prime }_0,\dots ,\pi ^{\prime }_i)$ and $(\pi ^{\prime }_{j+1},\dots ,\pi ^{\prime }_{n+1})$ are composed of consecutive elements with no intergenic breakpoint between them; also we know that $\pi ^{\prime }_i$ and $\pi ^{\prime }_{j+1}$ are consecutive elements, but if both statements are true it implies that there are no valid values for the elements $\pi _{i+1}$ and $\pi _j$ of the permutation $\pi $. If $k < i$ we apply a reversal $\rho ^{(k+1,i)}_{(0,\breve{\pi }_{i+1})}$ to obtain the case (i) (Fig. 3(b)). If $k > j$ we apply a reversal $\rho ^{(j+1,k)}_{(0,\breve{\pi }_{k+1})}$ to obtain the case (i) (Fig. 3(c)). Note that in both scenarios the intergenic regions sizes remains the same and the case (i) can be applied (Fig. 3(b)).
iii.
$(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j})$: In this case we apply two consecutive reversals. In this scenario, we need an intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$, such that $k > i$ and $k < j$, to apply the sequence of reversals without creating new intergenic breakpoints. We will prove that exists such intergenic breakpoint by contradiction. Suppose that there is no intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ such that $k > i$ and $k < j$. This means that the segment $(\pi ^{\prime }_{i+1},\dots ,\pi ^{\prime }_j)$ is composed of consecutive elements with no intergenic breakpoint between them; also we know that $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j})$ are consecutive elements, but if both statements are true implies that there are no valid values for the elements $\pi _{i+1}$ and $\pi _j$ of the permutation $\pi $. After identifying the intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ we apply a reversal $\rho ^{(i+1,k)}_{(0,\breve{\pi }_{k+1})}$ (Fig. 3(d)) to obtain the case (i).
iv.
$(\pi ^{\prime }_{i},\pi ^{\prime }_{i+1})$ or $(\pi ^{\prime }_{j},\pi ^{\prime }_{j+1})$: these cases are symmetric and we need to apply two consecutive reversals. Initially, we apply a reversal $\rho ^{(i+1, j)}_{(0,\breve{\pi }_{j+1})}$ without changing the intergenic regions sizes, as result we obtain the case (i) (Fig. 3(e)). $\square $

Theorem 5

SbIR problem is 4-approximable.

Proof

While the permutation is not sorted and while the permutation has intergenic regions with incorrect size, it is always possible to remove at least one intergenic breakpoint after applying at most two reversals (Lemmas 7 and 8). In the worst case, it gives us a total of $2ib(\pi ^{\prime },\breve{\pi },\breve{\iota })$ reversals to transform $(\pi ,\breve{\pi },\breve{\iota })$ into $(\iota ,\breve{\iota },\breve{\iota })$. By the Theorem 1, we obtained the lower bound $ib(\pi ^{\prime },\breve{\pi },\breve{\iota })/2$ and the theorem follows. $\square $

Lemma 9

Let $(\pi ,\breve{\pi },\breve{\iota })$ be an instance of the Sorting by Intergenic Reversals, Insertions, and Deletions problem, such that $ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota }) > 0$. It is always possible to find an insertion $\phi $ such that $\varDelta _{ib} (\pi ^{\prime }, \breve{\pi }, \breve{\iota }, \phi ) \le 0$.

Proof

Since $ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota }) > 0$, then it exists at least one intergenic breakpoint that we can apply an insertion in this region. Therefore, in the worst case the amount of intergenic breakpoints remains the same. $\square $

Lemma 10

Let $(\pi ^{\prime },\breve{\pi },\breve{\iota })$ be an instance of the Sorting by Intergenic Reversals, Insertions, and Deletions problem, such that $ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota }) = 1$ and $\sum _{i=1}^{n+1}\breve{\pi }_i > \sum _{i=1}^{n+1}\breve{\iota }_i$. It is always possible to find a deletion $\psi $ such that $\varDelta _{ib} (\pi ^{\prime }, \breve{\pi }, \breve{\iota }, \psi ) = -1$.

Proof

Since $ib(\pi ^{\prime }, \breve{\pi }, \breve{\iota }) = 1$, then we know that $\pi ^{\prime } = \iota ^{\prime }$, otherwise the number of intergenic breakpoints should be greater than one. Since all the elements of the permutation $\pi ^{\prime }$ are consecutive and $\sum _{i=1}^{n+1}\breve{\pi }_i > \sum _{i=1}^{n+1}\breve{\iota }_i$, there is an intergenic region $\breve{\pi }_k$, such as $\breve{\pi }_k > \breve{\iota }_k$. Thus the deletion $\psi ^{k}_{\breve{\iota }_k-\breve{\pi }_k}$ removes the intergenic breakpoint $(\pi ^{\prime }_{k-1},\pi ^{\prime }_k)$ and the lemma follows. $\square $

Theorem 6

SbIRID problem is 4-approximable.

Proof

We are going to divide the proof into three cases:

i.
$\sum _{i=1}^{n+1}\breve{\pi }_i > \sum _{i=1}^{n+1}\breve{\iota }_i$: Lemmas 7 and 8 remain valid as long as $ib(\pi ^{\prime },\breve{\pi },\breve{\iota }) > 1$, then we apply only one deletion to remove the last intergenic breakpoint (Lemma 10).
ii.
$\sum _{i=1}^{n+1}\breve{\pi }_i = \sum _{i=1}^{n+1}\breve{\iota }_i$: Lemmas 7 and 8 are sufficient to sort the permutation and to fix the sizes of the intergenic regions by applying only reversals.
iii.
$\sum _{i=1}^{n+1}\breve{\pi }_i < \sum _{i=1}^{n+1}\breve{\iota }_i$: Initially we apply an insertion to make $\sum _{i=1}^{n+1}\breve{\pi }_i = \sum _{i=1}^{n+1}\breve{\iota }_i$ (Lemma 9). Sequentially, Lemmas 7 and 8 guarantee that the permutation will be sorted and the sizes of the intergenic regions will be fixed applying only reversals. Note that it is not guaranteed that the initial insertion removes any intergenic breakpoint, but then only reversals are applied and the last reversal must remove two intergenic breakpoints. Considering the insertion and the last reversal, on average, we were able to remove one intergenic breakpoint after applying one operation.

Considering the three cases, in the worst scenario, it gives us a total of $2ib(\pi ^{\prime },\breve{\pi },\breve{\iota })$ operations to transform $(\pi ,\breve{\pi },\breve{\iota })$ into $(\iota ,\breve{\iota },\breve{\iota })$. By Theorem 2, we obtained the lower bound $ib(\pi ^{\prime },\breve{\pi },\breve{\iota })/2$ and the theorem follows. $\square $

Lemma 11

Let $(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ and $(\pi ^{\prime }_j,\pi ^{\prime }_{j+1})$ be intergenic breakpoints that are connected. It is possible to remove at least one intergenic breakpoint after at most two operations of reversal or transposition.

Proof

Similar to Lemma 8, we will analyze the possibilities to remove an intergenic breakpoint based on a pair of connected intergenic breakpoints.

i.
$(\pi ^{\prime }_i,\pi ^{\prime }_{j})$ or $(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j+1})$: We need to apply only one reversal, which is exactly as the procedure shown in case (i) of Lemma 8.
ii.
$(\pi ^{\prime }_i,\pi ^{\prime }_{j+1})$: In this case we need to apply one transposition. As shown previously in case (ii) of Lemma 8, we know that it must exist another intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ such that $k < i$ or $k > j$. If $k < i$ we apply a transposition $\tau ^{(k+1,i+1,j+1)}_{(x,y,z)}$ to place the element $\pi ^{\prime }_{i}$ on the left side of the element $\pi ^{\prime }_{j+1}$ (Fig. 4(a)). If $k > j$ we apply a transposition $\tau ^{(i+1,j+1,k+1)}_{(x,y,z)}$ to place the element $\pi ^{\prime }_{j+1}$ on the right side of the element $\pi ^{\prime }_{i}$ (Fig. 4(b)). In both scenarios, we have that $\breve{\pi }_{i+1} + \breve{\pi }_{j+1} \ge \breve{\iota }_k$, then the x, y, and z parameters always can be chosen properly to fill the intergenic region with the correct size between the consecutive elements generated.
iii.
$(\pi ^{\prime }_{i+1},\pi ^{\prime }_{j})$: In this case we need to apply one transposition. As shown previously in case (iii) of Lemma 8, we know that it must exist another intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ such that $k > i$ and $k < j$. After identifying the intergenic breakpoint $(\pi ^{\prime }_k,\pi ^{\prime }_{k+1})$ we apply a transposition $\tau ^{(i+1,k+1,j+1)}_{(x,y,z)}$ to place the element $\pi ^{\prime }_{j}$ on the left side of the element $\pi ^{\prime }_{i+1}$ (Fig. 4(c)). Since $\breve{\pi }_{i+1} + \breve{\pi }_{j+1} \ge \breve{\iota }_k$, then the x, y and z parameters always can be chosen properly to fill the intergenic region with the correct size between the consecutive elements generated.
iv
$(\pi ^{\prime }_i,\pi ^{\prime }_{i+1})$ or $(\pi ^{\prime }_{j},\pi ^{\prime }_{j+1})$: We need to apply two reversals exactly as the procedure shown on Lemma 8 case (iv). $\square $

Theorem 7

SbIRT problem is 6-approximable.

Proof

While the permutation is not sorted and with all the intergenic regions with the correct size it is always possible to remove at least one intergenic breakpoint after applying at most two operations (Lemmas 7 and 11). In the worst case, it gives us a total of $2ib(\pi ^{\prime },\breve{\pi },\breve{\iota })$ operations to transform $(\pi ,\breve{\pi },\breve{\iota })$ into $(\iota ,\breve{\iota },\breve{\iota })$. By Theorem 3, we obtained the lower bound $ib(\pi ^{\prime },\breve{\pi },\breve{\iota })/3$ and the theorem follows. $\square $

Theorem 8

SbIRTID problem is 6-approximable.

Proof

Similar to Theorem 6, we are going to divide the analysis into three cases:

i.
$\sum _{i=1}^{n+1}\breve{\pi }_i > \sum _{i=1}^{n+1}\breve{\iota }_i$: Lemmas 7 and 11 remain valid as long as $ib(\pi ^{\prime },\breve{\pi },\breve{\iota }) > 1$, then we apply only one deletion to remove the last intergenic breakpoint (Lemma 10).
ii.
$\sum _{i=1}^{n+1}\breve{\pi }_i = \sum _{i=1}^{n+1}\breve{\iota }_i$: Lemmas 7 and 11 are sufficient to sort the permutation and to fix the sizes of the intergenic regions by applying only reversals and transpositions.
iii.
$\sum _{i=1}^{n+1}\breve{\pi }_i < \sum _{i=1}^{n+1}\breve{\iota }_i$: Initially we apply an insertion to make $\sum _{i=1}^{n+1}\breve{\pi }_i = \sum _{i=1}^{n+1}\breve{\iota }_i$ (Lemma 9). Sequentially, Lemmas 7 and 11 guarantee that the permutation will be sorted and the sizes of the intergenic regions will be fixed applying only reversals and transpositions. Note that it is not guaranteed that the initial insertion removes any intergenic breakpoint, but then only reversals and transpositions are applied and the last operation (reversal or transposition) must remove at least two intergenic breakpoints. Considering the insertion and the last operation, on average, we were able to remove one intergenic breakpoint after applying one operation.

Considering the three cases, in the worst scenario, it gives us a total of $2ib(\pi ^{\prime },\breve{\pi },\breve{\iota })$ operations to transform $(\pi ,\breve{\pi },\breve{\iota })$ into $(\iota ,\breve{\iota },\breve{\iota })$. By Theorem 4, we obtained the lower bound $ib(\pi ^{\prime },\breve{\pi },\breve{\iota })/3$, and the theorem follows. $\square $

4 Conclusion

We proved that the problems of Sorting by Intergenic Reversals and Sorting by Intergenic Reversals, Insertions, and Deletions belong to NP-hard problems class. Besides, we presented for both problems an algorithms with approximation factor 4. We also investigate the Sorting by Intergenic Reversals and Transpositions problem and the variation with non-conservative events of insertion and deletion. For both problems, we designed approximation algorithms of factor 6.

As future works, we intend to improve the approximation factors of the algorithms and develop cost functions that consider the likelihood of each operation.

References

Berman, P., Hannenhalli, S., Karpinski, M.: 1.375-approximation algorithm for sorting by reversals. In: Möhring, R., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 200–210. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45749-6_21
Chapter Google Scholar
Biller, P., Guéguen, L., Knibbe, C., Tannier, E.: Breaking good: accounting for fragility of genomic regions in rearrangement distance estimation. Genome Biol. Evol. 8(5), 1427–1439 (2016)
Article Google Scholar
Biller, P., Knibbe, C., Beslon, G., Tannier, E.: Comparative genomics on artificial life. In: Beckmann, A., Bienvenu, L., Jonoska, N. (eds.) CiE 2016. LNCS, vol. 9709, pp. 35–44. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40189-8_4
Chapter Google Scholar
Bulteau, L., Fertin, G., Rusu, I.: Sorting by transpositions is difficult. SIAM J. Comput. 26(3), 1148–1180 (2012)
MathSciNet MATH Google Scholar
Bulteau, L., Fertin, G., Tannier, E.: Genome rearrangements with indels in intergenes restrict the scenario space. BMC Bioinform. 17(14), 426 (2016)
Article Google Scholar
Caprara, A.: Sorting permutations by reversals and eulerian cycle decompositions. SIAM J. Discrete Math. 12(1), 91–110 (1999)
Article MathSciNet Google Scholar
Elias, I., Hartman, T.: A 1.375-approximation algorithm for sorting by transpositions. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(4), 369–379 (2006)
Article Google Scholar
Fertin, G., Jean, G., Tannier, E.: Algorithms for computing the double cut and join distance on both gene order and intergenic sizes. Algorithms Mol. Biol. 12(1), 16 (2017)
Article Google Scholar
Oliveira, A.R., Jean, G., Fertin, G., Dias, U., Dias, Z.: Super short reversals on both gene order and intergenic sizes. In: Alves, R. (ed.) BSB 2018. LNCS, vol. 11228, pp. 14–25. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01722-4_2
Chapter Google Scholar
Rahman, A., Shatabda, S., Hasan, M.: An approximation algorithm for sorting by reversals and transpositions. J. Discrete Algorithms 6(3), 449–457 (2008)
Article MathSciNet Google Scholar
Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21(16), 3340–3346 (2005)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Council for Scientific and Technological Development - CNPq (grants 400487/2016-0, 425340/ 2016-3, and 140466/2018-5), the São Paulo Research Foundation - FAPESP (grants 2013/08293-7, 2015/ 11937-9, 2017/12646-3, and 2017/16246-0), the Brazilian Federal Agency for the Support and Evaluation of Graduate Education - CAPES, and the CAPES/COFECUB program (grant 831/15).

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Campinas, Brazil
Klairton Lima Brito, Andre Rodrigues Oliveira & Zanoni Dias
LS2N, UMR CNRS 6004, University of Nantes, Nantes, France
Géraldine Jean & Guillaume Fertin
School of Technology, University of Campinas, Limeira, Brazil
Ulisses Dias

Authors

Klairton Lima Brito
View author publications
You can also search for this author in PubMed Google Scholar
Géraldine Jean
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Fertin
View author publications
You can also search for this author in PubMed Google Scholar
Andre Rodrigues Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Ulisses Dias
View author publications
You can also search for this author in PubMed Google Scholar
Zanoni Dias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Klairton Lima Brito .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Zhipeng Cai
Georgia State University, Atlanta, GA, USA
Pavel Skums
Central South University, Changsha, China
Min Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brito, K.L., Jean, G., Fertin, G., Oliveira, A.R., Dias, U., Dias, Z. (2019). Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes. In: Cai, Z., Skums, P., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2019. Lecture Notes in Computer Science(), vol 11490. Springer, Cham. https://doi.org/10.1007/978-3-030-20242-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-20242-2_3
Published: 09 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20241-5
Online ISBN: 978-3-030-20242-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes

Abstract

Similar content being viewed by others

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

A 3.5-Approximation Algorithm for Sorting by Intergenic Transpositions

Super Short Reversals on Both Gene Order and Intergenic Sizes

Keywords

1 Introduction

2 Basic Definitions

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

Definition 7

Remark 1

Definition 8

Definition 9

Definition 10

3 Results

Lemma 1

Proof

Lemma 2

Proof

3.1 Lower Bounds

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

3.2 Approximation Algorithms

Lemma 7

Proof

Lemma 8

Proof

Theorem 5

Proof

Lemma 9

Proof

Lemma 10

Proof

Theorem 6

Proof

Lemma 11

Proof

Theorem 7

Proof

Theorem 8

Proof

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation