Keywords

1 Introduction

String comparison is one of the most fundamental problems in many fields such as bioinformatics and data compression. In computer science, the difference between two strings is often measured by edit distance, the number of edit operations required to transform one string into the other. The most widely known definitions of edit distance include insertion, deletion, and/or substitution operations. However, the more general edit distance with moves problem studied in [10] allows an additional operation wherein an entire block of text is shifted within a string.

These shift operations, also known as rearrangements, are especially relevant in biology [8, 18]. String comparison can be performed on DNA or protein sequences to estimate how closely related different species are. In data compression, we may want to store many similar strings as a single string along with the edits required to recover all strings. These two applications even overlap naturally in the field of bioinformatics where extremely large datasets of biological sequences are common. For example, the challenge of pan-genome storage is to store many highly similar sequences from the same clade such as a bacterial species.

One way to capture just the “moves” operation is to solve the Minimum Common String Partition problem (MCSP) which seeks to partition two strings into minimum cardinality sets of substrings that are permutations of each other. While the MCSP problem has been heavily studied, the complementary Maximum Duo-Preservation String Mapping Problem (MPSM) is a relatively new and under-explored problem in this area.

1.1 Problem Description

The Maximum Duo-Preservation String Mapping Problem (MPSM) is defined as follows. We are given two strings \(A = a_1 a_2 \ldots a_n\) and \(B = b_1 b_2 \ldots b_n\) of length n such that B is a permutation of A. Let \(a_i\) and \(b_j\) be the \(i^{th}\) and \(j^{th}\) characters of their respective strings. A proper mapping \(\pi \) from A to B is a one-to-one mapping with \(a_i = b_{\pi (i)}\) for all \(i = 1, \ldots , n\). A duo is simply two consecutive characters from the same string. We say that a duo \((a_i, a_{i+1})\) is preserved if \(a_i\) is mapped some \(b_j\) and \(a_{i+1}\) is mapped to \(b_{j+1}\). The objective is to return a proper mapping from the letters of A to the letters of B which preserves the maximum number of duos. Note that the number of duos preserved in each string is identical and by convention we count the number of duos preserved in a single string rather than the sum over both strings. Let \(OPT_{MPSM}\) denote the number of duos preserved from a single string in an optimal solution to the MPSM problem. Figure 1 shows an example of an optimal mapping which preserves the maximum possible number of duos.

Fig. 1.
figure 1

Illustration of a mapping \(\pi \) from A to B that preserves 3 duos: bc, dd, and dc. A solution to the complementary MCSP problem on the same strings would be partitions \(P_A = a, bc, ddc, b, a\) and \(P_B = bc, a, ddc, a, b\) with \(|P_A| = |P_B| = 5\).

The complementary Minimum Common String Partition problem (MCSP) seeks to find partitions of the strings A and B where a partition \(P_A\) of A is defined as a set of substrings whose concatenation is A. The objective is to find minimum cardinality partitions \(P_A\) of A and \(P_B\) of B such that \(P_B\) is a permutation of \(P_A\). Let \(OPT_{MCSP}\) denote the cardinality of a partition in an optimal solution to this problem. We can see that

$$ OPT_{MCSP} = |P_A| = |P_B| = n - OPT_{MPSM} $$

The variants, k-MPSM and k-MCSP, add the restriction that each letter occurs at most k times in each string. For a given algorithm, let \(ALG_{MPSM}\) be number of duos preserved by the algorithm. The approximation ratio for that algorithm is defined as

$$ \frac{OPT_{MPSM}}{ALG_{MPSM}} $$

1.2 Related Work

The Maximum Duo-Preservation String Mapping Problem (MPSM) was introduced in [7] along with the related Constrained Maximum Induced Subgraph (CMIS) and Constrained Minimum Induced Subgraph (CNIS) problems. They used a linear programming and randomized rounding approach to approximate the k-CMIS problem which they show is a generalization of k-MPSM. This leads to a \(k^2\)-approximation for \(k \ge 3\) and a 2-approximation for \(k = 2\). This was improved by [4] to a 4-approximation independent of k as well as approximation ratios of 3 for \(k = 3\) and 8 / 5 for \(k = 2\). [4] also show that k-MPSM is APX-hard even for \(k = 2\), meaning no polynomial-time approximation scheme (PTAS) exists assuming \(P \ne NP\). The fixed-parameter tractability was studied in [1] and MPSM was shown to be fixed-parameter tractable when parameterized by the number of preserved duos.

The Minimum Common String Partition problem (MCSP) has been extensively studied from many angles including polynomial-time approximation [7, 9, 10, 14, 16, 17], fixed-parameter tractability [5, 6, 11, 15], and heuristics [2, 12, 13]. FPT algorithms have been parameterized by maximum number of times any character occurs, minimum block size, and the size of the optimal minimum partition. Heuristic approaches range from an ant colony optimization algorithm [12] to integer linear programming (ILP) based strategies [2, 13] which in some cases solve the problem optimally for strings up to 2, 000 characters in length.

The problem was shown to be NP-hard (thus implying MPSM is also NP-hard) and APX-hard even for 2-MCSP [14]. The current best approximations are an \(O(\log n \log ^{*} n)\)-approximation due to [10] for general MCSP and an O(k)-approximation for k-MCSP due to [17]. Applications to evolutionary distance and genome rearrangement can be found in [8, 18].

1.3 Our Contributions

We show a 13 / 4-approximation ratio for the general MPSM problem using a new combinatorial triplet matching approach. This improves the previous best approximation ratio of 4 for the general problem due to [4].

Theorem 1

For any two strings A and B such that B is a permutation of A, there is an algorithm which finds a proper mapping from A to B that preserves at least 4 / 13 of the duos that the optimal algorithm preserves.

2 Preliminaries

Let \(A = a_1 a_2 \ldots a_n\) and \(B = b_1 b_2 \ldots b_n\) be the two strings of length n with \(a_i\) and \(b_i\) being the \(i^{th}\) characters of their respective strings. A duo \(D^A_i = (a_i, a_{i+1})\) corresponds to the pair of consecutive characters \(a_i\) and \(a_{i+1}\) in the string. We use \(D^A = (D^A_1, \ldots , D^A_{n-1})\) and \(D^B = (D^B_1, \ldots , D^B_{n-1})\) to denote the sets of duos for A and B, respectively. We similarly define a triplet \(T^A_i = (a_i, a_{i+1}, a_{i+2})\) as a set of three consecutive characters \(a_i\), \(a_{i+1}\), and \(a_{i+2}\) in the string and sets of triplets \(T^A = (T^A_1, \ldots , T^A_{n-2})\) and \(T^B = (T^B_1, \ldots , T^B_{n-2})\) for strings A and B, respectively. Observe that the duos \(D^A_i\) and \(D^A_{i+1}\) correspond to the first two and last two characters, respectively, of the triplet \(T^A_i\). We refer to duos \(D^A_i\) and \(D^A_{i+1}\) as subsets of the triplet \(T^A_i\).

Important note: In the first step of our algorithm, we append a special character ‘&’ to the beginning and end of each string (indices 0 and \(n+1\)). We define this character to be not equal to any other character including itself (meaning& \(\ne \) &). This ensures that each duo can be a subset of exactly two triplets.

A proper mapping \(\pi \) from A to B is a one-to-one mapping from the letters of A to the letters of B with \(a_i = b_{\pi (i)}\) for all \(\forall \, i = 1, \ldots , n\). Recall that a duo \((a_i, a_{i+1})\) is preserved if and only if \(a_i\) is mapped to some \(b_j\) and \(a_{i+1}\) is mapped to \(b_{j+1}\). We call a pair of duos \((D^A_i, D^B_j)\) preservable if and only if \(a_i = b_j\) and \(a_{i+1} = b_{j+1}\).

For consistency, we define the concept of conflicting pairs of duos using the terminology of [4] with a small modification to accommodate our particular analysis. Two preservable pairs of duos \((D^A_i, D^B_j)\) and \((D^A_h, D^B_{\ell })\) are said to be conflicting if no proper mapping can preserve both of them. These conflicts can be of two types Type 1 and Type 2.

  • Type 1: Either \(i = h \wedge j \ne \ell \) or \(i \ne h \wedge j = \ell \).

  • Type 2: Either \(i = h + 1 \wedge j \ne \ell + 1\) or \(i \ne h + 1 \wedge j = \ell + 1\).

Exception: In our analysis, we also consider two pairs of consecutive preservable duos \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) and a third pair of duos \((D^A_h, D^B_{\ell })\) which conflicts with one or both of them, potentially creating conflicts of both Type 1 and Type 2. However, we classify such conflicts simply as Type 1 conflicts.

3 Triplet Matching Approach

In this section, we introduce and analyze the triplet matching algorithm.

3.1 The Triplet Matching Algorithm

We start by finding a weighted matching on triplets that upper bounds the optimal solution, translating that to a fractional matching on duos, and rounding the fractional solution to a mapping S that preserves a number of duos that is at least 4 / 13 the weight of the triplet matching.

Step 1: Construct a weighted bipartite graph \(\mathbf{G}_\mathbf{T}\) on the triplets.

We first append the special character ‘&’ to the beginning and end of each string as discussed in the preliminaries. Recall that& \(\ne \) &. This ensures that each duo can be a subset of exactly two triplets.

We then construct a weighted bipartite graph \(G_T = (T^A \cup T^B, E)\) with each partition being the set of triplets from a string. We add three types of edges to this graph: full edges, first-half edges, and last-half edges. For a given pair of triplets, \((T^A_i, T^B_j)\), from the different strings, we can add at most one type of edge. A full edge is added if \((a_i = b_j) \wedge (a_{i+1} = b_{j+1}) \wedge (a_{i+2} = b_{j+2})\). A first-half edge is added if \((a_i = b_j) \wedge (a_{i+1} = b_{j+1}) \wedge (a_{i+2} \ne b_{j+2})\). Similarly, a last-half edge is added if \((a_i \ne b_j) \wedge (a_{i+1} = b_{j+1}) \wedge (a_{i+2} = b_{j+2})\). The full edges have weight 1 and the half edges have weight 1 / 2.

In other words, if the triplets are a perfect match, the weight of the edge is 1. Otherwise, if only the first two or last two characters match, the weight is 1 / 2. Finally, if the previous conditions are not met, we do not add an edge between these triplets. Figure 2 illustrates this step.

Fig. 2.
figure 2

Step 1 of the algorithm. (1) shows the original strings. (2) shows the bipartite triplet graph \(G_T\).

Step 2: Find a maximum weight matching \(\mathbf{M}_\mathbf{T}\) on the triplets in \(\mathbf{G}_\mathbf{T}\).

We find a maximum weight matching \(M_T\) in the graph \(G_T\). We will prove later that the weight of this matching is a valid upper bound on the optimum solution to the MPSM problem. Figure 3 illustrates this step.

Fig. 3.
figure 3

Steps 2 and 3 of the algorithm. (3) shows the maximum weight matching found in Step 2. (4) shows the construction of the bipartite duo graph \(G_D\) in Step 3. Note that the double edges connecting the duos (bc), (ba), and (aa) will each be collapsed into single edges of weight 1.

Step 3: Transfer the matching to a weighted bipartite graph \(\mathbf{G}_\mathbf{D}\) on the duos.

We now construct a bipartite graph \(G_D = (D^A \cup D^B, E)\) on the duos using the edges of the matching \(M_T\) found on \(G_T\). For every edge \((T^A_i, T^B_j) \in M_T\), we add one or two edges to \(G_D\). Each edge added to \(G_D\) has weight 1 / 2. Since each duo from the original string is contained in two separate triplets, it can happen that we get two copies of the edge \((D^A_i, D^B_j)\). In this case, we simply merge them into a single edge with weight 1. Edges are added according to the following simple rules:

  • If \((T^A_i, T^B_j)\) is a full edge, we add the edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) to \(G_D\).

  • If \((T^A_i, T^B_j)\) is a first-half edge, we add the edge \((D^A_i, D^B_j)\) to \(G_D\).

  • If \((T^A_i, T^B_j)\) is a last-half edge, we add the edge \((D^A_{i+1}, D^B_{j+1})\) to \(G_D\).

Recall that \(T^A_i\) and \(D^A_i\) refer to the triplet and duo, respectively, starting at letter \(a_i\) in the string A and the duos \(D^A_i\) and \(D^A_{i+1}\) are both subsets of the triplet \(T^A_i\). If the triplet edge \((T^A_i, T^B_j)\) causes duo edges \((D^A_i, D^B_j)\) or \((D^A_{i+1}, D^B_{j+1})\), we say that the triplets support the duo edges. The extra ‘&’ characters are discarded in this step since by definition, they can’t be part of any pair of matched duos.

Step 4: Use \(\mathbf{G}_\mathbf{D}\) to find a mapping from string \(\mathbf{A}\) to string \(\mathbf{B}\).

In this step, we select a subset of the edges in \(G_D\) to be the duos preserved in our final mapping solution S. This step happens in three phases, each of which may include many iterations. Each iteration of a phase removes edges from \(G_D\) corresponding to one or two pairs of duos preserved as well as any conflicting edges. The first two phases each remove all instances of a particular structure from the graph while the third phase tries to preserve as many duos as possible from the remaining graph.

Phase 1. For each edge \((D^A_i, D^B_j) \in G_D\) with weight 1. We remove \((D^A_i, D^B_j)\) from \(G_D\) and map \(a_i\) and \(a_{i+1}\) to \(b_i\) and \(b_{i+1}\) in S. We also remove any conflicting edges from \(G_D\).

Phase 2. Define a pair of consecutive parallel edges to be edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) in \(G_D\) such that the triplet edge \((T^A_i, T^B_j)\) was chosen in \(M_T\). Starting at the beginning of string A, we choose the first pair of consecutive parallel edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) in \(G_D\). In other words, we find the smallest i such that \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) are a pair of consecutive parallel edges. We map \(a_i\), \(a_{i+1}\), and \(a_{i+2}\) to \(b_i\), \(b_{i+1}\), and \(b_{i+2}\) in S. We then remove the edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) from \(G_D\) as well as any conflicting edges. We continue this process until we reach the end of string A and no pairs of consecutive parallel edges remain in \(G_D\).

Phase 3. Starting at the beginning of string A, we add the duos of the first edge we encounter to S and remove any conflicting edges. We repeat this step until we reach the end of A and no edges remain in \(G_D\).

3.2 Proof of 13 / 4-approximation

We will first show that the weight of the maximum weight triplet matching \(M_T\) found in Step 2 (and by construction the total weight of \(G_D\)) is an upper bound on the maximum number of duos preserved. Then, we will show that the number of preserved duos added to S in each iteration of Step 4 is at least 4 / 13 of the total weight of edges removed from \(G_D\) in that iteration. Finally, we will show that at the end of Phase 3 of Step 4, no edges remain in \(G_D\).

Lemma 1

The weights of the maximum weight triplet matching \(M_T\) and the corresponding duo graph \(G_D\) are an upper bound on the maximum number of duos preserved.

Proof

We show that any proper mapping \(\pi \) from A to B which preserves \(\Delta \) duos implies a matching \(M_T\) of weight at least \(\Delta \) in the corresponding triplet graph \(G_T\).

For each preserved duo \((D^A_i, D^B_j)\) in \(\pi \), we add the triplet edges \((T^A_{i-1}, T^B_{j-1})\) and \((T^A_i, T^B_j)\) to \(M_T\) if they have not been added already. Note that in the construction of \(G_T\), \((D^A_i, D^B_j)\) was responsible for adding 1 / 2 to the weights of both \((T^A_{i-1}, T^B_{j-1})\) and \((T^A_i, T^B_j)\) for a total contribution of 1. Thus, if we can guarantee that both triplet edges are added to the matching for each preserved duo, that ensures \(M_T\) has weight at least \(\Delta \).

Assume for the sake of contradiction that we encounter some preserved duo \((D^A_i, D^B_j)\) and at least one of the triplet edges corresponding to \((D^A_i, D^B_j)\) cannot be added. WLOG assume the triplet edge which cannot be added is \((T^A_i, T^B_j)\) and it is blocked by some other edge \((T^A_i, T^B_{\ell })\), \(j \ne \ell \). The edge \((T^A_i, T^B_{\ell })\) must have been added by either the preserved duo \((D^A_i, D^B_{\ell })\) or \((D^A_{i+1}, D^B_{\ell + 1})\). However, both of those duos are in conflict with \((D^A_i, D^B_j)\) and therefore could not exist in the mapping \(\pi \), leading to a contradiction. It follows that both triplet edges are added to the matching for each preserved duo in \(\pi \).    \(\square \)

Lemma 2

The number of preserved duos added to S in each iteration of Phase 1 of Step 4 is at least 1 / 3 the total weight of edges removed from \(G_D\) in that iteration.

Proof

The worst case structure for this phase is illustrated in Fig. 4.

Fig. 4.
figure 4

Illustration of the worst case for Phase 1 of Step 4 in Lemma 2. The solid black lines correspond to the edge \((D^A_i, D^B_j)\) and its supporting triplets. The dashed lines correspond to conflicting edges of Type 2 that must be removed. The gray lines illustrate other edges that may or may not exist, but are not conflicting.

Suppose some edge \((D^A_i, D^B_j)\) has weight 1 in \(G_D\). Then both triplet edges \((T^A_{i-1}, T^B_{j-1})\) and \((T^A_i, T^B_j)\) containing \((D^A_i, D^B_j)\) must have been chosen in the matching \(M_T\). Therefore, there can be no conflicts of Type 1.

Note that the edge \((D^A_i, D^B_j)\) can have at most four conflicts of Type 2 arising from the neighboring duos \(D^A_{i-1}\), \(D^A_{i+1}\), \(D^B_{j-1}\), and \(D^B_{j+1}\). Each of these potential conflicts is symmetric. So WLOG, we focus on the conflict with \(D^A_{i-1}\) and show that there is at most one edge \((D^A_{i-1}, D^B_\ell )\), \(\ell \ne j-1\), and this edge can only have weight 1 / 2.

By construction, the edge \((D^A_{i-1}, D^B_\ell )\) can only be added by the triplet edges \((T^A_{i-2}, , T^B_{\ell -1})\) or \((T^A_{i-1}, , T^B_{\ell })\). However, the latter triplet edge \((T^A_{i-1}, , T^B_{\ell })\) could not exist in \(M_T\) since we assume \(M_T\) contains the edge \((T^A_{i-1}, T^B_{j-1})\) and \(M_T\) is a matching. Therefore, there is at most one such edge \((D^A_{i-1}, D^B_\ell )\) with weight 1 / 2 which must have come from a triplet edge \((T^A_{i-2}, T^B_{\ell -1})\) chosen in the matching \(M_T\).

Then the sum of weights of edges removed is at most the weight \((D^A_i, D^B_j)\) plus the weight of four Type 2 conflicting edges with weight 1 / 2 each:

$$ 1 + 4(1/2) = 3 $$

It follows that the ratio of the number of preserved duos added to S to weight of edges removed from \(G_D\) is at least 1 / 3.    \(\square \)

Note that after Phase 1 of Step 4, all remaining edges in \(G_D\) have weight 1 / 2 since the edges with weight 1 have been removed.

Lemma 3

The number of preserved duos added to S in each iteration of Phase 2 of Step 4 is at least 4 / 13 of the total weight of edges removed from \(G_D\) in that iteration.

Proof

The worst case structure for this phase is illustrated in Fig. 5.

Fig. 5.
figure 5

Illustration of the worst case for Phase 2 of Step 4 in Lemma 3. The solid lines correspond to the edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) and their supporting triplets. The dashed lines correspond to conflicting edges that must be removed. The pair of parallel dotted lines originating from \(T^A_{i-1}\) represent two edges that could not both exist. This is due to the assumption that \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) is the first pair of consecutive parallel edges in the string A.

Suppose we select edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) in Phase 2. We can upper bound the number of edges removed by identifying all triplets that could support conflicting duo edges and bounding the number of such edges they could have supported. Recall that a triplet supports a duo edge if it belongs to a triplet edge in \(M_T\) and thus caused the duo edge to be added to \(G_D\).

First, there are four triplets at distance two from i and j that could each support at most one conflicting edge. These are triplets \(T^A_{i-2}\), \(T^A_{i+2}\), \(T^B_{j-2}\), and \(T^B_{j+2}\). Second, there are four triplets at distance one. Three of these, \(T^A_{i+1}\), \(T^B_{j-1}\), and \(T^B_{j+1}\), can support two conflicting edges. However, the fourth triplet, \(T^A_{i-1}\), can support at most one conflicting edge since we chose the smallest i such that \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) are a pair of consecutive parallel edges.

In addition to these conflicting edges, we also remove the two edges \((D^A_i, D^B_j)\) and \((D^A_{i+1}, D^B_{j+1})\) leading to a total weight removed of

$$ 4(1/2) + 7(1/2) + 2(1/2) = 6.5 $$

It follows that the ratio of the number of preserved duos added to S to weight of edges removed from \(G_D\) is at least \(2/6.5 = 4/13\).    \(\square \)

Lemma 4

The number of preserved duos added to S in each iteration of Phase 3 of Step 4 is at least 1 / 3 of the total weight of edges removed from \(G_D\) in that iteration.

Proof

Suppose we select the duo edge \((D^A_i, D^B_j)\) in some iteration of Phase 3. We can upper bound the weight of edges deleted from \(G_D\) by counting the number of triplets which could have supported duo edges that conflict with \((D^A_i, D^B_j)\). Recall that a triplet supports a duo edge if it belongs to a triplet edge in \(M_T\) and thus caused the duo edge to be added to \(G_D\). Because we have removed all pairs of consecutive parallel edges in Phase 2, each triplet can support at most one duo edge remaining in \(G_D\).

There are eight triplets which could potentially support a conflicting duo edge: \(T^A_{i-2}\), \(T^A_{i-1}\), \(T^A_{i}\), \(T^A_{i+1}\), \(T^B_{j-2}\), \(T^B_{j-1}\), \(T^B_{j}\), \(T^B_{j+1}\). Notice that we have been selecting edges starting from the beginning of A and moving towards the end. Therefore any edge supported by the triplet \(T^A_{i-2}\) would have already been selected or removed prior to the current iteration. Further note that two of those triplets must support the currently selected edge. Therefore, we removed the selected duo edge \((D^A_i, D^B_j)\) and at most five other duo edges. Each of these edges has weight at most 1 / 2 since all edges of weight 1 were removed in Phase 1. Then the sum of weights of edges removed is at most

$$ 1/2 + 5(1/2) = 3 $$

It follows that the ratio of the number of preserved duos added to S to weight of edges removed from \(G_D\) is at least 1 / 3.    \(\square \)

Lemma 5

At the conclusion of Step 4, no edges remain in \(G_D\).

Proof

Phase 3 iterates through every remaining edge in \(G_D\), thus removing all of them.    \(\square \)

4 Conclusion and Future Directions

We have shown that a combinatorial triplet matching approach yields an improved approximation to the Maximum Duo Preservation String Mapping problem. Given the fact that triplet matching allows for an improvement over the ratio achieved by the duo matching approach in [4], a natural question is whether a 4-tuple matching could yield even better results. However, a direct extension of the work in this paper to a 4-tuple matching approach is not possible because the 4-tuple matching would not provide an upper bound on the MPSM. The issue with such an approach is that the first and last duos in a 4-tuple have no potential to be conflicting and likely should not be grouped together. On the bright side, we conjecture that the triplet matching approach can be pushed further to achieve a 3-approximation. The clear bottleneck in this paper arises from Phase 2 of Step 4, but we’re hopeful this obstacle can be avoided somehow.

Other interesting future directions would be to follow the lead of the work on the MCSP problem. This could include analyzing the performance of faster algorithms such as greedy algorithms or searching for heuristics that solve smaller instances of the problem near optimally. Further, since MPSM currently appears to be “easier” than MCSP, it could be fruitful to explore more applications for this problem in fields such as bioinformatics and data compression.