On the Use of Decision Diagrams for Finding Repetition-Free Longest Common Subsequences

Horn, Matthias; Djukanovic, Marko; Blum, Christian; Raidl, Günther R.

doi:10.1007/978-3-030-62867-3_11

Matthias Horn¹²,
Marko Djukanovic¹²,
Christian Blum¹³ &
…
Günther R. Raidl¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12422))

Included in the following conference series:

International Conference on Optimization and Applications

448 Accesses

Abstract

We consider the repetition-free longest common subsequence (RFLCS) problem, where the goal is to find a longest sequence that appears as subsequence in two input strings and in which each character appears at most once. Our approach is to transform a RFLCS instance to an instance of the maximum independent set (MIS) problem which is subsequently solved by a mixed integer linear programming solver. To reduce the size of the underlying conflict graph of the MIS problem, a relaxed decision diagram is utilized. An experimental evaluation on two benchmark instance sets shows the advantages of the reduction of the conflict graphs in terms of shorter total computation times and the number of instances solved to proven optimality. A further advantage of the created relaxed decision diagrams is that heuristic solutions can be effectively derived. For some instances that could not be solved to proven optimality, new state-of-the-art results were obtained in this way.

This project is partially funded by the Doctoral Program “Vienna Graduate School on Computational Optimization”, Austrian Science Foundation (FWF) Project No. W1260-N35. Furthermore, this work was also supported by project CI-SUSTAIN funded by the Spanish Ministry of Science and Innovation (PID2019-104156GB-I00).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Better Heuristic Algorithms for the Repetition Free LCS and Other Variants

A comprehensive comparison of metaheuristics for the repetition-free longest common subsequence problem

Article 06 April 2017

Construct, Merge, Solve and Adapt: Application to the Repetition-Free Longest Common Subsequence Problem

Keywords

1 Introduction

The longest common subsequence (LCS) problem asks for the longest string which is a subsequence of a set of input strings. A subsequence is a string that can be obtained from another string by possibly deleting characters. For instance a longest common subsequence of the two input strings ABCDBA and ACBDBA is ABDBA. The LCS problem has applications in bioinformatics, where strings often represent segments of RNA or DNA [13, 16, 18]. Other fields where the LCS problem appears are text editing [17], data compression, file comparison [2, 19], and the production of circuits in field programmable gate arrays [8]. If the number of input strings m is constant, the problem is solvable by dynamic programming (DP) in $O(n^m)$ time, where n is the length of the longest input string [13]. Otherwise, if the number of input strings is arbitrary, the problem is $\mathcal {NP}$-hard. An additional constraint which arises in the context of gene duplication in the domain of genome rearrangement and which we consider in this work is that each character may appear in a common subsequence (CS) at most once. This problem, first introduced by Adi et al. [1] and denoted as the repetition-free LCS (RFLCS) problem, is usually considered for two input strings and is even then APX-hard [1].

This work builds upon the work of Blum et al. [6], where instances of the RFLCS problem are transformed to instances of the maximum independent set (MIS) problem. Hereby, an independent set of the underlying conflict graph of the MIS problem corresponds to a repetition-free common subsequence (RFCS) of the RFLCS instance. To solve the MIS problem the integer linear programming (ILP) solver CPLEX is applied. The performance of the ILP solver depends to a large extent on the size of the conflict graph. Therefore, in [6] the size of the conflict graph is reduced by filtering redundant nodes based on lower and upper bounds. This boosts the range of instances that can be solved to optimality as well as the quality of heuristic solutions obtained for larger instances. In this way, numerous new state-of-the-art results were obtained.

Contributions. To reduce the size of the conflict graph even further we compile a relaxed multivalued decision diagram (MDD) for the RFLCS problem, yielding a performance improvement of the subsequently applied ILP solver. In the last decade, decision diagrams (DDs) have been recognized as a powerful tool for combinatorial optimization problems; see [3] for a comprehensive survey. In particular, relaxed DDs may provide compact representations of discrete relaxations. Besides allowing for new interference techniques in constraint programming and novel branching schemes, they may also provide tight dual bounds. In case of the RFLCS problem it is further possible to effectively derive heuristic solutions directly from the relaxed MDD. This has the advantage that if the ILP solver is not able to solve an instance to proven optimality within a given time limit then the compiled relaxed MDD may be able to provide a tighter upper bound as the ILP solver does and/or may be able to deliver a better heuristic solution.

After an overview of related work in Sect. 2 we give a formal problem definition in Sect. 3. The MIS problem and a corresponding ILP model are described in Sect. 4. Decision diagrams for the RFLCS are introduced in Sect. 5, and Sect. 6 describes the incremental refinement of relaxed MDDs. Section 7 provides experimental results, showing that the suggested approach yields to a performance improvement in terms of average computation times, number of instances solved to optimality, and average solution quality.

2 Related Work

As already mentioned, the current work builds upon the approach of Blum et al. [6]. Besides the RFLCS, Blum et al. also consider the longest arc-preserving common subsequence (LAPCS) problem [15], where additional dependencies among characters must be respected in a solution, as well as the longest common palindromic subsequence (LCPS) problem [11], where the resulting sequence must also be a palindrome. All these LCS variants where solved by transforming instances to instances of the MIS problem. Moreover, the equivalent maximum clique problem of the complement of the conflict graph is solved heuristically by the LSCC-BMS solver as well as exactly by the LMC solver. Both solvers are currently among the leading solvers for the maximum clique problem.

In the literature LCS related problems with additional constraints are well known for almost 40 years and research in that field is still active due the practical relevance and computational difficulties. Besides RFLCS and the already mentioned LAPCS and LCPS problems other considered variants are, for instance, the constrained longest common subsequence problem [20] or the generalized constrained longest common subsequence problem [10]. For further problem variants we refer to survey papers such as [7].

The RFLCS problem in particular was tackled by several heuristic approaches [1, 4, 9]. The so far best heuristic is a construct, merge, solve and adapt (CMSA) metaheuristic combined with beam search as proposed by Blum and Blesa [5]. The authors showed that this approach can outperform other heuristics as well as the CPLEX solver applied to an ILP model of the RFCLS problem.

3 Problem Definition

The RFLCS problem considers a set of two input strings $S=\{s_1,s_2\}$ over a finite alphabet $\varSigma $. The goal is to find the longest subsequence which is common for both input strings $s_1$ and $s_2$ such that there is no character which occurs more than once. The character at position i is denoted by s[i]. A matching $\textit{\textbf{m}}=(m_1,m_2)$ is a pair of positions s.t. $s_1[m_1] = s_2[m_2]$ and the corresponding character is denoted by $c(\textit{\textbf{m}})=s_1[m_1]$. Hence, the character $c(\textit{\textbf{m}})$ of a matching $\textit{\textbf{m}}$ is a possible candidate to appear in a common sequence (CS). A matching $\textit{\textbf{m}}$ dominates a matching $\textit{\textbf{n}}$, denoted as $\textit{\textbf{m}}\succeq \textit{\textbf{n}}$, if $m_1 \le n_1 \wedge m_2 \le n_2$, meaning that in a possible CS $c(\textit{\textbf{m}})$ may appear before $c(\textit{\textbf{n}})$. Therefore, a CS can be represented by a sequence of matchings $({m_1},{m_2},\ldots )$ s.t. $c({m_1}),c({m}_2),\ldots $ maps to the CS and each matching of the sequence dominates each subsequent matching of the sequence. This observation is important since relaxed MDDs for the RFLCS problem will encode such sequences of matchings. If for two matchings $\textit{\textbf{m}}$ and $\textit{\textbf{n}}$ neither $\textit{\textbf{m}}\succeq \textit{\textbf{n}}$ nor $\textit{\textbf{n}}\succeq \textit{\textbf{m}}$ holds then $c(\textit{\textbf{m}})$ and $c(\textit{\textbf{n}})$ cannot appear together in a CS which will be henceforth referred to as $\textit{\textbf{m}}$ and $\textit{\textbf{n}}$ are in conflict, denoted as $\textit{\textbf{n}}\curlyvee \textit{\textbf{m}}$. Figure 1a shows an example of a RFLCS instance with input strings $s_1=\texttt {ABCDBA}$ and $s_2=\texttt {ACBDBA}$ and an optimal solution of $\texttt {ACDB}$. In this example, matching $\textit{\textbf{m}}_1$ dominates matching $\textit{\textbf{m}}_2$ and $\textit{\textbf{m}}_3$ whereas matching $\textit{\textbf{m}}_2$ is in conflict with matching $\textit{\textbf{m}}_3$.

4 Integer Linear Program and Independent Set Model

An instance of the RFLCS problem can be solved by transforming it into an instance of the MIS problem. Thereby, each matching corresponds to a node of the underlying conflict graph of the MIS problem. An edge is added between two nodes if the corresponding matchings are in conflict or they refer to the same character; see Fig. 1b for an example. A solution of the MIS instance corresponds to a solution of the RFLCS instance and vice versa, since only matchings are selected that are not in conflict with each other and can therefore appear in the same CS and for each character there is at most one matching selected. The resulting CS can be derived from the set of selected matchings by a topological sort considering the domination relationship.

We solve the MIS instance by a corresponding ILP model. Let M be the set of all matchings of the RFLCS instance and thus nodes of the MIS instance. We use a binary decision variable $x_{\textit{\textbf{m}}}$ for each matching $\textit{\textbf{m}}\in M$ indicating whether the matching is selected (=1) for the solution or not (=0). The model is:

$$\begin{aligned} \text {ILP(} M\text { )}=\max \quad&\sum _{x_{\textit{\textbf{m}}}\in M} x_{\textit{\textbf{m}}}&\end{aligned}$$

(1a)

$$\begin{aligned} \text {s.t.} \quad \,\,&x_{\textit{\textbf{m}}} + x_{\textit{\textbf{n}}} \le 1&\textit{\textbf{m}},\textit{\textbf{n}}\in M : \textit{\textbf{m}}\curlyvee \textit{\textbf{n}} \end{aligned}$$

(1b)

$$\begin{aligned}&\sum _{x_{\textit{\textbf{m}}} \in M_a} x_{\textit{\textbf{m}}} \le 1&a \in \varSigma \end{aligned}$$

(1c)

$$\begin{aligned}&x_{\textit{\textbf{m}}} \in \{0,1\}&\textit{\textbf{m}} \in M \end{aligned}$$

(1d)

The number of selected matchings is maximized. Inequalities (1b) ensure that the CS constraints are satisfied, i.e., no conflicting matchings are selected together. The repetition-free (RF) constraint is realized by Inequalities (1c), where set $M_a=\{\textit{\textbf{m}}\in M\mid c(\textit{\textbf{m}})=a\}$ contains all matchings corresponding to the same character $a\in \varSigma $.

5 Decision Diagrams for the RFLCS Problem

We use a relaxed MDD to derive a reduced set of matchings $M'\subseteq M$ to subsequently solve the model ILP($M'$). Our approach compiles relaxed MDDs in an iterative way s.t. if set $M'$ is derived from a relaxed MDD then another relaxed MDD is compiled w.r.t. set $M'$ to possible derive an even smaller set $M''\subseteq M'$. This procedure is repeated until some termination criterion is fulfilled.

A MDD w.r.t. a set of matchings M is a directed acyclic multi-graph $\mathcal {D}_M = (V,A)$ with one root node $\mathbf {r}$. All nodes are partitioned into at most $|\varSigma |+1$ layers $L_1,\ldots ,L_{|\varSigma |+1}$, where $L_i$, $i>0$ contains only nodes that are reachable from $\mathbf {r}$ over exactly $i-1$ arcs and $L_1$ is a singleton containing only $\mathbf {r}$. An arc $\alpha =(u,v)\in A(\mathcal {D}_M)$ is always directed from a source node u in some layer $L_i$ to a target node v in a subsequent layer $L_{i+1}$. Each arc $\alpha $ is associated with a matching $\mathrm {mat}(\alpha )\in M$ that represents the assignment of character $c(\mathrm {mat}(\alpha ))\in \varSigma $ to the i-th position of a CS. For convenience we write $c(\alpha )$ for $c(\mathrm {mat}(\alpha ))$. Any directed path $\varphi =(\alpha _1,\alpha _2,\ldots )$ originating from $\mathbf {r}$ identifies a sequence of characters $(c(\alpha _1),c(\alpha _2),\ldots )$ and thus a (partial) solution. A node without any further outgoing arcs is a sink node. An exact MDD encodes precisely the set of all feasible solutions. Due to the NP-hardness of the RFLCS problem such exact MDDs tend to have exponential size.

Therefore we consider more compact relaxed MDDs which encode supersets of all feasible solutions. In such a relaxed MDD nodes of an exact MDD are superimposed (merged) so that at each layer a maximum allowed number of nodes, called width, is not exceeded. We do this merging in such a way that any path from the root still represents a CS, but RF constraints may be violated. The length of the longest path in such a relaxed MDD then represents an upper bound to the length of a RFLCS.

To compile a MDD, a DP formulation of the considered problem is usually the starting point [14]. Each node $u\in V(\mathcal {D}_M)$ is associated to a state of the DP formulation. For the RFLCS problem, the DP formulation is defined as follows. Consider for a matching $\textit{\textbf{m}}\in M$ the set $\mathrm {D}_{M}(\textit{\textbf{m}})=\{\textit{\textbf{m}}'\in M\setminus \{\textit{\textbf{m}}\}\mid \textit{\textbf{m}}\succeq \textit{\textbf{m}}'\}$ of possible successor matchings of $\textit{\textbf{m}}$ that may appear in the same CS after $\textit{\textbf{m}}$. Note that set $\mathrm {D}_{M}(\textit{\textbf{m}})$ can be efficiently pre-computed for each $\textit{\textbf{m}}\in M$. Then a state $(\textit{\textbf{m}}(u), P(u), S(u))$ associated to node u consists of

a matching $\textit{\textbf{m}}(u)$ whose successor matchings $\mathrm {D}_{M}(\textit{\textbf{m}}(u))$ represent the remaining matchings to consider further,
set $P(u)\subseteq \varSigma $ containing all letters that may still be appended to the CS,
set $S(u)\subseteq \varSigma $ containing all letters that appear on some paths from $\mathbf {r}$ to u.

The root state is $(\textit{\textbf{m}}_r,\varSigma ,\emptyset )$ with the artificial matching $\textit{\textbf{m}}_r=(-1,-1)$, and all characters may be appended to it. Note that $\mathrm {D}_{M}(\textit{\textbf{m}}_r) = M$. An arc $\alpha =(u,v)$ corresponds to a transition from state $(\textit{\textbf{m}}(u), P(u), S(u))$ to state $(\textit{\textbf{m}}(v), P(v), S(v))$ that is achieved by appending character $c(\alpha )$ to the CS w.r.t. to the remaining matchings $\mathrm {D}_{M}(\textit{\textbf{m}}(u))$. Instead of considering all matchings from $\mathrm {D}_{M}(\textit{\textbf{m}}(u))$ as possible outgoing transitions, we consider only matchings that can appear directly after $\textit{\textbf{m}}(u)$ in a longest CS, i.e., matchings from the subset $\mathrm {ND}(u)=\{\textit{\textbf{m}}'\in \mathrm {D}_{M}(\textit{\textbf{m}}(u))\mid \not \exists \textit{\textbf{m}}''\in \mathrm {D}_{M}(\textit{\textbf{m}}(u))\setminus \{\textit{\textbf{m}}'\} : c(\textit{\textbf{m}}'')\not \in S(u) \wedge \textit{\textbf{m}}''\succeq \textit{\textbf{m}}'\}$ which are not dominated by any other matching in $\mathrm {D}_{M}(\textit{\textbf{m}}(u))$. The transition function to obtain the successor state $(\textit{\textbf{m}}(v), P(v), S(v))$ by considering matching $\mathrm {mat}(\alpha )\in \mathrm {D}_{M}(u)$ is defined as

(2)

where $\hat{0}$ represents the infeasible state. Note that no node $\hat{0}$ is created in $\mathcal {D}_M$ and the respective arcs are also skipped.

Moreover, a state $(\textit{\textbf{m}}(u), P(u), S(u))$ may be replaced by a strengthened state $(\textit{\textbf{m}}(u), P'(u), S(u))$, where $P'(u) =\{a\in P(u)\mid \exists \varvec{m'}\in \mathrm {D}_{M}(\textit{\textbf{m}}(u)) : a = c(\textit{\textbf{m}}')\}\subset P(u)$ without excluding any feasible solutions.

So far, we considered exact MDDs. For relaxed MDDs we have to define a state merger which computes the state of merged nodes. To still encode all feasible CSs in the relaxed MDD, only nodes of the same layer and with the same associated matching are merged. Let U be a subset of nodes s.t. all nodes are associated to matching $\textit{\textbf{n}}$, i.e. $\forall u\in U: \textit{\textbf{m}}(u) = \textit{\textbf{n}}$, then an appropriate state merger is $\oplus (U) = \left( \textit{\textbf{n}},\, \bigcup _{u\in U} P(u),\, \bigcup _{u\in U} S(u)\right) $. Since we restrict the state merger to nodes with the same associated matching, the possibilities to reduce the size of the relaxed MDD are also limited. However, since |M| is at most the product $|s_1|\,|s_2|$ of the lengths of the two input strings $s_1$ and $s_2$, the size of each layer is still polynomially bounded by $O(|s_1|\,|s_2|)$.

Let $\mathrm {mat}(\mathcal {D}_M)=\{\textit{\textbf{m}}(u)\mid u\in V(\mathcal {D}_M)\setminus \{\mathbf {r}\}\}\subseteq M$ be the set of matchings derived from $\mathcal {D}_M$. To see that $\mathrm {mat}(\mathcal {D}_M)$ is indeed a feasible set of matchings to solve the model ILP($\mathrm {mat}(\mathcal {D}_M)$) from Sect. 4, remember that each path from $\mathbf {r}$ in $\mathcal {D}_M$ encodes a feasible CS. Hence, each such path can also be described as a sequence of matchings from M. In particular this is true for the matchings of a RFLCS, which must be therefore also contained in $\mathrm {mat}(\mathcal {D}_M)$.

Problem Specific Upper Bounds. To reduce $\mathrm {mat}(\mathcal {D}_{M})$ further we filter arcs and nodes based on sub-optimality. The idea is to compute for each node u an upper bound $Z^\mathrm {ub}(u)$ on the number of characters that can appear in a common subsequence after the character $c(\textit{\textbf{m}}(u))$ of matching $\textit{\textbf{m}}(u)$. Then we can prune each node u in the relaxed MDD where $Z^\mathrm {lp}(u) + Z^\mathrm {ub}(u) < lb$ holds, where lb is a known lower bound on the length of the RFLCS and $Z^\mathrm {lp}(u)$ is the length of the longest path from $\mathbf {r}$ to u. We compute the upper bound for each node u by

$$\begin{aligned} Z^\mathrm {ub}(u) = \min \{|P(u)|, \mathrm {UB}^{\mathrm {lcs}}(\textit{\textbf{m}}(u)), \max _{\alpha =(w,u)}\{Z^\mathrm {ub}(w)-1\}, Z^{\mathrm {lp}\uparrow }(u)\}. \end{aligned}$$

(3)

The first term takes the number of characters into account that can still be appended to the CS after matching $\textit{\textbf{m}}(u)$. The second term $\mathrm {UB}^{\mathrm {lcs}}(\textit{\textbf{m}}(u)) = \mathrm {LCS}(m_1, m_2)$ is based on DP and computes the length of the longest common subsequence from matching $\textit{\textbf{m}}(u)$ onward. Note that this bound can be obtained in constant time by using a data structure known as scoring matrix, which can be computed during preprocessing for two input strings in $O(|s_1|\,|s_2|)$ time [6]. The third term takes the upper bounds from the parent nodes of u into account. Finally, the last term corresponds to the length of the longest path from u to any sink node in the relaxed MDD. Note that this term is only available if the whole relaxed MDD is already compiled.

6 Incremental Refinement

Our approach to compile a relaxed MDD $\mathcal {D}_M$ w.r.t. matching set M for the RFLCS problem is based on the incremental refinement (IR) algorithm from Cire and van Hoeve [12] for sequencing problems. Since $\mathcal {D}_M$ considers the CS constraints exactly and only relaxes the RF constraint, paths in $\mathcal {D}_M$ originating from $\mathbf {r}$ will correspond to CSs where characters may appear more than once. We use the ideas from [12] to ensure at least for some characters that they occur at most once at each path for refining $\mathcal {D}_M$. Cire and van Hoeve [12] showed that the size of a given relaxed MDD will be at most doubled to establish this property for one more character.

The algorithm applies repeatedly two major steps—filtering and refinement—until some termination condition is fulfilled. Let $a^*_1,a^*_2,\ldots ,a^*_{|\varSigma |}$ be a ranking of the characters in $\varSigma $ s.t. $a^*_1$ is the most important character to appear at most once at each path in $\mathcal {D}_M$ to get a strong relaxation. The following refinement step is applied layer by layer starting with $L_1$: For each character $a^*=a^*_1,a^*_2,\ldots ,a^*_{|\varSigma |}$ we identify nodes u s.t. $a^*\in P(u)\cap S(u)$ and split them into two new nodes $u_1$ and $u_2$ where an incoming arc $\alpha =(v,u)$ is redirected to $u_1$ if $a^*\in P(u)\setminus \{c(\alpha )\}$ and to $u_2$ otherwise. All outgoing arcs are replicated for both nodes $u_1$ and $u_2$. We do this as long as the size of the layer is below a maximum width threshold W. For more details and a correctness proof in the context of sequencing problems see [12]. Due to the splitting of nodes the corresponding states may be changed and some of the outgoing arcs from the current layer to the next layer may become infeasible. Those arcs are filtered for each layer after the refinement step finishes. Algorithm 1 shows this at lines 12 and 13. The algorithm terminates if set $\mathrm {mat}(\mathcal {D}_M)$ could not be further reduced by the previously applied refinement/filtering round. The other main parts of the algorithm are:

Initial Relaxed MDD. The IR algorithm starts with an initial relaxed MDD. Usually, this initial relaxed MDD is a naive one of width one, i.e., a relaxed MDD with just a single node at each layer. However, in our case we want to respect the CS constraints and only superimpose states that correspond to the same matching. Therefore we compile the initial $\mathcal {D}_M$ layer-by-layer in a top-down approach. At each layer $L_i$, $i\ge 1$, we expand all nodes using the transition function (2), thus creating for each feasible transition a corresponding node in $L_{i+1}$ and adding the corresponding arc if the node is not sub-optimal according to Eq. 3. Then all nodes in $L_{i+1}$ with the same corresponding matching are merged. Since no feasible CS can be longer than the upper bound $Z^\mathrm {ub}(\mathbf {r})$, the compilation of $\mathcal {D}_M$ stops at the ($Z^\mathrm {ub}(\mathbf {r})+1$)-th layer.

Character Ranking for Refinement. To determine priorities for the characters we use some structural information obtained from the initial MDD. For this purpose let $\mathrm {All}^{\uparrow }(u)$ for each node $u\in V(\mathcal {D}_M)$ be the set of characters that appear on all paths from node u to a sink node. Note that set $\mathrm {All}^{\uparrow }(u)$ can be efficiently computed in a recursive way by a single bottom-up pass. If there exists a node v with an incoming arc $\alpha =(u,v)$ s.t. $c(\alpha )\in \mathrm {All}^{\uparrow }(v)$ holds, then each path originating from $\mathbf {r}$ and leading to any sink node will be infeasible if the path traverses $\alpha $ since character $c(\alpha )$ will appear more than once in a corresponding CS, i.e., the RF constraint will be violated. In [12] such arcs could be safely removed without also removing any feasible solution from the relaxed MDD. In our case this is not possible since solutions have arbitrary length and the path from $\mathbf {r}$ to v could still correspond to a complete feasible solution. However, we can use these violations to determine for which character it is most important to appear on all paths at most once to get a strong relaxation. Hence, we count for each character how often such a violation occurs in $\mathcal {D}_{M}$ and sort the characters according to non-increasing numbers of violations. Ties are resolved by preferring characters that appear in more matchings.

Filtering and Deriving New Primal Solutions. Lines 4–7 and 15–18 perform the following steps. First the function filter-bottom-up performs a single bottom up pass where for each node u the length of the longest path $Z^{\mathrm {lp}\uparrow }(u)$ from u to any sink node is computed and the upper bound $Z^\mathrm {ub}(u)$ is updated accordingly. If $Z^\mathrm {lp}(u) + Z^\mathrm {ub}(u) < lb$ then node u and all incident arcs are removed from $\mathcal {D}_M$.

After filtering we try to derive from $\mathcal {D}_M$ a new best heuristic solution. Since each path in $\mathcal {D}_M$ originating from $\mathbf {r}$ corresponds to a CS, we can derive a RFCS by removing duplicate letters. This is done in two steps. First, a bottom-up pass is performed where primal bounds are computed: For each node $u\in \mathcal {D}_M$ we recursively determine set $B^{\uparrow }(u) = B^{\uparrow }(v)\cup \{c(\alpha ')\}$ where outgoing arc $\alpha '=(u,v)$ maximizes . Ties are resolved by sticking at the first arc that maximizes the expression. If u has no outgoing arcs then $B^{\uparrow }(u)=\emptyset $. Note that $|B^{\uparrow }(\mathbf {r})|$ is a valid primal bound on the RFLCS problem, since only the union is taken to compute $B^{\uparrow }(.)$. To improve this bound further, the second step performs a top-down pass where set $B^{\downarrow }(v)$ is recursively computed for each node v using the information of the precisely computed set $B^{\uparrow }(v)$. Hence, $B^{\downarrow }(v)=B^{\downarrow }(u)\cup \{c(\alpha ')\}$ where incoming arc $\alpha '=(u,v)$ maximizes using $|B^{\downarrow }(v)\cup \{c(\alpha )\}|$ as tie breaking criterion. A sink node $v'$ that maximizes $|B^{\downarrow }(v')|$ then provides the strongest primal bound. A respective RFCS is derived by going from $v'$ backwards to $\mathbf {r}$, skipping any character that already occurred along the path.

If a new best heuristic solution could be obtained in this way then the filter-bottom-up step is repeated and we try again to obtain a new best heuristic solution.

Main Procedure: Algorithm 2 shows the main procedure to solve an instance of the RFLCS problem. As input the algorithm takes the set of input strings S, a possibly known lower bound on the RFLCS length or zero, and the maximum width threshold W for the relaxed MDDs. The original set of matchings M reduced by performing iteratively the following steps. The first step processes the input strings $s_1$ and $s_2$ by removing characters that have no associated matching in M and characters that appear immediately one after the other in the input strings. For example, if character $a\in \varSigma $ appears in an input string at both position i and $i+1$ then a can be removed from $i+1$ without removing any feasible solution. Furthermore, if the pattern abab with $a,b\in \varSigma $ has been discovered in one of the input strings then the last b can be removed from the input string due to the RF constraint. Next, M is reduced by removing matchings $\textit{\textbf{m}}\in M$ where the upper bound used in [6], denoted by $\mathrm {UB}^{\mathrm {BLUM}}(\textit{\textbf{m}})$ is lower than our currently best primal bound. This upper bound is based on the first two terms in Eq. (3), i.e., on the number of characters that can appear in a RFCS that contains $\textit{\textbf{m}}\in M$ and on the length of the LCS that contains $\textit{\textbf{m}}$. Note that the difference to Eq. (3) is that $\mathrm {UB}^{\mathrm {BLUM}}(\textit{\textbf{m}})$ is an upper bound on the length of a complete RFCS containing $\textit{\textbf{m}}$ wheres Eq. (3) describes an upper bound on the remaining part from $\textit{\textbf{m}}$ onward. With this reduced set M we compile a relaxed MDD $\mathcal {D}_M$. If the length of the hereby derived RFCS $s^{\mathrm {rfcs}}$ is equal to the longest path in $\mathcal {D}_M$ then $s^{\mathrm {rfcs}}$ is an optimal solution and the algorithm terminates. Otherwise, if due to the reduced set $\mathrm {mat}(\mathcal {D}_M)$ further characters can be removed from $s_1$ and $s_2$ then we repeat the procedure until no further characters can be removed. Note that since the size of the input strings are reduced at each iteration also $\mathrm {UB}^{\mathrm {BLUM}}(\textit{\textbf{m}})$ changes, which may further reduce set M. Finally the ILP model from Sect. 4 is solved for set M.

7 Experimental Results

To test and compare our approach we used two benchmark sets from [4]. The first set, Set1, consists of 1680 randomly generated instances. For each combination of the input string lengths $n\in \{32,64,128,256,512,1024,2048,4096\}$ and the alphabet sizes $|\varSigma |\in \{\frac{n}{8},\frac{n}{4},\frac{3n}{8},\frac{n}{2},\frac{5n}{8},\frac{3n}{4},\frac{7n}{8}\}$ there are 30 instances. The second set, Set2, consists of 30 randomly generated instances for each combination of the alphabet size $|\varSigma |\in \{4,8,16,32,64,128,256,512\}$ and the maximal repetition of each character, $\mathrm {reps}\in \{3,4,5,6,7,8\}$. This set has a total of 1440 instances.

The algorithms were implemented using GNU C++ 5.4.1. All tests were executed on a single core of an Intel Xeon E5649 with 2.53 GHz and 16 GB RAM. The ILP model from Sect. 4 was solved with CPLEX 12.7 with a CPU-time limit of 3600 s. For Algorithm 2, henceforth denoted as MDD+CPLEX, the maximum width threshold was set to $W=5000$. This value was determined in preliminary experiments s.t. set M could be reduced as much as possible and as many instances as possible can be solved to optimality within the memory limit of 16 GB. MDD+CPLEX is compared to the approach from Blum et al. [6], henceforth denoted as UB+CPLEX, where the ILP($M'$) model is solved with the reduced set of matchings $M'=\{\textit{\textbf{m}}\in M\mid \mathrm {UB}^{\mathrm {BLUM}}(\textit{\textbf{m}}) < lb\}$. Both approaches use the lengths of the currently best known solutions from the literature as initial lower bound lb. Note that the compiled relaxed MDDs from Algorithm 1 are not strictly limited to W since the initial relaxed MDD could already contain layers that contain more nodes than W. However, such layers are not further refined during the compilation.

The average reduction rate of matchings from M is shown in Fig. 2 for 12 instance classes by means of bar plots. The first bar corresponds always to the UB+CPLEX approach whereas the next three bars corresponds to the MDD+CPLEX approach with different values for $W\in \{1,1000,5000\}$. Note that $W=1$ means that only the initial relaxed MDD is compiled and no further refinement will take place. As expected, the obtained reduction rate increases with W. The boxplots in Fig. 3 report the average difference $\mathrm {red}_{\mathrm {MDD}} - \mathrm {red}_{\mathrm {UB}}$ between the average reduction rate $\mathrm {red}_{\mathrm {UB}}$ obtained from UB+CPLEX and $\mathrm {red}_{\mathrm {MDD}}$ obtained from MDD+CPLEX in percentage points aggregated over the ratio between n and $|\varSigma |$ in case of Set1 and over $|\varSigma |$ in case of Set2. On average the MDD+CPLEX approach is able to reduce the original set of matchings by more than 25.79$\%$ and 41.28$\%$ as UB+CPLEX does, for Set1 and Set2, respectively.

Detailed aggregated results are presented in Tables 1 and 2 where the first two columns show the instance characteristics and the third column shows the average length of the so far best known solution from the literature. Columns obj report for each tested approach the average length of the best obtained solutions. In case of MDD+CLPEX these solutions are either those obtained from the ILP model or the ones found during the compilation of the relaxed MDDs. In case of UB+CPLEX a “-” symbol indicates that CPLEX was not able at all to derive a primal solution within the time and memory limits. Average optimality gaps, shown in columns gap, are calculated by $100\% \times (\mathrm {ub} - \mathrm {obj})/\mathrm {ub}$ where ub is for each approach the best obtained upper bound. In case of MDD+CPLEX this is either the upper bound obtained from the ILP model or the length of the longest path from a compiled relaxed MDD. Columns $t_\mathrm {prep}$ list average preprocessing times in CPU seconds including the computation of the reduced set of matchings M (see Algorithm 2, Line 10). Columns $t_\mathrm {tot}$ list average total computation times in CPU seconds until the algorithm terminates including $t_\mathrm {prep}$ plus the time CPLEX needs and columns #opt report the total numbers of instances solved to optimality. In case of MDD+CPLEX the second number corresponds to the number of instances where optimality could already be proven by the compiled relaxed MDD at Line 9 in Algorithm 2. Hence, the number of times where it was not required to solve the ILP model at all. Average reduction rates of the original set of matchings are reported by columns red.

Regarding the number of instances solved to proven optimality, note that already UB+CPLEX was quite successful with a total of 90.26$\%$. More precisely, 1489 out of 1680 instances from Set1 and 1327 out of 1440 instances of Set2 could be solved to proven optimality by UB+CPLEX. Nevertheless, MDD+CPLEX is able to solve significantly more instances to proven optimality: 1541 instances from Set1 and 1381 instances of Set2, and thus a total of 93.65$\%$. Moreover, in 90.90$\%$ of all instances it was not necessary to solve the ILP model at all, since Algorithm 2 terminated early at Line 9. Hence, the obtained upper bound from the compiled relaxed MDD was equal to the length of the currently best found solution in these cases. Concerning the computation times, the UB+CPLEX approach was on average in only two cases faster then the MDD+CPLEX approach regarding benchmark set Set1 and only in one case regarding benchmark set Set2. Finally, the MDD+CPLEX approach is able to obtain in 135 cases better results than the currently best-known-solutions from the literature. For each considered problem class, MDD+CPLEX is able to provide on average better results than UB+CPLEX.

Table 1. Results on Set1 instances.

Full size table

Table 2. Results on Set2 instances.

Full size table

8 Conclusions

In this work we approached the RFLCS problem by transforming an instance into a maximum independent set (MIS) problem instance as this is done by Blum et al. [6]. The MIS problem is subsequently solved by the ILP solver CPLEX. Our major contribution is to heavily reduce the conflict graph of the MIS problem by means of relaxed MDDs. This has multiple advantages: (1) reducing the conflict graph leads to an improved performance of CPLEX s.t. more instances could be solved faster to proven optimality, (2) the compiled relaxed MDDs present a discrete relaxation of the RFLCS problem meaning that upper bounds can be additionally derived and (3) it is also possible to quickly derive heuristic solutions from the MDDs. In many cases it was not necessary anymore to solve the ILP for the MIS problem since the upper bound from the MDD corresponded to the length of the derived heuristic solution and thus optimality was already proven. Overall, for many benchmark instances new state-of-the-art results could be obtained.

In the literature there are works where relaxed decision diagrams are successfully embedded into a branch-and-bound algorithm s.t. branching is done over nodes in the relaxed decision diagram. Since relaxed MDDs provide also strong upper bonds for the RFLCS problem it may be promising future work to develop such a branch-and-bound algorithm for the RFLCS problem to solve even larger instances to optimality. Moreover, it seems promising to apply relaxed decision diagrams also on other LCS-related problems.

References

Adi, S.S., et al.: Repetition-free longest common subsequence. Discret. Appl. Math. 158(12), 1315–1324 (2010)
Article MathSciNet Google Scholar
Aho, A., Hopcroft, J., Ullman, J.: Data Structures and Algorithms. Addison-Wesley, Boston (1983)
MATH Google Scholar
Bergman, D., Cire, A.A., van Hoeve, W.J., Hooker, J.N.: Decision diagrams for optimization. In: OŚullivan, B., Wooldridge, M. (eds.) Artificial Intelligence: Foundations, Theory, and Algorithms. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42849-9
Blum, C., Blesa, M.J.: Construct, merge, solve and adapt: application to the repetition-free longest common subsequence problem. In: Chicano, F., Hu, B., García-Sánchez, P. (eds.) EvoCOP 2016. LNCS, vol. 9595, pp. 46–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30698-8_4
Chapter Google Scholar
Blum, C., Blesa, M.J.: A comprehensive comparison of metaheuristics for the repetition-free longest common subsequence problem. J. Heuristics 24(3), 551–579 (2017). https://doi.org/10.1007/s10732-017-9329-x
Article Google Scholar
Blum, C., et al.: Solving longest common subsequence problems via a transformation to the maximum clique problem. Technical report. AC-TR-20-003 (2020). Submitted to Computers & Operations Research
Google Scholar
Bonizzoni, P., Vedova, G.D., Dondi, R., Pirola, Y.: Variants of constrained longest common subsequence. Inf. Process. Lett. 110(20), 877–881 (2010)
Article MathSciNet Google Scholar
Brisk, P., Kaplan, A., Sarrafzadeh, M.: Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of DAC 2004 - The 41st Annual Design Automation Conference. pp. 395–400. IEEE Press (2004)
Google Scholar
Castelli, M., Beretta, S., Vanneschi, L.: A hybrid genetic algorithm for the repetition free longest common subsequence problem. Oper. Res. Lett. 41(6), 644–649 (2013)
Article Google Scholar
Chen, Y.C., Chao, K.M.: On the generalized constrained longest common subsequence problems. J. Comb. Optim. 21(3), 383–392 (2011)
Article MathSciNet Google Scholar
Chowdhury, S.R., Hasan, M.M., Iqbal, S., Rahman, M.S.: Computing a longest common palindromic subsequence. Fundamenta Informaticae 129(4), 329–340 (2014)
Article MathSciNet Google Scholar
Cire, A.A., Hoeve, W.V.: Multivalued decision diagrams for sequencing problems. Oper. Res. 61(6), 1411–1428 (2013)
Article MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Hooker, J.N.: Decision diagrams and dynamic programming. In: Gomes, C., Sellmann, M. (eds.) CPAIOR 2013. LNCS, vol. 7874, pp. 94–110. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38171-3_7
Chapter Google Scholar
Jiang, T., Lin, G.-H., Ma, B., Zhang, K.: The longest common subsequence problem for arc-annotated sequences. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 154–165. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45123-4_15
Chapter Google Scholar
Jiang, T., Lin, G., Ma, B., Zhang, K.: A general edit distance between RNA structures. J. Comput. Biol. 9(2), 371–388 (2002)
Article Google Scholar
Kruskal, J.B.: An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25(2), 201–237 (1983)
Article MathSciNet Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press, Inc., Rockville (1987)
Google Scholar
Tsai, Y.T.: The constrained longest common subsequence problem. Inf. Process. Lett. 88(4), 173–176 (2003)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Logic and Computation, TU Wien, Vienna, Austria
Matthias Horn, Marko Djukanovic & Günther R. Raidl
Artificial Intelligence Research Institute (IIIA-CSIC), Campus UAB, Bellaterra, Spain
Christian Blum

Authors

Matthias Horn
View author publications
You can also search for this author in PubMed Google Scholar
Marko Djukanovic
View author publications
You can also search for this author in PubMed Google Scholar
Christian Blum
View author publications
You can also search for this author in PubMed Google Scholar
Günther R. Raidl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Horn .

Editor information

Editors and Affiliations

Dorodnicyn Computing Centre, FRC CSC RAS, Moscow, Russia
Nicholas Olenev
Dorodnicyn Computing Centre, FRC CSC RAS, Moscow, Russia
Yuri Evtushenko
Krasovsky Institute of Mathematics and Mechanics, Ekaterinburg, Russia
Michael Khachay
Dorodnicyn Computing Centre, FRC CSC RAS, Moscow, Russia
Vlasta Malkova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Horn, M., Djukanovic, M., Blum, C., Raidl, G.R. (2020). On the Use of Decision Diagrams for Finding Repetition-Free Longest Common Subsequences. In: Olenev, N., Evtushenko, Y., Khachay, M., Malkova, V. (eds) Optimization and Applications. OPTIMA 2020. Lecture Notes in Computer Science(), vol 12422. Springer, Cham. https://doi.org/10.1007/978-3-030-62867-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-62867-3_11
Published: 04 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62866-6
Online ISBN: 978-3-030-62867-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the Use of Decision Diagrams for Finding Repetition-Free Longest Common Subsequences

Abstract

Similar content being viewed by others

Better Heuristic Algorithms for the Repetition Free LCS and Other Variants

A comprehensive comparison of metaheuristics for the repetition-free longest common subsequence problem

Construct, Merge, Solve and Adapt: Application to the Repetition-Free Longest Common Subsequence Problem

Keywords

1 Introduction

2 Related Work

3 Problem Definition

4 Integer Linear Program and Independent Set Model

5 Decision Diagrams for the RFLCS Problem

6 Incremental Refinement

7 Experimental Results

8 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Use of Decision Diagrams for Finding Repetition-Free Longest Common Subsequences

Abstract

Similar content being viewed by others

Better Heuristic Algorithms for the Repetition Free LCS and Other Variants

A comprehensive comparison of metaheuristics for the repetition-free longest common subsequence problem

Construct, Merge, Solve and Adapt: Application to the Repetition-Free Longest Common Subsequence Problem

Keywords

1 Introduction

2 Related Work

3 Problem Definition

4 Integer Linear Program and Independent Set Model

5 Decision Diagrams for the RFLCS Problem

6 Incremental Refinement

7 Experimental Results

8 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation