Keywords

1 Introduction

The problem that is often encountered when applying exact solvers to combinatorial optimization problems is that they are not applicable to problem instances of realistic sizes. However, for smaller problem instances, exact solvers are often surprisingly efficient. This is because the operations research community has invested a lot of time, effort and expertise into the development of exact solvers. Prime examples are general-purpose mathematical programming solvers such as CPLEX and Gurobi. Therefore, recent research efforts have focused on ways in which efficient exact solvers can be used within heuristic frameworks even in the context of large problem instances. One of the most recent examples of these research efforts is an algorithm labelled Construct, Merge, Solve & Adapt (CMSA) [1, 2]. This algorithm works as follows. At each iteration, solutions to the tackled problem instance are generated in a probabilistic way. The solution components found in these solutions are then added to a sub-instance of the original problem instance. Subsequently, an exact solver such as, for example, CPLEX is used to solve the sub-instance to optimality. Moreover, the algorithm is equipped with a mechanism for deleting seemingly useless solution components from the sub-instance. This is done such that the sub-instance has a moderate size and can be solved rather quickly to optimality.

In this work we apply the CMSA algorithm to the so-called repetition-free longest common subsequence problem [3]. This problem, which is NP-hard, is a special case of the well-known longest common subsequence problem. The repetition-free longest common subsequence problem seems to be well-suited for being solved with CMSA, because the standard integer linear programming (ILP) model for the problem can only be solved to optimality in the context of rather small problem instance. Both the number of variables and constraints in this model (which is outlined later in Sect. 2) are exponential in the input parameters of the problem. The obtained results show that, indeed, the application of CMSA obtains state-of-the-art results, especially in the context of large problem instances.

The remaining part of the paper is organized as follows. In Sect. 2 we provide a technical description of the repetition-free longest common subsequence problem. Moreover, we describe the standard ILP model for this problem. Next, in Sect. 3, the application of CMSA to the tackled problem is outlined. Finally, Sect. 4 provides an extensive experimental evaluation and Sect. 5 offers a discussion and an outlook to future work.

2 Repetition-Free Longest Common Subsequence Problem

The longest common subsequence (LCS) problem is a string problem with numerous applications, for example, in computational biology [46]. A problem instance \((S,\varSigma )\) consists of a set \(S = \{s_1,s_2,\ldots ,s_n\}\) of n input strings over a finite alphabet \(\varSigma \). The goal consists in finding the longest possible subsequence of all strings in S. A string t is a subsequence of a string s, if t can be produced from s by deleting zero or more characters. For example, dga can be produced from adagtta by deleting the first two occurrences of letter a and the two occurrences of letter t. Apart from applications in computational biology, the LCS problem finds applications, for example, in data compression and file comparison [7, 8]. Moreover, note that the LCS problem was shown to be NP-hard [9] for an arbitrary number n of input strings.

In this work we consider a restricted version of the LCS problem, the so-called repetition-free longest common subsequence (RFLCS) problem. Given exactly two input strings x and y over a finite alphabet \(\varSigma \), the goal is to find a longest common subsequence with the additional restriction that no letter may appear more than once. This problem was introduced in [3] as a comparison measure for two sequences of biological origin. In the same paper, the authors proposed three heuristics for solving this problem. Other algorithms from the literature for solving the RFLCS problem include a Beam-ACO approach [10] and an evolutionary algorithm [11]. Among these techniques, the Beam-ACO approach can be regarded as the current state-of-the-art method.

The RFLCS problem can be stated in terms of an integer linear program (ILP) in the following way. First, let us denote the length of x by \(l_x\) and the length of y by \(l_y\). Furthermore, the positions in x and y are numbered from 1 to \(l_x\), respectively from 1 to \(l_y\). The letter at position i of x is denoted by x[i], and the letter at position j of y is denoted by y[j]. The set Z of binary variables that is required for the ILP model is composed as follows. For each combination of \(i=1,\ldots ,l_x\) and \(j=1,\ldots ,l_y\) such that \(x[i] = y[j]\), set Z contains a binary variable \(z_{i,j}\). Moreover, we say that two variables \(z_{i,j}\) and \(z_{k,l}\) are in conflict, if and only if either \(i < k\) and \(j > l\) or \(i > k\) and \(j < l\). Finally, for each letter \(a \in \varSigma \), set \(Z_a \subset Z\) contains all variables \(z_{i,j}\) such that \(x[i] = y[j] = a\). The RFLCS problem can then be rephrased as the problem of selecting a maximal number of non-conflicting variables from Z provided that, among all variables representing a letter \(a \in \varSigma \), at most one variable is chosen. Given these notations, the ILP is stated as follows.

figure afigure a

Hereby, constraints (2) ensure that each letter from the alphabet is chosen at most once, and constraints (3) ensure that selected variables are not in conflict.

3 Application of CMSA to the RFLCS Problem

The application of CMSA to the RFLCS problem is pseudo-coded in Algorithm 1. Note that, in the context of this algorithm, solutions to the problem and sub-instances are both subsets of the complete set Z of variables. If a solution S contains a variable \(z_{i,j}\), this means that this variable must be given value one in order to produce the corresponding solution. The main loop of CMSA is executed while the CPU time limit is not reached. It consists of the following actions. First, the best-so-far solution \(S_{\mathrm {bsf}}\) is initialized to \(\textsc {null}\), and the restricted problem instance (\(Z_{\mathrm {sub}}\)) to the empty set. Then, at each iteration a number of \(n_a\) solutions is probabilistically constructed in function ProbabilisticSolutionConstruction(Z) in line 6 of Algorithm 1. The variables contained in these solutions are added to \(Z_{\mathrm {sub}}\). The age of a newly added variable \(z_{i,j}\) (\(\text {age}[z_{i,j}]\)) is set to 0. After the construction of \(n_a\) solutions, an ILP solver is applied to find the best solution \(S^{\prime }_{\mathrm {opt}}\) in the sub-instance \(Z_{\mathrm {sub}}\) (see function ApplyILPSolver(\(Z_{\mathrm {sub}}\)) in line 12 of Algorithm 1). In case \(S^{\prime }_{\mathrm {opt}}\) is better than the current best-so-far solution \(S_{\mathrm {bsf}}\), solution \(S^{\prime }_{\mathrm {opt}}\) is stored as the new best-so-far solution (line 13). Next, sub-instance \(Z_{\mathrm {sub}}\) is adapted, based on solution \(S^{\prime }_{\mathrm {opt}}\) and on the age values of the variables. This is done in function Adapt(\(Z_{\mathrm {sub}}\), \(S^{\prime }_{\mathrm {opt}}\), \(\text {age}_{\max }\)) in line 14 as follows. First, the age of each variable in \(Z_{\mathrm {sub}}\) is increased by one, and, subsequently, the age of each variable in \(S^{\prime }_{\mathrm {opt}} \subseteq Z_{\mathrm {sub}}\) is re-initialized to zero. Finally, those solution components from \(Z_{\mathrm {sub}}\) whose age has reached the maximum component age (\(\text {age}_{\max }\)) are deleted from \(Z_{\mathrm {sub}}\). The motivation behind the aging mechanism is that variables which never appear in an optimal solution of \(Z_{\mathrm {sub}}\) should be removed from \(Z_{\mathrm {sub}}\) after a while, because they simply slow down the ILP solver. On the other side, components which appear in optimal solutions seem to be useful and should therefore remain in \(Z_{\mathrm {sub}}\).

figure bfigure b

In the following we will describe in detail the remaining component of the algorithm: the probabilistic construction of solutions in function ProbabilisticSolutionConstruction(Z). Such a solution construction starts with an empty solution \(S = \emptyset \), and the first step consists in generating the set of variables that serve as options to be added to S. More specifically, the initial set C is generated in order to contain for each letter \(a \in \varSigma \) the variable \(z_{i,j} \in Z_a\) (if any) such that \(i < k\) and \(j < l\), \(\forall z_{k,l} \in Z_a\). Moreover, options \(z_{i,j} \in C\) are given a weight value \(w(z_{i,j}) := \frac{i}{l_x} + \frac{j}{l_y}\), which is a known greedy function for longest common subsequence problems. At each construction step, exactly one variable is chosen from C and added to S. For doing so, first, a value r is chosen uniformly at random from [0, 1]. In case \(r \le d_{\mathrm {rate}}\), where \(d_{\mathrm {rate}}\) is a parameter of the algorithm, the variable \(z_{i,j} \in C\) with the smallest weight value is deterministically chosen. Otherwise, a candidate list \(L \subseteq C\) of size \(\min \{l_{\mathrm {size}}, |C|\}\) containing the options with the lowest weight values is generated and exactly one variable \(z_{i,j} \in L\) is then chosen uniformly at random and added to S. Note that \(l_{\mathrm {size}}\) is another parameter of the solution construction process. Finally, the set of options C for the next construction step is generated. This is done such that C only contains variables that represent letters that are not already represented by one of the variables in S. Moreover, being \(z_{i,j}\) the last variable that was added to S, C contains for each non-represented letter \(a \in \varSigma \) the variable \(z_{r,s} \in Z_a\) (if any) with the lowest weight value \(w(z_{r,s})\) calculated as \(w(z_{r,s}) := \frac{r-i}{l_x-i} + \frac{s-j}{l_y-j}\). The construction of a complete (valid) solution is finished when the set of options is empty.

4 Experimental Evaluation

We implemented the proposed algorithm in ANSI C++ using GCC 4.7.3, without the use of any external libraries. The ILP models, both the ones of the original RFLCS instances and the ones of sub-instances within CMSA, were solved with IBM ILOG CPLEX v12.1 in one-threaded mode. The experimental evaluation has been performed on a cluster of PCs with Intel(R) Xeon(R) CPU 5670 CPUs of 12 nuclei of 2933 MHz and at least 40 Gigabytes of RAM. In the following we first describe the set of benchmark instances that we generated to test the CMSA algorithm. Then, we describe the tuning experiments that were performed in order to determine a proper setting for the parameters. Finally, an exhaustive experimental evaluation is presented.

4.1 Problem Instances

Two sets of problem instances were adopted from [10]. These sets were generated with the same procedure as described in [3]. The first set (henceforth called Set1) consists for each combination of input sequence length \(n \in \{32, 64, 128, 256, 512\}\) and alphabet size \(|\varSigma | \in \{n/8, n/4, 3n/8, n/2, 5n/8, 3n/4, 7n/8\}\) of exactly 10 problem instances. The second set of instances (henceforth called Set2) is generated on the basis of alphabet sizes \(|\varSigma | \in \{4, 8, 16, 32, 64\}\) and the maximal repetition of each letter \(rep \in \{3, 4, 5, 6, 7, 8\}\) in each input string. For each combination of \(|\varSigma |\) and rep this instance set consists of 10 randomly generated problem instances. In addition, we generated an extension of Set2 consisting of larger problem instances. More specifically, we generated for each combination of \(|\varSigma | \in \{128, 256\}\) and \(rep \in \{3, 4, 5, 6, 7, 8\}\) ten problem instances. All the results to be shown in the forthcoming sections are averages over the 10 problem instances of each type.

4.2 Tuning of CMSA

There are several parameters involved in Cmsa for which well-working values must be found: (\(n_a\)) the number of solution constructions per iteration, (\(\text {age}_{\max }\)) the maximum allowed age of solution components, (\(d_{\mathrm {rate}}\)) the determinism rate, (\(l_{\mathrm {size}}\)) the candidate list size, and (\(t_{\max }\)) the maximum time in seconds allowed for CPLEX per application to a sub-instance. The last parameter is necessary, because even when applied to reduced problem instances, CPLEX might still need too much computation time for solving such sub-instances to optimality. In any case, CPLEX always returns the best feasible solution found within the given computation time.

We decided to make use of the automatic configuration tool irace[12] for the tuning of the five parameters. In fact, irace was applied to tune Cmsa separately for each alphabet size, which—after initial experiments—seems to have more influence on the behavior of the algorithm than the length of the input strings. In the context of Set1 we randomly generated two tuning instances for each combination of string length and alphabet size, whereas for Set2 (and its extension) we randomly generated two tuning instances for each combination of alphabet size and number of repetitions.

The tuning process for each alphabet size was given a budget of 200 runs of Cmsa, where each run was given a computation time limit of \(l_x/10\) CPU seconds for instances of Set1 (remember that for instances of Set1 it holds that \(l_x = l_y\)) and \((|\varSigma |*reps)/10\) CPU seconds for instances of Set2 and its extension. Finally, the following parameter value ranges were chosen concerning the five parameters of Cmsa:

  • \(n_a \in \{10, 30, 50\}\)

  • \(\text {age}_{\max } \in \{1, 5, 10, \textit{inf}\}\), where inf means that solution components are never removed from \(Z_{\mathrm {sub}}\).

  • \(d_{\mathrm {rate}}\in \{0.0, 0.3, 0.5, 0.7, 0.9\}\), where a value of 0.0 means that the selection of the next variable to be added to the partial solution under construction is always done randomly from the candidate list, while a value of 0.9 means that solution constructions are nearly deterministic.

  • \(l_{\mathrm {size}}\in \{3,5,10\}\)

  • \(t_{\max } \in \{0.5, 1.0, 5.0\}\) (in seconds) for instances of Set1 and Set2, and \(t_{\max } \in \{1.0, 10.0, 100.0\}\) for the instances of the extension of Set2.

The tuning runs with iraceproduced the configurations of Cmsa as shown in Table 1.

Table 1. Results of tuning CMSA with irace.

4.3 Experimental Results

Three algorithms were included in the comparison. Beam-ACO is currently a state-of-the-art method for the RFLCS problem [10], CPLEX refers to the application of CPLEX to the complete problem instances, and CMSA is the algorithm proposed in this work. The results of Beam-ACO for the instances of Set1 and Set2 were taken from [10], where Beam-ACO was applied once to each problem instance with a computation time limit of 5 CPU seconds per run, a beam width of 10, and a determinism rate of 0.9. Note that the low computation time limit of 5 CPU seconds was adopted in [10], because Beam-ACO always produced its best results during the first seconds of a run. For the application to the larger problem instances that were generated as an extension of Set2 Beam-ACO was applied with the same parameter values for beam with and determinism rate, but with the same computation time limit as CMSA. In particular, CMSA was applied to each problem instance with a computation time limit of \(l_x/10\) CPU seconds for instances of Set1 (remember that for instances of Set1 it holds that \(l_x = l_y\)) and \((|\varSigma |*reps)/10\) CPU seconds for instances of Set2 and its extension. The stand-alone application of CPLEX to each problem instances was given more computation time, namely, 600 CPU seconds for each run, regardless of the instance/alphabet size. Moreover, a memory limit of 2 GB were used for each application of CPLEX.

The numerical results are presented in Table 2 concerning Set1, Table 3 concerning Set2, and Table 4 concerning the extension of Set2. Each table row presents the results averaged over 10 problem instances of the same type. The results of Beam-ACO and CMSA are provided in two columns each. The first one (with heading result) provides the result of the corresponding algorithm averaged over 10 problem instances, while the second column (with heading time (s)) provides the average computation time necessary for finding the corresponding solutions. The same information is given for CPLEX. However, in this case we also provide the average optimality gaps (in percent), that is, the average gaps between the upper bounds and the values of the best solutions when stopping a run.

Table 2. Experimental results concerning the instances of Set1.
Table 3. Experimental results concerning the instances of Set2.
Table 4. Experimental results for larger problem instances.
Fig. 1.
figure 1figure 1

Graphical presentation of the sizes of the sub-instances in percent with respect to the size of the original problem instances.

The results allow to make the following observations:

  • First of all, CPLEX is able to provide optimal solutions for all instances of 29 out of 35 instance types (that is, table rows) concerning Set1, and for all instances of 27 out of 30 instance types concerning Set2. This means, on one side, that the instances of these two benchmark sets are, in their majority, not very difficult to be solved. On the other side, there seems to be a kind of phase transition between instances that can be solved to optimality quite easily, and instances that are difficult to be solved. In three out of six instance types of Set1 which CPLEX cannot solve to optimality within the allocated CPU time, the allocated memory is not sufficient, and for other two instance types the average optimality gap is greater than \(100\,\%\).

  • Concerning Set1, both Beam-ACO and CMSA provide (near-)optimal solutions and both outperform CPLEX once the average optimality gaps start to increase. However, no clear trend about the superiority of CMSA over Beam-ACO (or the other way around) is noticeable.

  • Concerning Set2, the performance of Beam-ACO and CMSA is comparable for instances of alphabet sizes \(|\varSigma | \in \{4, 8, 16\}\), both providing (near-)optimal solutions. However, starting from alphabet size \(|\varSigma | = 32\), CMSA outperforms Beam-ACO. This becomes even more clear in the case of the extension of Set2, consisting of larger problem instances. In the context of these instances, CMSA outperforms both CPLEX and Beam-ACO.

Finally, we studied the (average) size of the sub-instances that are generated (and maintained) within CMSA in comparison to the size of the original problem instances. These sub-instance sizes are provided in a graphical way in Fig. 1a for instances of Set1, and in Fig. 1b for instances of Set2 and its extension. Note that these graphics show the sub-instance sizes averaged over all instances of the same alphabet size. In both cases, the x-axis ranges from small alphabet size (left) to large alphabet sizes (right). Interestingly, when the alphabet size is rather small, the tackled sub-instances in CMSA are rather large (up to \(\approx \)70 % of the size of the original problem instances). With growing alphabet size, the size of the tackled sub-instances decreases. This is more clearly visible in the context of instances of Set1. However, this trend also becomes clear starting from alphabet size 32 in the context of instances of Set2. The reason for this trend is as follows. As CPLEX is very efficient for problem instances based on rather small alphabet sizes, the parameter values of CMSA are chosen during the tuning process of iracesuch that the sub-instance sizes become quite large. On the contrary, with growing alphabet size, the parameter values chosen during tuning lead to smaller sub-instances, simply because CPLEX is not so efficient anymore when applied to sub-instances that are not much smaller than the original problem instances.

5 Discussion and Future Work

CMSA is a new, general, algorithm for combinatorial optimization which is based on a simple, but apparently successful, idea: the generation of sub-instances based on merging the solution components found in randomly constructed solutions, and their subsequent solution by means of an exact solver. Moreover, the considered sub-instances are dynamically changing due to adding new solution components at each iteration, and removing existing solution components on the basis of indicators about their usefulness. In this work, the CMSA algorithm has been applied to the repetition-free longest common subsequence problem. The general picture of the results, in comparison to CPLEX, is similar to the one observed in earlier applications of CMSA to the minimum common string partition problem and a minimum weight arborescence problem in [1]. CMSA is generally competitive with CPLEX for small to medium size problem instances, whereas it outperforms CPLEX with growing problem instances size. In our opinion, this algorithm is quite appealing, especially for the following reasons:

  • CMSA can be applied to any problem for which a constructive heuristic and an exact solver are known.

  • In comparison to other metaheuristics, CMSA can generally be implemented with few lines of code.

  • When using an ILP solver for solving sub-instances, CMSA allows to make use of the valuable operations research expertise that has gone into the development of the ILP solver, without the need of being an expert in operations research.

Finally, note that the idea behind CMSA is similar, in some sense, to the basic idea of large neighborhood search (LNS) [13]. However, while exact solvers in LNS are used to search the best solution in a large neighborhood of the current solution which is generally obtained by a partial destruction of the current solution, exact solvers in the context of CMSA are applied to sub-instances of the original problem instances.

Concerning future work, we first plan to extend the conducted experimental study to even larger problem instances. Second, we intent to study the incorporation of potentially valuable knowledge about, for example, the reduced costs of variables, in order to develop a more sophisticated—and hopefully more effective—mechanism for the removal of variables from the considered sub-instances.