Selected String Problems

Blum, Christian; Festa, Paola

doi:10.1007/978-3-319-07124-4_58

Christian Blum⁴ &
Paola Festa⁵

4466 Accesses

Abstract

This chapter overviews some string selection and comparison problems, with special emphasis on the optimization and operational research perspective. It also proposes a simple and efficient ILP-based heuristic that can be used for any of the considered problems.

Access provided by CONRICYT-eBooks. Download reference work entry PDF

Selected String Problems

On the Far from Most String Problem, One of the Hardest String Selection Problems

A Survey of the Hybrid Exact String Matching Algorithms

Keywords

Introduction to String Problems

Among string selection and comparison problems, there is a class of problems known as strings consensus , where a finite set of strings is given and one is interested in finding their consensus, i.e., a new string that agrees as much as possible with all the given strings. In other words, the objective is to determine a string called consensus, because it represents—in some way—all the given strings. The idea of representation and of being in consensus can be related to several different objectives listed in the following:

(i)
the consensus is a new string whose total distance from all given strings is minimum (closest string problem);
(ii)
the consensus is a new string close to most of the given strings (close to most string problem);
(iii)
the consensus is a new string whose total distance from all given strings is maximum (farthest string problem);
(iv)
the consensus is a new string far from most of the given strings (far from most string problem).

Computational intractability of the general strings consensus problem was first proved in 1997 by Frances and Litman [7] and in 1999 by Sim and Park [28].

As a constring of the linear coding of DNA and proteins, many molecular biology problems have been formulated as computational optimization problems involving strings and sequences. Biological applications of computing distance/proximity among strings occur mainly in two varieties. Some require that a region of similarity be discovered, while other applications use the reverse complement of the region, such as designing probes or primers. In the following, some relevant biological applications are outlined.

Creating Diagnostic Probes for Bacterial Infections

Probes are strands of either DNA or RNA that have been modified (i.e., made either radioactive or fluorescent) so that their presence can be easily detected. One possible application of string problems arises in creating diagnostic probes for bacterial infections [14, 21]. In this scenario, given a set of DNA strings from a group of closely related pathogenic bacteria, the task is to find a substring that occurs in each of the bacterial strings (as close as possible) without occurring in the host’s DNA. Probes are then designed to hybridize to these target strings, so that the detection of their presence indicates that at least one bacterial species is likely to be present in the host .

Primer Design

Primers are short strings of nucleotides designed such that they hybridize to a given DNA string or to all of a given set of DNA strings with the aim of providing a starting point for DNA strand synthesis by polymerase chain reaction (PCR). The hybridization of primers depends on several conditions, including some thermodynamic rules, but it is largely influenced by the number of mismatching positions among the given strings, and this number should be as small as possible [9, 10, 17].

Discovering Potential Drug Targets

Another biological application of string selection and comparison problems is related to discovering potential drug targets. Given a set of strings of orthologous genes from a group of closely related pathogens and a host (such as a human, crop, or livestock), the goal is to find a string fragment that is more conserved in all or most of the pathogens strings but not as conserved in the host. Information encoded by this fragment can then be used for novel antibiotic development or to create a drug that harms several pathogens with minimal effect on the host. All these applications reduce to the task of finding a pattern that, with some error, occurs in one set of strings (closest string problem) and/or does not occur in another set (farthest string problem). The far from most string problem can help to identify a string fragment that distinguishes the pathogens from the host, so the potential exists to create a drug that harms several but not all pathogens [9, 10, 17].

Motif Search

A motif is a string that occurs approximately preserved as a substring in some/several of the DNA strings of a given set. Approximately preserved means that the motif occurs with changes in at most t positions for a fixed nonnegative integer t. The importance of a motif lies in its characteristic of being a candidate for substrings of noncoding parts of the DNA string that have functions related to, e.g., gene expression [9, 10].

For most consensus problems, Hamming distance is used instead of other alternative measures, such as, for example, the editing distance. The biological reasons justifying this choice are very well described and motivated by Lanctot et al. in [14,15,16] and can be summarized claiming that the “edit distance is more suitable to measure the amount of change that has happened, whereas the Hamming distance is more suitable to measure the effect of that change.”

The remainder of this chapter is organized as follows. The next section lists notation and definitions used throughout the paper. The following four sections are devoted to the closest string, the close to most string, the farthest string, and the far from most string problem, respectively. All these problems are mathematically formulated and their properties analyzed. The most popular solution techniques for them are surveyed, along with the computational results obtained and analyzed in the literature. The second last section reports computational results which demonstrate empirically the efficiency of the state-of-the-art algorithms. Concluding remarks and future directions are discussed in the last section.

Notation

Throughout this chapter, the following notation and definitions will be used:

An alphabet Σ = {c ₁, c ₂, …, c _k} is a finite set of elements, called characters.
$s^i = (s_1^i, s_2^i, \ldots , s_m^i)$ denotes a string of m characters (that is, of length m) over alphabet Σ, i.e., $s^i_j\in \Sigma ,\ j=1,2,\ldots ,m$.
Given two strings s ⁱ and s ^l on Σ such that |s ⁱ| = |s ^l|, d _H(s ⁱ, s ^l) denotes their Hamming distance and is given by
$$\displaystyle \begin{aligned} d_H(s^i,s^l)=\displaystyle{\sum_{j=1}^{\vert s^i\vert} \Phi(s^i_j,s^l_j)}, \end{aligned} $$
(1)
where $s_j^i$ and $s_j^l$ denote the character at position j in string s ⁱ and in string s ^l, respectively, and Φ : Σ × Σ→{0, 1} is a predicate function such that
$$\displaystyle \begin{aligned}\Phi(a,b)=\left\{ \begin{array}{ll} 0, & \mathrm{if} \, a=b; \\ 1, & \mathrm{otherwise.} \end{array} \right.\end{aligned}$$
For all consensus problems, each string s of length m over Σ is a valid solution.

The Closest String Problem (CSP)

Given a finite set of strings Ω on Σ, the problem is to find a center string s ^∗∈ Σ^m such that the Hamming distance between s ^∗ and all strings in Ω is minimal; in other words, s ^∗ is a string to which a minimal value d corresponds such that

$$\displaystyle \begin{aligned} d_H(s^*,s^i)\leq d,\quad\forall\ s^i\in \Omega. \end{aligned}$$

The closest string problem can be formulated as an integer linear program (ILP) . In fact, let Σ_k ⊆ Σ be the set of characters appearing at position k in any of the strings from Ω.

For each k = 1, 2, …, m and j ∈ V _k, let us define the following binary variables:

$$\displaystyle \begin{aligned} x_{ck}=\left\{ \begin{array}{ll} 1, & \mathrm {if\ character}\ c\, \in\, \Sigma_k\, \mathrm {is\ used\ at\ position}\ {k}\ \mathrm {of\ the\ solution}; \\ 0, & \mathrm {otherwise.}\end{array} \right. \end{aligned}$$

Then, the CSP admits the following ILP formulation:

Equalities (3) guarantee that only one character is selected for each position k ∈{1, 2, …, m}. Inequalities (4) impose that if a character in a string s ⁱ is not in the solution defined by the x-variables, then that character will contribute to increasing the Hamming distance from solution x to s ⁱ. Finally, (5) forces d assume a nonnegative integer value and (6) define the decision variables.

This problem was first studied in the area of coding theory [27] and has been independently proved computationally intractable in [7, 15, 16].

In 2004, Meneses et al. [26] used a linear relaxation of the above-described mathematical model to design a branch and bound algorithm . At each iteration, the next node in the branching tree to be explored is the one with the smallest linear relaxation objective function value (also known as the best-bound first strategy ). Once selected the next node and obtained an optimal fractional solution $x^{\prime }_{ck}\in [0,1]$, k = 1, …, m, ∀ c ∈ Σ_k, for the linear relaxation of the corresponding subproblem, the algorithm branches on the fractional variable x _ck with maximum value of $x^{\prime }_{ck}$. The bounding phase is very important in any branch and bound algorithm : the better is the computed bound, the smaller is the number of nodes that need to be explored and that can therefore be pruned. For the bounding phase, Meneses et al. computed an initial bound selecting one of the given input strings and modifying it until a local optimal solution is found. The authors have empirically shown that their branch and bound algorithm is able to solve in reasonable running times small-size instances with 10–30 strings, each of which is 300–800 characters long.

Further exact methods proposed for the CSP are fixed-parameter algorithms [11, 20, 30] that are applicable only when the maximum Hamming distance among all pairs of strings is small. In fact, since these algorithms are designed to solve the decision version of the problem, it is necessary to apply them multiple times in order to find an optimal solution that minimizes the Hamming distance.

The first approximation algorithm for the CSP was proposed in [16] with a worst-case performance ratio of 2. It is a simple algorithm that constructs an approximate feasible solution in a pure random fashion. Starting from an empty solution, the algorithm selects at random the next element to be added to the solution under construction.

Better performance approximation algorithms proposed in the literature are based on the linear programming relaxation of the previously described ILP model. The basic idea consists in solving the linear programming relaxation of the ILP model and in using the result of the relaxed problem to find an approximate solution to the original problem. Following this line, [16] also proposed a $\frac {4}{3}(1+\epsilon )$-approximation algorithm (for any small 𝜖 > 0) that uses the randomized rounding technique for obtaining an integer 0–1 solution from the continuous solution for the relaxed problem. The randomized rounding technique works by defining the value of a Boolean variable x ∈{0, 1} to be x = 1 with a certain probability y, where y is the value of the continuous variable corresponding to x in the relaxation of the original integer programming problem. In 1999, Li et al. [17] used the rounding idea to design a polynomial-time approximation scheme (PTAS). A PTAS is a special type of approximation algorithm that, for each 𝜖 > 1, yields a performance guarantee of 𝜖 in polynomial time. Thus, this can be viewed as a way of getting solutions with guaranteed performance, for any desired threshold greater than one. The PTAS proposed in [17] is also based on randomized rounding that here is refined to check results for a large (but polynomially bounded) number of subsets of indices. However, since a large number of iterations involve the solution of a linear relaxation of an ILP model, the algorithm becomes impractical for any instance with large strings. To efficiently deal with real-world scenarios and/or medium- to large-sized problem instances, several heuristic and metaheuristic algorithms have been proposed in the last few years.

In 2005, Liu et al. [18] designed a genetic algorithm and a simulated annealing algorithm, both in their sequential and their parallel versions. Genetic algorithms (GAs) are population-based metaheuristics that have been applied to find optimal or near-optimal solutions to combinatorial optimization problems [8, 12]. They implement the concept of survival of the fittest making an analogy between a solution and an individual in a population. Each individual has a corresponding chromosome that encodes the solution. A chromosome consists of a string of genes . Each gene can take on a value, called an allele , from some alphabet. A chromosome has an associated fitness level which is correlated to the corresponding objective function value of the solution it encodes. Over a number of iterations, called generations, GAs evolve a population of chromosomes. This is usually accomplished by simulating the process of natural selection through mating and mutation. For the CSP, starting from an initial randomly generated population, at each generation 0 ≤ t ≤number–generations of the Liu et al.’s GA, a population P(t) of popsize strings of length m is evolved, and the fitness function to be maximized is defined as the difference $m - d_{\max }$, where $d_{\max }$ is the largest Hamming distance between an individual of the population P(t) and any string in Ω. The production of offspring in a GA is done through the process of mating or crossover, and Liu et al. used a multipoint crossover (MPX). In more detail, at a generic generation t, two parental individuals x and y in P(t) are randomly chosen according to a probability which is proportional to their fitness. Then, iteratively until the offspring is not complete, x and y exchange parts between two randomly picked points. The resulting offspring have a new order of the strings, one part from the first parent and the other part from the second parent. Afterward, a mutation of any individual in the current population P(t) is executed with some given probability. During this phase, two positions are randomly chosen and exchanged in the individual.

In their paper, Liu et al. proposed also a simulated annealing (SA) algorithm for the CSP. Originally proposed in [13], in the optimization and computer science research communities, simulated annealing is commonly said to be the “oldest” among the metaheuristics and surely one of the first techniques that had an explicit strategy to escape from local minima. Its fundamental idea is to allow moves resulting in solutions of worse quality in terms of objective function value than the current solution (uphill moves) in order to escape from local minima. The origin of simulated annealing and the choice of the acceptance criterion of a better quality solution lie in the physical annealing process that can be modeled by methods based on Monte Carlo techniques. One of the early Monte Carlo techniques for simulating the evolution of a solid in a heat bath to thermal equilibrium is due to [24], who in 1953 designed a method that iteratively (until a stopping criterion is met) generates a string of states of the solid in the following way. At a generic iteration k, given a current state i of the solid (i.e., a current solution x),

E _i is the energy of the solid in state i (objective function value f(x)).
a subsequent state j (solution $\overline x$) is generated with energy E _j (objective function value $f(\overline x)$) by applying a perturbation mechanism such as displacement of a single particle ($\overline x$ is a solution “close” to x);
if E _j − E _i < 0 (i.e., $\overline x$ is a better quality solution), j ($\overline x$) is accepted; otherwise, j ($\overline x$) is accepted with probability given by
$$\displaystyle \begin{aligned} \exp\left(-\frac{E_j-E_i}{k_BT_k}\right)\qquad\left[\exp\left(-\frac{f(\overline x)-f(x)}{k_B\cdot T_k}\right)\right], \end{aligned}$$
where T _k is the heat bath temperature and k _B is the Boltzmann constant.

As the number of performed iterations increases, the current temperature T _k is decreased, resulting in a smaller probability of accepting not improving solutions. For the CSP, Liu et al.’s SA sets the initial temperature T ₀ to $\frac {m}{2}$. The current temperature T _k is reduced every 100 iterations according to the “geometric cooling schedule” that is, T _k+100 = γ ⋅ T _k, where γ = 0.9. The stopping criterion is to reach a current temperature less than or equal to 0.001. Despite the interesting ideas proposed in [18], the experimental analysis involves only small instances with up to 40 strings of length 40.

For the special case of | Ω| = 3 and | Σ| = 2, [19] designed an exact approach called distance first algorithm (DFA), whose basic idea is to let the Hamming distance d _H(s ^∗, s ⁱ), i = 1, 2, 3, be as close as possible. The algorithm decreases the distance between the string that is farthest to the other two strings and solution s ^∗ while increasing the distance between the string that is closest to the other two strings and solution s ^∗. For the general case, the authors proposed a polynomial-time heuristic resulting from a combination of local search strategies inspired by [26] and an approximation algorithm called Largest Distance Decreasing Algorithm (LDDA) which is based on similar ideas as DFA.

More recently, Tanaka [29] proposed a novel heuristic (TA) based on the Lagrangian relaxation of the ILP model of the problem that allows to decompose the problem into subproblems, each corresponding to a position of the strings. The proposed algorithm combines a Lagrangian multiplier adjustment procedure to obtain feasibility and a tabu search as local improvement procedure. In [4], Della Croce and Salassa described three relaxation-based procedures. One procedure (RA) rounds up the result of continuous relaxation, while the other two approaches (BCPA and ECPA) fix a subset of the integer variables in the continuous solution at the current value and let the solver run on the remaining (sub)problem. The authors also observed that all relaxation-based algorithms have been tested on rectangular instances, i.e., with n ≪ m, and that the instances such that n ≥ m are harder to be solved due to the higher number of constraints imposed by the strings, which enlarges the portion of non-integer components in the continuous solution of the problem. In the attempt to overcoming this drawback, Croce and Garraffa [3] designed a multistart relaxation-based algorithm (called the selective fixing algorithm) that for a predetermined number of iterations takes a feasible solution as input and iteratively selects variables to be fixed at their initial value until the number of free variables is small enough that the remaining subproblem can be efficiently solved to optimality by an ILP solver. The new solution found by the solver can then be used as initial solution for the next iteration. The authors have experimentally shown that their algorithm is much more robust compared to the state-of-the-art competitors and is able to solve a wider set of instances of different types, including those with n ≥ m.

The Close to Most String Problem (CTMSP)

The closest string problem can be seen as a special case of the so-called close to most string problem (CTMSP) that consists in determining a string close to most of the strings in the input set Ω. This can be formalized by saying that, given a threshold t, a string s ^∗ must be found maximizing the variable l such that

$$\displaystyle \begin{aligned} d_H(s^*,s^i)\leq t,\ \mathrm{for}\ s^i \in P \subseteq \Sigma\ \mathrm{and}\, \vert\ P \vert=l. \end{aligned}$$

Constraints (8) ensure that for each position k of a possible solution, exactly one character from Σ_k is chosen. Constraints (9) ensure that y _i can only be set to 1 if and only if the number of differences between s ⁱ ∈ Ω and the possible solution (as defined by the setting of the variables x _ck) is less than or equal to t. Remember, in this context, $s_k^i$ denotes the character at position k in s ⁱ ∈ Ω.

Despite its similarity with the CSP, the CTMSP has not been widely studied. In [2] it was proved that this problem has no polynomial-time approximation scheme (PTAS) unless NP has randomized polynomial-time algorithms .

The Farthest String Problem (FSP)

Given a finite set of strings Ω over alphabet Σ, a problem complementary to the CSP is that of finding a string s ^∗∈ Σ^m farthest from the strings in Ω. This type of problem can be useful in situations such as finding a genetic string that cannot be associated to a given number of species.

Like the CSP, the farthest string problem can be formulated mathematically in form of an ILP , where both decision variables and constraints have an interpretation which is contrary to the one in the CSP:which is contrary to the one in the CSP:

The computational intractability of the FSP was demonstrated in [16], where it was proved that the problem remains intractable even for the simplest case where the alphabet has only two characters.

Despite its inherent computational intractability, it has be shown in [16] that there is a PTAS for the FSP. The algorithm is based on the randomized rounding of the relaxed solution of the above-reported ILP and uses the randomized rounding technique together with probabilistic inequalities to determine the maximum error possible in the solution computed by the algorithm. Note that, the mathematical formulation of FSP is quite similar to the one used for the CSP, with only a change in the optimization objective, and the inequality sign in the constraint

$$\displaystyle \begin{aligned} m-\displaystyle{\sum_{j=1}^m x_{s_j^ij}}\geq d, \qquad i=1,2,\ldots,n. \end{aligned}$$

Thus, to solve the problem using an ILP formulation, one can use similar techniques to those employed for solving the CSP. In 2011 [31] and more recently in 2015 Zörnig [32] proposed a few integer programming models for some variants of the farthest string problem and the closest string problem. The number of variables and constraints is substantially less compared with state-of-the-art integer linear programming models, and the solution of the linear programming relaxation contains only a small proportion of non-integer values, which considerably simplifies both a subsequent rounding process and a branch and bound procedure .

The Far From Most String Problem (FFMSP)

A problem closely related to the farthest string problem is the far from most string problem (FFMSP). It consists in determining a string far from most of the strings in the input set Ω. This can be formalized by saying that, given a threshold t, a string s ^∗ must be found maximizing the variable l such that

$$\displaystyle \begin{aligned} d_H(s^*,s^i)\geq t,\ \mathrm{for}\ s^i \in P \subseteq \Sigma\ \mathrm{and}\, \vert\ P \vert=l. \end{aligned}$$

In [1], the FFMSP has been mathematically formulated as an ILP. In fact, by defining a Boolean variable x _ck for each position k (k = 1, 2…, m) of a possible solution and for each character c ∈ Σ_k and a Boolean variable y _i (i = 1, 2…, n) for each of the n input strings provided in set Ω, the FFMSP can be stated as the following ILP:

Constraints (16) ensure that for each position k of a possible solution, exactly one character from Σ_k is chosen. Constraints (17) ensure that y _i can only be set to 1 if and only if the number of differences between s ⁱ ∈ Ω and the possible solution (as defined by the setting of the variables x _ck) is greater than or equal to t. Remember, in this context, $s_k^i$ denotes the character at position k in s ⁱ ∈ Ω.

Despite the similarity, it can be shown [16] that the FFMSP is much harder to approximate than the FSP, due to the approximation preserving reduction to FFMSP from the independent set problem, a classical and computationally intractable combinatorial optimization problem. In particular, [16] demonstrated that for strings over an alphabet Σ with | Σ|≥ 3, approximating the FFMSP within a polynomial factor is NP-hard.

The first attempt in the direction of the design of heuristic methods to efficiently solve the FFMSP was done in [22, 23], who proposed a heuristic algorithm consisting of a simple greedy construction followed by an iterative improvement phase. Later, [6] designed a simple GRASP , recently improved in [25]. Mousavi et al. noticed that the search landscape of the FFMSP is characterized by many solutions having the same objective value. Consequently, local search is likely to visit many suboptimal local maxima. To efficiently escape from these local maxima, Mousavi et al. devised a new hybrid heuristic evaluation function and used it in conjunction with the objective function when evaluating neighbor solutions during the local search phase in the GRASP framework.

Ferone et al. [5] designed the following pure and hybrid multistart iterative heuristics:

a pure GRASP, inspired by [6];
a GRASP that uses forward path-relinking for intensification;
a pure VNS;
a VNS that uses forward path-relinking for intensification;
a GRASP that uses VNS to implement the local search phase; and
a GRASP that uses VNS to implement the local search phase and forward path-relinking for intensification.

The algorithms were tested on several random instances, and the results showed that the hybrid GRASP with VNS and forward path-relinking always found much better quality solutions compared with the other algorithms, but clearly with higher running times as compared to the pure GRASP and the hybrid GRASP with forward path-relinking. The best objective function values found by GRASP and its hybrids were when the construction phase was more greedy than random. The integration of forward path-relinking as an intensification procedure in the pure metaheuristics was beneficial in terms of solution quality. A further investigation conducted, studying the empirical distributions of the random variable time-to-target-solution value , revealed that, given any fixed amount of computing time, GRASP with forward path-relinking has an empirically higher probability than all competitors of finding a target solution.

In [1], besides the first linear integer programming formulation for the FFMSP described above (15), (16), and (17), a hybrid ant colony optimization approach has been proposed. This hybrid approach consists of two phases. A first phase applies ant colony optimization until the convergence of the pheromone values is reached. After this first phase, the algorithm possibly applies a second phase in which the hybridization with a mathematical programming solver takes place. Both the linear integer programming formulation and the hybrid ant colony algorithm have compared to the most performing hybrid GRASP with path-relinking, and computational results on a large set of randomly generated test instances have indicated that the hybrid ACO is very competitive .

A Simple ILP-Based Heuristic

In the following we present the results of a quite simple ILP-based heuristic which can be applied to all four problems described before: the closest string problem (CSP), the clost to most string problem (CTMSP), the farthest string problem (FSP), and the far from most string problem (FFMSP). The heuristic is based on the ILP models of the four problems. It works as follows. Given a fixed computation time limit (t _limit), maximally half of this computation time is given to an ILP solver for tackling the mixed integer linear problem (MILP) that is obtained by relaxing the x _ck-variables involved in all four ILP models. The used ILP solver returns the best solution found in the given computation time. Note that this solution may, or may not, correspond to the optimal MILP solution. The fractional values of the x _ck-variables after termination of the solver are henceforth denoted by $x^{\prime }_{ck}$. In the second phase of the heuristic, the corresponding ILP models are solved in the remaining computation time, with the following additional constraints:

$$\displaystyle \begin{aligned} x_{ck} = 1 \;\; \mbox{for } k = 1, \ldots, m, \; c \in \Sigma_k, \; x^{\prime}_{ck} = 1 \end{aligned} $$

(18)

In other words, whenever a variable x _ck in the best-found MILP solution has a value of 1, this value is fixed for the solution of the ILP model.

Experimental Evaluation

The heuristic described above is compared, in the case of all four problems, with the application of the ILP solver to the original ILP models. Moreover, a simple greedy algorithm was applied. This greedy algorithm generates a solution by selecting for each position of the solution string the letter which, in the case of the CSP and the CTMSP, has the least number of appearances at this position in the set of input strings and which, in the case of the FSP and the FFMSP, has the highest number of appearances at this position in the set of input strings. The greedy algorithm is henceforth denoted by Greedy. As (M)ILP solver we used IBM ILOG CPLEX V12.1. The experimental results were obtained on a cluster of PCs with “Intel(R) Xeon(R) CPU 5160” CPUs of four nuclei of 3000 MHz and 4 GB of RAM. Moreover, CPLEX was configured for single-threaded execution. Depending on the problems, we used two different computation time limits: t _limit = 200 s and t _limit = 3600 s per problem instance. The different applications of CPLEX and the ILP-based heuristic, respectively, are named accordingly: Cplex-200, Cplex-3600, Heuristic-200, and Heuristic-3600.

All the algorithms described above were applied, in the context of all four problems, to a set of benchmark instances that was originally introduced in [5] for the FFMSP. This set consists of random instances of different size. More specifically, the number of input strings (n) is in {100, 200}, and the length of the input strings (m) is in {300, 600, 800}. In all cases, the alphabet size is four, that is, | Σ| = 4. For each combination of n and m, the set consists of 100 random instances. This makes a total of 600 instances. Note that, in the context of the CTMSP and the FFMSP, a value for parameter t must be specified before running the algorithm(s). However, a sensible choice of t is not trivial. For example, in the context of the CTMSP, the lower the value of t, the easier it should be, for example, for CPLEX to solve the problem to optimality. In order to be able to choose meaningful values for t, the following experiments were executed. CPLEX was applied to each problem instance—both concerning the CTMSP and the FFMSP—for each value of t ∈{0.05, 0.02, …, 0.95 m}. This was done with a time limit of 3600 s per run. The corresponding optimality gaps (averaged over 100 problem instances) and the number of instances (out of 100) that was solved to optimality are graphically presented in Fig. 1. The results reveal that, in the case of the CTMSP, the problem becomes difficult for approx. t ≤ 0.72 m. In the case of the FFMSP, the problem becomes difficult for approx. t ≥ 0.78 m. Therefore, the following values for t were chosen for the final experimental evaluation: t ∈{0.65, 0.7, 0.75 m} in the case of the CTMSP and t ∈{0.75, 0.8, 0.85 m} in the case of the FFMSP.

The numerical results for the CSP are shown in Table 1. They are presented as averages over the 100 instances for each combination of n (the number of input strings) and m (the length of the input strings). For all three algorithms, we provide the values of the best-found solutions (averaged over 100 problem instances) and the computation time at which these solutions were found. In the case of Cplex-200, the average optimality gap is additionally provided. The results clearly show that Heuristic-200 outperforms both Greedy and Cplex-200.

Table 1 Numerical results for the CSP

Full size table

Similar conclusions can be drawn in the case of the FSP, for which the results are presented in Table 2. Except for one case (n = 200, m = 300), Heuristic-200 outperforms both Greedy and Cplex-200.

Table 2 Numerical results for the FSP

Full size table

In the case of the CTMSP, both the ILP-based heuristic and CPLEX were applied with computation time limits 200 and 3600 CPU seconds. Therefore, Table 3 contains results for Heuristic-200, Heuristic-3600, Cplex-200, and Cplex-3600. The following observations can be made:

Both CPLEX and the ILP-based heuristic greatly outperform Greedy.
When the problem is rather easy—that is, for a setting of t = 0.75 m—CPLEX has usually slight advantages over the ILP-based heuristic. This is the case especially for the instances with larger number of input strings (n = 200).
With growing problem difficulty, the ILP-based heuristic starts to outperform CPLEX. In particular, for a setting of t = 0.65 m, the differences in the qualities of the obtained solutions between the ILP-based heuristic and CPLEX are significant.

Not surprisingly, the same observations can be made in the context of the FFMSP, for which the results are provided in Table 4. As the considered benchmark instances were originally used for the FFMSP, we are able to compare to current state-of-the-art results for this problem (see [1]). This comparison is graphically presented in Fig. 2 for all instances concerning the interesting cases t = 0.8 m and t = 0.85 m. The results show that, for t = 0.8 m, Heuristic-3600 is generally outperformed by the other state-of-the-art methods. However, note that when the instance size (in terms of the number of input strings and their length) grows, Heuristic-3600 starts to produce better results than the competitors. In the case of t = 0.85 m, which results in more difficult instances than t = 0.8 m, Heuristic-3600 clearly outperforms the current state-of-the-art methods .

Table 3 Numerical results for the CTMS problem

Full size table

Table 4 Results for the FFMS problem

Full size table

Conclusions

The goal of this chapter was to provide an overview of some string selection and comparison problems, with special emphasis on the optimization and operational research perspective. Besides mathematical models that can be used to find exact solutions only up to a certain instance size, there are many approximate techniques, proposed by researchers from several heterogenous communities, which are more or less efficient, depending on the input data and the type of information they make use of. Not surprisingly, generally there is no single best approach that wins in every aspect. Therefore, we proposed a simple ILP-based heuristic that can be used for any of the four considered problems. We have shown that this heuristic outperforms both a general greedy algorithm and the application of an ILP solver (CPLEX) to the original ILP models. In the case of the far from most string problem, we were even able to show that this simple heuristic is able to produce state-of-the-art results for instances which are intrinsically difficult to be solved.

Cross-References

References

Blum C, Festa P (2014) A hybrid ant colony optimization algorithm for the far from most string problem. In: Proceedings of the 14th European conference on evolutionary computation in combinatorial optimisation (EvoCOP2014). Lecture notes in computer science, vol 8600. Springer, Berlin, pp 1–12
Google Scholar
Boucher C, Landau G, Levy A, Pritchard D, Weimann O (2013) On approximating string selection problems with outliers. Theor Comput Sci 498:107–114
Google Scholar
Croce FD, Garraffa M (2014) The selective fixing algorithm for the closest string problem. Comput Oper Res 41:24–30
Google Scholar
Croce FD, Salassa F (2012) Improved lp-based algorithms for the closest string problem. Comput Oper Res 39:746–749
Google Scholar
Ferone D, Festa P, Resende M (2013) Hybrid metaheuristics for the far from most string problem. In: Proceedings of 8th international workshop on hybrid metaheuristics. Lecture notes in computer science, vol 7919. Springer, Berlin, pp 174–188
Google Scholar
Festa P (2007) On some optimization problems in molecular biology. Math Biosci 207(2): 219–234
Google Scholar
Frances M, Litman A (1997) On covering problems of codes. Theory Comput Syst 30(2): 113–119
Google Scholar
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Reading, Massachusetts, Addison-Wesley
Google Scholar
Gramm J (2003) Fixed-parameter algorithms for the consensus analysis of genomic data. Eberhard Karls University of Tübingen, PhD thesis
Google Scholar
Gramm J, Hüffner F, Niedermeier R (2002) Closest strings, primer design, and motif search. In: Sixth annual international conference on computational molecular biology, Washington, DC, pp 74–75
Google Scholar
Gramm J, Niedermeier R, Rossmanith P (2003) Fixed-parameter algorithms for closest string and related problems. Algorithmica 37:25–42
Article MathSciNet Google Scholar
Holland JH (1975) Adaptation in natural and artificial systems. Cambridge, Massachusetts, MIT Press
Google Scholar
Kirkpatrick S (1984) Optimization by simulated annealing: quantitative studies. J Stat Phys 34(5–6):975–986
Article MathSciNet Google Scholar
Lanctot J (2004) Some string problems in computational biology. University of Waterloo, PhD thesis
Google Scholar
Lanctot J, Li M, Ma B, Wang S, Zhang L (1999) Distinguishing string selection problems. In: Proceedings of the annual ACM-SIAM symposium on discrete aLgorithms (SODA), Baltimore, pp 633–642
Google Scholar
Lanctot J, Li M, Ma B, Wang S, Zhang L (2003) Distinguishing string selection problems. Inf Comput 185(1):41–55
Article MathSciNet Google Scholar
Li M, Ma B, Wang L (1999) Finding similar regions in many strings. In: ACM symposium on theory of computing (STOC’99), New York, pp 473–482
Google Scholar
Liu X, He H, Sýkora O (2005) Parallel genetic algorithm and parallel simulated annealing algorithm for the closest string problem. In: Proceedings of ADMA 2005. Lecture notes in artificial intelligence, vol 3584. Springer, Berlin Heidelberg New York, pp 591–597
Google Scholar
Liu X, Liu S, Hao Z, Mauch H (2011) Exact algorithm and heuristic for the closest string problem. Comput Oper Res 38(11):1513–1520
Article MathSciNet Google Scholar
Ma B, Sun X (2008) More efficient algorithms for closest string and substring problems. Lecture notes in computer science, vol 4955. Springer, Berlin, pp 396–409
Google Scholar
Macario A, de Macario EC (eds) (1990) Gene probes for bacteria. Academic Press, San Diego
Google Scholar
Meneses C, Oliveira C, Pardalos P (2005) Optimization techniques for string selection and comparison problems in genomics. IEEE Eng Med Biol Mag 24(3):81–87
Article Google Scholar
Meneses C, Pardalos P, Resende M, Vazacopoulos A (2005) Modeling and solving string selection problems. In: Proceedings of the 2005 international symposium on mathematical and computational biology (BIOMAT 2005), pp 55–65
Google Scholar
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Article Google Scholar
Mousavi S, Babaie M, Montazerian M (2012) An improved heuristic for the far from most strings problem. J Heuristics 18:239–262
Article Google Scholar
Pardalos P, Oliveira C, Lu Z, Meneses C (2004) Optimal solutions for the closest string problem via integer programming. INFORMS J Comput 16:419–429
Article MathSciNet Google Scholar
Roman S (1992) Coding and information theory. Graduate texts in mathematics, vol 134. Springer, New York
Google Scholar
Sim J, Park K (1999) The consensus string problem for a metric is NP-complete. In: Proceedings of the annual Australasian workshop on combinatorial algorithms (AWOCA), pp 107–113
Google Scholar
Tanaka S (2012) A heuristic algorithm based on Lagrangian relaxation for the closest string problem. Comput Oper Res 39:709–717
Article MathSciNet Google Scholar
Wang L, Zhu B (2009) Efficient algorithms for the closest string and distinguishing string selection problems. Lecture notes in computer science, vol 5598. Springer, Berlin, pp 261–270
Google Scholar
Zörnig P (2011) Improved optimization modelling for the closest string and related problems. Appl Math Modell 35(12):5609–5617
Article MathSciNet Google Scholar
Zörnig P (2015) Reduced-size integer linear programming models for string selection problems: application to the farthest string problem. J Comput Biol 22(8):729–742
Article Google Scholar

Download references

Acknowledgements

C. Blum was supported by project TIN2012-37930 of the Spanish Government. In addition, support is acknowledged from IKERBASQUE (Basque Foundation for Science).

Author information

Authors and Affiliations

Artificial Intelligence Research Institute (IIIA), Spanish National Research Council (CSIC), Bellaterra, Spain
Christian Blum
Department of Mathematics and Applications “R. Caccioppoli”, University of Napoli FEDERICO II, Napoli, Italy
Paola Festa

Authors

Christian Blum
View author publications
You can also search for this author in PubMed Google Scholar
Paola Festa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Blum .

Editor information

Editors and Affiliations

Statistics and Operations Research Department, University of Valencia, Valencia, Spain
Rafael Martí
Department of Industrial and Systems Engineering, University of Florida, Gainesville, Florida, USA
Panos M. Pardalos
Amazon.com, Inc. and University of Washington, Seattle, Washington, USA
Mauricio G. C. Resende

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Blum, C., Festa, P. (2018). Selected String Problems. In: Martí, R., Pardalos, P., Resende, M. (eds) Handbook of Heuristics. Springer, Cham. https://doi.org/10.1007/978-3-319-07124-4_58

Download citation

DOI: https://doi.org/10.1007/978-3-319-07124-4_58
Published: 14 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07123-7
Online ISBN: 978-3-319-07124-4
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Selected String Problems

Abstract