1 Introduction

CMSA is a recently introduced hybrid metaheuristic (Blum et al., 2016; Blum, 2024). At each iteration, the algorithm deals with an initially empty sub-instance \(C'\) of the considered optimization problem. The first step of each iteration (the construct step) consists of the generation of several feasible solutions to the original problem instance in a probabilistic way. Subsequently, the solution components involved in these solutions are added to \(C'\) (the merge step) and an exact solver is applied to obtain a solution to sub-instance \(C'\) (the solve step). The last step (the adapt step) consists of removing those solution components from \(C'\) that were becoming too old with regard to an aging mechanism. In general, the CMSA metaheuristic is applicable to any problem for which (1) valid solutions can be probabilistically generated and (2) an exact solver can be devised.

Since its introduction in 2016, CMSA has been successfully applied to a range of different combinatorial optimization problems. Some of the most recent applications include the ones to the maximum disjoint dominating sets problem (Rosati et al., 2024), the electric vehicle routing problem with time windows, simultaneous pickup and deliveries, and partial vehicle charging (Akbay et al., 2022), a bus driver scheduling problem with complex break constraints (Rosati et al., 2022), and test data generation in software product lines (Ferrer et al., 2021). Moreover, existing extensions of CMSA include, for example, Adapt-CMSA (Akbay et al., 2022), which is a variant that reduces the parameter sensitivity of the original CMSA observed in some applications.

The goal of the research presented in this paper is to successfully leverage ideas from RL for improving CMSA. There are two main aspects which we aim to improve. The first concerns performance, and the second concerns obtaining a simpler algorithm by eliminating a problem-dependent component. The rest of the paper is organized as follows. The next two subsections explain the contribution of this paper and deal with related work from the literature, respectively. Next, the standard CMSA is described. Following that, the general structure of the new RL-CMSA variant is explained together with some particular implementations of the newly proposed learning mechanism. The fourth section provides the experimental study where the proposed RL-CMSA implementations are compared to the standard CMSA in the context of the FFMS and MDS problems. Finally, the last section gives some conclusions and ideas for future work.

1.1 Contribution of this paper

RL (Sutton & Barto, 2018) is an area of Machine Learning (ML) concerned with the actions of an intelligent agent in a certain environment with the goal of maximizing its cumulative reward. At each step, the agent is presented with a set of available actions and must discover which ones result in the highest reward. Generally, actions may not only produce a reward but can also influence the available actions and rewards for the following steps. In this work, we introduce a new variant of CMSA, named RL-CMSA. This new algorithm variant makes use of RL to improve the construct step of CMSA. As mentioned above, during the construct step of CMSA a certain number of solutions to the problem at hand are constructed. This is usually done by employing a problem-specific solution construction mechanism together with a greedy function. Hereby, the solution construction mechanism determines for each construction step the set of available options, while a greedy function gives a static or dynamic greedy value to each option. Exactly one of the available options is probabilistically chosen at each step—based on the greedy function values—until a complete solution is obtained. This procedure resembles the general RL setting and motivates employing an agent for constructing solutions, which is exactly what is done in our novel RL-CMSA approach. In particular, solution constructions are performed by selecting solution components depending on associated quality measures, which are updated at each iteration through an RL strategy. For every selection of a solution component, the agent receives a reward depending on whether the selected solution component proves to be useful. Solution components’ usefulness is hereby measured by their possible inclusion in solutions to sub-instances at each iteration of RL-CMSA.

The proposed RL-CMSA approach is applied to two NP-hard combinatorial optimization problems: (1) the FFMS problem and (2) the MDS problem. In both cases, the experimental results show the superiority of the new RL-CMSA variant over standard CMSA. In addition to performing strongly in comparison to standard CMSA, RL-CMSA is more general and sometimes even easier to implement. This is because, while still requiring a solution construction mechanism, it does not need a greedy function for evaluating the set of options for extending the current partial solution at each construction step.

1.2 Related work

In recent years, the field of ML has gained notorious popularity as it has brought significant advancements in different domains. This has been in part enabled by the increase in computational power and the availability of large datasets. As a consequence, more attention has been given to ML in the field of combinatorial optimization.

On the one hand, various ML end-to-end combinatorial optimization solvers have been proposed, see e.g., (Bello et al., 2016; Kool et al., 2018; Kwon et al., 2020). While this is an interesting and promising research direction, these solvers have not yet been able to compete with traditional state-of-the-art techniques and are currently limited to small problem instances. On the other hand, an alternative line of research consists of leveraging ML to enhance classic combinatorial optimization techniques. Our work belongs to this research area and more particularly consists of improving a general metaheuristic using the RL paradigm (Sutton & Barto, 2018). One early application of RL within metaheuristics is found in (Gambardella & Dorigo, 1995), where the authors make use of an RL technique known as Q-learning within the ant system metaheuristic and apply the resulting algorithm to the Travelling Salesman Problem (TSP). Some more recent work in the direction of merging RL and metaheuristics consists of a method for learning the heuristic function of beam search found in (Huber & Raidl, 2021), which is applied to the Longest Common Subsequence (LCS) and Constrained Longest Common Subsequence (CLCS) problems. Furthermore, a Variable Neighbourhood Search (VNS) based on Q-learning was devised for a machine scheduling problem in (Alicastro et al., 2021). A last algorithm proposal, applied to the TSP, that we mention here concerns the use of RL for adapting the parameters of a Biased Random Key Genetic Algorithm (BRKGA) throughout its evolutionary process, found in (Chaves & Lorena, 2021).

Two examples of recent work with a stronger relation to our contribution can be found in (Almeida et al., 2020; Kalatzantonakis et al., 2023). Both works use the setting of a classic RL problem known as the multi-armed bandit problem for selecting operators in the context of metaheuristics. In the first case, the selection concerns mutation and crossover operators in the context of a multi-objective evolutionary algorithm applied to the multi-objective permutation flow shop problem. In the second case, RL is used for learning the selection of local search operators in VNS applied to the Capacitated Vehicle Routing Problem (CVRP). Our work also makes use of existing work on the multi-armed bandit problem in the context of CMSA, for selecting solution components during solution construction, instead of operators.

2 Standard CMSA

To apply CMSA to a combinatorial optimization problem, one first needs to define a set C of solution components. In this way, each valid solution to the considered problem can be represented as a subset of C. For the following description of CMSA, we assume a generic set \(C =\{c_1, c_2,\dots , c_n\}\). Moreover, note that for every valid solution S to the problem at hand, it holds that \(S\subseteq C\).

Algorithm 1 illustrates the structure of standard CMSA. First, sub-instance \(C'\) is initialized as empty and the best-so-far solution \(S_{\text {bsf}}\) is initialized as null. Afterward, the main loop of the algorithm starts, in which the construct, merge, solve, and adapt steps are sequentially performed until a given time limit is reached. These four steps can be described as follows:

  1. 1.

    In the construct step, \(n_a\) solutions to the problem at hand are probabilistically constructed.

  2. 2.

    The merge step consists of extending \(C'\) with those solution components \(c_i\) that appear in at least one of the \(n_a\) constructed solutions and for which it holds that \(c_i \notin C'\). Moreover, their ages are set to 0.

  3. 3.

    The solve step uses an exact solver with a time limit given by parameter \(t_{\text {ILP}}\) to solve problem instance \(C'\), obtaining a solution \(S_{\text {opt}}\) to the problem at hand.

  4. 4.

    Finally, the adapt step consists of increasing (by one) the age of solution components in \(C' \setminus S_{\text {opt}}\), resetting the age of solution components in \(S_{\text {opt}}\) to 0 and erasing those solution components of \(C'\) that have an age of at least \(age_{\text {max}}\), which is a parameter of the algorithm.

The construct and solve steps are the problem-dependent parts of the algorithm. Generally, a solution construction mechanism together with a greedy function tailored to the problem at hand is used in the first, and an exact method for the tackled problem is used in the second.

Algorithm 1
figure a

High-level pseudo-code of standard CMSA

3 RL-CMSA

In contrast to standard CMSA, our new RL-CMSA approach keeps a quality measure \(q_i\) for every solution component \(c_i \in C\), henceforth called the q-values. The set (or vector) of all q-values will be denoted by \(\textbf{q}\). The probabilistic construction of solutions in RL-CMSA is performed depending on these q-values: solution components with higher values will have a higher chance to be selected. Moreover, in every iteration of CMSA, after the application of the exact solver, the q-values are updated. In particular, the values corresponding to those solution components in \(C'\) that form part of the solution \(S_{\text {opt}}\) found by the exact solver at the current iteration are increased. Conversely, the values of the solution components in \(C'\) that do not appear in \(S_{\text {opt}}\) are decreased.

Algorithm 2
figure b

High-level pseudo-code of RL-CMSA

Algorithm 2 illustrates the general structure of RL-CMSA. First, the stored sub-instance \(C'\) is initialized as empty, the best-so-far solution \(S_{\text {bsf}}\) to null and the q-values are all initialized to zero. Inside the main loop, the usual four CMSA steps are performed in addition to an extra fifth step, which we denote by learn step. The four usual CMSA steps are left unchanged except for the construct step. Similarly to standard CMSA, the construct step of RL-CMSA consists of probabilistically constructing \(n_a\) solutions. In RL-CMSA this is done, however, by using the q-values. Solution components with higher values will—in probability—be chosen more often. In the new learn step the q-values are updated and a convergence measure can be calculated, leading to a restart of the learning procedure if deemed necessary. Such a restart consists of (1) setting the q-values back to zero and (2) emptying the subinstance \(C'\) depending on a parameter (as explained below).

Different designs were considered for the new solution construction mechanism and the update of the q-values. These designs have in part been inspired by existing work on the multi-armed bandit problem, which is a classic RL problem (Kuleshov & Precup, 2014). Multi-armed bandit problems were introduced by Robbins in 1952 (Robbins, 1952). In their simplest form, they consist of a set of k probability distributions \(\{D_1,\dots , D_k\}\). The objective is to come up with a sampling strategy for an agent for whom the distributions are unknown. This agent iteratively samples from the distributions obtaining a reward in each iteration. In technical terms, the goal is to design a sampling strategy that maximizes the obtained sum of rewards. The distributions are generally interpreted as k arms in a slot machine and the agent is viewed as a gambler whose goal is to collect as much money as possible.

As one can notice, the RL strategy we implement into CMSA in this work results in a scenario closely resembling a multi-armed bandit problem. In the case of RL-CMSA, sampling consists of selecting one of the available solution components. However, a key difference is the following one: instead of assigning rewards after every sample, or after every solution construction, in RL-CMSA rewards are assigned after the solve step, depending on the result of the exact solver when solving the current sub-instance.

3.1 Update of the q-values

In the learn step, the q-values corresponding to solution components in \(C'\) are updated depending on whether they form part of the solution \(S_\text {opt}\) computed by the exact solver during the solve step. In this way, the quality of a solution component is not measured by a myopic measure of the quality of adding that solution component to a partial solution under construction. Neither is the quality related to the objective function value of the final solution to which a component was added during its construction. The quality of a component is rather measured in comparison to all other solution components in the sub-instance \(C'\) in the following way: the value \(q_i\) of a solution component \(c_i\) is increased if it forms part of \(S_\text {opt}\), and decreased otherwise. This is done by giving a reward \(R > 0\) in the first case, and \(-R\) in the latter. The following three designs for performing the q-value update were considered:

  1. 1.

    The first design consists of simply summing the obtained rewards over time. At each iteration, once the reward \(r_i \in \{R, -R\}\) for a solution component \(c_i \in C'\) is determined, its q-value is updated as follows:

    $$\begin{aligned} q_i := q_i + r_i \end{aligned}$$
    (1)
  2. 2.

    The second design is based on averaging the received rewards over time. For this purpose, a variable \(n_i\) stores the number of times the q-value \(q_i\) of a solution component \(c_i\) was updated since the start of the algorithm. At each iteration, once the reward \(r_i \in \{R, -R\}\) for a solution component \(c_i \in C'\) is determined, its q-value is updated as follows:

    $$\begin{aligned} q_i:= q_i+\frac{1}{n_i}(r_i - q_i). \end{aligned}$$
    (2)
  3. 3.

    The last design generalizes the previous one by replacing \(1 / n_i\) with a constant step-size parameter \(\alpha > 0\). The corresponding update of the q-value \(q_i\) of a solution component \(c_i \in C'\) is, therefore, as follows:

    $$\begin{aligned} q_i:=q_i+\alpha (r_i - q_i). \end{aligned}$$
    (3)

The first design option from above might introduce a bias toward solution components frequently selected. For example, with this method a solution component that gets reward R obtains the same q-value as one that was awarded rewards \(R, -R\) and R. On the other hand, this does not happen for the second design. By setting the q-values to the average of the rewards, the amount of times a solution component has been selected does not produce a bias. The second and third designs are popular strategies for updating the q-values in multi-armed bandit problems (Sutton & Barto, 2018). Note that, due to having a constant step-size \(\alpha \), the third design gives more weight to recent rewards than to older ones. This could be beneficial in the context of RL-CMSA as the rewards given may change over time.

3.2 Solution construction in RL-CMSA

The new construct step uses the q-values for probabilistically generating solutions. The construction process begins with an empty solution \(S = \emptyset \), to which—at each step—one of those solution components that can be used to feasibly extend the current partial solution is added until the solution is complete. In the following, we will denote by \(C_{\text {feas}} \subseteq C {\setminus } S\) the set of feasible solution components with respect to a partial solution S. Moreover, remember that \(q_i\) denotes the q-value associated to solution component \(c_i \in C\). We propose two different designs for selecting a solution component from \(C_{\text {feas}}\).

3.2.1 Softmax selection

The first proposed design uses a real parameter \(dr\in [0,1]\) called the determinism rate. At each step of the construction process, a solution component is selected in the following way:

  1. 1.

    With a probability dr, a random solution component between the ones from \(C_{\text {feas}}\) with the highest q-value is chosen.

  2. 2.

    Otherwise, with a probability \(1-dr\), the selection is done in a roulette-based manner with the probability \(p_i\) of selecting solution component \(c_i \in C_{\text {feas}}\) given by

    $$\begin{aligned} p_i = \frac{e^{\beta q_i}}{\sum _{c_k \in C_{\text {feas}}} e^{\beta q_k}} \quad . \end{aligned}$$
    (4)

    Hereby, \(\beta \ge 0\) is a parameter that, together with dr, governs the balance between exploration and exploitation.

Note that this first selection design might lead to a convergence of the algorithm. This is because the q-values of some solution components may become considerably larger than the rest, leading to the same solution being constructed all the time, further enlarging their q-values. To remediate this issue we propose to measure the level of convergence as described below. This measurement is conducted once per iteration in the learn step after updating the q-values. In case high convergence is detected, the algorithm is reset by re-initializing the q-values to zero and emptying \(C'\) depending on a parameter. This mechanism depends on a convergence factor and a convergence factor limit. Whenever the convergence factor is greater than the convergence factor limit, the algorithm is re-initialized.

In the following, the calculation of the convergence factor is described. For every solution component \(c_i\) of the last constructed solution S, the probability \(z_i\) of preferring \(c_i\) to all solution components that do not form part of S is calculated. The convergence factor is then defined as the minimum of these values for all \(c_i \in S\). Note that with this definition, the closer the convergence factor is to value one, the closer the algorithm is to convergence. In this context, note that the probability of choosing a particular solution component depends on the values of parameters dr and \(\beta \). More specifically, the probability \(z_i\) of choosing solution component \(c_i\) can be written as follows:

$$\begin{aligned} z_i := dr \cdot \chi _i + (1-dr) \cdot \frac{e^{\beta q_i}}{\sum _{c_k \not \in S} e^{\beta q_k} + e^{\beta q_i}} \quad , \end{aligned}$$
(5)

where \(\chi _i\) is defined as:

$$\begin{aligned} \chi _i := {\left\{ \begin{array}{ll} \frac{1}{|\{c_k \in C \setminus S \cup \{c_i\} \mid q_k = q_i\}|} & \text {if } q_i = \max \{q_k \mid c_k \in C \setminus S \cup \{c_i\}\} \\ \\ \quad \quad \quad 0 & \text {otherwise}\\ \end{array}\right. } \end{aligned}$$
(6)

The expression for \(z_i\) is derived from how solution components are chosen in the construct step. With a probability dr a solution component is chosen uniformly at random from the solution components that achieve the highest q-value, and with a probability \(1-dr\) the selection is performed in a roulette-wheel-based manner with probabilities given by the softmax expression. After having calculated probabilities \(z_i\) for every solution component \(c_i \in S\), the convergence factor is computed as \(cf:= \min \{z_i \mid c_i \in S\}\).

Once the convergence factor cf is calculated, the algorithm checks whether it is greater than the conference factor limit defined by parameter \(cf_{\text {limit}}\in [0,1]\). If this is the case then the algorithm is re-initialized:

  1. 1.

    The q-values are re-initialized to zero.

  2. 2.

    Sub-instance \(C'\) is emptied depending on a Boolean parameter \(b_{\text {reset}}\). If \(b_{\text {reset}} = \textit{true}\), \(C'\) is set to \(\emptyset \). Otherwise, if \(b_{\text {reset}} = \textit{false}\), \(C'\) is not modified.

Emptying \(C'\) when re-initializing the algorithm completely erases the previous information gathered by the RL agent. Conversely, if \(C'\) is not emptied, some of the so-far gathered information is kept.

3.2.2 Upper-confidence-bound (UCB) selection

As an alternative to Softmax selection, we consider UCB selection (Sutton & Barto, 2018). This is a method designed for the multi-armed bandit problem which aims at sampling the distribution set according to their potential to be optimal. We consider it as an alternative way of dealing with convergence, as this selection mechanism simply assures sufficient eventual exploration. Similarly to Softmax selection, we have implemented this method in the following way:

  1. 1.

    With a probability dr, a random solution component between the ones with the highest q-value is chosen.

  2. 2.

    Otherwise, with a probability \(1-dr\), UCB selection is employed, consisting of selecting randomly between the solution components whose q-values maximize the following expression:

    $$\begin{aligned} q_i + \rho \cdot \sqrt{\frac{\log (n)}{n_i}} \end{aligned}$$
    (7)

    Hereby \(\rho >0\) is a parameter, n denotes the current iteration number, \(\log (n)\) denotes the natural logarithm of n, and \(n_i\) the number of times solution component \(c_i\) has been selected so far.

The square-root term in the UCB expression is a measure of the uncertainty in the estimate of \(q_i\). Each time a solution component is selected, its corresponding square-root term decreases, hence lowering the estimated uncertainty in its q-value. Conversely, if an iteration passes and a solution component is not chosen the square-root term increases.

Note that in case the q-values are unbounded, this method is not usable as the square-root term becomes useless once the q-values become large enough. This may happen with the first and third designs we proposed for the update of the q-values. For this reason, we will use this method together with the second design proposed above for the q-value update, which sets the q-values to the average of the rewards seen so far.

4 Experimental study

To experimentally evaluate RL-CMSA, we consider the following four algorithm variants which make use of the different designs for the update of the q-values and for the selection of solution components during solution construction. The four considered RL-CMSA variants can be described as follows:

  1. 1.

    RL-CMSA-1: this variant is characterized by the first design for the update of the q-values (summation of the rewards), and the use of Softmax selection. The reward (R) is set to one.

  2. 2.

    RL-CMSA-2: second design for the q-value update (average rewards) and Softmax selection. The value of the reward (R) is considered a parameter of the algorithm.

  3. 3.

    RL-CMSA-3: the same as RL-CMSA-2, just that the third design (average rewards + step-size) is used for the q-value udpate.

  4. 4.

    RL-CMSA-4: second design for the q-value update, in combination with the UCB selection for solution construction as an alternative way of avoiding convergence. The reward (R) is set to one.

These four RL-CMSA variants will be compared to standard CMSA. In particular, this comparison will be conducted in the context of two combinatorial optimization problems from the literature: the FFMS and the MDS problems.

All the algorithms were run in a single-threaded mode in a cluster of machines with 10-core Intel Xeon processors at 2.2 GHz with 8 GB of RAM. Moreover, they all employed the commercial solver CPLEX as the exact method used in the solve step.

4.1 Algorithm parameters

The first three RL-CMSA variants, which use Softmax selection, make use of parameters \(\beta \), \(cf_{\text {limit}}\), and \(b_{\text {reset}}\). Hereby, \(\beta \) is a parameter in the Softmax equation, \(cf_{\text {limit}}\) is the limit for the convergence factor, and the Boolean parameter \(b_{\text {reset}}\) indicates whether to empty \(C'\) in the case of re-initializing the q-values. In addition, the second and third RL-CMSA variants consider the reward R as a parameter. The third RL-CMSA variant also uses parameter \(\alpha \), which determines the value of the step-size parameter in the q-value update expression. The fourth RL-CMSA variant makes use of parameter \(\rho \) from the UCB selection expression. On top of these, all four algorithm variants utilize parameter dr, which determines the determinism rate used when selecting solution components.

In addition to the above-mentioned parameters, all algorithms also use the standard CMSA parameters \(t_{\text {ILP}}\), \(n_a\), and \(age_{\text {max}}\). These determine the time limit given to the exact solver in the solve step, the number of solutions constructed in the construct step, and the age limit used in the adapt step respectively. In addition, they all use the CPLEX parameters \(\text {cplex}_{\text {warmstart}}\), \(\text {cplex}_{\text {emphasis}}\) and \(\text {cplex}_{\text {abort}}\). These are Boolean parameters that modify the behavior of CPLEX. The first one controls whether the algorithm provides an initial solution to CPLEX. In case \(\text {cplex}_{\text {warmstart}}\) = true, the best-so-far solution will be provided to CPLEX for warm-starting the solving process. The second parameter balances between the speed of proving optimality and the speed of improving the best solution found during a CPLEX execution. If \(\text {cplex}_{\text {emphasis}}\) = true, then CPLEX uses its highest heuristic emphasis value. Otherwise, the default setting is used. Finally, the third parameter determines whether a CPLEX execution is stopped when a solution that improves the best-so-far solution is found. This can be beneficial due to CPLEX sometimes spending a lot of resources on bound computations important for proving optimality.

4.2 Application to the far from most string (FFMS) problem

The FFMS problem is a combinatorial optimization problem arising in bioinformatics and forms part of the sequence consensus problems family. These problems find applications in different fields, such as molecular biology (Mousavi, 2010). The FFMS problem is known to be NP-hard, meaning that it can not be solved in polynomial time unless \(\text {P}=\text {NP}\) (Lanctot et al., 2003). Given a set of equal-length input strings over an alphabet \(\Sigma \) and a threshold \(t>0\), the problem aims at finding a string of the same length that maximizes the number of input strings with which its Hamming distance is at least t. Hereby, given two strings s and \(s'\) of length m, their Hamming distance \(d_H(s,s')\) is defined as the number of positions where their corresponding characters differ. That is:

$$\begin{aligned} d_H(s,s') := \big |\{k\in \{1,\dots , m\} \ | \ s[k] \not = s'[k]\}\big | \end{aligned}$$
(8)

An instance of the FFMS problem is denoted by \((\mathcal {S}, \Sigma , t)\), where \(\mathcal {S}\) is a set of n input strings \(\{s_i\}_{i = 1}^n\) of length m over alphabet \(\Sigma \) and \(0 < t \le m\) is the threshold. Every string of length m over alphabet \(\Sigma \) is then a feasible solution to the problem. The goal is to find a feasible string s that maximizes the following objective function:

$$\begin{aligned} f_1(s) := \big |\{ s'\in \mathcal {S} \ | \ d_H(s, s') \ge t \}\big | \end{aligned}$$
(9)

In practice, our CMSA and RL-CMSA implementations make additional use of a secondary objective function to differentiate between two solutions having the same primary objective function value (Eq. 9). This secondary objective function can be stated as follows:

$$\begin{aligned} f_2(s) := \sum _{s'\in \{t\in \mathcal {S}|d_H(t,s) \ge t\}} d_H(s,s') + \max _{s'\in \{t\in \mathcal {S} | d_H(t,s) < t\}}\{d_H(s,s')\} \end{aligned}$$
(10)

As the reader may notice, a higher value for \(f_2(s)\) makes a small change in s less probable to lead to a decrease in the main objective function f(s). Therefore, a solution s is deemed better than a solution \(s'\) (that is, \(f(s) > f(s')\)) if and only if (1) \(f_1(s)>f_1(s')\) or (2) \((f_1(s) = f_1(s')\) and \(f_2(s) > f_2(s'))\). This lexicographic function was proposed in (Blum & Pinacho-Davidson, 2023) to remedy the negative effect of large plateaus in the search space of the FFMS problem.

In the following, we explain the definition of solution components for the application of CMSA and the RL-CMSA variants. In addition, it will be described how sub-instances are solved by making use of CPLEX. Finally, we will introduce the method for generating solutions used in the construct step of standard CMSA.

4.2.1 Solution components

A natural way of defining the set C of solution components in the case of the FFMS problem is the following one. For every combination of a position \(k=1,\ldots ,m\) of a solution string and a character \(a \in \Sigma \) set C contains the corresponding solution component \(c_{k, \texttt {a}}\), that is

$$\begin{aligned} C := \{c_{k,\texttt {a}} \ | \ k=1,\dots , m, \text { and } \texttt {a}\in \Sigma \} \quad . \end{aligned}$$
(11)

Therefore, at every step \(j=1,\dots , m\) of the solution construction process the set of feasible solution components is \(C_{\text {feas}}:= \{c_{j, \texttt {a}} \ | \ \texttt {a} \in \Sigma \}\).

4.2.2 Probabilistic solution construction

In the following, we describe how solutions are generated in standard CMSA and the RL-CMSA variants. All algorithms make use of the same solution construction mechanism in which a letter for each position \(j\in \{1, \ldots , m\}\) is determined sequentially from \(j=1\) to \(j=m\). In other words, at the j-th construction step exactly one solution component from \(C_{\text {feas}} = \{c_{j, \texttt {a}} \ | \ \texttt {a} \in \Sigma \}\) is chosen. How this is done is different in CMSA and the RL-CMSA variants. Standard CMSA makes a probabilistic use of the following greedy function for this purpose. Given a position \(1\le j \le m\) and a character \(\texttt {a}\in \Sigma \), the corresponding frequency \(f_{j,\texttt {a}}\) is defined by:

$$\begin{aligned} f_{j,\texttt {a}}:= \frac{\big |\{s\in \mathcal {S} \ | \ s[j] = \texttt {a}\}\big |}{|\mathcal {S}|} \end{aligned}$$

For choosing a letter for position j the following is done:

  1. 1.

    With a probability \(0 \le dr_{\textrm{CMSA}} \le 1\), the solution component (letter-position assignment) with the lowest frequency value is selected, breaking ties randomly.

  2. 2.

    Otherwise, with probability \(1-dr_{\textrm{CMSA}}\), a solution component is chosen from \(C_{\text {feas}}\) utilizing letter probabilities proportional to the inverse of their frequencies. That is, the probability for choosing solution component \(c_{j, \texttt {a}}\) is set to:

    $$\begin{aligned} \frac{1/f_{j\texttt {a}}}{\sum _{\alpha \in \Sigma }1/f_{j\alpha }} \end{aligned}$$
    (12)

Hereby, \(dr_{\textrm{CMSA}}\in [0,1]\) is a parameter called the determinism rate of CMSA.

In contrast, the RL-CMSA variants choose a letter for each position j of a solution s by means of the q-values. In particular, in the case of the FFMS, we make use of a q-value \(q_{j, \texttt {a}}\) for every solution component \(c_{j, \texttt {a}} \in C\). At the j-th solution construction step, one of the solution components is chosen from \(C_{\text {feas}} = \{c_{j, \texttt {a}} \ | \ \texttt {a} \in \Sigma \}\) by Softmax, respectively UCB, selection.

4.2.3 Integer linear programming (ILP) model and sub-instance solving

In CMSA algorithms, sub-instances are—if possible—modeled in terms of ILP models which are then solved, at each iteration, using an ILP solver. As mentioned before, in this work we use the commercial solver CPLEX for this purpose. The standard ILP model of the FFMS problem uses two sets of binary variables. The first contains a variable \(x_{j, \texttt {a}}\) for every \(j=1,\dots , m\) and \(\texttt {a}\in \Sigma \), while the second contains a variable \(y_j\) for every \(j= 1,\dots , m\).

$$\begin{aligned} \text {max }&\sum _{i = 1}^n y_i \end{aligned}$$
(13)
$$\begin{aligned} \text {subject to }&\sum _{\texttt {a}\in \Sigma } x_{j,\texttt {a}} = 1,&\text {for} \ j&=1, \dots , m \end{aligned}$$
(14)
$$\begin{aligned}&\sum _{j=1}^m x_{j,s_i[j]} \le m-t\cdot y_i,&\text {for} \ i&=1, \dots , n \nonumber \\&x_{j,\texttt {a}}, y_i\in \{0,1\} \end{aligned}$$
(15)

Notably, variable \(x_{j,\texttt {a}}\) takes value one if character \(\texttt {a}\) is chosen for position j of the solution string and takes value zero otherwise. Constraint (14) makes sure that only one character is chosen for each position. Additionally, constraint (15) together with the maximization goal make \(y_i\) take value 1 if and only if the Hamming distance between the solution string and input string \(s_i\) is at least t.

To solve a sub-instance \(C' \subseteq C\), for all \(c_{j, \texttt {a}} \in C {\setminus } C'\) the constraint \(x_{j, \texttt {a}} = 0\) is added to this ILP model. In other words, the values of those variables that correspond to solution components not forming part of sub-instance \(C'\) are fixed to zero.

4.2.4 Experimental evaluation

For evaluating RL-CMSA in the context of the FFMS problem, we generated the following set of benchmark instances. Each instance is a collection of n strings of size m with characters from an alphabet \(\Sigma \) of size \(|\Sigma |\). Additionally, every instance has a threshold associated, denoted as t, indicated in terms of a proportion of m. The benchmark set contains 720 instances for every value of \(|\Sigma | \in \{4, 12, 20\}\). These are further divided into 30 instances for every combination of \(n \in \{100, 200, 300, 400\}\) and \(m \in \{100, 500, 1000\}\). Moreover, two threshold values depending on \(|\Sigma |\) are considered for all instances: (0.8m, 0.85m) for instances with \(|\Sigma | = 4\), (0.97m, 1.0m) for \(|\Sigma | = 12\), and (0.99m, 1.0m) for \(|\Sigma | = 20\).

Parameter tuning. In addition to the instances described above, the benchmark set contains 72 tuning instances, one for every combination of n, m, \(|\Sigma |\), and t. We tuned all five algorithms twice, once for all instances when considering the lower threshold t of each threshold pair, and once concerning the higher threshold. This was done because from earlier work it is known that the change of the threshold value changes the nature of the problem more than a change in n or m. However, in an attempt to reduce the number of tuning instances, instances with \(m=500\) were excluded. Hence, both tuning runs used 24 tuning instances. Moreover, a budget of 3000 algorithm runs was given to both tuning runs, and every algorithm execution was allowed a time limit of 600 CPU seconds, for both the tuning and evaluation runs.

Table 1 Parameter values obtained after tuning for the FFMS problem. Two tuning runs are performed for every algorithm. One for the lower thresholds (0.8m, 0.97m, 0.99m) and one for the higher threshold (0.85m, 1.00m, 1.00m). A dash (–) denotes that the algorithm does not use the corresponding parameter
Table 2 Comparison of CMSA and the four proposed RL-CMSA variants for the FFMS problem instances with \(|\Sigma | = 4.\)
Table 3 Comparison of CMSA and the four proposed RL-CMSA variants for the FFMS problem instances with \(|\Sigma | = 12.\)
Table 4 Comparison of CMSA and the four proposed RL-CMSA variants for the FFMS problem instances with \(|\Sigma | = 20.\)

Table 1 provides the parameter values obtained after conducting the two tuning runs for every algorithm. There are two columns per algorithm, which contain the parameter values obtained for the lower (left column) and higher thresholds (right column) respectively.

Results. Each algorithm variant was applied exactly once to each problem instance, with a computation time limit of 600 CPU seconds. The results obtained by standard CMSA and the four different RL-CMSA versions are reported in Tables 2, 3, 4. For each combination of m, \(|\Sigma |\) and t, and each algorithm, we present the average length of the best solutions found and the average execution time that was needed for obtaining these best solutions. Columns \(\overline{|s|}\) and \(\overline{t}_{best}[s]\) contain the two respective values. The three tables offer the results for the instances with \(|\Sigma | = 4\), \(|\Sigma | = 12\) and \(|\Sigma | = 20\), respectively. There are 30 different benchmark instances for every combination of n, m, and t. Hence, the presented values are averages over these 30 instances. The obtained results show the following:

  • The four RL-CMSA implementations obtain, on average, better results than CMSA (see last table rows).

  • The only exception are the instances of alphabet size \(|\Sigma | = 20\), for which RL-CMSA-2 and RL-CMSA-3 obtain slightly worse average results than CMSA.

  • RL-CMSA-1 obtains the best results except for the instances with \(|\Sigma | = 20\). For these, RL-CMSA-4 is the algorithm that obtains the best average results.

  • Concerning computation time, the first three RL-CMSA variants seem to require a similar amount of time to the one required by the standard CMSA.

  • In contrast, RL-CMSA-4 finds its best solutions later, employing a larger part of the 600-second time limit.

Figure 1 illustrates the differences in computation time further. It contains five box plots for the two threshold groups, representing the time required by each algorithm variant. All algorithms use more time to find their best solutions for the low threshold instances in comparison to the high threshold ones. This is because the latter instances are much harder, which causes the algorithms sometimes to get stuck in local optima.

Figure 2 contains Critical Difference (CD) plots for the FFMS problem results, generated using the R package scmamp (Calvo & Santafé Rodrigo, 2016). Each plot shows the average rank of every algorithm on the x-axis, with a horizontal bar between algorithms denoting non-significant differences. The Friedman rank-sum test indicated, with high significance, that at least one algorithm performs differently than the rest. Thus, we employed Finner’s procedure (García et al., 2010) as the post-hoc method for pairwise comparison. The CD plots show the results obtained regarding this method, using a significance level of 0.05. The one from Fig. 2a considers all instances together. We can observe that standard CMSA obtains the worst average rank and that the differences between CMSA and the RL-CMSA variants are statistically significant. Moreover, RL-CMSA-1 and RL-CMSA-4 are the best-performing algorithms, being better than the rest with statistical significance. Figures 2b and c show CD plots for the lower and higher threshold instances respectively. For the lower threshold instances, CMSA also obtains the worst average rank and is the worst algorithm with statistical significance. In the case of the higher threshold instances, the differences between the algorithms in terms of average rank are much smaller. In this case, CMSA obtains the best average rank but the differences with RL-CMSA-1 and RL-CMSA-3 are non-statistically significant. Interestingly, the best-performing algorithm for the lower threshold instances, RL-CMSA-4, obtains the worst average rank for the higher threshold ones.

Fig. 1
figure 1

Time spent by the algorithms for obtaining their best solutions to the FFMS problem instances

Fig. 2
figure 2

CD plots concerning the FFMS results

Fig. 3
figure 3

Exploration plots for the FFMS problem. The x-axis represents the solution components and the y-axis the percentage of times each one was chosen

With the intention of better understanding the behavioral differences between CMSA and the proposed RL-CMSA variants we generated the graphics in Figs. 3 and 4. They contain a plot for 12 exemplary FFMS problem instances with \(m=500\) for every combination of \(n \in \{100,400\}\) and \(|\Sigma | \in \{4,12,20\}\). In particular, the plots in Fig. 3 show the fraction of solution constructions (y-axis) for which each solution component (x-axis) was selected. Note that in all these plots the solution components are ordered from the most selected one (left) to the least selected one (right). Hereby, the y-axis is plotted in a logarithmic scale, to be able to see differences in a better way.

The most important aspect shown in these graphics is that CMSA—in comparison to the RL-CMSA variants—generally selects to a lesser extent the highly chosen solution components, while it generally shows a higher fraction of selection for less chosen solution components. This can simply be explained by the presence of RL in the RL-CMSA variants, which leads to higher exploitation of seemingly good solution components. However, there are also differences between the RL-CMSA variants. Algorithm variants RL-CMSA-3 and RL-CMSA-4, for example, generally show a lesser degree of exploration than the other RL-CMSA variants.

Figure 4 contains 12 plots, concerning the problem instances already considered in Fig. 3. These graphics show, for all five algorithm variants, the quality of the solutions constructed over time. Run time is represented on the x-axis in terms of seconds. The y-axis shows the objective function values of the constructed solutions. To improve the visualization, only the value of the best solution constructed at each iteration is used for plotting.

These graphics clearly show one of the benefits of implementing RL into CMSA. Due to employing learning, the RL-CMSA variants construct solutions of higher quality than the latter. Observe, for example, that the quality of the solutions constructed by the RL-CMSA variants grows over time. Conversely, in the case of CMSA, the quality of the solutions constructed stays more or less constant as there is no form of learning involved. Notably, the solutions constructed by CMSA are of really low quality as their objective function values often are close to 0.

In addition, these graphics show the differences in the behavior of the learning processes of the different RL-CMSA variants. In the context of RL-CMSA-1 and RL-CMSA-3, for example, important drops in solution quality can be noted over time. These correspond to algorithm restarts. Hereby, RL-CMSA-1 conducts most restarts, most notably for the high threshold instances for which it restarts multiple times. On the other hand, RL-CMSA-3 restarts once for every high threshold instance, except for the instance with \(n=400, |\Sigma | = 20\) and \(t = 1.00m\).

Fig. 4
figure 4

Evolution of the objective function values of the constructed solutions over time for the FFMS problem

4.3 Application to the minimum dominating set (MDS) problem

The MDS problem is another well-known NP-hard combinatorial optimization problem from the literature. Given a graph, the MDS problem aims at finding a smallest subset of nodes such that every node of the graph is either part of this subset or has at least one neighbor in it. More formally, let \(G=(V,E)\) be an undirected graph. The MDS problem aims at finding a smallest \(\tilde{V} \subseteq V\) such that for every \(v\in V\) at least one of the following two conditions holds:

  1. 1.

    \(v\in \tilde{V}\)

  2. 2.

    \(v'\in \tilde{V}\) for some \(v'\in N(v)\)

Hereby, N(v) denotes the set of neighbors of v in G. That is, \(N(v):= \{v'\in V \ | \ (v',v)\in E \ \}\). A subset of nodes that fulfills the previous two conditions is called a dominating set of G. Hence, the MDS problem aims at finding a minimum dominating set, as the name of the problem suggests. The MDS problem has applications in different fields, such as in wireless sensor networks (Pino et al., 2018) and natural language processing (Shen & Li, 2010).

The following subsections introduce the definition of solution components, the way of solving sub-instances, and the way of constructing valid solutions.

4.3.1 Solution components

A natural way of defining the solution components in the case of the MDS problem consists of introducing a solution component for every node of the input graph. The set of solution components is then \(C = V\). Therefore, we henceforth employ the v-notation instead of the c-notation for solution components, that is, \(C:= \{v_1, \ldots , v_n\}\), where each solution component \(v_i\) is a node of the input graph G. At each step of the construction process of a solution \(S \subseteq C\), the set of available solution components consists of all the nodes except for those that are already covered by a node in S and have no uncovered neighbors.

4.3.2 Probabilistic solution construction

Both CMSA and the RL-CMSA variants utilize the following solution construction mechanism. It starts with an empty solution \(S:= \emptyset \). At each step of the process exactly one node (that is, a solution component) is added until a valid solution—being a dominating set—is obtained. Hereby, let \(C_{\text {feas}} \subseteq C\) denotes—as before—the set of feasible solution components at the current step, which—in the case of the MDS problem—is defined as the set of nodes that can cover one or more nodes not already covered by the current partial solution S.

CMSA makes use of the following greedy function for choosing, at each construction step, a node from \(C_{\text {feas}}\). For the introduction of this greedy function, let \(N[v]:= N(v)\cup \{v\}\) denote the closed neighborhood of v and \(N[v \ | \ S]\subseteq N[v]\) denote the set of uncovered neighbors of v concerning partial solution S. For the choice of a node to be added to S, the following is done in CMSA:

  1. 1.

    With a probability \(0 \le dr_{\textrm{CMSA}} \le 1\), a node \(v\in C_{\text {feas}}\) is chosen as follows:

    $$\begin{aligned} v := \mathop {\mathrm {arg\,max}}\limits _{v'\in C_{\text {feas}}} \big \{\big |N[v' \ | \ S]\big |\big \} \end{aligned}$$
    (16)
  2. 2.

    Otherwise, with a probability \(1-dr_{\textrm{CMSA}}\), a number of \(\min \{l_{\textrm{CMSA}}^{\textrm{size}}, \big |C_{\text {feas}}\big |\}\) nodes from \(C_{\text {feas}}\) are stored in \(L \subseteq C_{\text {feas}}\) such that:

    $$\begin{aligned} \big |N[v \ | \ S]\big |\le \big |N[v'\ | \ S]\big |\quad \text {for all} \ v\in L, v'\in C_{\text {feas}} \setminus L \end{aligned}$$
    (17)

    A node \(v\in L\) is then chosen uniformly at random and added to S.

Hereby, \(dr_{\textrm{CMSA}}\) and \(l_{\textrm{CMSA}}^{\textrm{size}}\) are parameters of the CMSA algorithm.

The RL-CMSA variants avoid using this greedy function. They make use of a set of q-values containing a value \(q_i\) for each node (solution component) \(v_i \in C\). The choice of a node at each construction step is made via Softmax selection, UCB selection respectively.

4.3.3 ILP model and sub-instance solving

Similarly to the FFMS problem, our algorithms utilize the commercial solver CPLEX in their solve step. The following ILP model for the MDS problem is employed by CPLEX.

$$\begin{aligned} \text {min }&\sum _{v_i\in V} x_i \end{aligned}$$
(18)
$$\begin{aligned} \text {subject to }&\sum _{v_j\in N(v_i)} x_j + x_i \ge 1,&\text {for} \ v_i\in V \nonumber \\&x_i\in \{ 0,1 \},&\text {for} \ v_i\in V \end{aligned}$$
(19)

As one can see, binary variable \(x_i\) takes value one if solution component \(v_i\in V\) forms part of the solution and value zero otherwise. Constraints (19) cause solutions to be dominating sets as, for every node, it is required that either the node itself and/or one of its neighbors belong to the solution. Finally, the minimization goal causes the size of the final solution to be minimum.

To solve a sub-instance \(C' \subseteq C\), for all \(v_j \in C {\setminus } C'\) the constraint \(x_j = 0\) is added to this ILP model. In other words, the values of those variables that correspond to solution components (nodes) not forming part of sub-instance \(C'\) are fixed to zero.

4.3.4 Experimental evaluation

For evaluating standard CMSA and the RL-CMSA variants for the MDS problem, we used a benchmark set consisting of graphs of different sizes and densities generated by using the following three graph models: Erdös-Rényi (Erdös & Rényi, 1959), Watts-Strogatz (Watts & Strogatz, 1998) and Barabási-Albert (Barabási & Albert, 1999). The first is one of the best-known random graph models for generating graphs using two parameters: the number of nodes and the probability of the existence of an edge between any pair of nodes. The second is used for generating small-world networks, which have a short average shortest path length between nodes and maintain a high level of local clustering. Finally, the latter produces graphs with a majority of low-degree nodes and a few significantly higher-degree ones.

We generated 30 graphs of every graph type and for every combination of \(|V| \in \{500,1000,\) \(1500,2000\}\) and four different graph densities. Densities are controlled by parameters p, k, and m for the three graph models, respectively. The four densities considered are \(p \in \{0.00416381, 0.0062414,\) \(0.0103881, 0.020705\}\) and \(k,m \in \{2,3,5,10\}\).Footnote 1 Henceforth, these will be called 1st, 2nd, 3rd, and 4th density level respectively. The benchmark set therefore consists of 480 graphs for every model, totalling 1440 instances. Additionally, it contains one tuning instance for every graph type, density level, and size.

Parameter tuning. The five CMSA variants were tuned using the tuning instances of the lowest and highest density levels, that is, the instances concerning the 2nd and 3rd density levels were disregarded to speed up the procedure. This amounts to 24 tuning instances in total. As in the case of the FFMS problem, tuning was conducted using the R tool irace (López-Ibáñez et al., 2016) with a budget of 3000 experiments per tuning run. For both tuning and evaluation, every algorithm execution was given a time limit of 150, 300, 450, and 600 CPU seconds for instances of sizes \(|V| \in \{500,1000,1500, 2000\}\) respectively.

Table 5 presents the parameter values obtained after tuning, together with their allowed ranges. The only change in comparison to the FFMS problem is the additional CMSA-parameter \(l_{\textrm{CMSA}}^{\textrm{size}}\) and the allowance of a smaller range of values for \(t_{\text {ILP}}\). Parameter \(l_{\textrm{CMSA}}^{\textrm{size}}\) is used together with \(dr_{\textrm{CMSA}}\) in the standard CMSA solution construction procedure and the allowed range for parameter \(t_{\text {ILP}}\) was shortened due to the ILP solver requiring less time for solving MDS sub-instances, compared to FFMS sub-instances.

Table 5 Parameter values obtained after tuning for the MDS problem. Every algorithm is tuned exactly once. A dash (–) denotes that the algorithm does not use the corresponding parameter
Table 6 Comparison of CMSA with the four RL-CMSA variants for the MDS problem instances with \(|V| = 500\)
Table 7 Comparison of CMSA with the four RL-CMSA variants for the MDS problem instances with \(|V| = 1000\)
Table 8 Comparison of CMSA with the four RL-CMSA variants for the MDS problem instances with \(|V| = 1500\)
Table 9 Comparison of CMSA with the four RL-CMSA variants for the MDS problem instances with \(|V| = 2000\)

Results. Tables 6, 7, 8, 9 present the obtained results for the MDS problem instances. The same structure is used as in the case of the FFMS problem, just that each table in the case of the MDS problem presents the results for problem instances of a specific graph size. Each result is an average over the 30 problem instances for a specific combination of graph type, |V|, and density level. The results allow us to observe the following:

  • RL-CMSA-1 generally performs best, followed by the standard CMSA.

  • The differences between the algorithms grow with growing graph size.

  • RL-CMSA-1 improves over CMSA to a larger extent in the context of Erdös-Rényi and Watts-Strogatz graphs than for Barabási-Albert graphs.

  • Interestingly, algorithm variants RL-CMSA-2, RL-CMSA-3, and RL-CMSA-4 generally perform slightly worse than standard CMSA, except RL-CMSA-3 for smaller problem instances.

  • These results together with the ones of the FFMS problem show that RL-CMSA-1 seems to be the best RL-CMSA variant.

Figure 6 provides four box plot graphics, one for each value of |V|, showing the time taken by each algorithm to encounter the best solution in each run. These graphics show that the four RL-CMSA variants spent a similar amount of time in finding the best solutions in their respective runs. Remember that for the MDS problem, the time limit was set to 150, 300, 450, and 600 CPU seconds for the four considered graph sizes, respectively. RL-CMSA-1 is the algorithm that employs the most time for instances with \(|V|=1500\) and \(|V|=2000\) which coincides with it being the best-performing algorithm for these large instances.

To check for statistical significance, the CD plots shown in Fig. 5 were produced. These were again generated using the R package scmamp utilizing the same statistical testing procedure as in the case of the FFMS problem. The CD plot in Fig. 5a considers all problem instances together, while the CD plots in Figs. 5b–d are restricted to problem instances of a specific graph type. It can be observed that RL-CMSA-1 is the algorithm that obtains the best average rank and that the difference with the rest of the algorithms is statistically significant. Studying the CD plots for instance subsets, it can be seen that for Barabási-Albert instances, the resulting average ranks are similar, with RL-CMSA-1 and RL-CMSA-2 being slightly better than standard CMSA, without statistical significance. However, for Erdös-Rényi and Watts-Strogatz instances, RL-CMSA-1 is the best-performing algorithm and the differences are, this time, statistically significant.

Fig. 5
figure 5

CD plots concerning the MDS results

Fig. 6
figure 6

Time spent by the algorithms for obtaining their best solutions to the MDS problem instances

Fig. 7
figure 7

Exploration plots for the MDS problem. The x-axis represents the solution components and the y-axis the fraction of solution constructions in which each one was chosen

To gain a deeper understanding of the algorithm behavior, Figs. 7 and 8 show—as in the case of the FFMS problem—the exploration behavior of the algorithms, respectively the evolution of the quality of the constructed solutions over time. For this purpose, one instance was considered for every graph type and size (|V|), totaling 12 instances. All of the selected instances are from the 3rd density level. As in the case of the FFMS problem, Fig. 7 plots for every algorithm and problem instance the fraction of solution constructions in which solution components were selected. The x-axis represents the solution components, ordered from the most to the least selected one for each algorithm.

The exploration plots show some differences to the FFMS case. First, the behavior of standard CMSA is considerably different from the one displayed for the FFMS problem. In particular, the exploration of few-chosen solution components drops drastically (in steps) at some point. We conjecture that this is because of the use of the \(l_{\textrm{CMSA}}^{\textrm{size}}\) parameter that limits the number of selectable solution components (nodes) at each construction step. While its use is beneficial for the global performance of CMSA, apparently it reduces the exploration capability of the algorithm. In addition to this observation, we can also detect differences in the relative behavior of the RL-CMSA variants. While, in the context of the FFMS problem, RL-CMSA-1 and RL-CMSA-2 showed a higher exploration of few-chosen solution components than RL-CMSA-3 and RL-CMSA-4, this is generally the other way around for the MDS problem. This is except for smaller Barabási-Albert and Erdös-Rényi graphs.

Finally, Fig. 8 plots the quality (in terms of the objective function values) of the solution constructions performed over the run-time of the algorithms. Note that, in the case of the MDS problem, higher-quality solutions correspond to lower objective function values (which was the opposite for the FFMS problem).

Again, these graphics show the learning process of the RL-CMSA versions. In this case, CMSA performs better solution constructions compared to the RL-CMSA versions than it did for the FFMS problem. For most instances, RL-CMSA-1 constructs the best quality solutions, which coincides with the fact that this is the best-performing algorithm for this problem. The other three RL-CMSA versions perform worse solution constructions in general, except for RL-CMSA-4 which is the best at constructing solutions for the Barabasi instances.

A last interesting observation is that RL-CMSA-1 is the only algorithm that performs restarts for the MDS problem. As seen before, every restart produces a spike in the graphs showing the evolution of solution quality. This is because a restart causes the information gathered so far to be erased.

Fig. 8
figure 8

Evolution of the objective function values of the constructed solutions over time for the MDS problem

5 Conclusions and future work

The use of ML techniques for supporting and improving metaheuristics is a successful current trend in the literature. Following this trend, this work has introduced a new version of the hybrid metaheuristic CMSA by adding a RL component for constructing solutions at each iteration. This new CMSA variant, called RL-CMSA, improves over the standard CMSA in two aspects. First of all, its application is not dependent on a tailored greedy function for evaluating solution components at each solution construction step. Therefore, RL-CMSA can be seen as a more general algorithm than standard CMSA, which often is also easier to implement. The goodness/usefulness of solution components is learned online in RL-CMSA by means of the q-value sampling and update. Moreover, RL-CMSA was shown to improve over standard CMSA in terms of empirical performance both in the context of the FFMS and the MDS problem. The main conclusion of this research is that equipping CMSA with the proposed simple learning mechanism is highly successful. Therefore, this new variant should be tested for problems in which CMSA excels as it could potentially perform even better. Specifically, the best-performing variant is RL-CMSA-1, which is statistically significantly better than the standard algorithm for both problems. As our experimental evaluation shows, this variant successfully learns to construct solutions in the construct step, learning to generate better solutions than the ones constructed by the greedy probabilistic method employed by CMSA for both problems.

We introduced RL-CMSA as a general framework, leaving room for alternative particular implementations of the q-value update and solution component sampling. While we proposed four different designs, we believe that an avenue for future work could consist of devising even better designs, further improving the performance obtained by our proposal. It would also be interesting to compare both standard CMSA and RL-CMSA variants for other combinatorial optimization problems, with the goal of obtaining further confirmation of the improvements brought by the RL component. Moreover, it would also be of great interest to explore if the proposed learning mechanism can harm performance for some problems. This might happen for so-called deceptive problems, for which the learning process might introduce a bias towards areas of the search space that do not contain the best solutions that can be found.

Finally, one limitation of the proposed learning mechanism is that the partial solution under construction is not taken into consideration for evaluating the goodness of a solution component. In general, the partial solution under construction plays a role in deciding whether a solution component is suitable for extension. Another path for future work could consist of extending the learning mechanism so that this contextual information is taken into account for deciding the quality of solution components. Doing this would change the employed RL process from being state-less to having a concept of state, which would be the partial solution under construction.