Keywords

Introduction

When evaluating performance of an optimization method, the most commonly reported statistics are running time (or execution time) and solution quality. Often these statistics are sensitive to initial conditions and algorithm parameters, especially in randomized algorithms. To capture this variability when analyzing computational performance, it is common to report a sequence of run-times and the corresponding objective function values (or their averages and some measures of dispersion) for different problem instances, initial conditions, and parameter settings. In general, it is natural to consider a probability distribution of time that an algorithm requires to obtain a solution of a given quality (e.g., an optimal solution or a solution whose value is within factor α from optimal). The variability in run-times can often be attributed to randomized algorithmic steps, such as randomized branching rules and column selection in branch-and-price methods, arbitrarily resolved ties, stochastic initial solutions, and random local search moves in metaheuristics.

In some applications, the distribution of the algorithm’s run-time has a large spread, and there is a relatively high probability of having a run-time that is far from the average run-time value [6]. Such peculiar distributions can be exploited to accelerate the search via multi-start strategy: each run has a limited duration, and the algorithm is repeatedly restarted with different initial conditions or random seeds. Importantly, such acceleration opportunity is not just a mathematical curiosity, but it is commonly found in many real-life applications providing significant acceleration for the state-of-the-art optimization algorithms; see, e.g., [6, 12, 16, 18].

If a probability distribution of an algorithm’s run-time admits such acceleration, it is referred to as a restart distribution . In [18], the authors provide the formal definition of a restart distribution. The authors also establish the theoretical explanation of the efficient restart strategies for enumeration algorithms. A search tree model is described in [2] to provide a formal proof of heavy-tailed behavior in imbalanced search trees.

Incorporating variability into deterministic methods may lead to faster algorithms. Randomizing certain decision in the deterministic algorithms may also induce restart distribution of run-times. This type of acceleration has been reported for various applications of randomized backtracking search. In [6], random steps are embedded in a complete search algorithm for the constraint satisfaction problem and the propositional satisfiability problem, and it is shown that the proposed randomization accelerates deterministic methods.

Finally, multi-start is a popular framework in many metaheuristics (see, e.g., [3, 13, 14, 16]). However, the impact of restart strategies on the computational performance is often underestimated. One of the most successful metaheuristics – tabu search – and its run-time distributions in the context of restart strategies are discussed in [19].

The following example of a restart distribution is based on a randomly generated instance of the directed maximum cut problem [20] with 400 vertices, where every possible edge between vertices is included in the graph with probability \(\frac {1}{2}\) (both directions are considered separately). This random instance is repeatedly solved by a simple version of the tabu search method [4] starting from different initial solutions, terminating when the algorithm finds a solution whose objective is better than a predefined threshold value. The implementation of the tabu search in this example has a fixed tabu tenure of ten iterations, and it does not include any advanced features, except for the cycle detection [4].

The distribution of the run-times is presented in Fig. 1. Markedly, it can be observed that the probability of large computational times is rather high or, in other words, the run-time has a heavy-tailed distribution (the tails are not exponentially bounded). In order to accelerate the search, one can select a restart parameter value and repeatedly restart the algorithm after the number of iterations exceeds this predefined parameter. Figure 2 provides the average run times for the tabu algorithm as a function of the number of iterations between restarts. It is clear from the plot that certain values of the restart parameter guarantee superior algorithmic performance, when compared to the original algorithm that does not employ any restart strategy. In this context, an interesting question can be stated: What is the best restart value for algorithm acceleration? The following discussion provides an overview of the research that addresses this important question.

Fig. 1
figure 1

The distribution of run-time of a tabu search algorithm for an instance of the maximum cut problem based on 80,000 runs

Fig. 2
figure 2

The average run-time of a tabu search algorithm for an instance of the maximum cut problem as a function of the restart parameter (the same instance as in Fig. 1)

Basic Restart Strategies

Unlike Monte Carlo algorithms, which are guaranteed to provide approximate answers or solutions within a fixed time, the Las Vegas-type algorithms are defined as algorithms that always find “correct solutions,” but the required computational time is uncertain. To define the correctness of a solution, one can demand a proof of optimality or comparison of the solution objective function to a predefined threshold, in which case the “correct solution” is any solution with the objective function better than the threshold.

In general, the run-time of an algorithm can be measured either in iterations or in computational time. A restart strategy associated with a given algorithm is typically defined (see, e.g., [10, 22]) as a sequence, S = {t1, t2, …}, that represents allocated run-times between restarts. According to the strategy S, the first run continues for t1 time units (iterations) or until the correct answer is found. The second run lasts t2 time units or until the correct answer is found and so on. The runs are independent from each other, as no information is passed between different computational runs. As briefly discussed in section “Introduction”, different restart strategies might lead to different average run-times.

In [10], the authors consider integer run-times that represent the total number of iterations until a desired outcome is achieved. They consider a wide domain of restart strategies including strategies with random restart intervals and those that enable suspension of algorithm execution in favor of another run with an option to continue its execution later. It is shown that there always exists a uniform restart strategy (i.e., t1 = t2 = … = t) that is optimal. It is easy to see that the same result holds if a run-time is determined using continuous time units instead of iterations – for example, consider rounding continuous times to the nearest integral values (number of seconds) and redefining the run-time distribution accordingly.

To find an optimal restart period t, a good estimate of the underlying run-time distribution is required, which might be problematic in practice. If this information is not available, the universal restart strategy introduced in [10] can provide a decent performance guarantee. In this strategy, all run length are powers of 2 starting with a run of length 20 (this sequence may be rescaled). Considering the duration of the current run, if the number of previous runs with the same length is even, then the next run duration is double of the current run duration. Otherwise, the next run has the length equal to one. These rules define the following numerical sequence:

$$\displaystyle \begin{aligned} S^{\mathrm{univ}}=\{1, 1, 2, 1, 1, 2, 4, 1, 1, 2, 1, 1, 2, 4, 8, 1, \ldots \} \end{aligned}$$

When comparing Suniv to the optimal restart strategy, its performance is within a logarithmic factor from optimality without any requirements on prior knowledge of the run-time distribution [10]. Formally, if Tuniv denotes the expected run-time of the algorithm that uses universal restart strategy and Topt denotes the expected run time under the optimal restart strategy, then

$$\displaystyle \begin{aligned} T^{\mathrm{univ}}\leq 192 \cdot T^{\mathrm{opt}} ( \log_2(T^{\mathrm{opt}})+5). \end{aligned}$$

Unfortunately, the difference between run-times of the universal and optimal strategies may be large in practice, which limits the applicability of this important theoretical result.

Restart strategies can be naturally extended to parallel optimization [9], where copies of the same algorithm are executed in parallel. Note that due to randomness, the search trajectories are in general different. As in the serial setting, a restart strategy defines maximum run-times for each copy of the algorithm, i.e.,

$$\displaystyle \begin{aligned} S=\left\{ \begin{array}{c} t^1_1,t^1_2,t^1_3,\ldots\\ t^2_1,t^2_2,t^2_3,\ldots\\ \ldots \\ t^n_1,t^n_2,t^n_3,\ldots\\ \end{array}\right. \end{aligned} $$
(1)

In a uniform parallel restart strategy, each copy of the algorithm has the same restart schedule (i.e., \(t^i_j=t^*\)). This strategy is no longer optimal as it was in the serial case, but its performance is provably within a constant factor from optimality. An example of nonuniform optimal strategy can be found in [9].

Restart Distribution and Optimal Restart Strategies

Next, we overview some formal characterizations of run-time distribution that can be exploited to gain computational acceleration. Our discussion in this and the next sections is mostly based on the results from [22].

Consider a randomized algorithm A for solving a problem instance P. The execution time of A when solving P is a random variable, ξ. Based on the algorithm A, one can define its uniform restart version, AR, which follows the same algorithmic steps as A, but is restarted after R iterations or time units if the correct solution is not found. Similarly to A, algorithm AR terminates as soon as the correct solution is obtained. The positive parameter R is referred to as a restart period (or a restart parameter) of AR.

The number of restarts before the first successful run of AR is a geometric random variable, and the expected run-time, T(R), of AR can be expressed as

$$\displaystyle \begin{aligned} T(R)=R \cdot \frac{1-Pr\{\xi\leq R\}}{Pr\{\xi\leq R\}}+E[\xi | \xi \leq R] \end{aligned} $$
(2)

The first part of (2) is an expected number of restarts multiplied by the duration of each run, while the second is an average duration of the run that produces the final solution.

Definition 1

A probability distribution of a random variable ξ, Pr{ξ ≤ x}, is called a restart distribution if there exists R > 0, such that

$$\displaystyle \begin{aligned} R \cdot \frac{1-Pr\{\xi\leq R\}}{Pr\{\xi\leq R\}}+E[\xi | \xi \leq R] < E[\xi ] \end{aligned} $$
(3)

From (3), if a run-time of an algorithm follows a restart distribution, then the algorithm has a restart version that outperforms the algorithm without restarts in terms of its average run-time. Clearly, the properties of the run-time distribution depend on the optimization problem and the algorithm itself. If, for example, an optimization algorithm uses an optimal restart policy, then its run-time will no longer follow a restart distribution.

A log-normal distribution is an example of a restart distribution. For example, consider three log-normal random variables with the same location parameter, μ = 1. The scale parameters, σ, are 2, 2.5, and 1.5. The expected values of these random variables are 20.08, 61.86, and 8.37, respectively. If these variables represented run-times of some algorithms, we would be able to achieve acceleration by restarting after R units of run-time. Figure 3 illustrates this opportunity by showing the expected run-time given by (2) as a function of restart period R. It shows that restarting an algorithm allows us to achieve significant acceleration for these run-time distributions. Furthermore, we can see that the larger variability in run-times provides a better opportunity for acceleration by the restart mechanism.

Fig. 3
figure 3

The average run-time as a function of the restart parameter, R, for log-normally distributed run-times

The optimality conditions can be derived for continuous and differentiable (first and second derivative) cumulative distribution functions (CDF) of run-times (see [22] for more details). If this is not the case, then a continuous and differentiable approximation of original CDF can be used instead.

Proposition 1

Let R be an optimal restart period for A R . Then the expected run-time of \(A_{R^*}\) is

$$\displaystyle \begin{aligned} T({R^*}) =\frac{ 1- Pr\{ \xi \leq R^* \} } { \left.\frac{d Pr\{ \xi \leq R \}}{dR}\right|{}_{R=R^*} } \end{aligned} $$
(4)

Proof

The reciprocal of the expression in the right-hand side of (4) is known as a hazard rate function , which is an important concept in reliability engineering [17]. The hazard rate function, or failure rate, can be used to describe general properties of restart distributions.

Proposition 2

A hazard rate function of a restart distribution is nonincreasing on some interval containing an optimal restart period.

This property allows us to rule out the distribution that has an increasing hazard rate functions [7]. The previous proposition shows, for example, that the following distributions are not restart distributions:

  • Weibull distributions with the shape parameter k > 1,

  • Gamma distribution with the shape parameter k > 1,

  • Uniform distribution,

  • Normal distribution.

Single Algorithm Portfolios of Restart Algorithms

Algorithmic acceleration can be achieved by combining algorithms in portfolios that are executed concurrently in a distributed manner. For example, a single algorithm portfolio consists of different copies of the same algorithm that are deployed on different processors. This simple parallelism can be extremely efficient in practical applications that involve randomized algorithms. There are numerous reports of superlinear speedup when using this type of parallelization (e.g., see [15]). By superlinear speedup , we imply that the algorithm utilizes up to n processors and on average is more than n time faster than the algorithm utilizing one processor.

The superlinear speedup can be related to the concept of restart distributions. Consider a single restart algorithm portfolio: a parallel algorithm \(A^n_R\) that consists of n independent copies of AR running in parallel (no communication), where, as previously, AR is the restart version A with parameter R. The algorithm \(A^n_R\) halts whenever one of n copies finds a correct solution. Let random variable \(\xi ^n_{\mathrm {min}}\) denote the run-time of \(A^n_R\), while Tn(R) denotes the expected run-time of \(A^n_R\).

The expected run-time of the single algorithm portfolio with a uniform restart strategy is given by

$$\displaystyle \begin{aligned} T_n(R)=R \cdot \frac{(1-Pr\{\xi\leq R\})^n}{1-(1-Pr\{\xi\leq R\})^n}+E[\xi^n_{\mathrm{min}} | \xi^n_{\mathrm{min}} \leq R] \end{aligned} $$
(5)

An optimality condition similar to the serial case can be derived for parallel restart algorithms (see [22]).

Proposition 3

Let \(R_n^*\) be the optimal restart period for \(A^n_R\) ; then the expected running time of \(A^n_R\) is

$$\displaystyle \begin{aligned} T_n(R_n^*) = \frac{(1-Pr\{\xi\leq R^*_n\})}{n \cdot \left.\frac{d Pr\{ \xi \leq R \}}{dR}\right|{}_{R=R^*_n} } {} \end{aligned} $$
(6)

As mentioned earlier, the log-normal distribution is a restart distribution. We use this distribution to illustrate some of the ideas, since we can easily calculate the exact values for the conditional expectations in (5) without resorting to statistical sampling. For example, we can use the log-normal run-time to provide an illustration of the superlinear speedup.

Let ξ be a random variable with a log-normal distribution with parameters σ = 2 and μ = 1. We define a parallel speedup as a ratio between run-time of a serial algorithm and run-time of its parallel version. Firstly, we consider the speedup for the algorithm that does not apply any restart strategy. The data presented in Fig. 4 clearly indicates the superlinear average speedup relative to the run-time of the serial algorithm.

Fig. 4
figure 4

Speedup is obtained by comparing the serial no-restart version with the parallel no-restart version [22]; ξ has log-normal distribution with μ = 1 and σ = 2

The following proposition from [22] states that if the superlinear parallel speedup is achieved by a single algorithm portfolio, then the underlying distribution of the run-time is a restart distribution.

Proposition 4

If R and \(R^*_n\) are optimal restart periods for AR and \(A^n_R\), respectively, then \(\frac {T(R^*)}{T_n(R^*_n)}\leq n\).

Simply speaking, Proposition 4 indicates that the superlinear parallel speedup can be attributed to the inefficiencies of the serial algorithm, and these inefficiencies can be alleviated by adopting an appropriate restart strategy.

Consider now the situation when both serial and parallel algorithms implement optimal restart strategies. It is important to note that the optimal values in parallel and serial cases are not necessarily the same. The following proposition relates the number of processors to the value of the optimal restart parameter.

Proposition 5

If the hazard rate function of the run-time distribution is unimodal and \(\frac {T(R^*)}{T_n(R^*_n)}< n\), then \(R^*<R^*_n\).

Intuitively, it might seem that the best serial algorithm is the best candidate for parallelizing. However, Proposition 5 shows that the optimal restart parameter for the serial algorithm is not necessarily optimal when considering the performance of the corresponding single algorithm portfolio. In other words, good average performance in serial setting can often provide suboptimal parallel performance.

Suppose that the run-time of a serial algorithm follows the same log-normal distribution as in the previous example. The optimality conditions (4) and (6) for the optimal restart parameters provide the corresponding optima. Figure 5 illustrates the computational speedup achieved by the parallel algorithms with optimal restart parameters and varying number of parallel processors compared to the serial algorithm (for the log-normal run-times). The restart parameters of the serial and parallel algorithms are different: in fact, it turns out that the optimal restart parameter is monotonically increasing as a function of the number of processors. The speedup is almost linear for the number of processors below 100. For large numbers of processors, the speedup quickly becomes sublinear, and the single algorithm portfolio parallelization becomes less effective.

Fig. 5
figure 5

Speedup is obtained by comparing the optimal serial restart version with the optimal parallel restart version [22]; ξ has log-normal distribution with μ = 1 and σ = 2

Another interesting question that is important for practical applications is: what is the value of knowing the optimal restart parameter? Figure 6 shows the value of implementing optimal restart strategy with respect to the number of processors. As the number of processors increases, the benefit of knowing the exact value of optimal restart parameter is steadily decreasing. In some sense, the single algorithm portfolio framework takes advantage of restart-distribution properties without restarting the algorithm. The performance can be improved by applying an effective restart strategy explicitly; however, the value of such improvement quickly diminishes as the number of processes increases.

Fig. 6
figure 6

Comparison of the parallel algorithms with and without implementation of the optimal restart strategy [22]; ξ has log-normal distribution with μ = 1 and σ = 2

An optimal value of restart parameter is typically unknown in a realistic problem setting. Furthermore, as mentioned previously, the value of the optimal restart parameter may decrease as the number of processors grows. These observations suggest that the main effort should be concentrated on developing and identifying algorithms with run-times that follow restart distributions instead of focusing on the methods that find optimal restart values. The potential for acceleration of run-times that follow restart distributions can be automatically exploited via portfolio parallelization. Furthermore, any errors in estimating the exact value of the optimal restart sequence can degrade the performance with respect to adopting no-restart strategy.

We illustrate this idea by looking at different tenure parameters of tabu search. Again we consider a random instance of the maximum directed cut problem with 400 vertices. The computational experiment is based on a simple tabu search with different values of tabu tenure that was mentioned in earlier examples. Table 1 shows the results for two tabu algorithms with different tenure parameters. The algorithm with tenure parameter 30 is significantly faster than the algorithm with tenure parameter 10 when deployed on a single processor. However, the situation changes when parallel implementation is considered: the parallel algorithm with tenure 10 is much faster than its serial version. This example (see [19]) shows that if one uses the average serial run-time as the main criterion for choosing an algorithm for parallel implementation, the better parameter choices (or even algorithmic approaches) can be dropped in favor of suboptimal choices.

Table 1 Average number of iterations for different number of processors

Figure 7 highlights the value of knowing the optimal restart parameter. The tabu algorithm with tenure 10 without restarts quickly converges to the performance of the same tabu algorithm that uses an optimal restart strategy.

Fig. 7
figure 7

Computational results for MAXDICUT problem using tabu search in a single algorithm portfolio [19]

Mixed Algorithm Portfolios of Restart Algorithms

Instead of focusing on a single algorithm portfolio, one may consider a combination of different algorithms. The diversity of mixed algorithms can improve the overall performance with respect to a single algorithm portfolio. Suppose that we have a set of available randomized algorithms and we want to select a subset of them to include into a mixed algorithm portfolio of a given size. If the available algorithms are simply different copies of the same algorithm, we call such selection a single algorithm portfolio; otherwise, we will refer to it as a mixed algorithm portfolio.

There are a number of examples in the literature of mixed algorithm portfolios that outperform single algorithm portfolios. In particular, the algorithm portfolio approach for constraint satisfaction and mixed integer programming is presented in [5]. The authors show that the mixed algorithm portfolio can outperform a single algorithm portfolio and discuss intuition behind such situations. An efficient algorithm portfolio approach using backtracking search for the graph-coloring problems is considered in [8]. Extensive computational experiments with restart strategies and algorithm portfolios for benchmark instances of network design problem are also investigated in [1].

The mathematical model of mixed portfolios of restart algorithms can be outlined as follows. Consider a set of m algorithms A1, …, Am with restart parameters R1, …, Rm and random run-times ξ1, …, ξm, respectively. Additionally, there are N parallel processors that are available, and we need to select N algorithms to deploy on each processor. Each processor should be used by a single algorithm, and the same algorithm can be deployed on multiple processors. We assume that the run-time of each algorithm is an integer multiple of its restart parameter. In other words, even if an algorithm finds a solution in the beginning of the run, an actual run-time will be rounded up to the next restart period.

Using this setup, there are m single algorithm portfolios that can be formed. Let Ts denote the average run-time of the best single algorithm portfolio, which can be easily defined using the properties of the geometric distribution:

$$\displaystyle \begin{aligned} T_s(A_1,\ldots,A_m,N)=\min\left\{R_1\frac{1}{1- p_1^N}, \ldots, R_m\frac{1}{1- p_m^N}\right\}. \end{aligned}$$

Let Tm(A1, n1, …, Am, nm) denote the expected run-time of the mixed algorithm portfolio, which consists of n1 copies of A1, n2 copies of A2, and so on \(\left (\sum _{i=1}^m n_i=N\right )\). The mixed algorithm portfolio terminates as soon as one of the algorithms finds the target solution (e.g., a solution with objective below a certain threshold).

To identify the computational benefit that can be achieved by mixing randomized algorithms with different properties, we define the speedup ratio S as

$$\displaystyle \begin{aligned} S=\frac{T_s(A_1,\ldots,A_m,n_1+\ldots+n_m)}{T_m(A_1,n_1,\ldots,A_m,n_m)}. \end{aligned} $$
(7)

Unlike the serial case, in [9] the authors demonstrate that the best uniform restart strategy repeated on every processor is not necessarily optimal; however, its performance is within a constant factor of the optimal strategy. Furthermore, in [21] and the subsequent work in [11], it is shown that the speedup ratio S of any mixed algorithm portfolio satisfies

$$\displaystyle \begin{aligned} S\leq \frac{1}{1-e^{-1}}\approx 1.58. \end{aligned} $$
(8)

Therefore, if one has a full knowledge of run-time distributions, the best mixed algorithm portfolio is less than two times faster than the best single algorithm portfolio. However, a mixed algorithm portfolio can be viewed as a strategy that can reduce risks associated with a nonoptimal algorithm selection for single algorithm portfolios. Recall that by the definition in (7), the value of S compares the performance of a mixed algorithm portfolio against the best possible algorithm portfolio.

Moreover, it is interesting to note that according to the theoretical approach described above (see derivation details of (8) in [11, 21]), the best performance of a mixed algorithm portfolio is achieved when it consists of N algorithms that can also be used as candidates to form the best single algorithm portfolio. Thus, the expected performance of these algorithms in the case of a single algorithm portfolio is the same. However, for the mixed algorithm portfolio, one should select N − 1 algorithms with a relatively short restart parameter and exactly one algorithm with a long restart parameter. As all of these algorithms have the same performance in the single portfolio setting, it also implies that each of the former N − 1 algorithms is relatively unreliable within its short restart period (thus, these algorithms have to be restarted a large number of times), while the remaining N-th algorithm, despite its large restart parameter, is very reliable (i.e., the corresponding p is close to one). Clearly, the existence of such algorithms is not necessarily guaranteed. Nevertheless, the above result provides an intuitive characterization of effective mixed algorithm portfolios. Namely, if there exists a trade-off that involves the restart parameter value and the algorithm reliability, then it can be exploited within a mixed algorithm portfolio.

Conclusions

In the discussion above, we only consider a setting with a single problem instance. However, the same framework can be easily extended to the setting with multiple problem instances. For example, optimization models of daily locomotive scheduling remain constant, i.e., one needs to find optimal routes using existing railroad network for a given demand. The particular demand patterns can vary substantially from day to day producing different problem instances. We can form a training set by selecting a set of problem instances from historical demands and use them for construction of optimal restart strategies and/or algorithm portfolios. This is also a typical approach for computational experiments. The researchers often test their techniques on a small set of instances, tune the algorithms, and then conduct the final experiment on a larger set of problems.

Cross-References