1 Introduction

It is often cited as a strength of evolutionary algorithms (EAs) that by setting the parameters right the algorithm can be adjusted to the particular problem to be solved. However, it is also known that this process of optimizing the parameters is time-consuming and needs a lot of expert knowledge.

The theoretical research in this field (see, e.g., [2, 13, 20, 26]) has contributed to this challenge via mathematical runtime analyses for general parameter values, which allow to understand the influence of the parameter on the performance and allow to derive optimal parameter values. Examples include (i) the works of Jansen et al. [21] as well as Doerr and Künnemann [11], which determine the runtime of the \((1 + \lambda )\) EA on OneMax for general value of \(\lambda \) and from this conclude that a linear speed-up with regard to the number of iterations exists only for \(\lambda = O\big (\frac{\log (n) \log \log (n)}{\log \log \log (n)}\big )\), (ii) Witt’s analysis [33] of the runtime of the \((\mu + 1)\) EA for general values of \(\mu \) on the LeadingOnes benchmark, which in particular shows that for \(\mu = O(\frac{n}{\log n})\) a larger parent population does not lead to an asymptotic slow-down of the algorithm, or (iii) the results of Lehre [22, 23] and many follow-up works, which for many non-elitist algorithms determine asymptotically precise thresholds for the selection pressure that separate a highly inefficient regime from one with polynomial runtimes.

Concerning the mutation rate p of the standard bit mutation operator for bit strings of length n, which is our main object of interest in this work, a large number of classic results suggests that a value of \(p=\frac{1}{n}\) or close by is a good choice. We note that a mutation rate of \(p=\frac{1}{n}\) means that on average a single bit is flipped. The recommendation \(p=\frac{1}{n}\) can already be found in [6, 25]. Rigorously proven results show, among others, that only \(p = \Theta (\frac{1}{n})\) can give an \(O(n \log n)\) runtime of the \((1 + 1)\) EA on OneMax [16], that the asymptotically optimal mutation rate for the \((1 + 1)\) EA on LeadingOnes is approximately \(p=\frac{1.59}{n}\), that \(p = (1 \pm o(1)) \frac{1}{n}\) is the asymptotically best mutation rate of the \((1 + 1)\) EA for all pseudo-Boolean linear functions [34], that only a mutation rate below \(\frac{c}{n}\), where c is a specific constant, guarantees a polynomial runtime of the \((1 + 1)\) EA on all monotonic functions [10, 24], and that \((1 \pm o(1)) \frac{1}{n}\) is the optimal mutation rate for the \((1 + \lambda )\) EA on OneMax when \(\lambda \) is small [18].

In the light of this previous state of the art, it came as a surprise when Doerr et al. [12] determined the runtime of the \((1 + 1)\) EA on jump functions for general mutation rates and observed that here much higher mutation rates were optimal.Footnote 1 The jump function \(\textsc {Jump} _{nk}\) (we deviate here from the notation of [12]) is a function defined on bit-string of length n which is mostly identical to the easy OneMax function, but which has a valley of low fitness of Hamming width \(k-1\) around the global optimum. Consequently, elitist algorithms can leave this local optimum only by flipping k specific bits (and [14] suggests that non-elitist algorithms cannot do better). As shown in [12], for this multimodal benchmark function the insights gained previously on unimodal functions like OneMax, linear functions, or LeadingOnes do not apply. The optimal mutation rate for \(\textsc {Jump} _{nk}\) was found to be \((1\pm o(1)) \frac{k}{n}\). Deviating from this optimal rate by a small constant factor leads to a runtime increase by a factor of \(e^{\Omega (k)}\). Consequently, the choice of the mutation rate for this problem is truly delicate.

To overcome this difficulty, the use of a random mutation rate chosen according to a heavy-tailed distribution, more specifically, a power-law distribution with exponent \(\beta >1\), was suggested. This mutation operator, called fast mutation in agreement with previous uses of heavy-tailed distributions in continuous evolutionary computation [30, 35, 36], samples a random number \(\alpha \in [1\cdots \lfloor \frac{n}{2} \rfloor ]\) with probability proportional to \(\alpha ^{-\beta }\) and then flips each bit independently with rate \(\frac{\alpha }{n}\). Each application of this operator samples the value of \(\alpha \) independently.

The main result in [12] is that the \((1 + 1)\) EA with this mutation operator optimizes \(\textsc {Jump} _{nk}\) in a time that is only by a factor of \(O(k^{\beta -0.5})\) larger than the time resulting from standard bit mutation with the optimal rate. Given that missing the optimal rate (which is only accessible when knowing k) by a small constant factor already incurs a runtime increase by a factor of \(e^{\Omega (k)}\), the \(O(k^{\beta -0.5})\) price for having a one-size-fits-all mutation operator appears to be a good investment. From the asymptotic point of view \(\beta \) should be taken arbitrarily close to 1, but the experiments conducted in [12] suggested that \(\beta = 1.5\) is a good choice. Both theory and experiments showed that the choice of \(\beta \) is not overly critical. For this reason, it is fair to call fast mutation a parameterless operator.

Since the fast mutation operator is nothing else than a random linear combination of standard bit mutation operators with rates \(\frac{\alpha }{n}, \alpha = 1, \ldots , \lfloor \frac{n}{2} \rfloor \), it is not surprising that the resulting runtime is higher than the one from the best of these individual operators. Rather, it is surprising that by simply averaging over the available options, one comes relatively close to the optimum, and this in a scenario where for static rates a small deviation from the optimum leads to a significantly increased runtime.

In this work, we observe an even more surprising strength of the fast mutation operator. We investigate how the \((1+(\lambda ,\lambda ))\) genetic algorithm (\((1 + (\lambda , \lambda ))\) GA), first proposed in Doerr et al. [9], performs with the fast mutation operator. The \((1 + (\lambda , \lambda ))\) GA is an evolutionary algorithm that creates \(\lambda \) offspring from a unique parent individual with an unusually high mutation rate (independently, apart from the fact that they all have the same Hamming distance from the parent), selects the best of these, and creates another \(\lambda \) individuals via a biased crossover between this mutation winner and the original parent. The best of these is taken as the new parent individual if it is at least as good as the previous parent (see Sect. 2 for more details).

This combination of a high mutation rate and crossover with the parent as repair mechanism allows the algorithm to more efficiently explore the search space when the parameters are chosen suitably. Both from informal considerations and from existing runtime results, the right parameterization seems to be that the mutation rate is \(p = \frac{\lambda }{n}\) and the crossover bias, that is, the rate with which the crossover offspring takes bits from the mutation winner, is \(c = \frac{1}{\lambda }\). The informal argument for this is that a single application of mutation and crossover generates a bit string distributed as if generated via standard bit mutation with rate \(\frac{1}{n}\).

With a number of runtime analyses [4, 7,8,9] supporting this choice,Footnote 2 we fix this relation of the three parameters in the remainder of this work. Since the mutation rate is the starting point of our research, we can alternatively first choose a mutation rate of type \(p = \frac{\alpha }{n}\) and then set \(\lambda = pn\) and \(c = \frac{1}{pn}\).

The right choice of the mutation rate is non-trivial. The good news from [9] is that any rate between \(p = \omega (\frac{1}{n})\) and \(p = o(\frac{\log n}{n})\) leads to a runtime of \(o(n \log n)\) on OneMax, that is, asymptotically faster than the performance of classic evolutionary algorithms. The optimal mutation rate of

$$\begin{aligned} p = \Theta \left( \frac{1}{n} \sqrt{\frac{\log (n) \log \log (n)}{\log \log \log (n)}}\,\right) , \end{aligned}$$

however, is non-trivial to find [8]. It yields an expected runtime on OneMax of

$$\begin{aligned} E[T] = \Theta \left( n \sqrt{\frac{\log (n) \log \log \log (n)}{\log \log (n)}}\,\right) . \end{aligned}$$

Our main research goal in this work is understanding how the \((1 + (\lambda , \lambda ))\) GA performs when instead of standard bit mutation with a fixed mutation rate p the fast mutation operator is used. With the previously suggested relations between mutation rate, offspring number, and crossover bias, this means that first a number \(\alpha \) is sampled from a power-law distribution, then \(\lambda = \alpha \) offspring are generated via flipping \(\ell \) bits chosen uniformly at random, where \(\ell \sim {{\,\mathrm{Bin}\,}}(n, \frac{\alpha }{n})\),Footnote 3 and finally \(\lambda \) times a biased crossover with bias \(c = \frac{1}{\alpha }\) between parent and mutation winner is performed. We call this modified algorithm the fast \((1 + (\lambda , \lambda ))\) GA.

Our main result is that not only the use of the fast mutation operator in the \((1 + (\lambda , \lambda ))\) GA relieves us from finding a good mutation rate, but surprisingly we can even obtain a runtime that is faster than the runtime of the \((1 + (\lambda , \lambda ))\) GA with any fixed mutation rate: If the power-law exponent \(\beta \) satisfies \(2< \beta < 3\), then the fast \((1 + (\lambda , \lambda ))\) GA has an expected runtime of O(n) on OneMax.

We note that a linear runtime of the \((1 + (\lambda , \lambda ))\) GA on OneMax was obtained earlier with a self-adjusting choice of the mutation rate based on the one-fifth rule [8]. While this worked well on OneMax, experimental [17] and theoretical [7] studies on satisfiable MAX-3SAT instances showed that this approach carries the risk that the population size \(\lambda \) increases rapidly because the problem structure may just not allow a one-fifth success rate, regardless how large \(\lambda \) is. Since this behavior increases the time complexity of each iteration, it leads to a significant performance loss. Such problems, naturally, cannot arise with the static behavior of the fast mutation operator.

Via an empirical study, we show that the fast mutation operator indeed without any modification also solves well the satisfiable MAX-3SAT instances for which the one-fifth rule variant of the \((1 + (\lambda , \lambda ))\) GA did not perform well in [7] (unless enriched with a suitable cap on \(\lambda \)). However, our study also shows that on OneMax itself, the self-adjusting \((1 + (\lambda , \lambda ))\) GA is by a constant factor faster than the fast \((1 + (\lambda , \lambda ))\) GA. Since the runtime loss from a degenerate behavior of the one-fifth rule version of the \((1 + (\lambda , \lambda ))\) GA can be large (due to the population size of order n), we draw from these results the recommendation to use the more robust fast \((1 + (\lambda , \lambda ))\) GA on a novel problem rather than the self-adjusting \((1 + (\lambda , \lambda ))\) GA.

2 Notation and Problem Statement

The \((1 + (\lambda , \lambda ))\) GA, first presented in [9], has the following working principles. It stores one current individual x, which is initialized with a random bit string. Each iteration of the \((1 + (\lambda , \lambda ))\) GA consists of two phases, which are the mutation phase and the crossover phase. In the mutation phase the algorithm first chooses the mutation strength \(\ell \) following the binomial distribution with parameters n and p, where p is usually called the mutation rate. It then creates \(\lambda \) mutants by copying the current individual x and flipping exactly \(\ell \) bits which are chosen uniformly at random, independently for each mutant. After that the mutant with the best fitness is chosen as the winner of the mutation phase \(x'\) (all ties are broken uniformly at random). In the crossover phase the algorithm \(\lambda \) times performs a crossover between x and \(x'\) by taking each bit from \(x'\) with probability c and from x otherwise. The probability c is called the crossover bias. The best crossover offspring y (all ties are again broken uniformly at random) is compared with the current individual x. If y is not worse, then it replaces x. The main hope behind this algorithm is that with a high mutation rate, the mutation winner \(x'\) contains some beneficial solution elements, and that the crossover with the parent acts as repair mechanism that removes the destructions caused by the high mutation rate.

As it was discussed in the introduction, the standard parameter setting proposed in [9] uses the mutation rate \(p = \frac{\lambda }{n}\) and the crossover bias \(c = \frac{1}{\lambda }\). However, there is not strong recommendation on how to choose \(\lambda \). For the static choice [9] suggests to use \(\lambda = \omega (1)\) and \(\lambda = o(\log (n))\) in order to have a \(o(n\log n)\) runtime on OneMax, but this runtime is still super-linear. It was also shown in [9] that choosing a fitness-dependent \(\lambda = \sqrt{\frac{n}{n - f(x)}}\) gives a linear runtime on OneMax. In [8] it was shown that if we control \(\lambda \) according to the one-fifth rule we also get a \(\Theta (n)\) runtime on OneMax.

In this paper we propose to choose \(\lambda \) in each iteration from some heavy-tailed distribution. More precisely, the probability that we choose \(\lambda = i\) is

$$\begin{aligned} \Pr [\lambda = i] = {\left\{ \begin{array}{ll} C_{\beta , u} i^{-\beta }, &{}\text {if}\quad i \in [1\cdots u], \\ 0, &{}\text {otherwise,}\\ \end{array}\right. } \end{aligned}$$

where \(\beta \in {{\mathbb {R}}}\) is the power-law exponent of the distribution (which is always considered as a constant), \(u \in {{\mathbb {N}}}\) is an upper bound on the choice of \(\lambda \) (and may depend on n), and \(C_{\beta , u} {:}{=}(\sum _{i = 1}^u i^{-\beta })^{-1}\) is the normalization coefficient. All our runtime results on OneMax will hold for the classic choice \(u = \lfloor n/2 \rfloor \). We introduce this additional parameter because the Max-SAT analyses in [7] showed that sometimes a stricter upper bound on \(\lambda \) is necessary. For that reason, it is interesting to see also in the OneMax analyses how small an upper bound on \(\lambda \) can be taken so that a linear runtime is still obtained.

The detailed pseudocode of the fast \((1 + (\lambda , \lambda ))\) GA is shown in Algorithm 1. Our main result will be that this simple way of choosing \(\lambda \) gives us a linear runtime for all \(\beta \in (2, 3)\) and \(u \ge \ln ^{\frac{1}{3 - \beta }}(n)\).

figure a

2.1 Useful Tools

In this section we collect some classic results which are used in our proofs. First, to be able to make the transition between the number of iterations and the number of fitness evaluations, we use Wald’s equation [32].

Lemma 1

(Wald’s equation) Let \((X_t)_{t \in {{\mathbb {N}}}}\) be a sequence of real-valued random variables and let T be a positive integer random variable. Let also all following conditions be true.

  1. 1.

    All \(X_t\) have the same finite expectation.

  2. 2.

    For all \(t \in {{\mathbb {N}}}\) we have \(E[X_t \mathbb {1}_{\{T \ge t\}}] = E[X_t] \Pr [T \ge t]\).

  3. 3.

    \(\sum _{t = 1}^{+\infty } E[|X_t| \mathbb {1}_{\{T \ge t\}}] < \infty \).

  4. 4.

    E[T] is finite.

Then we have

$$\begin{aligned} E\left[ \sum _{t = 1}^{T} X_t\right] = E[T]E[X_1]. \end{aligned}$$

We use the following inequality to estimate the probability that at least one of \(\lambda \) Bernoulli trials succeeds.

Lemma 2

For all \(p \in [0, 1]\) and all \(\lambda > 0\) we have

$$\begin{aligned} 1 - (1 - p)^\lambda \ge \frac{\lambda p}{1 + \lambda p}. \end{aligned}$$

Proof

By [29, Lemma 8] (or (1.4.19) in [15]) we have \((1 - p)^\lambda \le \frac{1}{1 + \lambda p}\). Hence,

$$\begin{aligned} 1 - \left( 1 - p\right) ^\lambda&\ge 1 - \frac{1}{1 + \lambda p} = \frac{\lambda p}{1 + \lambda p}. \end{aligned}$$

\(\square \)

We frequently use the following bounds on the partial sums of the generalized harmonic series.

Lemma 3

For all \(s \in {{\mathbb {R}}}\) such that \(s \ge 1\) and for all \(\alpha \ne 1\) we have \(\sum _{i = 1}^{\lceil s \rceil } i^{-\alpha } \ge \frac{s^{1 - \alpha } - 1}{1 - \alpha }\). For \(\alpha = 1\) we have \(\sum _{i = 1}^{\lceil s \rceil } i^{-\alpha } \ge \ln (s)\).

Proof

We estimate the sum for \(\alpha \ne 1\) through the corresponding integral (this estimate is illustrated in Figs. 1 and 2).

$$\begin{aligned} \sum _{i = 1}^{\lceil s \rceil } i^{-\alpha } \ge \int _{1}^s x^{-\alpha } dx = \frac{s^{1 - \alpha } - 1}{1 - \alpha }. \end{aligned}$$

The case for \(\alpha = 1\) is a well-known bound on the partial sum of the harmonic series.

\(\square \)

Fig. 1
figure 1

Illustration of the inequality \(\sum _{i = 1}^{\lceil s \rceil } i^{-\alpha }\ge \int _1^s x^{-\alpha } dx\) for the case \(\alpha \ge 0\). In this example we have \(\alpha = 1\), \(s = 3.5\) and \(\lceil s \rceil = 4\). The red area equals to the sum. The blue area (fully under red, thus purple) equals to the integral (Color figure online)

Fig. 2
figure 2

Illustration of the inequality \(\sum _{i = 1}^{\lceil s \rceil } i^{-\alpha }\ge \int _1^s x^{-\alpha } dx\) for the case \(\alpha < 0\). In this example we have \(\alpha = -1.5\), \(s = 3.5\) and \(\lceil s \rceil = 4\). The red area equals to the sum. The blue area equals to the integral and to the green area, which is fully under the red one (Color figure online)

Lemma 4

For all \(u \in {{\mathbb {N}}}\) we have

  • \(\sum _{i = 1}^{u} i^{-\alpha } \le u^{1 - \alpha }\frac{2 - \alpha }{1 - \alpha }\), if \(\alpha < 0\),

  • \(\sum _{i = 1}^{u} i^{-\alpha } \le \frac{u^{1 - \alpha }}{1 - \alpha }\), if \(\alpha \in [0, 1)\),

  • \(\sum _{i = 1}^{u} i^{-\alpha } \le \frac{\alpha }{\alpha - 1}\), if \(\alpha > 1\),

  • \(\sum _{i = 1}^{u} i^{-\alpha } \le \ln (u) + 1\), if \(\alpha = 1\).

Proof of Lemma 4

By analogy with Lemma 3 we estimate the sum through a corresponding integral. If \(\alpha < 0\) we have

$$\begin{aligned} \sum _{i = 1}^{u} i^{-\alpha }&\le \int _{1}^{u} x^{-\alpha } dx + u^{-\alpha } \le \frac{u^{1 - \alpha } - 1}{1 - \alpha } + u^{-\alpha } \le u^{1 - \alpha }\frac{2 - \alpha }{1 - \alpha }. \\ \end{aligned}$$

If \(\alpha \ge 0\) we have

$$\begin{aligned} \sum _{i = 1}^{u} i^{-\alpha }&\le 1 + \int _{2}^{u + 1} (x - 1)^{-\alpha } dx \le 1 + \frac{u^{1 - \alpha } - 1}{1 - \alpha } \\ \end{aligned}$$

If \(\alpha \in [0, 1)\), then we have

$$\begin{aligned} \sum _{i = 1}^{u} i^{-\alpha }&\le \frac{u^{1 - \alpha } - 1 + 1 - \alpha }{1 - \alpha } \le \frac{u^{1 - \alpha }}{1 - \alpha }. \\ \end{aligned}$$

If \(\alpha > 1\), we have

$$\begin{aligned} \sum _{i = 1}^{u} i^{-\alpha }&\le 1 + \frac{1}{\alpha - 1} \le \frac{\alpha }{\alpha - 1}. \\ \end{aligned}$$

The case for \(\alpha = 1\) is a well-known bound on the partial sum of the harmonic series. \(\square \)

3 Runtime Analysis

In this section we prove upper and lower bounds on the runtime of the fast \((1 + (\lambda , \lambda ))\) GA on OneMax.

3.1 Upper Bound

Our aim in this subsection is to prove an upper bound on the number of fitness evaluations taken until the fast \((1 + (\lambda , \lambda ))\) GA finds the optimum of the OneMax benchmark. Since it is technically easier, we first regard the number of iterations until the optimum is found. For algorithms with fixed population sizes, such a bound would immediately imply a bound on the number of fitness evaluations (namely by multiplying the number of iterations with the fixed number of fitness evaluations per iteration). For the fast \((1 + (\lambda , \lambda ))\) GA using a newly sampled value of \(\lambda \) in each iteration, things are not that easy, but Wald’s equation (Lemma 1) allows to argue that multiplying with the expected number of fitness evaluations per iteration gives the right result.

Before proceeding with proofs, we now state two theorems that together constitute the main result of this subsection. We start by showing that for reasonable parameter values, the optimum is found in a linear number of iterations.

Theorem 5

If \(\beta \in (1, 3)\) and \(u \ge \ln ^\frac{1}{3 - \beta }(n)\), then the expected number of iterations until the fast \((1 + (\lambda , \lambda ))\) GA finds the optimum of OneMax function is O(n).

When \(\beta > 2\), the expected number of fitness evaluations per iteration is \(\Theta (1)\) (see Lemma 9). With this observation and Wald’s equation, we obtain the following estimate for the runtime.

Theorem 6

If \(\beta \in (2, 3)\) and \(u \ge \ln ^\frac{1}{3 - \beta }(n)\), then the expected number of fitness evaluations until the fast \((1 + (\lambda , \lambda ))\) GA finds the optimum of OneMax function is O(n).

We start with the proof of Theorem 5. For the readers’ convenience we split the proof into Lemmas 7 and 8. The first lemma is essentially an interpretation of Lemma 7 in [9].

Lemma 7

If \(\lambda \le \sqrt{\frac{n}{d(x)}}\), where d(x) is the current Hamming distance between the current individual x and the optimum, then the probability \(p_{d(x)}(\lambda )\) of increasing the fitness in one iteration is at least

$$\begin{aligned} C\frac{d(x)\lambda ^2}{n}, \end{aligned}$$

where \(C > 0\) is an absolute constant. If \(\lambda > \sqrt{\frac{n}{d(x)}}\), then this probability is at least C.

Proof

By [9, Lemma 7], the probability of a true progress (that is, an iteration in which we find a strictly better individual than the current individual x) \(p_{d(x)}(\lambda )\) is at least

$$\begin{aligned} C'\left( 1 - \left( 1 - \frac{d(x)}{n}\right) ^\frac{\lambda ^2}{2}\right) , \end{aligned}$$

where \(C' > 0\) is an absolute constant. By Lemma 2 we have

$$\begin{aligned} p_{d(x)}(\lambda )&\ge C'\left( 1 - \left( 1 - \frac{d(x)}{n}\right) ^\frac{\lambda ^2}{2}\right) \ge C'\frac{\frac{d(x)\lambda ^2}{2n}}{1 + \frac{d(x)\lambda ^2}{2n}}. \end{aligned}$$

If \(\lambda \le \sqrt{\frac{n}{d(x)}}\), then we have \(p_{d(x)}(\lambda ) \ge C' \frac{d(x)\lambda ^2}{3n}.\) Note that \(C {:}{=}\frac{C'}{3}\) is an absolute constant as well as \(C'\). If \(\lambda > \sqrt{\frac{n}{d(x)}}\), then \(p_{d(x)}(\lambda ) \ge \frac{C'}{3} = C.\) \(\square \)

Lemma 8

Let \(\beta \in (1, 3)\). Then the probability \(p_{d(x)}\) of having progress in one iteration given that the current distance to the optimum is d(x) is at least

$$\begin{aligned} C(\beta )\frac{d(x)U^{3 - \beta }}{n}, \end{aligned}$$

where \(U = \min \left\{ u, \sqrt{\frac{n}{d(x)}}\right\} \) and \(C(\beta )\) is some constant independent of n.

Proof

Note that since u is an integer number, we have \(u \ge \lceil U \rceil \). Hence, by Lemma 7 we have

$$\begin{aligned} p_{d(x)} = \sum _{\lambda = 1}^u C_{\beta , u} \lambda ^{-\beta } p_{d(x)}(\lambda ) \ge C_{\beta , u} C \sum _{\lambda = 1}^{\lceil U \rceil } \frac{d(x)\lambda ^{2 - \beta }}{n} = C_{\beta , u} C \frac{d(x)}{n} \sum _{\lambda = 1}^{\lceil U \rceil } \lambda ^{2 - \beta } \end{aligned}$$

If \(U \ge 2\), then by Lemma 3 we have

$$\begin{aligned} \sum _{\lambda = 1}^{\lceil U \rceil } \lambda ^{2 - \beta }&\ge \frac{U^{3 - \beta } - 1}{3 - \beta } \ge \frac{1 - 2^{\beta - 3}}{3 - \beta }U^{3 - \beta } \ge \frac{3}{8}U^{3 - \beta }. \end{aligned}$$

Otherwise, when \(U < 2\) we have

$$\begin{aligned} \sum _{\lambda = 1}^{\lceil U \rceil } \lambda ^{2 - \beta }&\ge 1 = U^{\beta - 3} U^{3 - \beta } \ge 2^{\beta - 3}U^{3 - \beta } \ge \frac{1}{4}U^{3 - \beta }. \end{aligned}$$

Finally, we estimate

$$\begin{aligned} p_{d(x)} \ge C_{\beta , u} C \frac{d(x)}{n} \sum _{\lambda = 1}^{\lceil U \rceil } \lambda ^{2 - \beta } \ge C_{\beta , u} C \frac{1}{4}\frac{d(x)}{n}U^{3 - \beta } = C(\beta ) \frac{d(x)U^{3 - \beta } }{n} \end{aligned}$$

with \(C(\beta ) {:}{=}\frac{1}{4}C_{\beta , u} C\). Since C is an absolute constant by Lemma 7 and since, by Lemma 4, \(C_{\beta , u}\) is at least \(\frac{\beta - 1}{\beta }\), which is a constant independent of u, \(C(\beta )\) is also a constant independent of u. \(\square \)

In order to show a full picture we also computed the values of \(p_{d(x)}\) for a wider range of parameters u and \(\beta \). The results are shown in Table 1 and their proofs are included in the “Appendix”.

Table 1 The probability \(p_{d(x)}\) to increase fitness in one iteration for various values of parameters \(\beta \in {{\mathbb {R}}}\) and \(u \in {{\mathbb {N}}}\)

We are now ready to prove Theorem 5.

Proof of Theorem 5

We estimate the upper bound on the expectation of the runtime \(T_I\) (in terms of iterations) as the sum of expected times until the algorithm leaves each fitness level. By Lemma 8 we have

$$\begin{aligned} E[T_I] \le \sum _{d = 1}^n \frac{1}{p_{d}} \le \frac{1}{C(\beta )} \left( \sum _{d = 1}^{\lfloor n/u^2 \rfloor } \frac{n}{du^{3 - \beta }} + \sum _{d = \lfloor n/u^2 \rfloor + 1}^n \sqrt{\frac{n}{d}}^{\beta - 1}\right) . \end{aligned}$$

By Lemma 4 we estimate the first sum

$$\begin{aligned} \sum _{d = 1}^{\lfloor n/u^2 \rfloor } \frac{n}{du^{3 - \beta }}&\le \frac{n\left( \ln \left( \frac{n}{u^2}\right) + 1\right) }{u^{3 - \beta }} \le \frac{n(\ln (n) + 1)}{\ln (n)} = n(1 + o(1)), \end{aligned}$$

where in the last inequality we used the assumption \(u \ge \ln ^{\frac{1}{3 - \beta }}(n)\). By Lemma 4 we estimate the second sum as follows.

$$\begin{aligned}&\sum _{d = \lfloor n/u^2 \rfloor + 1}^n \sqrt{\frac{n}{d}}^{\beta - 1} \le \sum _{d = 1}^n \sqrt{\frac{n}{d}}^{\beta - 1} \\&\quad \le n^\frac{\beta - 1}{2} \sum _{d = 1}^n d^{-\frac{\beta - 1}{2}} \le n^\frac{\beta - 1}{2} \frac{n^\frac{3 - \beta }{2}}{(3 - \beta )/2} = O(n). \end{aligned}$$

Therefore, we have

$$\begin{aligned} E[T_I]&\le \frac{1}{C(\beta )}\left( O(n) + O(n)\right) = O(n).\square \end{aligned}$$

Before we prove Theorem 6 we first estimate \(E[\lambda ]\), which is half the expected cost of one iteration.

Lemma 9

If \(\lambda \) is sampled from the heavy-tailed distribution with parameter \(\beta \) and upper limit u, then its expected value is

  • \(E[\lambda ] = \Theta (1)\), if \(\beta > 2\),

  • \(E[\lambda ] = \Theta (\log (u))\), if \(\beta = 2\),

  • \(E[\lambda ] = \Theta (u^{2 - \beta })\), if \(\beta \in (1, 2)\),

  • \(E[\lambda ] = \Theta (\frac{u}{\log (u)})\), if \(\beta = 1\), and

  • \(E[\lambda ] = \Theta (u)\), if \(\beta < 1\),

where the asymptotic notation is for \(u \rightarrow +\infty \).

Proof

First recall that \(C_{\beta , u} = (\sum _{i = 1}^u i^{-\beta })^{-1}\). By Lemmas 3 and 4 we have

  • If \(\beta < 1\), then \(C_{\beta , u} = \Theta (u^{\beta - 1})\),

  • If \(\beta = 1\), then \(C_{\beta , u} = \Theta (1/\ln (u))\), and

  • If \(\beta > 1\), then \(C_{\beta , u} = \Theta (1)\).

We compute

$$\begin{aligned} E[\lambda ] = \sum _{i = 1}^u i\Pr [\lambda = i] = C_{\beta , u} \sum _{i = 1}^u i^{1 - \beta }. \end{aligned}$$

If \(\beta > 2\), then by Lemma 4 we have

$$\begin{aligned} C_{\beta , u} \le E[\lambda ] \le C_{\beta , u}\frac{\beta - 1}{\beta - 2}. \end{aligned}$$

Hence, \(E[\lambda ] = \Theta (1)\).

If \(\beta = 2\), then \(\sum _{i = 1}^u i^{1 - \beta }\) is a partial sum of the harmonic series, thus it is \(\Theta (\log (u))\). If \(\beta < 2\), then by Lemmas 3 and 4 we have

$$\begin{aligned} C_{\beta , u} \frac{u^{2 - \beta } - 1}{2 - \beta } \le E[\lambda ] \le C_{\beta , u} \frac{u^{2 - \beta }}{2 - \beta }. \end{aligned}$$

Therefore, \(E[\lambda ] = C_{\beta , u} \Theta (u^{2 - \beta })\). Together with the estimates of \(C_{\beta , u}\) this proves the lemma for \(\beta < 2\). \(\square \)

We are now in the position to prove Theorem 6

Proof of Theorem 6

Let \(\{\lambda _t\}_{t\in {{\mathbb {N}}}}\) be a sequence of random variables, each following the power-law distribution with parameters \(\beta \) and u. We can assume that for all \(t \in {{\mathbb {N}}}\) the fast \((1 + (\lambda , \lambda ))\) GA chooses \(\lambda {:}{=}\lambda _t\) in iteration t. Since the cost of one iteration is \(2\lambda \) fitness evaluations (\(\lambda \) for the mutation phase and \(\lambda \) for the crossover phase), the total number of fitness evaluations \(T_F\) has the same distribution as \( \sum _{t = 1}^{T_I} 2\lambda _t. \) We aim at proving that the sequence \((\lambda _t)_{t \in {{\mathbb {N}}}}\) and \(T_I\) allow to use Wald’s equation (Lemma 1). We show that conditions (1)–(4) of this lemma are satisfied.

  1. 1.

    All \(\lambda _t\) have the same expectation, which is finite by Lemma 9.

  2. 2.

    The event \(T_I \ge t\) is independent of the outcome of \(\lambda _t\), which implies that for all \(i \in [1\cdots u]\) we have \(\Pr [T_I \ge t \mid \lambda _t = i] = \Pr [T_I \ge t]\). Therefore, we have

    $$\begin{aligned} E[\lambda _t\mathbb {1}_{\{T_I \ge t\}}]&= \sum _{i = 1}^{u} i \Pr [\lambda _t = i] \Pr [T_I \ge t \mid \lambda _t = i] \\&= \Pr [T_I \ge t] \sum _{i = 1}^{u} i \Pr [\lambda _t = i] = \Pr [T_I \ge t] E[\lambda _t]. \end{aligned}$$
  3. 3.

    By the previous condition we have

    $$\begin{aligned} \sum _{t = 1}^{+\infty } E[|\lambda _t| \cdot \mathbb {1}_{\{T_I \ge t\}}]&= \sum _{t = 1}^{+\infty } \Pr [T_I \ge t] E[\lambda _t] = E[\lambda ] E[T_I], \end{aligned}$$

    since for all \(t \in {{\mathbb {N}}}\) we have \(E[\lambda _t] = E[\lambda ]\). By Theorem 5 and Lemma 9, both \(E[\lambda ]\) and \(E[T_I]\) are finite, hence their product is finite as well.

  4. 4.

    By Theorem 5\(E[T_I]\) is finite.

Thus, by Wald’s inequality we have

$$\begin{aligned} E[T_F] = E[T_I]E[2 \lambda _t]. \end{aligned}$$

By Theorem 5 and Lemma 9 we conclude

$$\begin{aligned} E[T_F] = O(n) \cdot \Theta (1) = O(n).\square \end{aligned}$$

Although we are mostly interested in \(\beta \in (2, 3)\) and reasonably high upper limit u, a reader might find it interesting to see the upper bounds for the runtimes yielded by different parameters values.

For this reason we show the estimates for \(E[T_I]\) and \(E[T_F]\) for a wider range of parameters values in Table 2 and their proofs are included in the “Appendix”.

Table 2 Upper bounds on the expected number of iterations and expected number of fitness evaluations for different values of \(\beta \) and u

In the proofs of Theorems 5 and 6 we aimed at delivering only asymptotic upper bounds disregarding the leading constant in order not to reduce the readability of the paper. However, for the complete picture, without proof we estimate the leading constant delivered by our arguments.

Recall that \(C(\beta ) = \frac{1}{12}C_{\beta , u} C'\). From the proof of Lemma 7 in [9] we can show that \(C'\) which is used in Lemma 7 is at least \(\frac{1}{e}(1 - \exp (-\exp (-\frac{3}{2}))) \approx 0.0735\). For any upper bound \(u = \omega (1)\) we also have \(C_{\beta , u} \ge \frac{\beta - 1}{\beta }\). Hence, we estimate the upper bound on the leading constant.

$$\begin{aligned} \frac{1}{C(\beta )}\left( 1 + \frac{2}{3 -\beta }\right)&\le \frac{12\beta (5 - \beta )}{(3 - \beta )(\beta - 1)C'} \approx 164\frac{\beta (5 - \beta )}{(3 - \beta )(\beta - 1)}. \end{aligned}$$

Taking into account the leading constant hidden in Lemma 9, which is \(\frac{\beta - 1}{\beta - 2}\) if \(\beta > 2\), we estimate the upper bound on the leading constant for \(E[T_F]\) delivered by Theorem 6 as

$$\begin{aligned} 328 \cdot \frac{\beta (5 - \beta )}{(3 - \beta )(\beta - 2)}. \end{aligned}$$
(1)

3.2 Lower Bound

In this section we prove the tightness of our upper bounds by showing a lower bound of \(\Omega (n)\) fitness evaluations for the runtime of the fast \((1 + (\lambda , \lambda ))\) GA on OneMax. This is a special case of a deeper result [31], which showed the same lower bound for all comparison-based algorithms (which the \((1 + (\lambda , \lambda ))\) GA is). For the readers’ convenience, we give an elementary proof as well.

Theorem 10

The expected runtime of the fast \((1 + (\lambda , \lambda ))\) GA with parameter \(\beta \in {{\mathbb {R}}}\) and any upper limit \(u \in {{\mathbb {N}}}\) on the OneMax function is at least \(\Omega (\frac{n}{E[\lambda ]})\) iterations, where \(E[\lambda ]\) is estimated as in Lemma 9, and \(\Omega (n)\) fitness evaluations.

Proof

The progress in one iteration cannot be greater than the number \(\ell \) of bits which we flip in each mutant, since we cannot obtain more than \(\ell \) new one-bits in the winner \(x'\) of the mutation phase. Therefore, after we have sampled \(\lambda \), the expected progress is

$$\begin{aligned} E[f(y) - f(x) \mid \lambda ] \le E[\ell \mid \lambda ] = \lambda . \end{aligned}$$

The expected progress in one iteration thus is

$$\begin{aligned} E[f(y) - f(x)] = \sum _{i = 1}^u \Pr [\lambda = i] E[f(y) - f(x) \mid \lambda = i] \le E[\lambda ]. \end{aligned}$$

Let \(x_0\) be the initial individual. Since it is chosen uniformly at random, its expected fitness is \(E[f(x_0)] = \frac{n}{2}\). Hence, by the additive drift theorem [19] the expectation of the number of iterations \(T_I\) before the algorithm finds the optimum is at least

$$\begin{aligned} E[T_I] \ge \frac{n - E[f(x_0)]}{E[\lambda ]} = \frac{n}{2E[\lambda ]}. \end{aligned}$$

Now we can use Wald’s equation as we did in the proof of Theorem 6. We obtain

$$\begin{aligned} E[T_F] = E[T_I]E[2\lambda ] \ge \frac{n}{2 E[\lambda ]} \cdot 2E[\lambda ] = n.&\end{aligned}$$

\(\square \)

4 Experiments

Our theoretical findings show that the fast \((1 + (\lambda , \lambda ))\) GA with the natural choice \(\beta \in (2,3)\) has a linear runtime on OneMax, which matches the performance of the self-adjusting \((1 + (\lambda , \lambda ))\) GA. Due to their asymptotic nature, our results cannot indicate which of the two linear-time algorithms is faster, how the fast \((1 + (\lambda , \lambda ))\) GA compares with other algorithms on reasonable problem sizes, and how its performance depends on \(\beta \in (2,3)\). For the latter, the only estimate we have from theory, Eq. (1), provides a very large upper bound on the constant factor, which could suggest that \(\beta = 2.5 + \varepsilon \) may be better than \(\beta = 2.5 - \varepsilon \) for \(0< \varepsilon < 0.5\), but without a matching lower bound this is speculative. To answer these questions, but also to investigate the performance on a slightly less artificial problem, we performed a series of experiments.

As algorithms, we regarded randomized local search (RLS) and the \((1 + 1)\) EA with a standard bit mutation as well as the self-adjusting \((1 + (\lambda , \lambda ))\) GA, which controls \(\lambda \) (and thus \(p = \lambda / n\) and \(c = 1/\lambda \)) via the one-fifth success rule [8].

We have also considered the version of the \((1 + (\lambda , \lambda ))\) GA with the one-fifth rule with an upper limit of \(2 \ln (n+1)\) on the value of \(\lambda \), introduced in [7], since it showed a much better performance on the MAX-3SAT problem than without this upper limit. For the same reason, we also consider the fast \((1 + (\lambda , \lambda ))\) GA with the same upper limit of \(2 \ln (n+1)\) on the value of \(\lambda \), which is imposed by setting the distribution parameter u to \(u = 2 \ln (n+1)\)). To investigate the effect of varying u further, we also conduct a series of experiments with a fixed problem size n and different values of u.

For the fast \((1 + (\lambda , \lambda ))\) GA, we used the values of \(\beta \in \{2.1, 2.3, 2.5, 2.7, 2.9\}\) unless noted otherwise. In all the adaptive versions of the \((1 + (\lambda , \lambda ))\) GA, the initial value of \(\lambda \) is set to 1.

The source code used to perform these experiments is a part of a larger project dedicated to the \((1 + (\lambda , \lambda ))\) GA available on GitHubFootnote 4 and as the supplementary material for this paper.

4.1 Implementation Details and Their Discussion

In all runs we used slightly modified versions of the algorithms to avoid counting obviously unnecessary fitness evaluations. The particular changes are as follows.

  • In the \((1 + 1)\) EA, if standard bit mutation flips zero bits, then we resample the offspring until it is different from the parent. This is equivalent to not counting the fitness evaluation of the offspring identical to the parent.

  • In all versions of the \((1 + (\lambda , \lambda ))\) GA, we resample \(\ell \) until \(\ell \ne 0\). This is equivalent to not counting the fitness evaluations in iterations with \(\ell = 0\) because here all offspring are identical to the parent. In the crossover phase, samples taking all bits from the parent x are repeated (without evaluating the fitness of the copy of the parent) and samples taking all bits from the mutation winner \(x'\) are not evaluated (that is, do not count towards the number of fitness evaluations). Additionally, \(x'\) also participates in the selection of the best among x and the crossover results \(y^{(i)}\). When there is a tie, then the crossover winner has a higher priority than \(x'\).

We consider these natural modifications instead of the original algorithms in this section, since we are sure that anyone implementing these algorithms for solving practical problems would do the same. For a practitioner it does not make sense to waste fitness evaluations on individuals which are identical to their parents, while in theoretical works these are often counted since constant factors are often ignored. We note that similar modifications of algorithms were called practice-aware in [27]. We note that there are much more ways to tune the runtime of the \((1 + (\lambda , \lambda ))\) GA in a practical application, see, e.g., [17]. In contrast to the modifications described above, for these it is not clear to what extent they are useful in general or only for particular problems. For this reason, we did not consider them in this work.

Clearly our theoretical results from Sect. 3 apply to these mildly modified algorithms. For the upper bounds it is enough to note that by resampling identical individuals and by having \(x'\) participate in the selection, the probability to have a progress in one iteration only increases. Thus, repeating the arguments from Theorem 5 we obtain the same upper bound on the expected number of iterations. Since our implementation does not affect the choice of \(\lambda \), its expected value \(E[\lambda ]\) stays the same. The cost of one iteration is at most \(2\lambda \) (but can be smaller). Thus, by Wald’s equation we obtain the same upper bound on the expected number of fitness evaluations as in Theorem 6. For the lower bound we use the same arguments as in Theorem 10, with the only change that since we cannot choose \(\ell = 0\), we have

$$\begin{aligned} E[\ell \mid \lambda ] = \frac{\lambda }{1 - \left( 1 - \frac{1}{\lambda }\right) ^\lambda } \le \frac{\lambda }{1 - \frac{1}{e}}, \end{aligned}$$

which still gives us a lower bound of \(\Omega (n)\) fitness evaluations.

4.2 Experimental Setup

The experiments were performed on the OneMax function and on random satisfiable instances of the MAX-3SAT problem, that is, the problem of maximizing the number of satisfied clauses in a Boolean formula represented in conjunctive normal form. The second problem was chosen for two reasons. First, it is a more practical problem than OneMax, second, there are already theoretical and empirical results for the \((1 + (\lambda , \lambda ))\) GA on this function (see [7]). For this problem on n variables, the number of clauses was chosen to be \(4 n \ln n\). An all-ones bit string is assumed to be a planted optimal solution; this is without loss of generality, as all considered algorithms are unbiased. For each clause, three participating variables and their signs (i.e., whether it is negated or not) are sampled uniformly and independently until this clause is satisfied by the planted solution (that is, not all three variables are negated). Note that these are easy instances of the MAX-3SAT problem, so the presented results on this problem should not be considered as if the proposed algorithms are competitive in solving this problem in general. However, these instances have a lower fitness-distance correlation, which makes them harder in particular for the \((1 + (\lambda , \lambda ))\) GA.

To speed-up the experiments, we used the incremental fitness evaluation technique, which is more commonly seen in gray-box optimization and in problem-aware solvers. We note that this led only to a faster implementation of the algorithm, not to a different algorithm behavior. In particular, the number of iterations or fitness evaluations performed are not affected. We modified the implementation as follows.

During mutation we do not copy the parent individual, but instead directly generate the bit indices which are different in the parent and the offspring (the “patch”). Following that, we evaluate the fitness of the offspring based on the fitness of the parent and the patch. For RLS and the \((1+1)\) EA, if the new fitness is at least as good as the one of the parent, we apply the patch to the parent, turning it into the offspring. For the \((1 + (\lambda , \lambda ))\) GA, we select the best patch out of all the mutants’ patches (based on their fitness values). The subsequent applications of crossover translate to subsamplings of that patch, so that fitness evaluation is again based on the parent’s fitness.

For OneMax, evaluation of the offspring’s fitness based on the parent’s fitness and the patch is rather straightforward: only the bits at the affected indices are checked. This results in an expected O(1) amount of work per each iteration of both RLS and the \((1+1)\) EA, and in the \(\Theta (\lambda ^2)\) amount of work for the \((1 + (\lambda , \lambda ))\) GA, which still helps much because \(\lambda \) is typically much smaller than n.

For MAX-3SAT, the incremental evaluation is more difficult as it involves some preprocessing on the side of the fitness function. It amounts to constructing lists of clauses affected by the changed bits and to evaluating the satisfaction status of these clauses before and after the change. For the logarithmic density of clauses (that is, the ratio of the number of clauses to the number of variables) employed in this paper, this amounts to \(\Theta (\log n)\) expected work per iteration of RLS and the \((1+1)\) EA, and to \(\Theta (\lambda ^2\log n)\) expected work for the \((1 + (\lambda , \lambda ))\) GA, which is still faster than direct evaluation, but less efficient than what is possible for OneMax.

We also note that the particular structure of all the considered algorithms also allows to optimize the memory requirements: the memory used by RLS and the \((1+1)\) EA is \(\Theta (n)\) words resulting from storing a single bit vector, whereas the \((1 + (\lambda , \lambda ))\) GA uses \(\Theta (n + \lambda )\) words in expectation, as only the best patches for each of the phases need to be stored.

In our experiments we chose the problem sizes n to be powers of two, so that the asymptotic behavior of the algorithms is easier to investigate visually. For OneMax, we limit the problem size to \(2^{22}\), and for MAX-3SAT, the upper limit is \(2^{16}\). These sizes were derived from the affordable computational times. We did not reach the size of \(2^{20}\) on MAX-3SAT as in [7], because the incremental fitness evaluations have a weaker impact with fast mutation. Indeed, whenever \(\lambda \) is sampled from a heavy-tailed distribution, the distribution of \(\lambda ^2\), and hence of the wall-clock running time, has an even heavier tail, so occasional high values of \(\lambda \) result in very expensive iterations. For each algorithm, each problem setting, and each problem size, 100 independent runs were performed. For the MAX-3SAT problem, a new random instance was created for each run.

Our runtime results are shown in Figs. 345 and 6. In Figs. 34 and 5 the x-axis indicates the problem size in a logarithmic scale, and the y-axis indicates the ratio of the runtime to the problem size. In this visualization a linear runtime results in a horizontal plot and any runtime in \(\Theta (n \log n)\) gives a linearly increasing plot.

4.3 Runtimes on OneMax

Fig. 3
figure 3

Mean runtimes and their standard deviation of different algorithms on OneMax benchmark problem. By \(\lambda \in [1\cdots u]\) we denote the self-adjusting parameter choice via the one-fifth rule in the interval \([1\cdots u]\). The indicated interval for each value X is \([E[X] - \sigma (X), E[X] + \sigma (X)]\), where \(\sigma (X)\) is the standard deviation of X. We write \({{\,\mathrm{lnp}\,}}x := \ln (x+1)\). By pow(x) we denote the power-law distribution with parameters \(u = n\) and \(\beta = x\)

In Fig. 3 we show the results of the runs on the OneMax function. If we do not consider \(\beta = 2.1\), which turns out to be too small (and therefore gives a too large expected value of \(\lambda \)), then all versions of the fast \((1 + (\lambda , \lambda ))\) GA start outperforming the \((1 + 1)\) EA already at population size \(n = 2^{10}\) and then outperform RLS at \(n = 2^{20}\) or earlier. Recalling the discussion after the proof of Theorem 6 we note that our estimate of the leading constant in the runtime was overly pessimistic, otherwise we would have no chance to outperform RLS on these problem sizes.

The one-fifth rule shows a much better performance and yields a runtime of the \((1 + (\lambda , \lambda ))\) GA which is very close to linear already from \(n = 2^{10}\) on for both linear and logarithmic upper bounds on \(\lambda \). The plots for the heavy-tailed choice of \(\lambda \) do not look horizontal, but they show a strongly marked tendency that they will do so at larger population sizes. The runtimes for all \(\beta \) except \(\beta = 2.1\) are quite well concentrated, as well as the runtimes of the \((1 + (\lambda , \lambda ))\) GA with the one-fifth rule, in contrast to the runtimes of the \((1 + 1)\) EA and RLS. We have no results for \(\beta = 2.1\) for population sizes \(n \ge 2^{21}\) and for \(\beta = 2.3\) for \(n \ge 2^{22}\), since they were too expensive (in terms of computational resources) and most likely not too insightful.

Figure 3 also features the runtime plot of an asymptotically optimal static choice for \(\lambda \). It has been proven in [8] that the theoretically asymptotically optimal static choice is \(\lambda = \Theta (\sqrt{\frac{\ln (n)\ln \ln (n)}{\ln \ln \ln (n)}})\). By using \({{\,\mathrm{lnp}\,}}(n) := \ln (n+1)\) instead to avoid issues with logarithms of too small values, and by fitting the outer constant factor using auxiliary experiments with fixed \(\lambda \in [2\cdots 12]\), we have found that \(\lambda = 2\sqrt{\frac{{{\,\mathrm{lnp}\,}}(n){{\,\mathrm{lnp}\,}}{{\,\mathrm{lnp}\,}}(n)}{{{\,\mathrm{lnp}\,}}{{\,\mathrm{lnp}\,}}{{\,\mathrm{lnp}\,}}(n)}}\) approximates the optimal choices quite well, so we have used the version of the \((1 + (\lambda , \lambda ))\) GA with this choice in our plots. We also see that with the choice of \(\beta = 2.5\) the fast \((1 + (\lambda , \lambda ))\) GA outperforms the statically optimal parameter choice at problem sizes \(n \ge 2^{20}\).

Fig. 4
figure 4

Mean runtimes and their standard deviation of different algorithms on MAX-3SAT instances with \(4n\ln (n)\) clauses. By \(\lambda \in [1\cdots u]\) we denote the self-adjusting parameter choice via the one-fifth rule in the interval \([1\cdots u]\). The indicated interval for each value X is \([E[X] - \sigma (X), E[X] + \sigma (X)]\), where \(\sigma (X)\) is the standard deviation of X. By pow(x) we denote the power-law distribution with parameters \(u = n\) and \(\beta = x\)

4.4 Runtimes on MAX-3SAT

Figure 4 shows the results of the experiments on the MAX-3SAT problem. As previously shown in [7], large values of \(\lambda \) can be harmful. For this reason, the \((1 + (\lambda , \lambda ))\) GA with the unbounded one-fifth rule is outperformed already by the simple \((1 + 1)\) EA. The authors of [7] proposed to limit the value which \(\lambda \) can take by \(2\ln (n + 1)\), which greatly improved the performance up to the point that RLS was outperformed on this problem.

As we see in Fig. 4, the fast \((1 + (\lambda , \lambda ))\) GA is quite efficient even without an upper limit on \(\lambda \). Except for the case \(\beta = 2.1\), we managed to outperform the \((1 + 1)\) EA and the self-adjusting \((1 + (\lambda , \lambda ))\) GA without an upper limit on \(\lambda \). Nevertheless, RLS and the self-adjusting \((1 + (\lambda , \lambda ))\) GA with a logarithmic cap on \(\lambda \) remained faster.

The runtimes of all algorithms appear super-linear in the plots.

Fig. 5
figure 5

Mean runtimes and their standard deviation of different algorithms on MAX-3SAT instances with \(4n\ln (n)\) clauses with logarithmically capped population sizes. By \(\lambda \in [1\cdots u]\) we denote the self-adjusting parameter choice via the one-fifth rule in the interval \([1\cdots u]\). The indicated interval for each value X is \([E[X] - \sigma (X), E[X] + \sigma (X)]\), where \(\sigma (X)\) is the standard deviation of X

4.5 Effects of Capping for MAX-3SAT

Since apparently large values of \(\lambda \) are not helpful when optimizing MAX-3SAT instances (due to the weaker fitness-distance correlation), we conducted some experiments with the fast \((1 + (\lambda , \lambda ))\) GA choosing \(\lambda \) from a power-law distribution on a smaller range \([1\cdots u]\) of values. Based on the previous experience, we started with an upper limit of \(u=2 \ln (n+1)\). These results are presented in Fig. 5.

Using this upper limit reduced the computational burden associated with heavy-tailed distributions and allowed us to regard problem sizes up to \(2^{19}\). The upper limit also led a better performance in terms of fitness evaluations. When comparing Figs. 4 and 5 around the problem size \(n=2^{16}\), we see that for \(\beta \in \{2.1, 2.3\}\) a significant speed-up was obtained, whereas for \(2.5 \le \beta \le 2.9\) the differences of the corresponding mean running times are negligible. This is not surprising given that for smaller values of \(\beta \), the inefficient high values of \(\lambda \) are sampled more often. Interestingly, in combination with the upper limit small values of \(\beta \) gave the best performance. This suggests that it is important to use moderately large values of \(\lambda \) often and that only too large values lead to efficiency losses.

Fig. 6
figure 6

Mean runtimes and their standard deviation of different algorithms on MAX-3SAT instances with \(4n\ln (n)\) clauses for different capping values. Problem size is \(n=2^{16}\). The indicated interval for each value X is \([E[X] - \sigma (X), E[X] + \sigma (X)]\), where \(\sigma (X)\) is the standard deviation of X

To investigate the effect of the particular choice of the upper limit u on the running time for various values of \(\beta \), we performed additional experiments where the problem size was fixed to \(n=2^{16}\), but the upper limits were varying. Figure 6 presents these results, where u was taken from the set \(u \in \{ 2^2, 2^3, \ldots , 2^{13}\}\). Note that high values of u again prevented us from choosing a higher problem size. We also plot for reference the performance of the \((1+1)\) EA on the same problem size.

The plots in Fig. 6 indicate that for \(2.1 \le \beta \le 2.5\) the dependency on the upper limit has a clear optimal value: Too small values of u prevent the \((1 + (\lambda , \lambda ))\) GA from choosing the more efficient mid-size values of \(\lambda \), too high values of u lead to sampling too large values of \(\lambda \) too often, which have little chance of making progress and at the same time are very costly. It can be seen, however, that already for \(\beta = 2.5\) the subsequent increase of the running time is not too pronounced. Higher values of \(\beta \) tend to a monotonic behavior, up to the deviations from the mean running time. This basically indicates that the sensitivity to the upper limit of the distribution is not large even in practice.

4.6 Summary of Experimental Results

Summing up, from the results of the experiments we conclude the following three points.

  • The fast \((1 + (\lambda , \lambda ))\) GA performs generally well, often beating the classic mutation based algorithms. On OneMax, the self-adjusting \((1 + (\lambda , \lambda ))\) GA both without and with an upper limit of \(u = 2 \ln (n+1)\) are superior, on MAX-3SAT only the version with upper limit and RLS are superior.

  • The fast \((1 + (\lambda , \lambda ))\) GA can easily be used as a parameterless algorithm and this is what we suggest. We note that the \((1 + (\lambda , \lambda ))\) GA with the asymptotically optimal static parameter setting could not beat the fast \((1 + (\lambda , \lambda ))\) GA on OneMax. The self-adjusting \((1 + (\lambda , \lambda ))\) GA without an upper limit was superior on OneMax, but significantly inferior on MAX-3SAT. The version with upper limit \(u = 2 \ln (n+1)\) was superior on both OneMax and MAX-3SAT. We still do not want to advertise this approach as clearly such limits are problem-specific and non-trivial to find. The logarithmic limit for MAX-3SAT is based on a substantial mathematical analysis [7] of these particular MAX-3SAT instances. For other problems, such a limit may be detrimental, e.g., it may be hard to leave a local optimum with a large basin of attraction.

  • The choice of \(\beta \) does not play a big role as long as it is not too close to the borders of the interval (2, 3). Taking \(\beta \) between 2.5 and 2.7 might be a good general recommendation.

5 Conclusion

In this first runtime analysis of a crossover-based algorithm using the fast mutation operator, we observed that the fast mutation operator not only can relieve the algorithm designer from the task of choosing a suitable mutation rate, but it can also lead to runtimes asymptotically better than any static choice of the mutation rate.

Different from previous works, where any power-law exponent greater than one could be used, our work requires that \(\beta \) is between 2 and 3. We note, however, that the power-law distributions are often used with exponents in the open interval (2, 3) and this for good reason. In this regime, we have a heavy tail (as opposed for \(\beta > 3\)), but we still have a constant expectation (as opposed to \(\beta < 2\)). Since the complexity of a single iteration is \(\Theta (\lambda )\), having a constant expectation \(E[\lambda ]\) is very natural.

On the technical side, our work shows that algorithms with a heavy-tailed number of offspring can be much easier to analyze than those with a self-adjusting number of offspring (such as the self-adjusting \((1 + (\lambda , \lambda ))\) GA [8]), since Wald’s equation allows to estimate the expected runtime as the product of the expected number of iterations and the expected number of offspring generated in one iteration.

The natural question arising from this work is for which other algorithms and problems such a speed-up can be obtained. Natural candidates are other crossover-based algorithms or algorithms in which dynamic parameter choices could obtain a speed-up over static choices. We note that after this research was conducted, it was found that the \((1 + (\lambda , \lambda ))\) GA with two of its parameters chosen independently from heavy-tailed distributions has a good performance on jump functions [3]. The performance is slightly inferior to the one with optimal static parameters [5], however these were non-trivial to find as they deviated significantly from the previous recommendations.