1 Introduction

The runtime analysis of randomized search heuristics has made significant progress over the past two decades. A broad variety of randomized search heuristics, for example evolutionary algorithms (EAs), ant colony optimization and randomized local search have been considered for specific artificial functions, and also for combinatorial optimization problems. The analysis of evolutionary algorithms on simple pseudo-Boolean functions has been tremendously fruitful and led to a wide range of results and techniques. In most cases, runtime analysis is performed in an asymptotic fashion with respect to the expected optimization time (see [1, 15, 20] for an overview).

In the past couple of years there has been a surge of interest in non-asymptotic analyses of EAs, i. e. in determining the runtime exactly, up to lower order terms. For example, bounds on the expected runtime of the well-known (1+1) EA on the classical OneMax function have enjoyed a series of improvements. It has been known for a long time that the expected runtime is \(\varTheta (n\ln n)\) and goes back to Droste et al. [9]. The first asymptotically tight bound of \((1\pm o(1))en\ln n\) was given by Witt [24] for the (1+1) EA on general linear pseudo-Boolean functions with positive weights. Recently, the expected runtime of the (1+1) EA on OneMax has been stated exactly, up to subconstant terms [14].

For a long time, the optimal mutation rate has not been of particular interest. By a rule of thumb the mutation rate was set to 1 / n in many applications, i. e. every bit is flipped independently with probability 1 / n. At least for linear functions, it is known that 1 / n is asymptotically the optimal mutation rate for the (1+1) EA on all linear functions [9, 24]. Böttcher et al. [3] showed that the optimal mutation rate for the (1+1) EA on the LeadingOnes problem is in fact \({\approx }1.59/n\), way above the common practice of using 1 / n. Doerr and Goldberg [8] showed for the (1+1) EA that the asymptotic bound of \(\varTheta (n\ln n)\) holds for a mutation rate of c / n for the optimization of linear pseudo-Boolean functions, where c is an arbitrary constant. Chicano  et al. [5] showed that the optimal mutation rate can be up to 50% higher than the asymptotically optimal 1 / n for small sizes of n for the (1+1) EA on OneMax. Badkobeh et al. [4] showed for the (1+\(\lambda \)) EA on OneMax that the optimal mutation rate increases with \(\lambda \) in an adaptive setting.

The exact relationship between the static mutation rate and the runtime of the (1+1) EA was revealed by Witt in the aforementioned work [24] by showing that the runtime is \((1\pm o(1))\frac{e^c}{c}n\ln n\) for general linear pseudo-Boolean functions which is asymptotically tight up to lower order terms. Here, the leading constant \(\frac{e^c}{c}\) is minimized for \(c=1\), thus justifying the unwritten rule of setting the mutation rate to 1 / n in many applications. This result was refined and extended to populations by GieSSen and Witt [12], who gave the asymptotically tight bound

$$\begin{aligned} (1\pm o(1))\left( \frac{e^{c}}{c}\cdot \frac{n\ln n}{\lambda } +\frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) \end{aligned}$$

for the number of generations needed by the (1+\(\lambda \)) EA on OneMax with mutation rate c / n.

The impact of the mutation rate on the lower order term remains unclear though. Since the lower order term is only known asymptotically, small problem sizes might benefit from a mutation rate that deviates from 1 / n, as seen in [5].

The goal of this paper is to find optimal mutation rates for the (1+\(\lambda \)) EA on OneMax for constant \(\lambda \). To this end, a new drift theorem is given that provides a way of bounding the runtime if the exact drift values are known. We then give an exact formula for the drift by analyzing the distribution of the difference of two binomially distributed random variables and show how to efficiently approximate it by means of the Poisson distribution, such that the multiplicative error of the approximation is \((1\pm O(1/n))\). By applying our new drift theorem with the approximated drift values, we are able to give approximate optimal values of c in a computationally efficient way avoiding the need for empirical investigations.

The paper is structured as follows. In Sect. 2 we state the (1+\(\lambda \)) EA and the drift theorems used throughout the paper. In particular, a new drift theorem for lower bounds is given. In Sect. 3 we present our main result that the lower bounds from the new drift theorem and the upper bounds on the expected runtime are only apart by a lower order term, provided the exact drift is known. In Sect. 4 we give an exact closed-form expression for the drift. Moreover, we show how to approximate it with only small asymptotic error in a computationally efficient way. Section 5 deals with the practical implications of our theoretical results. We show that by combining our main result with the approximated drift we can approximate the expected runtime with only small relative error. In Sect. 5.1 we exploit our findings by determining approximate optimal mutation rates for various settings of n and \(\lambda \). Section 5.2 finally demonstrates in an empirical analysis of the expected runtime that our approximately optimal results (whose error provably can only affect a lower order term of the expected runtime) in fact very well reflect the actual runtimes.

2 Preliminaries

2.1 Algorithm

We consider the (1+\(\lambda \)) EA on pseudo-Boolean functions \(f :\{0,1\}^n \rightarrow \mathbb {R}\) in the minimization version, defined as Algorithm 1. The case of \(c=1\) in the mutation probability was considered in [10, 11, 16]. The classical (1+1) EA [1] is a special case of the (1+\(\lambda \)) EA for \(\lambda =1\) and \(c=1\). Throughout the paper, c and \(\lambda \) are for simplicity assumed to be constant, i. e., they may not depend on n. With minor efforts, our results can be generalized to larger values, e. g., to \(c=O(\ln n)\) and \(\lambda =O(\ln n)\), at the expense of additional logarithmic factors in the error bounds.

figure a

The runtime of the (1+\(\lambda \)) EA is defined to be the smallest \(t\in \mathbb {N}\), such that an individual of minimum f-value is found. These individuals are called optima. This notion of runtime is identical with the minimum number of iterations (also called generations). Since each of these offspring has to be evaluated, the number of function evaluations, which is another classical cost measure, is by a factor of \(\lambda \) larger than the runtime as defined here. However, assuming a massively parallel architecture that allows for parallel evaluation of the offspring, counting the number of generations seems a valid cost measure. In particular, a speed-up on the function \(\textsc {OneMax} (x_1,\ldots ,x_n):=x_1+\cdots +x_n\) by increasing \(\lambda \) can only be observed in terms of the number of generations.

In the rest of the paper we will focus on the minimization of the classical function \(\textsc {OneMax} (x_1,\ldots ,x_n):=x_1+\cdots +x_n\), which attains its unique optimum, i. e. minimum, in the all-zero-bitstring.

2.2 Drift Theorems

Our analyses use variable drift analysis, a state-of-the art technique for the analysis of expected optimization times. We first state a theorem for upper bounds, which is well known. See [17, 19, 21]. We use a general version that was proposed in [18].

Theorem 1

(Variable Drift, Upper Bound; [18]) Let \((X_t)_{t\ge 0}\) be a stochastic process adapted to a filtration \(\mathcal {F}_t\) over some state space \(S\subseteq \{0\}\cup [x_{\min },x_{\max }]\), where \(x_{\min }>0\). Let \(h(x):[x_{\min },x_{\max }]\rightarrow \mathbb {R}^+\) be a monotone increasing function such that 1 / h(x) is integrable on \([x_{\min },x_{\max }]\) and \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \ge h(X_t)\) if \(X_t\ge x_{\min }\). Then it holds for the first hitting time \(T:=\min \{t\mid X_t=0\}\) that

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \le \frac{x_{\min }}{h(x_{\min })} + \int \limits _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x. \end{aligned}$$

To prove lower bounds on the hitting time by variable drift, we need additional assumptions. The first variable drift theorem for lower bounds on the hitting that we are aware of goes back to [7]. It requires that the process does not make large steps towards the optimum, more precisely from state x it may only go to states \(y\ge c(x)\) for some function c(x). This deterministic requirement on the progress is weakened to a stochastic one in the following lemma. Moreover, the condition in item (v) is weaker than in [7]. Instead of demanding \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \le h(\xi (X_t))\), we allow for h-functions that are step functions. This can be useful for discrete state spaces. Somewhat simplifying, if the state space is \(\mathbb {N}\) and the drift at point i equals i, we can use \(h(x):=\lceil x\rceil \) in the new variant, while the original variant would require a continuous function such as \(h(x)=x\). As \(\int 1/\lceil z\rceil \, \mathrm {d}z \le \int 1/z \,\mathrm {d}z\), our step function gives an improved upper bound.

Theorem 2

(Variable Drift, Lower Bound) Let \((X_t)_{t\ge 0}\), be a stochastic process adapted to a filtration \(\mathcal {F}_t\) over some state space \(S\subseteq \{0\}\cup [x_{\min },x_{\max }]\), where \(x_{\min }>0\). Suppose there exist

  1. (i)

    two functions \(\xi ,h:[x_{\min },x_{\max }]\rightarrow \mathbb {R}^+\) such that h(x) is monotone increasing and 1 / h(x) integrable on the interval \([x_{\min },x_{\max }]\),

  2. (ii)

    \(\beta >0\),

and, using

$$\begin{aligned} g(x) := \frac{x_{\min }}{h(x_{\min })} + \int \limits _{x_{\min }}^x \frac{1}{h(z)} \,\mathrm {d}z, \end{aligned}$$

suppose it holds for all \(t\ge 0\) that

  1. (iii)

    \(X_{t+1} \le X_t\),

  2. (iv)

    \({{\mathrm{Pr}}}(X_{t+1} < \xi (X_t)) \le \frac{1}{\beta g(X_t)}\) for \(X_t\ge x_{\min }\),

  3. (v)

    \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \le \lim _{\delta \downarrow 0} h(\xi (X_t)+\delta )\) for \(X_t\ge x_{\min }\).

Then it holds for the first hitting time \(T:=\min \{t\mid X_t=0\}\) that

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \ge \frac{\beta }{1+\beta }g(X_0). \end{aligned}$$

Corollary to the above: If \(\xi \) is invertible and differentiable on the domain, with inverse function \(\xi ^{-1}\) and derivative \(\xi '\), and (iv) and (v) are replaced by the conditions

  1. (iv’)

    \({{\mathrm{Pr}}}(X_{t+1} < \xi (X_t))\)

    \( \le \frac{1}{\beta \left( \frac{x_{\min }}{h(\xi ^{-1}(x_{\min }))} + \int _{\xi ^{-1}(x_{\min })}^{\xi ^{-1}(X_t)} \frac{\xi '(x)}{h(x)} \,\mathrm {d}x\right) }\) for \(X_t\ge x_{\min }\),

  2. (v’)

    \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \le \lim _{\delta \downarrow 0} h(X_t+\delta )\) for \(X_t\ge x_{\min }\)

then the bound on \(\mathord {E}\mathord {\left( T\mid X_0\right) }\) is equivalent to

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \ge \frac{\beta }{1+\beta }\left( \frac{x_{\min }}{h(\xi ^{-1}(x_{\min }))} + \int \limits _{\xi ^{-1}(x_{\min })}^{\xi ^{-1}(X_0)} \frac{\xi '(x)}{h(x)} \,\mathrm {d}x\right) . \end{aligned}$$

Proof

We use the potential function \(g(x)=\frac{x_{\min }}{h(x_{\min })} + \int _{x_{\min }}^x \frac{1}{h(z)} \,\mathrm {d}z\) to apply additive drift analysis w. r. t. to the drift \(E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t)\). Note that the first bound on \(\mathord {E}\mathord {\left( T\mid X_0\right) }\) given in the theorem follows if we can establish the bound on the drift

$$\begin{aligned} \mathord {E}\mathord {\left( g(X_t)-g(X_{t+1})\mid \mathcal {F}_t\right) } \le 1+\frac{1}{\beta } \end{aligned}$$

for \(X_t\ge x_{\min }\) since \(g(X_0)=\left( \frac{x_{\min }}{h(x_{\min })} + \int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x\right) \). The second bound on \(\mathord {E}\mathord {\left( T\mid X_0\right) }\) is equivalent to the first one as can be shown by a simple substitution.

We are left with the bound on the drift. By the law of total probability,

$$\begin{aligned}&E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t) \\&= E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t; X_{t+1}\!\ge \!\xi (X_t)) {{\mathrm{Pr}}}(X_{t+1}\!\ge \!\xi (X_t)) \\&\, + E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t; X_{t+1}\!<\!\xi (X_t)) {{\mathrm{Pr}}}(X_{t+1}\!<\!\xi (X_t)) \\&\le E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t; X_{t+1}\!\ge \!\xi (X_t)) \\&\qquad \qquad \qquad \qquad \;+\; g(X_t) \cdot {{\mathrm{Pr}}}(X_{t+1}< \xi (X_t))\\&\le E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t; X_{t+1} \ge \xi (X_t)) \\&\qquad \qquad \qquad \qquad \;+\; g(X_t) \cdot \frac{1}{\beta g(X_t)}\\&= E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t; X_{t+1} \ge \xi (X_t)) \;+\; \frac{1}{\beta }, \end{aligned}$$

where the first inequality estimated the drift in the case \(X_{t+1}<\xi (X_t)\) by \(g(X_t)\) and the second one used the assumption (iv).

We proceed by bounding

$$\begin{aligned}&E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t;X_{t+1}\ge \xi (X_t)) \\&\quad \le E\left( \int _{X_{t+1}}^{X_t} \frac{1}{h(x)}\,\mathrm {d(x)} \mid \mathcal {F}_t;X_{t+1}\ge \xi (X_t)\right) \\&\quad \le \frac{E(X_t-X_{t+1}\mid \mathcal {F}_t;X_{t+1}\ge \xi (X_t))}{\lim _{\delta \downarrow 0} h(\xi (X_t)+\delta )}\\&\quad \le \frac{E(X_t-X_{t+1}\mid \mathcal {F}_t)}{\lim _{\delta \downarrow 0} h(\xi (X_t)+\delta )}, \end{aligned}$$

where the first inequality follows by expanding g, the second one used that \(X_t\ge X_{t+1}\ge \xi (X_t)\) (which uses (iii)) along with the fact that h is monotone increasing (i. e., non-decreasing) (from (i)), and the third one the definition of conditional probability. Using (v), the last bound simplifies to

$$\begin{aligned} E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t) \le \frac{\lim _{\delta \downarrow 0} h(\xi (X_t)+\delta )}{\lim _{\delta \downarrow 0} h(\xi (X_t)+\delta )} = 1, \end{aligned}$$

so that

$$\begin{aligned} E(g(X_t)-g(X_{t+1})\mid \mathcal {F}_t) \le 1+\frac{1}{\beta } \end{aligned}$$

as desired. \(\square \)

3 Bringing Together Lower Bounds and Upper Bounds from Variable Drift

We are interested in how far the lower bound from Theorem 2 and the upper bound from Theorem 1 are apart when an exact expression on the drift \(\mathord {E}\mathord {\left( X_t-X_{t+1}\right) }\) is known. Intuitively, this depends to a great extent on how sharply the progress is concentrated around its expected value. If large jumps towards the optimum are sufficiently likely, then the lower bound seems to reflect the truth better. For example, consider the following artificial process: the fitness function is OneMax and the algorithm is the (1+1) EA with the following modified mutation step: with probability 1 / n the optimum is created, and with probability \(1-1/n\) the usual standard-bit mutation operator with mutation probability 1 / n is used. At distance i from the optimum, the drift towards the optimum is \(\varTheta (i/n)\), resulting in an upper bound \(O(n\log n)\) on the expected runtime according to Theorem 1. However, it is easy to see that the actual expected runtime is O(n). This is a consequence of the possibly large jumps directly into the optimum, which happen in every step with probability 1 / n. Note that such steps are very unlikely with the unmodified mutation operator unless the algorithm is very close to the optimum anyway.

We concentrate now on the (1+\(\lambda \)) EA on OneMax, assuming constant c in the mutation probability and constant \(\lambda \). Then the following theorem shows using the right drift function, upper and lower bounds on the expected runtime are only apart by a term of lower order, which is indeed considerably lower than the expected time on OneMax.

Theorem 3

Consider the (1+\(\lambda \)) EA on OneMax, choosing c and \(\lambda \) as constants. Denoting by \(X_t\) the number of ones in the search point at time \(t\ge 0\), the drift \(\varDelta (i):=\mathord {E}\mathord {\left( X_t-X_{t+1}\mid X_t=i\right) }\) for \(i\in \{1,\ldots ,n\}\) is defined. Then it holds for the expected runtime that

$$\begin{aligned} (1-O(n^{-1/3}(\ln n)))\cdot I(X_0) \;\le \; \mathord {E}\mathord {\left( T\mid X_0\right) } \;\le \; I(X_0), \end{aligned}$$

where \(I(X_0)=\sum _{i=1}^{X_0} 1/\varDelta (i)\).

For its proof, we need a helper lemma, which is concerned with the monotonicity of the drift.

Lemma 4

Let \(\varDelta (i)\) be defined as in Theorem 3. Then \(\varDelta (i+1)\ge \varDelta (i)\) for any \(i\ge 0\).

Proof

We first prove the result for \(\lambda =1\) and show then how to extend it to \(\lambda >1\).

Assume search point x(i) with i one-bits at time t and let \(x'(i)\) be its random offspring (before selection). We represent

$$\begin{aligned} \varDelta (i) = \sum _{j\ge 0} {{\mathrm{Pr}}}(|x(i)| - |x'(i)| \ge j) = \sum _{j\ge i} {{\mathrm{Pr}}}(|x'(i)|\le j), \end{aligned}$$

where \(|z|\) denotes the number of one-bits in z. If we can prove for all \(j\le i\) the inequality

$$\begin{aligned} {{\mathrm{Pr}}}(|x'(i+1)|\le j+1) \ge {{\mathrm{Pr}}}(|x'(i)|\le j) \end{aligned}$$
(1)

the lemma follows. The inequality will be shown by a coupling argument; formally, we map mutations \(x'(i)\) of x(i) to mutations \(x'(i+1)\) of \(x(i+1)\) and show that the mapped mutations are at least as likely. By the symmetry of the mutation operator, we can assume that \(x(i+1)\) is bitwise non-less than x(i). Let \(i^*\) be the bit that is 1 in \(x(i+1)\) but 0 in x(i). We first consider all mutations \(x'(i)\) of x(i) with at most \(j\le i\) one-bits in which bit \(i^*\) is not flipped. These mutations map one-to-one to mutations \(x'(i+1)\) with at most \(j+1\) one-bits by flipping the same bits in \(x(i+1)\).

The mutations of x(i) where bit \(i^*\) is flipped to 1 must also flip another bit k to 0 as \(j\le i\). We map them to mutations of \(x(i+1)\) by flipping all bits in the same way, except for that neither \(i^*\) nor k are flipped. Such a mutation has one further one-bit. Since we use mutation probability \(c/n=\varTheta (1/n)\), the probability of not flipping \(i^*\) and k is by a factor of

$$\begin{aligned} \left( \frac{1-c/n}{c/n}\right) ^2 = \varTheta (n^2) \end{aligned}$$

larger than the probability of flipping both of them. Note that the mapping is not bijective in this case as up to \(n-1\) different mutations (one for each value of k) of x(i) are mapped to the same mutation of \(x(i+1)\). Still, by a union bound, the probability of generating the considered mutation of \(x(i+1)\) is by a factor of \(\varTheta (n)\) larger, proving (1).

If \(\lambda >1\), we represent the drift as a sum of tail probabilities in the same way as above, except for that the best of the \(\lambda \) offspring of x(i) is considered instead of a single offspring \(x'(i)\). By independence of the offspring creation, the probability that the best offspring improves by at least j equals \(1-({{\mathrm{Pr}}}(|x(i)| - |x'(i)| < j))^\lambda \). Using (1), this is at most \(1-({{\mathrm{Pr}}}(|x(i+1)| - |x'(i+1)| < j))^\lambda \), which completes the proof. \(\square \)

Proof of Theorem 3

To prove the upper bound, we use the variable drift theorem for upper bounds (Theorem 1), using \(x_{\min }=1\), \(x_{\max }=n\), and \(h(x)=\varDelta (\lceil x\rceil )\) for \(x\in [x_{\min },x_{\max }]\). Note that the theorem allows such a discontinuous h-function. Hence, we obtain

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) }&\;\le \; \frac{1}{\varDelta (1)} + \int \limits _{x_{\min }}^{X_0}\frac{\mathrm {d}x}{\varDelta (\lceil x\rceil )} \;= \; \frac{1}{\varDelta (1)} + \lim _{\ell \downarrow x_{\min }} \int \limits _{\ell }^{X_0}\frac{\mathrm {d}x}{\varDelta (\lceil x\rceil )} \\&\;=\; \sum _{i=1}^{X_0} \frac{1}{\varDelta (i)} \;=\; I_0. \end{aligned}$$

For the lower bound, we define

$$\begin{aligned} \xi (x):={\left\{ \begin{array}{ll} \lceil x\rceil -1 &{} \text { if }x\le n^{1/3}, \\ \lceil x\rceil -\lceil \ln (n)\rceil &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

To analyze the probability that \(X_{t+1}<\xi (X_t)\), given some \(X_t>0\), we also consider the two cases. If \(X_t\le n^{1/3}\), then it is necessary to flip at least two one-bits to zero to obtain \(X_{t+1}<\xi (X_t)=X_t-1\). This probability is at most \( (n^{1/3} \frac{c}{n})^2 = O(n^{-4/3})\), and this asymptotic bound even holds if the best of \(\lambda =O(1)\) offspring is considered. The probability of flipping at least \(\ln n\) bits in at least one of \(\lambda =O(1)\) offspring is clearly smaller, more precisely \(n^{-\varOmega (\ln n)}\). Hence, \(P(X_{t+1}<\xi (X_t))\le O(n^{-4/3})\). Furthermore, we get \(h(x)\ge (1-c/n)^{n-1} cx/n \ge e^{-2c}cx/n\) by considering the expected number of one-bits flipped in an arbitrary offspring, conditioned on that no zero-bit flips. Hence,

$$\begin{aligned} \left( \frac{x_{\min }}{h(x_{\min })} + \int \limits _{x_{\min }}^{X_t} \frac{1}{h(x)} \,\mathrm {d}x\right) \le O(n) + \int \limits _{1}^n \frac{n e^{2c}}{c\lceil x\rceil } , \end{aligned}$$

which is \(O(n\ln n)\) as c is constant. This establishes Condition (iv) of Theorem 2 for \(\beta =c' n^{1/3}/(\ln n)\), where \(c'\) is a sufficiently large constant. We also note that a union bound over \(\lambda =O(1)\) offspring implies \(h(x_{\min })=O(\lambda c/n)=O(1/n)\), hence

$$\begin{aligned} I_0 \ge \frac{x_{\min }}{h(x_{\min })} = \varOmega (n), \end{aligned}$$

which we will use later.

Condition (iii) of Theorem 2 holds trivially due to the selection mechanism of the (1+\(\lambda \)) EA. To verify the remaining conditions, we set \(h(x):=\varDelta (\lceil x\rceil )\) for \(x\le n^{1/3}\) and \(h(x):=\varDelta (\lceil x\rceil + \lceil \ln (n)\rceil )\) otherwise. Note that \(\lim _{\delta \downarrow 0} h(\xi (x)+\delta ) = \varDelta (x)\) if \(x\le n^{1/3}\). Otherwise,

$$\begin{aligned}&E(X_t-X_{t+1} \mid \mathcal {F}_t;X_t=x) = \varDelta (\lceil x\rceil ) \le \lim _{\delta \downarrow 0} \varDelta (\lceil x\rceil + \delta )\\&\quad = \lim _{\delta \downarrow 0} h(\lceil x\rceil -\lceil \ln (n)\rceil +\delta ) = \lim _{\delta \downarrow 0} h(\xi (\lceil x\rceil )+\delta ) \end{aligned}$$

since by Lemma 4 \(\varDelta (i)\) is monotone increasing in its argument. This establishes (v). Also (i) is satisfied by definition of h. Altogether, the drift theorem yields

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) }&\ge \frac{\beta }{1+\beta }\left( \frac{1}{\varDelta (1)} + \int \limits _{1}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x\right) \\&= \frac{\beta }{1+\beta }\left( \sum _{i=1}^{n^{1/3}} \frac{1}{\varDelta (i)} + \int \limits _{n^{1/3}}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x\right) \\&\ge \frac{\beta }{1+\beta } \left( \sum _{i=1}^{n^{1/3}} \frac{1}{\varDelta (i)} + \sum _{n^{1/3}+\lceil \ln n\rceil }^{X_0+\lceil \ln n\rceil } \frac{1}{\varDelta (i)}\right) , \end{aligned}$$

where the equality used that \(h(x):=\varDelta (\lceil x\rceil )\) for \(x\le n^{1/3}\) and the inequality used the definition of h in the other case along with an index transformation.

We see that the term in parentheses is at least

$$\begin{aligned} \sum _{i=1}^{X_0}\frac{1}{\varDelta (i)} - \sum _{i=1}^{\lceil \ln n\rceil } \frac{1}{\varDelta (n^{1/3}+i)} \ge I_0 - \lceil \ln n\rceil \frac{1}{\varDelta (n^{1/3})}, \end{aligned}$$

by the monotonicity of \(\varDelta \). We also know that \(\varDelta (n^{1/3}) \ge n^{1/3} \frac{c}{n} (1-c/n)^{n-1} = \varOmega (n^{-2/3})\) for any constant choice of c and any \(\lambda \ge 1\). Hence

$$\begin{aligned} \lceil \ln n\rceil \frac{1}{\varDelta (n^{1/3})} = O( n^{2/3}(\ln n)) = O(n^{-1/3}(\ln n)I_0), \end{aligned}$$

using \(I_0=\varOmega (n)\). This altogether proves the desired inequality

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \ge (1-O(n^{-1/3}(\ln n))) I_0. \end{aligned}$$

\(\square \)

4 Approximating the Drift

In the following we will give a closed-form exact formula for the drift at fitness k of the (1+\(\lambda \)) EA with mutation rate c / n for constant c and constant \(\lambda \). Let \(X_t\) denote the fitness at time t. Then, \(X_{t+1} = X_t + Z\), where Z is the difference of two binomially distributed random variables, namely the number of 1-bits flipped by the mutation and the number of 0-bits flipped by the mutation. If more 1-bits than 0-bits are flipped, i. e. if Z is negative, the algorithm progresses towards the optimum. Hence, we are interested in the exact distribution of the random variable Z.

To this end, we will make use of the ordinary hypergeometric function, also known as Gaussian hypergeometric function \({}_2F_1\). Let for each \(a\in \mathbb {Z}\) and \(b\in \mathbb {N}\) denote \(a^{(b)}=a(a+1)(a+2)\cdots (a+b-1)\) the rising factorial, which is sometimes called the Pochhammer symbol. The hypergeometric function \({}_2F_1\) is defined for \(x\in \mathbb {R}\) as the following series:

Note that the infinite series terminates if either \(a<0\) or \(b<0\). We refer the reader to [2] for more information.

Let \(X\sim {{\mathrm{Bin}}}(n_1,p_1)\), \(Y\sim {{\mathrm{Bin}}}(n_2,p_2)\) and let \(Z:=X-Y\). In the following, we will denote the distribution of Z as \(Z\sim {{\mathrm{BinDiff}}}(n_1,n_2,p_1,p_2)\).

Theorem 5

Let \(Z\sim {{\mathrm{BinDiff}}}(n_1,n_2,p_1,p_2)\). Then, for all \(z\in \mathbb {N}\):

where

Proof

Let XY and Z be defined as stated in the theorem. We have for all \(z\in \mathbb {Z}\)

$$\begin{aligned} {{\mathrm{Pr}}}(Z=z)&={{\mathrm{Pr}}}(X-Y=z)\nonumber \\&=\sum _{\begin{array}{c} a,b\in \mathbb {Z}\\ a-b=z \end{array}}{{\mathrm{Pr}}}(X=a){{\mathrm{Pr}}}(X=b). \end{aligned}$$
(2)

Furthermore, let \(q=p_1p_2/((1-p_1)(1-p_2))\) and \(\alpha = p_1^z(1-p_1)^{n_1-z}(1-p_2)^{n_2}\). Consider the case \(z\ge 0\). Rewriting the sum from above we get

$$\begin{aligned}&{{\mathrm{Pr}}}(Z=z)=\sum _{k=0}^{n_1}{{\mathrm{Pr}}}(X=k+z){{\mathrm{Pr}}}(Y=k)\\&=\sum _{k=0}^{n_1}\left( {\begin{array}{c}n_1\\ k+z\end{array}}\right) p_1^{k+z} (1-p_1)^{n_1-k-z}\left( {\begin{array}{c}n_2\\ k\end{array}}\right) p_2^k(1-p_2)^{n_2-k}\\&=\alpha \sum _{k=0}^{n_1} \frac{n_1!}{(k+z)!(n_1-k-z)!} \cdot \frac{n_2!}{k!(n_2-k)!} \cdot q^k\\&=\alpha \sum _{k=0}^{n_1} \frac{(n_1-z-k+1)^{(z+k)}}{z!(z+1)^{(k)}} \cdot \frac{(n_2-k+1)^{(k)}}{k!} \cdot q^k\\&=\alpha \sum _{k=0}^{n_1} \frac{(-1)^{2k}(-(n_1-z))^{(k)}(n_1-z+1)^{(z)}(-n_2)^{(k)}}{z!(z+1)^{(k)}k!} \cdot q^k. \end{aligned}$$

Using the fact that \((n_1-z+1)^{(z)}/z! = \left( {\begin{array}{c}n_1\\ z\end{array}}\right) \) we get

The case \(z<0\) can be shown analogously. \(\square \)

In the previous theorem we have seen that the discrete probability mass function of a \({{\mathrm{BinDiff}}}\)-distributed random variable can be expressed in terms of hypergeometric functions. The exact terms can also be written as Jacobi polynomials, a class of orthogonal polynomials. This relationship to orthogonal polynomials has been elaborated in connection with the probability distribution of fitness values of a bit string undergoing uniform bit-flip mutation in [5]. In that work, another class of orthogonal polynomials, namely Krawtchouk polynomials, were investigated.

In the following we will show that the drift can be computed efficiently. For this purpose, we will assume that the computation of \({}_2F_1\) counts as a single arithmetic operation using some implementation of \({}_2F_1\), similar to counting other special functions like \(\exp \).

Corollary 6

Consider the (1+\(\lambda \)) EA, choosing c and \(\lambda \) constant on OneMax. The drift \(\varDelta (k)\) at fitness k can be computed exactly with O(k) arithmetic operations.

Proof

Consider a run of the algorithm as stated above and assume that the OneMax-value is k. For all \(i\in [\lambda ]\) let \(X_i\sim {{\mathrm{Bin}}}(n-k,c/n)\) and \(Y_i~\sim {{\mathrm{Bin}}}(k,c/n)\), then \(Z_i:=X_i-Y_i\) is the random variable that denotes the change in fitness of the i-th offspring individual, i. e. \(k+\min \{Z_i|i\in [\lambda ]\}\) is the fitness of the new parent. Note that the image of each \(Z_i\) is \(\{z\in \mathbb {Z}|-k\le z\le n-k\}\).

Using Theorem 5 we can write the cumulative distribution function of \(Z_i\) as \(F_{Z_i}(x)=\sum _{y=-k}^x {{\mathrm{Pr}}}(Z_i=y)\). Let \(Z^*\) be defined as the minimum of \(Z_1,\ldots ,Z_\lambda \), i. e. \(Z^*\) is the minimum order statistic. It is known that \(F_{Z^*}(z) = 1-(1-F_{Z_1}(z))^\lambda \), using \(Z_1\) as an arbitrary representative among all \(Z_i\) since all \(Z_i\) are independent. Hence, we get that

$$\begin{aligned} {{\mathrm{Pr}}}(Z^*=z)=(1-F_{Z_1}(z-1))^\lambda -(1-F_{Z_1}(z))^\lambda . \end{aligned}$$

Consider the drift at fitness k which is defined as

$$\begin{aligned} \varDelta (k):=\mathord {E}\mathord {\left( X_t-X_{t+1}\mid X_t=k\right) }. \end{aligned}$$

We get

$$\begin{aligned} \varDelta (k)&=-\sum _{z=-k}^{-1} z{{\mathrm{Pr}}}(Z^*=z)\\&=-\sum _{z=-k}^{-1} z\left( (1-F_{Z_1}(z-1))^\lambda -(1-F_{Z_1}(z))^\lambda \right) \\&=-\sum _{z=-k}^{-1} z\Bigg (\bigg (1-\sum _{y=-k}^{z-1} {{\mathrm{Pr}}}(Z_1=y)\bigg )^\lambda \\&\phantom {=}\phantom {-\sum _{z=-k}^{-1} z\Bigg (} -\bigg (1-\sum _{y=-k}^{z} {{\mathrm{Pr}}}(Z_1=y)\bigg )^\lambda \Bigg ). \end{aligned}$$

The last term involves two nested sums. However, we only need to evaluate \({{\mathrm{Pr}}}(Z_1=y)\) for \(y=-k,\ldots ,-1\) once, saving all values in a list, in order to compute \(F_{Z_1}(z)\) for \(y=-k,\ldots ,-1\) which requires a single traversal over that list. Computing the actual drift only consists of O(k) arithmetic operations now, since we precomputed the needed values of \(F_{Z_1}\). Hence, the computation of \(\varDelta (k)\) only needs O(k) arithmetic operations and space O(k). \(\square \)

As we have seen in Corollary 6, we need O(k) arithmetic computations of the probability mass function of a \({{\mathrm{BinDiff}}}\)-distributed random variable (see Theorem 5). The computation of the hypergeometric function \({}_2F_1\) proves to be the bottleneck. For example, on an Intel i7-4770S CPU, a single computation of the value of \({}_2F_1(-750, -249; 2; 4.016\cdot 10^{-6})\), which is a representative invocation of the function in this context, takes around 0.02 seconds using the internal function hypergeom from Matlab R2015b. Different parameter settings yield similar computation times. In order to compute the expected runtime of the (1+\(\lambda \)) EA, we need to compute a linear amount of drift values, which can pose a problem for large values of n due to the computational effort on the hypergeometric function. Hence, we are interested in approximating the drift with less computational effort.

The idea is to approximate the \({{\mathrm{BinDiff}}}\)-distribution. It is well-known that a \({{\mathrm{Poi}}}(np)\)-distribution yields a good approximation for a \({{\mathrm{Bin}}}(n,p)\)-distribution for large n and small p. To this end, we will approximate a \({{\mathrm{BinDiff}}}\)-distribution with the distribution of the difference of the corresponding Poisson approximations. A similar approach to the approximation of the point-wise drift has been pursued by Doerr et al. ([6]). The distribution of the difference of two independent Poisson-distributed random variables is known in the literature as Skellam-distribution. We will state the definition according to [22].

Definition 7

Let \(X\sim {{\mathrm{Poi}}}(\mu _1)\) and \(Y\sim {{\mathrm{Poi}}}(\mu _2)\). The probability mass function of \(Z:=X-Y\) is given by

$$\begin{aligned} {{\mathrm{Pr}}}(Z=k)=e^{-(\mu _1+\mu _2)}\left( \frac{\mu _1}{\mu _2}\right) ^{\frac{k}{2}} I_k(2\sqrt{\mu _1\mu _2}), \end{aligned}$$

where \(I_k\) is the modified Bessel function of the first kind, defined by

$$\begin{aligned} I_\nu (z)=\left( \frac{z}{2}\right) ^\nu \sum _{i=0}^\infty \frac{(z/2) ^{2i}}{i!(\nu +i)!}, \end{aligned}$$

for \(\nu \in \mathbb {N}_0\) and \(z\in \mathbb {R}\). In the following we will denote the distribution of Z by \(Z\sim {{\mathrm{Skellam}}}(\mu _1,\mu _2)\).

For more information about the modified Bessel function of the first kind, we refer the reader to [2].

Instead of the hypergeometric function in the probability mass function of a \({{\mathrm{BinDiff}}}\)-distributed random variable, a \({{\mathrm{Skellam}}}\)-distributed random variable involves the modified Bessel function of the first kind, which is another special function. However, computing for example \(I_{250}(1.73)\), takes \({<}0.0001\) seconds on an Intel i7-4770S CPU using the internal function besseli from Matlab R2015b. The computation times are similar for different parameters. This is a big improvement over the time needed for the computation of the hypergeometric function. Computationally, the use of the Skellam-distribution instead of the \({{\mathrm{BinDiff}}}\)-distribution is therefore justified. In the following we will show that the error from the approximation is small as well. Again, we will assume that the computation of \(I_\nu \) counts as a single arithmetic operation.

Theorem 8

Consider the (1+\(\lambda \)) EA with mutation probability c / n and population size \(\lambda \) on OneMax where c and \(\lambda \) are constant. Then, the drift \(\varDelta (k)\) can be approximated up to a factor of \(1\pm O(1/n)\) in O(k) arithmetic operations.

Proof

We will first bound the relative error of the approximation of a \({{\mathrm{BinDiff}}}\)-distributed random variable by a \({{\mathrm{Skellam}}}\)-distributed random variable. Let \(X\sim {{\mathrm{Bin}}}(n-k,c/n)\) and \(Y\sim {{\mathrm{Bin}}}(k,c/n)\), then \(Z:=X-Y\sim {{\mathrm{BinDiff}}}(n-k, k, c/n, c/n)\). Let \(\tilde{X}\sim {{\mathrm{Poi}}}((n-k)c/n)\) and \(\tilde{Y}\sim {{\mathrm{Poi}}}(kc/n)\) be the according Poisson-approximated random variables. Then, \(\tilde{Z}:=\tilde{X}-\tilde{Y}\sim {{\mathrm{Skellam}}}((n-k)c/n, kc/n)\). The relative error of the Poisson-approximation with respect to X is

$$\begin{aligned} \left| \frac{{{\mathrm{Pr}}}(\tilde{X}=m)}{{{\mathrm{Pr}}}(X=m)}-1\right| \le \frac{(e^{(n-k)c/n}-1)\frac{c}{n}}{m+1}. \end{aligned}$$

Bounding the exponent by c and omitting the constants, we have \({{\mathrm{Pr}}}(\tilde{X}=m)={{\mathrm{Pr}}}(X=m)(1\pm O(1/(mn)))\) and \({{\mathrm{Pr}}}(\tilde{Y}=m)={{\mathrm{Pr}}}(Y=m)(1\pm O(1/(mn)))\) accordingly (see [23] for details about the relative error bounds).

One can obtain different bounds for the relative error, depending on the actual arithmetic computation of \({{\mathrm{Pr}}}(\tilde{Z}=z)\), For example, we have that

$$\begin{aligned} 1-F_{\tilde{Z}}(z) =1-\sum _{\ell =-k}^{z}{{\mathrm{Pr}}}(\tilde{Z}=\ell ) =\sum _{\ell =z+1}^{n-k}{{\mathrm{Pr}}}(\tilde{Z}=\ell ). \end{aligned}$$

For small z, the second term involves the computation of fewer probabilities than the third term. This can lead to different results on the error bound as a consequence of error propagation. However, due to the triangle inequality we have that the absolute error

$$\begin{aligned} |{{\mathrm{Pr}}}(Z=z)-{{\mathrm{Pr}}}(\tilde{Z}=z)| \le |{{\mathrm{Pr}}}(Z=z)-q|+|q-{{\mathrm{Pr}}}(\tilde{Z}=z)|, \end{aligned}$$

and the last term is 0 for any arithmetic computation q of \({{\mathrm{Pr}}}(\tilde{Z}=z)\). Thus, we can compute \({{\mathrm{Pr}}}(\tilde{Z}=z)\) in the same way as \({{\mathrm{Pr}}}(Z=z)\) in Eq. 2 in order to derive a bound on the relative error of \(\tilde{Z}\) with respect to Z, instead of using the exact probability mass function. We have for \(z\ge 0\) that

$$\begin{aligned} {{\mathrm{Pr}}}(\tilde{Z}=z)&= \sum _{j=0}^{n-z}\Bigg ({{\mathrm{Pr}}}(X=j+z){{\mathrm{Pr}}}(Y=z)\\&\phantom {=\sum _{j=0}^{n-z}} \cdot \left( 1\pm O\left( \frac{1}{(j+z)n}\right) \right) \left( 1\pm O\left( \frac{1}{zn}\right) \right) \Bigg )\\&=(1\pm O(1/n))\sum _{j=0}^{n-z}{{\mathrm{Pr}}}(X=j+z){{\mathrm{Pr}}}(Y=z)\\&=(1\pm O(1/n)){{\mathrm{Pr}}}(Z=z). \end{aligned}$$

The case \(z<0\) is analogous and we obtain in total

$$\begin{aligned} {{\mathrm{Pr}}}(\tilde{Z}=z)=(1\pm O(1/n)){{\mathrm{Pr}}}(Z=z), \end{aligned}$$

for all \(z\in \{-k,-k+1,\ldots ,n-k\}\).

We can now compute the relative error of the approximated drift \(\tilde{\varDelta }\). By replacing \(Z_1\) with \(\tilde{Z}\) in the computation of the exact drift from Corollary 6 we obtain

$$\begin{aligned} \tilde{\varDelta }(k)&=-\sum _{z=-k}^{-1} z\left( \Bigg (1-\sum _{y=-k}^{z-1} {{\mathrm{Pr}}}(\tilde{Z}=y)\Bigg ) ^\lambda \right. \\&\phantom {=-\sum _{z=-k}^{-1}}\left. -\Bigg (1-\sum _{y=-k}^{z} {{\mathrm{Pr}}}(\tilde{Z}=y)\Bigg )^\lambda \right) \\&=-(1\pm O(1/n))^\lambda \sum _{z=-k}^{-1} z\left( \Bigg (\sum _{y=z}^{n-k} {{\mathrm{Pr}}}(Z=y)\Bigg )^\lambda \right. \\&\phantom {=-(1\pm O(1/n))^\lambda \sum _{z=-k}^{-1} z}\left. -\left( \sum _{y=z+1}^{n-k} {{\mathrm{Pr}}}(Z=y)\right) ^\lambda \right) \\&=(1\pm O(1/n))\varDelta (k). \end{aligned}$$

Note that in the last step we exploited the fact that \(\lambda \) is a constant.

Since we can use the same arithmetic computation as in Corollary 6, the number of arithmetic operations in order to compute the approximated drift is O(k) as well. \(\square \)

5 Computation of Mutation Rates

In the previous section we showed how to approximate the drift with only a small relative error. Using the new drift theorem from Sect. 3 we are interested in approximating the expected runtime.

Corollary 9

Consider the (1+\(\lambda \)) EA on OneMax with mutation probability c / n, where c and \(\lambda \) are constant. Using the approximation from Sect. 4 the runtime of the (1+\(\lambda \)) EA can be approximated up to a multiplicative error of \(1\pm O(n^{-1/3}(\ln n))\).

Proof

Let \(X_t\) denote the OneMax-value at time \(t>0\) and let furthermore \(\varDelta (i):=\mathord {E}\mathord {\left( X_t-X_{t+1}\mid X_t=i\right) }\) be the drift at fitness value i.

Theorem 3 states that

$$\begin{aligned} (1-O(n^{-1/3}(\ln n)))\cdot I(X_0)\; \le \;\mathord {E}\mathord {\left( T\mid X_0\right) } \;\le \; I(X_0), \end{aligned}$$

where \(I(X_0)=\sum _{i=1}^{X_0} 1/\varDelta (i)\).

Plugging in the approximate drift values \(\tilde{\varDelta }(i)\), as described in Sect. 4 and by weakening the upper bound we obtain

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) }= (1\pm O(n^{-1/3}(\ln n))) \sum _{i=1}^{X_0} 1/\tilde{\varDelta }(i). \end{aligned}$$

Using a Chernoff bound we can see that the probability that the algorithm initializes in the interval \([n/2-n^{2/3}, n/2+n^{2/3}]\) is at least \(1-2e^{-(2/3)n^{1/3}}\). The expected time to advance from a fitness of \(\lceil n/2 + n^{2/3}\rceil \) down to \(\lfloor n/2\rfloor \) is \(\sum _{\lfloor n/2\rfloor }^{\lceil n/2 + n^{2/3}\rceil } 1/\tilde{\varDelta }(i) \le n^{2/3}/\tilde{\varDelta }(\lfloor n/2+n^{2/3}\rfloor )\) due to the monotonicity of \(\tilde{\varDelta }\) by Lemma 4. It holds that \(\tilde{\varDelta }(\lfloor n/2+n^{2/3}\rfloor ) \le (n/2+n^{2/3})(c/n)=O(1)\) and thus, we get \(\sum _{\lfloor n/2\rfloor }^{\lceil n/2 + n^{2/3}\rceil } 1/\tilde{\varDelta }(i) = \varOmega (n^{2/3})\). Similarly, we can show that \(\sum _{\lfloor n/2-n^{2/3}\rfloor }^{\lceil n/2\rceil } 1/\tilde{\varDelta }(i) = O(n^{2/3})\). Furthermore, we know that the expected runtime is \(\varTheta (n\ln n)\), independent from the actual initialization in the given interval. Therefore, the relative error by deviating at most \(n^{2/3}\) from n / 2 is of order \(\varTheta (n^{2/3}/E(T)) = \varTheta ((\ln n)n^{-1/3})\), which matches the asymptotic factor.

Conditioning on the event that the algorithm initializes in \([n/2-n^{2/3}, n/2+n^{2/3}]\) yields in total:

$$\begin{aligned} \mathord {E}\mathord {\left( T\right) }=(1\pm O(n^{-1/3}(\ln n))\sum _{i=1}^{\lceil n/2\rceil } 1/\tilde{\varDelta }(i), \end{aligned}$$
(3)

where the error introduced by the Chernoff bound is absorbed by the asymptotic factor. Hence, we can approximate the runtime of the (1+\(\lambda \)) EA with only small asymptotic error. \(\square \)

5.1 Approximating the Runtime

We implemented the sum from the right-hand side of Equation 3 in order to compute the approximate expected runtime in Matlab R2015b and computed the approximate expected runtimes for \(c=1.00, 1.01,\ldots , 2.0\) and \(\lambda =1,\ldots ,10\) and problem sizes \(n=100,200,500,1000,2000,5000\). The approximated optimal values of c are given in Table 1.

As expected, we can see that for fixed \(\lambda \), the approximated mutation rate parameter c is decreasing with n, approaching 1 as predicted by the tight analysis in [12]. Furthermore, the approximate optimal c for \(n=100\) and \(\lambda =1\) is 1.19, which is only slightly higher than the exact value of 1.17 given in [5], hence justifying that the approximation does not suffer from large constants in the lower order term that might corrupt the approximation for small values of n.

Table 1 Approximate optimal c values

Interestingly, for fixed n the approximate optimal mutation rate grows with \(\lambda \). However, this behaviour cannot be explained by the bound from [12] which means that the reason for this behaviour is hidden in the lower order term. Intuitively, employing a larger population stabilizes the explorative character of allowing a higher mutation rate by reducing the chance that none of the individuals makes progress at all.

In order to evaluate the benefit of setting the mutation rate to the approximate optimal value instead of using the mutation rate 1 / n, we provide the corresponding table of the ratios of the approximated expected runtimes for the optimal mutation rate and 1 / n. As can be seen in Table 2, using the optimal mutation rate compared to 1 / n does not improve the corresponding expected runtime by more than 2% in the considered ranges of n and \(\lambda \). This means that a mutation rate of 1 / n is a sane choice for the parameter ranges that we have examined.

Table 2 Ratios of the expected runtimes for the approximate optimal mutation rate and the standard mutation rate 1 / n
Fig. 1
figure 1

Approximate expected runtime of the (1+10) EA using the Skellam-approximation for \(n=100,200,500,1000,2000,5000\). The minimal c is marked for each n

For fixed \(\lambda \), the ratios increase with n, as expected. Interestingly, for fixed n, the ratios decrease with \(\lambda \), which means that using the optimal mutation rate has a greater influence on larger population sizes. Another observation from Fig. 1 is that for fixed n and \(\lambda \) the expected runtime exhibits a certain robustness in the sense that it does not change by much for values of c in the considered range.

5.2 Comparison with Empirical Results

As shown in the beginning of this section, the results presented in Sect. 5.1 approximate the expected runtime with only small error. However, this error is stated asymptotically and the computed approximations of the expected runtimes might differ greatly from the actual expected runtimes due to large constants hidden in the asymptotic term, especially for smaller values of n. However, this is not the case and our approximations turn out to be surprisingly precise, even for smaller values of n.

We performed experiments in order to compare the empirical optimal values of the mutation parameter c to the approximate optimal values of c. For this purpose we implemented the (1+\(\lambda \)) EA in C using the GNU Scientific Library (GSL) for the generation of pseudo-random numbers and to use the GSL implementations of the hypergeometric function and modified Bessel function of the first kind.

The results are shown in Fig. 2. The plot displays the empirical optimal values for the mutation parameter c for \(n\in \{100,200,\ldots ,1000,2000\}\) for the (1+1) EA, (1+50) EA and (1+100) EA, as dots. The empirical optimal values of c are determined by taking the empirical minimum of the corresponding runs of the (1+\(\lambda \)) EA for each pair of n and \(\lambda \) for \(c\in \{0.50,0.52,\ldots ,3.0\}\), averaged over 50,000 runs each. The approximated optimal value for c using the Skellam-approximation as described in the beginning of this section is displayed by a dashed line for each setting of n and \(\lambda \). Additionally, for illustration, the approximated optimal value for c using the exact formula for the \({{\mathrm{BinDiff}}}\)-distribution (see Corollary 6) is displayed by a continuous line for each setting of n and \(\lambda \). The approximated values are determined by numerical optimization for both versions. We can see that for each \(\lambda \) both approximations seem to reflect the empirical optimal values very well over the whole range of n, producing only a small absolute error. Note that the empirical optimal values are subject to some amount of variation, despite the high sample size of 50,000 for each data point. This is due to the high variance in the distribution of the optimization time, which can also be observed in Fig. 3 where the interquartile range of the experiments is illustrated. Moreover, the Skellam-approximation rapidly approaches the \({{\mathrm{BinDiff}}}\)-approximation which justifies its use even further.

Due to the fast computation of our approximation we additionally added values for \(n=5000\) and \(n=10{,}000\) in Fig. 3 to further illustrate the asymptotic behaviour of the optimal values for c.

Fig. 2
figure 2

Empirical optimal and approximate optimal mutation parameter c for the (1+\(\lambda \)) EA for \(n=100,200,\ldots ,1000, 2000\) and \(\lambda =1, 50, 100\). The empirical optimal values for c are marked by dots. The approximate optimal values for c using the exact formula for the \({{\mathrm{BinDiff}}}\)-distribution are marked by the continuous line. The approximate optimal values for c using the Skellam-approximation for the \({{\mathrm{BinDiff}}}\)-distribution are marked by the dashed line

Fig. 3
figure 3

Empirical optimal and approximate optimal mutation parameter c for the (1+\(\lambda \)) EA for \(n=100,200,\dots ,1000, 2000, 5000, 10{,}000\) and \(\lambda =1, 50, 100\). The approximate optimal values for c using the Skellam-approximation for the \({{\mathrm{BinDiff}}}\)-distribution are marked by the dashed line

Fig. 4
figure 4

Empirical expected runtime of the (1+\(\lambda \)) EA for \(n=100\) (left) and \(n=2000\) (right) and \(\lambda =1,2,5,10,20,50\) (top to bottom in each plot) over 50,000 runs for each setting of \(n,\lambda \) and \(c=0.5,0.52,0.54,\ldots ,3.0\). The empirical minimal c is marked by a dot for each n. The area between the first and third quartile is shaded for each \(\lambda \). The computed approximations of the minimal c are denoted by the dashed line

As already mentioned in Sect. 5.1 the approximate optimal mutation rate grows with \(\lambda \) for fixed n which is due to the lower order term in the expected runtime. To further illustrate the impact of the lower order term we displayed the empirical expected runtimes for several values of \(\lambda \) and fixed n and marked the empirical and approximate optimal values for the mutation parameters c. The results are shown in Fig. 4. Both plots display the number of generations needed to optimize the (1+\(\lambda \)) EA for \(n=100\) (left plot) and \(n=2000\) (right plot) for various settings of \(\lambda \) and c. The empirical runtimes are displayed in each plot for \(c=0.5,0.52,0.54,\ldots ,3.0\) and \(\lambda =1,2,5,10,20,50\) where each data point is averaged over 50,000 runs. For illustration of the high variance, the area between the first and third quartile is shaded. One can easily observe that for higher values of \(\lambda \) both the empirical and the approximate optimal values for c increase for both values of n. Note that higher values of \(\lambda \) lead to lower runtimes.

6 Conclusions

We have presented an improved variable drift theorem that weakens the requirement that no large steps towards the optimum may occur in the process to a stochastic one. We used this theorem to show that upper and lower bounds on the expected runtime of the (1+\(\lambda \)) EA with mutation probability c obtained from variable drift theorems are at most apart by a small lower order term if the exact drift is known and c and \(\lambda \) are constant. This reduces the analysis of expected optimization time to finding an exact expression for the drift.

Furthermore, we gave an exact closed-form expression for the drift and presented a method for approximating it very efficiently with small error. By applying the new drift theorem and the approximation for the drift, we were able to approximate optimal mutation rates for the (1+\(\lambda \)) EA for various parameter settings of c and \(\lambda \) and also for moderate sizes of n and verified experimentally that these approximations reflect empirical results very precisely.

Our results render the need for costly experiments in order to optimize the parameters unnecessary. Even for moderate n and not too small \(\lambda \) it turns out that mutation rates up to 10% larger than the asymptotically optimal rate of 1 / n minimize the expected runtime. However, the benefit of setting the mutation rate to the optimal value of c, instead of using mutation rate 1 / n is small with respect to the actual expected runtime.