1 Introduction

The runtime analysis of evolutionary algorithms (EAs) is a research area that emerged from the analysis of classical randomized algorithms, where the aim is to prove rigorous statements on the expected runtime and approximation quality of the algorithm depending on the problem size. Since the late 1990s, a number of results on the runtime of simple and moderately complex evolutionary algorithms as well as of other nature-inspired algorithms have emerged [1, 13, 18]. Both simple benchmark functions such as OneMax and more complex combinatorial optimization problems were considered. The vast majority of these results are asymptotic, i. e., use O-notation. While such a result in O-notation tells us how the runtime in the worst case scales with the problem size, it does not allow a direct comparison of different algorithms or parameter settings in a specific algorithm. For instance, the O-expression omits an implicit constant factor in front of the expression, which might be astronomically large. Hence, an \(O(n^3)\)-algorithm indeed might be more efficient for practical problem sizes than another \(O(n\ln n)\)-algorithm. This incomparability even persists if the upper bounds are supplemented by asymptotic lower bounds, e. g., \(\varOmega (n^3)\). Moreover, if a change of a parameter value such as mutation probability accounts only for a non-asymptotic change of the runtime, this will not become visible in the bound. As a consequence, the optimal setting of the parameter, which minimizes the runtime, is hard to determine.

In recent years, there has been increasing interest in analyses of EAs that are non-asymptotic in the leading term and tight up to lower order terms. Such analyses are more exact than purely asymptotic ones and therefore typically harder to derive. For instance, while the expected runtime of the simple (1 \(+\) 1) EA on OneMax had been known to be \(\varTheta (n\ln n)\) since the early days of the research area, the first tight lower bound of the kind \((1-o(1))en\ln n\) was not proven until 2010 [5, 21]. For the more general case of linear functions, a long series of research results was published (e. g., [8, 12]) until Witt [23] finally proved that the expected runtime of the (1 \(+\) 1) EA equals \((1\pm o(1))en\ln n\) for any linear function with non-zero weights.

Results like Witt’s reveal the leading term in the representation of the expected runtime as a polynomial of n exactly and show that the underlying leading coefficient is in fact small (here \(e=2.71\ldots \)). This could not be read off from the classical \(\varTheta (n\ln n)\) result. However, there is greater potential in such a non-asymptotic analysis. It had for a long time been a rule of thumb to set the mutation probability of the (1 \(+\) 1) EA to 1 / n, i. e., to flip each bit independently with probability 1 / n; however, there was limited theoretical evidence for this. Doerr and Goldberg [7] were the first to prove that the \(\varTheta (n\ln n)\) bound for the (1 \(+\) 1) EA on linear functions also holds if the mutation probability is changed to c / n for an arbitrary positive constant c. Hence, changing the mutation probability to, say, 1 / (10n) or 10 / n does not change the asymptotic runtime behavior. Witt’s study [23] proves the more general, tight (up to lower order terms) bound \((1\pm o(1))\frac{e^c}{c}n\ln n\), which exhibits an interesting dependency on the factor \(c>0\) from the mutation probability. Since the factor \(\frac{e^c}{c}\) is minimized for \(c=1\), this proves that the most often recommended mutation probability of 1 / n is optimal for all linear functions.

However, it is by no means clear that 1 / n is the best choice under all circumstances. For instance, Böttcher, Doerr and Neumann [2] determined the expected optimization time of the (1 \(+\) 1) EA depending on the mutation probability p exactly for the LeadingOnes function. It turned out that the standard choice \(p=1/n\) is not optimal here, but a value of roughly 1.59 / n minimizes the expected time. A similar result is represented by Sudholt’s analysis [20] of a simple crossover-based EA on OneMax, where the optimal mutation probability turns out as 1.618 / n. Note that in both cases a more aggressive mutation than the standard choice is beneficial.

Our study continues this line of research on tight analyses; however, in addition to the mutation probability, we consider a second parameter, namely the offspring population size of the underlying evolutionary algorithm. More precisely, we analyze the (\(1+\lambda \)) EA with mutation probability c / n, where c is an arbitrary positive constant, on the classical OneMax function. Our aim is to describe the expected runtime as a function of both the parameter \(\lambda \) (the population size) and the parameter c, in a non-asymptotically tight (up to lower order terms) manner. One aim is to determine the best parameter setting of c, depending on \(\lambda \). We also pick up the so far quite limited line of work on the (\(1+\lambda \)) EA [9, 10, 14], which has been restricted to the standard mutation probability 1 / n, and where OneMax was analyzed before. From this line of work, the bound \(\varTheta (\frac{n\ln n}{\lambda }+n\frac{\ln \ln \lambda }{\ln \lambda })\) on the expected number of generations has been known; the computational effort in terms of the number of function evaluations is by a factor of \(\lambda \) bigger. A remarkable insight drawn from this bound is that the number of generations enjoys a linear speed-up with respect to \(\lambda \) as long as \(\lambda =o(\ln n\ln \ln n/\ln \ln \ln n)\). If \(\lambda \) is above this threshold (the so-called cut-off point), the second term from the time bound becomes relevant and the linear speed-up ceases to exist.

Note that in the following we will use the term population size synonymously with offspring population size. The concept of parent populations has been studied previously from a theoretical perspective [22]. However, to the best of our knowledge, no asymptotically tight analyses have been performed for parent populations.

Another insight already known from previous work is that choosing \(\lambda =1\) yields at least asymptotically the best choice to minimize the expected number of f-evaluations; with respect to the standard mutation probability, this has actually been proven to be the absolute truth without asymptotics [14]: increasing \(\lambda \) never decreases the expected number of f-evaluations. Therefore, we are primarily interested in optimizing the factor c in the mutation probability c / n depending on \(\lambda \), assuming a sufficiently large parallel architecture allowing the \(\lambda \) offspring evaluations to be done in parallel. Note also that our work and the work we relate it to assumes a static choice of \(\lambda \), fixed throughout the whole run of the (\(1+\lambda \)) EA. Using adaptive offspring population sizes depending on the current fitness value, an asymptotically different runtime behavior may occur [3]. Also a static choice of c is assumed and dynamic schedules for the mutation probability, as studied for a (1 \(+\) 1) EA in [2], are not within the scope of this research.

In this paper, we prove that the expected runtime (i. e., number of generations) of the (\(1+\lambda \)) EA with mutation probability c / n on OneMax equals

$$\begin{aligned} (1\pm o(1)) \left( \frac{e^{c}}{c} \cdot \frac{n\ln n}{\lambda } + \frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) , \end{aligned}$$

which greatly generalizes the previous \(\varTheta (\frac{n\ln n}{\lambda }+n\frac{\ln \ln \lambda }{\ln \lambda })\) bound. Hence, as long as \(\lambda \) is below the cut-off point, more precisely, if \(\lambda =o(\ln n\ln \ln n/\ln \ln \ln n)\), the leading term of the expected runtime is the same as for linear functions with the (1 \(+\) 1) EA, and setting \(c=1\) minimizes the expected runtime. However, when \(\lambda \) is above the cut-off, more precisely, if \(\lambda =\omega (\ln n\ln \ln n/\ln \ln \ln n)\), such that the second term becomes the leading term, one may choose c as an arbitrarily small or large constant without changing the expected runtime, up to lower order terms. So somewhat counter-intuitively, mutations are also allowed to occur less frequently than in a (1 \(+\) 1) EA. Any effect on the lower order term of the expected runtime is not explained by this result, though. These exact results follow from a careful study of order statistics of the binomial distribution and are obtained by variable drift theorems for upper and lower bounds.

Altogether, our study gives advice on the choice of the mutation probability and also determines the cut-off point for speed-up for different mutation probabilities. To the best of our knowledge, this represents the first tight (up to lower order terms) runtime analysis of a population-based EA, and it seems also be novel in developing such a tight expression for the runtime depending on two parameters. We remark that our analyses build on the insights by Doerr and Künnemann [10], who also give asymptotic results holding for arbitrary linear functions.

This paper is structured as follows. Section 2 introduces the framework, and presents drift theorems as well as properties of order statistics used for the analysis. Section 3 proves our main result and discusses its implications. Supplemental experimental studies are presented in Sect. 4. We finish with some conclusions.

2 Preliminaries

2.1 Algorithm

We consider the (\(1+\lambda \)) EA for the minimization of pseudo-boolean functions \(f:\{0,1\}^n\rightarrow \mathbb {R}\), defined as Algorithm 1. The case of \(c=1\) in the mutation probability was considered in [9, 10, 14]. If both \(\lambda =1\) and \(c=1\), the algorithm simplifies to the classical (1 \(+\) 1) EA [1]. Throughout the paper, c is assumed to be constant, i. e., it may not depend on n.

figure a

The runtime, also called the optimization time, of the (\(1+\lambda \)) EA is the smallest t such that an individual of minimum f-value has been found. Note that t corresponds to a number of iterations (also called generations), where each generation creates \(\lambda \) offspring. Since each of these offspring has to be evaluated, the number of function evaluations, which is a classical cost measure, is by a factor of \(\lambda \) larger than the runtime as defined here. However, assuming a massively parallel architecture that allows for parallel evaluation of the offspring, counting the number of generations seems also a valid cost measure. In particular, a speed-up on the function \(\textsc {OneMax} (x_1,\ldots ,x_n):=x_1+\cdots +x_n\) by increasing \(\lambda \) can only be observed in terms of the number of generations. As proven by Jansen, De Jong and Wegener [14], the (1 \(+\) 1) EA performs on OneMax stochastically at most the same number of function evaluations as any (\(1+\lambda \)) EA for \(\lambda >1\) and \(c=1\). For arbitrary c and even more general population-based algorithms, a corresponding statement is proven in [23]. Note that for reasons of symmetry, it makes no difference whether OneMax is minimized (as in the present paper) or maximized (as in several previous research papers).

Throughout the paper, all O-notation (mostly of the kind “o(1)”) will be with respect to the problem size n.

2.2 Drift Theorems

Our results are obtained by variable drift analysis, which is also used in the asymptotic analysis of the (\(1+\lambda \)) EA on OneMax and other linear functions [10]. The first theorems stating upper bounds on the hitting time using variable drift go back to [15, 17]. These theorems were subsequently generalized in [19] and [16]. We use the latter version.

Theorem 1

(variable drift, upper bound; [16]) Let \((X_t)_{t\ge 0}\), be a stochastic process adapted to a filtration \(\mathcal {F}_t\) over some state space \(S\subseteq \{0\}\cup [x_{\min },x_{\max }]\), where \(x_{\min }>0\). Let \(h(x):[x_{\min },x_{\max }]\rightarrow \mathbb {R}^+\) be a monotone increasing function such that 1 / h(x) is integrable on \([x_{\min },x_{\max }]\) and \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \ge h(X_t)\) if \(X_t\ge x_{\min }\). Then it holds w. r. t. the first hitting time \(T:=\min \{t\mid X_t=0\}\) that (assuming the following expectation to exist)

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \le \frac{x_{\min }}{h(x_{\min })} + \int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x. \end{aligned}$$

To prove lower bounds on the hitting time by variable drift, we need additional assumptions like the one in the following lemma, a special case of which was first proposed in [6].

Theorem 2

(variable drift, lower bound; [16]) Let \((X_t)_{t\ge 0}\), be a stochastic process adapted to a filtration \(\mathcal {F}_t\) over some state space \(S\subseteq \{0\}\cup [x_{\min },x_{\max }]\), where \(x_{\min }>0\). Suppose there exists two functions \(\xi ,h:[x_{\min },x_{\max }]\rightarrow \mathbb {R}^+\) such that h(x) is monotone increasing and 1 / h(x) integrable on \([x_{\min },x_{\max }]\), and for all \(t\ge 0\),

  1. (i)

    \(X_{t+1} \le X_t\),

  2. (ii)

    \(X_{t+1} \ge \xi (X_t)\) for \(X_t\ge x_{\min }\),

  3. (iii)

    \(E(X_t-X_{t+1} \mid \mathcal {F}_t) \le h(\xi (X_t))\) for \(X_t\ge x_{\min }\).

Then it holds for the first hitting time \(T:=\min \{t\mid X_t=0\}\) that

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) } \ge \frac{x_{\min }}{h(x_{\min })} + \int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x. \end{aligned}$$

2.3 Tight Bounds on Order Statistics

When the (\(1+\lambda \)) EA optimizes OneMax, it samples \(\lambda \) offspring and takes the so-called winner individual from the offspring having minimum OneMax-value, breaking ties arbitrarily. This problem is strongly related to analyzing how many one-bits are flipped in the offspring that flips most one-bits (however, this is not necessarily the winner individual). The number of flipping one-bits per offspring follows a binomial distribution, which is why the following results on order statistics of the binomial distribution are crucial. We do not claim these statements on order statistics to be fundamentally new; however, we did not find them in the literature in this form. In the following we use the notation \(X\sim Y\) for two random variables X and Y to denote that X follows the same distribution as Y.

We start with an auxiliary lemma that will be useful in the course of this section. It is a well-known inequality, often used in the analysis of evolutionary algorithms. For completeness, we give a formal proof.

Lemma 3

Let \(X\sim {{\mathrm{Bin}}}(n,p)\). Then, \(\mathord {{{\mathrm{Pr}}}}\mathord {\left( X\ge k\right) }\le \left( {\begin{array}{c}n\\ k\end{array}}\right) p^k\) for all \(k\ge 0\).

Proof

By definition, the probability of the event \(X\ge k\) is

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X\ge k\right) } = \sum _{j=0}^{n-k} \left( {\begin{array}{c}n\\ k+j\end{array}}\right) p^{k+j} (1-p)^{n-k-j}. \end{aligned}$$

Using

$$\begin{aligned} \left( {\begin{array}{c}n\\ k+j\end{array}}\right) = \frac{n!(n-k)!}{k!(k+1)\cdots (k+j)(n-k-j)!(n-k)!} \le \left( {\begin{array}{c}n\\ k\end{array}}\right) \left( {\begin{array}{c}n-k\\ j\end{array}}\right) , \end{aligned}$$

which follows from \(k\ge 0\), we get

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X\ge k\right) }&\le \sum _{j=0}^{n-k} \left( {\begin{array}{c}n\\ k\end{array}}\right) \left( {\begin{array}{c}n-k\\ j\end{array}}\right) p^{k+j} (1-p)^{n-k-j} \\&= \left( {\begin{array}{c}n\\ k\end{array}}\right) p^k \sum _{j=0}^{n-k} \left( {\begin{array}{c}n-k\\ j\end{array}}\right) p^{j} (1-p)^{n-k-j} = \left( {\begin{array}{c}n\\ k\end{array}}\right) p^k, \end{aligned}$$

where the last step is using the binomial identity. \(\square \)

We can now state our main result on order statistics.

Lemma 4

Let \(X_i\), where \(i\in \{1,\ldots ,\lambda \}\), independent random variables being identically distributed as \(X_i\sim {{\mathrm{Bin}}}(k,c/n)\) for some \(k\le n\) and constant \(c>0\). Let \(X^*:=\max \{X_1,\ldots ,X_\lambda \}\) be the maximum order statistic of the \(X_i\). Then

  1. 1.

    If \(\lambda ck/n=o(1)\) then \(\mathord {E}\mathord {\left( X^*\right) }=(1-o(1))\lambda \mathord {E}\mathord {\left( X_1\right) } = (1-o(1))\lambda ck/n\).

  2. 2.

    If \(\lambda ck/n\ge \alpha \) for \(\alpha >0\) then \(\mathord {E}\mathord {\left( X^*\right) }\ge \alpha /(1+\alpha )\).

  3. 3.

    If \(\lambda =\omega (1)\) and \(k = n/(\ln ^\alpha \lambda )\), where \(\alpha \ge 0\) and \(\alpha =O(1)\), then \(\mathord {E}\mathord {\left( X^*\right) }=\frac{(1\pm o(1))}{1+\alpha } \frac{\ln \lambda }{\ln \ln \lambda }\).

Proof

We start with the first item. Note that for \(j\ge 0\),

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*=j\right) } \ge \left( {\begin{array}{c}\lambda \\ 1\end{array}}\right) \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) } \left( \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1<j\right) }\right) ^{\lambda -1}, \end{aligned}$$

since the maximum equals j if exactly one of the variables takes this value and the remaining ones are less; to denote a single of the identically distributed random variables, we always refer to \(X_1\).

The last bound is

$$\begin{aligned} \lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) }\left( 1-\mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) }\right) ^{\lambda -1} \ge \lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) } \left( 1-\lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) }\right) \end{aligned}$$

according to Bernoulli’s inequality. If \(j\ge 1\) then

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) } \le \mathord {E}\mathord {\left( X_1\right) }/j \le \mathord {E}\mathord {\left( X_1\right) } \end{aligned}$$

by Markov’s inequality. Altogether,

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*=j\right) }&\ge \lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) }(1-\lambda \mathord {E}\mathord {\left( X_1\right) })\\&= (1-o(1)) \lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) } \end{aligned}$$

by our assumption that \(\lambda \mathord {E}\mathord {\left( X_1\right) }=\lambda ck/n=o(1)\). The item now follows since

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) }&=\sum _{j=1}^n j \cdot \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*=j\right) }\\&\ge \sum _{j=1}^n j(1-o(1))\lambda \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1=j\right) }\\&=(1-o(1)) \lambda \mathord {E}\mathord {\left( X_1\right) } \end{aligned}$$

along with the trivial upper bound \(\mathord {E}\mathord {\left( X^*\right) }\le \lambda \mathord {E}\mathord {\left( X_1\right) }\).

To show the second item, we note that

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) }&=\sum _{\ell \ge 1} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge \ell \right) }\\&\ge \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge 1\right) }\\&= 1-\left( \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1 = 0\right) }\right) ^\lambda , \end{aligned}$$

since the maximum is at least 1 if it does not happen that all \(\lambda \) random variables evaluate to 0. Hence, using \(e^x\ge 1+x\) and \(e^{-x}\le 1/(1+x)\) for \(x\in \mathbb {R}\), we have

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) }&\ge 1-\left( \left( 1-\frac{c}{n}\right) ^k\right) ^\lambda \\&\ge 1-\left( \left( 1-\frac{c}{n}\right) ^{\frac{\alpha n}{c\lambda }}\right) ^\lambda \\&\ge 1-\left( e^{-\frac{\alpha }{\lambda }}\right) ^\lambda \\&= 1-e^{-\alpha }\\&\ge 1-\left( \frac{1}{1+\alpha }\right) \\&= \frac{\alpha }{1+\alpha }. \end{aligned}$$

To prove the third item, we will first show \(\mathord {E}\mathord {\left( X^*\right) }\le (1+o(1))\frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }\) and then \(\mathord {E}\mathord {\left( X^*\right) }\ge (1-o(1))\frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }\). We use that

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }&=1-\left( \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1<j\right) }\right) ^{\lambda }\\&= 1- (1-\mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) })^\lambda . \end{aligned}$$

Furthermore, by using Lemma 3 we get

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) }&\le \left( {\begin{array}{c}n/(\ln ^\alpha \lambda )\\ j\end{array}}\right) \left( \frac{c}{n}\right) ^{j}\\&\le \left( \frac{c}{\ln ^\alpha \lambda }\right) ^j\frac{1}{j!}\\&\le \left( \frac{ce}{j\ln ^\alpha \lambda }\right) ^j , \end{aligned}$$

where the last inequality uses \(j!\ge (j/e)^j\). Plugging this in our expression for \(\mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }\), we have

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }\le 1-\left( 1-\left( \frac{ce}{j\ln ^\alpha \lambda }\right) ^j\right) ^\lambda . \end{aligned}$$

The aim now is to inspect the last bound for \(j=\frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }+\beta \), where \(\beta \ge \delta (\lambda ):=4\frac{\ln \lambda (\ln \ln \ln \lambda +1+\ln c+\ln (1+\alpha ))}{(\ln \ln \lambda )^2}\) (the choice of \(\delta =\delta (\lambda )\) is to some extent arbitrary). In the following, we will implicitly assume that \(\beta \ge 0\), which holds for large enough \(\lambda \). We will estimate the inner term \((ce/(j\ln ^\alpha \lambda ))^j=e^{-j(\ln (j/(ce))+\alpha \ln \ln \lambda )}\) from above. We start with the first term in the last exponent and compute

$$\begin{aligned}&j (\ln (j/(ce))\\&\quad = \left( \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }+\beta \right) \ln \left( \frac{\ln \lambda }{(1+\alpha )ce\ln \ln \lambda }+\frac{\beta }{ce}\right) \\&\quad \ge \left( \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }+\beta \right) (\ln \ln \lambda -\ln \ln \ln \lambda - \ln (ce(1+\alpha ))) \\&\quad \ge \frac{\ln \lambda }{1+\alpha } + \frac{\beta (\ln \ln \lambda )^2/2 - \ln \lambda (\ln \ln \ln \lambda +\ln (ce(1+\alpha ))}{\ln \ln \lambda }, \end{aligned}$$

where the last inequality assumes \(\ln \ln \lambda - \ln \ln \ln \lambda -\ln (ce(1+\alpha )) \ge (\ln \ln \lambda ) /2\), which holds for large enough \(\lambda \), and uses \(1+\alpha \ge 1\).

By definition of \(\delta \), we have \(\beta (\ln \ln \lambda )^2/2 - \ln \lambda (\ln \ln \ln \lambda +1+\ln c+ \ln (1+\alpha )) \ge \beta (\ln \ln \lambda )^2/4\), and therefore

$$\begin{aligned} j\ln (j/(ce)) \ge \frac{\ln \lambda }{1+\alpha } + \frac{\beta }{4} (\ln \ln \lambda ). \end{aligned}$$

Therefore, the whole exponent is bounded from below according to

$$\begin{aligned} j (\ln (j/(ce)) +(\alpha \ln \ln \lambda ))&\ge \frac{\ln \lambda }{1+\alpha } + \frac{\beta }{4} (\ln \ln \lambda ) \\&\quad +\alpha (\ln \ln \lambda )\left( \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }+\beta \right) \\&\ge \ln \lambda + \frac{\beta }{4} (\ln \ln \lambda ). \end{aligned}$$

Plugging this in our expression for \(\mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }\), we have for \(\beta ':=\beta /4\)

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }&\le 1-\left( 1-e^{-\ln \lambda - \beta '(\ln \ln \lambda )}\right) ^\lambda \\&= 1-\left( 1-\frac{1}{\lambda }e^{-\beta '(\ln \ln \lambda )}\right) ^{\frac{\lambda e^{\beta ' (\ln \ln \lambda )}}{2}\cdot \frac{2}{e^{\beta '(\ln \ln \lambda )}}}\\&\le 1-e^{-2e^{-\beta '(\ln \ln \lambda )}} \\&\le 1-(1-2e^{-\beta '(\ln \ln \lambda )}) = 2e^{-\beta '(\ln \ln \lambda )}, \end{aligned}$$

using first \((1-1/x)^{x/2}> e^{-1}\) for \(x\ge 2\) and then \(e^{-x}\ge 1-x\) for \(x\in \mathbb {R}\).

Now,

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) }&= \sum _{j\ge 1} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) } \\&\le \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda } + \delta + \sum _{j \ge \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda } + \delta } \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }. \end{aligned}$$

The last sum is now split in sub-sums consisting of \(\delta \) terms each, where still \(\delta = 4\frac{\ln \lambda (\ln \ln \ln \lambda +1+\ln c+\ln (1+\alpha ))}{(\ln \ln \lambda )^2} \). We get

$$\begin{aligned} \sum _{j \ge \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda } + \delta }&\mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) } \\&\le \sum _{i=1}^\infty \delta \cdot \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda } + i \delta \right) }\\&\le \sum _{i=1}^\infty 2\delta e^{-i(\delta /4) (\ln \ln \lambda )} = o(1) \end{aligned}$$

according to our bound on \(\mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }\) and the geometric series. Altogether,

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) } \le \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda } + \delta + o(1) = \frac{1+o(1)}{1+\alpha } \frac{\ln \lambda }{\ln \ln \lambda }, \end{aligned}$$

where the last step used that \(\delta = O((\ln \lambda \ln \ln \ln \lambda )/(\ln \ln \lambda )^2) = o(\ln \lambda /(\ln \ln \lambda ))\) for \(\alpha =O(1)\). This proves the upper bound on \(\mathord {E}\mathord {\left( X^*\right) }\) from the third item.

We are left with the lower bound. Note that for \(j\ge 1\),

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1\ge j\right) }&\ge \left( {\begin{array}{c}n/\ln ^\alpha \lambda \\ j\end{array}}\right) \left( \frac{c}{n}\right) ^j \left( 1-\frac{c}{n}\right) ^{n-j}\\&\ge \left( \frac{n}{j\ln ^\alpha \lambda }\right) ^j \left( \frac{c}{n}\right) ^{j} e^{-2c}\\&=e^{-2c}\left( \frac{c}{j\ln ^\alpha \lambda }\right) ^j, \end{aligned}$$

where the second inequality used that \((1-c/n)^{n-j} \ge (1-c/n)^{n} = (1-c/n)^{(n/c) c} \ge e^{-2c}\) for n large enough since \((1-x)^{x} \ge e^{-2}\) for sufficiently large x. Thus, we get

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*<j\right) }\le \left( 1-e^{-2c}\left( \frac{c}{j\ln ^\alpha \lambda } \right) ^j\right) ^\lambda . \end{aligned}$$

Similarly to the upper bound, we consider the last bound for \(j=\frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }-\beta \), where \(\beta =2\frac{\ln \lambda \ln (c(1+\alpha )\ln \ln \lambda )}{(\ln \ln \lambda )^2}\). We work with the representation \((c/(j\ln ^\alpha \lambda ))^j=e^{-j(\ln (j/c)+\alpha \ln \ln \lambda )}\). To bound the last exponent, we note that

$$\begin{aligned} j\ln (j/c)\le \frac{\ln \lambda }{1+\alpha }-\beta \ln \ln \lambda \, \end{aligned}$$

for \(\lambda \) large enough. Hence, \(j\ln (j/c)+j\alpha \ln \ln \lambda \le \ln \lambda -\beta \ln \ln \lambda \). We have in total

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*<j\right) }&\le \left( 1-e^{-2c}\left( \frac{c}{j\ln ^\alpha \lambda }\right) ^j\right) ^\lambda \\&\le \left( 1-e^{-\ln \lambda -2c+\beta \ln \ln \lambda }\right) ^\lambda \\&\le \left( 1-\frac{1}{\lambda e^{2c-\beta \ln \ln \lambda }}\right) ^{\lambda e^{2c-\beta \ln \ln \lambda } \cdot \frac{1}{e^{2c-\beta \ln \ln \lambda }}}\\&\le e^{-e^{-2c+\beta \ln \ln \lambda }}=o(1), \end{aligned}$$

where the second inequality used our bound on \(j(\ln (j/c)+\alpha \ln \ln \lambda )\). By the law of total probability, we have that for all j

$$\begin{aligned} \mathord {E}\mathord {\left( X^*\right) }&\ge \mathord {{{\mathrm{Pr}}}}\mathord {\left( X^*\ge j\right) }\mathord {E}\mathord {\left( X^*\mid X^*\ge j\right) }\\&\ge j(1-\mathord {{{\mathrm{Pr}}}}\mathord {\left( X^* < j\right) }), \end{aligned}$$

Plugging in \(j=\frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }-\beta \) immediately gives the desired bound

$$\begin{aligned} E(X^*)&\ge \left( \frac{\ln \lambda }{(1+\alpha )\ln \ln \lambda }-\beta \right) (1-o(1))\\&=\frac{1-o(1)}{1+\alpha }\frac{\ln \lambda }{\ln \ln \lambda }. \end{aligned}$$

\(\square \)

The parameter k used in Lemma 4 above will correspond to the number of one-bits in the current individual (recall that the aim is to create an individual with all zeros since we are minimizing). Hence, the progress can be as big as \(X^*\), the maximum number of flipping one-bits in the \(\lambda \) trials creating offspring. However, it is not guaranteed that the offspring flipping most one-bits is accepted. For instance, it could flip even more zero-bits. Therefore, the following lemma will be used to estimate the probability of accepting the individual related to the number \(X^*\).

Lemma 5

Assume that an individual with k one-bits is mutated by flipping each bit independently with probability c / n, where \(c>0\) is a constant. The probability that no zero-bit flips in this mutation is at least

$$\begin{aligned} (1-o(1)) e^{-c+\frac{kc}{n}}. \end{aligned}$$

Moreover, the probability that the mutation creates an individual with at most k one-bits (i. e., at most the same OneMax-value) is at most

$$\begin{aligned} (1+o(1)) e^{-c+\frac{k(c+c^2)}{n}}. \end{aligned}$$

Proof

We start with the first claim. The probability of not flipping a zero-bit equals

$$\begin{aligned} \left( 1-\frac{c}{n}\right) ^{n-k} = \left( 1-\frac{c}{n}\right) ^{n} \left( 1-\frac{c}{n}\right) ^{-k} \ge (1-o(1)) e^{-c} e^{kc/n}, \end{aligned}$$

where we have used that c is a constant.

To create an individual with at most k one-bits, it is necessary that at least the same number of one-bits as zero-bits flips. The probability of flipping at least j one-bits is bounded from above \(\left( {\begin{array}{c}k\\ j\end{array}}\right) \left( \frac{c}{n}\right) ^j\) and the probability of flipping exactly j zero-bits equals \(\left( {\begin{array}{c}n-k\\ j\end{array}}\right) \left( \frac{c}{n}\right) ^j\left( 1-\frac{c}{n}\right) ^{n-k-j}\). Hence, the probability of obtaining at most k one-bits is at most

$$\begin{aligned}&\sum _{j=0}^{\max \{k,n-k\}} \left( {\begin{array}{c}k\\ j\end{array}}\right) \left( {\begin{array}{c}n-k\\ j\end{array}}\right) \left( \frac{c}{n}\right) ^{2j} \left( 1-\frac{c}{n}\right) ^{n-k-j} \\&\quad \le \left( 1-\frac{c}{n}\right) ^{n-k} \left( \sum _{j=0}^{\ln n} \left( {\begin{array}{c}k\\ j\end{array}}\right) \left( {\begin{array}{c}n-k\\ j\end{array}}\right) \left( \frac{c}{n}\right) ^{2j} \left( 1-\frac{c}{n}\right) ^{-\ln n}\right. \\&\qquad \left. +\,e^{-\varOmega (\ln n\ln \ln n)}\right) , \end{aligned}$$

where we used that the probability of flipping at least \(\ln n\) bits is at most

$$\begin{aligned} \left( {\begin{array}{c}n\\ \ln n\end{array}}\right) \left( \frac{c}{n}\right) ^{\ln n} \le \frac{c^{\ln n}}{(\ln n)!} = e^{-\varOmega (\ln n\ln \ln n)} \end{aligned}$$

for any mutation probability c / n. Using \(\left( {\begin{array}{c}a\\ b\end{array}}\right) \le \frac{a^b}{b!}\), \((1-c/n)^{-\ln (n)} = 1+o(1)\) and \(\left( 1-\frac{c}{n}\right) ^{n-k}\le e^{-c+kc/n}\) and subsuming lower order terms into a o(1)-term, our bound is at most

$$\begin{aligned}&(1+o(1)) e^{-c+kc/n} \left( \sum _{j=0}^{\ln n} \frac{k^j (n-k)^j}{j!\cdot j!} \left( \frac{c^2}{n^2}\right) ^{j}\right) \\&\quad \le (1+o(1)) e^{-c+kc/n} \left( \sum _{j=0}^{\infty } \frac{(c^2(k/n)((n-k)/n))^j}{j!}\right) \\&\quad = (1+o(1)) e^{-c+kc/n} e^{c^2 \frac{k}{n}\frac{n-k}{n}} \le (1+o(1)) e^{-c+\frac{k}{n}(c+c^2)}, \end{aligned}$$

where the last inequality used \((n-k)/n\le 1\). \(\square \)

Finally, to a similar purpose as described before Lemma 5, we will need the following lemma to correct the progress made by the individual flipping most one-bits by the number of flipping zero-bits. The statement is given in a general framework without appealing to a particular distribution; the notation \(\succ \) denotes stochastic dominance.

Lemma 6

Let \(X_1,\ldots ,X_\lambda \) and \(Y_1,\ldots ,Y_\lambda \) be two sequences of independent, identically distributed random variables. Let \(Z_i:=X_i-Y_i\) for \(i\in \{1,\ldots ,\lambda \}\) and \(Z^*:=\max \{Z_i\mid i\in \{1,\ldots ,\lambda \}\}\). Then

$$\begin{aligned} Z^*\succ \max \{X_i\mid i\in \{1,\ldots ,\lambda \}\} - Y_1, \end{aligned}$$

where the index 1 without loss of generality refers to any of the \(Y_i\).

Proof

We show the lemma only for \(\lambda =2\); the general case follows in the same way. We abbreviate \(\tilde{Z}^*:=\max \{X_i\mid i\in \{1,\ldots ,\lambda \}\} - Y_1\) and note that we have to show

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( Z^*\ge x\right) } \ge \mathord {{{\mathrm{Pr}}}}\mathord {\left( \tilde{Z}^*\ge x\right) } \end{aligned}$$

for all \(x\in \mathbb {R}\). By definition and using independence

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( Z^*\ge x\right) } = 1-\left( \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1-Y_1<x\right) } \cdot \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_2-Y_2<x\right) }\right) \end{aligned}$$

and because of dependence

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( \tilde{Z}^*\ge x\right) }&= 1-\big (\mathord {{{\mathrm{Pr}}}}\mathord {\left( X_1-Y_1<x\right) }\\&\qquad \times \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_2-Y_1<x \mid X_1-Y_1<x\right) }\big ). \end{aligned}$$

The lemma follows if we can show that

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_2-Y_2<x\right) } \le \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_2-Y_1<x \mid X_1-Y_1<x\right) }. \end{aligned}$$

To see this, note that \(Y_1\sim Y_2\) without any conditioning. The condition is \(Y_1>X_1-x\), so given this lower bound the probability of also \(Y_1>X_2-x\) happening only becomes larger. \(\square \)

3 Analysis of (\(1+\lambda \)) EA on OneMax

The following theorem is our main result.

Theorem 7

Suppose \(\lambda \le n^r\) for some natural number r and let T denote the optimization time (i. e., number of generations) of the (\(1+\lambda \)) EA with mutation probability c / n, where \(c>0\) is a constant, on OneMax. Then

$$\begin{aligned} \mathord {E}\mathord {\left( T\right) } = (1 \pm o(1)) \left( \frac{e^{c}}{c} \cdot \frac{n\ln n}{\lambda } + \frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) . \end{aligned}$$

Before we prove the theorem, we discuss its implications. If we choose \(\lambda =o(\ln n\ln \ln n/\ln \ln \ln n)\), the first term is asymptotically largest and we observe the linear speed-up by a factor of \(\lambda \). Since \(\frac{e^c}{c}\) is minimized for \(c=1\), the optimal mutation probability (minimizing the expected optimization time) is 1 / n. Neither mutation probabilities that are by a constant factor larger or by a constant factor smaller are optimal.

If \(\lambda =\omega (\ln n\ln \ln n/\ln \ln \ln n)\), i. e., is above the cut-off point for speed-up, the second term from the bound becomes the leading term and it turns out that choosing c arbitrarily small or large (but constant) is asymptotically no worse than the standard choice \(c=1\). A large population makes the algorithm more robust with respect to the choice of the mutation probability.

Proof of Theorem 7

Let \(X_t\) denote the number of one-bits of the current search point of the (\(1+\lambda \)) EA at generation t and \(\varDelta _t:=X_t-X_{t+1}\). Recall that the aim is to reach an \(X_t\)-value of 0.

Upper Bound: In order to show the upper bound we will work out a lower bound on the drift. To this end, in some regions only take into account steps of the process where progress towards the optimum occurs from one-bits only, i. e., for some regions we condition the process on the event that no zero-bit flips in the offspring with largest number of flipping one-bits. According to Lemma 5, this probability is at least \((1-o(1))e^{-c+cX_t/n}\ge (1-o(1))e^{-c}\). Conditioning on this event, which considers the mutation of the zero-bits and is independent of the mutation of the one-bits, we can estimate the drift for different regimes w. r. t.  \(X_t\).

We first consider the case of \(\lambda =\omega (1).\) We have that \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\ge h(X_t)\), where

$$\begin{aligned} h(X_t):={\left\{ \begin{array}{ll} (1-o(1)) \frac{\ln \lambda }{\ln \ln \lambda } &{}\text {if }X_t\ge \frac{n}{(\ln \lambda )^{\frac{1}{\ln \ln \ln \lambda }}}\\ (1/2-o(1)) e^{-c}\frac{\ln \lambda }{\ln \ln \lambda } &{}\text {if }X_t\ge \frac{n}{\ln \lambda }\\ (1-o(1)) {e^{-c}} \min \{c,1\}/2 &{}\text {if }X_t\ge \frac{n}{\lambda }\\ (1-o(1)) e^{-c}\frac{c}{\sqrt{\ln n}} &{}\text {if }X_t\ge \frac{n}{\lambda \sqrt{\ln n}}\\ (1-o(1))ce^{-c}\lambda X_t/n &{}\text {if }X_t<\frac{n}{\lambda \sqrt{\ln n}}. \end{array}\right. } \end{aligned}$$

The bound on the drift for the first regime can be obtained from \(\mathord {E}\mathord {\left( \varDelta _t \mid X_t\right) }\ge (1-o(1))\mathord {E}\mathord {\left( X^*\right) }\), where \(X^*\) denotes the maximum number of flipping ones over \(\lambda \) offspring at generation t which can be shown as follows. Let \(Y_i\), where \(1\le i\le \lambda \), denote the progress of an offspring of a parent of fitness k in one step if this progress is positive; otherwise \(Y_i=0\) since only offspring of non-negative progress can be selected. Then, \(Y_i\sim \max \{0,Y_i^1-Y_i^0\}\), where \(Y_i^0\sim {{\mathrm{Bin}}}(n-k,c/n)\) and \(Y_i^1\sim {{\mathrm{Bin}}}(k,c/n)\) denote the random number of zeros and ones flipped by mutation, respectively. Clearly, \(Y_i \succ Y_i^1-Y_i^0\) and \(\max \{Y_i\,\mid \,1\le i\le \lambda \}\succ \max \{Y_i^1-Y_i^0\,\mid \,1\le i\le \lambda \}\). From Lemma 6 it follows that

$$\begin{aligned} \varDelta _t = \max \{Y_i\,\mid \,1\le i\le \lambda \} \succ \max \{Y_i^1\,\mid \,1\le i\le \lambda \}-Y_1^0, \end{aligned}$$

where the index 1 is without loss of generality and refers to an arbitrary of the \(\lambda \) offspring. Since \(\max \{Y_i^1\,\mid \,1\le i\le \lambda \} = X^*\), by linearity of expectation this yields \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\ge \mathord {E}\mathord {\left( X^*\right) }-c=(1-o(1))\mathord {E}\mathord {\left( X^*\right) }\) if \(\mathord {E}\mathord {\left( X^*\right) }=\omega (1)\). Applying the third statement of Lemma 4 with \(\alpha =(\ln \ln \ln \lambda )^{-1}\) recalling that \(\lambda =\omega (1)\) provides us with the desired bound. Note this argument is specific for the first regime. For all the other regimes, we instead condition on the event that the offspring flipping most one-bits does not flip zero-bits, introducing the aforementioned factor of \((1-o(1))e^{-c}\) for each bound.

The second bound on the drift is obtained by applying the third statement of Lemma 4 with \(\alpha =1\).

The third bound on the drift stems from the second statement of Lemma 4 with \(\alpha =\min \{c,1\}\).

The fourth regime’s bound on the drift stems from the second statement of Lemma 4 with \(\alpha =c/\sqrt{\ln n}\). Finally, the last bound stems from the first statement of Lemma 4. Note also that h is monotone increasing w. r. t.  \(X_t\) (choosing the smallest \(1-o(1)\) function from the five regimes and using \(\lambda =\omega (1)\)) and that its reciprocal is integrable on \(\mathbb {R}^+\).

We have established bounds on the drift. Since the (\(1+\lambda \)) EA clearly optimizes OneMax in expected finite time (as each generation has a positive probability of creating the optimum), we can apply Theorem 1 with minimal distance \(x_{\min }=1\) and get

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) }\le \frac{1}{h(1)}+\int _{1}^{X_0}\frac{1}{h(x)}\,\mathrm {d}x. \end{aligned}$$

The first term is

$$\begin{aligned} \frac{1}{h(1)}=\frac{1}{(1-o(1))\lambda ce^{-c} \frac{1}{n}} =(1+o(1))\frac{e^cn}{\lambda c}, \end{aligned}$$

which is a lower order term in the total expected optimization time as we will see in the following.

We will now examine each regime’s contribution to the expected optimization time by splitting up the integral at the corresponding bounds. Due to our definition of h the first four regimes’ contribution can be easily computed since they do not depend on \(X_t\).

For the first regime, we first determine \(X_0\). Using a Chernoff bound, the probability that the initial individual is created within \([n/2-n^{2/3}, n/2+n^{2/3}]\) is at least \(1-2e^{-\frac{2n^{1/3}}{3}}\), so we will condition on this in the following, introducing a small error probability absorbed by the o(1)-term of the theorem.

The first regime contributes a term of at most

$$\begin{aligned} \int _{\frac{n}{(\ln \lambda )^{\frac{1}{\ln \ln \ln \lambda }}}}^{\frac{n}{2}+n^{\frac{2}{3}}}&\frac{\mathrm {d}x}{(1-o(1))\frac{\ln \lambda }{\ln \ln \lambda }}\\&\le (1+o(1))\frac{\left( \frac{n}{2}+n^{\frac{2}{3}}\right) \ln \ln \lambda }{\ln \lambda }\\&\le (1+o(1))\frac{1}{2}\frac{n\ln \ln \lambda }{\ln \lambda }. \end{aligned}$$

The second regime contributes a term of at most

$$\begin{aligned} \int _{\frac{n}{\ln \lambda }}^{\frac{n}{(\ln \lambda )^{\frac{1}{\ln \ln \ln \lambda }}}}&\frac{\mathrm {d}x}{(\frac{1}{2}-o(1))e^{-c}\frac{\ln \lambda }{\ln \ln \lambda }}\\&\le (2+o(1))\frac{e^cn\ln \ln \lambda }{(\ln \lambda )^{1+\frac{1}{\ln \ln \ln \lambda }}}. \end{aligned}$$

This term is clearly asymptotically smaller than the first, due to the denominator’s exponent.

The third regime contributes a term of at most

$$\begin{aligned} \int _{\frac{n}{\lambda }}^{\frac{n}{\ln \lambda }} \frac{\mathrm {d}x}{(1-o(1)) e^{-c} \min \{c/2,1/2\}} \le \frac{\left( 1+o(1)\right) 2e^{c} n}{\min \{c,1\}\ln \lambda }, \end{aligned}$$

and is thus asymptotically smaller than the first term as well.

The fourth regime contributes a term of at most

$$\begin{aligned} \int _{\frac{n}{\lambda \sqrt{\ln n}}}^{\frac{n}{\lambda }}\frac{\mathrm {d}x}{(1-o(1))e^{-c}\frac{c}{\sqrt{\ln n}}} \le (1+o(1))\frac{e^{c}n\sqrt{\ln n}}{c\lambda }, \end{aligned}$$

which is dominated by the following term.

The fifth regime contributes a term of at most

$$\begin{aligned} \int _{1}^{\frac{n}{\lambda \sqrt{\ln n}}}\frac{e^cn}{(1-o(1))\lambda cx}\,\mathrm {d}x&\le (1+o(1))\frac{e^c n\ln n}{c\lambda }. \end{aligned}$$

Summing up the individual contributions we end up with

$$\begin{aligned} \mathord {E}\mathord {\left( T\right) }\le (1+o(1))\left( \frac{e^{c}}{c}\cdot \frac{n\ln n}{\lambda }+\frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) , \end{aligned}$$

since the unconditional error introduced by the conditioning on the initialization is of lower order. Note that this holds for the case of \(\lambda =\omega (1)\).

In case of \(\lambda =O(1)\) the first four regimes can be subsumed; using the same arguments as in the previous case we have \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\ge \hat{h}(X_t)\), where

$$\begin{aligned} \hat{h}(X_t):= {\left\{ \begin{array}{ll} (1-o(1)) e^{-c}\frac{c}{\sqrt{\ln n}} &{}\text { if }X_t\ge \frac{n}{\lambda \sqrt{\ln n}}\\ (1-o(1))ce^{-c}\lambda X_t/n &{}\text { if }X_t<\frac{n}{\lambda \sqrt{\ln n}}. \end{array}\right. } \end{aligned}$$

The first regime contributes a term of at most

$$\begin{aligned} \int _{\frac{n}{\lambda \sqrt{\ln n}}}^{\frac{n}{2}+n^\frac{2}{3}} \frac{\mathrm {d}x}{(1-o(1))e^{-c}\frac{c}{\sqrt{\ln n}}} \le \left( \frac{1}{2}+o(1)\right) \frac{e^cn\sqrt{\ln n}}{c}, \end{aligned}$$

which is clearly dominated by the second regime’s contribution which is the same as in the previous case by reusing the same arguments. Hence, for \(\lambda =O(1)\), we have \(\mathord {E}\mathord {\left( T\right) }\le (1+o(1))\left( (e^{c}/c)n\ln n\right) \).

Combining both cases, i. e. for all \(\lambda \le n^r\), we end up with the desired bound of

$$\begin{aligned} \mathord {E}\mathord {\left( T\right) }\le (1+o(1))\left( \frac{e^{c}}{c}\cdot \frac{n\ln n}{\lambda }+\frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) . \end{aligned}$$

Lower Bound: Using Theorem 2, we will now show the matching lower bound. Theorem 2 requires a monotone process and a function \(\xi \) bounding the progress towards the optimum. We define \(\xi :\mathbb {R}^+\rightarrow \mathbb {R}^+,x\mapsto x-\log _2 x-1\).

Note that \(\mathord {{{\mathrm{Pr}}}}\mathord {\left( X_{t+1} < \xi (X_t)\right) }\le \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_{t+1}\le \xi (X_t)\right) }\). Due to our assumption that \(\lambda \le n^r\) and using Lemma 3, we have

$$\begin{aligned} \mathord {{{\mathrm{Pr}}}}\mathord {\left( X_{t+1}\le \xi (X_t)\right) }&\le \lambda \left( {\begin{array}{c}X_t\\ \lceil \log _2 X_t+1\rceil \end{array}}\right) \left( \frac{1}{n}\right) ^{\log _2(X_t)+1}\\&\le n^r\left( \frac{eX_t}{n\lceil \log _2(X_t)+1\rceil }\right) ^{\log _2(X_t)+1}. \end{aligned}$$

Setting \(x_{\min }:=2^r\), the last term assumes its maximum at \(X_t=x_{\min }\) within \([x_{\min },\ldots ,n]\) for sufficiently large n and takes the value \(r'/n^{r+1}\), where \(r'=((ex_{\min })/(r+1))^{r+1}\). We condition on the event that \(X_{t+1}\ge \xi (X_t)\) holds for the at most \(O(n\ln n)\) generations we consider; by a union bound this introduces an error of o(1). Note also that this condition only decreases the drift.

We will now distinguish between the cases \(\lambda <\ln ^2 n\) and \(\lambda \ge \ln ^2 n\). In the latter case, \(\lambda \) is above the cut-off point. Then the bound on the drift will be relatively simple and use only a single expression, as detailed later. We first thoroughly analyze the case of \(\lambda < \ln ^2 n\), where we have to be more careful as this includes the cut-off point. We claim that \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\le h^*(X_t)\), where

$$\begin{aligned} h^*(X_t):={\left\{ \begin{array}{ll} (1+o(1))\frac{\ln \lambda }{\ln \ln \lambda } &{}\text { if }X_t\ge \frac{n}{\lambda \sqrt{\ln n}}\\ (1+o(1))ce^{-c}\lambda X_t/n &{}\text { if }X_t<\frac{n}{\lambda \sqrt{\ln n}}, \end{array}\right. } \end{aligned}$$

To show that \(h^*(X_t)\) indeed is a bound on the drift, we look into the two regions. In the region \(X_t\ge \frac{n}{\lambda \sqrt{\ln n}}\), we simply estimate the drift from above by the number of flipping one-bits. Using the third statement of Lemma 4 with \(\alpha =0\) and observing that the number of flipping one-bits is maximized at \(X_t=n\), we obtain the claimed bound \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) } \le (1+o(1))\frac{\ln \lambda }{\ln \ln \lambda }\).

In the region \(X_t<\frac{n}{\lambda \sqrt{\ln n}}\), the function \(h^*(X_t)\) contains the factor \(e^{-c}\), which has to be explained carefully. Here we will distinguish between two cases. Let F be the event that at least one of the \(\lambda \) offspring flips at least two one-bits. By the law of total probability

$$\begin{aligned} \mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) } = \mathord {E}\mathord {\left( \varDelta _t\mid X_t; F\right) } \cdot {{\mathrm{Pr}}}(F) + \mathord {E}\mathord {\left( \varDelta _t\mid X_t; \overline{F}\right) } \cdot {{\mathrm{Pr}}}(\overline{F}). \end{aligned}$$

Using our assumption on \(X_t\) and taking a union bound over \(\lambda \) offspring, we get \({{\mathrm{Pr}}}(F)\le \lambda \left( {\begin{array}{c}X_t\\ 2\end{array}}\right) (c/n)^2 \le \lambda X_t (n/(\lambda \sqrt{\ln n})) (c/n)^2 = O(X_t/(n\sqrt{\ln n}))\). Furthermore, we have \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t; F\right) } \le 2 + \lambda X_t c/n= 2 + o(1)\). This holds since after 2 bits have been flipped, the remaining bits flip with probability at most c / n each. The total expected number of one-bits flipped in addition to the two bits, counted over all offspring, is at most \(\lambda X_t c/n\). Hence,

$$\begin{aligned} \mathord {E}\mathord {\left( \varDelta _t\mid X_t; F\right) }\cdot {{\mathrm{Pr}}}(F) = (2+o(1)) O(X_t/(n\sqrt{\ln n})) = O(X_t/(n\sqrt{\ln n})). \end{aligned}$$

We now bound the drift on the event \(\overline{F}\). Let \(Y_i\), where \(1\le i\le \lambda \), denote the progress (decrease of OneMax-value) of the i-th offspring of a parent of fitness \(X_t\) in one step, maximized with 0 since only offspring of non-negative progress can be selected. Then, \(Y_i\sim \max \{0,Y_i^1-Y_i^0\}\), where \(Y_i^0\sim {{\mathrm{Bin}}}(n-X_t,c/n)\) and \(Y_i^1\sim {{\mathrm{Bin}}}(X_t,c/n)\). On \(\overline{F}\), we have \(Y_i=0\) for those individuals where \(Y_i^0>0\). Let \(Z_i\) be the event that offspring i does not flip any zero-bits, i. e., \(Y^0_i = 0\). Hence, by the law of total probability, \(Y_i=Y_i^1 \cdot {{\mathrm{Pr}}}(Z_i)\), and therefore,

$$\begin{aligned}&\mathord {E}\mathord {\left( \varDelta _t\mid X_t; \overline{F}\right) } \\&\quad \le {{\mathrm{Pr}}}(Z_i) \mathord {E}\mathord {\left( \max _{i=1,\ldots ,\lambda } Y_i^1 \mid X_t;\overline{F}\right) }. \end{aligned}$$

Now,

$$\begin{aligned}{{\mathrm{Pr}}}(Z_i) = \left( 1-\frac{c}{n}\right) ^{n-X_t} \le e^{-(c/n)(n-o(n))} = (1+o(1)) e^{-c}\end{aligned}$$

by our assumption on \(X_t\), and

$$\begin{aligned} \mathord {E}\mathord {\left( \max _{i=1,\ldots ,\lambda } Y_i^1 \mid X_t;\overline{F}\right) } \le \lambda X_t c/n,\end{aligned}$$

using the first statement of Lemma 4 with \(X_t\) many variables and noting that \(\overline{F}\) only reduces the number of flipping one-bits. Taking everything together, we obtain

$$\begin{aligned} \mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) } = (1+o(1)) e^{-c} \lambda X_t c/n + O(X_t/(n\sqrt{\ln n})) = (1+o(1)) \lambda c e^{-c} X_t /n \end{aligned}$$

in the region \(X_t<\frac{n}{\lambda \sqrt{\ln n}}\). This completes the proof of the claim \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\le h^*(X_t)\).

Note that \(h^*\) is increasing. Note also that \(\xi \) is strictly increasing on \([1/\ln 2, \infty )\), hence \(\xi ^{-1}\) exists on the corresponding domain and it is increasing as well. More precisely, \(\xi ^{-1}:[(1+\ln \ln 2/\ln 2)-1, \infty )\rightarrow [1/\ln 2, \infty )\). We can define \(\tilde{h}:=h^*\circ \xi ^{-1}\). By applying Theorem 2 we obtain

$$\begin{aligned} \mathord {E}\mathord {\left( T\mid X_0\right) }&\ge \frac{x_{\min }}{\tilde{h}(x_{\min })} + \int _{x_{\min }}^{X_0} \frac{1}{\tilde{h}(u)} \,\mathrm {d}u\\&=\frac{x_{\min }}{h^*(\xi ^{-1}(x_{\min }))} +\int _{\xi ^{-1}(x_{\min })}^{\xi ^{-1}(X_0)} \frac{\left( 1-\frac{1}{x}\right) }{h^*(x)} \,\mathrm {d}x. \end{aligned}$$

The last equality is due to integration by substitution and due to \((\xi ^{-1})'=(\xi ')^{-1}\) which holds for \(\xi \) on \([1/\ln 2, \infty )\). Hereinafter, we abbreviate \(h^{**}(x):=h^*(x)/\left( 1-\frac{1}{x}\right) \).

Due to the definition of \(x_{\min }\) the first term is

$$\begin{aligned} \frac{2^r}{\tilde{h}(2^r)}=\frac{2^r}{(1+o(1))\lambda ce^{-c} \frac{2\cdot 2^r}{n}} =(1-o(1))\frac{e^cn}{2\lambda c}, \end{aligned}$$

where we used \(x\le 2(x-1)-\log _2(x)=\xi (2x)\) for \(x\ge 4\) and thus \(\xi ^{-1}(x)\le 2x\) for \(x\ge 4\) due to the monotonicity of \(\xi ^{-1}\). This is a lower order term as we will see in the following.

Regarding the integral, we again split the integral into the regimes that we bounded the drift on, in order to analyze those individually. We get

$$\begin{aligned} \int _{\xi ^{-1}(x_{\min })}^{\xi ^{-1}(X_0)} \frac{\mathrm {d}x}{h^{**}(x)}&=\int _{\xi ^{-1}(x_{\min })}^{\frac{n}{\lambda \sqrt{\ln n}}} \frac{\mathrm {d}x}{h^{**}(x)} +\int _{\frac{n}{\lambda \sqrt{\ln n}}}^{\xi ^{-1}(X_0)} \frac{\mathrm {d}x}{h^{**}(x)}. \end{aligned}$$

We start with the first integral. We have

$$\begin{aligned} \int _{\xi ^{-1}(x_{\min })}^{\frac{n}{\lambda \sqrt{\ln n}}} \frac{1}{h^{**}(x)} \,\mathrm {d}x&\ge \int _{2x_{\min }}^{\frac{n}{\lambda \sqrt{\ln n}}} \frac{n\left( 1-\frac{1}{x}\right) }{(1+o(1))ce^{-c}\lambda x}\,\mathrm {d}x\\&=\frac{e^cn}{(1+o(1))c\lambda }\int _{2x_{\min }}^{\frac{n}{\lambda \sqrt{\ln n}}} \left( \frac{1}{x}-\frac{1}{x^2}\right) \,\mathrm {d}x\\&=\frac{e^{c}n}{(1+o(1))c\lambda }\bigg (\ln n-\ln \lambda -\frac{\ln \ln n}{2}\\&\quad -\ln (2x_{\min })+\frac{\lambda \sqrt{\ln n}}{n}-\frac{1}{2x_{\min }}\bigg )\\&\ge (1-o(1))\frac{e^{c}n\ln n}{c\lambda } \end{aligned}$$

since \(\lambda <\ln ^2 n\).

Similar to the proof of the upper bound, we have that the algorithm initializes with high probability in \([n/2-n^{2/3},n/2+n^{2/3}]\) by using a Chernoff bound, i. e., \(X_0 \ge n/2-n^{2/3}\). In the following we condition on this event, introducing only a small error absorbed by the o(1)-term.

Now, for the second integral it holds that

$$\begin{aligned} \int _{\frac{n}{\lambda \sqrt{\ln n}}}^{\xi ^{-1}(X_0)}\frac{1}{h^{**}(x)} \,\mathrm {d}x&\ge \int _{\frac{n}{\lambda \sqrt{\ln n}}}^{X_0} \frac{\ln \ln \lambda \left( 1-\frac{1}{x}\right) }{(1+o(1))\ln \lambda }\,\mathrm {d}x\\&\ge \frac{\ln \ln \lambda }{(1+o(1))\ln \lambda }\left( \frac{n}{2}-n^{\frac{2}{3}}-\frac{n}{\lambda \sqrt{\ln n}}\right) \\&\quad -\frac{\ln \ln \lambda }{(1+o(1))\ln \lambda } \left( \ln \left( \frac{\frac{n}{2}-n^{\frac{2}{3}}}{\frac{n}{\lambda \sqrt{\ln n}}}\right) \right) \\&\ge \frac{\ln \ln \lambda }{(1+o(1))\ln \lambda }\left( \frac{(1-o(1))n}{2}\right) \\&\ge (1-o(1))\frac{1}{2}\frac{n\ln \ln \lambda }{\ln \lambda }, \end{aligned}$$

where the first inequality follows from \(\xi ^{-1}(x)\ge x\) and the third inequality is due to the subtracted summand being a lower order term.

If \(\lambda \ge \ln ^2 n\), we have that \(\mathord {E}\mathord {\left( \varDelta _t\mid X_t\right) }\le \tilde{h}^*(X_t)\), where

$$\begin{aligned} \tilde{h}^*(X_t):=(1+o(1)) \frac{\ln \lambda }{\ln \ln \lambda } \end{aligned}$$

for all \(X_t\ge 0\). Reusing the analysis of the contribution of the first region in \(h^*(X_t)\) above, we get the lower bound

$$\begin{aligned} (1-o(1))\frac{1}{2}\frac{n\ln \ln \lambda }{\ln \lambda } \end{aligned}$$

in this case, which is the leading term in the bound from the theorem since \(\lambda \) is above the cut-off point.

Now we can give a bound on the expected runtime for all \(\lambda \le n^r\). Summing up the individual contributions, we end up with

$$\begin{aligned} \mathord {E}\mathord {\left( T\right) }\ge (1-o(1))\left( \frac{e^{c}}{c}\cdot \frac{n\ln n}{\lambda }+\frac{1}{2}\cdot \frac{n\ln \ln \lambda }{\ln \lambda }\right) \,, \end{aligned}$$

since the unconditional error introduced by the conditioning is of lower order. \(\square \)

4 Experiments

While the analysis of the (\(1+\lambda \)) EA is asymptotically tight, it is still asymptotic in nature. Since the bound given in Theorem 7 does not capture the effect of the lower-order term on the expected runtime, some phenomena might be obscured for practically relevant values of n.

Hence, we performed experiments in order to illustrate the effect of c on the runtime for small and moderate problem sizes. We implemented the (\(1+\lambda \)) EA in C using the GNU Scientific Library for the generation of pseudo-random numbers.

Fig. 1
figure 1

Empirical number of generations needed to optimize OneMax, averaged over 50000 runs of the \((1+5)\) EA for \(n\in \{5,10,\ldots ,95\}\) (left) and \(n\in \{100,150,200,300,400,500,1000,2000,5000\}\) (right)

Fig. 2
figure 2

Number of generations needed to optimize OneMax as a box plot over 50000 runs for each value of c from 1.0 to 1.5 (step size 0.02) for \(n = 1000\) (lower boxes) and \(n=5000\) (upper boxes). The outliers are not displayed

The results are displayed in Fig. 1. We used an offspring population size of \(\lambda = 5\) for all experiments. Both diagrams show the number of generations that the \((1+5)\) EA with mutation probability c / n needed to optimize OneMax as a function of the mutation parameter c. The optimization times are averaged over 50000 independent runs for problem size n, where \(n\in \{5,10,\ldots ,95\}\) in the left diagram (resp. \(n\in \{100,150,200,300,400,500,1000,2000,5000\}\) in the right diagram). The parameter c takes values in the interval [1.0, 2.0] (resp. [1.0, 1.5] in the right diagram) with step size 0.02. For each n, the value for c that minimizes the empirical optimization time is marked.

The left digaram indicates that the optimal value for c varies around 1.3 for small problem sizes. This variation is due to the variance of the runs and the fact that for the considered values of n, similar optimization times are attained for a broad interval of c values, as can be seen in the plot. For example: for \(n=100\) the observed standard deviation of the optimization time is 67.62 for \(c=1\) (mean 231.54) and 162.97 for \(c=1.5\) (mean 412.98). For \(n=5000\) the observed standard deviation of the optimization time is 3482.51 for \(c=1\) (mean 22126.00) and 8555.14 for \(c=1.5\) (mean 45416.49). For illustration of the high variances observed the optimization times for \(n=1000\) and \(n=5000\) are displayed as a box plot in Fig. 2.

It is not unexpected that the optimal mutation rate is higher than 1 for small values of n. This behaviour has already been observed for the (1 \(+\) 1) EA on OneMax for small problem sizes [4]. Furthermore, we can see that the slopes around the optimum value of c get steeper for higher values n. This can be explained by the leading constant \(e^c/c\) of the expected runtime, which grows exponentially in c around the optimal value of c and appears more pronounced for larger values of n.

The empirical optimal c-values are small for all problem sizes; even for \(n=5000\) the empirical optimal value for c is still 1.1, which is only slightly smaller than 1.3 and subject to variance as well. However, the plot indicates that c approaches 1 for higher problem sizes, as predicted by our bound in Theorem 7.

5 Conclusions

We have presented the first tight runtime analysis of a population-based EA, depending on population size and mutation probability. More precisely, we have analyzed the well-known (\(1+\lambda \)) EA with mutation probability c / n, where \(c>0\) is a constant, on the classical OneMax function. Our results show that 1 / n is the only optimal mutation probability if \(\lambda =o(\ln n\ln \ln n/\ln \ln \ln n)\), which is the cut-off point for linear speed-up. However, if \(\lambda \) is above this cut-off point then the standard mutation probability 1 / n is no longer the only optimal choice and the algorithm is more robust with respect to this choice. In fact, the expected number of generations is independent of c then (up to lower order terms), irrespectively of c being less than 1 or greater.

Our results shed light on the interplay of population size and mutation probability in evolutionary algorithms. At the same time, we have extended our reservoir of methods for the analysis. We are optimistic that our study paves the ground for tight analyses of more complex population-based EAs on also more complex function classes.