1 Introduction

Evolutionary algorithms (EAs) are general-purpose problem solvers which can be successfully applied to a wide variety of problems with small effort. In particular, EAs are popular where no tailored solutions exist, for example because the structure of the problem is inaccessible (given as a black box) or where the structure of the problem is very complicated. In particular, EAs are popular in settings including uncertainties, such as noisy fitness (quality) evaluations; see [1] for a survey on examples in combinatorial optimization, but also [11] for an excellent survey also discussing different sources of uncertainty.

We are interested in formally analyzing the performance of EAs in settings where the fitness function is probabilistic, i.e., a given search point can have different fitness values for different fitness evaluations. One way to deal with such uncertainty is to replace fitness evaluations with an average of a (large) sample of fitness evaluations and then proceed as if there was no noise. In this paper we are interested in a different approach where we accept the noise and try to analyze how much noise can be overcome by EAs without further modifications (note that this research can also be used to decide how much resampling is necessary for successful optimization). This was first done in [5], where a noisy variant of the well-known OneMax test function was analyzed for the simplest EA, the (\(1+1\)) EA. In essence, it was shown that the (\(1+1\)) EA can deal with small noise levels, but not medium noise levels. Recently, there was a sequence of papers discussing ant colony optimization for path finding problems in the presence of uncertainty [2, 6, 20], see also [8, 9] for early work in this area.

For this paper we are exclusively concerned with optimization problems defined on bit strings of fixed length n. In this domain we have the two well-known (static) test functions OneMax and LeadingOnes as follows. For each bit string \(x \in \{0,1\}^n\) we let \(\textsc {OneMax} (x)\) be the number of 1s in x and \(\textsc {LeadingOnes} (x)\) is the number of consecutive 1s counting from the left until the first occurrence of a 0. The performance of various randomized search heuristics on these two static problems is known in detail.

We modify these test functions by adding noise. We distinguish between two general noise models: prior noise and posterior noise. In the first model we assume that the noise comes from not evaluating the search point in question, but a noisy variant; an example of this is the noise model used in [5], where, with probability p, a bit of the search point was flipped before evaluation. In the second model the fitness value of a search point obtains noise after evaluation. For example, one can add a value drawn from a centered normal distribution (or add a value drawn from any other chosen distribution; we call such noise additive posterior noise). Posterior noise is essentially the model used in [2, 6, 8, 20].

In each case we consider the noise to be independent for different elements of the search space and for reevaluations (note that we assume that all algorithms reevaluate each search point under consideration in every iteration).

In this paper we expand on the work done in [5] in three ways. First, we make the results applicable to many different noise models; second, we analyze the LeadingOnes function in noisy settings; third, we show how the use of populations can make the EA much more robust towards noise.

Regarding the generalization, we reprove the results from [5] as a corollary to more general theorems which can be applied in many different settings. The proofs of these more general theorems rely heavily on drift theory, a modern tool which facilitates the formal analysis of randomized search heuristics significantly. Note that this tool was not available for [5]. Another tool suitable for the analysis of populations was recently introduced in [14] in the context of non-elitism, i.e. just as in the setting with noise, good solutions can get lost.

Regarding the LeadingOnes test function, we give the first formal analysis of this test function in a noisy setting.

Regarding the use of populations, the paper [18] gives a nice overview of several different aspects where populations (and the use of crossover operators) are beneficial for optimization of static fitness functions. In contrast to this, we show that populations can also be highly beneficial in the context of stochastic fitness functions, as they make much higher noise levels tractable.

1.1 Detailed Contribution

The only algorithm we consider is the (\(\mu +\lambda \)) EA, for different values of \(\mu \) and \(\lambda \) (see Sect. 2 for a detailed description). We consider the (\(1+1\)) EA as an EA that does not rely on populations; this was the algorithm considered in [5]. Even when we discuss EAs with populations, we only consider cases with \(\mu =1\) or \(\lambda = 1\), for simplicity.

We consider optimization successful in the stochastic setting as soon as the algorithm has evaluated the best static solution (in both our cases the all-1s bit string); note that the best static solution is, in all of our models, also the solution with best expected fitness. Whenever we consider the “run time” of an algorithm, this is understood as the expected number of iterations (or generations) of the EA. In particular, population-based EAs have a higher number of fitness evaluations than iterations (also due to reevaluation of old search points). Note, however, that these numbers differ only by a factor of \(\mu + \lambda \). Therefore, it makes sense to take the number of iterations as a measure of the run time and the number of fitness evaluations can be directly inferred. In fact, when fitness evaluations can be done in parallel (due to sufficient hardware), the number of iterations is directly a measure for the run time.

In Sect. 3 we consider OneMax. In particular, we give a general theorem for deriving upper bounds in different noise settings (Theorem 5) and a general Theorem for deriving lower bounds (Theorem 6). As a result we completely reprove the theorems from [5] (Corollary 7). These results concern prior noise where, with probability p, a (uniformly chosen) bit is flipped. As a further corollary, we show that the (\(1+1\)) EA can optimize in the presence of additive posterior noise with variance of \(O(\log n/n)\) efficiently, but not, for example, in the presence of additive noise from an exponential distribution with parameter 1 (Corollary 10). We consider additive posterior noise taken from a normally distributed random variable in Corollary 11; here we can give the precise threshold below which the (\(1+1\)) EA is successful. This list of corollaries can easily be extended, for example to cover the case of prior noise based on mutation (Corollary 8) or the noise model of “partial evaluation” considered in [4] (Corollary 9).

In Sect. 3.2 we show that populations can make an EA much more robust towards noise. For example, a linear population is large enough to allow arbitrary values of p in the setting of prior bit-flip noise. Furthermore, for constant p in the setting of prior bit-flip noise (a setting far outside the abilities of the (\(1+1\)) EA for efficient optimization), already a logarithmic population size suffices for efficient optimization. Similarly, we get robustness of small populations for posterior noise models, for example for exponentially distributed noise with constant parameter.

In Sect. 4, we give our results for LeadingOnes. We show that the (\(1+1\)) EA optimizes successfully in the presence of small noise, but we also give an example of higher noise levels where optimization is unsuccessful. Here again populations are helpful, even of logarithmic size. As expected, the LeadingOnes function is much more sensitive to prior noise than the OneMax function.

For the specific noise model taken from [5], we give an overview of our results in Table 1.

We conclude the paper with a discussion in Sect. 5. Note that this paper is an extended and corrected version of [7]; we added several further corollaries and expanded on many proofs. Since the publication of the conference version of this paper, another work addressing the optimization of stochastic versions of OneMax and LeadingOnes was published [4] where the focus was on non-elitist algorithms optimizing with noise models based on certain kinds of unavailability of data (see Corollary 9 for one of the noise models). Furthermore, several papers regarding optimization with stochastic fitness functions were presented at the FOGA’15 conference.

Table 1 For different algorithms and different target problems, the range of noise parameter p is given for which optimization is successful in polynomial time (in the model of prior noise from [5], with probability p flipping a single bit)

2 Mathematical Preliminaries

In this paper we consider the (\(\mu +\lambda \)) EA, an algorithm which bases its progress on mutation (see Algorithm 1 for a detailed description). We consider only the mutation operator which flips each bit independently with probability 1 / n. Ties in the selection of fitter individuals are broken so that individuals from the offspring population are preferred (this allows the (\(1+1\)) EA to cross plateaus and is consistent with the definition of, for example, [5]); further ties are broken uniformly at random.

figure a

Note that all references to the “run time” or the “number of steps” of an algorithm always concern the expected first hitting time of the optimum, as mentioned above.

2.1 Drift Theorems

We will use a variety of drift theorems to derive the theorems of this paper. Drift, in this context, describes the expected change of the best-so-far solution within one iteration with respect to some potential. In later proofs we will define potential functions on best-so-far solutions and prove bounds on the drift; these bounds then translate to expected run times with the use of the drift theorems from this section.

The literature knows a large number of drift theorems; this selection is not representative, but merely contains those theorems needed for this paper.

The simplest drift theorem concerns additive drift.

Theorem 1

(Additive Drift [10]) Let \((D^{(t)})_{t \ge 0}\) be random variables describing a Markov process over a finite state space \(S \subseteq \mathbb {R}^+_0\). Let T be the random variable that denotes the earliest point in time \(t \ge 0\) such that \(D^{(t)} = 0\). If there exist \(c > 0\) such that

$$\begin{aligned} E\left( D^{(t)} - D^{(t+1)} \vert T > t\right) \ge c, \end{aligned}$$

then

$$\begin{aligned} E(T | D^{(0)}) \le \frac{D^{(0)}}{c}. \end{aligned}$$

Another useful tool is the Variable Drift Theorem given in [13, Theorem 4.6] (independently developed in [15, Section 8]); this drift theorem is applicable when the drift is not uniform across the search space; frequently one can find a uniform lower bound and use the additive drift theorem, but using the variable drift theorem will typically give much better bounds. The version of the Variable Drift Theorem that we use is due to [19], which removes the restriction of h being differentiable.

Theorem 2

(Variable Drift [19]) Let \((D^{(t)})_{t \ge 0}\) be random variables describing a Markov process over a finite state space \(S \subseteq \mathbb {R}_0^+\) and let \(x_{\mathrm {min}}:= \min \{x \in S \; | \; x > 0\}\). Furthermore, let T be the random variable that denotes the first point in time \(t \in \mathbb {N}\) for which \(D^{(t)} = 0\). Suppose that there exists a monotone increasing function \(h: \mathbb {R}^+ \rightarrow \mathbb {R}^+\) such that 1 / h is integrable and

$$\begin{aligned} E\left( D^{(t)} - D^{(t+1)}\; | \; D^{(t)}\right) \ge h(D^{(t)}) \end{aligned}$$

holds for all \(t < T\). Then,

$$\begin{aligned} E(T \; | \; D^{(0)}) \le \frac{x_{\mathrm {min}}}{h(x_{\mathrm {min}})} + \int _{x_{\mathrm {min}}}^{D^{(0)}} \frac{1}{h(x)} dx. \end{aligned}$$

Since we will make frequent use of it in the following sections, we will also give the version of the Multiplicative Drift Theorem for upper bounds, due to [3], which is implied by the previous Variable Drift Theorem.

Theorem 3

(Multiplicative Drift [3]) Let \((D^{(t)})_{t \ge 0}\) be random variables describing a Markov process over a finite state space \(S \subseteq \mathbb {R}^+_0\) and let \(x_{\mathrm {min}}:= \min \{x \in S \; | \; x > 0\}\). Let T be the random variable that denotes the earliest point in time \(t \ge 0\) such that \(D^{(t)} = 0\). If there exist \(\delta > 0\) such that for all \(x\in S\) with \(P(D^{(t)}=x)>0\) we have

$$\begin{aligned} E\left( D^{(t)}-D^{(t+1)} | D^{(t)} = x\right) \ge \delta x, \end{aligned}$$

then for all \(x'\in S\) with \(P(D^{(0)}=x')>0\),

$$\begin{aligned} E\left( T|D^{(0)}=x'\right) \le \frac{1+\log \left( \frac{x'}{x_{\mathrm {min}}}\right) }{\delta }. \end{aligned}$$

Finally, in order to derive lower bounds on the run time of EAs, we use the Negative Drift Theorem.

Theorem 4

(Negative Drift [16, 17]) Let \((D^{(t)})_{t \ge 0}\) be real-valued random variables describing a stochastic process over some state space. Suppose there is an interval \([a,b] \subseteq \mathbb {R}\), two constants \(\delta , \varepsilon > 0\) and, possibly depending on \(\ell = b-a\), a function \(r(\ell )\) satisfying \(1 \le r(\ell ) = o(\ell / \log \ell )\) such that, for all \(t \ge 0\), the following conditions hold.

  1. 1.

    \(E(D^{(t+1)} - D^{(t)} \mid a < D^{(t)} < b) \ge \varepsilon \);

  2. 2.

    For all \(j \ge 0\), \(P(|D^{(t+1)} - D^{(t)}| \ge j \mid a < D^{(t)}) \le \frac{r(\ell )}{(1+\delta )^{j}}\).

Then there is a constant c such that, for \(T = \min \{t \ge 0: D^{(t)} \le a \mid D^{(0)}\ge b\}\), we have

$$\begin{aligned} P(T \le 2^{c \ell / r(\ell )}) = 2^{-\varOmega (\ell / r(\ell ))}. \end{aligned}$$

3 OneMax

In this section we present our results regarding OneMax. We fix a (stochastic) OneMax function f according to one of our models. For each of the stochastic models we consider (see Corollaries 7 through 11), it is easy to verify that there is a sequence of independent random variables \((X_k)_{k \le n}\) such that the following holds.

  • For each evaluation of f on a bit string with exactly k 1s, the return value is drawn at random \(\sim X_k\) (recall that all evaluations of fitness functions are independent).

  • \(\forall 0 < j < k < n: P(X_k < X_{k+1}) \le P(X_j < X_{k+1})\); intuitively, the larger the true difference in OneMax-value, the more likely this is reflected in a random OneMax evaluation. This simplifies some conditions.

Note that these two properties capture two important properties of the OneMax function: symmetry in the positions (only the number of 1s decides on the fitness, not the position), and monotonicity (the more 1s a bit string has, the higher its fitness; we need this comparison-based version, given the comparison-based definition of the (\(1+1\)) EA).

We start by giving an upper and a lower bound for the (\(1+1\)) EA in Sect. 3.1. In Sect. 3.2 we give upper bounds for EAs with parent populations and in Sect. 3.3 for EAs with offspring populations, showing that populations are efficient for the stochastic versions of OneMax we consider.

3.1 The (\(1+1\)) EA on OneMax

Our first theorem gives an upper bound for the (\(1+1\)) EA on OneMax, generalizing a theorem from [5]. The condition given by Eq. (1) intuitively says that more 1s look better with a probability which is close to 1, getting closer the more 1s there are. This increase in reliability of the evaluation is needed to counterbalance the increased likelihood of finding worse search points the closer the best search point is to the optimum.

Theorem 5

Suppose there is a positive constant \(c \le 1/15\) such that

$$\begin{aligned} \forall k < n: P(X_k < X_{k+1}) \ge 1 - c\frac{n-k}{n}. \end{aligned}$$
(1)

Then the (\(1+1\)) EA optimizes f in \(\varTheta (n \log n)\) steps. Furthermore, if Eq. (1) holds for all \(k < n-\ell \) for some \(\ell \) with \(2 < \ell \le n/2\), and we have

$$\begin{aligned} \forall k < n: P(X_k < X_{k+1}) \ge 1 - \frac{\ell }{n}, \end{aligned}$$

then (\(1+1\)) EA optimizes f in expectation in \(O(n \log (n)) + n2^{O(\ell )}\) iterations.

Proof

Let \(p_k= c(n-k)/n\). Let \((D^{(t)})_{t \ge 0}\) be the Markov process describing the number of 1s of the (\(1+1\)) EA on f. We show that there is a positive multiplicative drift driving the number of 1s up to n, more precisely we will show that there is a constant z such that

$$\begin{aligned} \forall t: E\left( D^{(t+1)} - D^{(t)} \mid D^{(t)}, D^{(t)} < n\right) \ge \frac{z}{n} \; \left( n-D^{(t)}\right) . \end{aligned}$$

Let t be given and let \(k := D^{(t)}\) be at most \(n-1\).

Let \(E_0\) be the event that the new search point has at least one more 1 (i.e. \(D^{(t+1)} \ge k+1\)) and the comparison of old and new search point indicates correctly that the new search point is better. We have that \(P(E_0) \ge (1 - p_k)(n-k)/(en)\), as the probability for flipping exactly one 0-bit is at least \((n-k)/(en)\), and the probability of accepting the new (better) search point is at least \(1-p_k\).

Let \(E_1\) be the event that the new search point has less 1s than the current search point and that this new search point is nonetheless accepted (i.e. \(D^{(t+1)} \le k-1\)). We have that \(P(E_1) \le p_{k-1}k/n\), as the probability of creating a worse search point is at most k / n (one of the 1-bits has to flip) and the probability of incorrectly accepting a worse search point is at most \(p_{k-1}\). First, we show that the expected number of 1s conditional on \(E_1\) is at least \(k-2\). For that notice that this expectation only decreases if we condition additionally on the event \(E_2\) that no 0 flips to a 1. Let A be the set of indices where the current search point has a 1; without loss of generality, assume \(0 \in A\). Let, for all \(i \in A\), \(F_i\) be the event that bit i flipped. Furthermore, let Y be the random variable describing the number of flipping 1s. Note that \(E_1 \subseteq \bigcup _{i \in A} F_i\). Now we see

$$\begin{aligned} E(Y \mid E_1, E_2) \le \sum _{i \in A} E(Y \mid F_i, E_1, E_2)P(F_i) \le \frac{|A|}{n} E(Y \mid F_0, E_1, E_2) \le 2, \end{aligned}$$

as Y conditioned on any \(F_i\) is 1 (for the ith bit flipping) plus an expected 1 / n for every other 0-bit. This shows \(E(D^{(t+1)} \mid E_1) \ge k-2\) as desired.

Taking all parts together, we have

$$\begin{aligned} E(D^{(t+1)} - D^{(t)} \mid D^{(t)}, D^{(t)} < n)\ge & {} P(E_0) - 2P(E_1)\\\ge & {} \left( 1 - p_k\right) \frac{n-k}{en} - 2p_{k-1}\frac{k}{n}\\\ge & {} \frac{n-k}{en} - p_k\left( 2 + \frac{n-k}{en}\right) - 2c/n\\\ge & {} \frac{n-k}{en} - 3p_k- 2c/n\\\ge & {} \frac{n-k}{n}(1/e - 5c). \end{aligned}$$

Using \(c \le 1/15\) we see that \(1/e - 5c\) is a constant \(> 0\). We apply the Variable Drift Theorem (see Theorem 2) on the process \((n-D^{(t)})_t\) describing the number of 0s at iteration t, \(x_{\mathrm {min}}= 1\) and h such that, for all \(x > 0\), \(h(x) = ax/n\) for a positive constant \(a < 1/e-5c\). As the initial number of 0 is at most n, we get an expected run time of \(O(n \log n)\).

Regarding the “furthermore” clause, we do not have sufficient drift when we use the number of 1s as the potential function. Thus, we change our potential function in a way which could also be used to show an \(\exp (O(n))\) bound for optimizing the needle function with the (\(1+1\)) EA. Intuitively, this drift function takes care of the plateau of the last \(\ell \) OneMax values.

We define a helper function a so that, for all \(i \ge 1\),

$$\begin{aligned} a(i) = {\left\{ \begin{array}{ll} 2\frac{i!}{\ell !} (10e\ell ), ^{\ell -i} &{}\quad \text{ if } i \le \ell ;\\ 1, &{}\quad \text{ otherwise. } \end{array}\right. } \end{aligned}$$

We now define the potential in terms of a as follows. A given search point with exactly z 0s has potential

$$\begin{aligned} g(z) = \sum _{i=1}^z a(i). \end{aligned}$$

This potential function makes it particularly easy to compute the differences of the potentials of similar search points (it is a sum of successive a(i)). We let \((C^{(t)})_t\) be the sequence of random variables describing the number of 0s in the current bit string after t iterations. Let \(E_0\) and \(E_1\) be as above. For \(z \ge \ell \) we see that we have a drift of \(\varOmega (z/n)\) from the computations above. Suppose now \(z < \ell \). Later we will need that, for all \(i < \ell \),

$$\begin{aligned} \frac{a(i)}{a(i+1)} = \frac{10e\ell }{i+1} \end{aligned}$$

and, thus,

$$\begin{aligned} a(i) + 2a(i+1) \le 2a(i). \end{aligned}$$
(2)

We now estimate how much the potential can fall conditional on sampling and accepting a worse search point. We use \(E(C^{(t+1)} \mid C^{(t)} = z, E_1) \le z+2\) as before and make the worst-case assumption that we lose the complete potential from g(z) to \(g(\ell )\). This gives

$$\begin{aligned} E\left( g(C^{(t)}) - g(C^{(t+1)}) \mid C^{(t)} = z, E_1\right)\ge & {} g(z) - g(\ell ) - 2\\= & {} - 2 - \sum _{i=z+1}^\ell a(i)\\\ge & {} -2a(z+1). \end{aligned}$$

The last step can be seen with a straightforward induction employing Eq. (2).

Conditional on making progress at all (more precisely, conditional on \(E_0\)), we gain in potential by at least

$$\begin{aligned} g(z)-g(z-1) = a(z). \end{aligned}$$

The probability of accepting a better individual is, by assumption, at least \(1 - \ell /n\). Thus, taking all parts together, we have

$$\begin{aligned}&E\left( g(C^{(t)}) - g(C^{(t+1)}) \mid C^{(t)} = z\right) \\\ge & {} P(E_0)a(z) - P(E_1)2a(z+1)\\\ge & {} \left( 1 - \frac{\ell }{n} \right) \frac{z}{en}a(z) - \frac{\ell }{n} \frac{n-z}{n}2a(z+1). \end{aligned}$$

Simple arithmetic and the conditions we have on \(\ell \) and z now show that this term is at least b / n for some constant b. It is easy to see that \(g(\ell ) = \exp (O(\ell ))\). We now use the Variable Drift Theorem (see Theorem 2) on the process \((g(C^{(t)}))_t\). We let h be monotone and integrable such that, for all z with \(0 < z \le \ell \) we have \(h(g(z)) = b/n\), and for \(z > \ell \) \(h(g(z)) = az/n\) for a as above. Thus we see that the total expected optimization time is at most

$$\begin{aligned} O(n \; g(\ell ) + n \log (n)), \end{aligned}$$

where the first term accounts for potential values up to \(g(\ell )\) and the second term accounts for higher values, where we have the higher drift of O(z / n).\(\square \)

Now we come to our second theorem, a lower bound for the (\(1+1\)) EA on OneMax.

Theorem 6

Suppose there is \(\ell \le n/4\) and a constant \(c \ge 16\) such that

$$\begin{aligned} \forall k, n - \ell \le k < n: P(X_k < X_{k+1}) \le 1 - c\frac{n-k}{n}. \end{aligned}$$

Then the (\(1+1\)) EA optimizes f in \(2^{\varOmega (\ell )}\) iterations with high probability.

Proof

We want to show that there is a constant negative drift on the number of ones in the interval between \(n - \ell \) and n; however, we will count iterations of the process only when an actual change in the number of 1s occurs (i.e., we condition on the change), as otherwise the drift would be too small (this will only yield a smaller bound than when counting all other steps as well). See also [19] regarding an explicit negative drift theorem in the presence of self-loops. Let \(k \ge n- \ell \) be the number of 1s of the current search point.

Let \(p_k= c(n-k)/n\). With a probability of at least \(p_{k-1}k /(en)\) we decrease the number of 1 by one (this is the chance of flipping exactly one bit, from 1 to 0, and then accepting). This decrease of at least 1 times the probability of doing so is our negative part of the drift.

Countering the negative part is a positive part. Let \(E_0\) be the event that the new search point has at least one more 1 and the comparison of old and new search point indicates correctly that the new search point is better. Clearly, the expected number of 1s conditional on \(E_0\) is at most \(k+2\) and \(P(E_0) \le (n-k)/n\). Thus, the expected increase in 1s times the probability of making such an increase is at most \(2(n-k)/n\), the positive part of the drift.

Since we do not count steps without change, it suffices to show that the negative part of the drift is at least twice as large in magnitude as the positive part of the drift. This would yield a negative drift of at least 1/3 (this uses that we have a minimal step width of 1). Dividing the lower bound for the negative part of the drift by the upper bound for the positive part of the drift, we get a ratio of

$$\begin{aligned} \frac{p_{k-1}k}{en} \frac{n}{2(n-k)} \ge \frac{c k}{2en}. \end{aligned}$$

From \(k \ge 3n/4\) and \(c \ge 16\) we get the desired lower bound of 2 on the ratio. As the (\(1+1\)) EA makes long jumps with sufficiently small probability, an application of the Negative Drift Theorem (Theorem 4) concludes the proof.\(\square \)

Our two theorems can be used for easy corollaries, showing the optimization time of the (\(1+1\)) EA given different noise models. We first consider the noise model given in [5].

Corollary 7

([5]) Suppose prior noise which, with probability p, flips a bit chosen uniformly at random. Then we have that the (\(1+1\)) EA optimizes OneMax in an expected number of iterations of

$$\begin{aligned} {\left\{ \begin{array}{ll} \varTheta (n \log n), &{}\quad \text{ if } p = O(1/n);\\ \text{ polynomial }, &{}\quad \text{ if } p = O(\log (n)/n);\\ \text{ superpolynomial }, &{}\quad \text{ if } p = \omega (\log (n)/n) \cap 1 - \omega (\log (n)/n). \end{array}\right. } \end{aligned}$$

Proof

Let c be given and suppose \(p \le c /n\); let \(k < n - 2c\). We estimate \(P(X_k < X_{k+1})\) by observing that the event \(X_k \ge X_{k+1}\) requires the individual with k 1s to be evaluated to \(k+1\) or the other to k. Both have a probability of at most p. By the union bound, we get either of the two events with probability at most \(2p \le 2c/n \le (n-k)/n\). Thus, we get the desired bound from Theorem 5 using the “furthermore” clause with \(\ell = 2c\).

Let now c be given and suppose \(p \le c \log n / n\). We get the bound \(P(X_k \ge X_{k+1}) = O(\log n /n)\) just as in the last paragraph; this is sufficient up to a distance of \(O(\log n)\) from the optimum, which gives a polynomial bound using the “furthermore” clause from Theorem 5 again.

Suppose \(p = \omega (\log n/n) \cap 1 - \omega (\log n/n)\). Then, for all \(k \ge 3n/4\), we estimate \(P(X_k = X_{k+1})\) as follows. That \(X_k\) evaluates to k has a probability of at least \((1-p)\); that \(X_{k+1}\) evaluates to k has a probability of at least \(p (k+1)/n \ge 3p/4\). Since these two probabilities are independent, we get

$$\begin{aligned} P(X_k = X_{k+1}) \ge (1-p)(3p/4). \end{aligned}$$

Using the bounds on p, Theorem 6 now gives a superpolynomial run time.\(\square \)

Note that, for the missing case of \(p = 1 - O(\log (n)/n)\) in Corollary 7 we can apply neither Theorem 5 nor Theorem 6 as the relevant constant c has a value of roughly 1 (for an upper bound we need \(c \le 1/15\), for a lower bound we need \(c \ge 16\)).

Just as easily we can make conclusions about different noise models. In the next corollary we consider noise as given by flipping each bit with a certain probability.

Corollary 8

Suppose prior noise which flips each bit independently with probability p. Then we have that the (\(1+1\)) EA optimizes OneMax in an expected number of iterations of

$$\begin{aligned} {\left\{ \begin{array}{ll} \varTheta (n \log n), &{}\quad \text{ if } p = O(1/n^2);\\ \text{ polynomial }, &{}\quad \text{ if } p = O(\log n/n^2);\\ \text{ superpolynomial }, &{}\quad \text{ if } p = \omega (\log n/n^2). \end{array}\right. } \end{aligned}$$

Proof

Suppose first \(p \le c/n^2\) for some constant c and let \(k < n - 2c\). We estimate \(P(X_k < X_{k+1})\) by observing that that the event \(X_k \ge X_{k+1}\) requires the individual with k 1s to be evaluated to at least \(k+1\) or the other to at most k. Both have a probability of at most np by the union bound. Again by the union bound, we get either of the two events with probability at most \(2np \le 2c/n \le (n-k)/n\). Thus, we get the desired bound from Theorem 5.

Let now c be given and suppose \(p \le c \log n / n^2\). We get the bound \(P(X_k \ge X_{k+1}) = O(\log n /n)\) just as in the last paragraph; this is sufficient up to a distance of \(O(\log n)\) from the optimum, which gives a polynomial bound using the “furthermore” clause from Theorem 5.

We show the superpolynomial run time bound by distinguishing three cases for the growth rate of p. For all \(k \ge 3n/4\), we estimate \(P(X_k = X_{k+1})\). We will show a bound of

$$\begin{aligned} P(X_k \ge X_{k+1}) = \omega (\log (n)/n). \end{aligned}$$

so that Theorem 6 gives a superpolynomial run time with \(\ell = \omega (\log (n))\). Fix two individuals x and y with k and \(k+1\) 1s respectively.

Suppose first \(p = \omega (\log n/n^2) \cap O(1/n)\). That x evaluates to k has a probability of at least \((1-p)^n = \varOmega (1)\) (no bit flips); that y evaluates to at most k has a probability of at least \(p (k+1)/e \ge 3pn/(4e)\). Since these two probabilities are independent, we get

$$\begin{aligned} P(X_k \ge X_{k+1}) \ge \varOmega (1)(3pn/(4e)) = \omega (\log (n)/n). \end{aligned}$$

Suppose now \(p = \omega (1/n) \cap O(\log (n)/n)\). Fix \(\log (n)\) positions for x at which there is exactly one 0; fix \(\log (n)\) positions of y where there are only 1s. The probability that, after the noise is applied, x has at most as many 1s as y on the positions outside of the fixed ones is at least 1/2 (due to symmetry, they have just as many 1s at these positions). The probability that of all the fixed position exactly one in y flips is at least \(\log (n)p(1 - p)^{2\log (n)-1}\), as there are \(\log (n)\) choices for the position in y. We can bound this probability, using Bernoulli’s Inequality, as

$$\begin{aligned} \log (n)p(1 - p)^{2\log (n)-1}\ge & {} \log (n)\omega (1/n) \left( 1- p(2\log (n)-1)\right) \\\ge & {} \omega (\log (n)/n) \left( 1- O\left( \frac{\log (n)^2}{n}\right) \right) \\= & {} \omega (\log (n)/n). \end{aligned}$$

Together with the probability of 1/2 for x having at most as many 1s on the other bits, this shows the desired bound.

Finally, suppose \(p = \omega (\log (n)/n)\). Align the bit positions so that there is only one bit position where x and y differ. We first focus on the bit positions where x and y are the same. After applying the noise, the probability that y has at most as many 1s (within these bit positions) as x, is at least 1/2. The probability that, at the position where x and y differ, y has its bit flipped, is p. Thus, the probability of y evaluating to at most as much as x is at least p / 2, which gives again the desired bound.\(\square \)

Next we consider the noise model “partial evaluation” given in [4]. Note that the case of p constant of the next corollary was shown by [4].

Corollary 9

Suppose prior noise which sets each bit independently with probability p to 0. Then we have that the (\(1+1\)) EA optimizes OneMax in an expected number of iterations of

$$\begin{aligned} {\left\{ \begin{array}{ll} \varTheta (n \log n), &{}\quad \text{ if } p = O(1/n^2);\\ \text{ polynomial }, &{}\quad \text{ if } p = O(\log n/n^2);\\ \text{ superpolynomial }, &{}\quad \text{ if } p = \omega (\log n/n^2). \end{array}\right. } \end{aligned}$$

We omit the proof, since it is basically the same as for the previous corollary.

Regarding posterior noise we give the following two corollaries.

Corollary 10

Suppose posterior noise, sampling from some distribution D with variance \(\sigma ^2\). Then we have that the (\(1+1\)) EA optimizes OneMax in an expected polynomial time if \(\sigma ^2 = O(\log n/n)\). On the other hand, if, for example, D is exponentially distributed with parameter 1, then the (\(1+1\)) EA optimizes OneMax in superpolynomial time only.

Proof

We have \(X_k \sim k + D\) and \(X_{k+1} \sim k+1 + D\); let \(D'\) be the difference of two independent copies of D. We have

$$\begin{aligned} P(X_k < X_{k+1}) = P(0 < X_{k+1} - X_k) = P(-1 < D'). \end{aligned}$$

The variance of the difference of two random variables both with variance \(\sigma ^2\) is \(2 \sigma ^2\). Now we apply Chebyshev’s Inequality to see that

$$\begin{aligned} P(|D'| \ge 1) \le 2\sigma ^2. \end{aligned}$$

Thus, for \(\sigma ^2 = O(\log n/n)\), we get polynomial run time using Theorem 5.

In the case of D an exponential distribution we have a constant chance of \(X_k \ge k+2\) and \(X_{k+1} \le k+2\), which leads to the claimed result using Theorem 6.\(\square \)

Corollary 11

Suppose posterior noise, sampling from a Gaussian distribution \(D \sim \mathcal {N}(0,\sigma ^2)\) with variance \(\sigma ^2\). Then we have that the (\(1+1\)) EA optimizes OneMax in an expected polynomial time if \(\sigma ^2 \le 1/(4\log (n))\). On the other hand, if \(\sigma ^2 \ge c/(4\log (n))\) for any \(c > 1\), then the (\(1+1\)) EA optimizes OneMax in superpolynomial time only.

Proof

Let \(D'\) be the difference of two independent copies of D as in the proof of Corollary 10. We have that \(D' \sim \mathcal {N}(0, 2\sigma ^2)\); thus

$$\begin{aligned} P\left( -1 < D' \right) = \frac{1}{2}\left( 1 + \mathrm {erf}\left( \frac{1}{2\sigma }\right) \right) = 1- \frac{1}{2}\mathrm {erfc}\left( \frac{1}{2\sigma }\right) . \end{aligned}$$

We will use the following standard estimates of the complementary error function [21], somewhat simplified by making the estimate more rough. For all x,

$$\begin{aligned} \frac{\exp (-x^2)}{2x+2} \le \mathrm {erfc}(x) \le \frac{2\exp (-x^2)}{x} \end{aligned}$$
(3)

We now first consider the positive part, so suppose \(\sigma ^2 \le 1/(4\log (n))\). Using the upper bound from Eq. (3) and the fact that \(\mathrm {erfc}\) is monotonically decreasing, we get

$$\begin{aligned} \frac{1}{2}\mathrm {erfc}\left( \frac{1}{2\sigma }\right) \le \frac{1}{2}\mathrm {erfc}\left( \sqrt{\log n}\right) \le \frac{\exp (-\log n)}{\sqrt{\log n}} \le \frac{1}{n}. \end{aligned}$$

Thus, we get polynomial run time using Theorem 5.

Regarding the negative part, let \(c > 1\) and suppose \(\sigma ^2 \ge c/(4\log (n))\). Using the lower bound from Eq. (3) and again the fact that \(\mathrm {erfc}\) is monotonically decreasing, we get, asymptotically for any \(c'\) with \(1/c < c' < 1\),

$$\begin{aligned} \frac{1}{2}\mathrm {erfc}\left( \frac{1}{2\sigma }\right) \ge \frac{1}{2}\mathrm {erfc}\left( \sqrt{\log n/c}\right) \ge \frac{\exp (-\log n/c)}{2\sqrt{\log n/c} + 2} \ge \frac{1}{n^{c'}}. \end{aligned}$$

This leads to the claimed result using Theorem 6.\(\square \)

3.2 Parent Populations: The (\(\mu +1\)) EA on OneMax

In this section we give upper bounds for a EAs using parent populations, i.e. we consider the (\(\mu +1\)) EA. From [22] we know that the (\(\mu +1\)) EA needs \(\varTheta (\mu n + n\log n)\) iterations to optimize the static version of OneMax. We conjecture a similar run time for sufficiently benevolent noisy versions, but, for simplicity also for the requirements on the noise model, we only give the slightly weaker bound of \(O(\mu n\log n)\). To bound the negative drift, we also require that the noise has only a very small range.

Theorem 12

Let \(\mu \) be given and suppose, for each \(k \le n\), \(X_k \in [k-1,k+1]\). For each \(k < n\), let \(A_k\) be the event that, when drawing \(\mu \) independent copies of \(X_k\) and one copy of \(X_{k+1}\) and then sorting with breaking ties uniformly, the value of \(X_{k+1}\) does not come out least. If there is a positive constant \(c \le 1/15\) such that

$$\begin{aligned} \forall k, n/4 < k < n: P(A_k) \ge 1 - c\frac{n-k}{n\mu }, \end{aligned}$$
(4)

then the (\(\mu +1\)) EA optimizes f in an expected number of \(O(\mu n \log n)\) iterations.

Note that the requirement of Eq. (4) might seem to get more restrictive with growing \(\mu \); however, it only gets linearly more restrictive in the fraction on the right-hand-side, while the impact of growing \(\mu \) on the probability of the event \(A_k\) is typically much stronger.

Proof

Let \(p_k= c(n-k)/(n\mu )\).

We show that there is a positive drift on the number of ones in the current best search point. Let k be the number of 1s of the current best search point.

Let \(E_0\) be the event that the new search point has at least one more 1 than the best one and this new search point is not removed in the selection step. We have \(P(E_0) \ge (1 - p_k)(n-k)/(e\mu n)\).

Let \(E_1\) be the event that the new search point has less 1s than the current best search point, the best search point is unique, and that this unique search point is discarded (if the best search point is not unique, \(E_1\) is the empty event). From the bound on the range of the noise we get that the best search point with k 1s can only be discarded if all other search point have at least \(k-2\) 1s. Thus, the number of 1s conditional on \(E_1\) (if \(E_1 \ne \emptyset \)) is at least \(k-2\). We have \(P(E_1) \le p_{k-1}\). Thus, the expected increase in the number of 1s is at least

$$\begin{aligned} P(E_0) - 2P(E_1)\ge & {} \left( 1 - p_k\right) \frac{n-k}{e\mu n} - 2p_{k-1}\\= & {} \frac{n-k}{e \mu n} - p_k\left( 2 + \frac{n-k}{e \mu n} \right) - 2c/(n\mu )\\\ge & {} \frac{n-k}{e \mu n} - 3p_k- 2c/(n\mu ). \end{aligned}$$

Using the choice of \(c\le 1/15\), we see that we have sufficient multiplicative drift, and our claim follows from the Multiplicative Drift Theorem (see Theorem 3).\(\square \)

From these theorems we can again derive many corollaries regarding concrete noise models. These includes corollaries implying an exponential speedups of populations of logarithmic size, when compared with the performance of the (\(1+1\)) EA.

Corollary 13

Suppose prior noise which, with probability p, flips a bit uniformly at random. Let \(\mu \ge 12\log (15n)/p\). Then we have that the (\(\mu +1\)) EA optimizes OneMax in an expected number of \(O(\mu n \log n)\) iterations. In particular, for \(p=1/2\), we have that a population size of \(\mu = 24\log (15n)\) suffices for an expected optimization time of \(O(\mu n \log n)\).

Proof

In order to apply Theorem 12 we consider \(\mu \) independent copies of \(X_k\) (called low individuals) and one copy of \(X_{k+1}\) (called high individual).

Suppose first \(p \le 1/(32n)\). Using Chernoff bounds, we can assume that at least \(\mu /2\) low individuals will evaluate to k. Now the high individual can only come out last in the sorting described in Theorem 12 if it evaluates down and looses in the tie breaking against at least \(\mu /2\) low individuals; this happens with a probability of at most

$$\begin{aligned} \frac{p}{\mu /2} \le \frac{1}{16 n\mu }. \end{aligned}$$

Thus, Theorem 12 concludes the claim for this case.

Suppose now \(p > 1/(32n)\). For \(k \ge n/4\), the probability that none of the low individuals is evaluated to \(k-1\) is at most

$$\begin{aligned} \left( 1-\frac{p}{4}\right) ^{\mu } \end{aligned}$$

(as each individual is evaluated noisily with probability p and, using \(k \ge n/4\), with probability at least 1/4 the noise has a negative effect on the fitness).

For \(\mu \) large enough (recall that \(\mu \) increasing with increasing n), we have \(p\mu /6 \ge \log (\mu )\), as can be seen by the following case distinction.

$$\begin{aligned} \mu \le n^2: \;\;\;&p\mu /6 \ge 2\log (15n) \ge 2 \log (n) \ge \log (\mu );\\ \mu > n^2: \;\;\;&p\mu /6 \ge \mu /(6\cdot 32n) \ge \sqrt{\mu }/198 \ge \log (\mu ). \end{aligned}$$

Let \(q = 1-p/4\). We have

$$\begin{aligned} \mu= & {} 4\left( \mu /12 + \mu /6\right) \\\ge & {} 4\left( \frac{\log (15n)}{p} + \frac{\log (\mu )}{p}\right) \\= & {} 4\frac{\log (15n\mu )}{p}. \end{aligned}$$

We use the inequality \(x \le -\log (1-x)\) for \(x = p/4\) and get

$$\begin{aligned} \mu \ge \frac{4\log (15n\mu )}{-4 \log (q)} = - \frac{\log (15n\mu )}{\log (q)}. \end{aligned}$$

Taking all inequalities together we get that the probability that none of \(\mu \) individuals with k 1s is evaluated to \(k-1\) is at most

$$\begin{aligned} \left( 1-\frac{p}{4}\right) ^{\mu } = q^\mu \le \frac{1}{15n\mu }. \end{aligned}$$

Theorem 12 gives the desired result.\(\square \)

Corollary 13 requires larger \(\mu \) for smaller p, while small noise (small values of p) should make optimization easier. Consider the following case as an illustrative example for why small noise might be bad for efficient optimization. Assume we have \(\mu = \varTheta (\log n)\) and \(p = \varTheta (1/\sqrt{n})\). We have that, with constant probability, all low individuals are truthfully evaluated. Furthermore, \(X_{k+1} = k\) has a probability of \(\varTheta (1/\sqrt{n})\), assuming a large value of k. This shows that there is a chance of \(\varTheta (1/\sqrt{n})\) to remove the best individual from the population, making Theorem 12 inapplicable. The assumptions of Theorem 12 are very strong, even if they do not hold there might be efficient optimization, but this will likely require much more refined techniques (probably considering the diversity of the population).

3.3 Offspring Populations: The (\(1+\lambda \)) EA on OneMax

In this section we give upper bounds for a EAs using offspring populations, i.e. we consider the (\(1+\lambda \)) EA. From [12, Theorem 4] we know that the (\(1+\lambda \)) EA needs \(O(n\log n + n\lambda )\) iterations to optimize the static version of OneMax.

Theorem 14

Let \(\lambda \ge 24\log n\) and, for each \(k < n\), let \(Y_k\) denote the maximum over \(\lambda \) observed values of \(X_k\) (belonging to inferior individuals) and let \(Z_k\) denote the maximum over at least \(\lambda /6\) observed values of \(X_k\) (belonging to better individuals). Suppose there is \(q < 1\) such that

$$\begin{aligned} \forall k < n: P(Y_{k} < X_{k+1})\ge q, \end{aligned}$$
(5)

and

$$\begin{aligned} \forall k < n: P(Y_{k-1} < Z_{k}) \ge 1-\frac{q}{5}\frac{\ell \lambda }{en+\ell \lambda }. \end{aligned}$$
(6)

Then the (\(1+\lambda \)) EA optimizes f in \(O((n\log n /\lambda + n)/q)\) iterations and needs \(O((n\log n + n\lambda )/q)\) fitness evaluations.

Proof

We show that there is a positive drift on the number of 1bits. Let k be the number of 1-bits of the current search point and \(\ell \) the number of 0-bits, such that \(k+\ell =n\). Let \(p_k=(q/5)\ell \lambda /(en+\ell \lambda )\).

Let \(E_0\) be the event that at least one offspring is improved by at least 1 and is correctly accepted. Let \(E_1\) be the event that at least one offspring has less 1s than the current search point, and that it is still accepted. We have that \(P(E_0)\) can be bounded below by

$$\begin{aligned} q\left( 1-\left( 1-\frac{\ell }{en}\right) ^\lambda \right) \ge q\left( 1-e^{-\frac{\ell \lambda }{en}}\right) \ge q\frac{\ell \lambda }{en+\ell \lambda }. \end{aligned}$$

In order to estimate \(P(E_1)\) let \(E_2\) be the event that more than \(\lambda /6\) copied offspring are created. We have

$$\begin{aligned} P(E_1)= P(E_1|E_2)P(E_2) +P(E_1|\overline{E_2})P(\overline{E_2}). \end{aligned}$$

Equation (6) gives us \(P(E_1|E_2)\le p_k\). By using a Chernoff bound we further get \(P(\overline{E_2})\le e^{-\lambda /24}\). Bounding the other probabilities by 1 we obtain \(P(E_1)\le p_k+e^{-\lambda /24}\). Conditioning on \(E_1\), the expected number of ones is at least \(k-2\). Furthermore, due to our definition of \(p_k\) and \(\ell =n-k\) we can show the following inequality:

$$\begin{aligned} p_k=\frac{q}{5}\frac{\ell \lambda }{en+\ell \lambda } =\frac{\frac{q}{5}}{1+\frac{en}{(n-k)\lambda }} \ge \frac{\frac{q}{5}}{1+\frac{en}{\lambda }} \ge \frac{\frac{q}{5}}{n\frac{(1+e)}{\lambda }} =\frac{q\lambda }{5n(1+e)}, \end{aligned}$$

where the inequalities follow from the trivial bounds \(k\le n-1\) and \(n\ge 1\), respectively. Since \(\lambda \ge 24\log n\), the last term is bounded below by 1 / n. Altogether, the expected increase in the number of 1s is

$$\begin{aligned} P(E_0)-2P(E_1)&\ge q\frac{\ell \lambda }{en+\ell \lambda }-2\left( p_k+e^{-\frac{\lambda }{24}}\right) \\&\ge q\frac{\ell \lambda }{en+\ell \lambda }-2\left( p_k+\frac{1}{n}\right) \\&\ge q\frac{\ell \lambda }{en+\ell \lambda }-4p_k\\&=\left( q-\frac{4q}{5}\right) \frac{\ell \lambda }{en+\ell \lambda }, \end{aligned}$$

where the third inequality follows from \(p_k\ge 1/n\), as shown above. Since the last term is positive, we can apply Theorem 3, which yields an upper bound of \(O((n\log n/\lambda +n)/q)\) for the expected number of iterations. Since \(\lambda \) fitness evaluations are performed in each iteration the expected number of fitness evaluations until an optimum is found is therefore \(O((n\log n + n\lambda )/q)\).\(\square \)

Again, the previous theorem allows us to easily derive corollaries in the context of noise. In the following we present two corollaries showing that offspring populations can handle much higher noise levels than the (\(1+1\)) EA. We first consider the noise model given in [5].

Corollary 15

Suppose prior noise which, with probability \(p > 0\), flips a bit uniformly at random. Then, for \(\lambda \ge \max \{12/p,24\} n\log n\), the (\(1+\lambda \)) EA optimizes OneMax in time \(O((n^2\log n/\lambda +n^2)/p)\), i.e. the (\(1+\lambda \)) EA needs \(O((n^2\log n + n^2\lambda )/p)\) fitness evaluations until an optimum is found.

Proof

Let \(c=\max \{12/p,24\}\). In order to apply Theorem 14 we need to check Eqs. (5) and (6). Let \(q=p/n\).

We show Eq. (5) by considering the event that the good individual evaluates better than it is. Note that then, any noise affecting the worse individuals does not harm the comparison since misevaluations can only differ from the true fitness by 1. We get

$$\begin{aligned} P(Y_{k} < X_{k+1})\ge 1-\left( 1-p+\frac{pk}{n}\right) \ge \frac{p}{n} = q. \end{aligned}$$

Reusing our above considerations regarding the probability that at least one of \(\lambda /6\) individuals improves, and due to our choice of \(\lambda \ge cn\log n\) we have

$$\begin{aligned} P(Y_{k-1} < Z_{k})&\ge 1-\left( 1-p+\frac{pk}{n}\right) ^{\lambda /6}\\&\ge 1-\frac{1}{e^{\frac{p\lambda }{6n}}}\\&\ge 1-\frac{1}{n^2}\\&\ge 1-\frac{\frac{q}{5}}{1+\frac{en}{(n-k)\lambda }}, \end{aligned}$$

thus fulfilling Eq. (6). The claim now follows from Theorem 14.\(\square \)

The last corollary shows that offspring populations permit the optimization process to cope with a much higher level of noise than the (\(1+1\)) EA (see Corollary 7).

For the posterior noise model, we can show an even greater improvement over the (\(1+1\)) EA, similar to the result regarding parent populations.

Corollary 16

Let any non-positive additive posterior noise be given which has a non-zero probability p of evaluating to \(> -1\). Then we have that for \(\lambda \ge \max \{10e,-(6\log (n/p)/(\log (1-p))\}\) (implying \(\lambda = \varOmega (\log (n/p)/p)\)), the (\(1+\lambda \)) EA optimizes OneMax in time \(O((n\log n/\lambda +n)/p)\).

Proof

Let D be the posterior noise such that \(D\le 0\) and \(P(D > -1) = p\). Furthermore, let

$$\begin{aligned} \lambda \ge \max \left\{ 10e,\frac{-6\log \left( \frac{n}{p}\right) }{\log (1-p)}\right\} . \end{aligned}$$

Note that since \(D\le 0\) we have that \(Y_k\le k\) which gives us

$$\begin{aligned} P(Y_k\ge X_{k+1})\le P(k\ge X_{k+1}) =P(D\le -1)=1-p, \end{aligned}$$

which yields \(P(Y_k < X_{k+1})\ge p\), fulfilling the first condition of Theorem 14 with \(q=p\).

In order to show Eq. (6) we use the same bound as before. Note that individuals can not me misevaluated better than they are since we only consider negative, additive posterior noise. Hence, \(Y_{k-1}\ge Z_k\) implies that all of at least \(\lambda /6\) individuals at fitness k need to evaluate to \(\le k-1\), i. e. \(P(Y_{k-1}\ge Z_k)\le (1-p)^{\frac{\lambda }{6}}\). We have Furthermore, due to our choice of \(\lambda \), we have \((1-p)^{\lambda /6}\le pn^{-1}\). Thus, using \(\lambda \ge 10e\) from the above definition, we have

$$\begin{aligned} P(Y_{k-1} < Z_k) \ge 1-\frac{p}{n} \ge 1-\frac{\frac{p}{5}\lambda }{2en} \ge 1-\frac{\frac{p}{5}}{1+\frac{en}{(n-k)\lambda }}, \end{aligned}$$

fulfilling the second condition, and the claim follows from Theorem 14.\(\square \)

Similar to Corollary 12 the last corollary applies, for example, to additive posterior noise taken from \(-{\text {Exp(1)}}\) (where \({\text {Exp}}(\beta )\) denotes the exponential distribution with parameter \(\beta \)).

4 LeadingOnes

We follow up on the section about \(\textsc {OneMax} \) with results for \(\textsc {LeadingOnes} \). For this purpose we now fix a stochastic LeadingOnes function f according to one of our models. For each k, we let \(x_k^{\mathrm {opt}}\) be the bit string which has only 1s, except for position \(k+1\); let \(x_k^{\mathrm {pes}}\) be the bit string with k leading ones and otherwise only 0s. In a sense, \(x_k^{\mathrm {opt}}\) is optimal for a bit string with a leading ones value of k, while \(x_k^{\mathrm {pes}}\) is pessimal. We let \((X_k^\mathrm {opt})_{k \le n}\) and \((X_k^\mathrm {pes})_{k \le n}\) be two sequences of independent random variables such that, for all \(k \le n\), \(X_k^\mathrm {opt}\sim f(x_k^\mathrm {opt})\) and \(X_k^\mathrm {pes}\sim f(x_k^\mathrm {pes})\). We will assume the following about f.

  • For each evaluation of f on a bit string with the leftmost zero at position \(k+1\), the return value is drawn according to a distribution which is in between \(X_k^\mathrm {pes}\) and \(X_k^\mathrm {opt}\) with respect to stochastic dominance.

  • \(\forall j \le k < n: P(X_j^\mathrm {opt}< X_{k+1}^\mathrm {opt}) \ge P(X_k^\mathrm {opt}< X_{k+1}^\mathrm {opt})\).

  • \(\forall j \le k < n: P(X_j^\mathrm {pes}< X_{k+1}^\mathrm {pes}) \ge P(X_k^\mathrm {pes}< X_{k+1}^\mathrm {pes})\).

We show that, despite the more drastic consequences of noise, we still find sufficient conditions for efficient optimization similar to the ones we have already seen in Sect. 3.

We begin by giving upper and lower bounds for the (\(1+1\)) EA in Sect. 4.1. In Sects. 4.2 and 4.3 we show the effectiveness of parent and offspring populations, respectively, for the stochastic LeadingOnes problem by giving upper bounds.

4.1 The (\(1+1\)) EA on LeadingOnes

Similar to the analysis of OneMax subject to noise, we start by giving an upper bound for the (\(1+1\)) EA. As we see in the next theorem, the condition for LeadingOnes is more strict than in the case of OneMax: the reliability of correct evaluations is required to be \(1- O(1/n^2)\) close to the optimum (see Eq. (7)).

Theorem 17

Suppose there is a positive constant \(c<1/(6e)\) such that

$$\begin{aligned} \forall k < n: P(X_k^\mathrm {opt}< X_{k+1}^\mathrm {pes}) \ge 1 - \frac{c}{kn} \end{aligned}$$
(7)

Then the (\(1+1\)) EA optimizes f in \(O(n^2)\) iterations in expectation.

Proof

Let \(p_k= c/(kn)\). We show that there is a positive drift on the number of leading 1-bits. Let k be the length of the prefix of the current search point consisting of 1s.

We stick to our previous notation and denote by \(E_0\) the event that the new search point has a longer prefix consisting of 1s and the comparison of old and new search point indicates correctly that the new search point is better. Let \(E_1\) be the event that the new search point has a smaller number of leading ones than the current search point, and that it is accepted. Conditioning on \(E_1\) we can trivially bound the expected number of leading ones below by 0. We have \(P(E_0)\ge (1-p_k)/(en)\) and \(P(E_1)\le p_{k-1}(1-e^{-1})\le p_{k-1}\). Therefore, the expected increase in the number of leading 1s is

$$\begin{aligned} P(E_0) - kP(E_1)&\ge \frac{(1-p_k)}{en}-kp_{k-1}\\&=\frac{1-\frac{c}{kn}}{en}-\frac{ck}{(k-1)n}\\&\ge \frac{1}{en}-\frac{c}{ekn^2}-\frac{2c}{n}\\&\ge \frac{1}{en}-\frac{c}{n}\left( \frac{1}{e}+2\right) \\&\ge \frac{1}{en}-\frac{3c}{n}. \end{aligned}$$

Due to our choice of \(c<1/(6e)\) we have a positive additive drift of 1 / (2en) leading to an upper bound of \(O(n^2)\) for the expected run time of the algorithm by applying the Additive Drift Theorem (see Theorem 1).\(\square \)

It is only to be expected that noise disrupts the optimization of LeadingOnes immensely. Consequently, our following corollaries to Theorem 17 are rather weak with respect to the noise allowed (basically, the algorithm will only experience constantly many incorrect decisions during optimization, in expectation). First, we consider prior noise where one bit is flipped uniformly at random before evaluation.

Corollary 18

Suppose prior noise which, with probability p, flips a bit uniformly at random. Then we have that in expectation the (\(1+1\)) EA optimizes LeadingOnes in \(O(n^2)\) iterations if \(p\le 1/(6en^2)\).

Proof

Suppose first \(p \le 1/(6en^2)\) and let \(k < n\). We estimate \(P(X_k^\mathrm {opt}< X_{k+1}^\mathrm {pes})\) by observing that the event \(X_k^\mathrm {opt}\ge X_{k+1}^\mathrm {pes}\) requires the individual with k 1s to be evaluated to \(\ge k+1\) or the other to \(\le k\). The first option has a probability of at most p / n, the second of pk / n. Hence, \(P(X_k^\mathrm {opt}< X_{k+1}^\mathrm {pes}) \ge 1- (k+1)/(6en^3) \ge 1-1/(6en^2)\) and the desired bound follows from Theorem 17.\(\square \)

Regarding posterior noise we give the following corollary.

Corollary 19

Suppose posterior noise, sampling from some distribution D with variance \(\sigma ^2\). Then we have that the (\(1+1\)) EA optimizes LeadingOnes in \(O(n^2)\) iterations in expectation if \(\sigma ^2 \le 1/(12en^2)\).

Proof

Note that, in this case, \(X_k^\mathrm {opt}\sim X_k^\mathrm {pes}\), for all \(k \le n\). With the same argument as in Corollary 10 we have \(P(X_k^\mathrm {opt}< X_{k+1}^\mathrm {pes}) \ge 1-2\sigma ^2\). Thus, for \(\sigma ^2 = 1/(12en^2)\) the claim follows from Theorem 17.\(\square \)

Next we give a lower bound for the (\(1+1\)) EA for the prior noise model. We will not give a general lower bound that holds for both of our models because it is very easy for the (\(1+1\)) EA to detect an inferior noisy offspring by selection if LeadingOnes is subjected to posterior noise.

Theorem 20

Suppose prior noise which, with probability 1/2, flips a bit uniformly at random. Then we have that the (\(1+1\)) EA optimizes LeadingOnes in \(2^{\varOmega (n)}\) iterations in expectation.

Proof

It is sufficient to show that there is a constant negative drift on the number of ones in the interval between 99n / 100 and n. Let \(k \ge 99n/100\) be the number of 1s of the current search point. Note that the number of 1s k is a trivial upper bound for the number of leading ones.

Let \(E_0\) be the event that the new search point has at least one more 1 than the current search point and the comparison of old and new search point indicates that the new search point is to be accepted. The expected number of 1s conditional on \(E_0\) is at most \(k+2\) and we have a trivial bound of \(P(E_0)\le (n-k)/n \le 1/100\).

Let \(E_1\) be the event that the new search point differs from the old by flipping exactly one 1 in the right half of positions, and that it is accepted. We want to estimate \(P(E_1)\). There are at least 49n / 100 1s in the right half of positions, so the probability of flipping exactly one of them and no other is at least 49 / (100e). In order to estimate the probability of accepting such an offspring, we consider two cases. First, assume that the parent has a leading ones value of at least n/2. Then the probability of the noisy evaluation evaluating the parent to a value \(<\) \(n\)/2 is at least 1/4 (by choosing to flip a bit in the left half in the evaluation), while evaluating the offspring to its true value \(\ge \) \(n\)/2 has a probability of at least 1/2. In total we have \(P(E_1) \ge 49/(800e)\) in this case.

Second, assume that the parent has a leading ones value of \(<\) \(n/2\). Then both parent and offspring have the same leading ones value; with probability 1/4 they both evaluate to their true value, which favors the offspring. Thus, in this case, we get \(P(E_1) \ge 49/(400e)\). Overall we have now \(P(E_1) \ge 49/(800e) > 1/50\)

Thus, we have that the total (negative) drift of

$$\begin{aligned} P(E_1)-2P(E_0) \ge 49/(800e) - 2/100 \end{aligned}$$

which gives us a constant negative drift. Since long jumps are sufficiently small (due to our choice of using the number of 1s as potential), we can apply Theorem 4 which yields our result.\(\square \)

4.2 Parent Populations: The (\(\mu +1\)) EA on LeadingOnes

In this section we give upper bounds for EAs using parent populations, i.e. we consider the (\(\mu +1\)) EA. As for the (\(\mu +1\)) EA on OneMax, we again only give the weaker bound of \(O(\mu n^2)\) for the sake of simplicity instead of \(\varTheta (\mu n\log n +n^2)\) which is the bound for the static version of LeadingOnes ([22]). Note that the proofs for the upper bounds for the (\(1+1\)) EA carry over to the case of the (\(\mu +1\)) EA.

Theorem 21

Let \(\mu \) be given and, for each \(k < n\), let \(Y_k\) denote the minimum over \(\mu \) observed values of \(X_k^\mathrm {opt}\). If there is a positive constant \(c<1/(6e)\) such that

$$\begin{aligned} \forall k: n/4 < k < n \Rightarrow P(Y_k < X_{k+1}^\mathrm {pes}) \ge 1-\frac{c}{\mu k n}, \end{aligned}$$
(8)

then the (\(\mu +1\)) EA optimizes f in \(O(\mu n^2)\) iterations in expectation.

Note that, due to the dependence of \(Y_k\) on \(\mu \), Eq. (8) typically gets less restrictive with growing \(\mu \), just as in Theorem 12.

Proof

Let \(p_k= c/(\mu kn)\). We show that there is a positive drift on the number of leading 1-bits of a current best individual. Let k be the length of the prefix of the current search point consisting of 1s. Let \(E_0\) be the event that a best individual is improved by at least 1 and accepted by mutation. Let \(E_1\) be the event that the new individual has a smaller prefix consisting of 1s and that the unique best individual is dropped from the parent population. We have that \(P(E_0)\ge (1-p_k)/(e\mu n)\), assuming pessimistically that there is only one best individual. On the other hand \(P(E_1)\le p_{k-1}\).

Since, conditioned on \(E_1\), the expected number of leading ones of a best individual is trivially at least 0 we can bound the expected increase in the number of leading 1s of a best individual below by

$$\begin{aligned} P(E_0) - kP(E_1)&\ge (1-p_k)\frac{1}{e\mu n}-kp_{k-1}\\&=\frac{1}{e\mu n}-\frac{c}{e\mu ^2 kn^2}-\frac{2c}{\mu n}\\&\ge \frac{1}{e\mu n}-\frac{c}{\mu n}\left( \frac{1}{e}+2\right) \\&\ge \frac{1}{e\mu n}-\frac{3c}{\mu n} \end{aligned}$$

and the last term can be bounded below by \(1/(2e\mu n)\) due to our choice of \(c<1/(6e)\). Applying Theorem 1 yields an upper bound of \(2e\mu n^2\) for the expected number of iterations until the optimum is found. Taking the cost of initialization into account we have an expected run time of \(\mu + 2e\mu n^2\), proving our claim.\(\square \)

Theorem 21 is not strong enough to derive a good upper bound for the prior noise model where a bit flip is performed with certain probability. Regarding posterior noise, we can still derive the following corollary.

Corollary 22

Let any non-negative additive posterior noise be given which has a non-zero constant probability of evaluating to \(<\)1. Then there is a constant c such that, for \(n/(8e)\ge \mu \ge c\log n\), the (\(\mu +1\)) EA optimizes LeadingOnes in time \(O(\mu n^2)\).

Proof

Let D be the posterior noise and \(p = P(D < 1)\) a non-zero constant. Let \(c = -3/\log (1-p)\) and \(\lfloor n/(8e)\rfloor \ge \mu \ge \lceil c\log n\rceil \). We have

$$\begin{aligned} P(Y_k < X_{k+1}) = 1-P(D \ge 1)^\mu \ge 1-(1-p)^{-3\log n/\log (1-p)}, \end{aligned}$$

and the last term equals \(1-n^{-3}\). Using \(\mu \le n/(8e)\), we have

$$\begin{aligned} 1-\frac{1}{n^3} \ge 1-\frac{1/(8e)}{\mu n^2} \ge 1-\frac{1/(8e)}{\mu k n}, \end{aligned}$$

for all \(1\le k\le n\). Since \(1/(8e) < 1/(6e)\) holds, we can apply Theorem 21, which yields the result.\(\square \)

4.3 Offspring Populations: The (\(1+\lambda \)) EA on LeadingOnes

In this section we consider the (\(1+\lambda \)) EA. The next theorem gives conditions for efficient optimization of stochastic LeadingOnes problems. Note that Theorem 14 establishes a similar result for OneMax. The bound we give is the same as is shown for static LeadingOnes in [12].

Theorem 23

Let \(\lambda \ge 72\log n\) and, for each \(k < n\), let \(Y_k\) denote the maximum over \(\lambda \) observed values of \(X_k^\mathrm {opt}\) (belonging to inferior individuals) and let \(Z_k\) denote the maximum over at least \(\lambda /6\) observed values of \(X_k^\mathrm {pes}\) (belonging to better individuals). Suppose there is \(q < 1\) with \(q\ge 6e/n\) such that

$$\begin{aligned} \forall k < n: P(Y_{k} < X_{k+1}^\mathrm {pes})\ge q, \end{aligned}$$
(9)

and

$$\begin{aligned} \forall k < n: P(Y_{k-1} < Z_{k}) \ge 1-\frac{\frac{q}{3}}{k(1+\frac{en}{\lambda })}. \end{aligned}$$
(10)

Then the (\(1+\lambda \)) EA optimizes f in expectation in \(O((n+n^2/\lambda )/q)\) iterations and needs \(O((n^2+\lambda n)/q)\) fitness evaluations.

The proof of Theorem 23 is essentially identical to the proof of Theorem 14. We therefore omit it here.

We give the following corollary showing that even small offspring populations can lead to an improvement over the (\(1+1\)) EA.

Corollary 24

Let \(p\le c/n\) for some constant \(c<1/(12e+2)\). Suppose prior noise which, with probability p, flips a bit uniformly at random. Then, for \(\lambda \ge 72\log n\) with \(\lambda = o(n)\), the (\(1+\lambda \)) EA optimizes LeadingOnes in expectation in \(O(\lambda n+n^2)\) iterations.

Proof

In order to apply Theorem 23 we need to show Eqs. (9) and (10). We can show the former considering the event that none of \(\lambda \) inferior individuals is evaluated better than it is and that the good individual is evaluated to its true value. Let \(k<n\), then we get

$$\begin{aligned} P(Y_{k} < X_{k+1}^\mathrm {pes})\ge \left( 1-\frac{p}{n}\right) ^\lambda \left( 1-p\right) \ge \left( 1-p\right) ^2\ge 1-2c. \end{aligned}$$

Thus, Eq. (9) holds for \(q = 1-2c\), since c is a constant.

Regarding Eq. (10) we have that the probability that at least one of at least \(\lambda /6\) good individuals is not evaluated worse is \(1-(pk/n)^{\lambda /6}\). Reusing our above considerations regarding the probability that inferior individuals do not improve, we have

$$\begin{aligned} P(Y_{k-1} < Z_{k})&\ge \left( 1-\frac{p}{n}\right) ^\lambda \left( 1-\left( \frac{pk}{n}\right) ^\frac{\lambda }{6}\right) \\&\ge \left( 1-\frac{c\lambda }{n^2}\right) \left( 1-\left( \frac{ck}{n^2}\right) ^\frac{\lambda }{6}\right) \\&\ge \left( 1-\frac{c\lambda }{n^2}\right) ^2\\&\ge 1-\frac{2c\lambda }{n^2}. \end{aligned}$$

Due to our choice of \(c<1/(12e+2)\) we have \(2c < (1-2c)/6e\), i. e. \(2c<q/(6e)\). Hence, we can further bound the last term and get

$$\begin{aligned} 1-\frac{2c\lambda }{n^2} > 1-\frac{\frac{q}{6e}}{\frac{n^2}{\lambda }} = 1-\frac{\frac{q}{3}}{2\frac{en^2}{\lambda }} \ge 1-\frac{\frac{q}{3}}{n+\frac{en^2}{\lambda }} \ge 1-\frac{\frac{q}{3}}{k+\frac{ekn}{\lambda }}. \end{aligned}$$

Thus, Eq. (10) holds and the claim follows from Theorem 23.\(\square \)

We follow up by showing an even greater improvement over the (\(1+1\)) EA for the posterior noise model. In this case

Corollary 25

Let any non-positive, additive posterior noise be given which has a non-zero constant probability p of evaluating to \(> -1\). Then we have that for \(en\ge \lambda \ge \max \{6e, -12\log (n/p)/\log (1-p)\}\) (implying \(\lambda =\varOmega (\log (n/p)/p)\cap O(n)\)) the (\(1+\lambda \)) EA optimizes LeadingOnes in expectation in \(O(n+n^2/\lambda )\) iterations.

Proof

Let D be the posterior noise, such that \(D\le 0\) and \(P(D > -1) = p\). Furthermore, let

$$\begin{aligned} en\ge \lambda \ge \max \left\{ 6e, \frac{-12\log \left( \frac{n}{p}\right) }{\log (1-p)}\right\} . \end{aligned}$$

Note that since \(D\le 0\) we have that \(Y_k\le k\) which gives us

$$\begin{aligned} P(Y_k\ge X_{k+1}) \le P(k\ge X_{k+1}) =P(D\le -1)=1-p, \end{aligned}$$

which yields \(P(Y_k < X_{k+1})\ge p\), fulfilling the first condition of Theorem 23 with \(q=p\).

In order to check Eq. (10) we use the same bound as before. Note that individuals can not me misevaluated better than they are since we only consider negative, additive posterior noise. Hence, \(Y_{k-1}\ge Z_k\) implies that all of at least \(\lambda /6\) individuals at fitness k need to evaluate to \(\le k-1\), i. e. \(P(Y_{k-1}\ge Z_k)\le (1-p)^{\frac{\lambda }{6}}\). Due to our choice of \(\lambda \), we have \((1-p)^{\lambda /6}\le pn^{-2}\). We can estimate

$$\begin{aligned} P(Y_{k-1} < Z_k) \ge 1-\frac{p}{n^2} \ge 1-\frac{\frac{p}{3}\lambda }{2en^2} \ge 1-\frac{\frac{p}{3}}{n\left( 1+\frac{en}{\lambda }\right) } \ge 1-\frac{\frac{p}{3}}{k(1+\frac{en}{\lambda })}, \end{aligned}$$

where the second inequality follows from \(\lambda \ge 6e\) and the third inequality stems from our assumption of \(\lambda \le en\). Thus, the second condition is fulfilled, and the claim follows from Theorem 23.\(\square \)

Just as for Corollary 12 the last corollary applies, for example, to additive posterior noise taken from \(-{\text {Exp(1)}}\).

5 Conclusion

In this paper we consider the optimization of noisy versions of OneMax and LeadingOnes. The summary of the results is that populations are necessary for successful optimization for any substantial noise levels. The surprising result is that even very small populations (of size logarithmic in the problem size) already lead to very high robustness to noise (see Corollaries 13 and 24).

From the formal analysis we see the reason for this robustness: In a (parent) population of size \(\mu \), while the best individual might look bad in a given iteration, there will surely be (objectively) worse individuals which also look worse. This holds as long as there are enough individuals in the parent populations to make sure that one of them will evaluate worse than the (objectively) best individual, and while the noise can never be very disruptive. For example, if a non-best individual will evaluate worse than the best individual with constant probability, a logarithmic number of non-best individuals is large enough to get very high confidence that such a bad individual is dropped. This observation probably extends to the analysis of the (\(\mu +\lambda \)) EA with \(\mu >1\) and \(\lambda >1\); here the same approach of computing the drift could lead to further results.

As for offspring populations, in the (\(1+\lambda \)) EA the current individual is cloned multiple times and thus hedges against bad evaluations (as long as good evaluations are sufficiently likely). This does not extend to the analysis of the (\(\mu +\lambda \)) EA with \(\mu >1\) and \(\lambda >1\) in a straightforward way, as clones might be made from sub-optimal individuals.

All these analyses would probably carry over easily to different mutation probabilities: changing them by a constant factor should only change some constants of the theorems. An entirely different question is about what happens if not all old search points are reevaluated, as for example in [20]. For high noise values, this will likely disrupt optimization heavily.

While we get very accurate statements about the capabilities of the (\(1+1\)) EA in stochastic environments (including upper and lower bounds), there is still a lot to learn about the capabilities of population-based EAs. On the one hand we do not get any lower bounds, on the other hand we believe that the positive results can be extended. We believe that such extensions will require more refined techniques: all our bounds are based on worst-case assumptions about the distribution of the search points (for example in Theorem 12 we assume that we have only one good individual and all other individuals are worse by exactly 1). We think that the results of the (\(1+\lambda \)) EA might be easier to extend than the results for the (\(\mu +1\)) EA, as the former has a much easier to control state from one iteration to the next (only the best individual survives).

Finally, it would be very interesting to see analyses of other algorithms on stochastic fitness functions. Especially interesting would be an analysis of crossover, or any kind of estimation of distribution algorithm, including ant colony optimization algorithms.