1 Introduction

The use of a random mechanism to solve a problem in a deterministic setup is very common not only in mathematics but much beyond formal framework. One can remember that random decisions were performed in ancient times, and the procedure of drawing lots was very common. Moreover, political events such as election of governing officers in Athens were randomized. Nowadays, elements of randomization are often exploited in sport competitions to equalize the chances of the participants. Very important role of random mixing in medical and biological experiments is of no doubt.

Probably one of the first application of stochastic approach in mathematics is the theory of mixed strategies in zero-sum games by John von Neumann. Here the role of randomization is to make secret a strategy of the player against the competitor. Approximately at the same time Fisher proposed to apply mixed strategies in experiment design; here their role was different. A real breakthrough was the invention of the Monte Carlo methods by Ulam, Metropolis, von Neumann and Teller [43], and the ideas of random sampling became very popular in modeling and numerical analysis. Thus randomization methods found numerous applications in various fields of research; to survey all of them does not seem to be realistic. In this chapter we restrict ourselves to some problems related to estimation, robustness, and continuous optimization. The typical question to be analysed is as follows. Given a deterministic problem (say, unconstrained smooth optimization), how randomization ideas can be exploited for its solution and are randomized methods true competitors with deterministic ones? We will see that situation differs in various domains of interest.

The role of Roberto Tempo in progress of this approach can not be overestimated. His research since 2000 was mostly dedicated to randomization methods in control, robustness, and optimization, see the monograph [77]. In the present chapter, we continue this line of research, but also we address the directions which have little intersections with [77] as well with other monographs and surveys on randomization [14, 28, 29].

Due to the wide spectrum of the problems under consideration, we are forced to provide really brief presentation of the problems; the references do not pretend to be complete. However we have tried to emphasize the pioneering works and surveying publications.

2 Uncertainty and Robustness

Mathematical models for systems and control are often unsatisfactory due to the incompleteness of the parameter data. For instance, the ideas of off-line optimal control can only be applied to real systems if all the parameters, exogenous perturbations, state equations, etc., are known precisely. Moreover, feedback control also requires a detailed information which is not available in most cases. For example, to drive a car with four-wheel control, the controller should be aware of the total weight, location of the center of gravity, weather conditions and highway properties as well as many other data which may not be known. In that respect, even such a relatively simple real-life system can be considered a complex one; in such circumstances, control under uncertainty is a highly important issue.

In this section we consider the parametric uncertainty; other types of uncertainty can be treated within more general models of robustness.

There are numerous tools to check robustness under parametric uncertainty; below we focus on randomized methods. This line of research goes back to pioneering papers by Stengel and Ray [74]. Within this approach, the uncertain parameters are assumed to have random rather than deterministic nature; for instance, they are assumed to be uniformly distributed over the respective intervals of uncertainty. Next, an acceptable tolerance \(\varepsilon \), say \(\varepsilon = 0.01\) is specified, and a check is performed if the resulting random family (of polynomials, matrices, transfer functions) is stable with probability no less than \((1-\varepsilon )\); see [77] for a comprehensive exposition of such a randomized approach to robustness.

In many of the NP-hard robustness problems, such a reformulation often leads to exact or approximate solutions. Moreover, the randomized approach has several attractive properties even in the situations where the deterministic solution is available. Indeed, the deterministic statements of robustness problems are minimax, hence, the answer is dictated by the “worst” element in the family, whereas these critical values of the uncertain parameters are rather unlikely to occur. Therefore, by neglecting a small risk of violation of the desired property (say, stability), the admissible domains of variation of the parameters may be considerably extended. This effect is known as the probabilistic enhancement of robustness margins; it is particularly tangible for the large number of the parameters. Another attractive property of the randomized approach is its low computational complexity which only slowly grows with increase in the number of uncertain parameters.

We illustrate some of these concepts and effects.

2.1 Volume of Violation and Approximate Feasibility

We consider robustness problems for systems described in terms of a design vector \(x \in X \subseteq {\mathbb R}^n\) and a real uncertain parameter vector \(q \in Q \subset {\mathbb R}^\ell \), where Q is a box. For such systems, the objective is to select \(x \in X\) such that a given continuous performance specification

$$\begin{aligned} f(x,q) \le 0 \end{aligned}$$
(1)

is satisfied for all \(q \in Q\). When such a design vector x exists, the triple (fXQ) is said to be robustly feasible.

In a number of situations, robust feasibility of \(f(x,q)\le 0\) is guaranteed if and only if \(f(x,q^i) \le 0\) for each of the vertices \(q^i\) of the \(\ell \)-dimensional box Q, and we use the term vertexization. A typical example of a vertexization is the quadratic stability problem for the system with state space matrix \(A(q) = A_0 + \sum _{i=1}^\ell A_iq_i\), where \(A_i\in {\mathbb R}^{n\times n}\) are fixed and known, and the uncertainty parameter vector \(q \in Q\). The goal is to find a symmetric candidate Lyapunov matrix \(P=P(x)\) with entries \(x_i \in {\mathbb R}\) viewed as the design variables, such that \(P(x)\succ 0\) and the linear matrix inequality (LMI) \(A^\top (q)P(x) + P(x)A(q) \prec 0\) holds for all \(q \in Q\) (throughout the text, the signs \(\succ \) and \(\prec \) denote the positive and negative sign-definiteness of a symmetric matrix). Hence, with

$$ f(x,q) = \lambda _{\max }\Bigl (A^\top \!(q)P(x) + P(x)A(q)\Bigr ), $$

this strict feasibility design problem in x is reducible to the vertices \(q^i\) of Q. That is, the satisfaction of the Lyapunov inequality above for all \(q \in Q\) is equivalent to \(A^\top \!(q^i)P(x)+P(x)A(q^i)\prec 0\) for \(i=1,2,\ldots ,N\). However, since \(N = 2^\ell \), we see that the computational task can easily get out of hand. For example, with five states and ten uncertain parameters, the resulting LMI is of size greater than \(5000 \times 5000\).

As an alternative to the computational burden associated with vertexization, it is often possible to introduce an overbounding function in such a way as to enable convex programming in order to test for robust feasibility; also, see Sect. 4 for a different approach to solving the feasibility problem for LMIs. Note also that a reduction to checking the vertices is rather an exception and is considered here for illustrative purposes, while the overbounding techniques may be applied to much broader classes of systems.

Specifically, given x, introduce the associated violation set

$$ Q_{bad}(x) \doteq \{q \in Q:\; f(x,q) > 0\} $$

and estimate from above its volume. Equivalently, assuming that the uncertainty vector q is random, uniformly distributed over Q, we estimate from above the probability of violation for the performance specification.

In [4, 5], a computationally modest method for finding such overbounding functions is proposed and numerical examples are presented.

More delicate constructions are also described in [4, 5], where the notion of approximate feasibility is introduced. Namely, the triple (fXQ) is said to be approximately feasible if the following condition holds: Given any \(\varepsilon > 0\), there exists some \(x^\varepsilon \in X\) such that

$$ \mathbf{Vol}\Bigl (\{q \in Q: \; f(x^\varepsilon ,q) > 0\}\Bigr ) < \varepsilon , $$

where \(\mathbf{Vol}(\cdot )\) stands for the volume of a set. For such \(\varepsilon \), \(x^\varepsilon \) is called an \(\varepsilon \)-approximate solver. So, instead of guaranteeing satisfaction of \(f(x,q) \le 0\) for all \(q \in Q\), we seek solution vectors x with associated violation set having volume less than any arbitrarily small prespecified level \(\varepsilon > 0\).

We present a formal result on approximate feasibility in general terms; the details can be found in [4, 5]. First, we consider so-called homogenizable in x functions f(xq) and use their homogenized versions denoted by \(f^+(x_0,x,q)\). In [4, 5] this requirement was shown to be not very much restrictive, covering quite a large class of functions. Next, the notion of approximate feasibility indicator (AFI) is introduced; in a sense, it is a convex generalization of the classical indicator function. For instance, a “natural” type of AFI is the exponential one, \(\phi (\zeta ) = e^\zeta \).

In the theorem to follow, the approximate feasibility indicator \(\phi (\zeta )\) is used with argument \(\zeta = f^+(x_0,x,q)\) in the determination of approximate feasibility.

Theorem 1

([4, 5]) Given the continuous homogenizable performance specification function f(xq), \(X ={\mathbb R}^n\) and an approximate feasibility indicator \(\phi (\cdot )\), define

$$ \Phi (x_0,x) \doteq \int \limits _Q\phi (f^+(x_0,x,q))\mathrm{d}q $$

and

$$ \Phi ^*\doteq \inf \limits _{x_0>0,x}\Phi (x_0,x). $$

Then the following holds:

(i) \(\Phi ^* = 0\) implies approximate feasibility of (fXQ);

(ii) For any \(x_0>0\) and \(x\in {\mathbb R}^n\),

$$ \mathbf{Vol}\left( Q_{bad}\Bigl (\frac{x}{x_0}\Bigr )\right) \le \Phi (x_0,x). $$

A similar idea of overbounding was presented in [6]. Multivariable polynomials f(x) with parameter vector x restricted to a hypercube \(X\in {\mathbb R}^n\) were considered, and the objective was to check the robust positivity of f(x), i.e., to determine if \(f(x) > 0\) for all \(x \in X\). Again, instead of solving the original NP-hard problem, the authors proposed a straightforward procedure for the computation of an upper bound on the volume of violation by computing a respective dilation integral that depends on the degree k of a certain auxiliary polynomial, followed by a convex minimization in one scalar parameter. By increasing the degree k, the authors obtain a sequence of upper bounds \(\varepsilon _k\) which are shown to be “sharp” in the sense that they converge to zero whenever the positivity requirement is satisfied. Notably, that this dilation integral method applies to a general polynomial dependence on the variables.

2.2 Probabilistic Predictor

In the discussion above, the stochastic nature of the uncertain parameters was somewhat hidden; we just evaluated the bad portion of the uncertainty box. Assume now that the originally deterministic parameters are randomly distributed over the given uncertainty set Q. Then it seems natural to sample the uncertainty set Q and arrive at conclusions on the probability of robustness. In the control-related literature, these ideas have been first formulated in [74]; also see [3].

Together with numerous advantages of this approach, it also suffers serious drawbacks. First, it is usually desired to have any closed-form estimates of the robustness margin, rather than to rely on the results of simulations; moreover, in practical applications, such a simulation is often prohibited. Next, the sample size that guarantees high reliability of the result may happen to be rather large [76, 77], hence, simulations may be very time-consuming. On top of that, sampling in accordance with one or another distribution over a given set may be highly nontrivial [30, 57]. Finally, the results of Monte Carlo simulation heavily depend on the probabilistic distribution adopted and may lead to overly optimistic estimates of the robustness margin; the correct choice of the distribution is a nontrivial problem [2].

In this section, assuming the uniform distribution of the uncertain parameters over \(q\in \gamma Q\), where \(Q\subset {\mathbb R}^{\ell }\) is the uncertainty set and \(\gamma \in {\mathbb R}\) is the scaling factor, we characterize the probability of stability of a system and evaluate the probabilistic stability margin

$$ \gamma _{\max } \;:=\; \sup \bigl \{ \gamma \!:\, \mathsf{Prob}\{\text{ the } \text{ system } \text{ is } \text{ stable } \text{ for } q\in \gamma Q\} \,\ge \, p \bigr \}, \quad p\in (0,\;1), $$

where \(\mathsf{Prob}(\cdot )\) denotes the probability of an event.

Without getting deep into the details, we describe the idea of the probabilistic approach to robustness as applied to polynomial families.

Since the early 1990s, numerous graphical tests for robust stability proved themselves to be efficient; these are based on the famous zero exclusion principle, which is formulated next. Consider the family of polynomials p(sq) which depend on the vector q of uncertain parameters confined to the connected set \(Q\subset {\mathbb R}^\ell \). For a fixed \(s=j\omega \), the set

$$ {V}(\omega ) \;\doteq \; \{p(j\omega ,q):\; q\in Q\} $$

is referred to as the value set of the family p(sq); it is the 2D image of Q under the mapping \(p(j\omega ,\cdot )\). Let the polynomial \(p(s,q^0)\) be stable for some \(q^0\in Q\); then, for robust stability, the following condition is necessary and sufficient:

$$\begin{aligned} 0\; \not \in \; {V}(\omega ) \text{ for } \text{ all } \omega \in [0,\,\infty ). \end{aligned}$$
(2)

To exploit this result, one has to efficiently construct the set \({V}(\omega )\) and check condition (2). This is doable in a number of simple cases; however, for more or less involved dependence of p(sq) on q, this approach cannot be applied, since no closed-form description of the boundary of the value set is available, and checking condition (2) is complicated by the nonconvexity of \({V}\omega )\).

Taking the probabilistic point of view and letting q be random, uniformly distributed over Q, we consider the two-dimensional random variable

$$ z_\omega \;=\; [\mathrm{Re}\, p(j\omega ,q);\;\;\mathrm{Im}\, p(j\omega , q)] $$

and construct its confidence domain

$$ {V}_{1-\varepsilon }(\omega )\!:\;\; \mathsf{Prob}\bigl \{ z_\omega \in {V}_{1-\varepsilon }(\omega ) \bigr \} \;\ge \; 1-\varepsilon , \qquad \varepsilon >0 \text{ is } \text{ small }. $$

This set is referred to as a \(100(1-\varepsilon )\%\) probabilistic predictor of the value set \({V}(\omega )\). The condition (2) now has to be checked for the predictor, rather than for the value set, hence, evaluating the probability of stability of the uncertain polynomial family.

Often, the construction of the predictor can be accomplished via using the central limiting behavior of the random vector \(z_\omega \). Indeed, if p(sq) depends affinely on q, and the \(q_i\)s are mutually independent, the random vector \(z_\omega \) is represented by the sum of independent random vectors, and if the number \(\ell \) of the parameters is large enough, then, under the general assumptions on \(p_i(s)\) it is well described by the two-dimensional Gaussian random vector with mean \(\overline{z}_\omega =\mathbf{\mathsf E} z_\omega \) and the covariance matrix \(S=\mathsf{Cov}\,z_\omega \). Therefore, \({V}(\omega )\) may be approximated by the confidence ellipse

$$ \mathcal{E}_\nu (\omega ) \;\doteq \; \bigl \{ z\in {\mathbb R}^2\!:\; (z-\overline{z}_\omega )^\top S^{-1}(z-\overline{z}_\omega ) \,\le \, \nu ^2 \bigr \}, $$

where \(\nu \) specifies the confidence level. In other words, if \(\mathsf{p}_\nu \) is the associated confidence probability, then for a given \(\omega \) we have

$$ \mathsf{Prob}\bigl \{ p(j\omega ,q)\in \mathcal{E}_\nu (\omega ) \bigr \} \;\approx \; \mathsf{p}_\nu \;=\; 1 - e^{-\nu ^2/2}. $$

In a number of situations, it is possible to obtain a precise nonasymptotic distribution of the random vector \(z_\omega \) and, respectively, a precise description of the probabilistic predictor.

We illustrate these ideas via the problem of robust stability of uncertain delay systems; i.e., those described by uncertain quasipolynomials, see [58]. In this case, the generic value set has a very complicated geometry; application of the zero exclusion principle is hardly possible, and we lean on the probabilistic approach.

Consider the delay system specified by the characteristic quasipolynomial

$$\begin{aligned} h(s,a,\tau ) \;=\; a_0 + a_1s + s^2 + 2se^{-\tau _1s} + e^{-\tau _2s}, \end{aligned}$$
(3)
$$ |a_0| \le \gamma , \quad |a_1| \le \gamma , \quad |1-\tau _1| \le \gamma , \quad |2-\tau _2| \le \gamma . $$

Here, both the coefficients and the delays are subject to interval uncertainty. The nominal system \(h(s)=s^2+2se^{-s}+e^{-2s}\) is stable, \(\max _k \mathrm{Re} s_k = -0.3181\), where \(s_k\) are the roots of the quasipolynomial h(s) (the roots of h(s) are the values of the Lambert function \(W(x)e^{W(x)}=x\) at the point \(x=-1\)). For this system, the value of the radius of robustness cannot be found exactly, but the estimate \(0.01< \gamma _{\max } < 0.05\) is known from the literature. For the confidence level \(\nu = 3\), the probabilistic approach gives \(\gamma _\nu = 0.0275\), so that it fits well the deterministic estimate.

Fig. 1
figure 1

a The plot of \(h(j\omega ,q)\) and confidence ellipses \(\mathcal{E}_\nu (\omega )\), \(\,\nu =3\) for system (3). b Probabilistic predictor of the value set for \(\omega =1.3113\)

To illustrate, for a set of frequencies in \(0\le \omega \le 2\), Fig. 1a depicts the confidence ellipses \(\mathcal{E}_\nu (\omega )\), \(\,\nu =3\), for the uncertainty range \(\gamma =0.0275\). Also, presented are the frequency responses \(h(j\omega ,q)\) for a number of sampled values of the uncertainty \(q=(a_0,\,a_1,\,\delta \!\tau _1,\,\delta \!\tau _2)\) in the box \(|q_i|\le \gamma \). The curves are seen to remain inside the “corridor” defined by the confidence ellipses. Figure 1b depicts the confidence ellipse \(\mathcal{E}_\nu (\omega )\) for a “typical” \(\omega =1.3113\) together with sampled points \(h(j\omega ,q)\); the predictor is seen to approximate nicely the value set.

Probabilistic robustness techniques can be effectively exploited for robust control design [12, 39, 53, 54, 61, 77, 78].

2.3 Probabilistic Enhancement of Robustness Margins

It is important to note that, even for the values of \(\mathsf{p}_\nu =\mathsf{p}\) close to unity, the ellipse \(\mathcal{E}_\nu (\omega )\) is often considerably smaller than the value set \(\mathbf{Vol}(\omega )\). Let us make use of the probabilistic counterpart of the zero exclusion principle (the origin does not belong to \(\mathcal{E}_\nu (\omega )\) for all \(\omega \)) and evaluate the probabilistic stability margin defined as

$$ \gamma _\mathsf{p} \;\doteq \; \sup \bigl \{\gamma :\; 0\notin \mathcal{E}_\nu (\omega ) \text{ for } \text{ all } \omega \in [0,\,\infty ) \bigr \}. $$

It then usually happens that \(\gamma _\mathsf{p}\gg \gamma _{\max }\), where \(\gamma _{\max }\) is the deterministic stability margin. Hence, the uncertainty range may be considerably enlarged at the expense of neglecting low-probability events. This phenomenon is referred to as probabilistic enhancement of classical robustness margins [40]. Moreover, in accordance with the central limit theorem, this enlargement gets bigger as the number of uncertainties grow, and it is this case which is most problematic for deterministic methods. At the same time, the computational burden of probabilistic methods does not depend on the dimension of the vector of uncertain parameters. Indeed, putting the precise description of the value set aside, we make use of an approximation of it, which is defined by the two-dimensional covariance matrix.

We illustrate use of the probabilistic approach to the assessment of such an enhancement via the case of matrix uncertainty. Specifically, let us consider the uncertain matrix family

$$\begin{aligned} A \;=\; A_0+\Delta , \qquad \Delta \in \gamma Q, \end{aligned}$$
(4)

where \(A_0\in {\mathbb R}^{n\times n}\) is a known, Hurwitz stable matrix and \(\Delta \) is its bounded perturbation confined to the ball in the Frobenius norm \(\gamma Q=\{\Delta \in {\mathbb R}^{n\times n}:\,\Vert \Delta \Vert _F\le \gamma \}\); the goal is to estimate the robust stability margin of \(A_0\). To this end, we provide an approximate description of the pseudospectrum of A (4), the set of the eigenvalues of A for all admissible values of the uncertainty \(\Delta \).

For a generic case of simple complex eigenvalues \(\lambda =\lambda (A_0)\in {\mathbb C}\), the perturbed eigenvalue \(\lambda (A_0+\Delta )\) is well described by the linear approximation

$$ \tilde{\lambda }\;=\; \lambda + Rq, \quad R\in {\mathbb R}^{2\times \ell },\quad \ell =n^2, $$

provided that \(\gamma \) is small enough. Here, \(q\in {\mathbb R}^{\ell }\) is the vectorization of \(\Delta \), and the matrix R is defined by the left and right eigenvectors of \(\lambda \).

It can be shown that, as q sweeps the ball \(\gamma Q\), the 2D-vector \([\mathrm{Re\,}\tilde{\lambda }, \,\mathrm{Im}\tilde{\lambda }]\) sweeps the ellipse

$$ \mathcal{E} := \bigl \{ x\in {\mathbb R}^2:\; \bigl (S^{-1}(x-\lambda ),\,x-\lambda \bigr )\le \gamma ^2 \bigr \}, \qquad S := RR^\top . $$

Now, assuming that the uncertainty q is random, uniformly distributed over the ball \(\gamma Q\), and specifying a confidence probability \(\mathsf{p}\), we make use of Lemma 2 (see Sect. 5.1) to shape an ellipsoidal probabilistic predictor \(\mathcal{E}_\mathsf{p}\) of the ellipse \(\mathcal{E}\).

A schematic illustration of the ideas above is given next. For a \(6\times 6\) stable matrix having \(\ell =36\) uncertain entries, quite an accurate upper bound \(\overline{\gamma }= 0.3947\) of the stability margin can be found.

Fig. 2
figure 2

The pseudospectrum of \(A_0\), its linear approximation, and the probabilistic predictor

Let us specify \(\mathsf{p}=0.99\); then the constructions above yield \(\hat{\gamma }_p = 0.7352\) as an estimate of the value of the probabilistic margin. In other words, the uncertainty radius is almost doubled, at the expense of admitting the \(1\%\)-probability of instability. To confirm these conclusions, we performed straightforward Monte Carlo simulations for \(\gamma =\hat{\gamma }_p\), which resulted in the sampled probability of stability \(p^{}_{MC}=0.9989\) (from a sample of 40, 000 points q). Figure 2 depicts the linear approximation of the pseudospectrum of A (larger ellipses) and its ellipsoidal probabilistic predictors (smaller ellipses, rightmost of them touch the imaginary axis), along with sampled values of the pseudospectrum.

Other examples relate to the probability of a polynomial with coefficients in a cube to be stable [46] and to the generation of random stable polynomials [69].

3 Randomization in Estimation

Usual assumptions on the noise in linear regression problems are that it is a sequence of independent zero-mean random variables (vectors). However in practical situations these assumptions are often violated which may strongly affect the performance of standard estimators. Therefore it is important to examine the possibility to estimate the regression parameters under minimal assumptions on the noise. It may appear surprising that the regression parameters can be consistently estimated in the case of biased, correlated and even nonrandom noise. However, it can be done under certain conditions when the inputs (regressors) are random. We consider a linear regression model

$$\begin{aligned} y_n=x_n^\top \theta +\xi _n \end{aligned}$$
(5)

with the parameter vector \(\theta \in {\mathbb R}^N\) to be estimated from the observations \(y_n, x_n\), \(n = 1,2,\dots \) It is assumed that the inputs \(x_n\) are zero-mean random vectors independent of the noise \(\xi _k\). This assumption ensures “good” properties of estimators under extremely mild restrictions on the noise. The idea of using random inputs to eliminate bias was put forward by Fisher [22] as the randomization principle in the design of experiments. Besides settings of design type where regressors are randomized by the experimenter, random inputs arise in many applications of identification, filtering, recognition, etc. Having these applications in mind, we use the terms “inputs,” “outputs,” etc., rather than those traditional to the regression analysis (say, “regressors”).

We follow the results in [25], see also [27]. Let us formulate the rigorous assumptions on the data for the regression problem (5).

(A) the inputs \(x_n\) are represented by a sequence of independent, identically distributed random vectors with symmetric distribution function, zero mean value \(\mathsf{E}x_n=0\), positive-definite covariance matrix \(\mathsf{E}x_nx_n^\top = B\succ 0\), and a finite fourth moment \(\mathsf{E}\Vert x_n\Vert ^4 < \infty \); moreover, \(x_n\) is independent of \(\{\xi _0, \xi _1, \dots , \xi _n\}\).

(B) the noise \(\xi _n\) is mean-square bounded: \(\mathsf{E}|\xi _n|^2 \le \sigma ^2\).

Theorem 2

Under the assumptions above, the least square estimate \(\theta _n\) of the true parameter \(\theta \) is mean-square consistent, and the rate of convergence is given by

$$\begin{aligned} \mathsf{E}(\theta _n-\theta )(\theta _n-\theta )^\top =\frac{\sigma ^2}{n}B^{-1}+o\Bigl (\frac{1}{n}\Bigr ). \end{aligned}$$
(6)

If the inputs are deterministic and \(B=\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{\infty }x_i x_i^\top \), one can obtain a similar estimate for the least squares algorithm under the standard assumption that the noise is zero mean, \(\mathsf{E}\xi _n=0\). The principal contribution of Theorem 2 is the removal of this restrictive assumption.

A result similar to Theorem 2 holds true for the Polyak–Ruppert online averaging algorithm [64]:

$$\begin{aligned} \theta _n=\theta _{n-1}+\gamma _n x_n(y_n-\theta _{n-1}^\top x_n) \end{aligned}$$
(7)
$$\begin{aligned} \hat{\theta }_n=(1-n^{-1}) \hat{\theta }_{n-1}+n^{-1}\theta _n, \end{aligned}$$
(8)

where

$$\begin{aligned} \gamma _n/\gamma _{n+1}=1+o(\gamma _n); \end{aligned}$$
(9)

for instance, \(\gamma _n=1/n^r\) for some \(0<r<1\). It is proved in [25] that estimate (6) is true under assumptions (A), (B) for no-zero-mean noise.

The fruitful idea of randomizing the inputs is exploited in numerous problems of identification, control, optimization in the monographs [28, 29]. These results confirm the general conclusion: Randomization enables for a considerable relaxation of the standard assumptions on the noise. In Sect. 5, we focus on such approaches to stochastic optimization problems.

4 Feasibility

The problem of solving convex inequalities (also known as convex feasibility problem) is one of the basic problems of numerical analysis. It arises in numerous applications, including statistics, parameter estimation, pattern recognition, image restoration, tomography and many others, see, e.g., monographs and surveys [7, 15, 17] and references therein. Particular cases of the problem relate to solving of linear inequalities and to finding a common point of convex sets. The specific feature of some applications is a huge number of inequalities to be solved, while the dimensionality of the variables is moderate, see, e.g., the examples of applied problems below. Under these circumstances many known numerical methods are inappropriate. For instance, finding the most violated inequality may be a hard task; dual methods also cannot be applied due to large number of dual variables.

In this survey we follow mainly the paper [56] and focus on simple iterative methods which are applicable to the case of very large (and even infinite) number of inequalities. They are based on projection-like algorithms, originated in the works [1, 31, 36, 44]. There are many versions of such algorithms; they can be either parallel or non-parallel (row-action); in the latter case the order of projections is usually chosen as cyclical one or the-most-violated one, see [7, 15, 17]. All these methods are well suited for the finite (and not too large) number of constraints. The novelty of the method under consideration is its random nature, which allows to treat large-dimensional- and infinite-dimensional cases. Although the idea of exploiting stochastic algorithms for optimization problems with continua of constraints has been known in the literature [34, 51, 80], it led to much more complicated calculations than the proposed method. Another feature of the method is its finite termination property—under the strong feasibility assumption a solution is found after a finite number of steps with probability one. The version of a projection method for linear inequalities with this property has been proposed first by V.A. Yakubovich [81]. Below we survey the main results from [56]. Related contributions can be found in [13, 61].

Consider the general convex feasibility problem: find a point x in the set

$$\begin{aligned} C=\{x\in X:\;f(x,q)\le 0\quad \forall \, q \in Q \}. \end{aligned}$$
(10)

Here \(X\subset {\mathbb R}^{n}\) is a convex closed set, f(xq) is convex in x for all \(q \in Q\), where Q is an arbitrary set of indices (finite or infinite). Note that this formulation is similar to the robust feasibility problem (1) considered above. However, instead of finding its approximate solution or evaluating the volume of violation, we are aimed at finding a solution satisfying all inequalities, but using randomized methods.

Particular cases of problem (10) are:

  1. 1.

    Finite number of inequalities: \(Q =\{1,...m\}\).

  2. 2.

    Semi-infinite problem: \(Q =[0,T]\subset {\mathbb R}^{1}\).

  3. 3.

    Finding a common point of convex sets: \(f(x,q)=\mathrm{dist}(x,C_q)=\min _{y\in C_q}\Vert x-y\Vert \), where the sets \(C_q:=\{x\in X:\;f(x,q)\le 0 \text{ for } \text{ a } q \in Q\}\subset {\mathbb R}^{n}\) are closed and convex and \(C=\cap _{q \in Q}C_q\). Here, \(\Vert x\Vert \) denotes the Euclidean norm of a vector.

  4. 4.

    Linear inequalities: \(f(x,q)=a(q)^\top x-b(q).\)

We assume that a subgradient \(\partial _{x}f(x,q)\) is available at any point \(x\in X\) for all \(q \in Q\) (we mean an arbitrary subgradient if the set of them is not a singleton).

The algorithm has the following structure. At the kth iteration, we generate randomly \(q_{k}\in Q\); we assume that the \(q_k\)’s are independent and identically distributed (i.i.d.) samples from some probabilistic distribution \(p_{q}\) on Q. Two key assumptions are adopted.

Assumption 1 (strong feasibility). The set C is nonempty and contains an interior point

$$ \exists \, x^{*}\in C:\;\Vert x-x^{*}\Vert \le r \;\Longrightarrow \; x\in C. $$

Here, \(r>0\) is a constant which is assumed to be known.

Assumption 2 (distinguishability of feasible and infeasible points). For \(x\in X\setminus C\), the probability of generating a violated inequality is not vanishing:

$$ \mathsf{Prob}\{f(x,q )>0\}>0. $$

This is the only assumption on the probability distribution \(p_q\). For instance, if Q is a finite set and each element in Q is generated with nonzero probability, then Assumption 2 holds. The feasibility algorithm is then formulated as follows:

Algorithm 1: Given an initial point \(x_{0}\in X\), proceed as follows:

$$\begin{aligned} x_{k+1}= & {} \pi ^{}_X \bigl ( x_k-\lambda _k\partial _x f(x_k,q_k) \bigr ), \end{aligned}$$
(11)
$$\begin{aligned} \lambda _{k}= & {} \left\{ \begin{array}{cl} \dfrac{f(x_k,q_k)+r\Vert \partial _x f(x_k, q_k)\Vert }{\Vert \partial _x f(x_k, q_k)\Vert ^{2}} &{} \text {if}~~f(x_k,q_k)>0;\\ &{} \\ 0 &{} \text {otherwise.} \end{array} \right. \end{aligned}$$
(12)

Here, \(\pi ^{}_X\) is a projection operator onto X; that is, \(\Vert x-\pi ^{}_{X}(x)\Vert =\mathrm{dist}(x,X)\). Hence, at every step, the calculation of a subgradient is performed just for one inequality, which is randomly chosen among all inequalities Q. Note that the value of r (the radius of a ball in the feasible set) is used in the algorithm; its modification for r unknown will be presented later. To explain the choice of the step-size \(\lambda _k\) in the algorithm, we consider the two particular cases.

  1. 1.

    Linear inequalities: \(f(x,q)=a(q)^\top x-b(q)\), \(X={\mathbb R}^n\).

    Then we have \(\partial _x f(x_k,q_k)=a_k\), where \(f(x_k,q_k) = a_k^\top x_k-b_k\) and \(a_k = a(q_k)\), \(b_k = b(q_k)\), so that the algorithm takes the form

    $$ x_{k+1} = x_{k}-\frac{(a_{k}^\top x_{k}-b_{k})_+ + r\Vert a_{k}\Vert }{\Vert a_{k}\Vert ^{2}} a_{k} $$

    for \((a_{k}^\top x_{k}-b_{k})_{+}\ne 0\), otherwise \(x_{k+1}=x_{k}\); here, \(c_{+}=\max \{0,c\}\). For \(r=0\), the method coincides with the projection method for solving linear inequalities by Agmon–Motzkin–Shoenberg [1, 44].

  2. 2.

    Common point of convex sets: \(f(x,q) = \mathrm{dist}(x,C_q)\), \(C=\cap _{q\in Q}C_q\), \(X = {\mathbb R}^n\).

    Then we have \(\partial _x f(x_k,q_k) = \bigl (x_k - \pi ^{}_k(x_k)\bigr )/\rho _{k}\), where \(\pi ^{}_k\) denotes the projection onto the set \(C_k = C^{}_{q^{}_k}\) and \(\rho _k = \Vert x_k - \pi ^{}_k(x_k)\Vert \). The algorithm takes the form

    $$ x_{k+1} = \pi ^{}_k(x_k) + \frac{r}{\varrho _{k}}\bigl (\pi ^{}_k(x_{k})-x_{k}\bigr ), $$

    provided that \(x_{k}\notin C_{k}\); otherwise \(x_{k+1} = x_k\). We conclude that, for \(r=0\), each iteration of the algorithm is the same as for the projection method for finding the intersection of convex sets [7, 31].

Having this in mind, the rule for selecting the step-size \(\lambda _k\) has a very natural explanation. Denote by \(y_{k+1}\) the point which is generated via the same formula as \(x_{k+1}\), but with \(r=0\); assume also \(X={\mathbb R}^{n}\). Then, for the case of linear inequalities, \(y_{k+1}\) is the projection of \(x_k\) onto the half-space \(\left\{ x:\, a_{k}^\top x-b_{k}\le 0\right\} \). Similarly, if we deal with finding a common point of convex sets, \(y_{k+1}\) is the projection of \(x_{k}\) onto the set \(C_{k}\). It is easy to show that \(\Vert x_{k+1}-y_{k+1}\Vert =r.\) Thus the step in the algorithm is an (additively) over-relaxed projection; we perform an extra step (of length r) inside the current feasible set.

The idea of additive over-relaxation is due to V.A. Yakubovich who applied such a method to linear inequalities [81]. In the papers mentioned above, the order of sorting out the inequalities was either cyclic or the-most-violated one was taken, in contrast with the random order in the proposed algorithm.

Now we formulate the main result on the convergence of the algorithm.

Theorem 3

Under Assumptions 1, 2, Algorithm 1 finds a feasible point in a finite number of iterations with probability one, i.e., with probability one there exists N such that \(x_{N}\in C\) and \(x_{k}=x_{N}\) for all \(k\ge N\).

We now illustrate how the general algorithm can be adapted to two particular important cases.

1. Linear Matrix Inequalities are one of the most powerful tools for model formulation in various fields of systems and control, see [10]. There exist well-developed techniques for solving such inequalities as well as for solving optimization problems subject to such inequalities (Semidefinite Programming, SDP). However in a number of applications (for instance, in robust stabilization and control), the number of LMIs is extremely large or even infinite, and such problems are beyond the applicability of the standard LMI tools. Let us cast these problems in the framework of the approach proposed above.

The space \({\mathbb S}_m\) of \(m\times m\) symmetric real matrices equipped with the scalar product \(<A,B> = \mathrm{tr\,}AB\) and the Frobenius norm, becomes a Hilbert space (\(\mathrm{tr}(\cdot )\) denotes the trace of a matrix). Then we can define the projection \(A_{+}\) of a matrix A onto the cone of positive semidefinite matrices. This projection can be found in explicit form. Indeed, if \(A=RDR^\top \), \(R^{-1}=R^\top \), is the eigenvector–eigenvalue decomposition of A and \(D = \mathrm{diag\,}(d_{1},\dots ,d_{m})\), then

$$\begin{aligned} A_{+} = RD_{+}R^\top , \end{aligned}$$
(13)

where \(D_{+} = \mathrm{diag\,}(d_{1}^{+},\dots ,d_{m}^{+})\) and \(d_{i}^{+}=\max \{0,d_{i}\}.\)

Linear matrix inequality is the expression of the form

$$ A(x) = A_{0}+\sum _{i=1}^{n}x_{i}A_{i} \preccurlyeq 0, $$

where \(A_{i}\in {\mathbb S}_{m}\), \(i=0,1,\dots ,n\), are given matrices and \(x=(x_{1},\dots ,x_{n})\in {\mathbb R}^{n}\) is the vector variable. Another form of LMI was mentioned in Sect. 2; it is reducible to the canonical form above.

The general system of LMIs can be written as

$$\begin{aligned} A(x,q) = A_{0}(q) + \sum _{i=1}^{n}x_{i}A_{i}(q) \preccurlyeq 0\qquad \forall \, q \in Q. \end{aligned}$$
(14)

Here, Q is the set of indices which can be finite or infinite. The problem under consideration is to find an \(x\in {\mathbb R}^{n}\) which satisfies LMIs (14). Our first goal is to convert these LMIs into a system of convex inequalities. For this purpose, introduce the scalar function

$$\begin{aligned} f(x,q)=\Vert A_{+}(x,q)\Vert \end{aligned}$$
(15)

where A(xq) is given by (14) and \(A_{+}\) is defined in (13).

Lemma 1

The matrix inequalities (14) are equivalent to the scalar inequalities

$$ f(x,q )\le 0\qquad \forall \, q \in Q. $$

The function f(xq) is convex in x and its subgradient is given by

$$ \partial _{x}f(x,q)=\frac{1}{f(x,q)}\left( \begin{array}{c} \mathrm{tr\,}A_{1}A_{+}(x,q) \\ \vdots \\ \mathrm{tr\,}A_{n}A_{+}(x,q) \end{array} \right) $$

if \(f(x,q)>0\); otherwise \(\partial _{x}f(x,q)=0\).

Hence, solving linear matrix inequalities can be converted into solving a convex feasibility problem.

2. Solving linear equations. This case has some peculiarities—the solution set is either a single point or a linear subspace, so that it never contains an interior point and Algorithm 1 with \(r>0\) does not converge. However it can be applied with \(r=0\); for a deterministic choice of the alternating directions it is precisely the Kaczmarz algorithm [36]. Its randomized version with equal probabilities for all equations has been proposed in [56]; it converges with linear rate. More recently, Strohmer and Vershynin [75] studied this method with the probabilities for choosing the equation \((a_i,x)=b_i\) being proportional to \(\Vert a_i\Vert ^2\). They proved that the rate of convergence depends on the condition number of the matrix A, but not on the number of equations. This result stimulated further research in [15, 16, 20, 26, 41].

5 Optimization

After the invention of the Monte Carlo (MC) paradigm by N. Metropolis and S. Ulam in the late 1940s [43], it has become extremely popular in numerous application areas such as physics, biology, economics, social sciences, and other areas. As far as mathematics is concerned, Monte Carlo methods proved to be exceptionally efficient in the simulation of various probability distributions, numerical integration, estimation of the mean values of the parameters, etc. [37, 67, 77]. More recent version of the approach, Markov Chain Monte Carlo, is often referred to as MCMC revolution [23]. The salient feature of MC approach to solution of various problems of this sort is that “often,” it is dimension-free in the sense that, given N samples, the accuracy of the result does not depend on the dimension of the problem.

On the other hand, applications of the MC paradigm in the area of optimization are not that successful. In this regard, problems of global optimization deserve special attention. As explained in [82] (see beginning of Chapter 1.2), “In global optimization, randomness can appear in several ways. The main three are: (i) the evaluations of the objective function are corrupted by random errors; (ii) the points \(x_i\) are chosen on the base of random rules, and (iii) the assumptions about the objective function are probabilistic.” Pertinent to the exposition of this paper is only case (ii). Monte Carlo is the simplest, brute force example of randomness-based methods (in [82] it is referred to as “Pure Random Search”). With this method, one samples points uniformly in the feasible domain, computes the values of the objective function, and picks the record value as the output.

Of course, there are dozens of more sophisticated stochastic methods such as multistart, simulated annealing, genetic algorithms, evolutionary algorithms, etc.; e.g., see [24, 35, 52, 70, 72, 82] for an incomplete list of relevant references. However, most of these methods are heuristic in nature; often, they lack rigorous justification, and the computational efficiency is questionable. Moreover, there exist pessimistic results on “insolvability of global optimization problems.” This phenomenon has first been observed as early as in the monograph [47] by A. Nemirovskii and D. Yudin, both in the deterministic and stochastic optimization setups (see Theorem, Section 1.6 in [47]). Specifically, the authors of [47] considered the minimax approach to the minimization of the class of Lipschitz functions and proved that, no matter what the optimization method is, it is possible to construct a problem which will require exponential (in the dimension) number of function evaluations. The “same” number of samples is required for the simplest MC method. Similar results can be found in [48], Theorem 1.1.2, where the construction of “bad” problems is exhibited. Below we present another example of such problems (with very simple objective functions, close to linear ones) which are very hard to optimize. Concluding this brief survey, we see that any advanced method of global optimization cannot outperform Monte Carlo when optimizing “bad” functions.

This explains our interest in the MC approach as applied to the optimization setup. In spite of the pessimistic results above, there might be a belief that, if Monte Carlo is applied to a “good” optimization problem (e.g., a convex one), the results would not be so disastrous. Our goal in this section is to blow up these optimistic expectations. We examine the “best” optimization problems (the minimization of a linear function on a ball) and estimate the accuracy of the Monte Carlo method. Unfortunately, the dependence on the dimension remains exponential, and practical solution of these simplest problems via such an approach is impossible for high dimensions.

The second part of the section is devoted to randomized algorithms for convex optimization. The efficiency of such an approach has been discovered recently; it became clear that advanced randomized coordinate descent and similar approaches for distributed optimization are strong competitors to deterministic versions of the methods.

5.1 Direct Monte Carlo in Optimization

In this subsection we show that straightforward use of Monte Carlo in optimization, both global and convex is highly inefficient in problems of high dimensions. The material is based on the results in [60].

Global optimization: A pessimistic example. We first present a simple example showing failure of stochastic global optimization methods in high-dimensional spaces. This example is constructed along the lines suggested in [47] (also, see [48], Theorem 1.1.2) and is closely related to one of the central problems discussed below, the minimization of a linear function over a ball in \({\mathbb R}^n\).

Consider an unknown vector \(c\in {\mathbb R}^n\), \(\Vert c||=1\), and the function

$$ f(x)=\min \Bigl \{99-c^\top x, \,\bigl (c^\top x-99\bigr )/398\Bigr \} $$

to be minimized over the Euclidean ball \(Q\subset {\mathbb R}^n\) of radius \(r=100\) and centered at the origin. Obviously, the function has one local minimum \(x_1=-100c\), with the function value \(f_1=-0.5\), and one global minimum \(x^*=100c\), with the function value \(f^*=-1\). The objective function is Lipschitz with Lipschitz constant equal to 1, and \(\max f(x) - \min f(x)=1\).

Any standard (not problem-oriented) version of stochastic global search (such as multistart, simulated annealing, etc.) will miss the domain of attraction of the global minimum with probability \(1-V^1/V^0\), where \(V^0\) is the volume of the ball Q, and \(V^1\) is the volume of the set \(C = \{x\in Q:c^\top x\ge 99\}\). In other words, the probability of success is equal to

$$ \mathsf{Prob}= \frac{V^1}{V^0} = \frac{1}{2}I\Bigl (\dfrac{2rh-h^2}{r^2};\dfrac{n+1}{2},\dfrac{1}{2}\Bigr ), $$

where I(xab) is the regularized incomplete beta function with parameters a and b, and h is the height of the spherical cap C; in this example, \(h=1\). This probability quickly goes to zero as the dimension of the problem grows; say, for \(n=15\), it is of the order of \(10^{-15}\). Hence, any “advanced” method of global optimization will find the minimum with relative error not less than \(50\%\); moreover, such methods are clearly seen to be no better than a straightforward Monte Carlo sampling. The same is true if our goal is to estimate the minimal value of the function \(f^*\) (not the minimum point \(x^*\)). Various methods based on ordered statistics of sample values (see Section 2.3 in [82]) fail to reach the set C with high probability, so that the prediction will be close to \(f_1=-0.5\) instead of \(f^*=-1\).

Scalar convex optimization: Pessimistic results. Let Q denote the unit Euclidean ball in \({\mathbb R}^n\) and let \(\left. \xi ^{(i)}\right| _1^N = \bigl \{\xi ^{(1)},\dots ,\xi ^{(N)}\bigr \}\) be a multisample of size N from the uniform distribution \(\xi \sim {\mathscr {U}}(Q)\).

Given the scalar-valued linear function

$$\begin{aligned} g(x) = c^\top x, \quad c\in \mathbb {R}^n, \end{aligned}$$
(16)

defined on Q, estimate its maximum value from the multisample.

More specifically, let \(\eta ^*\) be the true maximum of g(x) on Q and let

$$\begin{aligned} \eta = \max \{g^{(1)}, \dots , g^{(N)}\}, \qquad g^{(i)}= g(\xi ^{(i)}),\quad i=1,\dots ,N, \end{aligned}$$
(17)

be the empirical maximum; we say that \(\eta \) approximates \(\eta ^*\) with accuracy at least \(\delta \) if

$$ \frac{\eta ^* - \eta }{\eta ^*} \,\le \, \delta . $$

Then the problem is: Given a probability level \(p\in ]0,\, 1[\) and accuracy \(\delta \in ]0,\,1[\), determine the minimal length \(N_{\min }\) of the multisample such that, with probability at least p, the accuracy of approximation is at least \(\delta \) (i.e., with high probability, the empirical maximum nicely evaluates the true one).

The results presented below are based on the following fact established in [59]; it relates to the probability distribution of a specific quadratic function of the random vector uniformly distributed on the Euclidean ball.

Lemma 2

([59]) Let the random vector \(\xi \in \mathbb {R}^n\) be uniformly distributed on the unit Euclidean ball \(Q\subset \mathbb {R}^n\). Assume that a matrix \(A\in \mathbb {R}^{m\times n}\) has rank \(m\le n\). Then the random variable

$$ \rho = \Bigl ( (AA^\top )^{-1}A\xi ,\,A\xi \Bigr ) $$

has the beta distribution \({\mathscr {B}}(\frac{m}{2},\,\frac{n-m}{2}+1)\) with probability density function

$$\begin{aligned} f_\rho (x) =\left\{ \begin{array}{cl} \displaystyle {\frac{\Gamma (\frac{n}{2}+1)}{\Gamma (\frac{m}{2})\Gamma (\frac{n-m}{2}+1)}\, x^{\frac{m}{2}-1}(1-x)^{\frac{n-m}{2}}} &{} \text{ for } 0 \le x \le 1, \\ \displaystyle {~~~~0} &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(18)

where \(\Gamma (\cdot )\) is the Euler gamma function.

Alternatively, the numerical coefficient in (18) writes

$$ \frac{\Gamma (\frac{n}{2}+1)}{\Gamma (\frac{m}{2})\Gamma (\frac{n-m}{2}+1)} = 1/B\Bigl (\frac{m}{2},\frac{n-m}{2}+1\Bigr ), $$

where \(B(\cdot ,\cdot )\) is the beta function.

We consider the scalar case (16) and discuss first a qualitative result that follows immediately from Lemma 2. Without loss of generality, let \(c = (1,\, 0,\,\dots ,\, 0)^\top \), so that the function \(g(x)=x_1\) takes its values on the segment \([-1,\, 1]\), and the true maximum of g(x) on Q is equal to 1 (respectively, \(-1\) for the minimum) and is attained with \(x = c\) (respectively, \(x=-c\)). Let us compose the random variable

$$ \rho = g^2(\xi ), $$

which is the squared first component \(\xi _1\) of \(\xi \). By Lemma 2 with \(m=1\) (i.e., \(A = c^\top \)), for the probability density function (pdf) of \(\rho \) we have

$$ f_\rho (x) \,=\, \frac{\Gamma (\frac{n}{2}+1)}{\Gamma (\frac{1}{2})\Gamma (\frac{n+1}{2})}x^{-\frac{1}{2}}(1-x)^{\frac{n-1}{2}} \,:=\, \beta _n\, x^{-\frac{1}{2}}(1-x)^{\frac{n-1}{2}}. $$

Straightforward analysis of this function shows that, as dimension grows, the mass of the distribution tends to concentrate closer to the origin, meaning that the random variable (r.v.) \(\rho \) is likely to take values which are far from the maximum, equal to unity.

We next state the following rigorous result [60].

Theorem 4

Let \(\xi \) be a random vector uniformly distributed over the unit Euclidean ball \(Q\subset {\mathbb R}^n\) and let \(g(x)=x_1\), \(x\in Q\). Given \(p\in ]0,\,1[\) and \(\delta \in ]0,\,1[\), the minimal sample size \(N_{\min }\) that guarantees, with probability at least p, for the empirical maximum of g(x) to be at least a \(\delta \)-accurate estimate of the true maximum, is given by

$$\begin{aligned} N_{\min } = \frac{\mathrm{ln}(1-p)}{\mathrm{ln}\Bigl [\frac{1}{2}+\frac{1}{2}I\bigl ((1-\delta )^2; \frac{1}{2},\frac{n+1}{2}\bigr )\Bigr ]}\,, \end{aligned}$$
(19)

where I(xab) is the regularized incomplete beta function with parameters a and b.

Clearly, a correct notation should be \(N_{\min } = \lceil \cdot \rceil \), i.e., rounding toward the next integer; we omit it, but it is implied everywhere in the sequel.

Numerical values of the function I(xab) can be computed via use of the Matlab routine betainc. For example, with the modest values \(n=10\), \(\delta =0.05\), and \(p=0.95\), formula (19) gives \(N_{\min }\approx 8.9\cdot 10^6\), and this quantity grows quickly as the dimension n increases.

Since we are interested in small values of \(\delta \), i.e., in x close to unity, a “closed-form” lower bound for \(N_{\min }\) can be computed as stated below.

Corollary 1

In the conditions of Theorem 4

$$ N_{\min } > N_\mathrm{appr} = \frac{\mathrm{ln}(1-p)}{\mathrm{ln}\Bigl [ 1- \tfrac{\beta _n}{n+1}\tfrac{1}{1-\delta }\bigl (2\delta - \delta ^2 \bigr )^{(n+1)/2} \Bigr ]}\,, $$

where \(\beta _n = \frac{\Gamma (\frac{n}{2}+1)}{\Gamma (\frac{1}{2})\Gamma (\frac{n+1}{2})} = 1/B(\tfrac{1}{2},\tfrac{n+1}{2})\) .

Further simplification of the lower bound can be obtained

$$ N_\mathrm{appr} > \widetilde{N}_\mathrm{appr} = -\frac{\mathrm{ln}(1-p)}{\sqrt{2\pi (n+1)}\tfrac{1}{1-\delta }\bigl (2\delta - \delta ^2 \bigr )^{(n+1)/2}}\,. $$

The lower bounds obtained above are quite accurate; for instance, with \(n=10\), \(\delta =0.05\), and \(p=0.95\), we have \(N_{\min }\approx 8.8694\cdot 10^6\), while \(N_\mathrm{appr} \approx 8.7972\cdot 10^6\) and \(\widetilde{N}_\mathrm{appr} = 8.5998\cdot 10^6\).

The moral of this subsection is that, for high dimensions, a straightforward use of Monte Carlo sampling cannot be considered as a tool for finding extreme values of a function, even in the convex case.

5.2 Randomized Methods

On the other hand, exploiting randomized methods in different forms can be highly efficient; in many cases they are strong competitors of deterministic algorithms.

Unconstrained minimization. We start with random search methods for unconstrained minimization

$$ \min f(x), \quad x\in R^n. $$

Probably the first publication relates to the 1960s [42, 65]. The idea was to choose a random direction in the current point and make a step resulting in decrease of the objective function. Rigorous results on convergence of some random search algorithms were obtained in [19]. Nevertheless the practical experiments with similar methods were mostly disappointing, and they did not attract much attention (excluding global optimization, see above). For convex problems the situation has changed recently, when the dimension of problems under consideration became very large (n is of the order \(10^6\)) or when distributed optimization problems arose (\(f(x)=\sum _{i=1}^{N}f_i(x_i), \, x=(x_1,\dots , x_N), \, N\) is large). We survey some results in this direction first.

The basic algorithm of random search can be written as

$$\begin{aligned} x_{k+1}=x_k - \gamma _k \frac{\hat{f}(x_k+\mu _k u_k)-\hat{f}(x_k)}{\mu _k}u_k, \end{aligned}$$
(20)

where \(x_k\) is a k-th approximation to the solution \(x^*\), \(u_k\) is a random vector, \(\gamma _k, \mu _k\) are step-sizes, and \(\hat{f}(x_k)\) is a measured value of \(f(x_k)\); either \(\hat{f}(x_k)=f(x_k)\) (deterministic setup) or \(\hat{f}(x_k)=f(x_k)+\xi _k,\, \xi _k\) being a random noise (stochastic optimization). Algorithm (20) requires one calculation of the objective function per iteration, its symmetric version

$$\begin{aligned} x_{k+1}=x_k - \gamma _k \frac{\hat{f}(x_k+\mu _k u_k)-\hat{f}(x_k-\mu _k u_k)}{2\mu _k}u_k, \end{aligned}$$
(21)

uses two calculations. The strategy of choosing step-sizes depends on smoothness of f(x) and on the presence of errors \(\xi _k\) in function evaluation. The following result is adaptation of more general theorems in [62, 63] for \(C^2\) functions.

Theorem 5

Consider the problem of unconstrained minimization of f(x), where f(x) is strongly convex, twice differentiable, with gradient satisfying the Lipschitz condition. Suppose \(u_k\) are random i.i.d. uniformly distributed in the cube \(||u||_{\infty }\le 1\). Noises \(\xi _k\) are independent of \(u_1,\dots , u_k\) and have bounded second moment \(\mathsf{E}|\xi _i|^2\le \sigma ^2\). The step-size satisfies the following conditions: \(\gamma _k=a/k\), \(\mu _k=\mu /k^4\), a is large enough. Then the iterations \(x_k\) in algorithms (20), (21) converge to the minimum point \(x^*\) in mean-square and

$$ \mathsf{E}\Vert x_k-x^*\Vert ^2=O(1/\sqrt{k}). $$

It is worth mentioning that randomization of directions \(u_k\) allows to remove the assumption \(\mathsf{E}x_k=0\), which is standard in stochastic optimization methods [38]; a similar effect for estimation is exhibited in Theorem 2. If compared with the classical Kiefer–Wolfowitz (KW) method, algorithms (20), (21) are less laborious: they require just one or two function evaluations per iteration vs n or 2n in the KW-method. On the other hand, asymptotic rate of convergence is the same: \(O(1/\sqrt{n})\). More details about convergence, various forms, computational experience of such algorithms can be found in the publications of J. Spall (e.g., [73]); he names the algorithms SPSA (Simultaneous Perturbation Stochastic Approximation). The pioneering research on the algorithms are due to Yu. Ermoliev [21] and H. Kushner [38].

Now we focus on purely deterministic version of problem (5), where measurements of the objective function do not contain errors: \(\hat{f}(x_k)=f(x_k)\). As we mentioned above, the interest to such methods grew enormously when very high-dimensional problems became appealing due to such applications as machine learning and neural networks. The interest has been triggered with Yu. Nesterov’s paper [49]. Roughly speaking, the approach of [49] is as follows. It is assumed that the Lipschitz constants \(L_i\) for partial derivatives \(\partial f/\partial x_i\) are known (and they can be easily estimated for quadratic functions). Then, at the kth iteration, the index \(i=\alpha \) is chosen with probability proportional to \(L_i\), and new iteration is obtained by changing coordinate \(i\alpha \) with step-size \((1/L_{\alpha })\partial f/\partial x_{\alpha }\). Yu. Nesterov provides sharp estimates on the rate of convergence and also presents the accelerated version of the algorithm. These theoretical results supported with intensive numerical experiments for huge-scale problems confirm advantages of the random coordinate descent. This line of research found numerous applications in distributed optimization [9, 45, 66]. The titles of many publications (e.g., recent one [33]) confirm advantages of randomized algorithms.

Randomization techniques are also helpful for minimization of nonsmooth convex functions, when the only data available are the values of the function f(x) at an arbitrary point. The idea of the following algorithm is due to A. Gupal [32], also see [55], Section 6.5.2. In contrast with algorithm (21), we generate a random point \(\tilde{x}_k\) in the neighborhood of the current iteration point \(x_k\) and then make a step similar to (21) from this point. Thus the algorithm is written as

$$\begin{aligned} x_{k+1}=x_k - \gamma _k \frac{f(\tilde{x}_k+\mu _k u_k)-f(\tilde{x}_k-\mu _k u_k)}{2\mu _k}u_k, \end{aligned}$$
(22)
$$\begin{aligned} \tilde{x}_k=x_k+\alpha _k h_k \end{aligned}$$
(23)

where \(u_k, h_k\) are independent random vectors uniformly distributed in the cube \(\Vert u\Vert _{\infty }\le 1\), while \(\alpha _k, \gamma _k, \mu _k\) are scalar step-sizes. It can be seen that randomization step with \(h_k\) is equivalent to smoothing of the original function, thus algorithm similar to (21) is applied to the smoothed function. By adjusting the parameters \(\alpha _k\), \(\gamma _k\), \(\mu _k\), we arrive at the convergence result.

Theorem 6

Let f(x) be convex, and let a unique minimum point \(x^*\) exist. Let the step-sizes satisfy the conditions

$$ \sum _{k=1}^{\infty }\gamma _k=\infty ,\, \sum _{k=1}^{\infty }\gamma _k^2\le \infty ,\, \gamma _k/\alpha _k\rightarrow 0,\, \mu _k/\alpha _k\rightarrow 0,\,\alpha _k\rightarrow 0,\,|\alpha _k-\alpha _{k+1}|/\gamma _k \rightarrow 0. $$

Then \(x_k\rightarrow x^*\) with probability one.

This result guarantees convergence of the algorithm to the minimum point. However it does not provide effective strategies for choosing parameters, neither it estimates the rate of convergence. Above-mentioned problems are deeply investigated in [50]. The authors apply Gaussian smoothing technique (i.e., the vectors \(u_k\) are Gaussian) and present randomized methods for various classes of functions (smooth and nonsmooth) for different situations (gradient or gradient-free oracles). The versions of the algorithms with the best rate of convergence are indicated.

To conclude, we remind that there exist no-zero-order deterministic methods for minimization of nondifferentiable convex functions, so that randomized methods provide the only option.

Constrained minimization. There are various problem formulations related to randomized methods for optimization in the presence of constraints.

One of them is closely related to feasibility problem (10), but now we are looking to the feasible point which minimizes an objective function

$$\begin{aligned} \min (c,x)\quad f(x,q)\le 0\quad \forall \, q \in Q. \end{aligned}$$
(24)

Here we have taken the objective function to be linear without loss of generality. Constraint functions f(xq) are convex in the variable \(x\in {\mathbb R}^n\) for all values of the parameters q. Numerous examples of constraints of this form were discussed in Sect. 4. Such problems are closely related to robust optimization, see [8] and Sect. 2. A randomized approach to the problem consists of a random choice of N parameters \(q_1, \dots , q^{}_N\) from the set Q and solving the convex optimization problem with a finite number of constraints

$$\begin{aligned} \min _{x\in C} (c,x)\quad f(x,q_i )\le 0\quad i=1, \dots , N. \end{aligned}$$
(25)

We suppose that this problem can be solved with high accuracy (e.g., if f(xq) are linear in x, then (25) is LP), and denote the solution by \(x_N\). Such an approach has been proposed in [11]; the authors answer the following question: How many samples (N) need to be drawn in order to guarantee that the resulting randomized solution violates only a small portion of the constraints? They assume that there is some probability measure on Q which defines the probability of violation of constraints V(x) for arbitrary x. The main result in [11] states

Theorem 7

\(\mathsf{E\,} V(x_N)\le \dfrac{n}{N+1}\,. \)

Of course this result says nothing about the accuracy of the randomized solution (i.e., how close \(x_N\) is to the true solution \(x^*\) or how small \((c,x_N - x^*)\) is. However, it provides much useful information. Some related results can be found in Sect. 2 above.

Another type of constrained optimization problems reads as

$$\begin{aligned} \min (c,x), \quad x\in Q, \end{aligned}$$
(26)

where \(Q\subset {\mathbb R}^n\) is a closed bounded set (convex or nonconvex) such that it is hard to solve explicitly the problem above, and projection on Q is also unavailable. Then a possible option is to sample random points in Q and take the best point having the minimal value of the objective function. It is exactly the “direct Monte-Carlo” we have considered in Sect. 2 and found it to be inefficient. However, another approach, based on cutting plane ideas, might be more promising. We assume that a so-called boundary oracle is available, that is for an \(x\in Q\) and \(y\in {\mathbb R}^n\), the quantities

$$ \underline{\lambda }=\arg \max \{ \lambda \ge 0:\; x-\lambda y\in Q\}, \qquad \overline{\lambda }=\arg \max \{ \lambda \ge 0:\; x+\lambda y\in Q\}, $$

can be found efficiently. Numerous examples of sets with known boundary oracles can be found in [30, 68, 71]. Then, starting with some known \(x_0\in Q\), we proceed sampling in Q by using the technique described below.

Hit-and-Run algorithm (HR). For \(x_k\in Q\), take a direction vector y uniformly distributed on the unit sphere; the oracle returns \(\underline{x}_k = x_k-\underline{\lambda } y\) and \(\overline{x}_k = x_k+\overline{\lambda } y\). Then, draw \(x_{k+1}\) uniformly distributed on \([\underline{x}_k,\, \overline{x}_k]\). Repeat. Schematically, this algorithm is illustrated in Fig. 3.

Fig. 3
figure 3

The idea of the HR algorithm

This technique was proposed in [71, 79]; under mild assumptions on Q, the distribution of the random point \(x_k\) was proved to approach the uniform distribution on Q. Instead of using the “direct Monte-Carlo,” we now apply the randomized cutting plane algorithm, following the ideas of [18, 57].

A cutting plane algorithm. Start with \(X_0=Q\). For \(X_k\), generate 3N points \(x_k\), \(\underline{x}_k\), \(\overline{x}_k\), \(k=1, \dots , N\), by the HR algorithm and find \(f_k = \min (c,x)\), where the minimum is taken over these 3N points. Proceed to the new set \(X_{k+1}=X_k\bigcap \{x:\; (c,x)\le f_k\}\) and the initial point \(x_0=\arg \min (c,x)\), where the minimum is also taken over the 3N points mentioned above.

Rigorous results on the rate of convergence of such an algorithm are lacking. For the idealized analog of it (with the points x “truly” uniformly distributed in \(X_k\)), the results on convergence can be found in [18, 57]. Moreover, the algorithm presented above includes the boundary points \(\underline{x}_k\), \(\overline{x}_k\); this essentially improves the convergence, since the minimum in the original problem (26) is attained at a boundary point. Numerical experiments in [18, 57] confirm a nice convergence if the set Q is not too “flat.”

6 Conclusions

We have covered in this chapter several topics—in robustness, estimation, control, feasibility, constrained and unconstrained optimization—where the ideas of randomization can be applied and moreover can provide better results than deterministic methods. We could see that the situation with regard to effectiveness of randomized methods is not completely clarified; e.g., some straightforward attempts to apply Monte Carlo for optimization do not work for high dimensions. On the other hand, the only approach to minimization of nonsmooth convex functions with zero-order oracle (i.e., only function values are available) is based on randomization. We hope that the survey will stimulate further interest toward this exciting field of research.