On the Information Content of Some Stochastic Algorithms

Esquível, Manuel L.; Machado, Nélio; Krasii, Nadezhda P.; Mota, Pedro P.

doi:10.1007/978-3-030-83266-7_5

Manuel L. Esquível⁴,
Nélio Machado⁵,
Nadezhda P. Krasii⁶ &
…
Pedro P. Mota⁴

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 371))

Included in the following conference series:

International Conference on Stochastic Methods

553 Accesses

Abstract

We formulate an optimization stochastic algorithm convergence theorem, of Solis and Wets type, and we show several instances of its application to concrete algorithms. In this convergence theorem the algorithm is a sequence of random variables and, in order to describe the increasing flow of information associated to this sequence we define a filtration – or flow of $\sigma $-algebras – on the probability space, depending on the sequence of random variables and on the function being optimized. We compare the flow of information of two convergent algorithms by comparing the associated filtrations by means of the Cotter distance of $\sigma $-algebras. The main result is that two convergent optimization algorithms have the same information content if both their limit minimization functions generate the full $\sigma $-algebra of the probability space.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Limit Theorems in Discrete Stochastic Geometry

Generalized Statistical Convergence for Sequences of Function in Random 2-Normed Spaces

Ergodicities of Infinite Dimensional Nonlinear Stochastic Operators

Article 13 August 2020

Keywords

1 Introduction

There are three main roots we can consider to the present work. The first is a quite general formulation of a stochastic optimization algorithm given in [SW81], studied under a different perspective in [SP99] and [PS00], and then corrected and slightly generalized in [Esq06] and having further developments and extensions in [dC12]. The subject of stochastic optimization – in the perspective adopted in this work – become stabilized with the books [Spa03, Zab03] and the work [Spa04]. The interest on the development of stochastic optimization methods continued, for instance in the work [RS03]. We refer a very effective and general approach to a substantial variety of stochastic optimization problems that takes the denomination of the cross entropy method proposed in a unified way in [RK04], further explained in [dBKMR05], with the convergence proved in [CJK07] and further extended in [RK08].

The second root originated in [SB98], is detailed in Sect. 4 for the reader’s convenience, and may be broadly described as a form of conditioning of the results of any algorithm for global optimization; conditioning in the sense that the algorithm must gather enough information in order to get significant results.

This leads to the third root, namely the formalization of the concept information, conveyed by a random variable, as the $\sigma $-algebra generated by this random variable. This formalization encompasses many extensions and uses (see [Vid18]) for a recent and thorough account). We may initially refer with introduction of a convergence definition for $\sigma $-algebras – the so called strong convergence – related to the conditional expectation by Neveu in [Nev65] (or the French version in [Nev64]), with in [Kud74] a very deep study and, with developments, in [Pic98] and [Art01]. Then [Boy71] introducing a different convergence – the Hausdorff convergence – with an important observation in [Nev72] and further analysis in [Rog74] and [VZ93]. In the study of convergence of $\sigma $-algebras (or fields) there were many noticeable advances – and useful in our perspective – with Cotter in [Cot86] and [Cot87] extended in [ALR03] and detailed in [Bar04] and further extensions in [Kom08].

In the perspective of further developments, we mention [Wan01] and [Yin99], two works that concern the determination of the rates of convergence of stochastic algorithms allowing for the determination of adequate and most effective stopping rules for the algorithm and also [dC11] – and references therein – for a method to obtain confidence intervals for stochastic optimums.

2 Some Random Search Algorithms

We will now develop the following general idea: a convergent stochastic search algorithm for global optimization of a real valued function f defined on a domain $\mathcal {D}$ may be seen simply as a sequence of random variables $\mathbbm {Y}=(Y_n)_{n\ge 1}$ such that the sequence $(f(Y_n))_{n\ge 1}$ converges (almost surely or in probability) to a random variable which gives a good estimate of $\min _{x \in \mathcal {D}} f(x)$. This sequence of random variables gives information about f on $\mathcal {D}$. A natural question is how to compare quantitatively the information brought by two different algorithms.

We now describe three algorithms which we will discuss in the following. Important issues for discussion are the convergence of the algorithm and, in case of convergence, the rate of convergence of the algorithm. Let $(\varOmega , \mathcal {F}, \mathbbm {P})$ be a complete probability space.

2.1 The Pure Random Search Algorithm

For the general problem of minimizing $ f:\mathcal {D} \subseteq \mathbbm {R}^n \mapsto \mathbbm {R}$ over $\mathcal {D}$, a bounded Borel set of $\mathbbm {R}^n$, we consider the following natural algorithm.

S.1 Select a point $x_1$ at random in $\mathcal {D}$. Do $y_1:=x_1$.
S.2 Choose a point $x_2$ at random in $\mathcal {D}$. Do:
S.3 Repeat S.2.

To this algorithm it corresponds a probabilistic translation given in the following.

Sp.1:: Let $X_1, X_2, \dots , X_n, \dots $ be independent random variables with common distribution over $\mathcal {D}$ verifying furthermore, with $\mathcal {B}(\mathcal {D})$ the Borel $\sigma $-algebra of $\mathcal {D}$:
$$\begin{aligned} \forall B \in \mathcal {B}(\mathcal {D}) \; \; \; \lambda (B)>0 \Rightarrow \mathbbm {P}[X_1 \in B]>0 . \end{aligned}$$
(1)
Sp.2:: $Y_1:=X_1$
Sp.3:

Having no prior information on the minimum set location and for random variables having common distribution on a bounded Borel set, a natural choice for the distribution of the random variables $X_j$ is the uniform distribution. A non uniform distribution will distribute more mass on some particular sub domain. This may entail a loss of efficiency if the minimizer set is not contained in the more charged domain.

Remark 1

(The laws of the random variables of pure random search algorithm). We observe that for $n>1$ we have:

an alternative expression that will allow us to describe the law of $Y_n$. Let D in the Borel $\sigma $-algebra of $\mathcal {D}$ and suppose that the random variables $(X_n)_{n \ge 1}$ are uniformly distributed in $\mathcal {D}$. We have the following disjoint union:

$$\begin{aligned} \begin{aligned}&\{Y_n \in D\}\\&= \bigcup _{k=1}^n \left( \{X_k \in D \} \cap \bigcap _{1 \le j<k} \{ f(X_k) \le f(X_j )\} \cap \bigcap _{k< j \le n} \{ f(X_k) < f(X_j) \} \right) , \end{aligned} \end{aligned}$$

which entails, representing by $ \lambda $ the Lebesgue measure over $\mathcal {D}$, that (see the Appendix, page 17, for the complete deduction):

$$\begin{aligned} \begin{aligned}&\mathbbm {P}[Y_n \in D] \\&= \sum _{k=1}^n \left( \frac{1}{\lambda (\mathcal {D})^n} \int _{D} \lambda (f^{-1}([f(x_k), + \infty [ ))^{k-1} \lambda (f^{-1}(]f(x_k), + \infty [ ))^{n-k} d\lambda (x_k) \right) , \end{aligned} \end{aligned}$$

(2)

by Fubini theorem and by the fact that $(X_n)_{n \ge 1}$ is a sequence of independent uniformly distributed random variables on $\mathcal {D}$. Suppose furthermore that for every $x \in \mathcal {D}$ we have $\lambda (f^{-1}(\{f(x)\})) =0$, we then have:

$$ \mathbbm {P}[Y_n \in D] =\frac{n}{\lambda (\mathcal {D})^n} \int _{D} \lambda (f^{-1}([f(x), + \infty [ ))^{n-1} d\lambda (x) , $$

which gives us the density of $Y_n$ with respect to the Lebesgue measure.

2.2 The Random Search Algorithm on (Nearly) Unbounded Domains

In the context of simple random search we may ask what is the natural substitute of the uniform distribution on an unbounded domain? A variant of the algorithm we now present was introduced in [Esq06] having in mind performing global optimization in unbounded domains. For bounded but large domains one may consider an algorithm using, for instance, a Gaussian distribution.

S.1 Select a point x at random in $\mathcal {D}\subset \mathbbm {R}$. Do $z:=x$.
S.2 Choose a point x at random in $\mathcal {D}$. Choose a point y with distribution $\mathcal {N}(x,\sigma )$ where for instance $\sigma :=\text{ diam }(\mathcal {D})/10$. Do:
S.3 Repeat S.2.

The probabilistic recursive translation of this algorithm is the following.

pS.1:: Let $X_1, X_2, \dots , X_n, \dots $ independent random variables with common uniform distribution over $\mathcal {D}$
pS.2:: $Z_1:=X_1$
pS.3:: Let $Y_1, Y_2, \dots , Y_n, \dots $ be a sequence of independent random variables such that $Y_{n} \frown \mathcal {N}(X_n, \sigma )$.
pS.4:: .

2.3 The Zig-Zag Algorithm

The zig-zag algorithm was introduced in [MPB99] (see also for other references and a convergence proof [PM10]) in order to optimize a quadratic function in two sets of multidimensional variables. The main idea of this algorithm may be simply described. In the first step we optimize in one of the sets of variables leaving the variables of the other set unchanged. On the second step, the first set of variables remains unchanged in the optimum value determined in the first step and an optimization is performed in the second set of variables. On the third step, it is now the second set of variables that remains unchanged in the optimum determined in the second step while an optimization is executed in the first set of variables. For the general case, the convergence and – if applicable – the rate of convergence issues were open problems, as far as we know.

One of the possibilities opened by this algorithm is to perform the optimization in sets of strictly smaller linear dimension than the dimension of $\mathcal {D}$. Suppose that $\mathcal {D} \subseteq \mathbbm {R}^2$ is bounded.

S.1 Select a point x at random in $\mathcal {D}$. Do $z:=x$.
S.2 (Optimization along a lower dimensional subset of the domain)
S.2.1:

Choose a point y at random in $\mathcal {D}$.

S.2.2:

Choose, at random, points $\lambda _1, \dots , \lambda _N \in \mathbbm {R}$ such that $\lambda _j z+ (1-\lambda _j)y \in \mathcal {D}$ and define x to be such that $f(x) =\min _{1 \le j \le N} f(\lambda _j z+ (1-\lambda _j)y)$. Do:
S.3 Repeat S.2

For this algorithm, one probabilistic recursive translation may be the following.

pS.1:

Let $Y_1, Y_2, \dots , Y_n, \dots $ be a sequence of independent random variables with common uniform distribution over $\mathcal {D}$.

pS.2:

$Z_1:=Y_1$

pS.3:

For each $n \ge 1$, let $\lambda ^n_1, \dots , \lambda ^n_N$ be independent sequences of independent random variables with uniform distribution in [a, b] an interval such that:

$$ \forall \lambda \in [a,b] \; \; \forall x,y \in \mathcal {D} \; \; \; \lambda x + (1-\lambda ) y \in \mathcal {D} \;. $$

which is possible as $\mathcal {D}$ is bounded.

pS.4:

Define the random variable $X_n^{j_0}$ such that:

$$ f(X_n^{j_0})=\min _{1\le j \le N} f(\lambda ^n_j Z_n + (1-\lambda ^n_j ) Y_{n+1} ) $$

pS.5:

.

The main idea of the zig-zag algorithm may, of course, be exploited in several other ways.

3 The Solis and Wets Approach of Random Algorithms

We present this approach following the presentation in [Esq06] – which follows the context formalism of [SW81] – and observe that this approach may be applied to the algorithms described above.

3.1 The Convergence Results

We introduce some definitions which are necessary for the presentation of the convergence results. Let $f:\mathcal {D} \subset \mathbbm {R}^n \longrightarrow \mathbbm {R}$ be a measurable function defined on a domain $\mathcal {D}$ that can be unbounded. Let $(\varOmega , \mathcal {F}, \mathbbm {P})$ be a complete probability space. In order to deal with discontinuous functions, such as , or unbounded functions, such as , we need the following notion.

Definition 1

(Essential infimum of f in $\mathcal {D}$).

$$\begin{aligned} \alpha :=\inf \{ t \in \mathbbm {R}: \lambda (\{ x \in \mathcal {D}: f(x) <t\})>0 \} \end{aligned}$$

(3)

with $\lambda $ being the Lebesgue measure on $ \mathbbm {R}^n$.

The formulation of general hypothesis on the function f in order to obtain the algorithm convergence requires the definition of the sets for $\epsilon >0$ and $M<0$.

Definition 2

(Level set of f of height $\epsilon $ over $\alpha $ ).

$$\begin{aligned} E_{\alpha +\epsilon , M}:={\left\{ \begin{array}{ll} \{ x \in \mathcal {D}: f(x)<\alpha +\epsilon \} &{} \text{ if } \alpha \in \mathbbm {R} \\ \{ x \in \mathcal {D}: f(x) <M \} &{} \text{ if } \alpha =-\infty \end{array}\right. } \end{aligned}$$

(4)

The general form of the algorithm may be decomposed into the nuclear part which is a function verifying some condition and the procedure.

Definition 3

(The algorithm).

A function $\psi : \mathcal {D} \times \mathbbm {R}^n \mapsto \mathcal {D} \subset \mathbbm {R}^n$ such that the following hypothesis [H1] is verified.
$$\begin{aligned}{}[H1]: {\left\{ \begin{array}{ll} \forall t,x \;\; f(\psi (t,x)) \le f(t) &{} \\ \forall x \in \mathcal {D} \;\; f(\psi (t,x)) \le f(x) \end{array}\right. } \end{aligned}$$
(5)
A sequence of random variables given by:
$$\begin{aligned} {\left\{ \begin{array}{ll} Y_1=X_1 &{} \\ Y_{n+1}=\psi (Y_n,X_n) &{} \text{ for } n\ge 1 \end{array}\right. } \end{aligned}$$
(6)
where $X_n \frown \mathbbm {P}_n$ satisfying hypothesis in Formula (1), and $\mathbbm {P}_n$ being a probability measure – the law of $X_n$ – that may depend on $\mathbbm {P}_1, \dots , \mathbbm {P}_{n-1}$ in case of adaptive random search.

Remark 2

(Examples of stochastic algorithms for global optimization). The pure random search algorithm in Sect. 2.1, the random search on nearly unbounded domains in Sect. 2.2 and the zig-zag algorithm in Sect. 2.3 may be considered as instances of Solis and Wets approach. As presented, the following function obviously describes the algorithms and verifies the hypothesis H1 in Formula (5).

The following result ensures the convergence of the algorithm under very general hypothesis.

Theorem 1

(A Solis and Wets’ type theorem for random search algorithm convergence). Suppose that f is measurable and bounded from below. Let $\alpha $ be the essential infimum of f in $\mathcal {D}$.

$H2(\epsilon )$:: For pure random search this hypothesis is defined for every $\epsilon >0$ as:
$$ \lim _{k \rightarrow + \infty } \prod _{1\le j \le k} \mathbbm {P}[X_j \in E^{c}_{\alpha + \epsilon , M}]= \lim _{k \rightarrow + \infty } \prod _{1\le j \le k} \mathbbm {P}_j [ E^{c}_{\alpha + \epsilon , M}]= 0 \;. $$
$H^{\prime }2(\epsilon )$:: For adaptive search this hypothesis is defined for every $\epsilon >0$ as:
$$\begin{aligned} \lim _{k \rightarrow + \infty } \inf _{1\le j \le k} \mathbbm {P}[X_j \in E^{c}_{\alpha + \epsilon , M}]= \lim _{k \rightarrow + \infty } \inf _{1\le j \le k} \mathbbm {P}_j[ E^{c}_{\alpha + \epsilon , M}]=0 \;. \end{aligned}$$
(7)

If for some $\epsilon >0 $ hypothesis $H2(\epsilon )$ ( in case of pure random search) or $H^{\prime }2(\epsilon )$ (in case of adaptive search) are verified, then:

$$\begin{aligned} \lim _{n \rightarrow + \infty } \mathbbm {P}[Y_n \in E_{\alpha + \epsilon , M}]=1 \;. \end{aligned}$$

(8)

If for every $\epsilon >0 $ hypothesis $H2(\epsilon )$ (in case of pure random search) or $H^{\prime }2(\epsilon )$ (in case of adaptive search) are verified, then the sequence $(f(Y_n))_{n\ge 1}$ converges almost surely to a random variable $\text {Min}_{f,\mathbbm {Y}}$ such that $\mathbbm {P}[\text {Min}_{f,\mathbbm {Y}} \le \alpha ]=1$.

Proof

A first fundamental observation is that if $Y_n \in E_{\alpha +\epsilon , M}$ or $X_n \in E_{\alpha +\epsilon , M}$, then by hypothesis H1 we have that $Y_{n+1} \in E_{\alpha +\epsilon , M}$ and so as $(f(Y_n))_{n\ge 1}$ is decreasing, $Y_{n+k} \in E_{\alpha +\epsilon , M}$ for every $k\ge 1$. As a consequence, for $k>1$:

$$ \{ Y_k \in E^{c}_{\alpha + \epsilon , M} \} \subseteq \{ Y_1, \dots , Y_{k-1} \in E^{c}_{\alpha + \epsilon , M} \} \cap \{ X_1, \dots , X_{k-1} \in E^{c}_{\alpha + \epsilon , M} \}. $$

as if it was otherwise we would contradict our first observation. Now, it is clear that:

$$\begin{aligned} \begin{aligned} \mathbbm {P}[Y_k \in E^{c}_{\alpha + \epsilon , M} ]&\le \mathbbm {P}\left[ \bigcap _{1 \le j \le k-1} \{ Y_j \in E^{c}_{\alpha + \epsilon , M} \} \cap \{ X_j \in E^{c}_{\alpha + \epsilon , M} \} \right] \\&\le \mathbbm {P}\left[ \bigcap _{1 \le j \le k-1} \{ X_j \in E^{c}_{\alpha + \epsilon , M} \} \right] . \end{aligned} \end{aligned}$$

(9)

On the pure random search scenario we have that $(X_n)_{n \ge 1}$ is a sequence of iid random variables and so:

$$\begin{aligned} \mathbbm {P}\left[ \bigcap _{1 \le j \le k-1} \{ X_j \in E^{c}_{\alpha + \epsilon , M} \} \right] = \prod _{1 \le j \le k-1} \mathbbm {P}\left[ X_j \in E^{c}_{\alpha + \epsilon , M} \right] = \mathbbm {P}\left[ X_1 \in E^{c}_{\alpha + \epsilon , M} \right] ^{k-1}, \end{aligned}$$

(10)

implying that

$$ 1\ge \mathbbm {P}[Y_k \in E_{\alpha + \epsilon , M} ]=1-\mathbbm {P}[Y_k \in E^{c}_{\alpha + \epsilon , M} ] \ge 1-\mathbbm {P}\left[ X_1 \in E^{c}_{\alpha + \epsilon , M} \right] ^{k-1}. $$

Now by hypothesis in Formula (1) we have that $\mathbbm {P}\left[ X_1 \in E^{c}_{\alpha + \epsilon , M} \right] <1$ and so conclusion in Formula (8) of the theorem now follows. On the alternative scenario of adaptive random search we still have the same conclusion but now based on the estimate:

$$ \mathbbm {P}\left[ \bigcap _{1 \le j \le k-1} \{ X_j \in E^{c}_{\alpha + \epsilon , M} \} \right] \le \inf _{1 \le j \le k-1} \mathbbm {P}\left[ X_j \in E^{c}_{\alpha + \epsilon , M} \right] , $$

instead of estimate in Formula (10) used in the pure random search case. For the second conclusion of the proof, observe that the sequence $(f(Y_n))_{n \ge 1}$ being almost surely non increasing, as a consequence of hypothesis H1, and bounded below is almost surely convergent to a random variable that we denote by $\text {Min}_{\mathbbm {Y}}$. Now, observing that for all $\epsilon >0$:

$$ \lim _{k \rightarrow + \infty } \mathbbm {P}[Y_k \in E_{\alpha + \epsilon , M} ]= \lim _{k \rightarrow + \infty } \mathbbm {P}[ f(Y_k) <\alpha + \epsilon ]=1, $$

in either pure random search or adaptive search, the conclusion follows by a standard argument (see Corollary 1. and Lemma 1. in [Esq06, p. 844]).

Remark 3

Having in mind a characterization of the speed of convergence of the algorithm it may be useful to observe that the following condition $H^{\prime \prime }2(\epsilon )$ also entails the conclusion of the theorem, although being more stringent than $H^{\prime }2(\epsilon )$.

$$\begin{aligned} \lim _{k \rightarrow + \infty } \max _{1\le j \le k} \mathbbm {P}[X_j \in E^{c}_{\alpha + \epsilon , M}]= \lim _{k \rightarrow + \infty } \max _{1\le j \le k} \mathbbm {P}_j[ E^{c}_{\alpha + \epsilon , M}]=0 . \end{aligned}$$

(11)

In order to improve Theorem 1 some additional hypothesis are needed. For instance, if the minimizer is not unique then the sequence $(Y_n)_{n \ge 1}$ may not converge. First, let us observe that if the minimizer of f is unique and f is continuous, then the essential minimum of f coincides with the minimum of f.

Proposition 1

Let f be continuous and admiting an unique minimizer $z \in \mathcal {D}$ that is such that $f(z)=\min _{x \in \mathcal {D}}f(x)$. Then $\alpha =\min _{x \in \mathcal {D}}f(x)=:m$.

Proof

Let $\epsilon >0$ be given. There exists $x_\epsilon \in \mathcal {D}$ such that $m=f(z)< f(x_\epsilon ) < m+\epsilon $. By the continuity we have an open neighborhood V of $x_\epsilon $ such that for all $x \in V$ we still have $m< f(x) < m+\epsilon $. As a consequence:

$$ \lambda (\{ x : f(x) < m + \epsilon \} ) \ge \lambda (V) >0, $$

and so $\alpha \le m + \epsilon $ and, as $\epsilon $ is arbitrary, we have $\alpha \le m$. Consider again a given $\epsilon >0$. There exists $ \alpha< t_\epsilon < \alpha +\epsilon $. As a consequence:

$$ \lambda \left( \{ x \in \mathcal {D}: f(x) < t_\epsilon \} \right) >0, $$

and $ m = \min _{x \in \mathcal {D}}f(x) < \alpha +\epsilon $. As $\epsilon $ is arbitrary we have $m \le \alpha $ and finally the conclusion stated.

Theorem 2

Suppose the same notations and the same set of hypothesis of Theorem 1, namely that for every $\epsilon >0 $ hypothesis $H2(\epsilon )$ (in case of pure random search) or $H^{\prime }2(\epsilon )$ (in case of adaptive search) are verified. Suppose, furthermore, that f is continuous and that admits an unique minimizer $z \in \mathcal {D}$. Then we have almost surely that $ \lim _{n \rightarrow + \infty } f(Y_n)=f(z) $. If, furthermore, $\mathcal {D}$ is compact then $ \lim _{n \rightarrow + \infty } Y_n=z. $

Proof

Let us first show that the sequence $(f(Y_n))_{n \ge 1}$ converges in probability to f(z). Consider $\epsilon >0$. As f(z) is now the essential minimum of f in $\mathcal {D}$ we have that:

$$\begin{aligned} \mid f(Y_n) -f(z) \mid \ge \epsilon \Leftrightarrow {\left\{ \begin{array}{ll} f(Y_n) \le f(z) - \epsilon &{} \text{ impossible } \\ f(Y_n) \ge f(z) + \epsilon , &{} \end{array}\right. } \end{aligned}$$

the possible case meaning that $Y_n \in E^c_{f(z)+\epsilon }$. Now, by a similar argument as the one used in the proof of Theorem 1, we have that $X_1, \dots , X_{n-1} \in E^c_{f(z)+\epsilon }$ and so, under each one of the alternative hypothesis, we have:

$$\begin{aligned} \mathbbm {P}\left[ \mid f(Y_n) -f(z) \mid \ge \epsilon \right] \le {\left\{ \begin{array}{ll} \mathbbm {P} [ \{X_ \in E^c_{f(z)+\epsilon } \} ] ^{n-1} &{} \text{ under } H2(\epsilon )\\ \inf _{1 \le j \le n-1} \mathbbm {P} [ X_j \in E^c_{f(z)+\epsilon } ] &{} \text{ under } H^{\prime }2(\epsilon ) , \end{array}\right. } \end{aligned}$$

(12)

thus ensuring that $\lim _{n \rightarrow + \infty } \mathbbm {P}[ \mid f(Y_n) -f(z) \mid \ge \epsilon ] =0$. If $H2\epsilon $ (or $H^{\prime }2\epsilon $) are verified for all $\epsilon >0$ the convergence in probability follows immediately. Finally, by a standard argument, the convergence almost surely of the sequence $(f(Y_n))_{n \ge 1}$ follows because this sequence is non increasing and convergent in probability. Let us suppose now that $\mathcal {D}$ is compact and that the sequence $(Y_n)_{n \ge 1}$ does not converge to z almost surely. Then for every $\omega $ on a set of positive probability $\varOmega ^{\prime } \subset \varOmega $:

$$\begin{aligned} \exists \epsilon>0 \; \; \forall n \in \mathbbm {N} \; \; \exists N_n>n \; \; \; \mid Y_{N_n}(\omega ) -z \mid > \epsilon . \end{aligned}$$

(13)

Now for all $\omega \in \varOmega ^{\prime }$ the sequence $(Y_n)(\omega )_{n \ge 1}$ is a sequence of points in a compact set $\mathcal {D}$ and by Bolzano-Weierstrass theorem there is a convergent subsequence $(Y_{n_{k}})(\omega )_{k \ge 1}$ of $(Y_n)(\omega )_{n \ge 1}$. This subsequence must converge to z because if the limit were y then, by the continuity of f we would have the sequence $(f(Y_{n_{k}}))(\omega )_{k \ge 1}$ converging to $f(y)=f(z)$. Now as z is an unique minimizer of f in $\mathcal {D}$ we certainly have $y=z$. Finally observe that the subsequence $(Y_{n_{k}})(\omega )_{k \ge 1}$ also verifies the condition expressed in Formula (13) for k large enough, which yields the desired contradiction.

3.2 A Preliminary Observation on the Rate of Convergence

Results on the rate of convergence may be used to determine a stopping criterium for the algorithm. As a proxy for the speed of convergence of the algorithms in the context of the proof Theorems 1 and 2, namely for instance Formula (12), we may consider the quantity $\mathbbm {P}[X_j \in E^c_{\alpha +\epsilon ,M}]$ for various choices of distributions. In case of pure random search we have obviously:

$$ \mathbbm {P}\left[ X_j \in E^c_{\alpha +\epsilon ,M} \right] =\frac{\lambda (E^c_{\alpha +\epsilon ,M})}{\lambda (\mathcal {D})} \;. $$

In case of random search on a nearly unbounded domain we have, (with the notations of Sect. 2.2), that:

Now as we have that:

$$ \mathbbm {P} \left[ Y_j \in E^c_{\alpha +\epsilon ,M} \mid X_j=x \right] =\int _{E^c_{\alpha +\epsilon ,M} } \frac{e^{-\frac{(x-u)^2}{2 \sigma ^2}}}{\sqrt{2 \pi \sigma }} du , $$

it follows that,

which, in turn, implies that,

$$ \mathbbm {P}\left[ Y_j \in E^c_{\alpha +\epsilon ,M} \right] = \mathbbm {E} \left[ \int _{E^c_{\alpha +\epsilon ,M} } \frac{e^{-\frac{(X_j-u)^2}{2 \sigma ^2}}}{\sqrt{2 \pi \sigma }} du \right] = \int _{\mathcal {D}} \int _{E^c_{\alpha +\epsilon ,M} } \frac{e^{-\frac{(x-u)^2}{2 \sigma ^2}}}{\sqrt{2 \pi \sigma }} du \frac{dx}{\lambda (\mathcal {D})} $$

where the integral on the right doesn’t seem easily estimable, in general. Suppose for simplification that $\mathcal {D}=[-A,+A]$ and that $E^c_{\alpha +\epsilon ,M} \subseteq [-a,+a]$ where $0< a \ll 1 \ll A$. Then, by Fubini theorem,

$$\begin{aligned} \begin{aligned} \int _{\mathcal {D}} \int _{E^c_{\alpha +\epsilon ,M} } \frac{e^{-\frac{(x-u)^2}{2 \sigma ^2}}}{\sqrt{2 \pi \sigma }} du \frac{dx}{\lambda (\mathcal {D})}&\approx \int _{-\infty }^{+ \infty } \int _{E^c_{\alpha +\epsilon ,M} } \frac{e^{-\frac{(x-u)^2}{2 \sigma ^2}}}{\sqrt{2 \pi \sigma }} du \frac{dx}{\lambda (\mathcal {D})} \\&= \frac{\lambda (E^c_{\alpha +\epsilon ,M})}{\lambda (\mathcal {D})}, \end{aligned} \end{aligned}$$

allowing the conclusion that also $\mathbbm {P}[Y_j \in E^c_{\alpha +\epsilon ,M}] \approx \lambda (E^c_{\alpha +\epsilon ,M})/\lambda (\mathcal {D})$ thus showing that the two algorithms, in the special situation assumed for simplification, are comparable in a first approximation.

4 On the Information Content of a Stochastic Algorithm

It is natural to conceive that in order for an algorithm to achieve global stochastic optimization of a function over a domain the algorithm has to collect complete – in some sense – information on the function over the domain. In [SB98] there are some very striking precise results on this idea. Let us detail Stephens and Baritompa’s result. Consider a random algorithm described by a sequence of random variable $X^f_1, \dots , X^f_n, \dots $ for some function f on a domain $\mathcal {D}$. The closure $\overline{\mathbf {X}^f}$ of the set $\{ X^f_1, \dots , X^f_n, \dots \}$ is a random set in $\mathcal {D}$.

Theorem 3

(Global optimization requires global information). For any $r \,\in \, ]0,1[$, the following are equivalent:

1.
The probability that the algorithm locate the global minimizers for f as points of $\overline{\mathbf {X}^f}$ is greater or equal than r, for any f in a sufficiently rich class of functions.
2.
The probability that $x \in \overline{\mathbf {X}^f}$ is greater or equal than r, for any $x \in \mathcal {D}$ and f in a sufficiently rich class of functions.

That is, roughly speaking, an algorithm works on any rich class of functions if and only if we have $\mathbbm {P}[\overline{\mathbf {X}^f}=\mathcal {D}]=1$. In the case of deterministic search the result is as expected, namely that the algorithm sees – in an intuitive yet precise sense – the global optimum for a class of functions in a domain if and only if the closure of the set of finite testing sequences, for any function, is dense in the domain. The extension of this result to the stochastic case gives the necessary and sufficient condition, in Theorem 3, that the lower bound of the probability of a stochastic algorithm seing the global optimum is the same as the lower bound of the probability of an arbitrary point of the domain belonging to the closure of the (random) set of finite testing sequences.

Having in mind the study of the limitations of an effective global optimization stochastic algorithm we address the problem of studying the information content of an algorithm. We recall that – as in Theorem 1 – a random algorithm may be identified with a sequence of random variables. The flow of information gained through a sequential observation of the sequence of random variables is usually described by the natural filtration associated with the sequence. In order to compare, in the information sense, two sequences of random variables we need to compare the associated natural filtrations.

In Theorem 5 below, by resorting to a natural defined notion of the information content of a stochastic algorithm, we obtain the result that two convergent algorithms have the same information content if the information generated by their respective minimizing functions is the whole available information in the probability space. So, the connection between the function and the stochastic set-up to generate stochastic algorithms for its global optimization - namely, probability space, probability laws of the algorithm – deserves to be further investigated.

In the following Sect. 4.1 we briefly recall results from [Cot86, Cot87, ALR03, Kud74, Bar04] on the set of complete sub $\sigma $-algebras of $\mathcal {F}$ as a topological metric space.

4.1 The Cotter Metric Space of the Complete $\sigma $-algebras

Recall that all random variables are defined on a complete probability space $(\varOmega , \mathcal {F}, \mathbbm {P})$. We now consider $\mathfrak {F}^{\star }$, the set of all $\sigma $-algebras $\mathcal {G} \subseteq \mathcal {F}$ which are complete with respect to $\mathbbm {P}$.

Remark 4

We may define an equivalence relation $\mathcal {R}$ on $\mathfrak {F}^{\star }$ by considering an equivalence relation $\thicksim $ for sets in $\mathcal {F}$ defined for all $ G, H \in \mathcal {F}$ by:

$$\begin{aligned} G \thicksim H\Leftrightarrow \mathbbm {P}[G \setminus H \cup H \setminus G ] =0. \end{aligned}$$

(14)

As so, the quotient class $\mathfrak {F}:= \mathfrak {F}^{\star } / \mathcal {R} $ is the class of all sub-$\sigma $-algebras of $ \mathcal {F}$ with elements identified up to sets of probability zero.

Strong convergence in $L^1( \varOmega , \mathcal {F}, \mathbbm {P}) $ – and also in $L^p( \varOmega , \mathcal {F}, \mathbbm {P}) $ for $p\in [1, + \infty [$ – of a sequence $(\mathcal {G}_n)_{n \ge 1}$ of $\sigma $-algebras to $\mathcal {G}_{\infty }$ was introduced by Neveu in 1970 (see [Nev64, pp. 117–118]) with the condition that:

$$\begin{aligned} \forall X \in L^1( \varOmega , \mathcal {F}, \mathbbm {P})\;\; \lim _{n \rightarrow + \infty } \left\Vert \mathbbm {E}[X \mid \mathcal {G}_n] -\mathbbm {E}[X \mid \mathcal {G}_{\infty } ] \right\Vert _{L^1( \varOmega , \mathcal {F}, \mathbbm {P})}=0, \end{aligned}$$

(15)

noticing that for the sequence $(\mathcal {G}_n)_{n \ge 1}$ to converge it suffices that for all $F \in \mathcal {F}$ the sequence converges in probability. In 1985, Cotter showed that this notion of convergence defines a topology which is metrizable (see [Cot87]). The Cotter distance $d_c$ is defined on $\mathfrak {F} \times \mathfrak {F}$ by:

$$\begin{aligned} \begin{aligned} d_c(\mathcal {H},\mathcal {G})&=\sum _{i=1}^{+\infty } \frac{1}{2^i} \min \left( \mathbbm {E} \left[ \left| \mathbbm {E}[X_i \mid \mathcal {H}] -\mathbbm {E}[X_i \mid \mathcal {G}] \right| \right] , 1 \right) \\&= \sum _{i=1}^{+\infty } \frac{1}{2^i} \min \left( \left\Vert \mathbbm {E}[X_i \mid \mathcal {H}] -\mathbbm {E}[X_i \mid \mathcal {G}] \right\Vert _1, 1 \right) . \end{aligned} \end{aligned}$$

(16)

with $\mathcal {H},\mathcal {G}\in \mathfrak {F}$, $\left\Vert X \right\Vert _1$ the $L^1( \varOmega , \mathcal {F}, \mathbbm {P}) $ norm of X, with $(X_i)_{i \in \mathbbm {N}}$ a dense denumerable set in $L^1( \varOmega , \mathcal {F}, \mathbbm {P}) $. We have that $(\mathfrak {F}, d_c)$ is a complete metric space.

We will need a consequence of the definition of the Cotter distance (see Corollary III.35, in [Bar04, p. 36]) that we quote for the reader’s convenience.

Proposition 2

Consider $\mathcal {G}_1 \subset \mathcal {G}_2 \subset \mathcal {G}_3$ in $\mathfrak {F}$, Then we have that:

$$ d_c\left( \mathcal {G}_2 , \mathcal {G}_3\right) \le 2 d_c\left( \mathcal {G}_1 , \mathcal {G}_3\right) . $$

We will also need a remarkable result of Cotter (see Corollary 2.2 and Corollary 2.4 in [Cot87, p. 42]) that we formulate next.

Theorem 4

Let $\mathcal {L}_\mathrm{{P}}$ be the metric space of the real valued random variables defined on the probability space $(\varOmega , \mathcal {F}, \mathbbm {P})$ with the metric of the convergence in probability. Let $\sigma : \mathcal {L}_\mathrm{{P}} \mapsto \mathfrak {F}$ that to each random variable X associates $\sigma (X)=\{ X^{-1}(B): B \in \mathcal {B}(\mathbbm {R})\}$ the sigma-algebra generated by X. Then, considering the metric space $( \mathfrak {F}, d_c)$ with $d_c$ the Cotter distance defined in Formulas (16), we have that $\sigma $ is continuous at $X \in \mathcal {L}_\mathrm{{P}}$ if and only if $\sigma (X)=\mathcal {F}$.

This result on the continuity of the map $\sigma $ between metric spaces $\mathcal {L}_\mathrm{{P}}$ and $( \mathfrak {F}, d_c)$ will be applied to convergent sequences.

4.2 The Information Content of a Random Algorithm

Let $\mathbbm {Y}=(Y_n)_{n \ge 1}$ be a stochastic algorithm for the minimization of f on a domain $\mathcal {D}$. According to Theorem 1 we may define a convergent algorithm for the minimization problem of f on the domain $\mathcal {D}$.

Definition 4

Let $\alpha $ be the essential infimum of f on $\mathcal {D}$ defined in Formula (3). Following Theorem 1, the algorithm $\mathbbm {Y}$ converges on $\mathcal {D}$ if the sequence $ (f(Y_n))_{n \ge 1}$ converges almost surely to a random variable $\text {Min}_{f,\mathbbm {Y}}$ such that:

$$ \mathbbm {P} \left[ \text {Min}_{f,\mathbbm {Y}} \le \alpha \right] = 1. $$

Now given a random algorithm $\mathbbm {Y}=(Y_n)_{n \ge 1}$ we define the flow of information associated to this algorithm.

Definition 5

The flow of information associated to the algorithm $\mathbbm {Y}=(Y_n)_{n \ge 1}$ for the global minimization of the function f is given by the natural filtration of $(f(Y_n))_{n \ge 1}$, which is the increasing sequence of $\sigma $-algebras defined by:

$$ \mathcal {F}_n^{\mathbbm {Y}}:= \sigma \left( f(Y_1), \dots ,f(Y_n) \right) . $$

The terminal $\sigma $-algebra associated to this algorithm, $ \mathcal {F}_{\infty }^{\mathbbm {Y}}$, is naturally defined as (in the two usual notations):

$$ \mathcal {F}_{\infty }^{\mathbbm {Y}}:= \sigma \left( \bigcup _{n=1}^{+ \infty } \mathcal {F}_n^{\mathbbm {Y}} \right) = \bigvee _{n=1}^{+ \infty } \mathcal {F}_n^{\mathbbm {Y}} . $$

As an immediate result we have that the filtration converges in the Cotter distance to the terminal $\sigma $-algebra.

Proposition 3

For every stochastic algorithm $\mathbbm {Y}=(Y_n)_{n \ge 1}$,

$$ \lim _{n \rightarrow + \infty } d_c \left( \mathcal {F}_n^{\mathbbm {Y}}, \mathcal {F}_{\infty }^{\mathbbm {Y}} \right) =0. $$

Proof

Let’s first observe that by Proposition 2.2 of Cotter (see again [Cot86]) any increasing sequence of $\sigma $-algebras converges in the Cotter distance. In fact, by a standard argument we have that:

$$ \bigcap _{n=1}^{+ \infty } \bigvee _{m=n}^{+ \infty } \mathcal {F}_m^{\mathbbm {Y}} = \mathcal {F}_{\infty }^{\mathbbm {Y}} = \bigvee _{n=1}^{+ \infty } \bigcap _{m=n}^{+ \infty } \mathcal {F}_m^{\mathbbm {Y}} $$

and by the result quoted this suffices to ensure that the filtration associated to the algorithm converges. Now, it is a well known fact (see [Bil95, p. 470]) that, by the definitions above, we have that almost surely:

$$\begin{aligned} \forall Z \in L^1( \varOmega , \mathcal {F}, \mathbbm {P}) \; \; \lim _{n \rightarrow + \infty } \mathbbm {E}\left[ Z \mid \mathcal {F}_n^{\mathbbm {Y}} \right] = \mathbbm {E}\left[ Z \mid \mathcal {F}_{\infty }^{\mathbbm {Y}} \right] \end{aligned}$$

(17)

as the sequence $(\mathbbm {E}\left[ Z \mid \mathcal {F}_n^{\mathbbm {Y}} \right] )_{n \ge 1}$ is uniformly integrable, (17) implies that

$$ \forall Z \in L^1( \varOmega , \mathcal {F}, \mathbbm {P}) \; \; \lim _{n \rightarrow + \infty } \left\Vert \mathbbm {E}\left[ Z \mid \mathcal {F}_n^{\mathbbm {Y}} \right] -\mathbbm {E}\left[ Z \mid \mathcal {F}_{\infty }^{\mathbbm {Y}} \right] \right\Vert _{L^1( \varOmega , \mathcal {F}, \mathbbm {P}) }=0, $$

and this is just definition given by Formula (15).

We now compare the information content of two stochastic algorithms by comparing their information induced filtrations.

Definition 6

Two algorithms $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ are informationally asymptotically equivalent (IAE) if and only if:

$$ \lim _{n \rightarrow + \infty } d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^1}, \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) =0 . $$

As an easy first observation we have that two algorithms are informationally asymptotically equivalent if and only if the Cotter distance of their terminal $\sigma $-algebras is zero, that is:

Proposition 4

Let $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ be two algorithms, Then:

$$\begin{aligned} \mathbbm {Y}^1 \text{ IAE } \mathbbm {Y}^2 \Leftrightarrow d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^2} \right) =0 . \end{aligned}$$

(18)

Proof

If the two algorithms are informationally asymptotically equivalent then the condition about the terminal $\sigma $-algebras is verified as an immediate consequence of Proposition 3. In fact,

$$ d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^2} \right) \le d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{n}^{\mathbbm {Y}^1} \right) + d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^1}, \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) + d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^2}, \mathcal {F}_{\infty }^{\mathbbm {Y}^2} \right) . $$

Now, for the converse suppose that $d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^2} \right) =0$ and that the algorithms are not IAE. Then, for some $\epsilon >0$ there exists an increasing integer sequence $(n^{\epsilon }_k)_{k \in \mathbbm {N}}$ such that

$$ \forall k \in \mathbbm {N}, \; \; d_c \left( \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^1}, \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^2} \right) \ge \epsilon . $$

We then have that for all $k \ge 1$,

$$\begin{aligned} \begin{aligned} \epsilon&\le d_c \left( \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^1}, \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^2} \right) \le d_c \left( \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^1} \right) + d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^2} \right) + d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^2} \right) \\&= d_c \left( \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^1} \right) + d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^2} \right) \\&\le \limsup _{n \rightarrow + \infty } \left( d_c \left( \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^1}, \mathcal {F}_{\infty }^{\mathbbm {Y}^1} \right) + d_c \left( \mathcal {F}_{\infty }^{\mathbbm {Y}^1}, \mathcal {F}_{n^{\epsilon }_k}^{\mathbbm {Y}^2} \right) \right) =0 , \end{aligned} \end{aligned}$$

again, by Proposition 3, which is a contradiction.

Our purpose now is to illustrate the intuitive idea that a convergent algorithm for minimizing a function must recover all available information about the function. For the first result we require that the algorithm exhausts all the available information in the probability space. We will suppose that the two algorithms $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ both converge. We will show next that, if we suppose,

$$\begin{aligned} \sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) = \mathcal {F} = \sigma \left( \text {Min}_{f,\mathbbm {Y}^2} \right) , \end{aligned}$$

(19)

then, these algorithms, $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$, are informationally asymptotic equivalent.

Theorem 5

With the notations of Definition 4, let $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ be two algorithms that converge. We have that:

$$ \sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) = \mathcal {F} = \sigma \left( \text {Min}_{f,\mathbbm {Y}^2} \right) \Rightarrow \mathbbm {Y}^1 \text{ IAE } \mathbbm {Y}^2 . $$

Proof

The proof is a consequence of the continuity of the operator that maps each random variable to the $\sigma $-algebra it generates formulated in Cotter’s Theorem 4. We have that the sequences,

$$ \left( \sigma \left( f(Y^1_n) \right) \right) _{n\ge 1}, \left( \sigma \left( f(Y^2_n)\right) \right) _{n\ge 1} \;, $$

both converge in the Cotter distance to $\mathcal {F}$ by reason of the hypothesis. Now, by definition, as we have that for all $n \ge 1$,

$$ \sigma \left( f(Y^1_n) \right) \subset \mathcal {F}_{n}^{\mathbbm {Y}^1} \subset \mathcal {F} \;,\; \sigma \left( f(Y^2_n) \right) \subset \mathcal {F}_{n}^{\mathbbm {Y}^2} \subset \mathcal {F}, $$

by Proposition 2, we have:

$$ d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^1},\mathcal {F} \right) \le 2 d_c \left( \sigma \left( f(Y^1_n) \right) ,\mathcal {F} \right) \;,\; d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^2},\mathcal {F} \right) \le 2 d_c \left( \sigma \left( f(Y^2_n) \right) ,\mathcal {F} \right) $$

and so we also have that the sequences,

$$ \left( \mathcal {F}_{n}^{\mathbbm {Y}^1} \right) _{n\ge 1} , \left( \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) _{n\ge 1} \;, $$

converge in the Cotter distance to $\mathcal {F}$. Finally, as we have:

$$ d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^1}, \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) \le d_c \left( \mathcal {F}_{n}^{\mathbbm {Y}^1},\mathcal {F} \right) + d_c \left( \mathcal {F}, \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) , $$

we have the condition of Formula (19) appearing in Theorem 5.

Remark 5

If condition in Formula (19), essential in Theorem 5, is not verified – then by Cotter’s theorem quoted in Theorem 4 – the map $\sigma $ is not continuous at $\sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) $ and $ \sigma \left( \text {Min}_{f,\mathbbm {Y}^2} \right) $ and so – it is in general not true that the sequences $\left( \sigma \left( f(Y^1_n) \right) \right) _{n\ge 1}$ and $ \left( \sigma \left( f(Y^2_n) \right) \right) _{n\ge 1}$ converge. As a consequence, despite the fact that, by Proposition 3, the sequences $\left( \mathcal {F}_{n}^{\mathbbm {Y}^1} \right) _{n\ge 1} $ and $\left( \mathcal {F}_{n}^{\mathbbm {Y}^2} \right) _{n\ge 1}$ both converge – to $\mathcal {F}_{\infty }^{\mathbbm {Y}^1}$ and $ \mathcal {F}_{\infty }^{\mathbbm {Y}^2}$, respectively – we can not ensure that the condition given by Formula (18) in Proposition 4 is verified and so, we can not conclude that the two algorithms are IAE.

If moreover the algorithms are informationally asymptotic equivalent, and their associated limit minimum functions take a denumerable set of values, then their associated limit minimum functions will coincide almost surely thus saying, essentially, that two IAE convergent algorithms carry the same information content with respect to the minimization function.

Theorem 6

With the notations of Definition 4, let $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ be two algorithms that converge. Let us suppose that the set $ \text {Min}_{f,\mathbbm {Y}^1}(\varOmega ) \cup \text {Min}_{f,\mathbbm {Y}^2} (\varOmega )$ is denumerable. We then have that:

$$\begin{aligned} \mathbbm {Y}^1 \text{ IAE } \mathbbm {Y}^2 \Rightarrow \text {Min}_{f,\mathbbm {Y}^1} = \text {Min}_{f,\mathbbm {Y}^2} \text { a. s.} \end{aligned}$$

(20)

Proof

The announced result is a consequence of Proposition 4. In fact, if $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ are IAE then this means that:

$$ \mathcal {F}_{\infty }^{\mathbbm {Y}^1} \thicksim \mathcal {F}_{\infty }^{\mathbbm {Y}^2}, $$

and so by (14), for every B in the Borel $\sigma $-algebra of the reals $\mathcal {B}(\mathbbm {R})$,

$$\begin{aligned} \mathbbm {P} \left[ \text {Min}_{f,\mathbbm {Y}^1} ^{-1}(B) \setminus \, \text {Min}_{f,\mathbbm {Y}^2} ^{-1}(B) \right] = 0= \mathbbm {P} \left[ \text {Min}_{f,\mathbbm {Y}^2} ^{-1}(B) \setminus \, \text {Min}_{f,\mathbbm {Y}^1} ^{-1}(B) \right] \end{aligned}$$

(21)

Now, consider $B=\{ x\} \in \mathcal {B}(\mathbbm {R})$. Formulas (21) imply that:

$$ \mathbbm {P} \left[ \left\{ \omega \in \varOmega \mid \text {Min}_{f,\mathbbm {Y}^1} (\omega ) \ne x \right\} \cup \left\{ \omega \in \varOmega \mid \text {Min}_{f,\mathbbm {Y}^2} (\omega ) =x \right\} \right] =1, $$

and also

$$ \mathbbm {P} \left[ \left\{ \omega \in \varOmega \mid \text {Min}_{f,\mathbbm {Y}^2} (\omega ) \ne x \right\} \cup \left\{ \omega \in \varOmega \mid \text {Min}_{f,\mathbbm {Y}^1} (\omega ) =x \right\} \right] =1. $$

Now, by considering the intersection

$$ \left( \left\{ \text {Min}_{f,\mathbbm {Y}^1} \ne x \right\} \cup \left\{ \text {Min}_{f,\mathbbm {Y}^2} =x \right\} \right) \cap \left( \left\{ \text {Min}_{f,\mathbbm {Y}^2} \ne x \right\} \cup \left\{ \text {Min}_{f,\mathbbm {Y}^1} =x \right\} \right) , $$

which is is a set of probability one, we get by expanding that for every $x \in \mathbbm {R}$:

$$ \mathbbm {P} \left[ \left\{ \text {Min}_{f,\mathbbm {Y}^1} \ne x \wedge \text {Min}_{f,\mathbbm {Y}^2} \ne x \right\} \cup \left\{ \text {Min}_{f,\mathbbm {Y}^1} = x \wedge \text {Min}_{f,\mathbbm {Y}^2} = x \right\} \right] =1. $$

And so by considering the denumerable set $\text {Im}= \text {Min}_{f,\mathbbm {Y}^1}(\varOmega ) \cup \text {Min}_{f,\mathbbm {Y}^2} (\varOmega )$, as

$$\begin{aligned} \begin{aligned}&\left\{ \text {Min}_{f,\mathbbm {Y}^1} \ne \text {Min}_{f,\mathbbm {Y}^2} \right\} \\&\subseteq \bigcup _{x \in \text {Im} } \left\{ \text {Min}_{f,\mathbbm {Y}^1} = x \wedge \text {Min}_{f,\mathbbm {Y}^2} \ne x \right\} \cup \left\{ \text {Min}_{f,\mathbbm {Y}^1} \ne x \wedge \text {Min}_{f,\mathbbm {Y}^2} = x \right\} \\&= \bigcup _{x \in \text {Im} } \left( \left\{ \text {Min}_{f,\mathbbm {Y}^1} \ne x \wedge \text {Min}_{f,\mathbbm {Y}^2} \ne x \right\} \cup \left\{ \text {Min}_{f,\mathbbm {Y}^1} = x \wedge \text {Min}_{f,\mathbbm {Y}^2} = x \right\} \right) ^c \end{aligned} \end{aligned}$$

we will have that $\mathbbm {P} \left[ \text {Min}_{f,\mathbbm {Y}^1} \ne \text {Min}_{f,\mathbbm {Y}^2} \right] =0$, as wanted.

The particular case of an unique minimizer of a continuous function on a compact domain deserves special mention as a case where two algorithms having IAE lead to the same minimizing function almost surely.

Proposition 5

With the notations of definition 4, let $\mathbbm {Y}^1$ and $\mathbbm {Y}^2$ be two algorithms that converge. Suppose, furthermore, that f is continuous, that f admits an unique minimizer z and $\mathcal {D}$ is compact. Then we have that:

$$\begin{aligned} \mathbbm {Y}^1 \text{ IAE } \mathbbm {Y}^2 \Rightarrow \left\{ \begin{aligned}&\lim _{n\rightarrow + \infty }f(Y^1_n) =f(z) = \lim _{n\rightarrow + \infty }f(Y^2_n) \text { a. s.} \\&\lim _{n\rightarrow + \infty }Y^1_n =z= \lim _{n\rightarrow + \infty }Y^2_n \text { a. s.} \end{aligned} \right. . \end{aligned}$$

Proof

As we have $ \lim _{n \rightarrow + \infty } f(Y^1_n)=f(z) = \lim _{n \rightarrow + \infty } f(Y^2_n)$ and $ \lim _{n \rightarrow + \infty } Y^1_n=z=\lim _{n \rightarrow + \infty } Y^2_n$, by Theorem 2 and Proposition 1, we also have that $\text {Min}_{f,\mathbbm {Y}^1} =f(z) = \text {Min}_{f,\mathbbm {Y}^2}$ almost surely and so, by Theorem 6, we have the announced result.

Remark 6

Let us observe, with respect to Proposition 5, that under the hypothesis stated in that proposition, that is, if we have almost surely,

$$ \text {Min}_{f,\mathbbm {Y}^1} = \text {Min}_{f,\mathbbm {Y}^2} =f(z) , $$

then, by modifying $\text {Min}_{f,\mathbbm {Y}^1} $ and $ \text {Min}_{f,\mathbbm {Y}^2}$ on sets of probability zero we would have that:

$$ \sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) = \{ \emptyset , \varOmega \}= \sigma \left( \text {Min}_{f,\mathbbm {Y}^2} \right) . $$

By Remark 4, in general, under the hypothesis of Proposition 5, the two $\sigma $-algebras $\sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) $ and $\sigma \left( \text {Min}_{f,\mathbbm {Y}^1} \right) $ are equal to $\{ \emptyset , \varOmega \}$ in $\mathfrak {F}:= \mathfrak {F}^{\star } / \thicksim $ – the class of all sub-$\sigma $-algebras of $ \mathcal {F}$ identified up to sets of probability zero – and so the condition in Formula (19) – which is essential in Theorem 5 – may be verified only for deterministic algorithms (as in this case all random variables are constant).

References

Appel, M.J., LaBarre, R., Radulović, D.: On accelerated random search. SIAM J. Optim. 14(3), 708–731 (2003). (Electronic)
Article MathSciNet Google Scholar
Artstein, Z.: Compact convergence of $\sigma $-fields and relaxed conditional expectation. Probab. Theory Relat. Fields 120(3), 369–394 (2001)
Article MathSciNet Google Scholar
Barty, K.: Contributions à la discrétisation des contraintes de mesurabilité pour les problèmes d’optimisation stochastique. Ph.D. thesis, École Nationale des Ponts et Chaussées, Paris, France, June 2004
Google Scholar
Billingsley, P.: Probability and Measure. Wiley Series in Probability and Mathematical Statistics, 3rd edn. Wiley, New York (1995)
MATH Google Scholar
Boylan, E.S.: Equiconvergence of martingales. Ann. Math. Stat. 42, 552–559 (1971)
Article MathSciNet Google Scholar
Costa, A., Jones, O.D., Kroese, D.: Convergence properties of the cross-entropy method for discrete optimization. Oper. Res. Lett. 35(5), 573–580 (2007)
Article MathSciNet Google Scholar
Cotter, K.D.: Similarity of information and behavior with a pointwise convergence topology. J. Math. Econ. 15(1), 25–38 (1986)
Article MathSciNet Google Scholar
Cotter, K.D.: Convergence of information, random variables and noise. J. Math. Econ. 16(1), 39–51 (1987)
Article MathSciNet Google Scholar
de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134, 19–67 (2005)
Article MathSciNet Google Scholar
de Carvalho, M.: Confidence intervals for the minimum of a function using extreme value statistics. Int. J. Math. Modell. Numer. Optim. 2(9), 288–296 (2011)
MATH Google Scholar
de Carvalho, M.: A generalization of the Solis and Wets method. J. Stat. Plann. Inference 142(3), 633–644 (2012)
Article MathSciNet Google Scholar
Esquível, M.L.: A conditional Gaussian martingale algorithm for global optimization. In: Gavrilova, M., et al. (eds.) ICCSA 2006, Part III. LNCS, vol. 3982, pp. 841–851. Springer, Heidelberg (2006). https://doi.org/10.1007/11751595_89
Chapter Google Scholar
Komisarski, A.: Distances between $\sigma $-fields on a probability space. J. Theoret. Probab. 21(4), 812–823 (2008)
Article MathSciNet Google Scholar
Kudō, H.: A note on the strong convergence of $\sigma $-algebras. Ann. Probab. 2(1), 76–83 (1974)
Article MathSciNet Google Scholar
Mexia, J.T., Pereira, D., Baeta, J.: $L_2$ environmental indexes. Listy Biom. 36(2), 137–143 (1999)
MATH Google Scholar
Neveu, J.: Bases mathématiques du calcul des probabilités. Masson et Cie, Éditeurs, Paris (1964)
Google Scholar
Neveu, J.: Mathematical Foundations of the Calculus of Probability. Translated by Amiel Feinstein. Holden-Day Inc., San Francisco (1965)
Google Scholar
Neveu, J.: Note on the tightness of the metric on the set of complete sub $\sigma $-algebras of a probability space. Ann. Math. Stat. 43, 1369–1371 (1972)
Article MathSciNet Google Scholar
Piccinini, L.: Convergence of nonmonotone sequence of sub-$\sigma $-fields and convergence of associated subspaces $L^p(\cal{B}_n) (p\in [1,+\infty ])$. J. Math. Anal. Appl. 225(1), 73–90 (1998)
Article MathSciNet Google Scholar
Pereira, D.G., Mexia, J.T.: Comparing double minimization and zigzag algorithms in joint regression analysis: the complete case. J. Stat. Comput. Simul. 80(1–2), 133–141 (2010)
Article MathSciNet Google Scholar
Peng, J., Shi, D.: Improvement of pure random search in global optimization. J. Shanghai Univ. 4(2), 92–95 (2000)
Article MathSciNet Google Scholar
Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method. Information Science and Statistics, Springer, New York (2004). https://doi.org/10.1007/978-1-4757-4321-0. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning
Rubinstein, R.Y., Kroese, D.P.: Simulation and the Monte Carlo Method. Wiley Series in Probability and Statistics, 2nd edn. Wiley-Interscience, Hoboken (2008)
MATH Google Scholar
Rogge, L.: Uniform inequalities for conditional expectations. Ann. Probab. 2, 486–489 (1974)
Article MathSciNet Google Scholar
Raphael, B., Smith, I.F.C.: A direct stochastic algorithm for global search. Appl. Math. Comput. 146(2–3), 729–758 (2003)
MathSciNet MATH Google Scholar
Stephens, C.P., Baritompa, W.: Global optimization requires global information. J. Optim. Theory Appl. 96(3), 575–588 (1998)
Article MathSciNet Google Scholar
Shi, D., Peng, J.: A new theoretical framework for analyzing stochastic global optimization algorithms. J. Shanghai Univ. 3(3), 175–180 (1999)
Article MathSciNet Google Scholar
Spall, J.C.: Introduction to Stochastic Search and Optimization. Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley-Interscience, Hoboken (2003). Estimation, Simulation, and Control
Google Scholar
Spall, J.C.: Stochastic optimization. In: Handbook of Computational Statistics, pp. 169–197. Springer, Berlin (2004)
Google Scholar
Solis, F.J., Wets, R.J.-B.: Minimization by random search techniques. Math. Oper. Res. 6(1), 19–30 (1981)
Article MathSciNet Google Scholar
Vidmar, M.: A couple of remarks on the convergence of $\sigma $-fields on probability spaces. Statist. Probab. Lett. 134, 86–92 (2018)
Article MathSciNet Google Scholar
Van Zandt, T.: The Hausdorff metric of $\sigma $-fields and the value of information. Ann. Probab. 21(1), 161–167 (1993)
Article MathSciNet Google Scholar
Wang, X.: Convergence rate of conditional expectations. Sci. Math. Jpn. 53(1), 83–87 (2001)
MathSciNet MATH Google Scholar
Yin, G.: Rates of convergence for a class of global stochastic optimization algorithms. SIAM J. Optim. 10(1), 99–120 (1999). (Electronic)
Article MathSciNet Google Scholar
Zabinsky, Z.B.: Stochastic Adaptive Search for Global Optimization. Nonconvex Optimization and its Applications, vol. 72. Kluwer Academic Publishers, Boston (2003)
Book Google Scholar

Download references

Acknowledgements

This work was partially supported, for the first and fourth authors, through the project UID/MAT/00297/2020 of the Centro de Matemática e Aplicações, financed by the Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) and, for the third author, through RFBR (Grant n. 19-01-00451).

Author information

Authors and Affiliations

FCT Nova and CMA, UNL, 2829-516, Caparica, Portugal
Manuel L. Esquível & Pedro P. Mota
FCT Nova, UNL, 2829-516, Caparica, Portugal
Nélio Machado
Don State Technical University, Gagarin Square 1, Rostov-on-Don, 344000, Russian Federation
Nadezhda P. Krasii

Authors

Manuel L. Esquível
View author publications
You can also search for this author in PubMed Google Scholar
Nélio Machado
View author publications
You can also search for this author in PubMed Google Scholar
Nadezhda P. Krasii
View author publications
You can also search for this author in PubMed Google Scholar
Pedro P. Mota
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel L. Esquível .

Editor information

Editors and Affiliations

Steklov Mathematical Institute of RAS, Moscow, Russia
Albert N. Shiryaev
Applied Probability & Informatics, Department of Applied Probability and Informatics, RUDN University, Moscow, Russia
Konstantin E. Samouylov
Applied Probability & Informatics, Department of Applied Probability and Informatics, RUDN University, Moscow, Russia
Dmitry V. Kozyrev

A Appendix

Deduction of Formula (2). Let $ \lambda _x$ denote the Lebesgue measure over $\mathcal {D}$ applied to the set defined by the variable x.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esquível, M.L., Machado, N., Krasii, N.P., Mota, P.P. (2021). On the Information Content of Some Stochastic Algorithms. In: Shiryaev, A.N., Samouylov, K.E., Kozyrev, D.V. (eds) Recent Developments in Stochastic Methods and Applications. ICSM-5 2020. Springer Proceedings in Mathematics & Statistics, vol 371. Springer, Cham. https://doi.org/10.1007/978-3-030-83266-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-83266-7_5
Published: 03 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83265-0
Online ISBN: 978-3-030-83266-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the Information Content of Some Stochastic Algorithms

Abstract

Similar content being viewed by others

Limit Theorems in Discrete Stochastic Geometry

Generalized Statistical Convergence for Sequences of Function in Random 2-Normed Spaces

Ergodicities of Infinite Dimensional Nonlinear Stochastic Operators

Keywords

1 Introduction

2 Some Random Search Algorithms

2.1 The Pure Random Search Algorithm

Remark 1

2.2 The Random Search Algorithm on (Nearly) Unbounded Domains

2.3 The Zig-Zag Algorithm

3 The Solis and Wets Approach of Random Algorithms

3.1 The Convergence Results

Definition 1

Definition 2

Definition 3

Remark 2

Theorem 1

Proof

Remark 3

Proposition 1

Proof

Theorem 2

Proof

3.2 A Preliminary Observation on the Rate of Convergence

4 On the Information Content of a Stochastic Algorithm

Theorem 3

4.1 The Cotter Metric Space of the Complete \(\sigma \)-algebras

Remark 4

Proposition 2

Theorem 4

4.2 The Information Content of a Random Algorithm

Definition 4

Definition 5

Proposition 3

Proof

Definition 6

Proposition 4

Proof

Theorem 5

Proof

Remark 5

Theorem 6

Proof

Proposition 5

Proof

Remark 6

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation