Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

33.1 Introduction

The present paper is based on the talk given by the second author on May 21, 2013, to the Seventh International Workshop on Simulation in Rimini. Some pieces of research that were announced in that talk have been subsequently published [19, 21, 22]. Here we give a general overview, references to latest published results, and a number of specific topics that have not been published elsewhere.

Let \((\varOmega,\mathcal{F},\mu )\) be a measure space, whose strictly positive probability densities form the algebraically open convex set \(\mathcal{P}_{>}\). An open statistical model \((\mathcal{M},\theta,B)\) is a parametrized subset of \(\mathcal{P}_{>}\), that is, \(\mathcal{M}\subset \mathcal{P}_{>}\) and \(\theta: \mathcal{M}\rightarrow B\), where θ is a one-to-one mapping onto an open subset of a Banach space B. We assume in the following that Ω is endowed with a distance and \(\mathcal{F}\) is its Borel σ-algebra.

If \(f: \varOmega \rightarrow \mathbb{R}\) is a bounded continuous function, the mapping \(\mathcal{M}\ni p\mapsto \mathbb{E}_{p}\left [f\right ]\) is a Stochastic Relaxation (SR) of f. The strict inequality \(\mathbb{E}_{p}\left [f\right ] <\sup _{\omega \in \varOmega }f(\omega )\) holds for all \(p \in \mathcal{M}\), unless f is constant. However, \(\sup _{p\in \mathcal{M}}\mathbb{E}_{p}\left [f\right ] =\sup _{\omega \in \varOmega }\) f(ω) if there exists a probability measure ν in the weak closure of \(\mathcal{M}\cdot \mu\) whose support is contained in the set of maximizing points of f, that is to say

$$\displaystyle{ \nu \left \{\omega \in \varOmega: f(\omega ) =\sup _{\omega \in \varOmega }f(\omega )\right \} = 1,\quad \mbox{ or}\quad \int f\ d\nu =\sup _{\omega \in \varOmega }f(\omega ). }$$

Such a ν belongs to the border of \(\mathcal{M}\cdot \mu\). For a discussion of the border issue for finite Ω, see [14]. Other relaxation methods have been considered, e.g., [4, 25].

An SR optimization method is an algorithm producing a sequence \(p_{n} \in \mathcal{M}\), \(n \in \mathbb{N}\), which is expected to converge to the probability measure ν, so that \(\lim _{n\rightarrow \infty }\mathbb{E}_{p_{n}}\left [f\right ] =\sup _{\omega \in \varOmega }f(\omega )\). Such algorithms are best studied in the framework of Information Geometry (IG), that is, the differential geometry of statistical models. See [3] for a general treatment of IG and [4, 6, 13, 1619] for applications to SR. All the quoted literature refers to the case where the model Banach space of the statistical manifold, i.e., the parameter space, is finite dimensional, \(B = \mathbb{R}^{d}\). An infinite dimensional version of IG has been developed, see [22] for a recent presentation together with new results, and the references therein for a detailed bibliography. The nonparametric version is unavoidable in applications to evolution equations in Physics [21], and it is useful even when the sample space is finite [15].

33.2 Stochastic Relaxation on an Exponential Family

We recall some basic facts on exponential families, see [8].

  1. 1.

    The exponential family \(q_{\theta } =\exp \left (\sum _{j=1}^{d}\theta _{j}T_{j} -\psi (\theta )\right ) \cdot p\), \(\mathbb{E}_{p}\left [T_{j}\right ] = 0\), is a statistical model \(\mathcal{M} = \left \{q_{\theta }\right \}\) with parametrization \(q_{\theta }\mapsto \theta \in \mathbb{R}^{d}\).

  2. 2.

    \(\psi (\theta ) =\log \left (\mathbb{E}_{p}\left [\mathrm{e}^{\theta \cdot T}\right ]\right )\), \(\theta \in \mathbb{R}^{d}\), is convex and lower semi-continuous.

  3. 3.

    ψ is analytic on the (nonempty) interior \(\mathcal{U}\) of its proper domain.

  4. 4.

    \(\nabla \psi (\theta ) = \mathbb{E}_{\theta }\left [T\right ]\), \(T = (T_{1},\ldots,T_{d})\).

  5. 5.

    \(\mathop{Hess}\nolimits \psi (\theta ) = \mbox{ Var}_{\theta }\left (T\right )\).

  6. 6.

    \(\mathcal{U}\ni \theta \mapsto \nabla \psi (\theta ) =\eta \in \mathcal{N}\) is one-to-one, analytic, and monotone; \(\mathcal{N}\) is the interior of the marginal polytope, i.e., the convex set generated by \(\left \{T(\omega ): \omega \in \varOmega \right \}\).

  7. 7.

    The gradient of the SR of f is

    $$\displaystyle{ \nabla (\theta \mapsto \mathbb{E}_{\theta }\left [f\right ]) = (\mathop{Cov}\nolimits _{\theta }\left (f,T_{1}\right ),\ldots,\mathop{Cov}\nolimits _{\theta }\left (f,T_{d}\right )), }$$

    which suggests to take the least squares approximation of f on \(\mathop{Span}\nolimits \left (T_{1},\ldots,T_{d}\right )\) as direction of steepest ascent, see [18].

  8. 8.

    The representation of the gradient in the scalar product with respect to \(\theta\) is called natural gradient, see [2, 3, 15].

Different methods can be employed to generate a maximizing sequence of densities p n is a statistical model \(\mathcal{M}\). A first example is given by Estimation of Distribution Algorithms (EDAs) [12], a large family of iterative algorithms where the parameters of a density are estimated after sampling and selection, in order to favor samples with larger values for f, see Example 1. Another approach is to evaluate the gradient of \(\mathbb{E}_{p}\left [f\right ]\) and follow the direction of the natural gradient over \(\mathcal{M}\), as illustrated in Example 2.

Example 1 (EDA from [19]).

An Estimation of Distribution Algorithm is an SR optimization algorithm based on sampling, selection, and estimation, see [12].

Example 2 (SNGD from [19]).

Stochastic Natural Gradient Descent [18] is an SR algorithm that requires the estimation of the gradient.

Finally, other algorithms are based on Bregman divergence. Example 3 illustrates the connection with the exponential family.

Example 3 (Binomial B(n, p)).

On the finite sample space \(\varOmega = \left \{0,\ldots,n\right \}\) with \(\mu (x) = \binom{n}{x}\), consider the exponential family \(p(x;\theta ) =\exp \left (\theta x - n\log \left (1 +\mathrm{ e}^{\theta }\right )\right )\). With respect to the expectation parameter \(\eta = n\mathrm{e}^{\theta }/(1 +\mathrm{ e}^{\theta }) \in ]0,n[\) we have \(p(x;\eta ) = (\eta /n)^{x}(1 -\eta /n)^{n-x}\), which is the standard presentation of the binomial density.

The standard presentation is defined for η = 0, n, where the exponential formula is not. In fact, the conjugate ψ (η) of \(\psi (\theta ) = n\log \left (1 +\mathrm{ e}^{\theta }\right )\) is

$$\displaystyle{ \psi _{{\ast}}(\eta ) = \left \{\begin{array}{@{}l@{\quad }l@{}} +\infty \quad &\mbox{ if $\eta < 0$ or $\eta > n$,} \\ 0 \quad &\mbox{ if $\eta = 0,n$,} \\ \eta \log \left ( \frac{\eta }{n-\eta }\right ) - n\log \left ( \frac{n} {n-\eta }\right )\quad &\mbox{ if $0 <\eta < n$.} \end{array} \right. }$$

We have

$$\displaystyle\begin{array}{rcl} \log p(x;\eta )& =& \log \left ( \frac{\eta } {n-\eta }\right )(x-\eta ) +\psi _{{\ast}}(\eta ),\quad \eta \in ]0,n[ {}\\ & =& \psi _{{\ast}}^{{\prime}}(\eta )(x-\eta ) +\psi _{ {\ast}}(\eta )\leqslant \psi _{{\ast}}(x). {}\\ \end{array}$$

For x ≠ 0, n, the sign of \(\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta )\) is eventually negative as η → 0, n, hence

$$\displaystyle{ \lim _{\eta \rightarrow 0,n}\log p(x;\eta ) =\lim _{\eta \rightarrow 0,n}\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta ) +\psi _{ {\ast}}(\eta ) = -\infty. }$$

If x = 0, n, the sign of both \(\psi _{{\ast}}^{{\prime}}(\eta )(0-\eta )\) and \(\psi _{{\ast}}^{{\prime}}(\eta )(n-\eta )\) is eventually positive as \(\eta \rightarrow 0\) and \(\eta \rightarrow n\), respectively. The limit is bounded by 0 = ψ (x), for x = 0, n.

The argument above is actually general. It has been observed by [5] that the Bregman divergence \(D_{\psi _{{\ast}}}(x\Vert \eta ) =\psi _{{\ast}}(x) -\psi _{{\ast}}(\eta ) -\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta )\geqslant 0\) provides an interesting form of the density as \(p(x;\eta ) =\mathrm{ e}^{-D_{\psi _{{\ast}}}(x\Vert \eta )}\mathrm{e}^{\psi _{{\ast}}(x)} \propto \mathrm{ e}^{-D_{\psi _{{\ast}}}(x\Vert \eta )}\).

33.3 Exponential Manifold

The set of positive probability densities \(\mathcal{P}_{>}\) is a convex subset of L 1(μ). Given a \(p \in \mathcal{P}_{>}\), every \(q \in \mathcal{P}_{>}\) can be written as q = ev ⋅ p where \(v =\log \left (\frac{q} {p}\right )\). Below we summarize, together with a few new details, results from [21, 22] and the references therein, and the unpublished [24].

Definition 33.1 (Orlicz Φ-Space [11], [20, Chapter II], [23]).

Define \(\varphi (y) =\cosh y - 1\). The Orlicz Φ-space L Φ(p) is the vector space of all random variables such that \(\mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]\) is finite for some α > 0. Equivalently, it is the set of all random variables u whose Laplace transform under \(p\cdot \mu\), \(t\mapsto \hat{u}_{p}(t) = \mathbb{E}_{p}\left [\mathrm{e}^{tu}\right ]\) is finite in a neighborhood of 0. We denote by \(M^{\varPhi }(p) \subset L^{\varPhi }(p)\) the vector space of random variables whose Laplace transform is always finite.

Proposition 33.1 (Properties of the Φ-Space).

  1. 1.

    The set \(S_{\leqslant 1} = \left \{u \in L^{\varPhi }(p): \mathbb{E}_{p}\left [\varPhi (u)\right ]\leqslant 1\right \}\) is the closed unit ball of the complete norm

    $$\displaystyle{ \left \Vert u\right \Vert _{p} =\inf \left \{\rho > 0: \mathbb{E}_{p}\left [\varPhi \left (\frac{u} {\rho } \right )\right ]\leqslant 1\right \} }$$

    on the Φ-space. For all a⩾1 the continuous injections \(L^{\infty }(\mu )\hookrightarrow L^{\varPhi }(p)\hookrightarrow L^{a}(p)\) hold.

  2. 2.

    \(\left \Vert u\right \Vert _{p} = 1\) if either \(\mathbb{E}_{p}\left [\varPhi (u)\right ] = 1\) or \(\mathbb{E}_{p}\left [\varPhi (u)\right ] < 1\) and \(\mathbb{E}_{p}\left [\varPhi \left (\frac{u} {\rho } \right )\right ] = \infty \) for ρ > 1. If \(\left \Vert u\right \Vert _{p} > 1\) , then \(\left \Vert u\right \Vert _{p}\leqslant \mathbb{E}_{p}\left [\varPhi (u)\right ]\) . In particular, \(\lim _{\left \Vert u\right \Vert _{p}\rightarrow \infty }\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ] = \infty \) .

  3. 3.

    M Φ (p) is a closed and separable subspace of L Φ (p).

  4. 4.

    \(L^{\varPhi }(p) = L^{\varPhi }(q)\) as Banach spaces if, and only if, \(\int p^{1-\theta }q^{\theta }\ d\mu\) is finite on a neighborhood of [0,1].

Proof.

  1. 1.

    See [11], [20, Chapter II], [23].

  2. 2.

    The function \(\mathbb{R}_{\geqslant } \ni \alpha \mapsto \hat{u}(t) = \mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]\) is increasing, convex, lower semi-continuous. If for some t + > 1 the value \(\hat{u}(t_{+})\) is finite, we are in the first case and \(\hat{u}(1) = 1\). Otherwise, we have \(\hat{u}(1)\leqslant 1\). If \(\left \Vert u\right \Vert _{p} > a > 1\), so that \(\left \Vert \frac{a} {\left \Vert u\right \Vert _{p}}u\right \Vert _{p} > 1\), hence

    $$\displaystyle{ 1 < \mathbb{E}_{p}\left [\varPhi \left ( \frac{a} {\left \Vert u\right \Vert _{p}}u\right )\right ]\leqslant \frac{a} {\left \Vert u\right \Vert _{p}}\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ], }$$

    and \(\left \Vert u\right \Vert _{p} < a\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ]\), for all a > 1.

  3. 3.

    See [11], [20, Chapter II], [23].

  4. 4.

    See [9, 24].

Example 4 (Boolean State Space).

In the case of a finite state space, the moment generating function is finite everywhere, but its computation can be challenging. We discuss in particular the Boolean case \(\varOmega = \left \{+1,-1\right \}^{n}\) with counting reference measure μ and uniform density \(p(x) = 2^{-n}\), x ∈ Ω. In this case there is a huge literature from statistical physics, e.g., [10, Ch. VII]. A generic real function on Ω—called pseudo-Boolean [7] in the combinatorial optimization literature—has the form \(u(x) =\sum _{\alpha \in L}\hat{u}(\alpha )x^{\alpha }\), with \(L = \left \{0,1\right \}^{n}\), \(x^{\alpha } =\prod _{ i=1}^{n}x_{i}^{\alpha _{i}}\), \(\hat{u}(\alpha ) = 2^{-n}\sum _{x\in \varOmega }u(x)x^{\alpha }\).

As \(\mathrm{e}^{ax} =\cosh (a) +\sinh (a)x\) if x 2 = 1 i.e., \(x = \pm 1\), we have

$$\displaystyle\begin{array}{rcl} \mathrm{e}^{tu(x)}& = & \exp \left (\sum _{\alpha \in \mathop{Supp}\nolimits \hat{u}}t\hat{u}(\alpha )x^{\alpha }\right ) =\prod _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\mathrm{e}^{t\hat{u}(\alpha )x^{\alpha }} {}\\ & = & \prod _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\left (\cosh (t\hat{u}(\alpha )) +\sinh (t\hat{u}(\alpha ))x^{\alpha }\right ) {}\\ & =&\sum _{B\subset \mathop{Supp}\nolimits \hat{u}}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha ))x^{\sum _{\alpha \in B}\alpha }. {}\\ \end{array}$$

The moment generating function of u under the uniform density p is

$$\displaystyle{ t\mapsto \sum _{B\in \mathcal{B}(\hat{u})}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha )), }$$

where \(\mathcal{B}(\hat{u})\) are those \(B \subset \mathop{ Supp}\nolimits \hat{u}\) such that \(\sum _{\alpha \in B}\alpha = 0\mod 2\). We have

$$\displaystyle{ \mathbb{E}_{p}\left [\varPhi \right ](tu) =\sum _{B\in \mathcal{B}_{0}(\hat{u})}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha )) - 1, }$$

where \(\mathcal{B}_{0}(\hat{u})\) are those \(B \subset \mathop{ Supp}\nolimits \hat{u}\) such that \(\sum _{\alpha \in B}\alpha = 0\mod 2\) and \(\sum _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\alpha = 0\).

If S is the \(\left \{1,\ldots,n\right \} \times \mathop{ Supp}\nolimits \hat{u}\) matrix with elements α i , we want to solve the system \(Sb = 0\mod 2\) to find all elements of \(\mathcal{B}\); we add the equation \(\sum b = 0\mod 2\) to find \(\mathcal{B}_{0}\). The simplest example is \(u(x) =\sum _{ i=1}^{n}c_{i}x_{i}\),

Example 5 (The Sphere is Not Smooth in General).

We look for the moment generating function of the density

$$\displaystyle{ p(x) \propto (a + x)^{-\frac{3} {2} }\mathrm{e}^{-x},\quad x > 0, }$$

where a is a positive constant. From the incomplete gamma integral

$$\displaystyle{ \upgamma -\frac{1} {2}x =\int _{ x}^{\infty }s^{-\frac{1} {2} -1}\mathrm{e}^{-s}\ ds,\quad x > 0, }$$

we have for θ, a > 0,

$$\displaystyle{ \frac{d} {dx}\varGamma \left (-\frac{1} {2},\theta (a + x)\right ) = -\theta ^{-\frac{1} {2} }\mathrm{e}^{-\theta a}(a + x)^{-\frac{3} {2} }\mathrm{e}^{-\theta x}. }$$

We have, for \(\theta \in \mathbb{R}\),

$$\displaystyle{ C(\theta,a) =\int _{ 0}^{\infty }(a+x)^{-\frac{3} {2} }\mathrm{e}^{-\theta x}\ dx = \left \{\begin{array}{@{}l@{\quad }l@{}} \sqrt{\theta }\mathrm{e}^{\theta a}\varGamma \left (-\frac{1} {2},\theta a\right )\quad &\mbox{ if $\theta > 0$}. \\ \frac{1} {2\sqrt{a}} \quad &\mbox{ if $\theta = 0$}, \\ +\infty \quad &\mbox{ if $\theta < 0$}. \end{array} \right. }$$

or, \(C(\theta,a) = \frac{1} {2}a^{-\frac{1} {2} } -\frac{\sqrt{\pi }\theta } {2} \mathrm{e}^{\theta a}R_{ 1/2,1}(\theta a)\) if θ ⩽1, + otherwise, where R 1∕2, 1 is the survival function of the Gamma distribution with shape 1∕2 and scale 1.

The density p is obtained with θ = 1,

$$\displaystyle{ p(x) = C(1,a)^{-1}(a + x)^{-\frac{3} {2} }\mathrm{e}^{-x} = \frac{(a + x)^{-\frac{3} {2} }\mathrm{e}^{-x}} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )},\quad x > o, }$$

and, for the random variable u(x) = x, the function

$$\displaystyle\begin{array}{rcl} \alpha \mapsto \mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]& =& \frac{1} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )}\int _{0}^{\infty }(a + x)^{-\frac{3} {2} } \frac{\mathrm{e}^{-(1-\alpha )x} +\mathrm{ e}^{-(1+\alpha )x}} {2} \ dx - 1 {}\\ & =& \frac{C(1-\alpha,a) + C(1+\alpha,a)} {2C(1,a)} - 1 {}\\ \end{array}$$

is convex lower semi-continuous on \(\alpha \in \mathbb{R}\), finite for α ∈ [−1, 1], infinite otherwise, hence not steep. Its value at α = 1 is

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{p}\left [\varPhi (u)\right ]& =& \frac{1} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )}\int _{0}^{\infty }(a + x)^{-\frac{3} {2} } \frac{1 +\mathrm{ e}^{-2x}} {2} \ dx - 1 {}\\ & =& \frac{C(0,a) + C(2,a)} {2C(1,a)} - 1 {}\\ \end{array}$$

Example 6 (Normal Density).

Let \(p(x) = (2\pi )^{-1/2}\mathrm{e}^{-(1/2)x^{2} }\). Consider a generic quadratic polynomial \(u(x) = a + bx + \frac{1} {2}cx^{2}\). We have for tc ≠ 1

$$\displaystyle{ t(a+bx+ \frac{1} {2}cx^{2})-\frac{1} {2}x^{2} == - \frac{1} {2(1 - tc)^{-1}}\left (x - \frac{tb} {1 - tc}\right )^{2}+ \frac{1} {2} \frac{t^{2}b^{2} - 2ta(1 - tc)} {(1 - tc)}, }$$

hence

$$\displaystyle{ \mathbb{E}_{p}\left [\mathrm{e}^{tu}\right ] = \left \{\begin{array}{@{}l@{\quad }l@{}} +\infty \quad &\mbox{ if $tc\leqslant 1$,} \\ \sqrt{1 - tc}\exp \left (\dfrac{1} {2} \dfrac{t^{2}b^{2} - 2ta(1 - tc)} {(1 - tc)} \right )\quad &\mbox{ if $tc < 1$}. \end{array} \right. }$$

If, and only if, − 1 < c < 1, we have

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{p}\left [\varPhi (u)\right ]& =& \frac{1} {2}\sqrt{ 1 - c}\exp \left (\dfrac{1} {2} \dfrac{b^{2} - a(1 - c)} {(1 - c)} \right ) {}\\ & & +\frac{1} {2}\sqrt{1 + c}\exp \left (\dfrac{1} {2} \dfrac{b^{2} - a(1 + c)} {(1 + c)} \right ) - 1. {}\\ \end{array}$$

33.4 Vector Bundles

Vector bundles are constructed as sets of couples (p, v) with \(p \in \mathcal{P}_{>}\) and v is some space of random variables such that \(\mathbb{E}_{p}\left [v\right ] = 0\). The tangent bundle is obtained when the vector space is \(L_{0}^{\varPhi }(p)\). The Hilbert bundle is defined as \(H\mathcal{P}_{>} = \left \{(p,v): p \in \mathcal{P}_{>},v \in L_{0}^{2}(p)\right \}\). We refer to [21] and [15] were charts and affine connections on the Hilbert bundle are derived from the isometric transport

$$\displaystyle{ L_{0}^{2}(p) \ni u\mapsto \sqrt{\frac{p} {q}}u -\left (1 + \mathbb{E}_{q}\left [\sqrt{\frac{p} {q}}\,\,\right ]\right )^{-1}\left (1 + \sqrt{\frac{p} {q}}\,\,\right )\mathbb{E}_{q}\left [\sqrt{\frac{p} {q}}u\right ] \in L_{0}^{2}(q). }$$

In turn, an isometric transport \(U_{p}^{q}: L_{0}^{2}(p) \rightarrow L_{0}^{2}(q)\) can be used to compute the derivative of a vector field in the Hilbert bundle, for example the derivative of the gradient of a relaxed function.

The resulting second order structure is instrumental in computing the Hessian of the natural gradient of the SR function. This allows to design a second order approximation method, as it is suggested in [1] for general Riemannian manifolds, and applied to SR in [15]. A second order structure is also used to define the curvature of a statistical manifold and, possibly, to compute its geodesics, see [6] for applications to optimization.