Optimization via Information Geometry

Malagò, Luigi; Pistone, Giovanni

doi:10.1007/978-1-4939-2104-1_33

Luigi Malagò⁵ &
Giovanni Pistone⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 114))

1628 Accesses

Abstract

Information Geometry has been used to inspire efficient algorithms for black-box optimization, both in the combinatorial and in the continuous case. We give an overview of the authors’ research program and some specific contribution to the underlying theory.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Information-Geometric Optimization: The Interest of Information Theory for Discrete and Continuous Optimization

A Note on Open Problems and Challenges in Optimization Theory and Algorithms

Nonlinear Optimization: A Brief Overview

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

33.1 Introduction

The present paper is based on the talk given by the second author on May 21, 2013, to the Seventh International Workshop on Simulation in Rimini. Some pieces of research that were announced in that talk have been subsequently published [19, 21, 22]. Here we give a general overview, references to latest published results, and a number of specific topics that have not been published elsewhere.

Let $(\varOmega,\mathcal{F},\mu )$ be a measure space, whose strictly positive probability densities form the algebraically open convex set $\mathcal{P}_{>}$. An open statistical model $(\mathcal{M},\theta,B)$ is a parametrized subset of $\mathcal{P}_{>}$, that is, $\mathcal{M}\subset \mathcal{P}_{>}$ and $\theta: \mathcal{M}\rightarrow B$, where θ is a one-to-one mapping onto an open subset of a Banach space B. We assume in the following that Ω is endowed with a distance and $\mathcal{F}$ is its Borel σ-algebra.

If $f: \varOmega \rightarrow \mathbb{R}$ is a bounded continuous function, the mapping $\mathcal{M}\ni p\mapsto \mathbb{E}_{p}\left [f\right ]$ is a Stochastic Relaxation (SR) of f. The strict inequality $\mathbb{E}_{p}\left [f\right ] <\sup _{\omega \in \varOmega }f(\omega )$ holds for all $p \in \mathcal{M}$, unless f is constant. However, $\sup _{p\in \mathcal{M}}\mathbb{E}_{p}\left [f\right ] =\sup _{\omega \in \varOmega }$ f(ω) if there exists a probability measure ν in the weak closure of $\mathcal{M}\cdot \mu$ whose support is contained in the set of maximizing points of f, that is to say

$$\displaystyle{ \nu \left \{\omega \in \varOmega: f(\omega ) =\sup _{\omega \in \varOmega }f(\omega )\right \} = 1,\quad \mbox{ or}\quad \int f\ d\nu =\sup _{\omega \in \varOmega }f(\omega ). }$$

Such a ν belongs to the border of $\mathcal{M}\cdot \mu$. For a discussion of the border issue for finite Ω, see [14]. Other relaxation methods have been considered, e.g., [4, 25].

An SR optimization method is an algorithm producing a sequence $p_{n} \in \mathcal{M}$, $n \in \mathbb{N}$, which is expected to converge to the probability measure ν, so that $\lim _{n\rightarrow \infty }\mathbb{E}_{p_{n}}\left [f\right ] =\sup _{\omega \in \varOmega }f(\omega )$. Such algorithms are best studied in the framework of Information Geometry (IG), that is, the differential geometry of statistical models. See [3] for a general treatment of IG and [4, 6, 13, 16–19] for applications to SR. All the quoted literature refers to the case where the model Banach space of the statistical manifold, i.e., the parameter space, is finite dimensional, $B = \mathbb{R}^{d}$. An infinite dimensional version of IG has been developed, see [22] for a recent presentation together with new results, and the references therein for a detailed bibliography. The nonparametric version is unavoidable in applications to evolution equations in Physics [21], and it is useful even when the sample space is finite [15].

33.2 Stochastic Relaxation on an Exponential Family

We recall some basic facts on exponential families, see [8].

1.
The exponential family $q_{\theta } =\exp \left (\sum _{j=1}^{d}\theta _{j}T_{j} -\psi (\theta )\right ) \cdot p$, $\mathbb{E}_{p}\left [T_{j}\right ] = 0$, is a statistical model $\mathcal{M} = \left \{q_{\theta }\right \}$ with parametrization $q_{\theta }\mapsto \theta \in \mathbb{R}^{d}$.
2.
$\psi (\theta ) =\log \left (\mathbb{E}_{p}\left [\mathrm{e}^{\theta \cdot T}\right ]\right )$, $\theta \in \mathbb{R}^{d}$, is convex and lower semi-continuous.
3.
ψ is analytic on the (nonempty) interior $\mathcal{U}$ of its proper domain.
4.
$\nabla \psi (\theta ) = \mathbb{E}_{\theta }\left [T\right ]$, $T = (T_{1},\ldots,T_{d})$.
5.
$\mathop{Hess}\nolimits \psi (\theta ) = \mbox{ Var}_{\theta }\left (T\right )$.
6.
$\mathcal{U}\ni \theta \mapsto \nabla \psi (\theta ) =\eta \in \mathcal{N}$ is one-to-one, analytic, and monotone; $\mathcal{N}$ is the interior of the marginal polytope, i.e., the convex set generated by $\left \{T(\omega ): \omega \in \varOmega \right \}$.
7.
The gradient of the SR of f is
$$\displaystyle{ \nabla (\theta \mapsto \mathbb{E}_{\theta }\left [f\right ]) = (\mathop{Cov}\nolimits _{\theta }\left (f,T_{1}\right ),\ldots,\mathop{Cov}\nolimits _{\theta }\left (f,T_{d}\right )), }$$
which suggests to take the least squares approximation of f on $\mathop{Span}\nolimits \left (T_{1},\ldots,T_{d}\right )$ as direction of steepest ascent, see [18].
8.
The representation of the gradient in the scalar product with respect to $\theta$ is called natural gradient, see [2, 3, 15].

Different methods can be employed to generate a maximizing sequence of densities p _n is a statistical model $\mathcal{M}$. A first example is given by Estimation of Distribution Algorithms (EDAs) [12], a large family of iterative algorithms where the parameters of a density are estimated after sampling and selection, in order to favor samples with larger values for f, see Example 1. Another approach is to evaluate the gradient of $\mathbb{E}_{p}\left [f\right ]$ and follow the direction of the natural gradient over $\mathcal{M}$, as illustrated in Example 2.

Example 1 (EDA from [19]).

An Estimation of Distribution Algorithm is an SR optimization algorithm based on sampling, selection, and estimation, see [12].

Example 2 (SNGD from [19]).

Stochastic Natural Gradient Descent [18] is an SR algorithm that requires the estimation of the gradient.

Finally, other algorithms are based on Bregman divergence. Example 3 illustrates the connection with the exponential family.

Example 3 (Binomial B(n, p)).

On the finite sample space $\varOmega = \left \{0,\ldots,n\right \}$ with $\mu (x) = \binom{n}{x}$, consider the exponential family $p(x;\theta ) =\exp \left (\theta x - n\log \left (1 +\mathrm{ e}^{\theta }\right )\right )$. With respect to the expectation parameter $\eta = n\mathrm{e}^{\theta }/(1 +\mathrm{ e}^{\theta }) \in ]0,n[$ we have $p(x;\eta ) = (\eta /n)^{x}(1 -\eta /n)^{n-x}$, which is the standard presentation of the binomial density.

The standard presentation is defined for η = 0, n, where the exponential formula is not. In fact, the conjugate ψ _∗(η) of $\psi (\theta ) = n\log \left (1 +\mathrm{ e}^{\theta }\right )$ is

$$\displaystyle{ \psi _{{\ast}}(\eta ) = \left \{\begin{array}{@{}l@{\quad }l@{}} +\infty \quad &\mbox{ if $\eta < 0$ or $\eta > n$,} \\ 0 \quad &\mbox{ if $\eta = 0,n$,} \\ \eta \log \left ( \frac{\eta }{n-\eta }\right ) - n\log \left ( \frac{n} {n-\eta }\right )\quad &\mbox{ if $0 <\eta < n$.} \end{array} \right. }$$

We have

$$\displaystyle\begin{array}{rcl} \log p(x;\eta )& =& \log \left ( \frac{\eta } {n-\eta }\right )(x-\eta ) +\psi _{{\ast}}(\eta ),\quad \eta \in ]0,n[ {}\\ & =& \psi _{{\ast}}^{{\prime}}(\eta )(x-\eta ) +\psi _{ {\ast}}(\eta )\leqslant \psi _{{\ast}}(x). {}\\ \end{array}$$

For x ≠ 0, n, the sign of $\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta )$ is eventually negative as η → 0, n, hence

$$\displaystyle{ \lim _{\eta \rightarrow 0,n}\log p(x;\eta ) =\lim _{\eta \rightarrow 0,n}\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta ) +\psi _{ {\ast}}(\eta ) = -\infty. }$$

If x = 0, n, the sign of both $\psi _{{\ast}}^{{\prime}}(\eta )(0-\eta )$ and $\psi _{{\ast}}^{{\prime}}(\eta )(n-\eta )$ is eventually positive as $\eta \rightarrow 0$ and $\eta \rightarrow n$, respectively. The limit is bounded by 0 = ψ _∗(x), for x = 0, n.

The argument above is actually general. It has been observed by [5] that the Bregman divergence $D_{\psi _{{\ast}}}(x\Vert \eta ) =\psi _{{\ast}}(x) -\psi _{{\ast}}(\eta ) -\psi _{{\ast}}^{{\prime}}(\eta )(x-\eta )\geqslant 0$ provides an interesting form of the density as $p(x;\eta ) =\mathrm{ e}^{-D_{\psi _{{\ast}}}(x\Vert \eta )}\mathrm{e}^{\psi _{{\ast}}(x)} \propto \mathrm{ e}^{-D_{\psi _{{\ast}}}(x\Vert \eta )}$.

33.3 Exponential Manifold

The set of positive probability densities $\mathcal{P}_{>}$ is a convex subset of L ¹(μ). Given a $p \in \mathcal{P}_{>}$, every $q \in \mathcal{P}_{>}$ can be written as q = e^v ⋅ p where $v =\log \left (\frac{q} {p}\right )$. Below we summarize, together with a few new details, results from [21, 22] and the references therein, and the unpublished [24].

Definition 33.1 (Orlicz Φ-Space [11], [20, Chapter II], [23]).

Define $\varphi (y) =\cosh y - 1$. The Orlicz Φ-space L ^Φ(p) is the vector space of all random variables such that $\mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]$ is finite for some α > 0. Equivalently, it is the set of all random variables u whose Laplace transform under $p\cdot \mu$, $t\mapsto \hat{u}_{p}(t) = \mathbb{E}_{p}\left [\mathrm{e}^{tu}\right ]$ is finite in a neighborhood of 0. We denote by $M^{\varPhi }(p) \subset L^{\varPhi }(p)$ the vector space of random variables whose Laplace transform is always finite.

Proposition 33.1 (Properties of the Φ-Space).

1.
The set $S_{\leqslant 1} = \left \{u \in L^{\varPhi }(p): \mathbb{E}_{p}\left [\varPhi (u)\right ]\leqslant 1\right \}$ is the closed unit ball of the complete norm
$$\displaystyle{ \left \Vert u\right \Vert _{p} =\inf \left \{\rho > 0: \mathbb{E}_{p}\left [\varPhi \left (\frac{u} {\rho } \right )\right ]\leqslant 1\right \} }$$
on the Φ-space. For all a⩾1 the continuous injections $L^{\infty }(\mu )\hookrightarrow L^{\varPhi }(p)\hookrightarrow L^{a}(p)$ hold.
2.
$\left \Vert u\right \Vert _{p} = 1$ if either $\mathbb{E}_{p}\left [\varPhi (u)\right ] = 1$ or $\mathbb{E}_{p}\left [\varPhi (u)\right ] < 1$ and $\mathbb{E}_{p}\left [\varPhi \left (\frac{u} {\rho } \right )\right ] = \infty $ for ρ > 1. If $\left \Vert u\right \Vert _{p} > 1$ , then $\left \Vert u\right \Vert _{p}\leqslant \mathbb{E}_{p}\left [\varPhi (u)\right ]$ . In particular, $\lim _{\left \Vert u\right \Vert _{p}\rightarrow \infty }\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ] = \infty $ .
3.
M ^Φ (p) is a closed and separable subspace of L ^Φ (p).
4.
$L^{\varPhi }(p) = L^{\varPhi }(q)$ as Banach spaces if, and only if, $\int p^{1-\theta }q^{\theta }\ d\mu$ is finite on a neighborhood of [0,1].

Proof.

1.
See [11], [20, Chapter II], [23].
2.
The function $\mathbb{R}_{\geqslant } \ni \alpha \mapsto \hat{u}(t) = \mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]$ is increasing, convex, lower semi-continuous. If for some t ₊ > 1 the value $\hat{u}(t_{+})$ is finite, we are in the first case and $\hat{u}(1) = 1$. Otherwise, we have $\hat{u}(1)\leqslant 1$. If $\left \Vert u\right \Vert _{p} > a > 1$, so that $\left \Vert \frac{a} {\left \Vert u\right \Vert _{p}}u\right \Vert _{p} > 1$, hence
$$\displaystyle{ 1 < \mathbb{E}_{p}\left [\varPhi \left ( \frac{a} {\left \Vert u\right \Vert _{p}}u\right )\right ]\leqslant \frac{a} {\left \Vert u\right \Vert _{p}}\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ], }$$
and $\left \Vert u\right \Vert _{p} < a\mathbb{E}_{p}\left [\varPhi \left (u\right )\right ]$, for all a > 1.
3.
See [11], [20, Chapter II], [23].
4.
See [9, 24].

Example 4 (Boolean State Space).

In the case of a finite state space, the moment generating function is finite everywhere, but its computation can be challenging. We discuss in particular the Boolean case $\varOmega = \left \{+1,-1\right \}^{n}$ with counting reference measure μ and uniform density $p(x) = 2^{-n}$, x ∈ Ω. In this case there is a huge literature from statistical physics, e.g., [10, Ch. VII]. A generic real function on Ω—called pseudo-Boolean [7] in the combinatorial optimization literature—has the form $u(x) =\sum _{\alpha \in L}\hat{u}(\alpha )x^{\alpha }$, with $L = \left \{0,1\right \}^{n}$, $x^{\alpha } =\prod _{ i=1}^{n}x_{i}^{\alpha _{i}}$, $\hat{u}(\alpha ) = 2^{-n}\sum _{x\in \varOmega }u(x)x^{\alpha }$.

As $\mathrm{e}^{ax} =\cosh (a) +\sinh (a)x$ if x ² = 1 i.e., $x = \pm 1$, we have

$$\displaystyle\begin{array}{rcl} \mathrm{e}^{tu(x)}& = & \exp \left (\sum _{\alpha \in \mathop{Supp}\nolimits \hat{u}}t\hat{u}(\alpha )x^{\alpha }\right ) =\prod _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\mathrm{e}^{t\hat{u}(\alpha )x^{\alpha }} {}\\ & = & \prod _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\left (\cosh (t\hat{u}(\alpha )) +\sinh (t\hat{u}(\alpha ))x^{\alpha }\right ) {}\\ & =&\sum _{B\subset \mathop{Supp}\nolimits \hat{u}}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha ))x^{\sum _{\alpha \in B}\alpha }. {}\\ \end{array}$$

The moment generating function of u under the uniform density p is

$$\displaystyle{ t\mapsto \sum _{B\in \mathcal{B}(\hat{u})}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha )), }$$

where $\mathcal{B}(\hat{u})$ are those $B \subset \mathop{ Supp}\nolimits \hat{u}$ such that $\sum _{\alpha \in B}\alpha = 0\mod 2$. We have

$$\displaystyle{ \mathbb{E}_{p}\left [\varPhi \right ](tu) =\sum _{B\in \mathcal{B}_{0}(\hat{u})}\prod _{\alpha \in B^{c}}\cosh (t\hat{u}(\alpha ))\prod _{\alpha \in B}\sinh (t\hat{u}(\alpha )) - 1, }$$

where $\mathcal{B}_{0}(\hat{u})$ are those $B \subset \mathop{ Supp}\nolimits \hat{u}$ such that $\sum _{\alpha \in B}\alpha = 0\mod 2$ and $\sum _{\alpha \in \mathop{Supp}\nolimits \hat{u}}\alpha = 0$.

If S is the $\left \{1,\ldots,n\right \} \times \mathop{ Supp}\nolimits \hat{u}$ matrix with elements α _i, we want to solve the system $Sb = 0\mod 2$ to find all elements of $\mathcal{B}$; we add the equation $\sum b = 0\mod 2$ to find $\mathcal{B}_{0}$. The simplest example is $u(x) =\sum _{ i=1}^{n}c_{i}x_{i}$,

Example 5 (The Sphere is Not Smooth in General).

We look for the moment generating function of the density

$$\displaystyle{ p(x) \propto (a + x)^{-\frac{3} {2} }\mathrm{e}^{-x},\quad x > 0, }$$

where a is a positive constant. From the incomplete gamma integral

$$\displaystyle{ \upgamma -\frac{1} {2}x =\int _{ x}^{\infty }s^{-\frac{1} {2} -1}\mathrm{e}^{-s}\ ds,\quad x > 0, }$$

we have for θ, a > 0,

$$\displaystyle{ \frac{d} {dx}\varGamma \left (-\frac{1} {2},\theta (a + x)\right ) = -\theta ^{-\frac{1} {2} }\mathrm{e}^{-\theta a}(a + x)^{-\frac{3} {2} }\mathrm{e}^{-\theta x}. }$$

We have, for $\theta \in \mathbb{R}$,

$$\displaystyle{ C(\theta,a) =\int _{ 0}^{\infty }(a+x)^{-\frac{3} {2} }\mathrm{e}^{-\theta x}\ dx = \left \{\begin{array}{@{}l@{\quad }l@{}} \sqrt{\theta }\mathrm{e}^{\theta a}\varGamma \left (-\frac{1} {2},\theta a\right )\quad &\mbox{ if $\theta > 0$}. \\ \frac{1} {2\sqrt{a}} \quad &\mbox{ if $\theta = 0$}, \\ +\infty \quad &\mbox{ if $\theta < 0$}. \end{array} \right. }$$

or, $C(\theta,a) = \frac{1} {2}a^{-\frac{1} {2} } -\frac{\sqrt{\pi }\theta } {2} \mathrm{e}^{\theta a}R_{ 1/2,1}(\theta a)$ if θ ⩽1, + ∞ otherwise, where R _1∕2, 1 is the survival function of the Gamma distribution with shape 1∕2 and scale 1.

The density p is obtained with θ = 1,

$$\displaystyle{ p(x) = C(1,a)^{-1}(a + x)^{-\frac{3} {2} }\mathrm{e}^{-x} = \frac{(a + x)^{-\frac{3} {2} }\mathrm{e}^{-x}} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )},\quad x > o, }$$

and, for the random variable u(x) = x, the function

$$\displaystyle\begin{array}{rcl} \alpha \mapsto \mathbb{E}_{p}\left [\varPhi (\alpha u)\right ]& =& \frac{1} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )}\int _{0}^{\infty }(a + x)^{-\frac{3} {2} } \frac{\mathrm{e}^{-(1-\alpha )x} +\mathrm{ e}^{-(1+\alpha )x}} {2} \ dx - 1 {}\\ & =& \frac{C(1-\alpha,a) + C(1+\alpha,a)} {2C(1,a)} - 1 {}\\ \end{array}$$

is convex lower semi-continuous on $\alpha \in \mathbb{R}$, finite for α ∈ [−1, 1], infinite otherwise, hence not steep. Its value at α = 1 is

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{p}\left [\varPhi (u)\right ]& =& \frac{1} {\mathrm{e}^{a}\varGamma \left (-\frac{1} {2},a\right )}\int _{0}^{\infty }(a + x)^{-\frac{3} {2} } \frac{1 +\mathrm{ e}^{-2x}} {2} \ dx - 1 {}\\ & =& \frac{C(0,a) + C(2,a)} {2C(1,a)} - 1 {}\\ \end{array}$$

Example 6 (Normal Density).

Let $p(x) = (2\pi )^{-1/2}\mathrm{e}^{-(1/2)x^{2} }$. Consider a generic quadratic polynomial $u(x) = a + bx + \frac{1} {2}cx^{2}$. We have for tc ≠ 1

$$\displaystyle{ t(a+bx+ \frac{1} {2}cx^{2})-\frac{1} {2}x^{2} == - \frac{1} {2(1 - tc)^{-1}}\left (x - \frac{tb} {1 - tc}\right )^{2}+ \frac{1} {2} \frac{t^{2}b^{2} - 2ta(1 - tc)} {(1 - tc)}, }$$

hence

$$\displaystyle{ \mathbb{E}_{p}\left [\mathrm{e}^{tu}\right ] = \left \{\begin{array}{@{}l@{\quad }l@{}} +\infty \quad &\mbox{ if $tc\leqslant 1$,} \\ \sqrt{1 - tc}\exp \left (\dfrac{1} {2} \dfrac{t^{2}b^{2} - 2ta(1 - tc)} {(1 - tc)} \right )\quad &\mbox{ if $tc < 1$}. \end{array} \right. }$$

If, and only if, − 1 < c < 1, we have

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{p}\left [\varPhi (u)\right ]& =& \frac{1} {2}\sqrt{ 1 - c}\exp \left (\dfrac{1} {2} \dfrac{b^{2} - a(1 - c)} {(1 - c)} \right ) {}\\ & & +\frac{1} {2}\sqrt{1 + c}\exp \left (\dfrac{1} {2} \dfrac{b^{2} - a(1 + c)} {(1 + c)} \right ) - 1. {}\\ \end{array}$$

33.4 Vector Bundles

Vector bundles are constructed as sets of couples (p, v) with $p \in \mathcal{P}_{>}$ and v is some space of random variables such that $\mathbb{E}_{p}\left [v\right ] = 0$. The tangent bundle is obtained when the vector space is $L_{0}^{\varPhi }(p)$. The Hilbert bundle is defined as $H\mathcal{P}_{>} = \left \{(p,v): p \in \mathcal{P}_{>},v \in L_{0}^{2}(p)\right \}$. We refer to [21] and [15] were charts and affine connections on the Hilbert bundle are derived from the isometric transport

$$\displaystyle{ L_{0}^{2}(p) \ni u\mapsto \sqrt{\frac{p} {q}}u -\left (1 + \mathbb{E}_{q}\left [\sqrt{\frac{p} {q}}\,\,\right ]\right )^{-1}\left (1 + \sqrt{\frac{p} {q}}\,\,\right )\mathbb{E}_{q}\left [\sqrt{\frac{p} {q}}u\right ] \in L_{0}^{2}(q). }$$

In turn, an isometric transport $U_{p}^{q}: L_{0}^{2}(p) \rightarrow L_{0}^{2}(q)$ can be used to compute the derivative of a vector field in the Hilbert bundle, for example the derivative of the gradient of a relaxed function.

The resulting second order structure is instrumental in computing the Hessian of the natural gradient of the SR function. This allows to design a second order approximation method, as it is suggested in [1] for general Riemannian manifolds, and applied to SR in [15]. A second order structure is also used to define the curvature of a statistical manifold and, possibly, to compute its geodesics, see [6] for applications to optimization.

References

Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press, Princeton (2008). With a foreword by Paul Van Dooren
Google Scholar
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article MathSciNet Google Scholar
Amari, S., Nagaoka, H.: Methods of information geometry. American Mathematical Society, Providence (2000). Translated from the 1993 Japanese original by Daishi Harada
MATH Google Scholar
Arnold, L., Auger, A., Hansen, N., Ollivier, Y.: Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles (2011v1; 2013v2). ArXiv:1106.3708
Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
MATH MathSciNet Google Scholar
Bensadon, J.: Black-box optimization using geodesics in statistical manifolds. ArXiv:1309.7168
Google Scholar
Boros, E., Hammer, P.L.: Pseudo-Boolean optimization. Discrete Appl. Math. 123(1–3), 155–225 (2002). Workshop on Discrete Optimization, DO’99 (Piscataway, NJ) (2013 v1; 2013v2)
Google Scholar
Brown, L.D.: Fundamentals of statistical exponential families with applications in statistical decision theory. No. 9 in IMS Lecture Notes. Monograph Series. Institute of Mathematical Statistics (1986)
Google Scholar
Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59(1), 27–56 (2007)
Article MATH MathSciNet Google Scholar
Gallavotti, G.: Statistical Mechanics: A Short Treatise. Texts and Monographs in Physics. Springer, Berlin (1999)
Book MATH Google Scholar
Krasnosel’skii, M.A., Rutickii, Y.B.: Convex Functions and Orlicz Spaces. Noordhoff, Groningen (1961). Russian original: (1958) Fizmatgiz, Moskva
Google Scholar
Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algoritms. A New Tool for evolutionary Computation. Genetic Algorithms and Evolutionary Computation, vol. 2. Springer, New York (2001)
Google Scholar
Malagò, L.: On the geometry of optimization based on the exponential family relaxation. Ph.D. thesis, Politecnico di Milano (2012)
Google Scholar
Malagò, L., Pistone, G.: A note on the border of an exponential family. ArXiv:1012.0637v1 (2010)
Google Scholar
Malagò, L., Pistone, G.: Combinatorial Optimization with Information Geometry: Newton Method. Entropy 16(8), 4260–4289 (2014)
Article MathSciNet Google Scholar
Malagò, L., Matteucci, M., Pistone, G.: Stochastic relaxation as a unifying approach in 0/1 programming. In: NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler, 11 Dec 2009
Google Scholar
Malagò, L., Matteucci, M., Pistone, G.: Stochastic natural gradient descent by estimation of empirical covariances. In: Proceedings of IEEE CEC, pp. 949–956 (2011)
Google Scholar
Malagò, L., Matteucci, M., Pistone, G.: Towards the geometry of estimation of distribution algorithms based on the exponential family. In: Proceedings of the 11th Workshop on Foundations of Genetic Algorithms, FOGA ‘11, pp. 230–242. ACM, New York (2011)
Google Scholar
Malagò, L., Matteucci, M., Pistone, G.: Natural gradient, fitness modelling and model selection: a unifying perspective. In: Proceedings of IEEE CEC, pp. 486–493 (2013)
Google Scholar
Musielak, J.: Orlicz Spaces and Modular Spaces. Lecture Notes in Mathematics, vol. 1034. Springer, Berlin (1983)
Google Scholar
Pistone, G.: Examples of application of nonparametric information geometry to statistical physics. Entropy 15(10), 4042–4065 (2013)
Article MathSciNet Google Scholar
Pistone, G.: Nonparametric information geometry. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information. LNCS, vol. 8085, pp. 5–36. Springer, Berlin/Heidelberg (2013). GSI 2013 Paris, August 28–30, 2013 Proceedings
Google Scholar
Rao, M.M., Ren, Z.D.: Applications of Orlicz Spaces. Monographs and Textbooks in Pure and Applied Mathematics, vol. 250. Marcel Dekker, New York (2002)
Google Scholar
Santacroce, M., Siri, P., Trivellato, B.: New results on mixture and exponential models by Orlicz spaces (2014, submitted)
Google Scholar
Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Natural evolution strategies. In: Proceedings of IEEE CEC, pp. 3381–3387 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135, Milano, Italy
Luigi Malagò
de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024, Moncalieri, Italy
Giovanni Pistone

Authors

Luigi Malagò
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Pistone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Pistone .

Editor information

Editors and Affiliations

St.Petersburg State University, St.Petersburg, Russia
V.B. Melas
University of Bologna, Rimini, Italy
Stefania Mignani
University of Bologna, Rimini, Italy
Paola Monari
University of Padova, Padova, Italy
Luigi Salmaso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Malagò, L., Pistone, G. (2014). Optimization via Information Geometry. In: Melas, V., Mignani, S., Monari, P., Salmaso, L. (eds) Topics in Statistical Simulation. Springer Proceedings in Mathematics & Statistics, vol 114. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2104-1_33

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2104-1_33
Published: 18 November 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2103-4
Online ISBN: 978-1-4939-2104-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Optimization via Information Geometry

Abstract

Similar content being viewed by others

Information-Geometric Optimization: The Interest of Information Theory for Discrete and Continuous Optimization

A Note on Open Problems and Challenges in Optimization Theory and Algorithms

Nonlinear Optimization: A Brief Overview

Keywords

33.1 Introduction

33.2 Stochastic Relaxation on an Exponential Family

Example 1 (EDA from [19]).

Example 2 (SNGD from [19]).

Example 3 (Binomial B(n, p)).

33.3 Exponential Manifold

Definition 33.1 (Orlicz Φ-Space [11], [20, Chapter II], [23]).

Proposition 33.1 (Properties of the Φ-Space).

Proof.

Example 4 (Boolean State Space).

Example 5 (The Sphere is Not Smooth in General).

Example 6 (Normal Density).

33.4 Vector Bundles

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimization via Information Geometry

Abstract

Similar content being viewed by others

Information-Geometric Optimization: The Interest of Information Theory for Discrete and Continuous Optimization

A Note on Open Problems and Challenges in Optimization Theory and Algorithms

Nonlinear Optimization: A Brief Overview

Keywords

33.1 Introduction

33.2 Stochastic Relaxation on an Exponential Family

Example 1 (EDA from [19]).

Example 2 (SNGD from [19]).

Example 3 (Binomial B(n, p)).

33.3 Exponential Manifold

Definition 33.1 (Orlicz Φ-Space [11], [20, Chapter II], [23]).

Proposition 33.1 (Properties of the Φ-Space).

Proof.

Example 4 (Boolean State Space).

Example 5 (The Sphere is Not Smooth in General).

Example 6 (Normal Density).

33.4 Vector Bundles

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation