Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Contrast functions, called also divergence functions, are distance-like quantities which measure the asymmetric “proximity” of two probability density functions on a statistical manifold or statistical model \(\mathcal{S}\). A contrast function, D(p | | q), for density functions \(p,q \in \mathcal{S}\), is a smooth, non-negative function that vanishes for p = q. Eguchi [38, 39, 41] has shown that a contrast function D induces a Riemannian metric by its second order derivatives, and a pair of dual connections by its third order derivatives.

This chapter introduces contrast functionals on statistical manifolds, which are natural extensions of Kullback–Leibler relative entropy from statistical models, and analyzes their corresponding geometric structures and how these interact with the dualistic structure of a statistical manifold. The chapter also investigates the geometry generated by a contrast functional on the space of probability distributions of a statistical model and provides examples of contrast functions.

It has been shown in Chap. 4 that Kullback–Leibler relative entropy is positive, non-degenerate, its first variation along the diagonal ξ 0 = ξ vanishes, and the Hessian along the diagonal defines the Fisher metric.

The contrast functions mimic the aforementioned properties of the Kullback–Leibler relative entropy. The only difference in the new context is that there are no density functions and no formula of expectation type can be used here.

We overcome this flaw by defining the contrast functions abstractly in two stages: (i) on an open set of \(\mathbb{R}^{k}\); (ii) on a smooth manifold \(\mathcal{S}\).

1 Contrast Functions on \(\mathbb{R}^{k}\)

Consider an open set \(\mathbb{E}\) in \(\mathbb{R}^{k}\), and let \(\xi _{1},\xi _{2} \in \mathbb{E}\). A contrast function on \(\mathbb{E}\) is a smooth function \(D(\,\cdot \,\|\,\cdot \,): \mathbb{E} \times \mathbb{E} \rightarrow \mathbb{R}\) satisfying the following properties:

  1. (i)

    positive: D1 ||ξ 2 ) ≥ 0, \(\forall \xi _{1},\xi _{2} \in \mathbb{E};\)

  2. (ii)

    non-degenerate: \(D(\xi _{1}\vert \vert \xi _{2}) = 0\Longleftrightarrow\xi _{1} =\xi _{2};\)

  3. (iii)

    the first variation along the diagonal1 = ξ 2 } vanishes:

    $$\displaystyle{\partial _{\xi _{1}^{i}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} = \partial _{\xi _{2}^{i}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} = 0;}$$
  4. (iv)

    the Hessian along the diagonal ξ 0 = ξ

    $$\displaystyle{g_{ij}(\xi _{1}) = \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}}}$$

    is strictly positive definite and smooth with respect to ξ 1.

Some comments regarding the notation are worthy to make. Even if the function D(ξ 1 | | ξ 2) is not a distance (the symmetry and the triangle inequality are not satisfied), it is a useful distance-like measure of the separation between two points ξ 1, ξ 2. The separation notation is represented by the symbol | | .

Another observation worthy to make is the redundancy of part (iii) of the definition; this is a consequence of parts (i) and (ii) as follows:

$$\displaystyle\begin{array}{rcl} \lim _{\epsilon \searrow 0}\frac{D(\xi _{1} +\epsilon \vert \vert \xi _{1}) - D(\xi _{1}\vert \vert \xi _{1})} {\epsilon } =\lim _{\epsilon \searrow 0}\frac{D(\xi _{1} +\epsilon \vert \vert \xi _{1})} {\epsilon } \geq 0& & {}\\ \lim _{\epsilon \nearrow 0}\frac{D(\xi _{1} +\epsilon \vert \vert \xi _{1}) - D(\xi _{1}\vert \vert \xi _{1})} {\epsilon } =\lim _{\epsilon \nearrow 0}\frac{D(\xi _{1} +\epsilon \vert \vert \xi _{1})} {\epsilon } \leq 0,& & {}\\ \end{array}$$

which implies the limit equal to 0. We assumed \(\xi _{1} \in \mathbb{R}\) for the sake of notation simplicity, but the result holds true in multiple dimensions.

We note two facts, which are direct consequences of the definition:

  1. (1)

    The point ξ 0 is a global minimum of the map ξ → D(ξ 0 ||ξ).

  2. (2)

    The quadratic approximation of a contrast function is given by

    $$\displaystyle{ D(\xi _{1}\vert \vert \xi _{2}) = \frac{1} {2}\sum _{i,j}g_{ij}(\xi _{1})(\xi _{1}^{i} -\xi _{ 2}^{i})(\xi _{ 1}^{j} -\xi _{ 2}^{j}) + o(\|\varDelta (\xi _{ 1} -\xi _{2})\|^{2}) }$$
    (11.1.1)

    when ξ 2 −ξ 1 → 0.

Hence, for any two close enough neighbor vectors \(\xi _{1},\xi _{2} \in \mathbb{E}\), the contrast function is approximated by half the length of their difference measured in the inner product induced by the matrix g ij

$$\displaystyle{D(\xi _{1}\vert \vert \xi _{2}) \approx \frac{1} {2}\langle \xi _{1} -\xi _{2},\xi _{1} -\xi _{2}\rangle _{g} = \frac{1} {2}\|\xi _{1} -\xi _{2}\|_{g}^{2}.}$$

In the following we show how a contrast function can be induced by a strictly convex function.

Proposition 11.1.1

Let \(\varphi: \mathbb{E} \rightarrow \mathbb{R}\) be a strictly convex function. Then

$$\displaystyle\begin{array}{rcl} D(\xi _{0}\vert \vert \xi )& =& \varphi (\xi ) -\varphi (\xi _{0}) -\sum _{j}\partial _{j}\varphi (\xi _{0})(\xi ^{j} -\xi _{ 0}^{j}) \\ & =& \varphi (\xi ) -\varphi (\xi _{0}) -\langle \partial \varphi (\xi _{0}),\xi -\xi _{0}\rangle {}\end{array}$$
(11.1.2)

is a contrast function on \(\mathbb{E}\) .

Proof:

  1. (i)

    Positivity: since the graph of the strictly convex function \(\varphi\) is above the tangent plane at each point, we have

    $$\displaystyle{ \varphi (\xi ) \geq \varphi (\xi _{0}) +\sum _{j}\partial _{j}\varphi (\xi _{0})(\xi ^{j} -\xi _{ 0}^{j}). }$$
    (11.1.3)

    This implies D(ξ 0 | | ξ) ≥ 0.

  2. (ii)

    Non-degenerate: Since the equality in (11.1.3) occurs only for ξ = ξ 0, it follows that D(ξ 0 | | ξ) = 0 implies ξ = ξ 0.

  3. (iii)

    Differentiating with respect to ξ i yields

    $$\displaystyle\begin{array}{rcl} \partial _{\xi _{i}}D(\xi _{0}\vert \vert \xi )& =& \partial _{\xi _{i}}\varphi (\xi ) - \partial _{\xi _{i}}\varphi (\xi _{0}), {}\\ \end{array}$$

    and hence \(\partial _{\xi _{i}}D(\xi _{0}\vert \vert \xi )_{\vert \xi =\xi _{0}} = 0\).

  4. (iv)

    Since the function \(\varphi\) is strictly convex, and

    $$\displaystyle{ \partial _{\xi _{i}}\partial _{\xi _{j}}D(\xi _{0}\vert \vert \xi ) = \partial _{\xi _{i}}\partial _{\xi _{j}}\varphi (\xi ) }$$
    (11.1.4)

    it follows that \(\partial _{\xi _{i}}\partial _{\xi _{j}}D(\xi _{0}\vert \vert \xi )\) is strictly positive definite. Hence D(ξ 0 | | ξ) satisfies the properties of a contrast function.

 ■ 

We shall discuss in the following a few particular cases.

Example 11.1.2 (Exponential Model)

Consider the convex function \(\varphi (\xi ) = -\ln \xi\), with ξ > 0. The induced contrast function is given by

$$\displaystyle{D(\xi _{0}\vert \vert \xi ) = \frac{\xi } {\xi _{0}} -\ln \frac{\xi } {\xi _{0}} - 1,}$$

which is exactly the Kullback–Leibler relative entropy for the exponential distribution. It is worth noting that the convex function \(\varphi (\xi ) =\xi -\ln \xi\) induces the same contrast function. Hence, there is no one-to-one correspondence between convex functions and contrast functions.

Example 11.1.3

The convex function \(\varphi (\xi ) =\xi ^{2}-\ln \xi\), with ξ > 0, induces the contrast function

$$\displaystyle{D(\xi _{0}\vert \vert \xi ) = (\xi -\xi _{0})^{2} + \frac{\xi } {\xi _{0}} -\ln \frac{\xi } {\xi _{0}} - 1.}$$

Example 11.1.4

If consider \(\varphi (\xi ) =\xi ^{2}\), with ξ > 0, the induced contrast function is

$$\displaystyle{D(\xi _{0}\vert \vert \xi ) = (\xi -\xi _{0})^{2}.}$$

Not all contrast functions are induced by strictly convex functions. For instance, one can show that

$$\displaystyle{D(\xi _{0}\vert \vert \xi ) = \frac{(\xi -\xi _{0})^{2}} {\xi _{0}\xi ^{2}} }$$

is a contrast function on (0, )2, which cannot be written in the form of formula (11.1.2). We make the note that this contrast function is related to the problem of minimum chi-squared estimator, as described in Kass and Vos [49], p.244. There are many other contrast functions that are not in the form (11.1.2), for instance most f-divergences, see Sect. 12.2. It can be shown that a contrast function derived from a strictly convex function by formula (11.1.2) is a dually flat contrast function.

It is worth noting that the definition of the contrast function adopted by Kass and Vos [49], p.240, is slightly modified, replacing condition (iv) by the following condition:

(iv′):

the matrix

$$\displaystyle{g_{ij}(\xi _{1}) = \partial _{\xi _{1}^{i}}\partial _{\xi _{1}^{j}}D(\xi _{1}\vert \vert \xi _{2})}$$

is positive definite and a smooth function of ξ 1 alone.

The contrast function given by formula (11.1.2) is sometimes called Bregman divergence, see Bregman [20], and it is widely used in convex optimization, see Bauschke [14], Bauschke and Combettes [16], and Bauschke et al. [15].

The term of “contrast function” has been defined slightly different by other authors, and under different names (divergence, yoke, etc.) see Eguchi [40], Rao [72] and Barndorff-Nielsen [11].

2 Contrast Functions on a Manifold

Let \(\mathcal{S}\) be a smooth manifold. A contrast function on \(\mathcal{S}\) is a smooth mapping \(D_{\mathcal{S}}(\,\cdot \,\|\,\cdot \,): \mathcal{S}\times \mathcal{S}\rightarrow \mathbb{R}\), such that any parametrization \(\phi: \mathbb{E} \rightarrow \mathcal{S}\) makes

$$\displaystyle{D(\xi _{1}\vert \vert \xi _{2}) = D_{\mathcal{S}}\big(\phi (\xi _{1})\vert \vert \phi (\xi _{2})\big)}$$

a contrast function on \(\mathbb{E}\). This definition was given for the first time in Amari [5].

We note the local character of a contrast function on a manifold. If \(p_{1},p_{2} \in \mathcal{S}\) belong to the same coordinate chart, there are \(\xi _{1},\xi _{2} \in \mathbb{E}\) such that ϕ(p i ) = ξ i and then we have \(D(\xi _{1}\vert \vert \xi _{2}) = D_{\mathcal{S}}\big(p_{1}\vert \vert p_{2}\big)\). Since there might be no coordinate charts to include both points p 1, p 2, then the contrast function \(D_{\mathcal{S}}(\,\cdot \,\|\,\cdot \,)\) makes sense only locally. In general, there might be no global defined contrast functions on a manifold \(\mathcal{S}\).

The invariance of the contrast function with respect to charts is given in the following result.

Theorem 11.2.1

Consider two local parametrizations \(\phi: \mathbb{E}_{\xi } \rightarrow U\), \(\varphi: \mathbb{E}_{\eta } \rightarrow V\) on the manifold \(\mathcal{S}\) . If

$$\displaystyle{D(\xi _{1}\vert \vert \xi _{2}) = D_{\mathcal{S}}\big(\phi (\xi _{1})\vert \vert \phi (\xi _{2})\big)}$$

is a contrast function on the parameter space \(\mathbb{E}_{\xi }\) , then

$$\displaystyle{D(\eta _{1}\vert \vert \eta _{2}) = D_{\mathcal{S}}\big(\varphi (\eta _{1})\vert \vert \varphi (\eta _{2})\big)}$$

is also a contrast function on the parameter space \(\mathbb{E}_{\eta }\) .

Proof:

For any two points \(p_{1},p_{2} \in U \cap V \subset \mathcal{S}\) denote \(p_{1} =\phi (\xi _{1}) =\varphi (\eta _{1})\), \(p_{2} =\phi (\xi _{2}) =\varphi (\eta _{2})\). Let \(\psi: \mathbb{E}_{\xi } \rightarrow \mathbb{E}_{\eta }\), ψ(ξ) = η be the change of parametrization map, which is invertible as a composition of invertible maps \(\psi =\varphi ^{-1}\circ \phi\), see Fig. 11.1.

  1. (i)

    The positivity follows obviously from

    $$\displaystyle{D(\eta _{1}\vert \vert \eta _{2}) = D_{\mathcal{S}}\big(p_{1}\vert \vert p_{2}\big) = D(\xi _{1}\vert \vert \xi _{2}) \geq 0.}$$
  2. (ii)

    To check the non-degeneracy we note that D(η 1 | | η 2) = 0 implies D(ξ 1 | | ξ 2) = 0, and hence ξ 1 = ξ 2, or \(\psi ^{-1}(\eta _{1}) =\psi ^{-1}(\eta _{2})\). Since ψ −1 is one-to-one, we obtain η 1 = η 2.

  3. (iii)

    The fact that the first variation along the diagonal {η 1 = η 2} vanishes is a consequence of (i) and (ii).

    Figure 11.1:
    figure 1

    The parameterizations ϕ and \(\varphi\) on a manifold \(\mathcal{S}\)

  4. (iv)

    We investigate first how does g ij change when changing the parameter ξ into η

    $$\displaystyle\begin{array}{rcl} g_{ij}(\xi )& =& g(\partial _{\xi ^{i}},\partial _{\xi ^{j}}) = g\Big(\frac{\partial \eta ^{r}} {\partial \xi ^{i}} \partial _{\eta ^{r}}, \frac{\partial \eta ^{k}} {\partial \xi ^{j}}\partial _{\eta ^{k}}\Big) {}\\ & =& \frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}g(\partial _{\eta ^{r}},\partial _{\eta ^{k}}) = \frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}\bar{g}_{rk}(\eta ), {}\\ \end{array}$$

    and hence

    $$\displaystyle{ g_{ij}(\xi ) = \frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}\bar{g}_{rk}(\eta ). }$$
    (11.2.5)

    Consider the points p 1 and p 2 infinitesimally close. Then writing the quadratic approximation formula (11.1.1) in differential form for D(ξ 1 | | ξ 2) and D(η 1 | | η 2) and combining with (11.2.5) and the chain rule yields

    $$\displaystyle\begin{array}{rcl} D(\xi _{1}\vert \vert \xi _{2})& =& \frac{1} {2}\sum _{i,j}g_{ij}(\xi _{1})d\xi ^{i}d\xi ^{j} \\ & =& \frac{1} {2}\sum _{i,j}\sum _{r,k}\bar{g}_{rk}(\eta _{1})\frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}d\xi ^{i}d\xi ^{j} {}\end{array}$$
    (11.2.6)
    $$\displaystyle\begin{array}{rcl} D(\eta _{1}\vert \vert \eta _{2})& =& \frac{1} {2}\sum _{r,k}h_{rk}(\eta _{1})d\eta ^{r}d\eta ^{k} \\ & =& \frac{1} {2}\sum _{i,j}\sum _{r,k}h_{rk}(\eta _{1})\frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}d\xi ^{i}d\xi ^{j}. {}\end{array}$$
    (11.2.7)

    Comparing (11.2.6) and (11.2.7) yields \(\bar{g}_{rk}(\eta ) = h_{rk}(\eta )\). Since \(\bar{g}_{rk}(\eta )\) is strictly positive definite, then h rk (η) is the same. Hence D(η 1, η 2) verifies all the conditions of a contrast function.

 ■ 

Corollary 11.2.2

The diagonal part of the Hessians

$$\displaystyle{g_{ij}(\xi _{1}) = \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}}}$$
$$\displaystyle{h_{ij}(\eta _{1}) = \partial _{\eta _{2}^{i}}\partial _{\eta _{2}^{j}}D(\eta _{1}\vert \vert \eta _{2})_{\vert \eta _{2}=\eta _{1}}}$$

are related by the following relation

$$\displaystyle{ g_{ij}(\xi _{1}) = \frac{\partial \eta ^{r}} {\partial \xi ^{i}} \frac{\partial \eta ^{k}} {\partial \xi ^{j}}h_{rk}(\eta _{1}). }$$
(11.2.8)

3 Induced Riemannian Metric

One of the useful consequences of the invariance property given by Theorem 11.2.1 is that a contrast function provides a unique Riemannian metric on the manifold \(\mathcal{S}\). This metric is the inner product \(g_{p}: T_{p}\mathcal{S}\times T_{p}\mathcal{S}\rightarrow \mathbb{R}\) defined in a particular chart as

$$\displaystyle{ g_{p}(\partial _{i},\partial _{j}) = \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}}, }$$
(11.3.9)

for any coordinate vector fields i ,  j on \(\mathcal{S}\) about p.

In the following we shall develop two formulas equivalent with (11.3.9). Consider the notation ρ(ξ 1, ξ 2) = D(ξ 1 | | ξ 2). By (ii) we have

$$\displaystyle\begin{array}{rcl} \partial _{\xi _{1}^{i}}\rho (\xi _{1},\xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }& =& \partial _{\xi _{1}^{i}}\rho (\xi,\xi ) = 0 {}\\ \partial _{\xi _{2}^{i}}\rho (\xi _{1},\xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }& =& \partial _{\xi _{2}^{i}}\rho (\xi,\xi ) = 0. {}\\ \end{array}$$

Denote \(\partial _{j} = \frac{\partial } {\partial \xi ^{j}}\). Differentiating the function \(\varphi (\xi ) = \partial _{\xi _{1}^{i}}\rho (\xi,\xi )\) with respect to j we get

$$\displaystyle{0 = \partial _{j}\varphi (\xi ) = \partial _{\xi _{1}^{j}}\partial _{\xi _{1}^{i}}\rho (\xi,\xi ) + \partial _{\xi _{2}^{j}}\partial _{\xi _{1}^{i}}\rho (\xi,\xi ),}$$

which implies

$$\displaystyle{ \partial _{\xi _{1}^{j}}\partial _{\xi _{1}^{i}}\rho (\xi,\xi ) = -\partial _{\xi _{2}^{j}}\partial _{\xi _{1}^{i}}\rho (\xi,\xi ). }$$
(11.3.10)

Differentiating the function \(\phi (\xi ) = \partial _{\xi _{2}^{i}}\rho (\xi,\xi )\) with respect to j we obtain

$$\displaystyle{0 = \partial _{j}\phi (\xi ) = \partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ) + \partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ),}$$

which implies

$$\displaystyle{ \partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ) = -\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ). }$$
(11.3.11)

Assuming ρ(⋅ , ⋅ ) smooth enough, the partial derivatives commute and using (11.3.10) and (11.3.11) we arrive at the following equivalent local formulas for the induced Riemannian metric:

$$\displaystyle{ g_{ij}(\xi ) = \partial _{\xi _{1}^{i}}\partial _{\xi _{1}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}} }$$
(11.3.12)
$$\displaystyle{ = \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}} }$$
(11.3.13)
$$\displaystyle{ = -\partial _{\xi _{1}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}} }$$
(11.3.14)
$$\displaystyle{ = -\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{2}=\xi _{1}}. }$$
(11.3.15)

Another relation which will be useful in a later section is obtained by differentiating with respect to \(\partial _{k}(= \frac{\partial } {\partial \xi ^{k}})\) in relation (11.3.11) and applying the chain rule

$$\displaystyle\begin{array}{rcl} \partial _{k}\partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi )& =& -\partial _{k}\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi )\Longleftrightarrow \\ \partial _{\xi _{1}^{k}}\partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ) + \partial _{\xi _{2}^{k}}\partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi )& =& -\partial _{\xi _{1}^{k}}\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ) \\ & & -\partial _{\xi _{2}^{k}}\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{i}}\rho (\xi,\xi ).{}\end{array}$$
(11.3.16)

The following notation is adopted for the representation of a vector field X on \(\mathcal{S}\) with respect to two local coordinate systems (ξ 1 i) and (ξ 2 i)

$$\displaystyle{X_{(\xi _{1})} =\sum _{i}X^{i}(\xi _{ 1})\partial _{\xi _{1}^{i}},\quad X_{(\xi _{2})} =\sum _{i}X^{i}(\xi _{ 2})\partial _{\xi _{2}^{i}}.}$$

We note that for any vector field X we have

$$\displaystyle\begin{array}{rcl} X_{(\xi _{1})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} = X_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}& =& 0. {}\\ \end{array}$$

Next we provide the global definition of the induced Riemannian metric.

Proposition 11.3.1

The inner product of two vector fields is given by the following equivalent formulas

$$\displaystyle\begin{array}{rcl} g(X,Y )& =& X_{(\xi _{1})}Y _{(\xi _{1})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& X_{(\xi _{2})}Y _{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& -X_{(\xi _{1})}Y _{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& -X_{(\xi _{2})}Y _{(\xi _{1})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}. {}\\ \end{array}$$

Proof:

The proof follows from the bilinearity of g and an application of relations (11.3.12)–(11.3.15). For instance, the first relation can be shown as

$$\displaystyle\begin{array}{rcl} g(X,Y )& =& \sum _{i,j}X^{i}Y ^{j}g(\partial _{ i},\partial _{j}) {}\\ & =& \sum _{i,j}X^{i}Y ^{j}\partial _{\xi _{ 1}^{i}}\partial _{\xi _{1}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& X_{(\xi _{1})}Y _{(\xi _{1})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}. {}\\ \end{array}$$

 ■ 

4 Dual Contrast Function

If D is a contrast function on \(\mathbb{R}^{k}\), then the associated dual contrast function is defined by

$$\displaystyle{D^{{\ast}}(\xi _{ 1}\vert \vert \xi _{2}) = D(\xi _{2}\vert \vert \xi _{1}).}$$

The fact that D satisfies properties (i)–(iv) from the definition of a contrast function follows obviously from the fact that D satisfies the same properties. Similarly, we can define the dual contrast function on a manifold by

$$\displaystyle{D_{\mathcal{S}}^{{\ast}}(p\vert \vert q) = D_{ \mathcal{S}}(q\vert \vert p),\qquad \forall p,q \in \mathcal{S}.}$$

It is worthy to note that the contrast functions D and D induce the same Riemannian metric on the manifold \(\mathcal{S}\). However, the connections induced by D and D play a central role in the geometry of contrast functions, as we shall see in the next couple of sections.

5 Induced Primal Connection

Let g be the Riemannian metric on \(\mathcal{S}\) induced by the contrast function \(D_{\mathcal{S}}\). Consider the operator ∇(D) given by

$$\displaystyle{ g(\nabla _{X}^{(D)}Y,Z) = -X_{ (\xi _{1})}Y _{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}, }$$
(11.5.17)

for any vector fields X, Y, Z defined on the overlap of the chart neighborhoods associated with the coordinate systems (ξ 1 i) and (ξ 2 i). We shall check that ∇(D) satisfies the properties of a connection. The \(\mathbb{R}\)-bilinearity is obvious. Let \(f \in \mathcal{F}(\mathcal{S})\) be an arbitrary smooth function. Then

$$\displaystyle\begin{array}{rcl} g(\nabla _{fX}^{(D)}Y,Z)& =& -fX_{ (\xi _{1})}Y _{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} = g(f\nabla _{X}^{(D)}Y,Z), {}\\ \end{array}$$

and dropping the Z-argument implies ∇ fX (D) Y = f X (D) Y. Next we check Leibniz rule in the second argument

$$\displaystyle\begin{array}{rcl} g(\nabla _{X}^{(D)}fY,Z)& =& -X_{ (\xi _{1})}(fY _{(\xi _{1})})Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& -fX_{(\xi _{1})}Y _{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & & -X_{(\xi _{1})}(f)\;Y _{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& fg(\nabla _{X}^{(D)}fY,Z) + X_{ (\xi _{1})}(f)g(Y,Z) {}\\ & =& g(f\nabla _{X}^{(D)}fY + X(f)Y,Z), {}\\ \end{array}$$

so \(\nabla _{X}^{(D)}fY = f\nabla _{X}^{(D)}fY + X(f)Y\).

Writing formula (11.5.17) in local coordinates we obtain the components of the linear connection ∇(D) as in the following

$$\displaystyle{ \varGamma _{ij,k}^{(D)} = g(\nabla _{ \partial _{i}}^{(D)}\partial _{ j},\partial _{k}) = -\partial _{\xi _{1}^{i}}\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}. }$$
(11.5.18)

The commutativity of the partial derivatives imply Γ ij, k (D) = Γ ji, k (D), and hence the connection ∇(D) has zero torsion. We can arrive to the same result in the following equivalent way. Starting from the global definition of the connection and Riemannian metric we write

$$\displaystyle\begin{array}{rcl} g(\nabla _{X}^{(D)}Y -\nabla _{ Y }^{(D)}X,Z)& =& -X_{ (\xi _{1})}Y _{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\xi _{1}=\xi _{2}} {}\\ & & +Y _{(\xi _{1})}X_{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\xi _{1}=\xi _{2}} {}\\ & =& -[X,Y ]_{(\xi _{1})}Z_{(\xi _{2})}D(\xi _{1}\vert \vert \xi _{2})_{\xi _{1}=\xi _{2}} {}\\ & =& g([X,Y ],Z). {}\\ \end{array}$$

Dropping the Z-argument implies \(\nabla _{X}^{(D)}Y -\nabla _{Y }^{(D)}X = [X,Y ]\), i.e., the torsion of connection ∇(D) is zero.

6 Induced Dual Connection

The dual connection \(\nabla ^{(D^{{\ast}}) }\) is the connection induced by the dual contrast function D , i.e., it is given by

$$\displaystyle\begin{array}{rcl} g(\nabla _{X}^{(D^{{\ast}}) }Y,Z)& =& -X_{(\xi _{2})}Y _{(\xi _{2})}Z_{(\xi _{1})}D^{{\ast}}(\xi _{ 2}\vert \vert \xi _{1})_{\vert \xi _{1}=\xi _{2}} {}\\ & =& -X_{(\xi _{2})}Y _{(\xi _{2})}Z_{(\xi _{1})}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}, {}\\ \end{array}$$

for any vector fields X, Y, Z. This can be written locally as

$$\displaystyle{\varGamma _{ij,k}^{(D^{{\ast}}) } = g(\nabla _{\partial _{i}}^{(D^{{\ast}}) }\partial _{j},\partial _{k}) = -\partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}\partial _{\xi _{1}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}.}$$

Theorem 11.6.1

The connections ∇ (D) and \(\nabla ^{(D^{{\ast}}) }\) are torsion-less dual connections.

Proof:

The fact that the torsions vanish follows from the symmetry in the first two indices of the connection components Γ ij, k (D) = Γ ji, k (D) and \(\varGamma _{ij,k}^{(D^{{\ast}}) } =\varGamma _{ ji,k}^{(D^{{\ast}}) }\). The duality relation will be shown in local coordinates. Differentiating with respect to \(\partial _{k} = \partial _{\xi ^{k}}\) in relation \(g_{ij}(\xi ) = -\partial _{\xi _{1}^{i}}\partial _{\xi _{2}^{j}}D(\xi \vert \vert \xi )\) we obtain

$$\displaystyle\begin{array}{rcl} \partial _{k}g_{ij}& =& -\partial _{\xi _{1}^{k}}\partial _{\xi _{1}^{i}}\partial _{\xi _{2}^{j}}D(\xi \vert \vert \xi ) {}\\ & & -\partial _{\xi _{2}^{k}}\partial _{\xi _{1}^{i}}\partial _{\xi _{2}^{j}}D(\xi \vert \vert \xi ) {}\\ & =& \varGamma _{ki,j}^{(D)} +\varGamma _{ kj,i}^{(D^{{\ast}}) }, {}\\ \end{array}$$

which is equivalent with the duality of D and D . ■ 

Therefore, a contrast function D on a manifold \(\mathcal{S}\) induces a statistical structure \((g,\nabla ^{(D)},\nabla ^{(D^{{\ast}}) })\). Hence, \((\mathcal{S},g,\nabla ^{(D)},\nabla ^{(D^{{\ast}}) })\) becomes the statistical manifold induced by the contrast function D.

Proposition 11.6.2

The Levi–Civita connection of the Riemannian space \((\mathcal{S},g)\) is given by

$$\displaystyle{\nabla ^{(0)} = \frac{1} {2}\big(\nabla ^{(D)} + \nabla ^{(D^{{\ast}}) }\big).}$$

Proof:

Since ∇(D) and \(\nabla ^{(D^{{\ast}}) }\) have zero torsion, the same applies to ∇(0). Using the duality relation we show that ∇(0) is a metrical connection

$$\displaystyle\begin{array}{rcl} Xg(Y,Z)& =& \frac{1} {2}Xg(Y,Z) + \frac{1} {2}Xg(Y,Z) {}\\ & =& \frac{1} {2}\Big\{g(\nabla _{X}^{(D)}Y,Z) + g(Y,\nabla _{ X}^{(D^{{\ast}}) }Z)\Big\} {}\\ & =& \frac{1} {2}\Big\{g(\nabla _{X}^{(D^{{\ast}}) }Y,Z) + g(Y,\nabla _{X}^{(D)}Z)\Big\} {}\\ & =& g\Big(\frac{\nabla _{X}^{(D)}Y + \nabla _{X}^{(D^{{\ast}}) }Y } {2},Z\Big) + g\Big(Y, \frac{\nabla _{X}^{(D)}Z + \nabla _{X}^{(D^{{\ast}}) }Z} {2} \Big) {}\\ & =& g(\nabla _{X}^{(0)}Y,Z) + g(Y,\nabla _{ X}^{(0)}Z). {}\\ \end{array}$$

 ■ 

7 Skewness Tensor

Besides a Riemannian metric g and a pair of dual connections ∇(D), \(\nabla ^{(D^{{\ast}}) }\), a contrast function D also induces the skewness tensor by

$$\displaystyle\begin{array}{rcl} C^{(D)}(X,Y,Z)& =& g\big(\nabla _{ X}^{(D^{{\ast}}) }Y -\nabla _{X}^{(D)}Y,Z\big) {}\\ & =& \Big(X_{(\xi _{1})}Y _{(\xi _{1})}Z_{(\xi _{2})} - X_{(\xi _{2})}Y _{(\xi _{2})}Z_{(\xi _{1})}\Big)D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}. {}\\ \end{array}$$

In local coordinates this becomes

$$\displaystyle\begin{array}{rcl} C_{ijk}^{(D)}& =& \varGamma _{ ij,k}^{(D^{{\ast}}) } -\varGamma _{ij,k}^{(D)} {}\\ & =& \partial _{\xi _{1}^{i}}\partial _{\xi _{1}^{j}}\partial _{\xi _{2}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}} - \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}\partial _{\xi _{1}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}}. {}\\ \end{array}$$

In the virtue of identities (11.3.12)–(11.3.15), the tensor C ijk (D) becomes completely symmetric.

8 Third Order Approximation of \(\boldsymbol{D(p\vert \vert \,\cdot )}\)

This section will present the third order approximation of a contrast function \(D_{\mathcal{S}}\) on a manifold \(\mathcal{S}\). Let \(p,q \in \mathcal{S}\) be two points in the same chart with coordinates \(\xi _{1} =\phi ^{-1}(p)\) and \(\xi _{2} =\phi ^{-1}(q)\). Denote \(\varDelta \xi ^{i} =\xi _{ 2}^{i} -\xi _{1}^{i}\). The third order approximation of \(D_{\mathcal{S}}(p\vert \vert \,\cdot )\) about p is given by

$$\displaystyle\begin{array}{rcl} D_{\mathcal{S}}(p\vert \vert q)& =& D_{\mathcal{S}}(p\vert \vert p) + \partial _{\xi _{2}^{i}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }\,\varDelta \xi ^{i} {}\\ & & +\frac{1} {2}\partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }\,\varDelta \xi ^{i}\varDelta \xi ^{j} {}\\ & & +\frac{1} {6}\partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }\,\varDelta \xi ^{i}\varDelta \xi ^{j}\varDelta \xi ^{k} + o(\|\varDelta \xi \|^{2}), {}\\ \end{array}$$

where \(o(\|\varDelta \xi \|^{2})\) is a term which converges to 0 faster than \(\|\varDelta \xi \|^{2}\) does, as p → q. Since from the definition of a contrast function the first two terms are zero, then

$$\displaystyle\begin{array}{rcl} D_{\mathcal{S}}(p\vert \vert q)& =& \frac{1} {2}g_{ij}(\xi _{1})\varDelta \xi ^{i}\varDelta \xi ^{j} + \frac{1} {6}h_{ijk}(\xi _{1})\varDelta \xi ^{i}\varDelta \xi ^{j}\varDelta \xi ^{k} + o(\|\varDelta \xi \|^{2}), {}\\ \end{array}$$

where g ij is the induced Riemannian metric. It suffices to compute the coefficients

$$\displaystyle{h_{ijk}(\xi _{1}) = \partial _{\xi _{2}^{i}}\partial _{\xi _{2}^{j}}\partial _{\xi _{2}^{k}}D(\xi _{1}\vert \vert \xi _{2})_{\vert \xi _{1}=\xi _{2}=\xi }.}$$

Writing relation (11.3.16) in terms of the induced connections components, see formula (11.5.18), we have

$$\displaystyle\begin{array}{rcl} -\varGamma _{ij,k}^{{\ast}} + h_{ ijk}& =& \varGamma _{jk,i} +\varGamma _{ ik,j}^{{\ast}} {}\\ \end{array}$$

from where

$$\displaystyle\begin{array}{rcl} h_{ijk}& =& \varGamma _{ij,k}^{{\ast}} +\varGamma _{ jk,i} +\varGamma _{ ik,j}^{{\ast}} {}\\ & =& \partial _{j}g_{ik} +\varGamma _{ ik,j}^{{\ast}} {}\\ & =& \partial _{k}g_{ij} +\varGamma _{ ij,k}^{{\ast}}. {}\\ \end{array}$$

The last two identities follow from formula (8.1.2). A similar argument can be used to show also the relation

$$\displaystyle{h_{ijk} = \partial _{i}g_{kj} +\varGamma _{ jk,i}^{{\ast}}.}$$

This relations imply the total symmetry of h ijk

$$\displaystyle{h_{ijk} = h_{ikj} = h_{kji} = h_{jik}.}$$

It is worthy to mention that if D(⋅  | | ⋅ ) induces a dually flat statistical manifold (i.e., \(\varGamma =\varGamma ^{{\ast}} = 0\)), then h ijk  = 0.

We have seen that any contrast function induces a dualistic structure \((g^{(D)},\nabla ^{(D)},\nabla ^{(D^{{\ast}}) })\) on \(\mathcal{S}\). Next we consider the converse implication, which states that any triple (g, ∇, ∇), which consists in a metric and two dual torsion-free connections, is induced from a divergence. The divergence can be given locally by

$$\displaystyle{ D(p\vert \vert q) = \frac{1} {2}g_{ij}(p)\varDelta \xi ^{i}\varDelta \xi ^{j} + \frac{1} {6}h_{ijk}(p)\varDelta \xi ^{i}\varDelta \xi ^{j}\varDelta \xi ^{k}, }$$
(11.8.19)

where \(\varDelta \xi ^{i} =\xi ^{i}(q) -\xi ^{i}(p)\) and \(h_{ijk} = \partial _{i}g_{kj} +\varGamma _{ jk,i}^{{\ast}}\). The existence of a globally defined contrast function is proved in Matumoto [56].

However, the contrast function is not unique. An alternative construction for (11.8.19) is

$$\displaystyle{D(p\vert \vert q) = \frac{1} {2}g_{ij}(p)\varDelta \xi ^{i}\varDelta \xi ^{j} -\frac{1} {6}h_{ijk}^{{\ast}}(p)\varDelta \xi ^{i}\varDelta \xi ^{j}\varDelta \xi ^{k},}$$

where \(h_{ijk}^{{\ast}} = \partial _{i}g_{jk} +\varGamma _{ jk,i}^{{\ast}}\).

9 Hessian Geometry

Assume now that there is a local coordinate chart with respect to which the contrast function \(D_{\mathcal{S}}\) is induced locally by a convex function \(\varphi\) via formula (11.1.2). We make the remark that it is not necessarily true that there is always a local system of coordinates in which the contrast function is induced by a convex function. However, when this occurs, it defines a dually flat structure of statistical manifold, as we shall see next. This type of contrast function is sometimes called Bregman divergence, see Bregman [20], and it is widely used in convex optimization, see Bauschke [14, 16, 15]. For a generalization of this contrast function to an α-family, see Zhang [86].

Using (11.1.4) we obtain that the metric is given by the Hessian of the strictly convex potential function \(\varphi\)

$$\displaystyle{ g_{ij}(\xi ) = \partial _{\xi ^{i}}\partial _{\xi ^{j}}\varphi (\xi ). }$$
(11.9.20)

A straightforward computation shows that the components of the induced dual connections ∇(D) and \(\nabla ^{(D^{{\ast}}) }\) are given by

$$\displaystyle{ \varGamma _{ij,k}^{(D)}(\xi ) = 0,\qquad \varGamma _{ ij,k}^{(D^{{\ast}}) }(\xi ) = \partial _{\xi ^{i}}\partial _{\xi ^{j}}\partial _{\xi ^{k}}\varphi (\xi ). }$$
(11.9.21)

A further computation shows that the Riemann curvature tensors are \(R = R^{{\ast}} = 0\), i.e., the connections are dually flat.

It is worth noting that there are topological obstructions to the existence of dually flat structures. Ay and Tuschmann [10] proved that if \((\mathcal{S},g,\nabla,\nabla ^{{\ast}})\) is dually flat and \(\mathcal{S}\) is compact, then the first fundamental group \(\pi _{1}(\mathcal{S})\) must be finite.

The skewness tensor is given by the third order derivatives as

$$\displaystyle{C_{ijk}^{(D)} = \partial _{\xi ^{ i}}\partial _{\xi ^{j}}\partial _{\xi ^{k}}\varphi (\xi ).}$$

This geometry is commonly referred to in the literature as the Hessian geometry. Some authors considered weaker conditions than strictly convexity for the potential function \(\varphi\), see Shima [74] and Shima and Yagi [75]. For more details on hessian metrics, the reader is referred to Bercu [17] and Corcodel [29].

10 Problems

  1. 11.1.

    Let γ: (a, b) → (M, g) be a regular curve, i.e., \(\dot{\gamma }\not =0\). Define

    $$\displaystyle{D(s\vert \vert t) =\int _{ s}^{t}(t - u)\vert \dot{\gamma }(u)\vert _{ g}^{2}\,du.}$$

    Show that D( ⋅  | | ⋅ ) is a contrast function on (a, b).

  2. 11.2.

    Let \(\mathcal{S}\) be a statistical model and consider two distributions \(p_{0},p_{1} \in \mathcal{S}\). Define the following curves in \(\mathcal{S}\)

    $$\displaystyle{p_{t}^{(m)} = (1 - t)p_{ 0} + tp_{1},\quad p_{t}^{(e)} = C_{ t}p_{0}^{1-t}p_{ 1}^{t},\quad 0 \leq t \leq 1,}$$

    where C t is a normalization function. Denote by g (m)(t) and g (e)(t) the Fisher metrics along the aforementioned curves. Let

    $$\displaystyle{D^{(m)}(p_{ 1}\vert \vert p_{0}) =\int _{ 0}^{1}(1 - s)g^{(m)}(s)\,ds,}$$
    $$\displaystyle{D^{(e)}(p_{ 1}\vert \vert p_{0}) =\int _{ 0}^{1}(1 - s)g^{(e)}(s)\,ds.}$$
    1. (a)

      Prove that D (m)( ⋅  | | ⋅ ) and D (e)( ⋅  | | ⋅ ) are contrast functions on \(\mathcal{S}\).

    2. (b)

      What is the relationship between D (m)( ⋅  | | ⋅ ) and D (e)( ⋅  | | ⋅ )?

  3. 11.3.

    Let (M, g, ∇, ∇) be a dually flat statistical manifold and (x i) and (ζ α ) a pair of dual coordinate systems associated with potentials \(\varphi\) and ψ (i.e., \(x^{i} = \partial _{\zeta _{i}}\varphi (\zeta )\), \(\zeta _{j} = \partial _{x^{j}}\psi (x)\)). Define \(D: M \times M \rightarrow \mathbb{R}\) as

    $$\displaystyle{D(p\vert \vert q) =\psi \big (x(p)\big) +\varphi \big (\zeta (q)\big) - x^{i}(p)\zeta _{ i}(q).}$$
    1. (a)

      Prove that D( ⋅  | | ⋅ ) is a contrast function (called the canonical divergence of (M, g, ∇, ∇)).

    2. (b)

      Find the dual contrast function D ( ⋅  | | ⋅ ).

    3. (c)

      Show that for any p, q, r ∈ M the following relation holds

      $$\displaystyle{ D(p\vert \vert q) + D(q\vert \vert r) = D(p\vert \vert r) -\big (x^{i}(q) - x^{i}(p)\big)\big(\zeta _{ i}(q) -\zeta _{i}(q)\big). }$$
    4. (d)

      Let θ be the angle made at q by the ∇-geodesic joining p and q, γ pq , and the ∇-geodesic joining q and r, γ qr . Show that

      $$\displaystyle{D(p\vert \vert q) + D(q\vert \vert r) = D(p\vert \vert r) -\|\dot{\gamma }_{pq}\| \cdot \|\dot{\gamma }_{qr}^{{\ast}}\|\cos (\pi -\theta ).}$$
    5. (e)

      If \(\theta = \frac{\pi } {2}\) show the following Pythagorean relation:

      $$\displaystyle{D(p\vert \vert r) = D(p\vert \vert q) + D(q\vert \vert r).}$$
    6. (f)

      Find the skewness tensor associated with D( ⋅  | | ⋅ ).

  4. 11.4.

    Consider the Euclidean space \((M,g) = (\mathbb{R}^{n},\delta _{ij})\), with ∇ = ∇ given by ∇ U V = U(V j)e j , for any \(U,V \in \mathcal{X}(M)\).

    1. (a)

      Show that the Euclidean coordinates system is self-dual, i.e., x i = ζ i .

    2. (b)

      Show that in this case the potential functions are given by

      $$\displaystyle{\psi (x) = \frac{1} {2}\sum _{i}(x^{i})^{2},\qquad \phi (x) = \frac{1} {2}\sum _{i}(\zeta _{i})^{2}.}$$

      (c) Prove that the canonical divergence is given by \(D(p\vert \vert q) = \frac{1} {2}d_{E}^{2}(p,q)\), where d E (p, q) denotes the Euclidean distance between p and q.

  5. 11.5.

    How many of the previous requirements still hold on a Riemannian manifold (M, g, ∇) with a flat Levi–Civita connection ∇?

  6. 11.6.

    Let (M, g, ∇, ∇) be a dually flat statistical manifold, and denote by D( ⋅  | | ⋅ ) the associated canonical divergence. Consider the D-sphere centered at p ∈ M of radius ρ, defined by

    $$\displaystyle{S^{(D)} =\{ q \in M;D(p\vert \vert q) =\rho \}.}$$

    Show that every ∇-geodesic starting at the center p intersects S (D) orthogonally.

  7. 11.7.

    Consider the exponential family \(p(x;\xi ) = e^{C(x)+\xi ^{i}F_{ i}(x)-\psi (\xi )}\), \(x \in \mathcal{X}\), with {F i (x)} linearly independent on \(\mathcal{X}\). Define η j  = E ξ [F j ], 1 ≤ j ≤ n.

    1. (a)

      Show that η j  =  j ψ(ξ).

    2. (b)

      Prove that (ξ i) and (η j ) are dual systems of coordinates.

    3. (c)

      Verify that (ξ i) is a 1-affine coordinate system and (η j ) is a (−1)-affine coordinate system.

    4. (d)

      Let \(\varphi (\eta )\) be the potential associated with ξ, i.e., \(\xi ^{j} = \partial _{\eta ^{j}}\varphi (\eta )\). Show that \(\varphi (\eta ) = E_{\xi }[\ln p_{\xi }(x) - C(x)]\).

    5. (e)

      Let H(p) be the entropy of distribution p. Validate the relation

      $$\displaystyle{H(p_{\xi }) = -\varphi (\xi ) - E_{\xi }[C(x)].}$$
    6. (f)

      Let \(\widehat{\eta _{j}} = F_{j}(x)\). Show that \(\hat{\eta }\) is an unbiased estimator for η, and that the covariance matrix provides the Fisher metric, i.e., \(V _{\eta }(\hat{\eta }) = g_{ij}\).

    7. (g)

      Find the contrast function given by the canonical divergence associated with the dual system of coordinates (ξ i), (η i ). What is its relationship with the Kullback–Leibler relative entropy?

  8. 11.8.

    Consider the statistical model given by the Poisson distribution \(p(x;\xi ) = e^{-\xi }\frac{\xi ^{x}} {x!}\), \(x \in \{ 0,1,2,\ldots \}\), ξ > 0. Consider η = ξ and θ = lnξ.

    1. (a)

      Prove that η and θ are dual coordinates.

    2. (b)

      Find the canonical divergence associated with the above dual coordinates.

  9. 11.9.

    Consider the statistical model given by the normal family

    $$\displaystyle{p(x;\xi ) = \frac{1} {\sqrt{2\pi }\sigma }e^{-\frac{(x-\mu )^{2}} {2\sigma ^{2}} },\;\mu \in \mathbb{R},\sigma > 0.}$$

    Show that (θ i) are (η i ) are dual systems of coordinates, where

    $$\displaystyle{\eta _{1} =\mu,\qquad \eta _{2} =\mu ^{2} +\sigma ^{2}}$$
    $$\displaystyle{ \frac{\theta ^{1}} {2\theta ^{2}} = -\mu,\qquad \frac{(\theta ^{1})^{2} - 2\theta ^{2}} {4(\theta ^{2})^{2}} =\mu ^{2} +\sigma ^{2}.}$$
  10. 11.10.

    Consider the statistical model given by the exponential distribution \(p(x;\xi ) =\xi e^{-\xi x}\), x ≥ 0, ξ > 0.

    1. (a)

      Find a pair of dual coordinates on the above statistical model.

    2. (b)

      Find the potentials ψ and \(\varphi\) associated with the dual coordinates obtained at (a).

    3. (c)

      Deduct the expression for the Fisher metric.

    4. (d)

      Find the canonical divergence associated with the dual coordinates obtained at (a).