Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A statistical manifold [1, 2, 7] is an abstract manifold \(\mathbb M\) equipped with a Riemannian metric g and an Amari—Chentsov tensor T. If the manifold is a smooth differentiable manifold then it can be realized [8] as a statistical model.

Most studies of statistical models are based on the widely used logarithmic embedding of probability density functions. Here, more generally embeddings are considered. Recent work [11, 12, 23] unifies the formalism of rho-tau embeddings [19] with statistical models belonging to deformed exponential families [10]. The present exposition continues this investigation.

The notion of a statistical manifold has been generalized in the non-parametric setting [14, 15, 20, 21] to include Banach manifolds. The corresponding terminology is used here, although up to now only a few papers have combined non-parametric manifolds with deformed exponential families [9, 13, 16, 18].

The rho-tau divergence is discussed in the next section. Eguchi [4, 5] proved under rather general conditions that, given a differentiable manifold, a divergence function defines a metric tensor and a pair of connections. These are derived in Sect. 4, respectively Sect. 6. Parametric statistical models are discussed in Sect. 7, which discusses Hessian geometry, and Sect. 8, which deals with deformed exponential families.

2 The Statistical Manifold

The points of a given statistical manifold \(\mathbb M\) are assumed to be random variables over some measure space \((\mathscr {X}, \mu )\). A random variable X is defined as any measurable real function. The expectation, if it exists, is denoted \(\mathbb E_\mu X\). Throughout the text it is assumed that the manifold is differentiable and that for each X in \(\mathbb M\) the tangent plane \(T_X\mathbb M\) is well-defined.

The derivative of a random variable is again a random variable. Therefore one can expect that the tangent vectors at a point X of \(\mathbb M\) are random variables with vanishing expectation value. Let us assume that these tangent vectors can be used as a local chart in the vicinity of the point X and that they belong to some Banach space \(\mathscr {B}\). Then \(\mathbb M\) is a Banach manifold, provided a number of technical conditions are satisfied.

In the simplest case the manifold \(\mathbb M\) consists of all strictly positive probability distributions on a discrete set \(\mathscr {X}\). These probability distributions can be considered as positive-valued random variables with expectation equal to 1. The space \(\mathscr {B}\) of all random variables is a Banach space for instance for the \(L^1\) norm. The manifold \(\mathbb M\) is a Banach manifold. Our approach here is the same as that adopted in [21], where random variables are called \(\chi \)-functions, and functions of random variables are called \(\chi \)-functionals.

In the more general situation the choice of an appropriate norm for the tangent vectors is not so simple. See the work of Pistone et al. [14,15,16].

3 Rho-Tau Divergence

Given a strictly convex differentiable function h and a pair of real-valued random variables P and Q the Bregman divergence [3] is given by

$$\begin{aligned} \mathscr {D}(P,Q)= & {} \mathbb E_\mu \left[ h(P)-h(Q)-(P-Q)h'(Q)\right] , \end{aligned}$$
(1)

where \(h'\) denotes the derivative of h. A generalization involving two strictly increasing real functions \(\rho (u)\) and \(\tau (u)\) is proposed in [19]. For the sake of completeness the definition is repeated here. Throughout the text these functions \(\rho \) and \(\tau \) are assumed to be at least once, sometimes twice differentiable.

There exists a strictly convex function f with the property that \(f'\circ \rho =\tau \). It is given by

$$\begin{aligned} f(u)=\int ^{\rho ^{-1}(u)} \tau (v)\mathrm{d}\rho (v). \end{aligned}$$
(2)

The convex conjugate function \(f^*\) is therefore given by

$$\begin{aligned} f^*(u)=\int ^{\tau ^{-1}(u)}\rho (v)\mathrm{d}\tau (v), \end{aligned}$$
(3)

provided the lower boundary of the integrals is chosen appropriately.

The original definition [19] of the rho-tau divergence can be written as

$$\begin{aligned} \mathscr {D}_{\rho ,\tau }(P,Q)= & {} \mathbb E_\mu \left[ f(\rho (P))+f^*(\tau (Q))-\rho (P)\tau (Q)\right] \end{aligned}$$
(4)

which is assumed to be \(\le +\infty .\) The reformulation given below simplifies the proof of some of its properties.

Definition 1

Let be given two strictly increasing differentiable functions \(\rho \) and \(\tau \), defined on a common open interval D in \(\mathbb R\). The rho-tau divergence of two random variables P and Q with values in D is given by

$$\begin{aligned} \mathscr {D}_{\rho ,\tau }(P,Q)=\mathbb E_\mu \left( \int _Q^P\left[ \tau (v)-\tau (Q)\right] \mathrm{d}\rho (v)\right) . \end{aligned}$$
(5)

This definition is equivalent to (4). To see this, split (5) into two parts. Use (2) to write the former contribution as \(\mathbb E_\mu f\circ \rho (P)-\mathbb E_\mu f\circ \rho (Q)\) and the latter as \(-\mathbb E_\mu \tau (Q)[\rho (P)-\rho (Q)]\). Use partial integration to prove that \(f\circ \rho +f^*\circ \tau =\rho \tau \). This definition also generalizes (1). To see this take \(I=f\), \(\rho =\text{ id }\), and \(\tau =I'\).

Note that the integral in (5) is a Stieltjes integral, which is well-defined because \(\rho \) and \(\tau \) are strictly increasing functions. The result is non-negative. Hence, the \(\mu \)-expectation is either convergent or it diverges to \(+\infty \).

Let P and Q be two random variables with joint probability distribution \(p(\zeta ,\eta )\). Then (5) can be written as

$$\begin{aligned} \mathscr {D}_{\rho ,\tau }(P,Q)= & {} \int p(\zeta ,\eta )\mathrm{d}\zeta \mathrm{d}\eta \, \left( \int _\eta ^\zeta \left[ \tau (v)-\tau (\eta )\right] \mathrm{d}\rho (v)\right) \\\le & {} \int p(\zeta ,\eta )\mathrm{d}\zeta \mathrm{d}\eta \,|\tau (\zeta )-\tau (\eta )|\,|\rho (\zeta )-\rho (\eta )|\\\le & {} \left\{ \mathbb E_\mu |\tau (P)-\tau (Q)|^2\mathbb E_\mu |\rho (P)-\rho (Q)|^2\right\} ^{1/2}. \end{aligned}$$
(6)

To obtain the latter the Cauchy–Schwarz inequality is used.

Theorem 1

\(\mathscr {D}_{\rho ,\tau }(P,Q)\ge 0\) with equality if \(P=Q\). If \(\mu \) is faithful, i.e. \(\mathbb E_\mu P=0\) implies \(P=0\) for any non-negative P, then \(\mathscr {D}_{\rho ,\tau }(P,Q)=0\) implies \(P=Q\).

Proof

From (5) it is immediately clear that \(\mathscr {D}_{\rho ,\tau }(P,Q)\ge 0\) and \(\mathscr {D}_{\rho ,\tau }(P,P)=0\). Assume now that \(\mathscr {D}_{\rho ,\tau }(P,Q)=0\). By assumption this implies that

$$\begin{aligned} \int _Q^P\left[ \tau (v)-\tau (Q)\right] \mathrm{d}\rho (v)=0\quad \mu \text{-almost } \text{ everywhere }. \nonumber \end{aligned}$$

However, because \(\tau \) and \(\rho \) are strictly increasing the integral is strictly positive unless \(P=Q\), \(\mu \)-almost everywhere.    \(\square \)

It can be easily verified that the rho-tau divergence satisfies the following generalized Pythagorean equality for any three points PQR

$$ \mathscr {D}_{\rho ,\tau }(P,Q) + \mathscr {D}_{\rho ,\tau }(Q,R) - \mathscr {D}_{\rho ,\tau }(P,R) = \mathbb E_\mu \left\{ [\rho (P) - \rho (Q)] [ \tau (R) - \tau (Q)] \right\} . $$

The general expression for the rho-tau entropy is

$$\begin{aligned} S_{\rho ,\tau }(P)=-\mathbb E_\mu f(\rho (P))+ \text{ constant } =-\mathbb E_\mu \int ^P\tau (u)\mathrm{d}\rho (u). \end{aligned}$$
(7)

See for instance Section 2.6 of [23]. The function f is a strictly convex function which, given \(\rho \), can still be chosen arbitrarily and then determines \(\tau \). The following identity holds

$$\begin{aligned} \mathscr {D}_{\rho ,\tau }(P,Q)= -S_{\rho ,\tau }(P)+S_{\rho ,\tau }(Q)-\mathbb E_\mu \left[ \rho (P)-\rho (Q)\right] \tau (Q). \end{aligned}$$
(8)

In [12, 23], we also discuss rho-tau cross-entropy, as well as the notion of “dual entropy” arising out of rho-tau embedding.

Rho-tau divergence \(\mathscr {D}_{\rho ,\tau }(P,Q)\) is a special form of the more general divergence function \(\mathscr {D}^{(\alpha )}_{f, \rho }(P,Q)\) arising out of convex analysis, see [19, 20]:

$$\begin{aligned}&\mathscr {D}^{(\alpha )}_{f, \rho }(P,Q) = \frac{4}{1-\alpha ^{2}} \\\times & {} \mathbb E_\mu \left\{ \frac{1-\alpha }{2} f(\rho (P)) + \frac{1+\alpha }{2} f(\rho (Q)) - f \left( \frac{1-\alpha }{2} \rho (P)+\frac{1+\alpha }{2} \rho (Q) \right) \right\} .\\&\end{aligned}$$
(9)

Clearly

$$\begin{aligned} \lim _{\alpha \rightarrow 1} \mathscr {D}^{(\alpha )}_{f, \rho }(P,Q)= & {} \mathscr {D}_{\rho ,\tau }(P,Q) = \mathscr {D}_{\tau ,\rho }(Q,P); \\ \lim _{\alpha \rightarrow -1} \mathscr {D}^{(\alpha )}_{f, \rho }(P,Q)= & {} \mathscr {D}_{\rho ,\tau }(Q,P) = \mathscr {D}_{\tau ,\rho }(P,Q); \end{aligned}$$

with \(f^\prime \circ \rho = \tau \) (and equivalent \((f^*)^\prime \circ \tau = \rho \), with \(f^*\) denoting convex conjugate of f). Though in \(\mathscr {D}^{(\alpha )}_{f, \rho }(P,Q)\) the two free functions are f (a strictly convex function) and \(\rho \) (a strictly monotone increasing function), as reflected in its subscripts, there is only notational difference from the \(\rho , \tau \) specification of two function’s choice. This is because for \(f, f^*, \rho , \tau \), a choice of any two functions (one of which would have to be either \(\rho \) or \(\tau \)) would specify the remaining two. See [19, 22].

4 Tangent Vectors

The rho-tau divergence introduced above can be used to fix a Riemannian metric on the tangent planes of the statistical manifold \(\mathbb M\).

In the standard situation of the Fisher-Rao metric the point P is a probability density function \(p^\theta \), parametric with \(\theta \in \mathbb R^n\). A short calculation gives

$$\begin{aligned} \partial _j\mathbb E_\mu p^\theta Y=\big \langle \partial _j \log p^\theta ,Y\big \rangle _\theta , \end{aligned}$$
(10)

with \(\langle X,Y\rangle _\theta =\mathbb E_\mu p^\theta XY\), and where \(\partial _j\) is an abbreviation for \(\partial /\partial \theta ^j\). The metric tensor is then given by

$$\begin{aligned} g_{ij}(\theta ) = \langle \partial _i \log p^\theta ,\partial _j \log p^\theta \rangle _\theta . \nonumber \end{aligned}$$

The score variables \(\partial _j \log p^\theta \) have vanishing expectation and span the tangent plane at the point \(p^\theta \).

These expressions are now generalized. Fix P in \(\mathbb M\). Make the assumption that there exists some open neighborhood U of P in \(\mathbb M\) and a one-to-one correspondence \(\chi _P\) between elements Q of U and tangent vectors \(X=\chi _P(Q)\) of \(T_P\mathbb M\), satisfying \(\chi _P(P)=0\). This map \(\chi _P\) is used as a local chart centered at the point P. The directional derivative \(d_X\) is then defined as

$$d_X P := \lim _{\varepsilon \rightarrow 0} \frac{\chi _P^{-1}(\varepsilon X)-\chi ^{-1}(0)}{\varepsilon },$$

and is assumed to exist for all \(X\in T_P\mathbb M\). Here, we leave the topology unspecified.

Now we take one of the two increasing functions \(\rho \) and \(\tau \), say \(\rho \), to define a two-point correlation function \(\mathbb E_\mu \rho (P)Y\), and the other function, \(\tau \), to act as a deformed logarithmic function replacing the logarithmic function which appears in the definition of the standard scores. The expression analogue to (10) now involves derivatives of \(\mathbb E_\mu \rho (P) Y\) and of \(\tau (P)\). It becomes

$$\begin{aligned} d_X\mathbb E_\mu \rho (P) Y=\big \langle d_X\tau (P),Y\big \rangle _P, \end{aligned}$$
(11)

with

$$\begin{aligned} \left\langle X,Y\right\rangle _P=\mathbb E_\mu \frac{\rho '(P)}{\tau '(P)}XY. \nonumber \end{aligned}$$

This relation should hold for any P in \(\mathbb M\) and X in \(T_P\mathbb M\), and for any random variable Y. The metric tensor \(g_{XY} \equiv g(X,Y)\) becomes

$$\begin{aligned} g_{XY}(P)= & {} \big \langle d_X\tau (P),d_Y\tau (P)\big \rangle _P\\= & {} \mathbb E_\mu \rho '(P)\tau '(P)d_X Pd_Y P. \end{aligned}$$
(12)

This metric tensor is related to the divergence function introduced in the previous section by

$$\begin{aligned} d^{\tiny P}_Yd^{\tiny Q}_X \mathscr {D}_{\rho ,\tau }(P,Q) \bigg |_{P=Q} =-g_{XY}(P), \nonumber \end{aligned}$$

where \(d^{\tiny P}\) is the derivative acting only on P and \(d^{\tiny Q}\) acts only on Q. See [21] for the derivation of the metric tensor in the form of (12) for the non-parametric setting.

In the case of a model \(p^\theta \) which belongs to the exponential family the tangent plane can be identified with the coordinate space. The chart becomes \(\chi _{p^\theta }(p^\zeta )=\zeta -\theta \) so that

$$d_\zeta p^\theta := \lim _{\varepsilon \rightarrow 0}\frac{1}{\varepsilon }\left( p^{\theta +\varepsilon (\zeta -\theta )}-p^\theta \right) .$$

If \((\zeta -\theta )_i=\delta _{i,j}\) then \(d_\zeta p^\theta =\partial _jp^\theta \) follows and (11) reduces to (10).

5 Gauge Freedom

From (12) it is clear that the metric tensor depends only on the product \(\rho '\tau '\) and not on \(\rho \) and \(\tau \) separately. This implies that once the metric tensor is fixed there remains one function to be chosen freely, either the embedding \(\rho \) or the deformed logarithm \(\tau \), keeping \(\rho '\tau '\) fixed. This is what we call the gauge freedom of the rho-tau formalism.

The notion of gauge freedom is common in Physics to mark the introduction of additional degrees of freedom which do not modify the model but control some of its appearances. Here, the Riemannian metric of the manifold is considered to be an essential feature while the different geometries such as the Riemannian geometry or Amari’s dually flat geometries are attributes which give a further characterization.

It is known for long that distinct choices of the divergence function can lead to the same metric tensor. The present formalism offers the opportunity to profit from this freedom. Quantities such as the divergence function, the entropy or the alpha-family of connections depend on the specific choice of both \(\rho \) and \(\tau \). This is illustrated further on. Some examples are found in Table 1.

Table 1 Examples of \(\rho ,\tau \) combinations

The simplest choice to fix the gauge is \(\rho =\text{ id }\). Several classes generalizing Bregman divergences found in the literature, e.g. [6, 10], belong to this case. The phi-divergence of [10] is obtained by choosing \(\tau \) equal to the deformed logarithm \(\log _\phi \) (see Sect. 8), the derivative of which is \(1/\phi \). This implies \(\rho '\tau '=1/\phi \), which is also the condition for the deformed metric tensor of [10] to be conformally equivalent with (12). The U-divergence of [6] is obtained by taking \(\tau \) equal to the inverse function of \(U'\). These were discussed in detail in [11, 12, 23].

Also of interest is the gauge defined by \(\rho (u)=1/\tau '(u)\). Let \(\log _\rho \) be the corresponding deformed logarithm (see (21) below). It satisfies \(\log _\rho (u)=\tau (u)-\tau (1)\). Hence, the entropy becomes

$$\begin{aligned} S_{\rho ,\tau }(P)=-\mathbb E_\mu \rho (P)\tau (P)+\mathbb E_\mu P+ \text{ constant }. \nonumber \end{aligned}$$

The divergence becomes

$$\begin{aligned} \mathscr {D}_{\rho ,\tau }(P,Q)= & {} \mathbb E_\mu \rho (P)\left[ \log _\rho (P)-\log _\rho (Q)\right] -\mathbb E_\mu \left[ P-Q\right] . \nonumber \end{aligned}$$

This expression is an obvious generalization of the Kullback–Leibler divergence.

6 Induced Geometry

A divergence function not only fixes a metric tensor by taking two derivatives, it also fixes a pair of torsion-free connections by taking an extra derivative w.r.t. the first argument [4, 5]. In particular, the rho-tau-divergence (5) determines an alpha-family of connections [11, 19, 21].

A covariant derivative \(\nabla _Z\) with respect to a vector field Z is defined by

$$\begin{aligned} \langle \nabla _Zd_X\tau (P),d_Y\tau (P)\rangle _P = -d^{\tiny P}_Z d^{\tiny P}_Yd^{\tiny Q}_X \mathscr {D}_{\rho ,\tau }(P,Q) \bigg |_{Q=P}. \nonumber \end{aligned}$$

A short calculation of the righthand side, with \(\mathscr {D}_{\rho ,\tau }\) defined by (4), gives

$$\begin{aligned} \langle \nabla _Zd_X\tau (P),d_Y\tau (P)\rangle _P= & {} \mathbb E_\mu \left[ d_X\tau (P)\right] \,d_Zd_Y\rho (P). \nonumber \end{aligned}$$

Let \(\nabla ^{(1)}_Z=\nabla _Z\) and let \(\nabla _Z^{(-1)}\) be the operator obtained by interchanging \(\rho \) and \(\tau \). This is

$$\begin{aligned} \langle \nabla _Z^{(-1)}d_X\tau (P),d_Y\tau (P)\rangle _P= & {} \mathbb E_\mu \left[ d_X\rho (P)\right] \,d_Zd_Y\tau (P)\\= & {} \langle d_X\tau (P) ,d_Zd_Y\tau (P)\rangle _P. \end{aligned}$$
(13)

This shows that \(\nabla _Z^{(-1)}\) is the adjoint of \(d_Z\) with respect to g. In addition one has

$$\begin{aligned} \langle \nabla _Z^{(1)}d_X\tau (P),d_Y\tau (P)\rangle _P +\langle d_X\tau (P),\nabla _Z^{(-1)}d_Y\tau (P)\rangle _P =d_Z\, g_{XY}(P). \end{aligned}$$
(14)

The latter expression shows that the connections \(\nabla ^{(1)}\) and \(\nabla ^{(-1)}\) are the dual of each other with respect to g. The alpha-family of connections is then obtained by linear interpolation with \(\alpha \in [-1,1]\)

$$\begin{aligned} \nabla ^{(\alpha )}_Z=\frac{1+\alpha }{2}\nabla ^{(1)}_Z+\frac{1-\alpha }{2}\nabla ^{(-1)}_Z , \end{aligned}$$
(15)

such that the covariant derivatives \(\nabla ^{(\alpha )}\) and \(\nabla ^{(-\alpha )}\) are mutually dual. In particular, \(\nabla ^{(0)}\) is self-dual and therefore coincides with the Levi-Civita connection. The family of \(\alpha \)-connections (15) is induced by the divergence function \(\mathscr {D}^{(\alpha )}_{f, \rho }(P,Q)\) given by (9), with corresponding \(\alpha \)-values. Furthermore, upon switching \(\rho \leftrightarrow \tau \) in the divergence function, the designation of 1-connection vs (\(-1\))-connection also switches.

From (13) it is clear that the covariant derivative \(\nabla _Z^{(-1)}\) vanishes on the tangent plane when

$$\begin{aligned} \langle d_X\tau (P) ,d_Zd_Y\tau (P)\rangle _P=0 \quad \text{ for } \text{ all } X,Y\in T_P\mathbb M. \end{aligned}$$
(16)

If this holds for all P in \(\mathbb M\) then the \(\nabla ^{(-1)}\)-geometry is flat. This implies that the dual geometry \(\nabla ^{(1)}\) is also flat — see Theorem 3.3 of [1]. The interpretation of (16) is that all second derivatives \(d_Zd_Y\tau (P)\) are orthogonal to the tangent plane.

7 Parametric Models

The previous sections deal with the geometry of arbitrary manifolds consisting of random variables, without caring whether they possess special properties. Now parametric models with a Hessian metric g are considered.

From here on the random variables of the manifold \(\mathbb M\) are probability distribution functions \(p^\theta \), labeled with coordinates \(\theta \) belonging to some open convex subset U of \(\mathbb R^n\). The manifold is assumed to be differentiable. In particular, the \(\theta ^i\) are covariant coordinates and the assumption holds that the derivatives \(\partial _ip^\theta \equiv \partial p^\theta /\partial \theta ^i\) form a basis for the tangent plane \(T_\theta \mathbb M\equiv T_{p^\theta }\mathbb M\). The simplifications induced by this setting are that the tangent planes are finite-dimensional and that the dual coordinates belong again to \(\mathbb R^n\). For general Banach manifolds both properties need not to hold. The assumptions imply that the metric tensor

$$\begin{aligned} g_{ij}(\theta )=\langle \partial _i\tau (p^\theta ),\partial _j\tau (p^\theta )\rangle _\theta \nonumber \end{aligned}$$

is a strictly positive-definite matrix.

The metric g of the manifold \(\mathbb M\) is said to be Hessian if there exists a strictly convex function \(\Phi (\theta )\) with the property that \(g_{ij}(\theta )=\partial _i\partial _j\Phi (\theta )\). See for instance [17]. Let \(\Psi (\eta )\) denote the convex dual of \(\Phi (\eta )\). This is

$$\begin{aligned} \Psi (\eta )=\sup _{\theta }\{\langle \eta ,\theta \rangle -\Phi (\theta ):\,\theta \in U\}. \nonumber \end{aligned}$$

Let \(U^*\) denote the subset of \(\mathbb R^n\) of \(\eta \) for which the maximum is reached at some \(\theta \) in U. This \(\theta \) is unique and defines a bijection \(\theta \mapsto \eta \) between U and \(U^*\). These \(\eta \) are dual coordinates for the manifold \(\mathbb M\). Conversely [11], if there exist coordinates \(\eta _i\) for which \(g_{ij}(\theta )=\partial _j\eta _i\) then the rho-tau metric tensor g is Hessian.

The condition (16) for \(\nabla ^{(-1)}\) to vanish can now be written as

$$\begin{aligned} \langle \partial _i\tau (p^\theta ),\partial _k\partial _j\tau (p^\theta )\rangle _\theta =0, \quad \text{ for } \text{ all } \theta \in U \text{ and } \text{ for } \text{ all } i,j,k. \end{aligned}$$
(17)

Theorem 2

Assume that the \(\theta ^i\) are affine coordinates such that \(\nabla ^{(-1)}=0\). Then

  1. (1)

    the metric tensor g is Hessian;

  2. (2)

    the \(\eta _i\) are affine coordinates for the \(\nabla ^{(1)}\)-geometry.

Proof

(1) The metric tensor (12) becomes

$$ g_{ij}(p^\theta )=\langle \partial _i\tau (p^\theta ),\partial _j\tau (p^\theta )\rangle _\theta =\mathbb E_\mu \left( \partial _i\tau (p^\theta )\right) \partial _j\rho (p^\theta ) =\mathbb E_\mu \left( \partial _j\tau (p^\theta )\right) \partial _i\rho (p^\theta ).$$

This implies

$$ \partial _kg_{ij}(p^\theta )=\mathbb E_\mu \left( \partial _k\partial _i\tau (p^\theta )\right) \partial _j\rho (p^\theta ) +\mathbb E_\mu \left( \partial _i\tau (p^\theta )\right) \partial _k\partial _j\rho (p^\theta ),$$

but also

$$ \partial _kg_{ij}(p^\theta )= \mathbb E_\mu \left( \partial _k\partial _j\tau (p^\theta )\right) \partial _i\rho (p^\theta ) +\mathbb E_\mu \left( \partial _j\tau (p^\theta )\right) \partial _k\partial _i\rho (p^\theta ).$$

These equations simplify by means of (17). The result is

$$ \partial _kg_{ij}(p^\theta )= \mathbb E_\mu \left( \partial _i\tau (p^\theta )\right) \partial _k\partial _j\rho (p^\theta ) =\mathbb E_\mu \left( \partial _j\tau (p^\theta )\right) \partial _k\partial _i\rho (p^\theta ).$$

This implies that \(\partial _k g_{ij}(\theta )=\partial _i g_{kj}(\theta )\). Hence there exist functions \(\eta _j(\theta )\) such that \(g_{ij}(\theta )=\partial _i\eta _j(\theta )\). As remarked above, it is proved in [11] that this suffices to conclude that the metric g is Hessian.

(2) Let us show that

$$\begin{aligned} \eta (t)=(1-t)\eta ^{(1)}+t\eta ^{(2)}. \end{aligned}$$
(18)

is a solution of the Euler-Lagrange equations

$$\begin{aligned} \frac{\mathrm{d}^2\,}{\mathrm{d}t^2}\theta ^i +\Gamma ^i_{km} \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^k\right) \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^m\right) =0. \end{aligned}$$
(19)

Here, the \(\Gamma ^i_{km}\) are the coefficients of the connection \(\Gamma ^{(1)}\) induced by the \(\nabla ^{(1)}\)-geometry. They follow from

$$\begin{aligned} \Gamma _{ij,k}=\partial _ig_{jk}(\theta ). \end{aligned}$$
(20)

One has

$$\begin{aligned} \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^i =\frac{\partial \theta ^i}{\partial \eta _j}\frac{\mathrm{d}\eta _j}{\mathrm{d}t} =g^{ij}(\theta )\left[ \eta ^{(2)}_j-\eta ^{(1)}_j\right] \nonumber \end{aligned}$$

and

$$\begin{aligned} \frac{\mathrm{d}^2\,}{\mathrm{d}t^2}\theta ^i= & {} \frac{\mathrm{d}\,}{\mathrm{d}t}g^{ij}(\theta )\left[ \eta ^{(2)}_j-\eta ^{(1)}_j\right] \\= & {} \left[ \partial _kg^{ij}(\theta )\right] \frac{\mathrm{d}\theta ^k}{\mathrm{d}t}\left[ \eta ^{(2)}_j-\eta ^{(1)}_j\right] \\= & {} \left[ \partial _kg^{ij}(\theta )\right] g^{kl}(\theta ) \left[ \eta ^{(2)}_l-\eta ^{(1)}_l\right] \left[ \eta ^{(2)}_j-\eta ^{(1)}_j\right] \\= & {} \left[ \partial _kg^{ij}(\theta )\right] g_{jm}(\theta ) \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^k\right) \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^m\right) . \nonumber \end{aligned}$$

The l.h.s. of (19) becomes

$$\begin{aligned} \text{ l.h.s. } = \left\{ \left[ \partial _kg^{ij}(\theta )\right] g_{jm}(\theta ) +\Gamma ^i_{km} \right\} \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^k\right) \left( \frac{\mathrm{d}\,}{\mathrm{d}t}\theta ^m\right) . \nonumber \end{aligned}$$

This vanishes because (20) implies

$$\begin{aligned} \Gamma ^i_{km}=-\left[ \partial _kg^{ij}(\theta )\right] g_{jm}(\theta ). \nonumber \end{aligned}$$

   \(\square \)

It is important to realize that the discussion in this section is generic for parametric models, without assuming particular parametric families.

8 The Deformed Exponential Family

A repeated measurement of n independent random variables \(F_1,\ldots ,F_n\) results in a joint probability distribution \(\pi (\zeta _1,\ldots ,\zeta _n)\), which describes the probability that the true value of the measured data equals \(\zeta \). More generally, the model can be taken to be a deformed exponential family, obtained by using a deformed exponential function \(\exp _\phi \). Following [10], a deformed logarithm \(\log _\phi \) is defined by

$$\begin{aligned} \log _\phi (u)=\int _1^u\mathrm{d}v\,\frac{1}{\phi (v)}, \end{aligned}$$
(21)

where \(\phi (v)\) is strictly positive and integrable on the open interval \((0,+\infty )\). The deformed exponential function \(\exp _\phi (u)\) is the inverse function of \(\log _\phi (u)\). It is defined on the range \(\mathscr {R}\) of \(\log _\phi (u)\), but is eventually extended with the value 0 if \(u< \mathscr {R}\) and with the value \(+\infty \) if \(u> \mathscr {R}\).

The expression for the probability density function then becomes

$$\begin{aligned} p^\theta (x)=\exp _\phi \left( \sum _{k=1}^n\theta ^kF_k(x)-\alpha (\theta )\right) . \end{aligned}$$
(22)

The function \(\alpha (\theta )\) serves to normalize \(p^\theta \) and is assumed to exist within the open convex domain \(U\subset \mathbb R^n\) in which the model is defined. One can show [10] that it is a convex function. However, in general it does not coincide with the potential \(\Phi (\theta )\) of the previous section. The explanation is that escort probabilities come into play. Indeed, from

$$\begin{aligned} 0=\partial _i\mathbb E_\mu p^\theta =\mathbb E_\mu \phi (p^\theta )\left[ F_i-\partial _i\alpha \right] \nonumber \end{aligned}$$

follows that

$$\begin{aligned} \partial _i\alpha =\tilde{\mathbb E}_\theta F_i, \nonumber \end{aligned}$$

with the escort expectation \(\tilde{\mathbb E}_\theta \) defined by

$$\begin{aligned} \tilde{\mathbb E}_\theta Y=\frac{\mathbb E_\mu \phi (p^\theta )Y}{\mathbb E_\mu \phi (p^\theta )}. \nonumber \end{aligned}$$

Only in the non-deformed case, when \(\phi (u)=u\), the escort \(\tilde{\mathbb E}_\theta \) coincides with the model expectation \(\mathbb E_\theta \). Then the dual coordinates \(\eta _i\) satisfy \(\eta _i=\mathbb E_\theta F_i=\partial _i\alpha (\theta )\).

In general, the rho-tau metric tensor g of the deformed exponential model is not Hessian.We have the following Theorem (see [12])

Theorem 3

With respect to the (deformed) \(\phi \)-exponential family \(p^\theta \) obeying (22), the rho-tau metric tensor g is

  1. (a)

    conformal to Hessian if

    $$ \rho ^\prime \tau ^\prime \phi = \phi ^\prime ;$$
  2. (b)

    Hessian if

    $$ \rho '\tau '\phi =\text{ id }. $$

In case (a), the rho-tau metric tensor is conformally equivalent with the metric tensor obtained by taking the Hessian of the normalization function \(\alpha \); in case (b) the potential \(\Phi (\theta )\) is constructed in [10]. However, there still leaves a gauge freedom. The question is then whether one can choose \(\rho \) and \(\tau \) so that condition (16) for the dually flat geometry is satisfied. A sufficient condition is that \(\rho =\text{ id }\) and \(\tau =\log _\phi \). This is the rho-affine gauge. In this gauge both the \(\theta ^i\) and the \(\eta _i\) coordinates are affine and the model has a dually flat structure.