Keywords

3.1 Introduction

In information geometry, an exponential family is a useful statistical model and it is applied to various fields of statistical sciences (cf. [1]). For example, the set of Gaussian distributions is an exponential family. It is known that an exponential family can be naturally regarded as a Hessian manifold [28], which is also called a dually flat space [1] or a flat statistical manifold [12]. A pair of dually flat affine connections has essential roles in geometric theory of statistical inferences. In addition, a Hessian manifold has an asymmetric squared-distance like function, called the canonical divergence. On an exponential family, the canonical divergence coincides with the Kullback-Leibler divergence or the relative entropy. (See Sect. 3.3.)

A deformed exponential family is a generalization of exponential families, which was introduced in anomalous statistical physics [22]. (See also [23, 32] and [33].) A deformed exponential family naturally has two kinds of dualistic Hessian structures, and such geometric structures are independently studied in machine learning theory [21] and statistical physics [3, 26], etc. For example, a \(q\)-exponential family is a typical example of deformed exponential families. One of Hessian structures on a \(q\)-exponential family is related to geometry of \(\beta \)-divergences (or density power divergences [5]). The other Hessian structure is related to geometry of \(\alpha \)-divergences. (In the \(q\)-exponential case, these geometry are studied in [18].) In addition, conformal structures of statistical manifolds play important roles in geometry of deformed exponential families.

In this paper, we summarize such Hessian structures and conformal structures on deformed exponential families. Then we construct a generalized relative entropy from the viewpoint of estimating functions. As an application, we consider generalization of independence of random variables, then elucidate geometry of the maximum \(q\)-likelihood estimator. This paper is written based on the proceeding [19].

3.2 Preliminaries

In this paper, we assume that all objects are smooth, and a manifold \(M\) is an open domain in \({{\varvec{R}}}^n\).

Let \((M,h)\) be a semi-Riemannian manifold, that is, \(h\) is assumed to be nondegenerate, which is not necessary to be positive definite (e.g. the Lorentzian metric in relativity). Let \(\nabla \) be an affine connection on \(M\). We define the dual connection \(\nabla ^*\) of \(\nabla \) with respect to \(h\) by

$$ Xh(Y,Z) = h(\nabla _XY,Z) + h(Y,\nabla ^*_XZ), $$

where \(X,Y\) and \(Z\) are arbitrary vector fields on \(M\). It is easy to check that \((\nabla ^*)^* = \nabla \).

For an affine connection \(\nabla \), we define the curvature tensor field \(R\) and the torsion tensor field \(T\) by

$$\begin{aligned} R(X,Y)Z&:= \nabla _X\nabla _YZ - \nabla _Y\nabla _XZ - \nabla _{[X,Y]}Z, \\ T(X,Y)&:= \nabla _XY - \nabla _YX - [X,Y], \end{aligned}$$

where \([X,Y] := XY -YX\). We say that \(\nabla \) is curvature-free if \(R\) vanishes everywhere on \(M\), and the one is torsion-free if \(T\) vanishes everywhere.

For pair of dual affine connections, the following proposition holds (cf. [16]).

Proposition 1

Consider the conditions below:

  1. 1.

    \(\nabla \) is torsion-free.

  2. 2.

    \(\nabla ^*\) is torsion-free.

  3. 3.

    \(\nabla ^{(0)} = (\nabla + \nabla ^*)/2\) is the Levi-Civita connection with respect to \(h\).

  4. 4.

    \(\nabla h\) is totally symmetric, where \(\nabla h\) is the \((0,3)\)-tensor field defined by

    $$ (\nabla _X h)(Y,Z) := Xh(Y,Z) - h(\nabla _XY,Z) - h(Y,\nabla _XZ). $$

Assume any two of the above conditions, then the others hold.

From now on, we assume that an affine connection \(\nabla \) is torsion-free.

We say that an affine connection \(\nabla \) is flat if \(\nabla \) is curvature-free. For a flat affine connection \(\nabla \), there exists a coordinate system \(\{\theta ^i\}\) on \(M\) locally such that the connection coefficients \(\{\varGamma _{ij}^{\nabla \ k}\} \ (i,j,k = 1,\ldots ,n)\) of \(\nabla \) vanish on its coordinate neighbourhood. We call such a coordinate system \(\{\theta ^i\}\) an affine coordinate system.

Let \((M,h)\) be a semi-Riemannian manifold, and let \(\nabla \) be a flat affine connection on \(M\). We say that the pair \((\nabla ,h)\) is a Hessian structure on \(M\) if there exists a function \(\psi \), at least locally, such that \(h = \nabla d\psi \) [28]. In the coordinate form, the following formula holds:

$$ h_{\textit{ij}}(p(\theta )) = \frac{\partial ^2}{\partial \theta ^i\partial \theta ^j} \psi (p(\theta )), $$

where \(p\) is an arbitrary point in \(M\) and \(\{\theta ^i\}\) is a \(\nabla \)-affine coordinate system around \(p\). Under the same assumption, we call the triplet \((M,\nabla ,h)\) a Hessian manifold. For a Hessian manifold \((M,\nabla ,h)\), we define a totally symmetric \((0,3)\)-tensor field \(C\) by \(C := \nabla h\). We call \(C\) the cubic form for \((M,\nabla ,h)\).

For a semi-Riemannian manifold \((M,h)\) with a torsion-free affine connection \(\nabla \), the triplet \((M,\nabla ,h)\) is said to be a statistical manifold if \(\nabla h\) is totally symmetric [12]. Originally, the triplet \((M,g,C)\) is called a statistical manifold [14], where \((M,g)\) is a Riemannian manifold and \(C\) is a totally symmetric \((0,3)\)-tensor field on \(M\). From Proposition 1, these definitions are essentially equivalent. In fact, for a semi-Riemannian manifold \((M,h)\) with a totally symmetric \((0,3)\)-tensor field \(C\), we can define mutually dual torsion-free affine connections \(\nabla \) and \(\nabla ^*\) by

$$\begin{aligned} h(\nabla _XY,Z)&:= h(\nabla ^{(0)}_XY,Z) - \frac{1}{2}C(X,Y,Z), \end{aligned}$$
(3.1)
$$\begin{aligned} h(\nabla ^*_XY,Z)&:= h(\nabla ^{(0)}_XY,Z) + \frac{1}{2}C(X,Y,Z), \end{aligned}$$
(3.2)

where \(\nabla ^{(0)}\) is the Levi-Civita connection with respect to \(h\). In this case, \(\nabla h\) and \(\nabla ^*h\) are totally symmetric. Hence \((M,\nabla ,h)\) and \((M,\nabla ^*,h)\) are statistical manifolds.

A triplet \((M,\nabla ,h)\) is a flat statistical manifold if and only if it is a Hessian manifold (cf. [28]). Suppose that \(R\) and \(R^*\) are curvature tensors of \(\nabla \) and \(\nabla ^*\), respectively. Then we have

$$ h(R(X,Y)Z,V) = -h(Z,R^*(X,Y)V). $$

Hence the condition that the triplet \((M,\nabla ,h)\) is a Hessian manifold is equivalent to that the quadruplet \((M,h,\nabla ,\nabla ^*)\) is a dually flat space [1].

For a Hessian manifold \((M,\nabla ,h)\), we suppose that \(\{\theta ^i\}\) is a \(\nabla \)-affine coordinate system on \(M\). Then there exists a \(\nabla ^*\)-affine coordinate system \(\{\eta _i\}\) such that

$$ h\left( \frac{\partial }{\partial \theta ^i}, \frac{\partial }{\partial \eta _j} \right) = \delta ^i_j. $$

We call \(\{\eta _i\}\) the dual coordinate system of \(\{\theta ^i\}\) with respect to \(h\).

Proposition 2

Let \((M,\nabla ,h)\) be a Hessian manifold. Suppose that \(\{\theta ^i\}\) is a \(\nabla \)-affine coordinate system, and \(\{\eta _i\}\) is the dual coordinate system of \(\{\theta ^i\}\). Then there exist functions \(\psi \) and \(\phi \) on \(M\) such that

$$\begin{aligned}&\frac{\partial \psi }{\partial \theta ^i} = \eta _i, \quad \frac{\partial \phi }{\partial \eta _i} = \theta ^i, \quad \psi (p) + \phi (p) - \sum _{i=1}^{n}\theta ^i(p)\eta _i(p) = 0, \quad (p \in M), \\&h_{ij} = \frac{\partial ^2 \psi }{\partial \theta ^i \partial \theta ^j}, \quad h^{ij} = \frac{\partial ^2 \phi }{\partial \eta _i \partial \eta _j}, \nonumber \end{aligned}$$
(3.3)

where \((h_{ij})\) is the component matrix of a semi-Riemannian metric \(h\) with respect to \(\{\theta ^i\}\), and \((h^{ij})\) is the inverse matrix of \((h_{ij})\). Moreover,

$$\begin{aligned} C_{ijk} = \frac{\partial ^3 \psi }{\partial \theta ^i\partial \theta ^j\partial \theta ^k} \end{aligned}$$
(3.4)

is the cubic form of \((M,\nabla ,h)\).

For proof, see [1] and [28]. The functions \(\psi \) and \(\phi \) are called the \(\theta \) -potential and the \(\eta \) -potential, respectively. From the above proposition, the Hessians of \(\theta \)-potential and \(\eta \)-potential coincide with the semi-Riemannian metric \(h\):

$$\begin{aligned} \frac{\partial \eta _i}{\partial \theta ^j} = \frac{\partial ^2\psi }{\partial \theta ^i\partial \theta ^j} = h_{ij}, \qquad \frac{\partial \theta ^i}{\partial \eta _j} = \frac{\partial ^2\phi }{\partial \eta _i\partial \eta _j} = h^{ij}. \end{aligned}$$
(3.5)

In addition, we obtain the original flat connection \(\nabla \) and its dual \(\nabla ^*\) from the potential function \(\psi \). From Eq. (3.4), we have the cubic form of Hessian manifold \((M,\nabla ,h)\). Then we obtain two affine connections \(\nabla \) and \(\nabla ^*\) by Eqs. (3.1), (3.2) and (3.4).

Under the same assumptions as in Proposition 2, we define a function \(D\) on \(M \times M\) by

$$ D(p,r) := \psi (p) + \phi (r) - \sum _{i=1}^{n}\theta ^i(p)\eta _i(r), \quad (p, r \in M). $$

We call \(D\) the canonical divergence of \((M,\nabla ,h)\). The definition is independent of choice of an affine coordinate system. The canonical divergence is an asymmetric squared distance like function on \(M\). In particular, the canonical divergence \(D\) is non-negative if the metric \(h\) is positive definite. However, we assumed that \(h\) is a semi-Riemannian metric, hence \(D\) can take negative values. (cf. [12] and [15].)

We remark that the canonical divergence induces the original Hessian manifold \((M,\nabla ,h)\) by Eguchi’s relation [7]. Suppose that \(D\) is a function on \(M \times M\). We define a function on \(M\) by the following formula:

$$ D[X_1,\ldots ,X_i|Y_1,\ldots ,Y_j](p) := (X_1)_p \cdots (X_i)_p (Y_1)_r \cdots (Y_j)_r D(p,r)|_{p=r}, $$

where \(X_1,\ldots ,X_i\) and \(Y_1, \cdots , Y_j\) are vector fields on \(M\). We say that \(D\) is a contrast function on \(M \times M\) if

$$\begin{aligned} 1.&D[\ |\ ](p) = D(p,p) = 0, \nonumber \\ 2.&D[X|\ ](p) = D[\ |X](p) = 0, \nonumber \\ 3.&h(X,Y) := -D[X|Y] \\&\quad \text{ is } \text{ a } \text{ semi-Riemannian } \text{ metric } \text{ on } M\text{. } \nonumber \end{aligned}$$
(3.6)

For a contrast function \(D\) on \(M \times M\), we define a pair of affine connections by

$$\begin{aligned} h(\nabla _XY,Z)&= -D[XY|Z], \\ h(Y,\nabla ^*_XZ)&= -D[Y|XZ]. \end{aligned}$$

By differentiating Eq. (3.6), two affine connections \(\nabla \) and \(\nabla ^*\) are mutually dual with respect to \(h\). We can check that \(\nabla \) and \(\nabla ^*\) are torsion-free, and \(\nabla h\) and \(\nabla ^* h\) are totally symmetric. Hence triplets \((M,\nabla ,h)\) and \((M,\nabla ^*,h)\) are statistical manifolds. We call \((M,\nabla ,h)\) the induced statistical manifold from a contrast function \(D\). If \((M,\nabla ,h)\) is a Hessian manifold, we say that \((M,\nabla ,h)\) is the induced Hessian manifold from \(D\).

Proposition 3

Suppose that \(D\) is the canonical divergence on a Hessian manifold \((M,\nabla ,h)\). Then \(D\) is a contrast function on \(M\times M\) which induces the original Hessian manifold \((M,\nabla ,h)\).

Proof

From the definition and Eq. (3.3), we have \(D[\ |\ ] = 0\) and \(D[X|\ ] = D[\ |X] =0\). Let \(\{\theta ^i\}\) be a \(\nabla \)-affine coordinate and \(\{\eta _j\}\) the dual affine coordinate of \(\{\theta ^j\}\). Set \(\partial _i = \partial /\partial \theta ^i\). From Eqs. (3.3) and (3.5), we have

$$\begin{aligned} D[\partial _i|\partial _j](p)&= (\partial _i)_p(\partial _j)_rD(p,q)|_{p=r} \ = \ (\partial _j)_r\left\{ \eta _i(p) - \eta _i(r)\right\} |_{p=r} \\&= -(\partial _j)_r\eta _i(r)|_{p=r} \ = \ -h_{ij}(p). \end{aligned}$$

This implies that the canonical divergence \(D\) is a contrast function on \(M \times M\). Induced affine connections are given by

$$\begin{aligned} \varGamma _{ij,k}&= -D[\partial _i\partial _j|\partial _k] \ = \ (\partial _i)_p(\partial _k)_r \left\{ \eta _j(p) - \eta _j(r)\right\} |_{p=r} \\&= -(\partial _i)_p(\partial _k)_r\eta _j(r)|_{p=r} \ = \ 0, \\ \varGamma ^*_{ik,j}&= -D[\partial _j|\partial _i\partial _k] \ = \ (\partial _i)_r(\partial _k)_r \left\{ \eta _j(p) - \eta _j(r)\right\} |_{p=r} \\&= -(\partial _i)_r(\partial _k)_r\eta _j(r)|_{p=r} \ = \ -(\partial _i)_r(\partial _k)_r(\partial _j)_r\psi (r)|_{p=r} \\&= C_{ikj}, \end{aligned}$$

where \(\varGamma _{ij,k}\) and \(\varGamma ^*_{ik,j}\) are Christoffel symbols of the first kind of \(\nabla \) and \(\nabla ^*\), respectively. From Eqs. (3.1) and (3.2), since \(h\) is nondegenerate, the affine connection \(\nabla \) coincides with the original one of \((M,\nabla ,h)\). \(\square \)

At the end of this section, we review generalized conformal equivalence for statistical manifolds. Fix a number \(\alpha \in {{\varvec{R}}}\). We say that two statistical manifolds \((M,\nabla ,h)\) and \((M,\bar{\nabla },\bar{h})\) are \(\alpha {\text {-}}{\textit{conformally equivalent}}\) if there exists a function \(\varphi \) on \(M\) such that

$$\begin{aligned} \bar{h}(X,Y)&= e^{\varphi }h(X,Y), \\ \bar{\nabla }_X Y&= \nabla _XY - \frac{1 + \alpha }{2}h(X,Y)\mathrm {grad}_h\varphi +\frac{1 - \alpha }{2}\left\{ d\varphi (Y) \, X + d\varphi (X)\, Y\right\} , \end{aligned}$$

where \(\mathrm {grad}_h\varphi \) is the gradient vector field of \(\varphi \) with respect to \(h\), that is,

$$ h(\mathrm {grad}_h\varphi ,X) := X\varphi . $$

(The vector field \(\mathrm {grad}_h\varphi \) is often called the natural gradient of \(\varphi \) in neurosciences, etc.) We say that a statistical manifold \((M,\nabla ,h)\) is \(\alpha \) -conformally flat if it is locally \(\alpha \)-conformally equivalent to some Hessian manifold [12].

Suppose that \(D\) and \(\bar{D}\) are contrast functions on \(M \times M\). We say that \(D\) and \(\bar{D}\) are \(\alpha \) -conformally equivalent if there exists a function \(\varphi \) on \(M\) such that

$$ \bar{D}(p,r) = \exp \left[ \frac{1+\alpha }{2}\varphi (p)\right] \exp \left[ \frac{1-\alpha }{2}\varphi (r)\right] D(p,r). $$

In this case, induced statistical manifolds \((M,\nabla ,h)\) and \((M,\bar{\nabla },\bar{h})\) from \(D\) and \(\bar{D}\), respectively, are \(\alpha \)-conformally equivalent.

Historically, conformal equivalence of statistical manifolds was introduced in asymptotic theory of sequential estimation [27]. (See also [11].) Then it is generalized in affine differential geometry (e.g. [10, 12, 13] and [17]). As we will see in Sects. 3.5 and 3.6, conformal structures on a deformed exponential family play important roles. (See also [2, 20, 24] and [25].)

3.3 Statistical Models

Let \((\varOmega , \mathcal{F}, P)\) be a probability space, that is, \(\varOmega \) is a sample space, \(\mathcal{F}\) is a completely additive class on \(\varOmega \), and \(P\) is a probability measure on \(\varOmega \). Let \(\varXi \) be an open subset in \({{\varvec{R}}}^n\). We say that \(S\) is a statistical model if \(S\) is a set of probability density functions on \(\varOmega \) with parameter \(\xi = {}^t(\xi ^1,\ldots ,\xi ^n) \in \varXi \) such that

$$ S := \left\{ p(x;\xi ) \, \left| \, \int \limits _{\varOmega }p(x;\xi )dx=1, \ p(x;\xi )>0, \ \xi \in \varXi \subset {{\varvec{R}}}^n \right. \right\} \!. $$

Under suitable conditions, \(S\) can be regarded as a manifold with local coordinate system \(\{\xi ^i\}\) [1]. In particular, we assume that we can interchange differentials and integrals. Hence, the equation below holds

$$ \int \limits _{\varOmega }\left( \frac{\partial }{\partial \xi ^i}p(x;\xi )\right) dx = \frac{\partial }{\partial \xi ^i} \int \limits _{\varOmega }p(x;\xi )dx = \frac{\partial }{\partial \xi ^i} 1 = 0. $$

For a statistical model \(S\), we define the Fisher information matrix  \(g^F(\xi ) = (g_{ij}^F(\xi ))\) by

$$\begin{aligned} g_{ij}^F(\xi )&:= \int \limits _{\varOmega } \left( \frac{\partial }{\partial \xi ^i}\log p(x;\xi )\right) \left( \frac{\partial }{\partial \xi ^j}\log p(x;\xi )\right) p(x;\xi ) \, dx \\&= E_{p}[\partial _il_{\xi }\partial _jl_{\xi }], \nonumber \end{aligned}$$
(3.7)

where \(\partial _i = \partial /\partial \xi ^i\), \(l_{\xi } = l(x;\xi ) = \log p(x;\xi )\), and \(E_{p}[f]\) is the expectation of \(f(x)\) with respect to \(p(x;\xi )\). The Fisher information matrix \(g^F\) is semi-positive definite in general. Assuming that \(g^F\) is positive definite and all components are finite, then \(g^F\) can be regarded as a Riemannian metric on \(S\). We call \(g^F\) the Fisher metric on \(S\). The Fisher metric \(g^F\) has the following representations:

$$\begin{aligned} g_{ij}^F(\xi )&= \int \limits _{\varOmega } \left( \frac{\partial }{\partial \xi ^i}p(x;\xi )\right) \left( \frac{\partial }{\partial \xi ^j}\log p(x;\xi )\right) \, dx \end{aligned}$$
(3.8)
$$\begin{aligned}&= \int \limits _{\varOmega }\frac{1}{p(x;\xi )} \left( \frac{\partial }{\partial \xi ^i}p(x;\xi )\right) \left( \frac{\partial }{\partial \xi ^j}p(x;\xi )\right) \, dx. \end{aligned}$$
(3.9)

Next, let us define an affine connection on \(S\). For a fixed \(\alpha \in {{\varvec{R}}}\), an \(\alpha \) -connection \(\nabla ^{(\alpha )}\) on \(S\) is defined by

$$\begin{aligned} \varGamma _{ij,k}^{(\alpha )}(\xi )&:= E_{p} \left[ \left( \partial _i\partial _jl_{\xi } + \frac{1-\alpha }{2} \partial _il_{\xi }\partial _jl_{\xi }\right) (\partial _kl_{\xi }) \right] , \end{aligned}$$

where \(\varGamma _{ij,k}^{(\alpha )}\) is the Christoffel symbol of the first kind of \(\nabla ^{(\alpha )}\).

We remark that \(\nabla ^{(0)}\) is the Levi-Civita connection with respect to the Fisher metric \(g^F\). The connection \(\nabla ^{(e)} := \nabla ^{(1)}\) is called the the exponential connection and \(\nabla ^{(m)} := \nabla ^{(-1)}\) is called the mixture connection. Two connections \(\nabla ^{(e)}\) and \(\nabla ^{(m)}\) are expressed as follows:

$$\begin{aligned} \varGamma ^{(e)}_{ij,k}&= E_p[(\partial _i\partial _jl_{\xi })(\partial _kl_{\xi })] \ = \ \int \limits _{\varOmega }\partial _i\partial _j\log p(x;\xi ) \partial _kp(x;\xi )dx, \end{aligned}$$
(3.10)
$$\begin{aligned} \varGamma ^{(m)}_{ij,k}&= E_p[((\partial _i\partial _jl_{\xi }+\partial _il_{\xi }\partial _jl_{\xi }) (\partial _kl_{\xi })] \ = \ \int \limits _{\varOmega }\partial _i\partial _jp(x;\xi ) \partial _k\log p(x;\xi )dx. \end{aligned}$$
(3.11)

We can check that the \(\alpha \)-connection \(\nabla ^{(\alpha )}\) is torsion-free and \(\nabla ^{(\alpha )}g^F\) is totally symmetric. These imply that \((S, \nabla ^{(\alpha )},g^F)\) forms a statistical manifold. In addition, it is known that the Fisher metric \(g^F\) and the \(\alpha \)-connection \(\nabla ^{(\alpha )}\) are independent of choice of dominating measures on \(\varOmega \). Hence we call the triplet \((S, \nabla ^{(\alpha )},g^F)\) an invariant statistical manifold. The cubic form \(C^F\) of the invariant statistical manifold \((S, \nabla ^{(e)},g^F)\) is given by

$$ C_{ijk}^F = \varGamma _{ij,k}^{(m)} - \varGamma _{ij,k}^{(e)}. $$

A statistical model \(S_e\) is said to be an exponential family if

$$ S_e := \left\{ p(x;\theta ) \ \left| \ p(x;\theta )= \exp \left[ \sum _{i=1}^n\theta ^iF_i(x) -\psi (\theta )\right] , \ \theta \in \varTheta \subset {{\varvec{R}}}^n \right. \right\} , $$

under a choice of suitable dominating measure, where \(F_1(x),\ldots ,F_n(x)\) are functions on the sample space \(\varOmega \), \(\theta = (\theta ^1,\ldots ,\theta ^n)\) is a parameter, and \(\psi (\theta )\) is a function of \(\theta \) for normalization. The following proposition is well-known in information geometry [1].

Theorem 1

(cf. [1]) For an exponential family \(S_e\), the following hold:

  1. 1.

    \((S_e,\nabla ^{(e)}, g^F)\) and \((S_e, \nabla ^{(m)}, g^F)\) are mutually dual Hessian manifolds, that is, \((S_e,g^F,\nabla ^{(e)}, \nabla ^{(m)})\) is a dually flat space.

  2. 2.

    \(\{\theta ^i\}\) is a \(\nabla ^{(e)}\)-affine coordinate system on \(S_e\).

  3. 3.

    For the Hessian structure \((\nabla ^{(e)}, g^F)\) on \(S_e\), \(\psi (\theta )\) is the potential of \(g^F\) and \(C^F\) with respect to \(\{\theta ^i\}\):

    $$\begin{aligned} g_{ij}^F(\theta )&= \partial _i\partial _j\psi (\theta ), \quad (\partial _i = \partial /\partial \theta ^i), \\ C_{ijk}^F(\theta )&= \partial _i\partial _j\partial _k\psi (\theta ). \end{aligned}$$
  4. 4.

    Set the expectation of \(F_i(x)\) by \(\eta _i := E_{p}[F_i(x)]\). Then \(\{\eta _i\}\) is the dual affine coordinate system of \(\{\theta ^i\}\) with respect to \(g^F\).

  5. 5.

    Set \(\phi (\eta ) := E_{p}[\log p(x;\theta )]\). Then \(\phi (\eta )\) is the potential of \(g^F\) with respect to \(\{\eta _i\}\).

Since \((S_e, \nabla ^{(e)}, g^F)\) is a Hessian manifold, the formulas in Proposition 2 hold.

For a statistical model \(S\), we define a Kullback-Leibler divergence (or a relative entropy) by

$$\begin{aligned} D_{KL}(p,r)&:= \int \limits _{\varOmega }p(x)\log \frac{p(x)}{r(x)}dx \\&= E_p[\log p(x) - \log r(x)], \quad (p(x), r(x) \in S). \end{aligned}$$

The Kullback-Leibler divergence \(D_{KL}\) on an exponential family \(S_e\) coincides with the canonical divergence \(D\) on \((S_e,\nabla ^{(m)},g^F)\).

We define an \({{\varvec{R}}}^n\) valued function \(s(x;\xi ) = (s^1(x;\xi ),\ldots ,s^n(x;\xi ))^T\) by

$$ s^i(x;\xi ) := \frac{\partial }{\partial \xi ^i} \log p(x;\xi ). $$

We call \(s(x;\xi )\) the score function of \(p(x;\xi )\) with respect to \(\xi \). In information geometry, \(s^i(x;\xi )\) is called the \(e{\text {-}}{\textit{(exponential) representation}}\) of \(\partial /\partial \xi ^i\), and \(\partial /\partial \xi ^ip(x;\xi )\) is called the \(m{\text {-}}{\textit{(mixture) representation}}\). The duality of \(e\)- and \(m\)-representations is important. In fact, Eq. (3.8) implies that the Fisher metric \(g^F\) is nothing but an \(L^2\) inner product of \(e\)- and \(m\)-representations.

Construction of the Kullback-Leibler divergence is as follows. We define a cross entropy \(d_{\textit{KL}}(p,r)\) by

$$ d_{\textit{KL}}(p,r) := - E_{p}[\log r(x)]. $$

A cross entropy \(d_{KL}(p,r)\) gives a bias of information \(-\log r(x)\) with respect to \(p(x)\). A cross entropy is also called a yoke on \(S\) [4]. Intuitively, a yoke measures a dissimilarity of two probability density functions on \(S\). We should also note that the cross entropy is obtained by taking the expectation with respect to \(p(x)\) of the integrated score function at \(r(x)\). Then we have the Kullback-Leibler divergence by

$$\begin{aligned} D_{\textit{KL}}(p,r)&= - d_{\textit{KL}}(p,p) + d_{\textit{KL}(}p,r) \\&= E_p[\log p(x) - \log r(x)]. \end{aligned}$$

The Kullback-Leibler divergence \(D_{\textit{KL}}\) is a normalized yoke on \(S\), which satisfies \(D_{\textit{KL}}(p,p) = 0\). This argument suggests how to construct divergence functions. Once a function like the cross entropy is defined, we can construct divergence functions in the same way.

3.4 The Deformed Exponential Family

In this section, we review the deformed exponential family. For more details, see [3, 22, 23] and [26]. Geometry of deformed exponential families relates to so-called \(U\)-geometry [21].

Let \(\chi \) be a strictly increasing function from \((0,\infty )\) to \((0,\infty )\). We define a deformed logarithm function (or a \(\chi {\text {-}}{\textit{logarithm}}\;{\textit{function}}\)) by

$$\begin{aligned} \log _{\chi } (s)&:= \int \limits _1^s \frac{1}{\chi (t)} \, dt. \end{aligned}$$

We remark that \(\log _{\chi }(s)\) is strictly increasing and satisfies \(\log _{\chi }(1) = 0\). The domain and the target of \(\log _{\chi }(s)\) depend on the function \(\chi (t)\). Set \(U= \{s \in (0,\infty )\, | \, |\log _{\chi }(s)| < \infty \}\) and \(V = \{\log _{\chi }(s) \, | \, s \in U\}\). Then \(\log _{\chi }(s)\) is a function from \(U\) to \(V\). We also remark that the deformed logarithm is usually called the \(\phi \)-logarithm [23]. However, we use \(\phi \) as the dual potential on a Hessian manifold.

A deformed exponential function (or a \(\chi {\text {-}}{\textit{exponential function}}\)) is defined by the inverse of the deformed logarithm function \(\log _{\chi }(s)\):

$$\begin{aligned} {\exp }_{\chi } (t)&:= 1 + \int \limits _0^t \lambda (s)ds, \end{aligned}$$

where \(\lambda (s)\) is defined by the relation \(\lambda (\log _{\chi }(s)) := \chi (s)\).

When \(\chi (s)\) is a power function \(\chi (s) = s^q, (q>0, q \ne 1)\), the deformed logarithm and the deformed exponential are given by

$$\begin{aligned}&\log _q(s) := \frac{s^{1-q}-1}{1-q}, \qquad \qquad \qquad \qquad \qquad \,(s >0), \\&{\exp }_q(t) := (1+(1-q)t)^{\frac{1}{1-q}}, \qquad \qquad \qquad (1+(1-q)t >0). \end{aligned}$$

The function \(\log _q(s)\) is called the \(q{\text {-}}{\textit{logarithm}}\) and \({\exp }_q(t)\) the \(q{\text {-}}{\textit{exponential}}\). Taking the limit \(q \rightarrow 1\), the standard logarithm and the standard exponential are recovered, respectively.

A statistical model \(S_{\chi }\) is said to be a deformed exponential family (or a \(\chi {\text {-}} {\textit{exponential family}}\)) if

$$ S_{\chi } := \left\{ p(x;\theta ) \left| p(x;\theta ) = \exp _{\chi }\left[ \sum _{i=1}^n\theta ^iF_i(x) -\psi (\theta ) \right] , \ \theta \in \varTheta \subset {{\varvec{R}}}^n \right. \right\} , $$

under a choice of suitable dominating measure, where \(F_1(x),\ldots ,F_n(x)\) are functions on the sample space \(\varOmega \), \(\theta = \{\theta ^1,\ldots ,\theta ^n\}\) is a parameter, and \(\psi (\theta )\) is the function of \(\theta \) for normalization. We assume that \(S_{\chi }\) is a statistical model in the sense of [1]. That is, \(p(x;\theta )\) has support entirely on \(\varOmega \), there exits a one-to-one correspondence between the parameter \(\theta \) and the probability distribution \(p(x; \theta )\), and differentiation and integration are interchangeable. In addition, functions \(\{F_i(x)\}, \psi (\theta )\) and parameters \(\{\theta ^i\}\) must satisfy the anti-exponential condition. For example, in the \(q\)-exponential case, these functions satisfy

$$ \sum _{i=1}^n\theta ^iF_i(x) -\psi (\theta ) < \frac{1}{q-1}. $$

Then we can regard that \(S_{\chi }\) is a manifold with local coordinate system \(\{\theta ^i\}\). We also assume that the function \(\psi \) is strictly convex since we consider Hessian metrics on \(S_{\chi }\) later. A deformed exponential family has several different definitions. See [30] and [34], for example.

For a deformed exponential probability density \(p(x;\theta ) \in S_{\chi }\), we define the escort distribution \(P_{\chi }(x;\theta )\) of \(p(x;\theta )\) by

$$ P_{\chi }(x;\theta ) := \frac{1}{Z_{\chi }(\theta )}{\chi }\{p(x;\theta )\}, $$

where \(Z_{\chi }(\theta )\) is the normalization defined by

$$ Z_{\chi }(\theta ) := \int \limits _{\varOmega }{\chi }\{p(x;\theta )\}dx. $$

The \(\chi {\text {-}}{\textit{expectation}}\, E_{\chi ,p}[f]\) of \(f(x)\) with respect to \(P_{\chi }(x;\theta )\) is defined by

$$\begin{aligned} E_{\chi ,p}[f]&:= \int \limits _{\varOmega }f(x)P_{\chi }(x;\theta )\,dx \ = \ \frac{1}{Z_{\chi }(\theta )}\int \limits _{\varOmega } f(x)\chi \{p(x;\theta )\}dx. \end{aligned}$$

When \(\chi \) is a power function \(\chi (s) = s^q, (q>0, q\ne 0)\), we denote the escort distribution of \(p(x;\theta )\) by \(P_q(x;\theta )\), and the \(\chi \)-expectation with respect to \(p(x;\theta )\) by \(E_{q,p}[*]\).

Example 1

(discrete distributions [3]) The set of discrete distributions \(S_n\) is a deformed exponential family for an arbitrary \(\chi \). Suppose that \(\varOmega \) is a finite set: \(\varOmega = \{x_0, x_1,\ldots ,x_{n}\}\). Then the statistical model \(S_n\) is given by

$$\begin{aligned} S_n&:= \left\{ p(x;\eta ) \ \left| \ \eta _i > 0, \ p(x;\eta ) = \sum _{i=0}^{n}\eta _i\delta _i(x), \ \sum _{i=0}^{n}\eta _i = 1\right. \right\} , \end{aligned}$$

where \(\eta _{0}:=1-\sum _{i=1}^n\eta _i\) and

$$ \delta _i(x):= \left\{ \begin{array}{ll} 1 \quad &{} (x = x_i), \\ 0 &{} (x \ne x_i). \end{array}\right. $$

Set \(\theta ^i = \log _{\chi }p(x_i) - \log _{\chi }p(x_0) = \log _{\chi }\eta _i - \log _{\chi }\eta _0, \ F_i(x) = \delta _i(x)\) and \(\psi (\theta ) = -\log _{\chi }\eta _{0}\). Then the \(\chi \)-logarithm of \(p(x) \in S_n\) is written by

$$\begin{aligned} \log _{\chi }p(x)&= \sum _{i=1}^{n}\left( \log _{\chi }\eta _i-\log _{\chi }\eta _{0}\right) \delta _i(x) + \log _{\chi }(\eta _{0}) \\&= \sum _{i=1}^n \theta ^iF_i(x) - \psi (\theta ). \end{aligned}$$

This implies that \(S_n\) is a deformed exponential family.

Example 2

(\(q\) -normal distributions [20]) A \(q\)-normal distribution is the probability distribution defined by the following formula:

$$\begin{aligned} p_q(x;\mu ,\sigma )&:= \frac{1}{Z_q(\sigma )}\left[ 1-\frac{1-q}{3-q}\frac{(x-\mu )^2}{\sigma ^2} \right] ^{\frac{1}{1-q}}_+, \end{aligned}$$

where \([*]_+ := \max \{0,*\}\), \(\{\mu ,\sigma \}\) are parameters \(-\infty <\mu <\infty , 0<\sigma <\infty \), and \(Z_q(\sigma )\) is the normalization defined by

$$\begin{aligned} Z_q(\sigma ) := \left\{ \begin{array}{ll} \dfrac{\sqrt{3-q}}{\sqrt{1-q}} \mathrm{B} \left( \dfrac{2-q}{1-q},\dfrac{1}{2}\right) \sigma , &{} \;\;(-\infty < q < 1), \\ \dfrac{\sqrt{3-q}}{\sqrt{q-1}} \mathrm{B} \left( \dfrac{3-q}{2(q-1)},\dfrac{1}{2}\right) \sigma , &{} \;\;(1 \le q < 3). \end{array} \right. \end{aligned}$$

Here, \(\mathrm{B} \left( *,*\right) \) is the beta function. We restrict ourselves to consider the case \(q \ge 1\). Then the probability distribution \(p_q(x;\mu , \sigma )\) has its support entirely on \({{\varvec{R}}}\) and the set of \(q\)-normal distributions \(S_q\) is a statistical model. Set

$$\begin{aligned} \theta ^1&:= \frac{2}{3-q}\{Z_q(\sigma )\}^{q-1}\frac{\mu }{\sigma ^2}, \quad \theta ^2 \ := \ -\frac{1}{3-q}\{Z_q(\sigma )\}^{q-1}\frac{1}{\sigma ^2}, \\ \psi (\theta )&:= - \frac{(\theta ^1)^2}{4\theta ^2} - \frac{\{Z_q(\sigma )\}^{q-1}-1}{1-q}, \end{aligned}$$

then we have

$$\begin{aligned} \log _qp_q(x; \theta )&= \frac{1}{1-q}(\{p_q(x; \theta )\}^{1-q}-1) \\&= \frac{1}{1-q}\left\{ \frac{1}{\{Z_q(\sigma )\}^{1-q}} \left( 1-\frac{1-q}{3-q}\frac{(x-\mu )^2}{\sigma ^2}\right) -1\right\} \\&= \frac{2\mu \{Z_q(\sigma )\}^{q-1}}{(3-q)\sigma ^2}x - \frac{\{Z_q(\sigma )\}^{q-1}}{(3-q)\sigma ^2}x^2 \\&\quad - \frac{\{Z_q(\sigma )\}^{q-1}}{3-q}\cdot \frac{\mu ^2}{\sigma ^2} + \frac{\{Z_q(\sigma )\}^{q-1}-1}{1-q} \\&= \theta ^1x + \theta ^2x^2 - \psi (\theta ). \end{aligned}$$

This implies that the set of \(q\)-normal distributions \(S_q\) is a \(q\)-exponential family. For a \(q\)-normal distribution \(p_q(x;\mu ,\sigma )\), the \(q\)-expectation \(\mu _q\) and a \(q\)-variance \(\sigma ^2_q\) are given by

$$\begin{aligned} \mu _q&= E_{q,p}[x] \ = \ \mu , \\ \sigma ^2_q&= E_{q,p}\left[ (x-\mu )^2\right] \ = \ \sigma ^2. \end{aligned}$$

We remark that a \(q\)-normal distribution is nothing but a three-parameter version of Student’s \(t\)-distribution when \(q\ge 1\). In fact, if \(q=1\), then the \(q\)-normal distribution is the normal distribution. If \(q=2\), then the distribution is the Cauchy distribution. We also remark that mathematical properties of \(q\)-normal distributions have been obtained by several authors. See [29, 31], for example.

3.5 Geometry of Deformed Exponential Families Derived from the Standard Expectation

In this section, we consider geometry of deformed exponential families by generalizing the \(e\)-representation with the deformed logarithm function. For more details, see [21, 26].

Let \(S_{\chi }\) be a deformed exponential family. We define an \({{\varvec{R}}}^n\) valued function \(s^{\chi }(x;\theta ) = \left( (s^{\chi })^1(x;\theta ),\ldots ,(s^{\chi })^n(x;\theta ) \right) ^T\) by

$$\begin{aligned} (s^{\chi })^i(x;\theta ) := \frac{\partial }{\partial \theta ^i}\log _{\chi }p(x;\theta ), \quad (i=1,\ldots ,n). \end{aligned}$$
(3.12)

We call \(s^{\chi }(x;\theta )\) the \(\chi {\text {-}}{\textit{score function}}\) of \(p(x;\theta )\). Using the \(\chi \)-score function, we define a \((0,2)\)-tensor field \(g^M\) on \(S_{\chi }\) by

$$\begin{aligned} g_{ij}^M(\theta )&:= \int \limits _{\varOmega } \partial _ip(x;\theta )\partial _j \log _{\chi } p(x;\theta ) \, dx, \quad \left( \partial _i = \frac{\partial }{\partial \theta ^i}\right) . \end{aligned}$$
(3.13)

Lemma 1

The tensor field \(g^M\) on \(S_{\chi }\) is semi-positive definite.

Proof

From the definitions of \(g^M\) and \(\log _{\chi }\), the tensor field \(g^M\) is written as

$$\begin{aligned} g^M_{ij}(\theta ) = \int \limits _{\varOmega }\chi (p(x;\theta ))\left( F_i(x) - \partial _i\psi (\theta )\right) \left( F_j(x) - \partial _j\psi (\theta )\right) dx. \end{aligned}$$
(3.14)

Since \(\chi \) is strictly increasing, \(g^M\) is semi-positive definite. \(\square \)

From now on, we assume that \(g^M\) is positive definite. Hence \(g^M\) is a Riemannian metric on \(S_{\chi }\). This assumption is same as in the case of Fisher metric. The Riemannian metric \(g^M\) is a generalization of the Fisher metric in terms of the representation (3.8).

We can consider other types of generalizations of the Fisher metric as follows.

$$\begin{aligned} g^E_{ij}(\theta )&:= \int \limits _{\varOmega } \left( {\partial _i}\log _{\chi }p(x;\theta )\right) \left( {\partial _j}\log _{\chi }p(x;\theta )\right) P_{\chi }(x;\theta )dx \\&= E_{\chi ,p} [\partial _i l_{\chi }(\theta ) \partial _j l_{\chi }(\theta )], \\ g^N_{ij}(\theta )&:= \int \limits _{\varOmega }\frac{1}{P_{\chi }(x;\theta )} \left( {\partial _i}p(x;\theta )\right) \left( {\partial _j}p(x;\theta )\right) dx, \end{aligned}$$

where \(l_{\chi }(\theta ) = \log _{\chi }p(x;\theta )\). Obviously, \(g^E\) and \(g^N\) are generalizations of the Fisher metic with respect to the representations (3.7) and (3.9), respectively.

Proposition 4

Let \(S_{\chi }\) be a deformed exponential family. Then Riemannian metrics \(g^E, g^M\) and \(g^N\) are mutually conformally equivalent. In particular, the following formulas hold:

$$ Z_{\chi }(\theta )g^E(\theta ) = g^M(\theta ) = \frac{1}{Z_{\chi }(\theta )}g^N(\theta ), $$

where \(Z_{\chi }(\theta )\) is the normalization of the escort distribution \(P_{\chi }(x;\theta )\).

Proof

For a deformed exponential family \(S_{\chi }\), the differentials of probability density functions are given as follows:

$$\begin{aligned} \frac{\partial }{\partial \theta ^i}p(x;\theta )&= \chi (p(x;\theta ))\left( F_i(x) - \frac{\partial }{\partial \theta ^i}\psi (\theta )\right) , \\ \frac{\partial }{\partial \theta ^i}\log _{\chi }p(x;\theta )&= F_i(x) - \frac{\partial }{\partial \theta ^i}\psi (\theta ). \end{aligned}$$

From the above formula and the definitions of Riemannian metrics \(g^E\) and \(g^N\), we have

$$\begin{aligned} g^E_{ij}(\theta )&= \frac{1}{Z_{\chi }(\theta )} \int \limits _{\varOmega }\chi (p(x;\theta ))\left( F_i(x) - \partial _i\psi (\theta )\right) \left( F_j(x) - \partial _j\psi (\theta )\right) dx, \\ g^N_{ij}(\theta )&= Z_{\chi }(\theta ) \int \limits _{\varOmega }\chi (p(x;\theta ))\left( F_i(x) - \partial _i\psi (\theta )\right) \left( F_j(x) - \partial _j\psi (\theta )\right) dx. \end{aligned}$$

These equations and Eq. (3.14) imply that Riemannian metrics \(g^E, g^M\) and \(g^N\) are mutually conformally equivalent. \(\square \)

Among the three possibilities of generalizations of the Fisher metric, \(g^M\) is especially associated with a Hessian structure on \(S_{\chi }\), as we will see below. Although the meaning of \(g^E\) is unknown, \(g^N\) gives a kind of Cramér-Rao lower bound in statistical inferences. (See [22, 23].)

By differentiating Eq. (3.13), we can define mutually dual affine connections \(\nabla ^{M(e)}\) and \(\nabla ^{M(m)}\) on \(S_{\chi }\) by

$$\begin{aligned} \varGamma _{ij,k}^{M(e)}(\theta )&:= \int \limits _{\varOmega }\partial _kp(x;\theta ) \partial _i\partial _j\log _{\chi } p(x;\theta )dx, \\ \varGamma _{ij,k}^{M(m)}(\theta )&:= \int \limits _{\varOmega }\partial _i\partial _jp(x;\theta ) \partial _k\log _{\chi } p(x;\theta )dx. \end{aligned}$$

From the definitions of the deformed exponential family and the deformed logarithm function, \(\varGamma _{ij,k}^{M(e)}\) vanishes identically. Hence the connection \(\nabla ^{M(e)}\) is flat, and \((\nabla ^{M(e)}, g^M)\) is a Hessian structure on \(S_{\chi }\). Denote by \(C^M\) the cubic form of \((S_{\chi }, \nabla ^{M(e)}, g^M)\), that is,

$$ C^M_{ijk} = \varGamma ^{M(m)}_{ij,k} - \varGamma ^{M(e)}_{ij,k} = \varGamma ^{M(m)}_{ij,k}. $$

For \(t>0\), set a function \(V_{\chi }(t)\) by

$$ V_{\chi }(t) := \int \limits _1^t \log _{\chi }(s)\, ds. $$

We assume that \(V_{\chi }(0) = \lim _{t \rightarrow +0} V_{\chi }(t)\) is finite. Then the generalized entropy functional \(I_{\chi }\) and the generalized Massieu potential \(\varPsi \) are defined by

$$\begin{aligned} I_{\chi }(p_{\theta })&:= - \int \limits _{\varOmega }\left\{ V_{\chi }(p(x;\theta )) + (p(x;\theta )-1)V_{\chi }(0)\right\} dx, \\ \varPsi (\theta )&:= \int \limits _{\varOmega }p(x;\theta )\log _{\chi }p(x;\theta )dx + I_{\chi }(p_{\theta }) + \psi (\theta ), \end{aligned}$$

respectively, where \(\psi \) is the normalization of the deformed exponential family.

Theorem 2

(cf. [21, 26]) For a deformed exponential family \(S_{\chi }\), the following hold:

  1. 1.

    \((S_{\chi },\nabla ^{M(e)}, g^M)\) and \((S_{\chi }, \nabla ^{M(m)}, g^M)\) are mutually dual Hessian manifolds, that is, \((S_{\chi },g^M,\nabla ^{M(e)}, \nabla ^{M(m)})\) is a dually flat space.

  2. 2.

    \(\{\theta ^i\}\) is a \(\nabla ^{M(e)}\)-affine coordinate system on \(S_{\chi }\).

  3. 3.

    \(\varPsi (\theta )\) is the potential of \(g^M\) and \(C^M\) with respect to \(\{\theta ^i\}\), that is,

    $$\begin{aligned} g_{ij}^M(\theta )&= \partial _i\partial _j\varPsi (\theta ), \\ C_{ijk}^M(\theta )&= \partial _i\partial _j\partial _k\varPsi (\theta ). \end{aligned}$$
  4. 4.

    Set the expectation of \(F_i(x)\) by \(\eta _i := E_{p}[F_i(x)]\). Then \(\{\eta _i\}\) is a \(\nabla ^{M(m)}\)-affine coordinate system on \(S_{\chi }\) and the dual of \(\{\theta ^i\}\) with respect to \(g^M\).

  5. 5.

    Set \(\varPhi (\eta ) := -I_{\chi }(p_{\theta })\). Then \(\varPhi (\eta )\) is the potential of \(g^M\) with respect to \(\{\eta _i\}\).

Let us construct a divergence function which induces the Hessian manifold \((S_{\chi }, \nabla ^{M(e)}, g^M)\). We define the bias corrected \(\chi {\text {-}}{\textit{score function}}\)  \(u_p^{\chi }(x;\theta )\) of \(p(x;\theta )\) by

$$ (u_p^{\chi })^i(x;\theta ) := \frac{\partial }{\partial \theta ^i}\log _{\chi }p(x;\theta ) - E_{p}\left[ \frac{\partial }{\partial \theta ^i}\log _{\chi }p(x;\theta )\right] . $$

Set a function \(U_{\chi }(t)\) by

$$ U_{\chi }(s) := \int \limits _0^s\exp _{\chi }(t)\, dt. $$

Then we have

$$\begin{aligned} V_{\chi }(s)&= s\log _{\chi }(s) - \int \limits _1^st \left( \frac{d}{dt}\log _{\chi }(t)\right) dt \\&= s\log _{\chi }(s) - \int \limits _0^{\log _{\chi }(s)}\exp _{\chi }(u)du \\&= s\log _{\chi }(s) - U_{\chi }(\log _{\chi }(s)). \end{aligned}$$

Since \(\partial /\partial \theta ^iV_{\chi }(p(x;\theta )) = (\partial /\partial \theta ^ip(x;\theta ))\log _{\chi }p(x;\theta )\), we have

$$ p(x;\theta )\left( \frac{\partial }{\partial \theta ^i}\log _{\chi }p(x;\theta )\right) = \frac{\partial }{\partial \theta ^i}U_{\chi }(\log _{\chi }p(x;\theta )). $$

Hence, by integrating the bias corrected \(\chi \)-score function at \(r(x;\theta ) \in S_{\chi }\) with respect to \(\theta \), and by taking the standard expectation with respect to \(p(x;\theta )\), we define a \(\chi {\text {-}}{\textit{cross entropy of Bregman type}}\) by

$$ d^M_{\chi }(p,r) = - \int \limits _{\varOmega }p(x)\log _{\chi }r(x)dx + \int \limits _{\varOmega }U_{\chi }(\log _{\chi }r(x))dx. $$

Then we obtain the \(\chi {\text {-}}{\textit{divergence}}\) (or \(U{\text {-}}{\textit{divergence}}\)) by

$$\begin{aligned} D_{\chi }(p,r)&= - d^M_{\chi }(p,p) + d^M_{\chi }(p,r) \\&= \int \limits _{\varOmega }\left\{ U_{\chi }(\log _{\chi }r(x)) - U_{\chi }(\log _{\chi }p(x)) \right. \\&\qquad \qquad \left. -p(x)(\log _{\chi }r(x) - \log _{\chi }p(x))\right\} dx. \end{aligned}$$

In the \(q\)-exponential case, the bias corrected \(q{\text {-}}{\textit{score function}}\) is given by

$$\begin{aligned} u_q^i(x;\theta )&= \frac{\partial }{\partial \theta ^i}\log _qp(x;\theta ) - E_{p} \left[ \frac{\partial }{\partial \theta ^i}\log _qp(x;\theta )\right] \\&= \frac{\partial }{\partial \theta ^i}\left\{ \frac{1}{1-q}p(x;\theta )^{1-q} - \frac{1}{2-q}\int \limits _{\varOmega }p(x;\theta )^{2-q}dx\right\} \\&= p(x;\theta )^{1-q}s^i(x;\theta ) - E_{p}[p(x;\theta )^{1-q}s^i(x;\theta )]. \end{aligned}$$

This score function is nothing but a weighted score function in robust statistics. The \(\chi \)-divergence constructed from the bias corrected \(q\)-score function coincides with the \(\beta {\text {-}}{\textit{divergence}}\, (\beta = 1-q)\):

$$\begin{aligned} D_{1-q}(p,r)&= -d_{1-q}(p,p) + d_{1-q}(p,r) \\&= \frac{1}{(1-q)(2-q)}\int \limits _{\varOmega }p(x)^{2-q}dx \\&\quad - \frac{1}{1-q}\int \limits _{\varOmega }p(x)r(x)^{1-q}dx + \frac{1}{2-q}\int \limits _{\varOmega }r(x)^{2-q}dx. \end{aligned}$$

3.6 Geometry of Deformed Exponential Families Derived from the \(\chi \)-Expectation

Since \(S_{\chi }\) is linearizable by the deformed logarithm function, we can naturally define geometric structures from the potential function \(\psi \).

A \(\chi {\text {-}}{\textit{Fisher metric}}\, g^{\chi }\) and a \(\chi {\text {-}}{\textit{cubic form}}\, C^{\chi }\) are defined by

$$\begin{aligned} g_{ij}^{\chi }(\theta )&:= \partial _i\partial _j\psi (\theta ), \\ C_{ijk}^{\chi }(\theta )&:= \partial _i\partial _j\partial _k\psi (\theta ), \end{aligned}$$

respectively [3]. In the \(q\)-exponential case, we denote the \(\chi \)-Fisher metric by \(g^q\), and the \(\chi \)-cubic form by \(C^q\). We call \(g^q\) and \(C^q\) a \(q{\text {-}}{\textit{Fisher metric}}\) and a \(q\)-cubic form, respectively.

Let \(\nabla ^{\chi (0)}\) be the Levi-Civita connection with respect to the \(\chi \)-Fisher metric \(g^{\chi }\). Then a \(\chi {\text {-}}{\textit{exponential connection}} \nabla ^{\chi (e)}\) and a \(\chi {\text {-}}{\textit{mixture connection}} \nabla ^{\chi (m)}\) are defined by

$$\begin{aligned} g^{\chi }(\nabla ^{\chi (e)}_XY,Z)&:= g^{\chi }(\nabla ^{\chi (0)}_XY,Z) - \frac{1}{2}C^{\chi }(X,Y,Z), \\ g^{\chi }(\nabla ^{\chi (m)}_XY,Z)&:= g^{\chi }(\nabla ^{\chi (0)}_XY,Z) + \frac{1}{2}C^{\chi }(X,Y,Z), \end{aligned}$$

respectively. The following theorem is known in [3].

Theorem 3

(cf. [3]) For a deformed exponential family \(S_{\chi }\), the following hold:

  1. 1.

    \((S_{\chi },\nabla ^{\chi (e)}, g^{\chi })\) and \((S_{\chi }, \nabla ^{\chi (m)}, g^{\chi })\) are mutually dual Hessian manifolds, that is, \((S_{\chi },g^{\chi },\nabla ^{\chi (e)}, \nabla ^{\chi (m)})\) is a dually flat space.

  2. 2.

    \(\{\theta ^i\}\) is a \(\nabla ^{\chi (e)}\)-affine coordinate system on \(S_{\chi }\).

  3. 3.

    \(\psi (\theta )\) is the potential of \(g^{\chi }\) and \(C^{\chi }\) with respect to \(\{\theta ^i\}\).

  4. 4.

    Set the \(\chi \)-expectation of \(F_i(x)\) by \(\eta _i := E_{\chi ,p}[F_i(x)]\). Then \(\{\eta _i\}\) is a \(\nabla ^{\chi (m)}\)-affine coordinate system on \(S_{\chi }\) and the dual of \(\{\theta ^i\}\) with respect to \(g^{\chi }\).

  5. 5.

    Set \(\phi (\eta ) := E_{\chi , p}[\log _{\chi } p(x;\theta )]\). Then \(\phi (\eta )\) is the potential of \(g^{\chi }\) with respect to \(\{\eta _i\}\).

Proof

Statements 1, 2 and 3 are easily obtained from the definitions of \(\chi \)-Fisher metric and \(\chi \)-cubic form. From Eq. (3.3) and \(\eta _i = E_{\chi ,p}[F_i(x)]\), Statements 4 and 5 follow from the fact that

$$ E_{\chi , p}[\log _{\chi }p(x;\theta )] = E_{\chi , p}\left[ \sum _{i=1}^n\theta ^iF_i(x) - \psi (\theta )\right] = \sum _{i=1}^n\theta ^i\eta _i - \psi (\theta ). \quad \square $$

Suppose that \(s^{\chi }(x;\theta )\) is the \(\chi \)-score function defined by (3.12). The \(\chi \)-score is unbiased with respect to \(\chi \)-expectation, that is, \(E_{\chi ,p}[(s^{\chi })^i(x;\theta )] =0\). Hence we regard that \(s^{\chi }(x;\theta )\) is a generalization of unbiased estimating functions.

By integrating a \(\chi \)-score function, we define the \(\chi {\text {-}}{\textit{cross entropy}}\) by

$$\begin{aligned} d^{\chi }(p,r)&:= - E_{\chi , p}[\log _{\chi }r(x)] \\&= - \int \limits _{\varOmega }P(x)\log _{\chi }r(x)dx. \end{aligned}$$

Then we obtain the generalized relative entropy \(D^{\chi }(p,r)\) by

$$\begin{aligned} D^{\chi }(p,r)&:= -d^{\chi }(p,p) + d^{\chi }(p,r) \nonumber \\&= E_{\chi ,p}[\log _{\chi }p(x) - \log _{\chi }r(x)]. \end{aligned}$$
(3.15)

The generalized relative entropy \(D^{\chi }(p,r)\) coincides with the canonical divergence \(D(r,p)\) for \((S_{\chi },\nabla ^{\chi (e)},g^{\chi })\). In fact, from (3.15), we can check that

$$\begin{aligned} D^{\chi }(p(\theta ),p(\theta '))&= E_{\chi ,p}\left[ \left( \sum _{i=1}^n\theta ^iF_i(x) - \psi (\theta )\right) - \left( \sum _{i=1}^n(\theta ')^iF_i(x) - \psi (\theta ')\right) \right] \\&= \psi (\theta ') + \sum _{i=1}^n\theta ^i\eta _i - \psi (\theta ) - \sum _{i=1}^n(\theta ')^i\eta _i \ = \ D(p(\theta '),p(\theta )). \end{aligned}$$

Let us consider the \(q\)-exponential case. We assume that a \(q\)-exponential family \(S_q\) admits an invariant statistical manifold structure \((S_q,\nabla ^{(\alpha )}, g^F)\).

Theorem 4

([20]) For a \(q\)-exponential family \(S_q\), the invariant statistical manifold \((S_q, \nabla ^{(2q-1)}, g^F)\) and the Hessian manifold \((S_q,\nabla ^{q(e)},g^q)\) are \(1\)-conformally equivalent. In this case, the invariant statistical manifold \((S_q, \nabla ^{(2q-1)}, g^F)\) is \(1\)-conformally flat.

Divergence functions for \((S_q,\nabla ^{q(e)},g^q)\) and \((S_q, \nabla ^{(2q-1)}, g^F)\) are given as follows. The \(\alpha {\text {-}}{\textit{divergence}}\, D^{(\alpha )}(p,r)\) with \(\alpha = 1-2q\) is defined by

$$ D^{(1-2q)}(p,r) := \frac{1}{q(1-q)}\left\{ 1- \int \limits _{\varOmega }p(x)^{q}r(x)^{1-q}dx\right\} . $$

On the other hand, the normalized Tsallis relative entropy \(D_q^T(p,r)\) is defined by

$$\begin{aligned} D_q^T(p,r)&:= \int \limits _{\varOmega }P_q(x) \left( \log _q p(x) - \log _q r(x)\right) dx \\&= E_{q,p}[\log _q p(x) - \log _q r(x)]. \end{aligned}$$

We remark that the invariant statistical manifold \((S_q, \nabla ^{(1-2q)}, g^F)\) is induced from the \(\alpha \)-divergence with \(\alpha = 1-2q\), and that the Hessian manifold \((S_q,\nabla ^{q(e)},g^q)\) is induced from the dual of the normalized Tsallis relative entropy. In fact, for a \(q\)-exponential family \(S_q\), divergence functions have the following relations:

$$\begin{aligned} D(r,p)&= D_q^T(p,r) \\&= \int \limits _{\varOmega }\frac{p(x)^q}{Z_q(p)} \left( \log _q p(x) - \log _q r(x)\right) dx \\&= \frac{1}{Z_q(p)}\int \limits _{\varOmega }\left( \frac{p(x)-p(x)^q}{1-q} - \frac{p(x)^qr(x)^{1-q} - p(x)^q}{1-q}\right) dx \\&= \frac{1}{(1-q)Z_q(p)}\left\{ 1- \int \limits _{\varOmega }p(x)^{q}r(x)^{1-q}dx\right\} \\&= \frac{q}{Z_q(r)}D^{(1-2q)}(p,r), \end{aligned}$$

where \(D\) is the canonical divergence of the Hessian manifold \((S_q,\nabla ^{q(e)},g^q)\).

3.7 Maximum \(q\)-Likelihood Estimators

In this section, we generalize the maximum likelihood method from the viewpoint of generalized independence. To avoid complicated arguments, we restrict ourselves to consider the \(q\)-exponential case. However, we can generalize it to the \(\chi \)-exponential case (cf. [8, 9]).

Let \(X\) and \(Y\) be random variables which follow probability distributions \(p_1(x)\) and \(p_2(y)\), respectively. We say that two random variables \(X\) and \(Y\) are independent if the joint probability \(p(x,y)\) is decomposed by a product of marginal distributions \(p_1(x)\) and \(p_2(Y)\):

$$ p(x,y) = p_1(x)p_2(y). $$

When \(p_1(x) > 0\) and \(p_2(y) >0\), the independence can be written with an exponential function and a logarithm function by

$$ p(x,y) = \exp \left[ \log p_1(x) + \log p_2(x)\right] . $$

We generalize the notion of independence using the \(q\)-exponential and \(q\)-logarithm. Suppose that \(x>0, y>0\) and \(x^{1-q} + y^{1-q} -1>0 \ (q>0)\). We say that \(x \otimes _q y\) is a \(q{\text {-}}{\textit{product}}\) [6] of \(x\) and \(y\) if

$$\begin{aligned} x \otimes _q y&:= \left[ x^{1-q} + y^{1-q} -1\right] ^{\frac{1}{1-q}} \\&= \exp _q \left[ \log _q x + \log _q y \right] . \end{aligned}$$

In this case, the following low of exponents holds:

$$\begin{aligned} \exp _q x \otimes _q \exp _q y&= \exp _q(x+y), \end{aligned}$$

in other words,

$$\begin{aligned} \log _q (x \otimes _q y)&= \log _q x + \log _q y. \end{aligned}$$

Let \(X_i\) be a random variable on \(\mathcal {X}_i\) which follows \(p_i(x) \ \) (\(i=1, 2,\ldots ,N)\). We say that \(X_1, X_2,\ldots ,X_N\) are \(q{\text {-}}{\textit{independent with}}\, m\)-normalization (mixture normalization) if

$$\begin{aligned} p(x_1, x_2,\ldots ,x_N)&= \dfrac{p_1(x_1) \otimes _q p_2(x_2) \otimes _q \cdots \otimes _q p_N(x_N)}{Z_{p_1, p_2, \cdots , p_N}} \end{aligned}$$

where \(p(x_1, x_2,\ldots ,x_N)\) is the joint probability density of \(X_1, X_2,\ldots ,X_N\) and \(Z_{p_1, p_2, \cdots , p_N}\) is the normalization of \(p_1(x_1) \otimes _q p_2(x_2) \otimes _q \cdots \otimes _q p_N(x_N)\) defined by

$$\begin{aligned} Z_{p_1,p_2, \cdots , p_N}&:= \int \cdots \int \limits _{\mathcal {X}_1 \cdots \mathcal {X}_N}p_1(x_1) \otimes _q p_2(x_2) \otimes _q \cdots \otimes _q p_N(x_N) dx_1\cdots dx_N. \end{aligned}$$

Let \(S_q=\{p(x;\xi )| \xi \in \varXi \}\) be a \(q\)-exponential family, and let \(\{x_1,\ldots ,x_N\}\) be \(N\)-observations from \(p(x;\xi ) \in S_q\). We define a \(q{\text {-}}{\textit{likelihood function}}\, L_q(\xi )\) by

$$\begin{aligned}&L_q(\xi ) = p(x_1;\xi ) \otimes _q p(x_2;\xi ) \otimes _q \cdots \otimes _q p(x_N;\xi ). \end{aligned}$$

Equivalently, a \(q{\text {-}}{\textit{log-likelihood function}}\) is given by

$$\begin{aligned}&\log _q L_q(\xi ) = \sum _{i=1}^N\log _q p(x_i;\xi ). \end{aligned}$$

In the case \(q \rightarrow 1\), \(L_q\) is the standard likelihood function on \(\varXi \).

The maximum \(q{\text {-}}{\textit{likelihood estimator}}\,\,\hat{\xi }\) is the maximizer of the \(q\)-likelihood functions, which is defined by

$$ \hat{\xi } := \mathop {\mathrm {argmax}}_{\xi \in \varXi }L_q(\xi ) \quad \left( = \mathop {\mathrm {argmax}}_{\xi \in \varXi }\log _qL_q(\xi ) \right) . $$

Let us consider geometry of maximum \(q\)-likelihood estimators. Let \(S_q\) be a \(q\)-exponential family. Suppose that \(\{x_1,\ldots ,x_N\}\) are \(N\)-observations generated from \(p(x;\theta ) \in S_q\).

The \(q\)-log-likelihood function is calculated as

$$\begin{aligned} \log _qL_q(\theta )&= \sum _{j=1}^N\log _qp(x_j;\theta ) \ = \ \sum _{j=1}^N\left\{ \sum _{i=1}^n\theta ^iF_i(x_j) - \psi (\theta ) \right\} \\&= \sum _{i=1}^n\theta ^i\sum _{j=1}^NF_i(x_j) - N\psi (\theta ). \end{aligned}$$

The \(q\)-log-likelihood equation is

$$ \partial _i\log _qL_q(\theta ) = \sum _{j=1}^NF_i(x_j) - N\partial _i\psi (\theta ) = 0. $$

Thus, the maximum \(q\)-likelihood estimator for \(\eta \) is given by

$$ \hat{\eta }_i = \frac{1}{N}\sum _{j=1}^NF_i(x_j). $$

On the other hand, the canonical divergence for \((S_q, \nabla ^{q(e)}, g^q)\) can be calculated as

$$\begin{aligned} D_q^T(p(\hat{\eta }),p(\theta ))&= D(p(\theta ),p(\hat{\eta })) \\&= \psi (\theta ) + \phi (\hat{\eta }) - \sum _{i=1}^n\theta ^i\hat{\eta }_i \\&= \phi (\hat{\eta }) - \frac{1}{N}\log _qL_q(\theta ). \end{aligned}$$

This implies that the \(q\)-likelihood attains the maximum if and only if the normalized Tsallis relative entropy attains the minimum.

Let \(M\) be a curved \(q{\text {-}}{\textit{exponential family}}\) in \(S_q\), that is, \(M\) is a submanifold in \(S_q\) and is a statistical model itself. Suppose that \(\{x_1,\ldots ,x_N\}\) are \(N\)-observations generated from \(p(x;u) = p(x;\theta (u)) \in M\). The above arguments implies that the maximum \(q\)-likelihood estimator for \(M\) is given by the orthogonal projection of data with respect to the normalized Tsallis relative entropy.

We remark that the maximum \(q\)-likelihood estimator can be generalized by \(U\)-geometry. (See [8, 9] by Fujimoto and Murata.) However, their approach and ours are slightly different. They applied the \(\chi \)-divergence (\(U\)-divergence) projection for a parameter estimation, whereas we applied the generalized relative entropy. As we discussed in this paper, the induced Hessian structures from those divergences are different.

3.8 Conclusion

In this paper, we considered two Hessian structures from the viewpoints of the standard expectation and the \(\chi \)-expectation. Though the former and the later are known as \(U\)-geometry ([21, 26]) and \(\chi \)-geometry ([3]), respectively, they turn out to be different Hessian structures in the same deformed exponential family through a comparison of each other.

We note that, from the viewpoint of estimating functions, the former is geometry of bias-corrected \(\chi \)-score functions with the standard expectation, whereas the later is geometry of unbiased \(\chi \)-score functions with the \(\chi \)-expectation.

As an application to statistics, we considered generalization of maximum likelihood method for \(q\)-exponential family. We used the normalized Tsallis relative entropy for orthogonal projection, whereas the previous results used \(\chi \)-divergences of Bregman type.