Keywords

1 Introduction

In information geometry, a central role is played by a statistical manifold, which is a Riemannian manifold with a pair of two dual torsion-free affine connections. This geometrical structure is induced from an asymmetric (squared) distance-like smooth function called a contrast function by taking its second and third derivatives [1, 2]. The Kullback-Leibler divergence on a regular parametric statistical model is a typical example of contrast functions and its induced geometrical objects are the Fisher metric, the exponential and mixture connections. The structure determined by these objects play an important role in the geometry of statistical inference, as is widely known [3, 4].

A statistical manifold admitting torsion (SMAT) is a Riemannian manifold with a pair of two dual affine connections, where only one of them must be torsion-free but the other is necessarily not. This geometrical structure naturally appears in a quantum statistical model (i.e. a set of density matrices representing quantum states) [3] and the notion of SMAT was originally introduced to study such a geometrical structure from a mathematical point of view [5]. A pre-contrast function was subsequently introduced as a generalization for the first derivative of a contrast function and it was shown that an pre-contrast function induces a SMAT by taking its first and second derivatives [6].

Henmi and Matsuzoe [7] showed that a SMAT also appears in “classical” statistics through an estimating function. More precisely, an estimating function naturally defines a pre-contrast function on a parametric statistical model and a SMAT is induced from it.

This paper summarizes such previous results and provides some new insights for this geometrical structure. That is, we show that the canonical pre-contrast function can be defined on a partially flat space, which is a SMAT where only one of its dual connections is flat, and discuss a generalized projection theorem in a partially flat space. This theorem relates orthogonal projection of the geodesic with respect to the flat connection to the canonical pre-contrast function.

2 Statistical Manifolds and Contrast Functions

In this paper, we assume that all geometrical objects on differentiable manifolds are smooth and restrict our attention to Riemannian manifolds, although the most of the concepts can be defined for semi-Riemannian manifolds.

Let (Mg) be a Riemannian manifold and \(\nabla \) be an affine connection on M. The dual connection \(\nabla ^{*}\) of \(\nabla \) with respect to g is defined by

$$\begin{aligned} Xg(Y,Z) = g(\nabla _{X}Y, Z) + g(Y, \nabla ^{*}_{X}Z) \ \ (\forall X,\forall Y,\forall Z \in \mathcal{X}(M)) \end{aligned}$$

where \(\mathcal{X}(M)\) is the set of all vector fields on M.

For a affine connection \(\nabla \) on M, its curvature tensor field R and torsion tensor field T are defined by the following equations as usual:

$$\begin{aligned} R(X,Y)Z:= & {} \nabla _{X}\nabla _{Y}Z - \nabla _{Y}\nabla _{X}Z - \nabla _{[X,Y]}Z \ \ (\forall X,\forall Y,\forall Z \in \mathcal{X}(M)), \\ T(X,Y):= & {} \nabla _{X}Y - \nabla _{Y}X - [X,Y] \ \ (\forall X,\forall Y \in \mathcal{X}(M)). \end{aligned}$$

It is said that an affine connection \(\nabla \) is torsion-free if \(T=0\). Note that for a torsion-free affine connection \(\nabla \), \(\nabla ^{*}=\nabla \) implies that \(\nabla \) is the Levi-Civita connection with respect to g. Let \(R^{*}\) and \(T^{*}\) be the curvature and torsion tensor fields of \(\nabla ^{*}\), respectively. It is easy to see that \(R=0\) always implies \(R^{*}=0\), but \(T=0\) does not necessarily implies \(T^{*}=0\).

Let \(\nabla \) be a torsion-free affine connection on a Riemannian manifold (Mg). Following [8], we say that \((M,g,\nabla )\) is a statistical manifold if and only if \(\nabla g\) is a symmetric (0, 3)-tensor field, that is

$$\begin{aligned} (\nabla _{X}g)(Y,Z) = (\nabla _{Y}g)(X,Z) \ \ (\forall X,\forall Y,\forall Z \in \mathcal{X}(M)). \end{aligned}$$
(1)

This condition is equivalent to \(T^{*}=0\) under the condition that \(\nabla \) is a torsion-free. If \((M,g,\nabla )\) is a statistical manifold, so is \((M,g,\nabla ^{*})\) and it is called the dual statistical manifold of \((M,g,\nabla )\). Since \(\nabla \) and \(\nabla ^{*}\) are both torsion-free for a statistical manifold \((M,g,\nabla )\), \(R=0\) implies that \(\nabla \) and \(\nabla ^{*}\) are both flat. In this case, \((M,g,\nabla ,\nabla ^{*})\) is called a dually flat space.

Let \(\phi \) be a real-valued function on the direct product \(M \times M\) of a manifold M and \(X_1,...,X_i,Y_1,...,Y_j\) be vector fields on M. The functions \(\phi [X_1,...,X_i|Y_1,...,Y_j]\), \(\phi [X_1,...,X_i| \ ]\) and \(\phi [ \ |Y_1,...,Y_j]\) on M are defined by the equations

$$\begin{aligned} \phi [X_1,\ldots ,X_i | Y_1,\ldots ,Y_j](r):= & {} (X_1)_p \cdots (X_i)_p(Y_1)_q \cdots (Y_j)_q\phi (p,q)|_{p=r,q=r}, \end{aligned}$$
(2)
$$\begin{aligned} \phi [X_1,\ldots ,X_i | \ ](r):= & {} (X_1)_p \cdots (X_i)_p\phi (p,r)|_{p=r}, \end{aligned}$$
(3)
$$\begin{aligned} \phi [ \ |Y_1,\ldots ,Y_j](r):= & {} (Y_1)_q \cdots (Y_j)_q\phi (r,q)|_{q=r} \end{aligned}$$
(4)

for any \(r \in M\), respectively [1]. Using these notations, a contrast function \(\phi \) is defined to be a real-valued function which satisfies the following conditions on M [1, 2]:

Note that these conditions imply that in some neighborhood of the diagonal set \(\{(r,r) | r \in M\}\) in \(M \times M\),

$$\begin{aligned} \phi (p,q) \ge 0, \ \ \phi (p,q) = 0 \Longleftrightarrow p=q. \end{aligned}$$

Although a contrast function is not necessarily symmetric, this inequality means that a contrast function measures some discrepancy between two points on M (at least locally). For a given contrast function \(\phi \), the two affine connections \(\nabla \) and \(\nabla ^{*}\) are defined by

$$\begin{aligned} g(\nabla _{X}Y,Z) = -\phi [XY|Z], \ g(Y,\nabla ^{*}_{X}Z) = -\phi [Y|XZ] \ \ (\forall X, \forall Y, \forall Z \in \mathcal{X}(M)). \end{aligned}$$

In this case, \(\nabla \) and \(\nabla ^{*}\) are both torsion-free and dual to each other with respect to g, which means that both of \((M,g,\nabla )\) and \((M,g,\nabla ^{*})\) are statistical manifolds. In particular, \((M,g,\nabla )\) is called the statistical manifold induced from the contrast function \(\phi \).

Now we briefly mention a typical example of contrast functions. Let \(S=\{p({\varvec{x}};{\varvec{\theta }}) \ | \ {\varvec{\theta }}=(\theta ^1,...,\theta ^d) \in \varTheta \subset {\varvec{R}}^d\}\) be a regular parametric statistical model, which is a set of probability density functions with respect to a dominating measure \(\nu \) on a sample space \(\mathcal{X}\). Each element is indexed by a parameter (vector) \({\varvec{\theta }}\) in an open subset \(\varTheta \) of \({\varvec{R}}^d\) and the set S satisfies some regularity conditions, under which S can be seen as a differentiable manifold. The Kullback-Leibler divergence of the two density functions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\) and \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S is defined to be

$$\begin{aligned} \phi _{KL}(p_1,p_2) := \int _\mathcal{X}p_2({\varvec{x}})\log \frac{p_2({\varvec{x}})}{p_1({\varvec{x}})}\nu (d{\varvec{x}}). \end{aligned}$$

It is easy to see that the Kullback-Leibler divergence satisfies the conditions (a), (b) and (c), and so it is a contrast function on S. Its induced Riemannian metric and dual connections are Fisher metric, the exponential an mixture connections, respectively, and given as follows:

figure a

where indicates that the expectation is taken with respect to \(p({\varvec{x}};{\varvec{\theta }})\), \(\partial _i=\frac{\partial }{\partial \theta ^i}\) and \(s^{i}({\varvec{x}};{\varvec{\theta }})=\partial _{i}\log p({\varvec{x}};{\varvec{\theta }}) \ (i=1,\ldots ,d)\). As is widely known, this geometrical structure plays the most fundamental and important role in the differential geometry of statistical inference [3, 4].

3 Statistical Manifolds Admitting Torsion and Pre-contrast Functions

A statistical manifold admitting torsion is an abstract notion for the geometrical structure where only one of the dual connections is allow to have torsion, which naturally appears in a quantum statistical model [3]. The definition is obtained by generalizing (1) in the definition of statistical manifold as follows [5].

Let (Mg) be a Riemannian manifold and \(\nabla \) be an affine connection on M. We say that \((M,g,\nabla )\) is a statistical manifold admitting torsion (SMAT for short) if and only if

$$\begin{aligned} (\nabla _{X}g)(Y,Z) - (\nabla _{Y}g)(X,Z) = -g(T(X,Y),Z) \ \ (\forall X,\forall Y,\forall Z \in \mathcal{X}(M)). \end{aligned}$$
(5)

This condition is equivalent to \(T^{*}=0\) in the case where \(\nabla \) possibly has torsion. Note that the condition (5) reduces to (1) if \(\nabla \) is torsion-free and that \((M,g,\nabla ^{*})\) is not necessarily a statistical manifold although \(\nabla ^{*}\) is torsion-free. It should be also noted that \((M,g,\nabla ^{*})\) is a SMAT whenever a torsion-free affine connection \(\nabla \) is given on a Riemannian manifold (Mg).

For a SMAT \((M,g,\nabla )\), \(R=0\) does not necessarily imply that \(\nabla \) is flat, but it implies that \(\nabla ^{*}\) is flat since \(R^{*}=0\) and \(T^{*}=0\). In this case, we call \((M,g,\nabla ,\nabla ^{*})\) a partially flat space.

Let \(\rho \) be a real-valued function on the direct product \(TM \times M\) of a manifold M and its tangent bundle TM, and \(X_1,...,X_i,Y_1,...,Y_j,Z\) be vector fields on M. The function \(\rho [X_1,...,X_iZ|Y_1,...,Y_j]\) is defined by

$$\begin{aligned} \rho [X_1,\ldots ,X_iZ | Y_1,\ldots ,Y_j](r) := (X_1)_p \cdots (X_i)_p(Y_1)_q \cdots (Y_j)_q \rho (Z_p,q)|_{p=r,q=r} \end{aligned}$$

for any \(r \in M\). Note that the role of Z is different from vector fields in the notation of (2). The functions \(\rho [X_1,...,X_iZ| \ ]\) and \(\rho [ \ |Y_1,...,Y_j]\) are also defined in the similar way to (3) and (4).

We say that \(\phi \) is a pre-contrast function on M if and only if the following conditions are satisfied [6, 7]:

Note that for any contrast function \(\phi \), the function \(\rho _{\phi }\) which is defined by \(\rho _{\phi }(X_p,q) := X_p\phi (p,q) \ \ (\forall p, \forall q \in M, \ \forall X_p \in T_p(M))\) is a pre-contrast function on M. The notion of pre-contrast function is obtained by taking the fundamental properties of the first derivative of a contrast function as axioms. For a given pre-contrast function, two affine connections \(\nabla \) and \(\nabla ^{*}\) are defined by the following equations in the same way as a contrast function:

$$\begin{aligned} g(\nabla _{X}Y,Z) = -\rho [XY|Z], \ g(Y,\nabla ^{*}_{X}Z) = -\rho [Y|XZ] \ \ (\forall X, \forall Y, \forall Z \in \mathcal{X}(M)). \end{aligned}$$

In this case, \(\nabla \) and \(\nabla ^{*}\) are dual to each other with respect to g and \(\nabla ^{*}\) is torsion-free. However, the affine connection \(\nabla \) possibly has torsion. This means that \((M,g,\nabla )\) is a SMAT and it is called the SMAT induced from the pre-contrast function \(\rho \).

4 Generalized Projection Theorem in Partially Flat Spaces

In a dually flat space \((M,g,\nabla ,\nabla ^{*})\), it is well-known that the canonical contrast functions (called \(\nabla \)- and \(\nabla ^{*}\)- divergences) are naturally defined, and the Pythagorean theorem and the projection theorem are stated in terms of the \(\nabla \) and \(\nabla ^{*}\) geodesics and the canonical contrast functions [3, 4]. In a partially flat space \((M,g,\nabla ,\nabla ^{*})\), where \(R=R^{*}=0\) and \(T^{*}=0\), a pre-contrast function which seems to be canonical can be defined and a projection theorem holds on the “canonical” pre-contrast function and the \(\nabla ^{*}\)- geodesic.

Proposition 1

(Canonical Pre-contrast Functions). Let \((M,g,\nabla ,\nabla ^{*})\) be a partially flat space (i.e. \((M,g,\nabla )\) is a SMAT with \(R=R^{*}=0\) and \(T^{*}=0\)) and \((U,\eta _i)\) be an affine coordinate neighborhood with respect to \(\nabla ^{*}\) in M. The function \(\rho \) on \(TU \times U\) defined by the following equation is a pre-contrast function on U which induces the SMAT \((U,g,\nabla )\):

$$\begin{aligned} \rho (Z_p, q) := - g_p(Z_p, \dot{\gamma }^{*}(0)) \ \ (\forall p, \forall q \in U, \forall Z_p \in T_p(U)), \end{aligned}$$
(6)

where \(\gamma ^{*}:[0,1] \rightarrow U\) is the \(\nabla ^{*}\)-geodesic such that \(\gamma ^{*}(0)=p, \gamma ^{*}(1)=q\) and \(\dot{\gamma }^{*}(0)\) is the tangent vector of \(\gamma ^{*}\) on p.

Proof

For the function \(\rho \) defined as (6), the condition (a) in the definition of pre-contrast functions follows from the bilinearity of the inner product \(g_p\). The condition (b) immediately follows from \(\dot{\gamma }^{*}(0)=0\) when \(p=q\). By calculating the derivatives of \(\rho \) with the affine coordinate system \((\eta _i)\), it can be shown that the condition (c) holds and that the induced Riemannian metric and dual affine connections coincide with the original g, \(\nabla \) and \(\nabla ^{*}\).    \(\square \)

In particular, if \((M,g,\nabla ,\nabla ^{*})\) is a dually flat space, the pre-contrast function \(\rho \) defined in (6) coincides with the directional derivative of \(\nabla ^{*}\)-divergence \(\phi ^{*}(\cdot , q)\) with respect to \(Z_p\) (cf. [9, 10]). Hence, the definition of (6) seems to be natural one and we call the function \(\rho \) in (6) the canonical pre-contrast function in a partially flat space \((U,g,\nabla ,\nabla ^{*})\).

From the definition of the canonical pre-contrast function, we can immediately obtain the following theorem.

Corollary 1

(Generalized Projection Theorem). Let U be an affine coordinate neighborhood and \(\rho \) be the canonical pre-contrast function defined in Proposition 1. For any submanifold N in U, the following conditions are equivalent:

In the case where \((U,g,\nabla ,\nabla ^{*})\) is a dually flat space, the projection theorem states that the minimum of the \(\nabla ^{*}\)-divergence \(\phi ^{*}(\cdot , q):N \rightarrow {\varvec{R}}\) should attain at the point \(p \in N\) where the \(\nabla ^{*}\)-geodesic starting at q is perpendicular to N. It immediately follows from the generalized projection theorem, since the directional derivative of \(\phi ^{*}(\cdot , q)\) is the canonical pre-contrast function.

5 Statistical Manifolds Admitting Torsion Induced from Estimating Functions

As we mentioned in Introduction, a SMAT naturally appears through estimating functions in a “classical” statistical model as well as in a quantum statistical model. In this section, we briefly explain how a SMAT is induced on S from an estimating function. See [7] for more details including a concrete example.

Let \(S=\{p({\varvec{x}};{\varvec{\theta }}) \ | \ {\varvec{\theta }}=(\theta ^1,...,\theta ^d) \in \varTheta \subset {\varvec{R}}^d\}\) be a regular parametric statistical model. An estimating function on S, which we consider here, is a \({\varvec{R}}^{d}\)-valued function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) satisfying the following conditions:

The first condition is called the unbiasedness of estimating functions, which is important to ensure the consistency of the estimator obtained from an estimating function. Let \(X_1,\ldots ,X_n\) be a random sample from an unknown probability distribution \(p({\varvec{x}};{\varvec{\theta }}_0)\) in S. The estimator \(\hat{{\varvec{\theta }}}\) for \({\varvec{\theta }}_0\), which is obtained as a solution to the estimating equation \(\sum _{i=1}^{n}{\varvec{u}}({\varvec{X}}_i,{\varvec{\theta }}) = {\varvec{0}}\), is called an M-estimator. The M-estimator \(\hat{{\varvec{\theta }}}\) has the consistency \(\hat{{\varvec{\theta }}} \rightarrow {\varvec{\theta }}_0\) (in probability as \(n \rightarrow \infty \)) and the asymptotic normality \(\sqrt{n}(\hat{{\varvec{\theta }}} - {\varvec{\theta }}_0) \rightarrow N({\varvec{0}},\mathrm{Avar}(\hat{{\varvec{\theta }}}))\) (in distribution as \(n \rightarrow \infty \)) under some additional regularity conditions [11], where \(\mathrm{Avar}(\hat{{\varvec{\theta }}})\) is an asymptotic variance-covariance matrix of \(\hat{{\varvec{\theta }}}\) and is given by \(\mathrm{Avar}(\hat{{\varvec{\theta }}}) = \{A({\varvec{\theta }}_0)\}^{-1}B({\varvec{\theta }}_0) \{A({\varvec{\theta }}_0)\}^{-T} \) with and .

In order to induce the structure of SMAT on S from an estimating function, we consider the notion of standardization of estimating functions. For an estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\), its standardization (or standardized estimating function) is defined by

where \({\varvec{s}}({\varvec{x}},{\varvec{\theta }})=(\partial /\partial {\varvec{\theta }})\log p({\varvec{x}};{\varvec{\theta }})\) is the score function [12]. Geometrically, the ith component of the standardized estimating function \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) is the orthogonal projection of the ith component of the score function \({\varvec{s}}({\varvec{x}},{\varvec{\theta }})\) onto the linear space spanned by all components of the estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) in the Hilbert space

with the inner product . In terms of the standardization, the asymptotic variance-covariance matrix can be rewritten as \(\mathrm{Avar}(\hat{{\varvec{\theta }}}) = \{G({\varvec{\theta }}_0)\}^{-1}\), where . The matrix \(G({\varvec{\theta }})\) is called a Godambe information matrix [13], which is a generalization of the Fisher information matrix.

As we have seen in Sect. 2, the Kullback-Leibler divergence \(\phi _{KL}\) is a contrast function on S. Hence, the first derivative of \(\phi _{KL}\) is a pre-contrast function on S and given by

$$\begin{aligned} \rho _{KL}((\partial _j)_{p_1},p_2) := (\partial _j)_{p_1}\phi _{KL}(p_1,p_2) = - \int _\mathcal{X}s^{j}({\varvec{x}},{\varvec{\theta }}_1)p({\varvec{x}};{\varvec{\theta }}_2)\nu (d{\varvec{x}}) \end{aligned}$$

for any two probability distributions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\), \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S and \(j=1,\ldots ,d\). This observation leads to the following proposition.

Proposition 2

(Pre-contrast Functions from Estimating Functions). For an estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) on the parametric model S, a pre-contrast function is defined by

for any two probability distributions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\), \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S and \(j=1,\ldots ,d\), where \(u_{*}^j({\varvec{x}},{\varvec{\theta }})\) is the jth component of the standardization \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) of \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\).

The use of the standardization \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) instead of \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) ensures that the definition of the function does not depend on the choice of coordinate system (parameter) of S. In fact, for a coordinate transformation (parameter transformation) \({\varvec{\eta }}=\varPhi ({\varvec{\theta }})\), the estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) is changed into \({\varvec{v}}({\varvec{x}},{\varvec{\eta }})={\varvec{u}}({\varvec{x}},\varPhi ^{-1}({\varvec{\eta }}))\) and we have \({\varvec{v}}_{*}({\varvec{x}},{\varvec{\eta }})=\left( \partial {\varvec{\theta }}/\partial {\varvec{\eta }}\right) ^T {\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\). The proof of Proposition 2 is straightforward. In particular, the condition (b) in the definition of pre-contrast function follows from the unbiasedness of the (standardized) estimating function. The Riemannian metric g, dual connections \(\nabla \) and \(\nabla ^{*}\) induced from the pre-contrast function are given as follows:

figure b

where \(G({\varvec{\theta }})_{jk}\) is the (jk) component of the Godambe information matrix \(G({\varvec{\theta }})\). Note that \(\nabla ^{*}\) is always torsion-free since \(\varGamma ^{*}_{ik,j}=\varGamma ^{*}_{ki,j}\), whereas \(\nabla \) is not necessarily torsion-free unless \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) is integrable with respect to \({\varvec{\theta }}\).

Henmi and Matsuzoe [7] discussed the quasi score function in [14], which is a well-known example of non-integrable estimating functions. They showed that one of the induced affine connections actually has torsion and the other connection is flat, that is, a partially flat space is induced. The pre-contrast function defined from the estimating function coincides with the canonical pre-contrast function and the generalized projection theorem can be applied. However, its statistical meaning has not been clarified yet. Although it is expected that the SMAT induced from an estimating function has something to do with statistical inference based on the estimating function, the clarification on it is a future problem.