Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A statistical manifold is a Riemannian manifold with a pair of dual torsion-free affine connections and it plays a central role in information geometry. This geometrical structure is induced from an asymmetric (squared) distance-like smooth function called a contrast function by taking its second and third derivatives [1, 2]. The Kullback–Leibler divergence on a regular parametric statistical model is a typical example of contrast functions and its induced geometrical objects are the Fisher metric, the exponential and mixture connections. The geometrical structure determined by these objects plays an important role in the geometry of statistical inference, as is widely known [3, 4].

A statistical manifold admitting torsion (SMAT) is a Riemannian manifold with a pair of dual affine connections, where only one of them must be torsion-free but the other is not necessarily so. This geometrical structure naturally appears in a quantum statistical model (i.e. a set of density matrices representing quantum states) [3] and the notion of SMAT was originally introduced to study such a geometrical structure from a mathematical point of view [5]. A pre-contrast function was subsequently introduced as a generalization for the first derivative of a contrast function and it was shown that an pre-contrast function induces a SMAT by taking its first and second derivatives [6].

In statistics, an estimating function is a function defined on a direct product of parameter and sample spaces, and it is used to obtain an estimator by solving its corresponding estimating equation. Henmi and Matsuzoe [7] showed that a SMAT also appears in “classical" statistics through an estimating function. More precisely, an estimating function naturally defines a pre-contrast function on a parametric statistical model and a SMAT is induced from it.

This paper summarizes such previous results, focusing on a SMAT where one of its dual connections is flat. We call this geometrical structure a partially flat space. Although this space is different from a dually flat space in general since one of the dual connections in a SMAT possibly has torsion, some similar properties hold. For example, the canonical pre-contrast function can be naturally defined on a partially flat space, which is an analog of the canonical contrast function (or canonical divergence) in a dually flat space. In addition, a generalized projection theorem holds with respect to the canonical pre-contrast function. This theorem can be seen as a generalization of the projection theorem in a dually flat space. This paper is an extended version of the conference proceedings [8]. We consider a statistical problem to see an example of statistical manifolds admitting torsion induced from estimating functions and discuss some future problems, neither of which were included in [8].

2 Statistical Manifolds and Contrast Functions

Through this paper, we assume that all geometrical objects on differentiable manifolds are smooth and restrict our attention to Riemannian manifolds, although the most of the concepts can be defined for semi-Riemannian manifolds.

Let (Mg) be a Riemannian manifold and \(\nabla \) be an affine connection on M. The dual connection \(\nabla ^{*}\) of \(\nabla \) with respect to g is defined by

$$\begin{aligned} Xg(Y,Z) = g(\nabla _{X}Y, Z) + g(Y, \nabla ^{*}_{X}Z) \ \ \ (\forall X,\forall Y,\forall Z \in {\mathscr {X}}(M)), \end{aligned}$$

where \({\mathscr {X}}(M)\) is the set of all vector fields on M.

For an affine connection \(\nabla \) on M, its curvature tensor field R and torsion tensor field T are defined by the following equations as usual:

$$\begin{aligned} R(X,Y)Z:= & {} \nabla _{X}\nabla _{Y}Z - \nabla _{Y}\nabla _{X}Z - \nabla _{[X,Y]}Z, \\ T(X,Y):= & {} \nabla _{X}Y - \nabla _{Y}X - [X,Y] \end{aligned}$$

(\(\forall X,\forall Y,\forall Z \in {\mathscr {X}}(M)\)). It is said that an affine connection \(\nabla \) is torsion-free if \(T=0\). Note that for a torsion-free affine connection \(\nabla \), \(\nabla ^{*}=\nabla \) implies that \(\nabla \) is the Levi-Civita connection with respect to g. Let \(R^{*}\) and \(T^{*}\) be the curvature and torsion tensor fields of \(\nabla ^{*}\), respectively. It is easy to see that \(R=0\) always implies \(R^{*}=0\), but \(T=0\) does not necessarily imply \(T^{*}=0\).

Let \(\nabla \) be a torsion-free affine connection on a Riemannian manifold (Mg). Following [9], we say that \((M,g,\nabla )\) is a statistical manifold if and only if \(\nabla g\) is a symmetric (0, 3)-tensor field, that is

$$\begin{aligned} (\nabla _{X}g)(Y,Z) = (\nabla _{Y}g)(X,Z) \ \ \ (\forall X,\forall Y,\forall Z \in {\mathscr {X}}(M)). \end{aligned}$$
(1)

This condition is equivalent to \(T^{*}=0\) under the condition that \(\nabla \) is a torsion-free. If \((M,g,\nabla )\) is a statistical manifold, so is \((M,g,\nabla ^{*})\) and it is called the dual statistical manifold of \((M,g,\nabla )\). Since \(\nabla \) and \(\nabla ^{*}\) are both torsion-free for a statistical manifold \((M,g,\nabla )\), \(R=0\) implies that \(\nabla \) and \(\nabla ^{*}\) are both flat. In this case, \((M,g,\nabla ,\nabla ^{*})\) is called a dually flat space [3].

Let \(\phi \) be a real-valued function on the direct product \(M \times M\) of a manifold M and \(X_1, \ldots ,X_i,Y_1, \ldots ,Y_j\) be vector fields on M. The functions \(\phi [X_1, \ldots ,X_i|Y_1, \ldots , Y_j]\), \(\phi [X_1, \ldots ,X_i| \ ]\) and \(\phi [ \ |Y_1, \ldots ,Y_j]\) on M are defined by the equations

$$\begin{aligned} \phi [X_1,\ldots ,X_i | Y_1,\ldots ,Y_j](r):= & {} (X_1)_p \cdots (X_i)_p(Y_1)_q \cdots (Y_j)_q\phi (p,q)|_{p=r,q=r}, \nonumber \\ \end{aligned}$$
(2)
$$\begin{aligned} \phi [X_1,\ldots ,X_i | \ ](r):= & {} (X_1)_p \cdots (X_i)_p\phi (p,r)|_{p=r}, \end{aligned}$$
(3)
$$\begin{aligned} \phi [ \ |Y_1,\ldots ,Y_j](r):= & {} (Y_1)_q \cdots (Y_j)_q\phi (r,q)|_{q=r} \end{aligned}$$
(4)

for any \(r \in M\), respectively [1]. Using these notations, a contrast function \(\phi \) on M is defined to be a real-valued function on \(M \times M\) which satisfies the following conditions [1, 2]:

$$\begin{aligned}&(a)~ \phi (p,p) = 0 \ \ \ (\forall p \in M), \\&(b)~ \phi [X| \ ] = \phi [ \ |X] = 0 \ \ \ (\forall X \in {\mathscr {X}}(M)), \\&(c)~ g(X,Y) := -\phi [X|Y] \ \ (\forall X, \forall Y \in {\mathscr {X}}(M))~\text {is a Riemannian metric on}~M. \end{aligned}$$

Note that these conditions imply that

$$\begin{aligned} \phi (p,q) \ge 0, \ \ \phi (p,q) = 0 \Longleftrightarrow p=q \end{aligned}$$

in some neighborhood of the diagonal set \(\{(r,r) | r \in M\}\) in \(M \times M\). Although a contrast function is not necessarily symmetric, this property means that a contrast function measures some discrepancy between two points on M (at least locally). For a given contrast function \(\phi \), the two affine connections \(\nabla \) and \(\nabla ^{*}\) are defined by

$$\begin{aligned} g(\nabla _{X}Y,Z) = -\phi [XY|Z], \ \ g(Y,\nabla ^{*}_{X}Z) = -\phi [Y|XZ] \end{aligned}$$

(\(\forall X, \forall Y, \forall Z \in {\mathscr {X}}(M)\)). In this case, \(\nabla \) and \(\nabla ^{*}\) are both torsion-free and dual to each other with respect to g. This means that both of \((M,g,\nabla )\) and \((M,g,\nabla ^{*})\) are statistical manifolds. In particular, \((M,g,\nabla )\) is called the statistical manifold induced from the contrast function \(\phi \).

A typical example of contrast functions is the Kullback–Leibler divergence on a statistical model. Let \(S=\{p({\varvec{x}};{\varvec{\theta }}) \ | \ {\varvec{\theta }}=(\theta ^1, \ldots ,\theta ^d) \in \varTheta \subset {\varvec{R}}^d\}\) be a regular parametric statistical model, which is a set of probability density functions with respect to a dominating measure \(\nu \) on a sample space \(\varOmega \). Each element is indexed by a parameter (vector) \({\varvec{\theta }}\) in an open subset \(\varTheta \) of \({\varvec{R}}^d\) and the set S satisfies some regularity conditions, under which S can be seen as a differentiable manifold. The Kullback–Leibler divergence of the two density functions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\) and \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S is defined to be

$$\begin{aligned} \phi _{KL}(p_1,p_2) := \int _{\varOmega }p_2({\varvec{x}})\log \frac{p_2({\varvec{x}})}{p_1({\varvec{x}})}\nu (d{\varvec{x}}). \end{aligned}$$

It is easy to see that the Kullback–Leibler divergence satisfies the conditions (a), (b) and (c), and so it is a contrast function on S. Its induced Riemannian metric and dual connections are Fisher metric \(g^F\), the exponential connection \(\nabla ^{(e)}\) and mixture connection \(\nabla ^{(m)}\), respectively. They are given as follows:

$$\begin{aligned}&g_{jk}^F({\varvec{\theta }}) := g^F(\partial _j,\partial _k) = E_{{\varvec{\theta }}}\{s^j({\varvec{x}},{\varvec{\theta }})s^k({\varvec{x}},{\varvec{\theta }})\}, \\&\left\{ \begin{array}{l} \varGamma _{ij,k}^{(e)}({\varvec{\theta }}) := g^F(\nabla _{\partial _i}^{(e)}\partial _j,\partial _k) = E_{{\varvec{\theta }}}[\{\partial _is^j({\varvec{x}},{\varvec{\theta }})\}s^k({\varvec{x}},{\varvec{\theta }})] \\ \varGamma _{ik,j}^{(m)}({\varvec{\theta }}) := g^F(\partial _j,\nabla _{\partial _i}^{(m)}\partial _k) = \int _{\varOmega }s^j({\varvec{x}},{\varvec{\theta }})\partial _i\partial _kp({\varvec{x}};{\varvec{\theta }}) \nu (d{\varvec{x}}) \end{array} \right. , \end{aligned}$$

where \(E_{{\varvec{\theta }}}\) indicates that the expectation is taken with respect to \(p({\varvec{x}};{\varvec{\theta }})\), \(\partial _i=\frac{\partial }{\partial \theta ^i}\) and \(s^{i}({\varvec{x}};{\varvec{\theta }})=\partial _{i}\log p({\varvec{x}};{\varvec{\theta }}) \ (i=1,\ldots ,d)\). As is widely known, this geometrical structure plays the most fundamental and important role in the differential geometry of statistical inference [3, 4].

3 Statistical Manifolds Admitting Torsion and Pre-contrast Functions

A statistical manifold admitting torsion is an abstract notion for the geometrical structure where only one of the dual connections is allow to have torsion, which naturally appears in a quantum statistical model [3]. The definition is obtained by generalizing (1) in the definition of statistical manifold as follows [5].

Let (Mg) be a Riemannian manifold and \(\nabla \) be an affine connection on M. We say that \((M,g,\nabla )\) is a statistical manifold admitting torsion (SMAT for short) if and only if

$$\begin{aligned} (\nabla _{X}g)(Y,Z) - (\nabla _{Y}g)(X,Z) = -g(T(X,Y),Z) \ \ \ (\forall X,\forall Y,\forall Z \in {\mathscr {X}}(M)). \end{aligned}$$
(5)

This condition is equivalent to \(T^{*}=0\) in the case where \(\nabla \) possibly has torsion, and it reduces to (1) if \(\nabla \) is torsion-free. Note that \((M,g,\nabla ^{*})\) is not necessarily a statistical manifold although \(\nabla ^{*}\) is torsion-free. It should be also noted that \((M,g,\nabla ^{*})\) is a SMAT whenever a torsion-free affine connection \(\nabla \) is given on a Riemannian manifold (Mg).

For a SMAT \((M,g,\nabla )\), \(R=0\) does not necessarily imply that \(\nabla \) is flat, but it implies that \(\nabla ^{*}\) is flat since \(R^{*}=0\) and \(T^{*}=0\). In this case, we call \((M,g,\nabla ,\nabla ^{*})\) a partially flat space.

Let \(\rho \) be a real-valued function on the direct product \(TM \times M\) of a manifold M and its tangent bundle TM, and \(X_1, \ldots ,X_i,Y_1, \ldots ,Y_j,Z\) be vector fields on M. The function \(\rho [X_1, \ldots ,X_iZ|Y_1, \ldots ,Y_j]\) on M is defined by

$$\begin{aligned} \rho [X_1,\ldots ,X_iZ | Y_1,\ldots ,Y_j](r) := (X_1)_p \cdots (X_i)_p(Y_1)_q \cdots (Y_j)_q \rho (Z_p,q)|_{p=r,q=r} \end{aligned}$$

for any \(r \in M\). Note that the role of Z is different from those of the vector fields in the notation of (2). The functions \(\rho [X_1, \ldots ,X_iZ| \ ]\) and \(\rho [ \ |Y_1, \ldots ,Y_j]\) are also defined in the similar way to (3) and (4).

We say that \(\rho \) is a pre-contrast function on M if and only if the following conditions are satisfied [6, 7]:

$$\begin{aligned} (a)~&\rho (f_1X_1+f_2X_2,q) = f_1\rho (X_1,q) + f_2\rho (X_2,q) \\&(\forall f_1, \forall f_2 \in C^{\infty }(M), \ \forall X_1, \forall X_2 \in {\mathscr {X}}(M), \ \forall q \in M). \\ (b)~&\rho [X| \ ] = 0 \ \ (\forall X \in {\mathscr {X}}(M)) \ \ \ \left( { i.e.} \ \rho (X_p,p)=0 \ \ (\forall p \in M)\right) . \\ (c)~&g(X,Y) := -\rho [X|Y] \ \ (\forall X, \forall Y \in {\mathscr {X}}(M))~\text {is a Riemannian metric on}~M. \end{aligned}$$

Note that for any contrast function \(\phi \) on M, the function \(\rho _{\phi }\) which is defined by

$$\begin{aligned} \rho _{\phi }(X_p,q) := X_p\phi (p,q) \ \ \ (\forall p, \forall q \in M, \ \forall X_p \in T_p(M)) \end{aligned}$$

is a pre-contrast function on M. The notion of pre-contrast function is obtained by taking the fundamental properties of the first derivative of a contrast function as axioms. For a given pre-contrast function \(\rho \), two affine connections \(\nabla \) and \(\nabla ^{*}\) are defined by

$$\begin{aligned} g(\nabla _{X}Y,Z) = -\rho [XY|Z], \ \ g(Y,\nabla ^{*}_{X}Z) = -\rho [Y|XZ] \end{aligned}$$

(\(\forall X, \forall Y, \forall Z \in {\mathscr {X}}(M)\)) in the same way as for a contrast function. In this case, \(\nabla \) and \(\nabla ^{*}\) are dual to each other with respect to g and \(\nabla ^{*}\) is torsion-free. However, the affine connection \(\nabla \) possibly has torsion. This means that \((M,g,\nabla )\) is a SMAT and it is called the SMAT induced from the pre-contrast function \(\rho \).

4 Canonical Pre-contrast Functions in Partially Flat Spaces

In a dually flat space \((M,g,\nabla ,\nabla ^{*})\), it is well-known that the canonical contrast functions (called \(\nabla \) and \(\nabla ^{*}\)- divergences) are naturally defined, and the Pythagorean theorem and the projection theorem are stated in terms of the \(\nabla \) and \(\nabla ^{*}\)- geodesics and the canonical contrast functions [3, 4]. In a partially flat space \((M,g,\nabla ,\nabla ^{*})\), where \(R=R^{*}=0\) and \(T^{*}=0\), it is possible to define a pre-contrast function which can be seen as canonical, and a projection theorem holds with respect to the “canonical" pre-contrast function and the \(\nabla ^{*}\)-geodesic.

Proposition 1

(Canonical Pre-contrast Functions) Let \((M,g,\nabla ,\nabla ^{*})\) be a partially flat space (i.e. \((M,g,\nabla )\) is a SMAT with \(R=R^{*}=0\) and \(T^{*}=0\)) and \((U,\eta _i)\) be an affine coordinate neighborhood with respect to \(\nabla ^{*}\) in M. The function \(\rho \) on \(TU \times U\) defined by

$$\begin{aligned} \rho (Z_p, q) := - g_p(Z_p, \dot{\gamma }^{*}(0)) \ \ (\forall p, \forall q \in U, \forall Z_p \in T_p(U)), \end{aligned}$$
(6)

is a pre-contrast function on U, where \(\gamma ^{*}:[0,1] \rightarrow U\) is the \(\nabla ^{*}\)-geodesic such that \(\gamma ^{*}(0)=p, \gamma ^{*}(1)=q\) and \(\dot{\gamma }^{*}(0)\) is the tangent vector of \(\gamma ^{*}\) at p. Furthermore, the pre-contrast function \(\rho \) induces the original Riemannian metric g and the dual connections \(\nabla \) and \(\nabla ^{*}\) on U.

Proof

For the function \(\rho \) defined as (6), the condition (a) in the definition of pre-contrast functions follows from the bilinearity of the inner product \(g_p\). The condition (b) immediately follows from \(\dot{\gamma }^{*}(0)=0\) when \(p=q\). By calculating the derivatives of \(\rho \) with the affine coordinate system \((\eta _i)\), it can be shown that the condition (c) holds and that the induced Riemannian metric and dual affine connections coincide with the original g, \(\nabla \) and \(\nabla ^{*}\). \(\square \)

In particular, if \((U,g,\nabla ,\nabla ^{*})\) is a dually flat space, the pre-contrast function \(\rho \) defined in (6) coincides with the directional derivative \(Z_p\phi ^{*}(\cdot ,q)\) of \(\nabla ^{*}\)-divergence \(\phi ^{*}(\cdot ,q)\) with respect to \(Z_p\) (cf. [10, 11]). Hence, the definition of (6) seems to be natural one and we call the function \(\rho \) in (6) the canonical pre-contrast function in a partially flat space \((U,g,\nabla ,\nabla ^{*})\).

From the definition of the canonical pre-contrast function, we can immediately obtain the following theorem.

Corollary 1

(Generalized Projection Theorem) Let \((U,\eta _i)\) be an affine coordinate neighborhood in a partially flat space \((M,g,\nabla ,\nabla ^{*})\) and \(\rho \) be the canonical pre-contrast function on U. For any submanifold N in U, the following conditions are equivalent:

$$\begin{aligned}&(i)~\text {The}~\nabla ^{*}-\text {geodesic starting at}~q~\in U~\text {is perpendicular to}~N \text {at}~p~\in ~N. \\&(ii) \ \rho (Z_p, q) = 0 \ \text {for any}~Z_p~\text {in}~T_p(N). \end{aligned}$$

If \((U,g,\nabla ,\nabla ^{*})\) is a dually flat space, this theorem reduces to the projection theorem with respect to the \(\nabla ^{*}\)-divergence \(\phi ^{*}\), since \(\rho (Z_p, q)=Z_p\phi ^{*}(p,q)\). In this sense, it can be seen as a generalized version of the projection theorem in dually flat spaces, and this is also one of the reasons why we consider the pre-contrast function \(\rho \) defined in (6) as canonical.

5 Statistical Manifolds Admitting Torsion Induced from Estimating Functions

As we mentioned in Introduction, a SMAT naturally appears through an estimating function in a “classical" statistical model as well as in a quantum statistical model. In this section, we briefly explain how a SMAT is induced on a parametric statistical model from an estimating function. See [7] for more details.

Let \(S=\{p({\varvec{x}};{\varvec{\theta }}) \ | \ {\varvec{\theta }}=(\theta ^1, \ldots ,\theta ^d) \in \varTheta \subset {\varvec{R}}^d\}\) be a regular parametric statistical model. An estimating function on S, which we consider here, is a \({\varvec{R}}^{d}\)-valued function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) satisfying the following conditions:

$$\begin{aligned} E_{{\varvec{\theta }}}\{{\varvec{u}}({\varvec{x}},{\varvec{\theta }})\} = {\varvec{0}}, \ \ E_{{\varvec{\theta }}}\{\Vert {\varvec{u}}({\varvec{x}},{\varvec{\theta }})\Vert ^2\} < \infty , \ \ \det \left[ E_{{\varvec{\theta }}}\left\{ \frac{\partial {\varvec{u}}}{\partial {\varvec{\theta }}} ({\varvec{x}},{\varvec{\theta }})\right\} \right] \ne 0 \ \ (\forall {\varvec{\theta }} \in \varTheta ). \end{aligned}$$

The first condition is called the unbiasedness of estimating functions, which is important to ensure the consistency of the estimator obtained from an estimating function. Let \(X_1,\ldots ,X_n\) be a random sample from an unknown probability distribution \(p({\varvec{x}};{\varvec{\theta }}_0)\) in S. The estimator \(\hat{{\varvec{\theta }}}\) for \({\varvec{\theta }}_0\) is called an M-estimator if it is obtained as a solution to the estimating equation

$$\begin{aligned} \sum _{i=1}^{n}{\varvec{u}}({\varvec{X}}_i,{\varvec{\theta }}) = {\varvec{0}}. \end{aligned}$$
(7)

The M-estimator \(\hat{{\varvec{\theta }}}\) has the consistency

$$\begin{aligned} \hat{{\varvec{\theta }}} \longrightarrow {\varvec{\theta }}_0 \ \ (\text {in probability}) \end{aligned}$$

as \(n \rightarrow \infty \) and the asymptotic normality

$$\begin{aligned} \sqrt{n}(\hat{{\varvec{\theta }}} - {\varvec{\theta }}_0) \longrightarrow N\left( {\varvec{0}},\mathrm{Avar}\left( \hat{{\varvec{\theta }}}\right) \right) \ \ (\text {in distribution}) \end{aligned}$$

as \(n \rightarrow \infty \) under some additional regularity conditions [12], which are also assumed in the following discussion. The matrix \(\mathrm{Avar}(\hat{{\varvec{\theta }}})\) is the asymptotic variance-covariance matrix of \(\hat{{\varvec{\theta }}}\) and is given by

$$\begin{aligned} \mathrm{Avar}(\hat{{\varvec{\theta }}}) = \{A({\varvec{\theta }}_0)\}^{-1}B({\varvec{\theta }}_0) \{A({\varvec{\theta }}_0)\}^{-T} , \end{aligned}$$
(8)

where \(A({\varvec{\theta }}):=E_{{\varvec{\theta }}}\left\{ (\partial {\varvec{u}}/\partial {\varvec{\theta }})({\varvec{x}},{\varvec{\theta }})\right\} \), \(B({\varvec{\theta }}):=E_{{\varvec{\theta }}}\left\{ {\varvec{u}}({\varvec{x}},{\varvec{\theta }}){\varvec{u}}({\varvec{x}},{\varvec{\theta }})^T\right\} \) and \(-T\) means transposing an inverse matrix (or inverting a transposed matrix).

In order to induce the structure of SMAT on S from an estimating function, we consider the notion of standardization of estimating functions. For an estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\), its standardization (or standardized estimating function) is defined by

$$\begin{aligned} {\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }}) := E_{{\varvec{\theta }}}\left\{ {\varvec{s}}({\varvec{x}},{\varvec{\theta }}){\varvec{u}}({\varvec{x}},{\varvec{\theta }})^T\right\} \left[ E_{{\varvec{\theta }}}\left\{ {\varvec{u}}({\varvec{x}},{\varvec{\theta }}){\varvec{u}} ({\varvec{x}},{\varvec{\theta }})^T\right\} \right] ^{-1}{\varvec{u}}({\varvec{x}},{\varvec{\theta }}), \end{aligned}$$

where \({\varvec{s}}({\varvec{x}},{\varvec{\theta }})=(\partial /\partial {\varvec{\theta }})\log p({\varvec{x}};{\varvec{\theta }})\) is the score function for \({\varvec{\theta }}\) [13]. Geometrically, the i-th component of the standardized estimating function \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) is the orthogonal projection of the i-th component of the score function \({\varvec{s}}({\varvec{x}},{\varvec{\theta }})\) onto the linear space spanned by all components of the estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) in the Hilbert space

$$\begin{aligned} {\mathscr {H}}_{{\varvec{\theta }}} := \{a({\varvec{x}}) \ | \ E_{{\varvec{\theta }}}\{a({\varvec{x}})\} = 0, \ E_{{\varvec{\theta }}}\{a({\varvec{x}})^2\} < \infty \} \end{aligned}$$

with the inner product \(<a({\varvec{x}}), b({\varvec{x}})>_{{\varvec{\theta }}}:=E_{{\varvec{\theta }}}\{a({\varvec{x}})b({\varvec{x}})\} \ (\forall a({\varvec{x}}), \forall b({\varvec{x}}) \in {\mathscr {H}}_{{\varvec{\theta }}})\). The standardization \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) of \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) does not change the estimator since the estimating equation obtained from \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) is equivalent to the original estimating equation (7). In terms of the standardization, the asymptotic variance-covariance matrix (8) can be rewritten as

$$\begin{aligned} \mathrm{Avar}(\hat{{\varvec{\theta }}}) = \{G({\varvec{\theta }}_0)\}^{-1}, \end{aligned}$$

where \(G({\varvec{\theta }}):=E_{{\varvec{\theta }}}\left\{ {\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }}){\varvec{u}}_{*} ({\varvec{x}},{\varvec{\theta }})^T\right\} \). The matrix \(G({\varvec{\theta }})\) is called a Godambe information matrix [14], which can be seen as a generalization of the Fisher information matrix.

As we have seen in Sect. 2, the Kullback–Leibler divergence \(\phi _{KL}\) is a contrast function on S. Hence, the first derivative of \(\phi _{KL}\) is a pre-contrast function on S and given by

$$\begin{aligned} \rho _{KL}((\partial _j)_{p_1},p_2) := (\partial _j)_{p_1}\phi _{KL}(p_1,p_2) = - \int _{\varOmega }s^{j}({\varvec{x}},{\varvec{\theta }}_1)p({\varvec{x}};{\varvec{\theta }}_2)\nu (d{\varvec{x}}) \end{aligned}$$

for any two probability distributions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\), \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S and \(j=1,\ldots ,d\). This observation leads to the following proposition [7].

Proposition 2

(Pre-contrast Functions from Estimating Functions) For an estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) on the parametric model S, a pre-contrast function \(\rho _{{\varvec{u}}}:TS \times S \rightarrow {\varvec{R}}\) is defined by

$$\begin{aligned} \rho _{{\varvec{u}}}((\partial _j)_{p_1},p_2) := - \int _{\varOmega }u_{*}^j({\varvec{x}},{\varvec{\theta }}_1)p({\varvec{x}};{\varvec{\theta }}_2)\nu (d{\varvec{x}}) \end{aligned}$$
(9)

for any two probability distributions \(p_1({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_1)\), \(p_2({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_2)\) in S and \(j=1,\ldots ,d\), where \(u_{*}^j({\varvec{x}},{\varvec{\theta }})\) is the j-th component of the standardization \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) of \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\).

The use of the standardization \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) instead of \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) ensures that the definition of the function \(\rho _{{\varvec{u}}}\) does not depend on the choice of coordinate system (parameter) of S. In fact, for a coordinate transformation (parameter transformation) \({\varvec{\eta }}=\varPhi ({\varvec{\theta }})\), the estimating function \({\varvec{u}}({\varvec{x}},{\varvec{\theta }})\) is changed into \({\varvec{v}}({\varvec{x}},{\varvec{\eta }})={\varvec{u}}({\varvec{x}},\varPhi ^{-1}({\varvec{\eta }}))\) and we have

$$\begin{aligned} {\varvec{v}}_{*}({\varvec{x}},{\varvec{\eta }}) = \left( \frac{\partial {\varvec{\theta }}}{\partial {\varvec{\eta }}}\right) ^T {\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }}). \end{aligned}$$

This is the same as the transformation rule of coordinate bases on a tangent space of a manifold. The set of all components of the standardized estimating function \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) can be seen as a representation of the coordinate basis \(\{(\partial _1)_p,\ldots ,(\partial _d)_p\}\) on the tangent space \(T_p(S)\) of S, where \(p({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }})\).

The proof of Proposition 2 is straightforward. In particular, the condition (b) in the definition of pre-contrast function follows from the unbiasedness of the (standardized) estimating function. The Riemannian metric g, dual connections \(\nabla \) and \(\nabla ^{*}\) induced from the pre-contrast function \(\rho _{{{\varvec{u}}}}\) are given as follows:

$$\begin{aligned}&g_{jk}({\varvec{\theta }}) := g(\partial _j,\partial _k) = E_{{\varvec{\theta }}}\{u_{*}^j({\varvec{x}},{\varvec{\theta }}) u_{*}^k({\varvec{x}},{\varvec{\theta }})\} = G({\varvec{\theta }})_{jk}, \\&\left\{ \begin{array}{l} \varGamma _{ij,k}({\varvec{\theta }}) := g(\nabla _{\partial _i}\partial _j,\partial _k) = E_{{\varvec{\theta }}}[\{\partial _iu_{*}^j({\varvec{x}},{\varvec{\theta }})\}s^k({\varvec{x}},{\varvec{\theta }})] \\ \varGamma ^{*}_{ik,j}({\varvec{\theta }}) := g(\partial _j,\nabla ^{*}_{\partial _i}\partial _k) = \int _{\varOmega }u_{*}^j({\varvec{x}},{\varvec{\theta }})\partial _i\partial _kp({\varvec{x}};{\varvec{\theta }}) \nu (d{\varvec{x}}) \end{array} \right. , \end{aligned}$$

where \(G({\varvec{\theta }})_{jk}\) is the (jk) component of the Godambe information matrix \(G({\varvec{\theta }})\). Note that \(\nabla ^{*}\) is always torsion-free since \(\varGamma ^{*}_{ik,j}=\varGamma ^{*}_{ki,j}\), whereas \(\nabla \) is not necessarily torsion-free unless \({\varvec{u}}_{*}({\varvec{x}},{\varvec{\theta }})\) is integrable with respect to \({\varvec{\theta }}\) (i.e. there exists a function \(\psi ({\varvec{x}},{\varvec{\theta }})\) satisfying \(\partial _j\psi ({\varvec{x}},{\varvec{\theta }})=u_{*}^j({\varvec{x}},{\varvec{\theta }}) \ (j=1,\ldots ,d)\)).

If it is integrable and \(\nabla \) is torsion-free, it is possible to construct a contrast function on S, from which the pre-contrast function \(\rho _{{\varvec{u}}}\) in (9) is obtained by taking its first derivative, as follows:

$$\begin{aligned} \phi _{{\varvec{u}}}(p_1,p_2) = \int _{\varOmega }\left\{ \psi ({\varvec{x}},{\varvec{\theta }}_1) - \psi ({\varvec{x}},{\varvec{\theta }}_2) \right\} p({\varvec{x}};{\varvec{\theta }}_2)\nu (d{\varvec{x}}), \end{aligned}$$

where \(\partial _j\psi ({\varvec{x}},{\varvec{\theta }})=u_{*}^j({\varvec{x}},{\varvec{\theta }}) \ (j=1,\ldots ,d)\) and \(p_l({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_l) \ (l=1,2)\).

6 Example

In this section, we consider the estimation problem of voter transition probabilities described in [15] to see an example of statistical manifolds admitting torsion (SMAT) induced from estimation functions.

Suppose that we had two successive elections which were carried out in N constituencies, and that the two political parties C and L contended in each election. The table below summarizes the numbers of voters in the n-th constituency for the respective elections. It is assumed that we can observe only the marginal totals \(m_{1n}, m_{2n}, X_n\) and \(m_n-X_n\), where \(X_n\) is a random variable and the others are treated as fixed constants. Let \(\theta ^1\) and \(\theta ^2\) be the probabilities that a voter who votes for the parties C and L in Election 1 votes for C in Election 2, respectively. They are the parameters of interest here. Then, the random variables \(X_{1n}\) and \(X_{2n}\) in Table 1 are assumed to independently follow the binomial distributions \(B(m_{1n},\theta ^1)\) and \(B(m_{2n},\theta ^2)\), respectively.

Table 1 Votes cast in the n-th constituency (\(n=1,\ldots ,N\))

In the n-th constituency, the probability function of the observation \(X_n=X_{1n}+X_{2n}\) is given by

$$\begin{aligned} p_n(x_n;{\varvec{\theta }}) = \sum _{x_{1n}=0}^{m_{1n}}\left( \begin{array}{c} m_{1n} \\ x_{1n} \end{array} \right) \left( \begin{array}{c} m_{2n} \\ x_n - x_{1n} \end{array} \right) \left( \theta ^1\right) ^{x_{1n}}\left( 1 - \theta ^1\right) ^{m_{1n}-x_{1n}} \left( \theta ^2\right) ^{x_n-x_{1n}}\left( 1 - \theta ^2\right) ^{m_{2n}-x_n+x_{1n}}, \end{aligned}$$

where \({\varvec{\theta }}=\left( \theta ^1,\theta ^2\right) \). The statistical model S in this problem consists of all possible probability functions of the observed data \({\varvec{X}}=(X_1,\ldots ,X_N)\) as follows:

$$\begin{aligned} S = \left\{ p({\varvec{x}};{\varvec{\theta }}) \left| \ {\varvec{\theta }}=\left( \theta ^1,\theta ^2\right) \in (0,1) \times (0,1)\right. \right\} , \end{aligned}$$

where \(p({\varvec{x}};{\varvec{\theta }})=\prod _{n=1}^{N}p_n(x_n;{\varvec{\theta }}) \ \left( {\varvec{x}}=(x_1,\ldots ,x_N)\right) \) since \(X_1,\ldots ,X_N\) are independent.

Although the maximum likelihood estimation for \({\varvec{\theta }}\) is possible based on the likelihood function \(L({\varvec{\theta }})=p({\varvec{X}};{\varvec{\theta }})\), it is a little complicated since \(X_{1n}\) and \(X_{2n}\) are not observed in each n-th constituency. An alternative approach for estimating \({\varvec{\theta }}\) is to use the quasi-score function \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})=(q^1({\varvec{x}},{\varvec{\theta }}),q^2({\varvec{x}},{\varvec{\theta }}))^T\) [15] as an estimating function, where

$$\begin{aligned} q^1({\varvec{x}},{\varvec{\theta }}) = \sum _{n=1}^{N}\frac{m_{1n}\{x_n - \mu _n({\varvec{\theta }})\}}{V_n({\varvec{\theta }})}, \ q^2({\varvec{x}},{\varvec{\theta }}) = \sum _{n=1}^{N}\frac{m_{2n}\{x_n - \mu _n({\varvec{\theta }})\}}{V_n({\varvec{\theta }})}. \end{aligned}$$

Here, \(\mu _n({\varvec{\theta }})\) and \(V_n({\varvec{\theta }})\) are the mean and variance of \(X_n\), respectively, i.e.

$$\begin{aligned}&\mu _n({\varvec{\theta }}) = E(X_n) = m_{1n}\theta ^1 + m_{2n}\theta ^2 \\&V_n({\varvec{\theta }}) = V(X_n) = m_{1n}\theta ^1\left( 1-\theta ^1\right) + m_{2n}\theta ^2\left( 1-\theta ^2\right) . \nonumber \end{aligned}$$
(10)

In this example, the random variables \(X_1,\ldots ,X_N\) in the observed data are independent, but not identically distributed. However, it is possible to apply the results in Sect. 5 by considering the whole of the left hand side of (7) as an estimating function and modifying the results in this case. Note that the estimating function \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\) is already standardized since the i-th component \(q^i({\varvec{x}},{\varvec{\theta }})\) of \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\) is obtained by the orthogonal projection of the i-th component of the score function \({\varvec{s}}({\varvec{x}},{\varvec{\theta }})\) for \({\varvec{\theta }}\) onto the linear space spanned by \(\{x_1-\mu _1({\varvec{\theta }}),\ldots ,x_N-\mu _N({\varvec{\theta }})\}\). In fact, the orthogonal projection is calculated as follows:

$$\begin{aligned}&E_{{\varvec{\theta }}}\left\{ {\varvec{s}}({\varvec{x}},{\varvec{\theta }})({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))^T\right\} \left[ E_{{\varvec{\theta }}}\left\{ ({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))^T\right\} \right] ^{-1} ({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }})) \\= & {} -E_{{\varvec{\theta }}}\left\{ \frac{\partial }{\partial {\varvec{\theta }}^T}({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))\right\} \left[ E_{{\varvec{\theta }}}\left\{ ({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }}))^T\right\} \right] ^{-1} ({\varvec{x}} - {\varvec{\mu }}({\varvec{\theta }})) \\= & {} \left( \begin{array}{ccc} m_{11} &{} \cdots &{} m_{1N} \\ m_{21} &{} \cdots &{} m_{2N} \end{array} \right) \left( \begin{array}{ccc} V_1({\varvec{\theta }}) &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} V_n({\varvec{\theta }}) \end{array} \right) ^{-1} \left( \begin{array}{c} x_1 - \mu _1({\varvec{\theta }}) \\ \vdots \\ x_N - \mu _N({\varvec{\theta }}) \end{array} \right) = \left( \begin{array}{c} q^1({\varvec{x}},{\varvec{\theta }}) \\ q^2({\varvec{x}},{\varvec{\theta }}) \end{array} \right) , \end{aligned}$$

where \({\varvec{x}}=(x_1,\ldots ,x_N)^T\) and \({\varvec{\mu }}({\varvec{\theta }})=(\mu _1({\varvec{\theta }}),\ldots ,\mu _N({\varvec{\theta }}))^T\). In addition, the estimating function \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\) is not integrable with respect to \({\varvec{\theta }}\) since \(\partial q^1/\partial \theta ^2 \ne \partial q^2/\partial \theta ^1\). From Proposition 2 and the fact that \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\) itself is a standardized estimating function, we immediately obtain the pre-contrast function \(\rho _q:TS \times S \rightarrow {\varvec{R}}\) defined by \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\), where

$$\begin{aligned} \rho _q((\partial _i)_{p_1},p_2) = -\sum _{{\varvec{x}}}q^i({\varvec{x}},{\varvec{\theta }}_1)p({\varvec{x}};{\varvec{\theta }}_2) = \sum _{n=1}^{N}\frac{m_{in}\{\mu _n({\varvec{\theta }}_1) - \mu _n({\varvec{\theta }}_2)\}}{V_n({\varvec{\theta }}_1)} \end{aligned}$$

with \(p_l({\varvec{x}})=p({\varvec{x}};{\varvec{\theta }}_l) \in S \ (l=1,2)\). The pre-contrast function \(\rho _q\) induces the statistical manifold admitting torsion as follows.

Riemannian metric g:

$$\begin{aligned} g_{ij}({\varvec{\theta }}) = \sum _{{\varvec{x}}}q^i({\varvec{x}},{\varvec{\theta }})q^j({\varvec{x}},{\varvec{\theta }})p({\varvec{x}};{\varvec{\theta }}) = \sum _{n=1}^{N}\frac{1}{V_n({\varvec{\theta }})}m_{in}m_{jn}. \end{aligned}$$

Dual affine connections \(\nabla ^{*}\) and \(\nabla \):

$$\begin{aligned} \varGamma _{ij,k}^{*}({\varvec{\theta }})= & {} \sum _{{\varvec{x}}}\{\partial _i\partial _jp({\varvec{x}};{\varvec{\theta }})\} q^k({\varvec{x}},{\varvec{\theta }}) \\= & {} \sum _{n=1}^{N}\frac{m_{kn}}{V_n({\varvec{\theta }})} \left[ \sum _{{\varvec{x}}}x_n\{\partial _i\partial _jp({\varvec{x}};{\varvec{\theta }})\} - \mu _n({\varvec{\theta }})\sum _{{\varvec{x}}}\{\partial _i\partial _jp({\varvec{x}};{\varvec{\theta }})\} \right] \\= & {} \sum _{n=1}^{N}\frac{m_{kn}}{V_n({\varvec{\theta }})} \left[ \partial _i\partial _j\sum _{{\varvec{x}}}x_np({\varvec{x}};{\varvec{\theta }}) - \mu _n({\varvec{\theta }})\partial _i\partial _j\sum _{{\varvec{x}}}p({\varvec{x}};{\varvec{\theta }}) \right] \\= & {} \sum _{n=1}^{N}\frac{m_{kn}}{V_n({\varvec{\theta }})}\partial _i\partial _j\mu _n({\varvec{\theta }}) = 0 \ \ (\text {from (10)}) \\ \varGamma _{ij,k}({\varvec{\theta }})= & {} \varGamma _{ij,k}^{*}({\varvec{\theta }}) - \partial _ig_{jk}({\varvec{\theta }}) \ \ (\text {from the duality between}~\nabla ~\text {and}~\nabla ^{*}) \\= & {} \sum _{n=1}^{N}\frac{1-2\theta ^i}{V_n({\varvec{\theta }})^2}m_{in}m_{jn}m_{kn}. \end{aligned}$$

In this example, the statistical model S is \(\nabla ^{*}\)-flat since the coefficient of \(\nabla ^{*}\) with respect to the parameter \({\varvec{\theta }}\) is equal to zero. Furthermore, this shows that \({\varvec{\theta }}\) provides an affine coordinate system for \(\nabla ^{*}\). Although the curvature tensor of \(\nabla \) vanishes because the curvature tensor of \(\nabla ^{*}\) vanishes and \(\nabla \) is dual to \(\nabla ^{*}\), the statistical model S is not \(\nabla \)-flat because \(\nabla \) is not torsion-free, which comes from the non-integrability of the estimating function \({\varvec{q}}({\varvec{x}},{\varvec{\theta }})\). Hence, this geometrical structure provides an example of partially flat spaces, which was discussed in Sect. 4.

7 Future Problems

In this paper, we have summarized existing results on statistical manifolds admitting torsion, especially focusing on partially flat spaces. Although some results that are not seen in the standard theory of information geometry have been obtained, including a generalized projection theorem in partially flat spaces and statistical manifolds admitting torsion induced from estimating functions in statistics, a lot of (essential) problems have been unsolved. We discuss some of them to conclude this paper.

(1) The canonical pre-contrast function and the generalized projection theorem in a partially flat spase \((M,g,\nabla ,\nabla ^{*})\) are described only in terms of the flat connection \(\nabla ^{*}\). In this sense, it can be said that these are a concept and a theorem for the Riemannian manifold (Mg) with the flat connection \(\nabla ^{*}\). What is the role of the affine connection \(\nabla \) in the partially flat space \((M,g,\nabla ,\nabla ^{*})\), especially when \(\nabla \) is not torsion-free?

(2) The canonical pre-contrast function is defined in terms of the Riemannian metric g and the \(\nabla ^{*}\)-geodesic in a partially flat space \((U,g,\nabla ,\nabla ^{*})\) without using the affine coordinate system \((\eta _i)\) on U. Hence, this function can be defined in a general statistical manifold admitting torsion \((M,g,\nabla )\) as long as the \(\nabla ^{*}\)-geodesic uniquely exists. What is the condition under which this function is a pre-contrast function that induces the original Riemannian metric g, dual affine connections \(\nabla \) and \(\nabla ^{*}\)? What properties does the (canonical) pre-contrast function have in this case? These problems are closely related to the works by [10, 11], who try to define a canonical divergence (canonical contrast function) on a general statistical manifold beyond a dually flat space.

(3) The definition of pre-contrast functions from estimating functions is obtained by replacing the score function which appears in the pre-contrast function as the derivative of Kullback–Leibler divergence with the standardized estimating functions. However, this is not the unique way to obtain a pre-contrast function from an estimating function. For example, if we consider the \(\beta \)-divergence [16] (or density power divergence [17]) as a contrast function, its first derivative is also a pre-contrast function and takes the same form as (9) in Proposition 2. However, the estimating function which appears in the pre-contrast function is not standardized. Although the standardization seems to be natural, further consideration is necessary on how to define a pre-contrast function from a given estimating function.

(4) For the example considered in Sect. 6, we can show that the pre-contrast function \(\rho _{{\varvec{q}}}\) coincides with the canonical pre-contrast function in the partially flat space \((S,g,\nabla ,\nabla ^{*})\) and the generalized projection theorem (Corollary 1 in Sect. 4) can be applied. However, its statistical meaning has not been clarified yet. Although it is expected that the SMAT induced from an estimating function has something to do with statistical inference based on the estimating function, the clarification on it is a future problem.