Abstract
The standard model of information geometry, expressed as Fisher–Rao metric and Amari-Chensov tensor, reflects an embedding of probability density by \(\log \)-transform. The present paper studies parametrized statistical models and the induced geometry using arbitrary embedding functions, comparing single-function approaches (Eguchi’s U-embedding and Naudts’ deformed-log or phi-embedding) and a two-function embedding approach (Zhang’s conjugate rho-tau embedding). In terms of geometry, the rho-tau embedding of a parametric statistical model defines both a Riemannian metric, called “rho-tau metric”, and an alpha-family of rho-tau connections, with the former controlled by a single function and the latter by both embedding functions \(\rho \) and \(\tau \) in general. We identify conditions under which the rho-tau metric becomes Hessian and hence the \(\pm 1\) rho-tau connections are dually flat. For any choice of rho and tau there exist models belonging to the phi-deformed exponential family for which the rho-tau metric is Hessian. In other cases the rho–tau metric may be only conformally equivalent with a Hessian metric. Finally, we show a formulation of the maximum entropy framework which yields the phi-exponential family as the solution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In classical information geometry [1, 3] the Fisher–Rao metric, as a Riemannian metric on the manifold of parametric probability models, is accompanied by a family of \(\alpha \)-connections \(\varGamma ^{(\alpha )}\) with a dualistic structure such that \(\varGamma ^{(\alpha )}\) and \(\varGamma ^{(-\alpha )}\) jointly preserve the metric. This so-called “\(\alpha \)-geometry” is induced by the family of \(\alpha \)-divergence functions which include the Kullback-Leibler divergence as a special case (\(\alpha = \pm 1\)). Furthermore, when the statistical model belongs to the exponential family, then the connections \(\varGamma ^{(\pm 1)}\) are dually flat.
Zhang [20, 23] carefully delineated the different roles played by the interpolation parameter \(\alpha \) in information geometry:
-
1.
it parametrizes the divergence function (as in \(\alpha \)-divergence);
-
2.
it parametrizes the monotone embedding of probabilities (as in \(\alpha \)-embedding);
-
3.
it parametrizes the convex combination of connections (as in \(\alpha \)-connection).
A thorough understanding of the subtleties of these various roles of \(\alpha \) in \(\alpha \)-geometry leads not only to the class of two-parameter \((\alpha , \beta )\)-divergence (generalizing \(\alpha \)-divergence in different ways), which nevertheless results in the parametric family of \(\alpha \beta \)-connections, with \(\alpha \cdot \beta \) as a single parameter [20], but also to the more profound notion of reference-representation biduality uniquely embodied in information geometry [20, 22].
There has been considerable interest in generalizing the “standard model” and the corresponding exponential (and its dual, mixture) family of probability functions. By generalizing, we mean that the dualistic \(\alpha \)-geometry is still preserved while one relaxes from the restrictive exponential (or mixture) family. The generalizations are often achieved in the context of various monotone embedding functions, from \(\alpha \)-embedding (power function) to arbitrary deformed exponential embedding function, such as phi-embedding [11] and U-embedding [6]. Zhang [20, 22, 23] uses two arbitrary functions, referred to as conjugate rho–tau embedding. Our paper surveys these approaches and their links, with the goal of providing a unifying account in generalizing Amari’s \(\alpha \)-geometry with its characteristic biduality (reference duality and representation duality [21]). A particular outcome is the demonstration of the dually flat nature of \(\varGamma ^{(\pm 1)}\), despite of considerable relaxation both in terms of deforming the exponential family and the canonical divergence function.
1.1 The standard model
1.1.1 Fisher–Rao metric and \(\alpha \)-connections
Let be given a measure space \(({{\mathcal {X}}},\mathrm{d}x)\). Let \({\mathcal {M}}\) denote the space of probability density functions defined on the sample space \({{\mathcal {X}}}\). A parametric family of density functions, \(p^\theta \equiv p(\cdot | \theta )\), called a parametric statistical model, is the association \(\theta \mapsto p(\cdot |\theta )\) of a point \(\theta = [\theta ^1, \ldots , \theta ^n]\) in a connected open subset \({\mathcal {D}}\) of \({\mathbb {R}}^n\) with a probability density function \(p^\theta \) in \({\mathcal {M}}\). The elements of the parametric statistical model form a Riemannian manifold \({\mathbb {M}}\). For simplicity we assume that a single chart \(p^\theta \mapsto \theta \) covers all of \({\mathbb {M}}\), so that
The Fisher–Rao metric and the \(\alpha \)-connections are given by
The \(\alpha \)-connections satisfy the dualistic relation
Here, \(*\), denotes conjugate (dual) connection. The pair of conjugate connections preserves the dual pairing of vectors in the tangent space with co-vectors in the cotangent space when the tangent and cotangent spaces are mapped to each other by the Riemannian metric. Any Riemannian manifold with its metric, g, and conjugate connections, \(\varGamma , \varGamma ^{*}\), given in the form of Eqs. (1)–(3), is called a statistical manifold (in the narrower sense) and is denoted as \(\{ {\mathbb {M}}, g, \varGamma ^{(\pm \alpha )}\}\). In the broader sense, a statistical manifold \(\{ {\mathbb {M}}, g, \varGamma , \varGamma ^{*} \}\) is a differentiable manifold equipped with a Riemannian metric g and a pair of torsion-free conjugate connections \(\varGamma \equiv \varGamma ^{(1)}, \varGamma ^{*} \equiv \varGamma ^{(-1)}\) which jointly preserve the metric g, without necessarily requiring g and \(\varGamma , \varGamma ^{*}\) to take the forms of Eqs. (1)–(3).
1.1.2 Exponential and mixture families
An exponential family of probability density functions is defined as
where \(\theta \) is its canonical parameter and \(F_i(x) \, (i = 1,\cdots , n)\) is a set of linearly independent functions with the same support in \({{\mathcal {X}}}\), and the cumulant generating function (“potential function”) \(\varPhi (\theta )\) is:
Substitution of (4) into (1) and (2) results in the Fisher metric
which can be written as
The \(\alpha \)-connections can be written as
The \(\alpha \)-connection for the exponential family is dually flat when \(\alpha = \pm 1\). In particular, all components of \(\varGamma ^{(1)}_{ij,k}\) vanish on the manifold of an exponential family.
On the other hand, the mixture family:
when viewed as a manifold charted by its mixture parameter \(\theta \), with the constraints \(\sum _i \theta ^{i} = 1\) and \(\int _X F_i (x) d\mu = 1\), turns out to have identically zero \(\varGamma ^{(-1)}_{ij,k}\).
The connections, \(\varGamma ^{(1)}\) and \(\varGamma ^{(-1)}\), are also called the exponential and mixture connections, or the e- and m-connection, respectively.
1.2 \(\alpha \)-Embedding function
Amari [1, 3] considered a one-parameter family of denormalized probability density functions \(p^{(\alpha )}(\cdot |\theta )\) defined by \(p^{(\alpha )}(x|\theta )=p(x)\) with
The \(\alpha \) -embedding function \(l^{(\alpha )}: {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\), is defined as
Under \(\alpha \)-embedding, the denormalized density functions form the so-called \(\alpha \) -affine manifold, see, [3], p. 46. It is remarkable that the Fisher–Rao metric and the \(\alpha \)-connections, under such \(\alpha \)-representation, have the following expressions:
Clearly, for any given \(\alpha \) value, the components of \(\varGamma ^{(\alpha )}\) are all identically zero on the (unnormalized) \(\alpha \)-affine manifold, by virtue of the definition (8) of the \(\alpha \)-family. Hence, the \(\pm \alpha \)-connections are dually flat.
1.3 A plethora of probability embeddings
There have been various attempts at generalizing the standard exponential model of normalized probability density functions. Efforts considered here have been centered on “deforming” the exponential function to other functional forms, where the embedding functions of the \(\alpha \)-family are treated as deformed logarithm functions, whose inverse functions are then deformed exponentials. Better known examples are q-exponential functions [17] and the \(\kappa \)-functions [7]. More general deformations were introduced in [11]. The corresponding deformed exponential families coincide with the models of U-statistics [6]. From the point of view of [20, 22, 23] these models involve generalized embeddings which fit under a universal framework of conjugate rho-tau embeddings.
-
(i)
q-logarithmic embedding Tsallis [17] investigates the equilibrium distribution of statistical physics which is obtained by maximization of the Boltzmann–Gibbs–Shannon entropy under constraints. He replaces the entropy function by a q-dependent entropy, \(q \in {\mathbb {R}}\). This results in a deformed version of statistical physics. The q-logarithmic/exponential functions were introduced in [18]:
$$\begin{aligned} \log _q(u)=\frac{1}{1-q}\left( u^{1-q}-1\right) ,\qquad \exp _q(u)=\left[ 1+(1-q)u\right] ^{1/(1-q)}, \qquad q\not =1. \end{aligned}$$Note that q-embedding and \(\alpha \)-embedding functions are different: \(\log _q(u)=l^{(\alpha )}(u)-2/(1-\alpha )\) with \(\alpha =2q-1\). Like \(\alpha \)-embedding, q-embedding reduces to the standard logarithm as q tends to 1.
-
(ii)
\(\kappa \)-logarithmic embedding An alternative to the q-deformed exponential model for statistical physics is Kaniadakis’ \(\kappa \)-model [7], where
$$\begin{aligned} \log _\kappa (u)= & {} \frac{1}{2\kappa }\left( u^{\kappa }- u^{-\kappa } \right) , \qquad \exp _\kappa (u) =\left( \kappa u + \sqrt{1+\kappa ^2 u^2} \right) ^\frac{1}{\kappa }, \qquad \kappa \not =0. \end{aligned}$$
The case of \(\lim _{\kappa \rightarrow 0}\) corresponds to the standard exponential/logarithm.
The \(\phi \)-, U- and \((\rho , \tau )\)-embedding are monotone embeddings which rely on one or two free functions. So instead of using a one-parameter family of functions which include the logarithm/exponential function for a particular parameter value, arbitrary functions are used which replace the logarithm/exponential function. They are the main focus of this paper. The phi-model [11], U-model [6], and rho-tau model [20] were independently conceived around 2004 under different motivations.
1.4 Goals, organization, and notations
Our goal is to provide a unified theory of monotone embedding which generalizes that of logarithmic embedding and the classic \(\alpha \)-geometry. Specifically, we revisit the divergence function, cross-entropy and entropy determined by rho-tau embedding [20], their induced \(\alpha \)-geometry with respect to the phi-deformed exponential families [11], and show a unification of the “deformed” approach of [11] and conjugate embedding approach of [20]. It is also shown that the independently proposed U-embedding [6] is identical with phi-embedding in terms of divergence function and entropy functions, both being subsumed by the rho–tau embedding. The duality in rho-tau entropy is shown to be important in the formulation of the generalized maximum entropy principle, the solution of which is the phi-exponential family.
In Sect. 2, we first review the deformed logarithm, \(\log _\phi \), and the deformed exponential, \(\exp _\phi \). Then we point out that \(\log _\phi \) and \(\exp _\phi \) are nothing but an arbitrary pair of mutually inverse monotone functions, and can be represented as derivatives of a pair of conjugate convex functions \(f, f^*\). The deformed divergence \(D_\phi (p,q)\) is then precisely the Bregman divergence \(D_f(p,q)\) associated with f. The construction of deformed entropy and cross-entropy is reviewed, as well as their construction starting from the U-embedding. Then, we review the rho-tau embedding, which provides two independently chosen embedding functions. We explicitly identify its entropy and cross-entropy. Theorem 1 shows that the divergence function and entropy function of the rho-tau embedding reduce as a special case to those given by the phi-embedding and U-embedding, while the rho-tau cross-entropy reduces as another special case to the U cross-entropy.
In Sect. 3 we explore the freedom of choosing two functions \(\rho \) and \(\tau \), such that they lead to the same weighting function associated with the Riemannian metric. We call it the gauge freedom. Two prominent gauges, plus their duals, are studied. They lead to the entropy and cross-entropy functions given by phi/U-embedding and to those given by Tsallis.
In Sect. 4, we study the Riemannian metric induced from the rho-tau divergence (and equivalently, rho-tau cross-entropy), as well as induced from the entropy and dual entropy functions. Emphasis is put on the gauge freedom which is left once the metric is fixed. The metric tensor absorbs only one of the two degrees of freedom offered by the independent choice of two strictly increasing functions rho and tau. We then provide a characterization of conditions under which rho–tau metric is Hessian. The rho–tau connections were also investigated.
In Sect. 5, we study deformed exponential family of probability models, and show the Riemannian geometry they induce. We show that each phi-exponential family is associated with two special rho–tau metrics, (i) a Hessian one related to its entropy and (ii) a non-Hessian one that is conformally equivalent to the Hessian of a normalization function. We shown how these models are related to the maximum entropy principle.
In the final section we provide a summary and discusssions.
Throughout the paper it is assumed that two strictly increasing differentiable functions \(\rho \) and \(\tau \) are given. The rho–tau divergence induces a metric tensor g on finite-dimensional manifolds of probability distributions and turns them into Riemannian manifolds. Here we assume regularity conditions such that the relevant integrals all exist, Cf. [10]. A preliminary version of this report appeared in [15, 24].
2 Divergence, entropy, and cross-entropy
2.1 “Deforming” exponential and logarithmic functions
Naudts [11, 13] defines the phi-deformed logarithm
Here, \(\phi (v)\) is a strictly positive function such that \(1/\phi (v)\) is integrable. In the context of discrete probabilities it suffices that it is strictly positive on the open interval (0, 1), possibly vanishing at the end points. In the case of a probability density function it is assumed to be strictly positive on the interval \((0,+\infty )\). Note that by construction one has \(\log _\phi (1)=0\). The inverse of the phi-logarithm is denoted \(\exp _\phi (u)\), and called phi-exponential function:
The phi-exponential has an integral expression
where the function \(\psi (u)\) is given by
In terms of \(\phi , \psi \), we have the following relations:
We want to stress that all four functions, \(\phi , \psi , \log _\phi , \exp _\phi \), arise out of choosing one positive-valued function \(\phi \).
As examples, \(\phi (u) = u\) gives rise to the classic natural logarithm and exponential. The choice \(\phi (u)=u^q\), \(q\not =1\) reproduces the q-deformed logarithm and exponential, as introduced by Tsallis [18] and mentioned in the introduction. Taking \(\phi (u)=u/(1+u)\) leads to (see, [10, 16]) \(\log _\phi (u)=u-1+\log (u)\). Taking \(\phi (u)=u(1+\epsilon u)\) leads to (see, [25])
2.2 Deformed entropy and deformed divergence functions
The phi-entropy of the probability distribution p is defined by (see [11, 13])
By partial integration one obtains an equivalent expression
For the standard logarithm is \(\phi (u)=u\). Then the above expression coincides with the well-known entropy of Boltzmann–Gibbs–Shannon
The phi-divergence of two probability functions p and q is defined by
An equivalent expression is
Now let us express these quantities in terms of a strictly convex function f, satisfying \(f^\prime (u) = \log _\phi (u)\). We have:
One can readily recognize that \(D_\phi (p,q)\) is nothing but the Bregman divergence, whereas the function f itself determines the deformed entropy \(S_\phi (p)\). Note that \(p\mapsto S_\phi (p)\) is strictly concave while the map \(p \mapsto D_\phi (p,q)\) is strictly convex.
2.3 U-embedding, U entropy, and U cross-entropy
Eguchi [6] introduces the U-divergence, which is essentially the Bregman divergence under a strictly convex function U coupled with an embedding using \(\psi _{{\mathrm{U}}}\equiv (U^{\prime })^{-1}\). The U cross-entropy \(C_U(p,q)\) is defined as:
whereas the U entropy \(H_U\) is defined as \(H_U(p) = C_U(p,p)\). The U-divergence is
Note that the U-embedding only has one arbitrarily chosen function, as does the phi-embedding. In fact, it was noted in [14] that the U-divergence and the phi-divergence of the previous section map onto each other when the derivative \(U'\) of U is considered as a deformed exponential function.
2.4 Conjugate rho-tau embedding
In contrast with the “single function” embedding of the phi-model and the U-model, Zhang’s [20] rho–tau framework uses two arbitrarily and independently chosen monotone functions (see also [23]). He starts with the observation that a pair of mutually inverse functions occurs naturally in the context of convex duality. Indeed, if f is strictly convex and \(f^*\) is its convex dual then the derivatives \(f^{\prime }\) and \((f^*)^\prime \) are inverse functions of each other:
Here the definition of the convex dual \(f^*\) of f is:
For u in the range of \(f'\) it is given by
Take the derivative of this expression to find \((f^*)^\prime \circ f^\prime (u) = u\). By convex duality then follows that also \(f^{\prime }\circ (f^*)^\prime (u) = u\). Take an additional derivative to obtain
This identity will be used further on.
Consider now a pair \((\rho (\cdot ),\tau (\cdot ))\) of strictly increasing functions. Then there exists a strictly convex function \(f(\cdot )\) satisfying \(f'(u)=\tau \circ \rho ^{-1}(u)\). This is because the family of strictly increasing functions form a group, with function composition as the group operation, an observation made in [20, 23]. In terms of the conjugate function \(f^{*}\), the relation is \((f^{*})^{\prime }(u) = \rho \circ \tau ^{-1}(u)\). The derivatives of f(u) and of its conjugate \(f^*(u)\) have the property that
Among the triple \((f,\rho , \tau )\), given any two functions, the third is specified. When we arbitrarily choose two strictly increasing functions \(\rho \) and \(\tau \) as embedding functions, then they are automatically linked by a pair of conjugated convex functions \(f, f^*\). On the other hand, we may also independently choose to specify \((\rho , f), (\rho , f^*), (\tau , f),\) or \((\tau , f^*)\), with the others being fixed. Therefore, rho-tau embedding is a mechanism with two independently chosen functions. This differs from both the phi-embedding and the U-embedding. The following identities will be useful:
The \((\rho , \tau )\)-embedding mechanism can have another equivalent representation. Denote \(f \circ \rho = F, f^* \circ \tau = G\). We seek to use F, G as independently chosen functions from which \(\rho \) and \(\tau \) are derived. From
and
we obtain
or
Thus, we obtain that
and similarly
So this gives \(\rho , \tau \) in terms of F, G.
2.5 Divergence of the rho-tau embedding
Zhang [20] introducesFootnote 1 the rho-tau divergence (see Proposition 6 of [20])
where f is a strictly convex function satisfying \(f'(\rho (u))=\tau (u)\).
Proposition 1
Expression (23) can be written as
In particular this implies that \(D_{\rho ,\tau }(p, q)\ge 0\), with equality if and only if \(p=q\), reflecting the following identity:
The “reference-representation biduality” [20, 22, 23] reveals as
It can be easily verified that the rho-tau divergence satisfies the following generalized Pythagorean equality for any three probability functions p, q, r
2.6 Entropy and cross-entropy of the rho–tau embedding
It is now obvious to give the following definition of the rho-tau entropy
where f(u) is a strictly convex function satisfying \(f'(u)=\tau \circ \rho ^{-1}(u)\). This can be written as
Note that the rho–tau entropy \(S_{\rho ,\tau }(p)\) is concave in \(\rho (p)\), but not necessarily in p. This has consequences further on. We likewise define rho–tau cross-entropy
It satisfies \(C_{\rho ,\tau }(p,q) = C_{\tau ,\rho }(q,p)\).
The rho–tau divergence can then be given by
Note that, unlike the standard case, in general \(S_{\rho ,\tau }(q) \ne C_{\rho ,\tau }(q,q)\). This is because
So unless \(f(u) = cu\) for constant c, \(f^*\) would not vanish. In fact, denote
Then \(S^*_{\rho ,\tau }(p) = S_{\tau ,\rho }(p)\), and
which is, after integrating \(\int _{{\mathcal {X}}} \mathrm{d}x\), a re-write of (25). Therefore,
Because rho-tau cross-entropy does not degenerate to rho-tau entropy in general:
we can also define the modified cross-entropy:
The main properties of this modified version of cross-entropy \(\overline{C}_{\rho ,\tau }(p, q)\) are
-
1.
\(\overline{C}_{\rho ,\tau }(p, p)=S_{\rho , \tau }(p)\). Indeed, (32) implies that
$$\begin{aligned} \overline{C}_{\rho ,\tau }(p, p)= & {} C_{\rho , \tau }(p,p) - S^*_{\rho , \tau }(p)\\= & {} S_{\rho , \tau }(p)-D_{\rho ,\tau }(p,p)\\= & {} S_{\rho , \tau }(p). \end{aligned}$$ -
2.
From (32), the previous result and the definition (33) follows that
$$\begin{aligned} D_{\rho ,\tau }(p,q)=\overline{C}_{\rho ,\tau }(p, q)-\overline{C}_{\rho ,\tau }(p, p). \end{aligned}$$(34)
2.7 Rho–tau divergence from convex \(D^{(\alpha )}_{f, \rho }(p,q)\)-divergence
Refs. [20,21,22,23] studied the following general divergence function \(D^{(\alpha )}_{f, \rho }(p,q)\) from the perspective of convex analysis (with \(\alpha \in {\mathbb {R}}\))
Clearly, the role of \(\alpha \) is to effect an exchange of the position of p, q
Rho–tau divergence \(D_{\rho ,\tau }(p,q)\) arises as a special form of the above convex \(D^{(\alpha )}_{f, \rho }\)-divergence function:
with \(f^\prime \circ \rho = \tau \) (and equivalent \((f^*)^\prime \circ \tau = \rho \), with \(f^*\) denoting convex conjugate of f).
Though in \(D^{(\alpha )}_{f, \rho }(p,q)\) the two free functions are f (a strictly convex function) and \(\rho \) (a strictly monotone increasing function), as reflected in its subscripts, there is only notational difference from the \(\rho , \tau \) specification of two function’s choice. This is because for \(f, f^*, \rho , \tau \), a choice of any two functions (one of which would have to be either \(\rho \) or \(\tau \)) would specify the remaining two; see footnote 1 and Sect. 2.4.
3 Fixing the gauge
Zhang’s conjugate rho–tau embedding involves two freely chosen functions. However, the induced Riemannian metric, called rho-tau metric tensor (to be introduced in the next section) depends on a single function \(\psi \) which is the following combination of the functions \(\rho \) and \(\tau \)
Many choices of \(\rho \) and \(\tau \) give rise to the same function \(\psi \). We call this the “gauge freedom”. In the physics literature gauge theories cope with redundant degrees of freedom, either by fixing the gauge or by introduction of equivalence classes. In the present theory, the symmetry between the functions \(\rho \) and \(\tau \) implies that their exchange leads to equivalent theories. The simplest way to deal with gauge freedom is by breaking the symmetry, in the present context by assigning a different role to \(\rho \), respectively \(\tau \). For instance, \(\rho \) can be used for embedding the probability distribution while \(\tau \) is then used to fix the entropy or score variables of the corresponding statistical theory. Two specific types of gauges are now considered in more detail: Type I gauge where either \(\rho \) or \(\tau \) is identity, and Type II gauge where \(\rho \) and \(\tau \) are linked through the deformed logarithm/exponential transformation.
3.1 Rho-id gauge (\(\rho =\text{ id }, \tau =\log _\psi \))
This gauge is characterized by \(\rho =\text{ id }\), the identity function, and \(\tau =\log _{\psi }\). In this case, \(\rho ^\prime = 1\) and \(\tau ^\prime = 1/\psi \), satisfying (36).
Compare expression (27) of the rho–tau entropy with that of the phi-deformed entropy as given by (13). They coincide up to an additive constant when the choice \(\rho =\text{ id }\) and \(\tau (u)=\log _\phi (u)\) are made. This means that the function \(\psi \), defined by (36), can be identified with the function \(\phi \) of the phi-deformation formalism. With these choices one has
In the notations of Eguchi this becomes
where \(\psi _{{\mathrm{U}}}\) is the inverse function of \(U'\), see Sect. 2.3.
Divergence Expression (24) of \(D_{\rho , \tau }(p,q)\) reduces to the phi-divergence \(D_\phi (p,q)\), as given by (14), and to the U-divergence (19). Phi-divergence and U-divergence coincide with \(U^\prime = \exp _\phi \), as noted in [14].
Entropy As mentioned earlier, in the present gauge the rho–tau entropy coincides with the phi-deformed entropy (12). This suggests that the rho–tau entropy is more general than the phi-deformed entropy. However, although the rho-tau entropy (26) has two free functions in appearance, it is the result of their function composition which matters. So any rho–tau entropy is also a phi-entropy for a well-chosen function \(\phi \).
The situation with the U-embedding is the same, because U entropy is identical with phi-entropy:
Cross-entropy In the present gauge the rho–tau cross-entropy (28) becomes
The cross-entropy \(C_U(p,q)\) introduced by Eguchi (see (18)) contains an additional term
Since in the present gauge \(f^* = U\) with \((f^*)^\prime = U'\), this additional term is nothing but the negative of dual entropy \(S_{\rho , \tau }^*\)
Therefore,
which satisfies \(C_U(p,p) = H_U(p)\).
3.2 Tau-id gauge (\(\tau =\text{ id }, \rho =\log _{\psi }\))
This gauge is characterized by \(\tau =\text{ id }\) and \(\rho =\log _{\psi }\). This gauge is checked to satisfy (36). It is needed a number of times in what follows. Because of the rho–tau duality much of the previous section can be repeated with obvious modifications.
The rho-id and tau-id gauges are collectively called Type I gauges.
3.3 Constant entropy gauge (\(\rho = \log _\tau , \log \tau = \log _\psi \))
This gauge is characterized by \(f \circ \rho = \text{ id }\), or \(f = \rho ^{-1}\).
From (26) then follows that the rho–tau entropy \(S_{\rho ,\tau }(p)\) is a constant independent of the probability distribution p:
In this case, \(\rho '\tau =1\), so \(\rho ^\prime = 1/ \tau = (\log _\tau )^\prime \), so
From \(\rho ^\prime \tau ^\prime = 1/\psi \), we obtain \((\log \tau )^\prime = 1/\psi = (\log _\psi )^\prime \). Therefore
Cross-entropy Integration of \(\rho '=(\log _\tau )'\) gives \(\rho =\log _\tau +\, \text{ constant }\). This constant may be omitted. Using \(\rho =\log _\tau \) one can write
Divergence and dual entropy The rho–tau divergence (23) takes on the simplified form
which reminds of (34).
3.4 Constant-\(S^*\) gauge (\(\tau = \log _\rho , \log \rho = \log _\psi \))
Dual to the constant-S gauge, this gauge is characterized by \(f^* \circ \tau = \text{ id }\), or \(f^* = \tau ^{-1}\). This implies \(\rho \tau '=1\), so \(\tau ^\prime = 1/\rho \). Therefore, this gauge is same as taking \(\tau = \log _\rho \).
Because of the rho–tau duality the conclusions of the previous gauge can be adapted. In particular, \((\log \rho )'=1/\psi \).
Entropy and cross-entropy In the present gauge the rho-tau cross-entropy (28) becomes
Because in this gauge \(S^*(p)\) is a constant independent of p, the same expression holds for the modified cross-entropy \(\overline{C}_{\rho ,\tau }(p,q)\). Therefore the rho–tau entropy can be written as
Divergence In this case,
Because \(S^* = \text{ constant }\), this also gives
in agreement with (34). Now write (39) as
Expressions (38) and (40) look very similar to the standard expressions for the Boltzmann–Gibbs–Shannon entropy and the Kullback–Leibler divergence, respectively.
The constant-S or constant-\(S^*\) gauge is called a Type II gauge.
4 Riemannian geometry under rho–tau embedding
We now investigate the Riemannian geometry related to the rho-tau embedding, and expect a full generalization to Amari’s \(\alpha \)-geometry as reviewed in Sect. 1. Throughout this section we consider a parametrized statistical model \(\theta \mapsto p^\theta \).
4.1 The metric tensor
The rho–tau divergence \(D_{\rho ,\tau }(p,q)\) can be used (see [20, 22, 23]) to define a metric tensor \(g(\theta )\) by
with \(\partial _i = \partial /\partial \theta ^i\) and \(\partial '_{j} = \partial /\partial \theta ^{\prime j}\). One has
and also
A short calculation gives
Because \(\tau =f'\circ \rho \), the rho–tau metric \(g(\theta )\) also takes the form
This shows that the matrix \(g_{ij}(\theta )\) is symmetric. Moreover, it is positive-definite, because the derivatives \(\rho '\) and \(f''\) are strictly positive and the matrix with entries \( \left[ \partial _j \rho (p^\theta (x))\right] \, \left[ \partial _i \rho (p^\theta (x))\right] \), when pre- and post-multiplied with any vector, gives rise to a positive real number. Finally, \(g(\theta )\) is covariant, so g is indeed a metric tensor on the Riemannian manifold \(p^\theta \). From (42) follows that it is invariant under the exchange of \(\rho \) and \(\tau \).
4.2 Freedom of choice of the rho–tau metric
Remarkably, (22) can be used to write (43) as
Hence, the gauge freedom of choosing the metric \(g_{ij}\) under the rho embedding, by choosing an arbitrary function f, also exists when under the tau embedding.
Write the rho–tau metric \(g_{ij}\) as
where \(\psi =1/(\rho '\tau ')\) is the function as in (36). So despite of the two independent choices of embedding functions \(\rho \) and \(\tau \), the metric tensor \(g_{ij}\) is determined by one function \(\psi \) only.
There is another way of looking at the functional freedom in the \(g_{ij}\) metric tensor. Taking a look at (43) reveals that we can choose to specify the function f given any embedding function \(\rho \). So specifying \(\psi \) or specifying f achieves the same purpose.
Although the metric tensor \(g_{ij}\) is invariant under changes of rho and tau which leave \(\psi \) unchanged, other quantities such as the entropy, cross- entropy and divergence function are not. This gauge freedom, which is left once the function \(\psi \) and hence the metric tensor is fixed, explains why specific choices of \(\rho \) and \(\tau \) simplify the relation between rho–tau expressions and expressions found in the literature.
4.3 Tangent vectors
Let us now introduce the plane tangent to the rho embedding of the statistical model \(p^\theta \). A similar construction can be done for the tau embedding.
From the form of rho–tau metric
we introduce a bilinear form \(\langle \cdot ,\cdot \rangle \) defined on pairs of random variables u(x), v(x)
Introduce the notation \(X^\theta (x)=\tau (p^\theta (x))\), so
For any random variable u, it holds that
Because of this relation one says that, by definition, \(\partial _j X^\theta \) is tangent to the rho representation \(\rho (p^\theta )\) of the model \(p^\theta \).
Next decompose \(X^\theta \) into a component \(Y^\theta \) in the tangent plane
plus a component \(X^\theta -Y^\theta \) orthogonal to the tangent plane, i.e., satisfying
A short calculation gives
where \(g^{ij}(\theta )\) is the matrix inverse of \(g_{ij}(\theta )\). On the other hand, from
we have
Hence, the orthogonal projection of \(X(\theta )\) onto the tangent plane equals
A special case, of interest later on, occurs when \(y^i(\theta )=\theta ^i\) so that
and
Written out explicitly in terms of \(\rho \) and \(\tau \), condition (47) is
Its importance follows from the possibility to use the entropy \(S_{\rho ,\tau }\) as a potential function generating coordinates \(\theta _i=\sum _j g_{ij}(\theta )\theta ^j\).
We point out that the above analysis yields identical conclusions if we adopt \(X^\theta (x)=\rho (p^\theta (x))\) and
4.4 Difference between rho–tau metric and entropic metric
Starting from the rho–tau entropy \(S_{\rho , \tau }\) of the parametric family \(p^\theta \)
we take the second derivative to obtain
Likewise, define
using the dual entropy function \(S_{\rho ,\tau }^*(p^\theta )\). So \(h_{ij}\) (and its dual \(h^{*}_{ij}\)) is symmetric in i, j. When positive-definite, \(h(\theta )\) can also serve as a metric tensor, as is found sometimes in the physics literature. We may call it the “entropic metric”.
Recall that the rho–tau metric (44) is induced by the rho–tau divergence (14) by differentiating twice, see, (41). Though the entropic metric \(h(\theta )\) (induced from rho–tau entropy) differs in general from the rho–tau metric \(g(\theta )\) (induced from rho–tau divergence or equivalently rho–tau cross-entropy), the first-order derivatives of \(C_{\rho ,\tau }\), when evaluated at \(p=p^\theta \), is equal to that of \(S_{\rho ,\tau }(p^\theta )\) or of \(S^{*}_{\rho ,\tau }(p^\theta )\)
They reflect, respectively, the vanishing of \(\partial _{i} D_{\rho , \tau }(p^\theta ,p^{\theta '})\) and of \(\partial ^\prime _{i} D_{\rho , \tau }(p^\theta ,p^{\theta '})\) at \(p^\theta =p^{\theta '}\).
Making use of (42), one obtains, respectively
where \(A_{ij}(\theta )\) and \(B_{ij}(\theta )\) are functions symmetric in i, j, given by
When they are non-zero, they reflect the difference of the rho–tau metric \(g(\theta )\) induced from cross-entropy C, from \(h(\theta )\) or \(h^{*}(\theta )\) induced from entropy S or dual entropy \(S^*\). From (53) or (54), it can be seen that if either \(A_{ij}\) or \(B_{ij}\) can be written as the Hessian of a function, then so can \(g_{ij}\)—the rho–tau metric becomes Hessian.
4.5 Hessian geometry
We now consider the conditions under which the rho-tau metric g becomes Hessian.
Theorem 1
(Conditions for the rho–tau metric to be Hessian) Let be given a \(C^\infty \)-manifold of probability distributions \(p^\theta \). For fixed strictly increasing functions \(\rho \) and \(\tau \), let the metric tensor \(g(\theta )\) be given by (42). Then the following statements are equivalent:
-
1.
g is Hessian, i.e., there exists \(\varPhi (\theta )\) such that
$$\begin{aligned} g_{ij}(\theta )= & {} \partial _i\partial _j\varPhi (\theta ). \end{aligned}$$ -
2.
There exists a function \(U(\theta )\) such that
$$\begin{aligned} \partial _i\partial _j U(\theta ) = - \int _{{\mathcal {X}}}\mathrm{d}x\,\tau (p^\theta (x))\partial _i\partial _j\rho (p^\theta (x)) . \end{aligned}$$(55) -
3.
There exists a function \(V(\theta )\) such that
$$\begin{aligned} \partial _i\partial _j V(\theta ) = - \int _{{\mathcal {X}}}\mathrm{d}x\,\rho (p^\theta (x))\partial _i\partial _j\tau (p^\theta (x)) . \end{aligned}$$(56)
Proof
(i) \(\longleftrightarrow \) (ii) From the identity (53), the existence of \(\varPhi (\theta )\) to represent \(g_{ij}\) as its second derivatives allows us to choose the function U as \(U = \varPhi + S\). So from (i) we obtain (ii). Conversely when the integral term can be represented by the second derivative of \(U(\theta )\), we can choose \(\varPhi = U - S\), which satisfies (53). This yields (i) from (ii).
(i) \(\longleftrightarrow \) (iii) The proof is similar to the previous paragraph, except that we invoke (54). \(\square \)
The case when g is Hessian is very special, because of the existence of various bi-orthogonal coordinates. From
there are three “potential functions”: \(\varPhi \) which generates g, S which generates h, and U which measures the discrepancy between g and h. Because of the \(\rho \longleftrightarrow \tau \) duality there are two additional potentials \(S^*\) and V. Each of these potential functions can define conjugate coordinates with respect to \(\theta \). In particular, one defines
They are linked via
We call \(\eta _i\) the dual coordinates of the \(\theta ^i\). The meaning of \(\xi _i\) and \(\zeta _i\) will be explained in Sect. 5.1.
This multitude of potentials is well-known in thermodynamics, where they are interpreted in the context of the theory of ensembles. See for instance [4].
4.6 Rho–tau connections and dually flat geometry
Under Hessian geometry, there exists a pair of dually-flat connections. In the case of conjugate rho-tau embedding of a parametric model \(p^\theta \), Zhang introduced the following connections [20]
where \(\varGamma ^{(\alpha )}_{ij,k}\equiv (\varGamma ^{(\alpha )})_{ij}^lg_{lk}\). One readily verifies
This shows that, by definition, \(\varGamma ^{(-\alpha )}\) is the dual connection of \(\varGamma ^{(\alpha )}\). In particular, \(\varGamma ^{(0)}\) is self-dual and therefore coincides with the Levi-Civita connection. The family of \(\alpha \)-connections (58) is induced by the divergence function \(D^{(\alpha )}_{f, \rho }(p,q)\) given by (35), with corresponding \(\alpha \)-values. Furthermore, upon switching \(\rho \longleftrightarrow \tau \) in the divergence function, the designation of 1-connection versus (-1)-connection also switches.
The coefficients of the connection \(\varGamma ^{(-1)}\) vanish identically if
This condition can be written as
It expresses that the second derivatives \(\partial _i\partial _j X^\theta \) are orthogonal to the tangent plane of the statistical manifold. If satisfied, then the dual of \(\varGamma ^{(-1)}\) satisfies
Likewise, the coefficients of the connection \(\varGamma ^{(1)}\) vanish identically if
In the case of a \(\phi \)-deformed exponential family (see the next section) condition (60) is satisfied in the \(\rho =\text{ id }\) gauge while (63) is satisfied in the \(\tau =\text{ id }\) gauge.
Proposition 2
With respect to conditions (60) and (63),
-
1.
When (60) holds, the coordinates \(\theta ^i\) are affine coordinates for \(\varGamma ^{(-1)}\); the dual coordinates \(\eta _i\) are affine coordinates for \(\varGamma ^{(1)}\);
-
2.
When (63) holds, the coordinates \(\theta ^i\) are affine coordinates for \(\varGamma ^{(1)}\); the dual coordinates \(\eta _i\) are affine coordinates for \(\varGamma ^{(-1)}\);
-
3.
In either case above, \(g(\theta )\) is Hessian.
Proof
Recall that when \(\varGamma =0\) under a coordinate system \(\theta \), then \(\theta ^i\)’s are affine coordinates—the geodesics are straight lines:
The geodesics of the dual connection \(\varGamma ^*\) satisfy the Euler-Lagrange equations
Its solution is such that the dual coordinates \(\eta \) are affine coordinates:
For Statement 1, apply the above knowledge, taking \(\varGamma = \varGamma ^{(-1)}\) and \(\varGamma ^* = \varGamma ^{(1)}\). For Statement 2, take \(\varGamma = \varGamma ^{(1)}\) and \(\varGamma ^* = \varGamma ^{(-1)}\).
To prove Statement 3 observe that
So the vanishing of either term, i.e., either (60) or (63) holding, will yield \(\partial _k g_{ij}(\theta )\) to be symmetric in j, k or in i, k, respectively. This, in conjunction with the fact that \(g_{ij}\) is symmetric in i, j, leads to the conclusion that \(\partial _k g_{ij}(\theta )\) is totally symmetric in an exchange of any two of the three indices i, j, k. This implies that \(\eta _i\) exist for which \(g_{ij}(\theta )=\partial _j\eta _i=\partial _i\eta _j\). The symmetry of g implies now that it equals the Hessian of a potential \(\varPhi \). \(\square \)
5 Deformed exponential models
In the previous Section, we show how the rho–tau geometry fully generalizes the \(\alpha \)-geometry of Amari’s. The approach considered was largely based on generalization of entropy, cross-entropy, divergence functions, and the geometry induced by those quantities. Here we consider the generalization of exponential family to deformed exponential (phi-exponential) family, and show how they give rise to Hessian geometry. In this way, the \(\alpha \)-geometry is fully generalized with conjugate rho-tau embedding in terms of (i) entropy, cross-entropy, divergence function; (ii) Riemannian metric and affine connections; and (iii) parametric probability families.
5.1 Phi-exponential model
Fix an arbitrary monotone function \(\phi \) along with real random variables \(F_1, F_2, \cdots , F_n\). These functions determine a model \(\theta \rightarrow p^\theta \) belonging to the phi-exponential family by the relation (see, [11,12,13])
provided that one can prove that the normalization function \(\alpha (\theta )\) exists. Normalization of \(p^\theta \) leads to
where the so-called escort expectation, denoted \(\tilde{\mathbb {E}}_\theta \),
is with respect to the the escort family of probability distributions \(\tilde{p}^\theta \) as defined by
Here the integral
is assumed to converge. Using the properties of the deformed exponential function one obtains
For later convenience, we also derive the first derivative of the \(z(\theta )\) function
and the second derivatives of the \(\alpha \) function
Proposition 3
Denote
Then, there exists a function \(\varPhi \) such that
Proof
We compute
which is symmetric in i, j. Therefore, there exists a function \(\varPhi \) such that \(\eta _i = \partial _i \varPhi \), and that
\(\square \)
We remark that with respect to any deformed-exponential model \(p^\theta (x)\), we have two sets of coordinates dual with respect to \(\theta \):
-
1.
\(\eta _i = {\mathbb {E}}_\theta F_i\), which is given by \(\eta _i = \partial _i \varPhi \) for some function \(\varPhi (\theta )\);
-
2.
\(\zeta _i = \tilde{\mathbb {E}}_\theta F_i\), which is given by \(\zeta _i = \partial _i \alpha \) for the \(\alpha \) function associated with the deformed-exponential \(\log _\phi \).
In the literature, \(\eta \) is called the expectation coordinates while \(\zeta \) is called the escort (expectation) coordinates.
Simple calculations show that the first and second derivatives of the \(\eta \) coordinates with respect to \(\theta \) can be expressed as the second and third derivative of \(\varPhi \):
Now, let us consider the rho–tau metric (42) or (44) applied to the \(\phi \)-exponential model (65),
Below, we will consider two subcases, with \(\psi = \phi \) or \(\psi = \phi /\phi ^\prime \), both resulting in interesting geometries for the \(\phi \)-exponential family. Since \(\psi \) is controlled by two embedding functions \(\rho \) and \(\tau \), for simplicity we choose \(\tau = \log _\phi \), which leaves only the \(\rho \) function to be specified.
5.2 The case of \(\psi = \phi \)
Upon choosing \(\psi = \phi \), the expression of the rho-tau metric g in (71) takes the form of the right-hand side of (70). Therefore,
Theorem 2
With the choice of the weighting funcion \(\psi =\phi \), the rho-tau metric tensor (71) of the phi-exponential family obeying (65) is Hessian.
In this case, the rho–tau metric coincides with the Hessian metric \(g^\varPhi \) defined as second derivative of the potential \(\varPhi \) given by (70)
In the meanwhile, the rho–tau metric tensor (71) also takes the form
as originally derived in [11]. This expression implies that the rho–tau metric tensor in this case is conformally equivalent to the metric tensor \(\tilde{g}\) derived from the escort expectation of the random variables \(F_i\):
For later convenience, we refer to \(\tilde{g}\) as given by (73) as the “escort metric”.
Note that when \(\psi =\phi \) and \(\tau = \log _\phi \), then \(\rho = \text{ id }\). That is, we have adopted the rho-id gauge. that is, Type I gauge. In such case, \(S_{\rho , \tau }\) reduces to the phi-entropy \(S_\phi \).
Proposition 4
Under the rho-id gauge, the Hessian potential \(\varPhi \) of the \(\phi \)-exponential model
-
1.
is given by
$$\begin{aligned} \varPhi (\theta )= & {} S_\phi (p^\theta )+ \sum _k \theta ^k{\mathbb {E}}_\theta F_k ; \end{aligned}$$(74) -
2.
equals the convex dual \(S^{{\mathrm{cd}}}_\phi (\theta )\) of the \(\phi \)-entropy \(S_\phi \);
where \(S_\phi (p^\theta ) = -\int _{{\mathcal {X}}} f(p^{\theta }) \mathrm{d}x\), \(f^\prime = \log _\phi \).
Proof
To prove Statement 1, note that from the definitions (12) and (65) follows
Therefore,
The convex function \(\varPhi \) defined by (74) is hence the potential function generating the Hessian metric \(g_{ij}\).
To prove Statement 2, that is, the potential \(\varPhi \) can be seen as the convex dual \(S^{{\mathrm{cd}}}\) of the phi-entropy \(S_\phi \), recall the definition of convex duality
From (15) follows that for any probability distribution p is
with equality if and only if \(p=p^\theta \). This implies
with equality if and only if \(p=p^\theta \). One concludes that
From Statement 1 then follows \(\varPhi =S^{{\mathrm{cd}}}\). \(\square \)
It is important to note that the duality between S and \(S^*\) is not convex duality, \(S^{{\mathrm{cd}}}\ne S^*\), but rather a duality arising from \(\rho \longleftrightarrow \tau \).
Under the rho-id gauge, we have \(S_{\rho , \tau } = S_\phi \) so that
In this case (i.e., rho-id gauge for phi-exponential family)
so that
That selecting the rho-id gauge causes the rho–tau metric of the \(\phi \)-exponential family to become a Hessian metric can also be seen via (see Theorem 1)
So we can take \(V(\theta ) = \alpha (\theta )\). The convex potential \(\varPhi \) function can have an equivalent expression
where \((f^*)^\prime = \exp _\phi \). In this case,
with
The cross-entropies for deformed-exponential model (under rho-id gauge) are:
Finally, the Pythagorean Theorem 3.8 of [3] can be easily generalized to the \(\phi \)-exponential models. Let \(t\in {\mathbb {R}}\mapsto p_t\) be a differentiable map, defined on a neighborhood of \(t=0\), taking values in the manifold of the \(p^\theta \). A random variable P is said to be tangent to \(p_t\) at \(t=0\) in the rho-embedding if
for any random variable u, with the inner product \(\langle \cdot ,\cdot \rangle _\theta \) defined by (45) with \(p^\theta =p_{t=0}\).
Theorem 3
Let \(p^\theta \) obey (65). Let \(t\in {\mathbb {R}}\mapsto p_t\) and \(s\in {\mathbb {R}}\mapsto r_s\) be two differentiable maps with values in the manifold of the \(p^\theta \). Let P and R be the corresponding tangent vectors at \(s=t=0\) and assume they are orthogonal in the sense that \(\langle P,R\rangle _\theta =0\). Assume \(t\in {\mathbb {R}}\mapsto p_t\) is a geodesic for \(\varGamma ^{(-1)}\) and \(s\in {\mathbb {R}}\mapsto r_s\) is a geodesic for \(\varGamma ^{(1)}\). Assume the two geodesics intersect at \(s=t=0\) in a common point \(p_0=r_0\equiv q\). If \(\psi =\phi \) then the following Pythagorean relation holds
Proof
Let the \(\theta \)-coordinates of \(p_t\) be denoted \(\theta _t = (1-t)\theta _{0}+t\theta _{1}\) and the \(\eta \)-coordinates of \(r_s\) be denoted \(\eta _s = (1-s)\eta _{0}+s\eta _{1}\). A short calculation gives
Orthogonality of P and R yields
This is used in the following calculation. From (75) follows
As shown above the summation term of the r.h.s. of this expression vanishes. Hence the desired result follows. \(\square \)
5.3 The case of \(\psi = \phi /\phi ^\prime \)
The phi-deformed exponential family, considered in the previous section, is the generalization of the q-exponential model historically introduced by Tsallis [17]. The second [5] and third [19] version of the Tsallis formalism can be characterized by the observation that the role of expectations \({\mathbb {E}}_\theta \) and escort \(\tilde{\mathbb {E}}_\theta \) is exchanged.
For convenience the model discussed below is called the Tsallis model. Consider a phi-deformed exponential family \(p^\theta \) defined by (65). Assume now that the \(\alpha \) function is strictly convex. Then its Hessian can be used to define a metric \(g^\alpha (\theta )\)
For convenience, let us call this the Tsallis metric. Below, we show that this metric is conformally equivalent to the rho-tau metric (44), the latter being non-Hessian upon choosing \(\psi = \phi / \phi ^\prime \).
Theorem 4
For a phi-deformed exponential family \(p^\theta \) of the form (65), when the weighting function \(\psi \) of the rho–tau metric in the form of (71) satisfies
then the rho–tau metric g, while itself non-Hessian, is conformally equivalent to a Hessian metric \(g^\alpha \) (Tsallis metric).
Proof
From the expression of \(\partial _i \partial _j \alpha \) given by (68), we see that if we set
in the rho–tau metric as given by (71), then we have
This says that the rho–tau metric g in this case is conformally related to the Tsallis metric \(g^\alpha \), which is the Hessian of \(\alpha (\theta )\) function. \(\square \)
Note that with this choice of weighting function \(\psi = \phi /\phi ^\prime \) and embedding function \(\tau = \log _\phi \), then \(\rho = f^{-1}\) so \(f(\rho (p)) = p\). This corresponds to the constant-S gauge. Therefore in the Tsallis model, we can either choose \(\rho = \phi , \tau =\log _\phi \) (which is what we assumed in the above calculations) or dually \(\rho = \log _\phi , \tau = \phi \). Under these Type II gauges, the rho-tau metric \(g_{ij}\) is non-Hessian while the Tsallis metric \(g^\alpha _{ij}(\theta )\), if it exists, is (always) Hessian by construction.
It is known that the Tsallis metric \(g^\alpha _{ij}\) induces a dually flat structure associated with escort expectations \(\tilde{\mathbb {E}}_\theta \), see, [2, 8]. The dual coodinates with respect to the \(\alpha \) function are \(\zeta _i = \tilde{\mathbb {E}}_\theta F_i\). The dual potential of \(\alpha (\theta )\) is \(\tilde{\mathbb {E}}_\theta \log _\phi (p^\theta )\), as shown below:
This expression is nothing but \(S^*(p^\theta )\), apart from the conformal factor of \(z(\theta )\). This shows the duality (modulo a conformal factor) between \(\alpha \) and \(S^*\) under Type II gauge, just as \(\varPhi \) and \(S_\phi \) are convex dual under Type I gauge.
5.4 Maximum entropy models
The derivation of the phi-exponential family by means of the maximum entropy method is found in [11]. The treatment here is further generalized so as to cover the approach of [5] as well.
Consider the problem of maximizing the rho–tau entropy \(S_{\rho ,\tau }(p)\) under the constraint that for a finite number of random variables \(F_1, \cdots , F_n\) the functions
have some given values. Here, \(\sigma \) is a given strictly increasing twice differentiable function. We are interested in two specific cases. In the rho-id gauge (\(\rho =\text{ id }\)) the given values are the expectation values of the variables \(F_k\). Then \(\sigma =\text{ id }\). In the case of the Tsallis model the given values are the unnormalized escort expectations of the variables \(F_k\). This requires \(\sigma =\phi \).
Introduce now Lagrange multipliers \(\theta ^k\). Because of the requirement that the maximizing probability distribution is normalized, an extra multiplier \(\alpha \) is needed. The function of Lagrange can then be chosen equal to
Stationarity implies that the optimizing probability distribution \(p=p^\theta \) must satisfy an expression of the form
Two cases exist in which the resulting model belongs to a deformed exponential family. First take \(\sigma =\text{ id }\). Then (80) becomes
This can be written as (65) with \(\phi \) such that \(\tau \rho '=\log _\phi \). In the rho-id gauge this condition is satisfied with \(\rho =\text{ id }\) and \(\tau =\log _\phi \).
The other case occurs when \(\tau \rho '\) is proportional to \(\sigma '\), say \(\tau \rho '=\sigma '\). Then (80) becomes
This is of the form (65) provided \(\phi \) is such that
for some constant B. The result is
with
In the Tsallis context \(\sigma (u)=\phi (u)=u^q\) for some \(q\not =1\). The condition (81) is then satisfied with \(B=q/(1-q)\). The choice \(\sigma =\phi \) works because if \(\sigma \) is a power law then \(\sigma \sigma ''/(\sigma ')^2\) is a constant.
6 Summary and discussions
The classic information geometry (Amari’s \(\alpha \)-geometry) contains three inter-related parts: (i) the Fisher–Rao metric g with the family of \(\alpha \)-connections; (ii) the divergence functions inducing g and \(\alpha \)-connections; (iii) the exponential family corresponding to the dually flat \(\alpha = 1\) connection. Over the years, various aspects of classic information geometry were generalized by relaxing from the logarithm/exponential embedding functions, predominantly in the deformed exponential approach of Naudts [11], the U model of Eguchi [6], and conjugate rho-tau embedding of Zhang [20]. In this paper, these approaches are all synthesized to give a full generalization of classical information geometry with arbitrary monotone embedding.
The main thesis of our paper is that the divergence function \(D_{\rho , \tau }\) constructed from \((\rho , \tau )\)-embedding subsumes both the phi-divergence \(D_\phi \) constructed from the deformed-log embedding and the U-divergence constructed from the U-embedding. This is through adopting the rho-id gauge (or dually, tau-id gauge). A highlight of our analysis is that the rho–tau divergence \(D_{\rho , \tau }\) provides a clear distinction between entropy and cross-entropy as two distinct quantities without requiring the latter to degenerate to the former.
On the other hand, fixing the gauge \(f^* = \tau ^{-1}\) (constant \(S^*\) gauge) renders the rho-tau cross-entropy to be the U cross-entropy, where the dual-entropy is constant. In this case, \(\tau \longleftrightarrow \rho \) is akin to the \(\log _\phi \longleftrightarrow \phi \) transformation encountered in studying the phi-exponential family.
With respect to the induced geometry, we first show that the rho-tau metric tensor \(g(\theta )\) depends on a single function \(\psi \) which is defined by \(\psi (u)=1/(\rho '(u)\tau '(u))\). Theorem 1 gives equivalent conditions for the rho-tau metric to be Hessian. If the probability model is phi-exponential with the same function \(\phi =\psi \), then the rho–tau metric is Hessian. The potential function is the convex conjugate of the rho–tau entropy. However, in general the rho–tau metric is not Hessian. A non-Hessian special case is to choose \(\psi =\phi /\phi '\) for the phi-exponential family; the resulting metric is conformally equivalent to the metric given by the second-derivative of the normalizing function \(\alpha (\theta )\).
In our generalization of Amari’s \(\alpha \)-geometry, there is a variety of (semi-) Riemannian metrics:
-
(i)
rho–tau metric, induced from (\(\rho , \tau \))-divergence or (\(\rho , \tau \)) cross-entropy; it contains one free function \(\psi \) given by \(\psi \rho '\tau '=1\);
-
(ii)
entropic metric, induced from the (\(\rho , \tau \))-entropy—it is a Hessian metric.
When the probability \(p^\theta \) is the \(\phi \)-exponential family, with \(\alpha (\theta )\) representing the normalization function, then it is shown that there always exists another potential function \(\varPhi \), which is usually different from \(\alpha \) (unless \(\phi = \text{ id }\), the case of vanilla exponential family). Assuming convexity, both \(\alpha \) and \(\varPhi \) can be used to induce dual or expectation coordinates, respectively \(\zeta \) and \(\eta \), with respect to the \(\theta \) parameter (natural coordinate) indexing the \(\phi \)-exponential family. The rho–tau metric g of the \(\phi \)-exponential family, being dependent on the weighting function \(\psi \), may or may not be Hessian. After fixing one embedding function \(\tau \) to be \(\log _\phi \), it turns out that
-
(i)
\(g = g^\varPhi \) upon choosing \(\psi = \phi \) (which forces \(\rho = \text{ id }\), and hence adopting the Type I gauge). That is, the rho–tau metric g coincides with the Hessian metric \(g^\varPhi \) as induced from \(\varPhi \); it is conformally equivalent to the (non-Hessian) escort metric (associated with the escort expectation);
-
(ii)
\(g = z(\theta ) g^\alpha \) upon choosing \(\psi = \phi / \phi ^\prime \) (which forces \(\rho =\log _\tau \), and hence adopting Type II or constant-S gauge). That is, the rho–tau metric g, though not Hessian, becomes conformally equivalent to the Tsallis metric \(g^\alpha \), a Hessian metric induced from the normalization function \(\alpha \).
Therefore, one should carefully distinguish the various metrics: rho-tau metric (which may become Hessian) and entropic metric, which is always Hessian, and in the case of phi-exponential family, Tsallis metric (which is always Hessian), and the escort metric (which is generally non-Hessian).
Note that conformal equivalence for the case of \(\psi = \phi \) were previously studied e.g., [2, 8]. For the case of \(\psi = \phi / \phi ^\prime \), we were brought to the awareness (by an anonymous reviewer) that a recent report [9] derived identical conclusions using a different approach—there the weighting function is viewed as arising as the second derivative of \(\exp _\phi \):
Interestingly, the first derivative of \(\exp _\phi \) corresponds to the \(\psi = \phi \) selection. Future research will elucidate whether the result obtained by this “sequential derivative” approach of [9] and by our current rho–tau embedding approach to specify the weighting function of the Riemannian metric is merely a coincidence or reflects a deep cause.
Our current analysis clarifies various phenomena that emerge as a result of adopting general embedding functions—these phenomena have been largely obscured in the “standard model” due to its use of the standard logarithm/exponential function:
-
1.
In general, the divergence function (as a two-variable function) is the difference of cross-entropy (as a two-variable function) and a pair of dual entropies (as one-variable functions); one can always define a “modified” cross-entropy to absorb one of the entropies (as in the U cross-entropy case);
-
2.
In general, the deformed-exponential family always has two potentials, \(\varPhi \) and \(\alpha \), which are not equal unless there is no deformation. Therefore, there are always two expectation coordinates (standard expectation and escort expectation) with respect to the same natural parameter of the deformed-exponential family. This is regardless of the rho-tau metric of the Riemannian manifold (which is induced from the divergence function);
-
3.
When the rho–tau metric is Hessian (i.e., under Type I gauge), there are actually multiple potentials, including \(\varPhi \) and phi-entropy, as well as \(\alpha \) and dual entropy;
-
4.
When the rho–tau metric is conformally equivalent to a Hessian metric (i.e., under Type II gauge), the \(\alpha \) and \(S^*\) form convex dual (after a conformal scaling factor);
-
5.
The U model and the phi-model are identical models under different notations.
The conjugate rho–tau embedding mechanism and phi-exponential model together provide the necessary ingredients for generalizing the \(\alpha \)-geometry while preserving its elegant geometric structure. This greatly expands the reach of information geometric analysis to a much larger applied setting. In particular, the principle of Maximum Entropy inference can be generalized to the case of a generalized linear model. Future research will show how this generalized formulation of maxent duality and calculations may lead to practical impact in statistics, information science, and machine learning.
References
Amari, S.: Differential-geometric methods in statistics. Lecture notes in statistics, vol. 28. Springer, Berlin (1985)
Amari, S., Ohara, A., Matsuzoe, H.: Geometry of deformed exponential families: invariant, dually-flat and conformal geometries. Physica A Stat. Mech. Appl. 391(18), 4308–4319 (2012)
Amari, S., Nagaoka, H.: Methods of Information Geometry. Translations of mathematical monographs 191 (Am. Math. Soc., 2000; Oxford University Press, 2000); Originally in Japanese (Iwanami Shoten, Tokyo, 1993)
Callen, H.B.: Thermodynamics and an Introduction to Thermostatistics, 2nd edn. Wiley, New York (1985)
Curado, E.M.F., Tsallis, C.: Generalized statistical mechanics: connection with thermodynamics. J. Phys. A24, L69 (1991); Corrigenda 24, 3187 (1991) and 25, 1019 (1992)
Eguchi, S.: Information geometry and statistical pattern recognition. Sugaku Expositions (Amer. Math. Soc.) 19, 197–216 (2006). (originally Sūgaku 56 (2004) 380 in Japanese)
Kaniadakis, G.: Non-linear kinetics underlying generalized statistics. Physica A Stat. Mech. Appl. 296, 405–425 (2001)
Matsuzoe, H.: Hessian structures on deformed exponential families and their conformal structures. Diff. Geom. Appl. 35, 323–333 (2014)
Matsuzoe, H., Scarfone, A.M., Wada, T.: A sequential structure of statistical manifolds on deformed exponential family. LNCS 10589, 223–230 (2017)
Montrucchio, L., Pistone, G.: Deformed exponential bundle: the linear growth case. In: Nielsen, F., Barbaresco, F. (eds.) Geometric science of information, GSI 2017 LNCS proceedings, pp. 239–246. Springer, Berlin (2017)
Naudts, J.: Estimators, escort probabilities, and phi-exponential families in statistical physics. J. Ineq. Pure Appl. Math. 5, 102 (2004)
Naudts, J.: Generalised exponential families and associated entropy functions. Entropy 10, 131–149 (2008)
Naudts, J.: Generalised Thermostatistics. Springer, Berlin (2011)
Naudts, J., Anthonis, B.: The exponential family in abstract information theory. In: Nielsen, F., Barbaresco, F. (eds.) GSI 2013 LNCS Proceedings, pp. 265–272. Springer, Berlin (2013)
Naudts, J., Zhang, J.: Information geometry under monotone embedding. Part II: geometry. In: Nielsen, F., Barbaresco, F. (eds.) GSI 2017 Proceedings. LNCS, pp. 215–222. Springer, Berlin (2017)
Newton, N.J.: Information geometric nonlinear filtering. Inf. Dim. Anal., Quantum Prob. Rel. Topics 18, 1550014 (2015)
Tsallis, C.: Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 52, 479–487 (1988)
Tsallis, C.: What are the numbers that experiments provide? Quim. Nova 17, 468 (1994)
Tsallis, C., Mendes, R.S., Plastino, A.R.: The role of constraints within generalized nonextensive statistics. Physica A 261, 543–554 (1998)
Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16, 159–195 (2004)
Zhang, J.: Referential duality and representational duality on statistical manifolds. Proceedings of the Second International Symposium on Information Geometry and Its Applications, Tokyo, pp. 58-67 (2005)
Zhang, J.: Nonparametric information geometry: from divergence function to referential-representational biduality on statistical manifolds. Entropy 15, 5384–5418 (2013)
Zhang, J.: On monotone embedding in information geometry. Entropy 17, 4485–4499 (2015)
Zhang, J., Naudts, J.: Information geometry under monotone embedding. Part I: divergence functions. In: Nielsen, F., Barbaresco, F. (eds.) GSI 2017 Proceedings LNCS, pp. 205–214. Springer, Berlin (2017)
Zhou, J.: Information theory and statistical mechanics revisited. arXiv:1604.08739
Acknowledgements
The project is supported in part by DARPA/ARO Grant W911NF-16-1-0383 (“Information Geometry: Geometrization of Science of Information”, PI: Zhang). Both authors contributed equally to this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Naudts, J., Zhang, J. Rho–tau embedding and gauge freedom in information geometry. Info. Geo. 1, 79–115 (2018). https://doi.org/10.1007/s41884-018-0004-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-018-0004-6