1 Introduction

Counting discrete events seems one of the simplest ways of collecting data, but compositional bias when directly comparing such counts in varying contexts can lead intuition astray. Often, the lack of a common scale in samples taken from different environments or experimental conditions makes direct comparisons between counts meaningless. We need to gauge by internal references before we can make external comparisons. Compositional data analysis (CoDA, e.g. [1]) uses scale-free methods on data occurring in form of percentages, and its log-ratio methodology [2] has been applied to relative counts as well. While the sample spaces [3] of both data types are certainly not the same, the underlying problematic is identical: direct comparisons across samples can have paradoxical effects due to the lack of a common scale [4]. We have recently proposed to make use of information geometry [5] to analyse compositional data [6]. The information-geometric approach is even more natural for relative count data, and simple count distributions like the categorical or multinomial have served as examples to illustrate basic concepts in information geometry. Here we aim to demonstrate the usefulness of information-geometric concepts for the analysis of count data that are compositional in a well-defined sense.

Let us quickly sketch the main idea of this contribution. Consider a vector of counts \((n_i)_{i=1}^D\) that were produced by some process with unknown independent count probabilities \(q_i\). It is well known that the empirical estimator for such multinomial probabilities

$$\begin{aligned} \hat{q_i}=\frac{n_i}{\sum _{k=1}^Dn_k} \end{aligned}$$
(1)

(although it is the one that maximizes the likelihood of the data) can be much improved upon when the denominator is not large compared with D. In this case, a better alternative is the convex combination

$$\begin{aligned} \hat{q_i}^\textrm{sh}=\uplambda \frac{1}{D}+(1-\uplambda )\hat{q_i} \end{aligned}$$
(2)

of the estimator with the equidistribution, for an optimized value of the parameter \(0\le \uplambda \le 1\). This is an example of what is known as shrinkage of \({\hat{q}}_i\) toward the target 1/D. The reason why this works can be understood from a Bayesian perspective. The shrinkage estimator (2), instead of maximizing the likelihood of the data, maximizes the posterior probability of a suitable parameter of the multinomial (assuming a simple conjugate prior). Optimization of \(\uplambda \) corresponds to adjusting the weight that the prior will have compared with the weight that will be assumed for the data. But why is \({\hat{q}}_i^\textrm{sh}\) a good approximation of \(q_i\)? It turns out that maximizing the posterior probability corresponds to minimizing the divergence of \({\hat{q}}_i^\textrm{sh}\) from \(q_i\).

As the parameters (and estimators) we are dealing with are probabilities themselves, they can be understood as points in a finite simplex (which happens to be the CoDA sample space). From an information-geometric point of view, the shrinkage estimator is optimized along the mixture geodesic (or m-geodesic) between the equidistribution and the observed point \(({\hat{q}}_i)_{i=1}^D\) (see the blue line in Fig. 1). Geodesics provide intuition, e.g., a generalized Pythagorean theorem makes use of them. Unlike in Euclidean geometry, however, we need two types of geodesics for Pythagoras to work. The natural counterparts to m-geodesics are the exponential geodesics (or e-geodesics). These are convex combinations of points in exponential coordinates, which are dual to the mixture coordinates (via the Legendre duality that underlies information geometry). Let us now consider the e-geodesic between the two points in question (see the orange curve in Fig. 1).

Fig. 1
figure 1

Exponential (curved orange line) and mixture (blue straight line) geodesics between the equidistribution (1/3, 1/3, 1/3) and an observed point \((n_1/n,n_2/n,n_3/n)\) in the 3-part simplex

It turns out that the e-geodesic corresponds to an alternative parametrization of the posterior probability, where the prior and likelihood contribute via weighted geometric means. A point on the e-geodesic is just another estimator of the posterior mean that uses this alternative parametrization. When back-transforming exponential coordinates to the original parameter, this geodesic can be written as

$$\begin{aligned} {\hat{q}}_i^\textrm{es}=\frac{{\hat{q}}_i^\beta }{\sum _{k=1}^D{\hat{q}}_k^\beta }, \end{aligned}$$
(3)

with \(0\le \beta \le 1\). This kind of exponential scaling is well known in statistical physics, where \(\beta \) is the inverse temperature. It is also used when Box–Cox transforming data to reduce skew or to replace logarithms by approximate expressions when zeros are involved. In the CoDA context, \(\beta \) can be used to mediate between \(\chi \)-squared distance and Aitchison distance and thus makes a connection between log-ratio analysis and Correspondence Analysis (CA) [7]. The latter can handle zeros while the former needs to impute them.

For finding the optimal value of the shrinkage parameter \(\uplambda \), a simple analytic solution for minimization of the mean squared error (MSE) with respect to the true parameter can be found [8, 9]. To use the same strategy for the \(\beta \)-parameter of the e-geodesic, we propose to use an MSE on the tangent space. This is just expected Aitchison distance between the estimator and the true parameter. We derive an analytic solution that approximates an optimal \(\beta \) based on the Delta method (i.e., via Taylor expansion). This is computationally inexpensive and can, e.g., be used as a data preprocessing for dimension reduction techniques like CA. Simulations show that this approach holds promise for data with many essential zeros. We discuss the exponential shrinkage estimator as an additional tool that avoids the pseudocounts of current procedures in contexts where zero imputation may be inappropriate. On a theoretical level, this contribution aims to unify power transformations with shrinkage under the same conceptual framework.

Section 2 presents essentially review material, with the first two paragraphs dedicated to some very general statistical motivation. We then introduce the information geometric formulation of the multinomial likelihood and posterior and make some methodological excursions of a more technical nature in paragraphs 2.6 and 2.8. In these paragraphs, we reformulate known minimizations of relative entropy and of expected quadratic loss in form of propositions that will serve us in the subsequent application. Section 3 is then dedicated to the application of the material presented. It includes the definition of an alternative shrinkage estimator and its optimisation along the exponential geodesic as well as a benchmark of it using simulations. All the proofs and some of the more lengthy algebraic derivations are deferred to the Appendix.

2 Preliminaries

2.1 Sequencing data are relative

Let us first discuss the practical relevance of relative counts for contemporary biomedical data. While it is usually acknowledged that data produced by DNA sequencing instruments are relative [10], a number of arguments for the current dominance of absolute approaches have been put forward. We will discuss one of these arguments here: The constraint on the counts does not hold strictly, i.e., it is itself a fluctuating quantity [11].

Counting the times \(n_j\) a specific event j occurs within a fixed time interval, under very general assumptions (i.e., independence of events from previous occurrences, fixed average rate of occurrence, no simultaneous occurrences), the resulting data will be distributed according to a Poisson distribution:

$$\begin{aligned} p_P(n_j\mid \uplambda _j)=\frac{\uplambda _{j}^{n_{j}}}{n_{j}!}e^{-\uplambda _{j}}. \end{aligned}$$
(4)

Here, \(\uplambda _j\) denotes the average occurrence rateFootnote 1 of an event j. When considering D such events now, and assuming they don’t influence each other, we can write the overall probability of the D-dimensional vector of counts \({\varvec{n}}\) simply as a product of D such distributions.

Consider now a modification of this scenario where we observe these D events taking place but instead of fixing a time interval, we will simply stop counting after we have observed n events. The resulting distribution is a multinomial

$$\begin{aligned} p_n({\varvec{n}}\mid {\varvec{q}})=\frac{n!}{\prod _{j=1}^Dn_j!}\prod _{j=1}^Dq_j^{n_j}, \end{aligned}$$
(5)

where \({\varvec{q}}=(q_j)_{j=1}^D\) is the vector of individual event probabilities.Footnote 2 The multinomial encodes a constraint on \({\varvec{n}}\) that leads to a mutual dependence between the parts. In this sense, it models a composition of counts.

To see the connection between these two scenarios, let us come back to the independent Poisson distribution. It can be written as

$$\begin{aligned} p_\textrm{P}({\varvec{n}}\mid \varvec{\uplambda })= & {} \prod _{j=1}^D\frac{\uplambda _j^{n_j}}{n_j!}e^{-\uplambda _j} \nonumber \\= & {} \frac{\uplambda ^n}{n!}e^{-\uplambda }\frac{n!}{\prod _{j=1}^Dn_j!}\prod _{j=1}^D\left( \frac{\uplambda _j}{\uplambda }\right) ^{n_j} =p_\textrm{P}(n\mid \uplambda )~p_n({\varvec{n}}\mid {\varvec{q}}). \end{aligned}$$
(6)

Here \(\uplambda \) denotes the sum over the components of \(\varvec{\uplambda }\), and \({\varvec{q}}=\varvec{\uplambda }/\uplambda \). We see that the independent Poisson distributions factorize into a univariate Poisson of n with parameter \(\uplambda \) as well as a multinomial distribution \(p_n\) that has n and \({\varvec{q}}\) as parameters. This well-known relationship between the Poisson and the multinomial is interesting when discussing the argument against compositionality above. First we note that a variation in the constraining variable n can only be used for a correct estimate of the rate parameters \(\uplambda _j\) of the D Poisson processes if the overall rate \(\uplambda \) is exactly their sum. Modelling by a multinomial can thus be perfectly justified for a stochastic n whose rate \(\gamma \) is of no interest to the analyst because it is decoupled from the \(\varvec{\uplambda }\), in the sense that \(\gamma \ne \uplambda \). For sequencing data, the constraint on n is imposed by the capacity of the sequencing instrument while the variation in n can be caused by other aspects of the protocol (e.g., the subsequent read mapping). The practical effects of the constraint are well documented [13, 14] and aren’t invalidated by the stochastic nature of n.

For an applicaton of the multinomial to single-cell sequencing data, see [15]. A pragmatic approach is taken in [12], where it is acknowledged that the \(q_j\), not the \(\uplambda _j\) should be the modelling objective, but (for practical reasons) their modelling is done by an independent Poisson that is reparametrized as \(p_\textrm{P}({\varvec{n}}\mid \uplambda ,{\varvec{q}}\)). The Poisson can serve as an approximation whenever there are no dominant parts for which \(q_j\) becomes too large. The modelling gets complicated again as soon as co-variation of parts across samples are taken into account.

2.2 Variation across samples, Bayes

According to the Bayesian paradigm, probabilities are subjective in the sense that they quantify degrees of knowledge [16]. This quantification involves both data and model parameters, and both can be arguments to probability functions. While we assume a fixed parameter when considering a single sample \({\varvec{n}}\), it makes sense to let the parameter vary according to some distribution when considering many samples that were obtained under different conditions. This is typically the case when we have a data matrix where counts for D variables (or compositional parts) indexed by the columns are collected in N samples indexed by the rows.

As an example, consider the special case of the multinomial \(p_n\). Our choice of the prior \(\pi \) quantifying the probability of the parameter \({\varvec{q}}\) will determine the functional form of the joint distribution and thus affect our ability to capture the variability across samples:

$$\begin{aligned} p_n({\varvec{n}},{\varvec{q}})=p_n({\varvec{n}}\mid {\varvec{q}})\pi ({\varvec{q}}). \end{aligned}$$
(7)

Integrating the joint probabilityFootnote 3 over the parameter \({\varvec{q}}\) would leave us again with \({\varvec{n}}\) as the only argument. The resulting marginal distribution will depend on the hyperparameters of the prior (which we left out in the formula above).Footnote 4 If we divide (7) by it, we renormalize and obtain the posterior probability of the parameter \({\varvec{q}}\), giving us Bayes’ theorem.

An excellent choice for \(\pi \) would be a \(D-1\)-dimensional multivariate normal of the log-ratios \(\log (q_i/q_D)\). This allows for a compositional modelling of the second-order interactions between parts that captures the over-dispersion often observed in real-world data [17, 18]. While this logistic-normal multinomial model has no analytic solution, Markov-Chain Monte Carlo can be used, like in a recent application to differential association networks in microbiome data [19]. Note that the interest is now in the hyperparameters of the prior, especially in the covariance matrix of the log-ratios of \({\varvec{q}}\).

A less realistic but more tractable solution is obtained when simply choosing the conjugate prior to the multinomial, i.e., the Dirichlet distribution. While we will later describe it in more detail, let us here point out that this model implies that all interaction between parts comes from the constraint that counts have to add to n. It is thus the model with the greatest degree of independence that can be achieved for compositions [2].

2.3 Dual coordinates for count distributions

We have recently proposed to treat compositional data with the methods of information geometry [6]. The fact that the geometric structure of the discrete probability simplex can be exploited for the analysis of compositional data has been observed before, e.g. [20]. Compositions \({\varvec{q}}\) can be described as categorical distributions that live on a finite dimensional openFootnote 5 simplex

$$\begin{aligned} {\mathcal {S}}^D=\left\{ (q_1,\dots ,q_D)^T\in {\mathbb {R}}^D:q_i>0,i=1,\dots ,D,\sum _i^Dq_i=1\right\} . \end{aligned}$$
(8)

The finite version of information geometry contains already all its important concepts but often provides a more intuitive approach, see [5, 21]. For a comprehensive treatment of the finite case, see Chapter 2 of [22]. We are now showing a concrete example of an application to CoDA that slightly extends our framework in [6] to deal with relative count data.

To briefly recapitulate, we start from the two natural coordinate systems used in information geometry: the expectation parameters \(\varvec{\eta }\) (whose components carry lower indices) and the exponential parameters \(\varvec{\theta }\) (with upper indices). Consider again the case where the occurrence of D discrete events is encoded by a random variable \(R=r\in \{1,\dots ,D\}\) with occurrence probabilities \({\varvec{q}}\). The \(D-1\)-dimensional vector of expectation parameters \(\varvec{\eta }\) consists simply of those probabilities that can vary freely (while all of them have to sum to 1). The probability of an event in terms of \(\varvec{\eta }\) can then be written as

$$\begin{aligned} p(r\mid \varvec{\eta }) = \left\{ \begin{array}{cl} \eta _r &{} \text{ if }\quad r \le D-1, \\ 1 - \sum _{i = 1}^{D-1} \eta _i &{} \text{ if } \quad r = D, \end{array} \right. \qquad r = 1,\dots ,D. \end{aligned}$$
(9)

Alternatively, this distribution can be parametrized using what is known as the alr-transformation in CoDA:

$$\begin{aligned} \theta ^j=\log \frac{q_j}{q_D},~~~j=1,\dots ,D-1. \end{aligned}$$
(10)

Note that we are not (as often done in CoDA) log-ratio transforming the data themselves, but their underlying parameters \({\varvec{q}}\). With this, we can write our distribution in the form

$$\begin{aligned} p(r\mid \varvec{\theta }) = \textrm{exp}\left( \sum _{k=1}^{D-1}\theta ^k {\mathbbm {1}}_k(r) -\psi (\varvec{\theta })\right) , \qquad r = 1,\dots ,D, \end{aligned}$$
(11)

where \({\mathbbm {1}}_k(r) = 1\) if \(r = k\), and \({\mathbbm {1}}_k(r) = 0\) otherwise. The function \(\psi \) ensures normalization and is known as the free energy. It is given by

$$\begin{aligned} \psi (\varvec{\theta })=\log \left( 1+\sum _{i=1}^{D-1}e^{\theta ^i}\right) =-\log q_D. \end{aligned}$$
(12)

How do we get from a single outcome r to the multinomial counts \({\varvec{n}}\)? Let us first consider n outcomes \({\varvec{r}}=(r_1,\dots ,r_n)\). Their probability is simply the product over (11):

$$\begin{aligned} p({\varvec{r}}\mid n,\varvec{\theta })= & {} \prod _{i=1}^np(r_i\mid \varvec{\theta })\nonumber \\= & {} \exp \sum _{i=1}^{n}\left( \sum _{k=1}^{D-1}\theta ^k{\mathbbm {1}}_k(r_i) -\psi (\varvec{\theta })\right) ,\nonumber \\= & {} \exp \left( \sum _{k=1}^{D-1}\theta ^kn_k({\varvec{r}})-n\psi (\varvec{\theta })\right) , \end{aligned}$$
(13)

where \(n_k({\varvec{r}}):=\sum _{i=1}^{n}{\mathbbm {1}}_k(r_i)\). This latter expression encodes the D components of our relative counts \({\varvec{n}}\). To obtain their probability of occurrence, we note that many outcomes \({\varvec{r}}\) lead to the same outcomes of counts. Counting these leads to a factor given by the multinomial coefficient:

$$\begin{aligned} p_0({\varvec{n}}\mid n)=\frac{n!}{n_1!\dots n_D!}={n\atopwithdelims ()n_1\dots n_D}. \end{aligned}$$
(14)

With this base measure, we can finally write our multinomial (5) in form of an exponential family

$$\begin{aligned} p_n({\varvec{n}}\mid \varvec{\theta })=p_0({\varvec{n}}\mid n)~\textrm{exp}\left( \sum _{k=1}^{D-1}\theta ^k n_k -n\psi (\varvec{\theta })\right) . \end{aligned}$$
(15)

We see that the exponential coordinates remain the same regardless of the number of observations. It is often convenient to drop the base measure and, changing the random variable, resort to the expression (13). Also, as we can see from (15), to obtain the multi-event versions of \(\varvec{\eta }\) and \(\psi (\varvec{\theta })\), we just need to multiply by n. Due to the Legendre duality of the natural coordinates, we can obtain the multi-event expectation coordinates by taking partial derivatives

$$\begin{aligned} n\eta _j=\frac{\partial }{\partial \theta ^j}n\psi (\varvec{\theta })=\mathbbm {E}_{p_n}(n_j)=nq_j,\qquad j=1,\dots ,D-1. \end{aligned}$$
(16)

Finally, the potential that is dual to the multi-event free energy \(n\psi (\varvec{\theta })\), i.e., the negative Shannon entropy of (13), is given by \(n\phi (\varvec{\eta })\), where

$$\begin{aligned} \phi (\varvec{\eta })=\sum _{k=1}^{D-1}\eta _k\log \eta _k+\left( 1-\sum _{k=1}^{D-1}\eta _k\right) \log \left( 1-\sum _{k=1}^{D-1}\eta _k\right) . \end{aligned}$$
(17)

2.4 Parameter divergence from observed points

In the previous section, we have derived expressions for probabilities of data given some model parameters. These parameters happen to be compositions, and as such they can be depicted as points in a simplex. When normalizing a sample of count data by their total, we can also represent it as a so-called observed point [5] in the simplex:

$$\begin{aligned} \hat{{\varvec{q}}}=\left( \frac{n_1}{n},\dots ,\frac{n_D}{n}\right) ^T. \end{aligned}$$
(18)

This is the empirical estimate of the parameter \({\varvec{q}}\). The empirical estimate is also known as the type of a sequence \({\varvec{r}}\) of independent random variables. Our dual coordinates associated with the observed point are

$$\begin{aligned} \hat{\varvec{\theta }}= & {} \left( \log \frac{n_1}{n_D},\dots ,\log \frac{n_{D-1}}{n_D}\right) ^T,\end{aligned}$$
(19)
$$\begin{aligned} n\hat{\varvec{\eta }}= & {} \left( n_1,\dots ,n_{D-1}\right) ^T. \end{aligned}$$
(20)

One of the fundamental results of the method of types (e.g., [23]) is an equality relating the true distribution to the observed point:

$$\begin{aligned} p({\varvec{r}}\mid n,\varvec{\theta }) =\exp \left( n\phi (\hat{\varvec{\eta }})-n D_\phi (\hat{{\varvec{q}}}\mid \mid {\varvec{q}})\right) , \end{aligned}$$
(21)

where

$$\begin{aligned} D_\phi (\hat{{\varvec{q}}}\mid \mid {\varvec{q}})=\sum _{j=1}^{D}\frac{n_j}{n}\log \frac{n_j}{n q_j} \end{aligned}$$
(22)

is the relative entropy, or Kullback–Leibler (KL) divergence, between the empirical and the true parameter compositions. The expression (21) can be easily derived by simple algebraic rearrangement of (13) using the expressions for \(\phi \) and \(D_\phi \). With (21), it is clear that we can write the multi-event version of our divergence as

$$\begin{aligned} nD_\phi (\hat{{\varvec{q}}}\mid \mid {\varvec{q}})=n\phi (\hat{\varvec{\eta }})-\log p({\varvec{r}}\mid n,\varvec{\theta }). \end{aligned}$$
(23)

As the first term does not depend on \(\varvec{\theta }\), this shows why taking the maximum of the likelihood \(p({\varvec{r}}\mid n,\varvec{\theta }))\) over \(\varvec{\theta }\) is equivalent to minimizing the KL-divergence between the estimated and the true parameter composition.

More general relationships of this kind can be derived from a fundamental information-geometric equality that is due to the Legendre duality between \(\psi \) and \(\phi \):

$$\begin{aligned} D_\phi (\hat{{\varvec{q}}}\mid \mid {\varvec{q}})=\phi (\hat{\varvec{\eta }})+\psi (\varvec{\theta })-\varvec{\theta }^T\hat{\varvec{\eta }}. \end{aligned}$$
(24)

Minimizing a dissimilarity between distributions can be understood as a projection. Here we project the observed point onto the manifold of distributions parametrized by \(\varvec{\theta }\). In information geometry, this minimization of the KL-divergence is known under the name of m-projection, see [5]. In Sect. 2.6, we will show a result that is more general than (23) in the sense that it does not only hold for the likelihood but also for prior and posterior probability.

2.5 Posterior probability of the parameter

For the Bayesian estimation of a parameter we have to construct a posterior distribution of the parameter that also takes into account its prior distribution \(\pi \), which itself can depend on a vector of hyperparameters \(\varvec{\alpha }\). For a review of Bayesian inference for categorical data see [25]. The posterior probability density of the parameter in terms of the exponential parameter \(\varvec{\theta }\) is

$$\begin{aligned} p(\varvec{\theta }\mid {\varvec{r}},n,\varvec{\alpha })=\frac{p({\varvec{r}}\mid n,\varvec{\theta })\pi (\varvec{\theta }\mid \varvec{\alpha })}{\int d\varvec{\theta }^\prime p({\varvec{r}}\mid n,\varvec{\theta }^\prime )\pi (\varvec{\theta }^\prime \mid \varvec{\alpha })}. \end{aligned}$$
(25)

Instead of maximizing the likelihood over \(\varvec{\theta }\), we can now maximize the posterior to obtain the best parameter estimate.Footnote 6 Inserting (13), the posterior (25) evaluates to

$$\begin{aligned} p(\varvec{\theta }\mid {\varvec{r}},n,\varvec{\alpha })=\pi (\varvec{\theta }\mid \varvec{\alpha })\exp \left( \sum _{k=1}^{D-1}\theta ^k n_k({\varvec{r}}) -n\psi (\varvec{\theta })-\log p({\varvec{r}}\mid \varvec{\alpha })\right) . \end{aligned}$$
(26)

where \(p({\varvec{r}}\mid \varvec{\alpha })\) is the normalizing integral in the denominator of (25). Seeing this as an exponential family, we note that the parameter and the random variables have exchanged their roles. The prior can be written as a new base measure now, while the new free energy is given by \(\log p({\varvec{r}}\mid \varvec{\alpha })\).Footnote 7

A prior that has the same functional form as the resulting posterior is called a conjugate prior. Using a conjugate prior makes closed-form solutions of the posterior possible. The general form of the conjugate prior for an exponential family is well known [24], but it is instructive to obtain it as follows. We copy the functional form of (26) and obtain a D-parameter conjugate prior as

$$\begin{aligned} \pi (\varvec{\theta }\mid \varvec{\alpha })=\pi _0(\varvec{\theta })\exp \left( \sum _{k=1}^{D-1}\theta ^kf_k(\varvec{\alpha })-\left[ \sum _{k=1}^Df_k(\varvec{\alpha })\right] \psi (\varvec{\theta })-\chi (\varvec{\alpha })\right) , \end{aligned}$$
(27)

where \(\pi _0\) is a base measure, \(f_k\) is a sufficient statistic of the k-th hyperparameter, and \(\chi \) the normalization. With this, the posterior (26) becomes

$$\begin{aligned}{} & {} p(\varvec{\theta }\mid {\varvec{r}},n,\varvec{\alpha })=\pi _0(\varvec{\theta }) \nonumber \\{} & {} \times \exp \left( \sum _{k=1}^{D-1}\theta ^k\left( n_k({\varvec{r}})+f_k(\varvec{\alpha })\right) -\left[ n+\sum _{k=1}^Df_k(\varvec{\alpha })\right] \psi (\varvec{\theta })-\chi (\varvec{\alpha })-\log {p({\varvec{r}}\mid \varvec{\alpha })}\right) .\nonumber \\ \end{aligned}$$
(28)

In our categorical case it is well known [25] that the conjugate prior is a Dirichlet distribution with parameters \(\varvec{\alpha }\). The expressions involved evaluate to

$$\begin{aligned} f_k(\varvec{\alpha })= & {} \alpha _k, \end{aligned}$$
(29)
$$\begin{aligned} \pi _0(\varvec{\theta })= & {} 1, \end{aligned}$$
(30)
$$\begin{aligned} \chi (\varvec{\alpha })= & {} \log B(\varvec{\alpha }), \end{aligned}$$
(31)
$$\begin{aligned} p({\varvec{r}}\mid \varvec{\alpha })= & {} \frac{B\left( (n_k({\varvec{r}})+\alpha _k)_{k=1}^D\right) }{B(\varvec{\alpha })}, \end{aligned}$$
(32)

where B denotes the multivariate beta function. (For clarity, we give a short derivation for \(p({\varvec{r}}\mid \varvec{\alpha })\) in the Appendix.) With these expressions, the posterior simplifies to

$$\begin{aligned}{} & {} p(\varvec{\theta }\mid {\varvec{r}},n,\varvec{\alpha })\nonumber \\{} & {} \quad = \exp \left( \sum _{k=1}^{D-1}\theta ^k\left( n_k({\varvec{r}})+\alpha _k)\right) -\left[ n+\sum _{k=1}^D\alpha _k\right] \psi (\varvec{\theta })-\log B\left( {\varvec{n}}({\varvec{r}})+\varvec{\alpha }\right) \right) .\nonumber \\ \end{aligned}$$
(33)

We can see here the widely-used result that the posterior is obtained from the likelihood by simply adding the conjugate prior parameters as pseudo counts to the respective event counts and then renormalizing.

2.6 Parameter divergence from general estimators

The similarity between the likelihood and our expression for the posterior suggests that we can maximize the posterior similarly to the likelihood by minimizing a certain KL-divergence. Indeed, the following proposition shows that maximizing prior, likelihood, or posterior always corresponds to a minimization of KL-divergence between a suitable estimator and \({\varvec{q}}\):

Proposition 1

Let \({\varvec{q}}\) be a parameter of probabilities with exponential coordinates \(\varvec{\theta }\) via \(p(r\mid \varvec{\theta })\) with free energy \(\psi (\varvec{\theta })\) as defined in (10)–(12). Further, let the function \(f:{\mathcal {S}}^D\times {\mathbb {R}}_+\times {\mathbb {R}}^{D-1}\rightarrow {\mathbb {R}}_+\) be given by

$$\begin{aligned} f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta })=Z({\tilde{n}},\tilde{{\varvec{q}}})~\textrm{exp}\left\{ {\tilde{n}}\left( \varvec{\theta }^T\tilde{\varvec{\eta }}-\psi (\varvec{\theta })\right) \right\} , \end{aligned}$$

where \(\tilde{{\varvec{q}}}\) is an estimator of \({\varvec{q}}\) with expectation coordinates \(\tilde{\varvec{\eta }}\), \({\tilde{n}}\) denotes a positive real, and Z a positive function. We then have

$$\begin{aligned} {\tilde{n}}D_\phi (\tilde{{\varvec{q}}}\mid \mid {\varvec{q}})={\tilde{n}}\phi (\tilde{\varvec{\eta }})+Z({\tilde{n}},\tilde{{\varvec{q}}})-\log f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta }), \end{aligned}$$

with \(\phi \) the Lagrange dual to \(\psi \) as defined in (17) and \(D_\phi \) the KL-divergence.

The proof makes use of (24) and otherwise consists in a simple rearrangement of terms (see Appendix).

Corollary 1

Maximization of \(\log f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta })\) as a function of \(\varvec{\theta }\) minimizes \(D_\phi (\tilde{{\varvec{q}}}\mid \mid {\varvec{q}})\) as a function of \({\varvec{q}}\).

This is clear because the other (data-dependent) terms do not depend on the parameter.

Example 1

Shrinkage estimator:

We use as our estimator \(\tilde{{\varvec{q}}}\) the expected value of \({\varvec{q}}\) under the posterior (33), the so-called shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{sh}\)

$$\begin{aligned} \tilde{{\varvec{q}}}=\hat{{\varvec{q}}}^\textrm{sh}:={\mathbb {E}}_{\varvec{\theta }}({\varvec{q}}\mid {\varvec{r}},n,\varvec{\alpha })=\frac{{\varvec{n}}+\varvec{\alpha }}{n+\sum _{k=1}^D\alpha _k}, \end{aligned}$$
(34)

and set \({\tilde{n}}={\hat{n}}:=n+\sum _{k=1}^D\alpha _k\). This allows us to reparametrize the posterior in the required form

$$\begin{aligned} p(\varvec{\theta }\mid \hat{{\varvec{q}}}^\textrm{sh},{\hat{n}})=\exp \left( {\hat{n}}\left[ \sum _{k=1}^{D-1}\theta ^k{\hat{q}}_k^\textrm{sh}-\psi (\varvec{\theta })\right] -\log B\left( {\hat{n}}\hat{{\varvec{q}}}^\textrm{sh}\right) \right) , \end{aligned}$$
(35)

and thus \(f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta })=p(\varvec{\theta }\mid {\varvec{r}},n,\varvec{\alpha })\) and \(Z({\tilde{n}},\tilde{{\varvec{q}}})=1/B({\hat{n}}\hat{{\varvec{q}}}^\textrm{sh})\). With this, the proposition gives

$$\begin{aligned} {\hat{n}}D_\phi (\hat{{\varvec{q}}}^\textrm{sh}\mid \mid {\varvec{q}})={\hat{n}}\phi (\hat{\varvec{\eta }}^\textrm{sh})-\log B({\hat{n}}\hat{{\varvec{q}}}^\textrm{sh})-\log p(\varvec{\theta }\mid \hat{{\varvec{q}}}^\textrm{sh},{\hat{n}}). \end{aligned}$$
(36)

Thus finding the \(\varvec{\theta }\) that maximizes the posterior is equivalent to minimizing the KL-divergence between the shrinkage estimator and the true parameter \({\varvec{q}}\).

Example 2

Empirical estimator:

The empirical estimator of the multinomial distribution is a straightforward application: \(\tilde{{\varvec{q}}}=\hat{{\varvec{q}}}:={\varvec{n}}/n\), \({\tilde{n}}=n\), and \(f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta })= p_n({\varvec{n}}\mid \varvec{\theta })\) as given by (15), so \(Z({\tilde{n}},\tilde{{\varvec{q}}})\) is the multinomial coeffcient \(p_0({\varvec{n}}\mid n)\). The proposition gives (23) with an additional subtraction of the \(\log p_0\) term.

Clearly, another example consists in maximizing the prior probability of \(\varvec{\theta }\) to minimize the divergence between \(\varvec{\alpha }/\sum _k\alpha _k\) and \({\varvec{q}}\). In Sect. 3 we will define another version of the shrinkage estimator, which will provide us with yet another application of the proposition. Note that \(f(\tilde{{\varvec{q}}},{\tilde{n}},\varvec{\theta })\) has the general form of a conjugate prior of an exponential family, so Proposition 1 holds for exponential families in general. A more general treatment than the one presented here can be found in [26].

2.7 Decision-theoretic risk

Decision theory (e.g., [27]) provides a foundational framework for statistics. While it is closely linked with Bayesian analysis, it can also be formulated from a frequentist point of view. In any case, it implies the construction of a loss function that incorporates statistical knowledge in order to quantify the risk of a wrong decision. Such a loss function L has the “true state of nature" and an action (based on some knowledge) as its arguments. Perhaps the most important example for these arguments would be the true parameter \({\varvec{q}}\) of a distribution and some estimator \(\hat{{\varvec{q}}}\), where the latter would be identified with the action based on it. Given some loss \(L({\varvec{q}},\hat{{\varvec{q}}})\), the risk we incur when basing our decision on the estimator is then some expected value

$$\begin{aligned} R(\hat{{\varvec{q}}})=\mathbbm {E}L({\varvec{q}},\hat{{\varvec{q}}}). \end{aligned}$$
(37)

Bayesian and frequentist schools disagree on the type of expectation that should be taken here. While for the Bayesian the expectation is taken with respect to the posterior probabilityFootnote 8 of the parameter \({\varvec{q}}\), the frequentist averages over all instances of the random variables (which follow a distribution parametrized by \({\varvec{q}}\))Footnote 9. As a consequence, the risk remains a function of \({\varvec{q}}\). A frequentist then calls an estimator \(\hat{{\varvec{q}}}_1\) R-better than \(\hat{{\varvec{q}}}_2\) when \(R_{{\varvec{q}}}(\hat{{\varvec{q}}}_1)\le R_{{\varvec{q}}}(\hat{{\varvec{q}}}_2)\) for all \({\varvec{q}}\), with strict inequality for some of them. An estimator is called inadmissible if there exists an R-better estimator.

Often, for pragmatic reasons, a quadratic loss leading to a mean squared error (MSE) risk function is assumed. Beside its simplicity, one benefit is that for unbiased estimators, the (frequentist) risk is simply the variance of the estimator:

$$\begin{aligned} R_{{\varvec{q}}}(\hat{{\varvec{q}}})=\mathbbm {E}\left[ (\hat{{\varvec{q}}}-{\varvec{q}})^2\right] =\sum _{j=1}^{D}\left[ \textrm{var}({\hat{q}}_j-q_j)+\mathbbm {E}^2({\hat{q}}_j-q_j)\right] =\sum _{j=1}^{D}\textrm{var}({\hat{q}}_j).\nonumber \\ \end{aligned}$$
(38)

Here, the bias-variance decomposition of the MSE was used, and the last equality follows from the facts that \(q_j\) is not stochastic and that the bias \(\mathbbm {E}\left[ \hat{{\varvec{q}}}-{\varvec{q}}\right] \) vanishes. Note that here we do not have to know the true value of \({\varvec{q}}\) to evaluate its risk because in practice, to evaluate the variance of the estimator, its empirical estimate is used. As an example, for the empirical estimator (18), the variance components would be estimated by \({\hat{q}}_j(1-{\hat{q}}_j)/(n-1)\).

2.8 James–Stein shrinkage and regularization

The empirical estimator \(\hat{{\varvec{q}}}\) is (unlike the empirical estimator of the multivariate normal mean) known to be admissible under quadratic loss [28], so there is no "Stein effect" [29] for the multinomial. While the Bayesian estimator (34) isn’t uniformly better than the empirical estimator for all parameter values,Footnote 10 its flattening of the data can result in much smaller mean squared error than with the empirical estimate. This will be made plausible in the following. Let us rewrite (34) as a convex combination

$$\begin{aligned} \hat{{\varvec{q}}}^\textrm{sh}=\uplambda \varvec{\tau }+(1-\uplambda )\hat{{\varvec{q}}} \end{aligned}$$
(39)

between the target distribution \(\varvec{\tau }\) and the empirical estimator \(\hat{{\varvec{q}}}\). That this is equivalent to (34) can be seen when defining

$$\begin{aligned} \uplambda:= & {} \frac{\sum _{k=1}^D\alpha _k}{n+\sum _{k=1}^D\alpha _k}, \end{aligned}$$
(40)
$$\begin{aligned} \tau _j:= & {} \frac{\alpha _j}{\sum _{k=1}^D\alpha _k},\qquad j=1,\dots ,D. \end{aligned}$$
(41)

\(\hat{{\varvec{q}}}^\textrm{sh}\) is called a James-Stein type [30] shrinkage estimator of \({\varvec{q}}\), see also [31] as well as the discussion in [9]. Choosing the maximum-entropy target, i.e., the equidistribution \(\tau _j=1/D\) for all \(j=1,\dots ,D\), the target term can be understood as a regularization of the empirical estimator.

Remember that \(\hat{{\varvec{q}}}^\textrm{sh}\) is the posterior expected value of \({\varvec{q}}\). The fact that the posterior expected value of a random variable is a linear function of its empirical estimate is equivalent to the use of a conjugate prior. This is a result that holds for exponential families in general [24].

This linearity is helpful for evaluating the accuracy of the shrinkage estimator, again using the expected quadratic loss as a risk function. We shall give a result that is slightly more general than necessary for this estimator because we will again need it in Sect. 3:

Proposition 2

Let \(f_j\), \(j=1,\dots ,D\) be the components of a function \(f:{\mathcal {S}}^D\rightarrow {\mathbb {R}}^D\) acting on a vector of probabilities. Let \(\varvec{\tau }\) be a D-dimensional probability parameter and \(\hat{{\varvec{q}}}\) the multinomial empirical estimator. Then, for \(0\le \uplambda \le 1\), the convexly combined estimator \(f(\tilde{{\varvec{q}}})\) of \(f({\varvec{q}})\) given by its components

$$\begin{aligned} f_j(\tilde{{\varvec{q}}}):=\uplambda f_j(\varvec{\tau })+(1-\uplambda )f_j(\hat{{\varvec{q}}}),\qquad j=1,\dots ,D \end{aligned}$$

(i) has a quadratic risk with respect to \(f({\varvec{q}})\) given by

$$\begin{aligned} R_{{\varvec{q}}}(\tilde{{\varvec{q}}})=(1-\uplambda )^2\sum _{j=1}^D\textrm{var}\big (f_j(\hat{{\varvec{q}}})\big )+\sum _{j=1}^D\bigg [{\mathbb {E}}f_j(\hat{{\varvec{q}}})-f_j({\varvec{q}})-\uplambda \big ({\mathbb {E}}f_j(\hat{{\varvec{q}}})-f_j(\varvec{\tau })\big )\bigg ]^2. \end{aligned}$$

(ii) The minimum risk is attained for

$$\begin{aligned} \uplambda ^*=\frac{\sum _{j=1}^D\bigg [\textrm{var}\big (f_j(\hat{{\varvec{q}}})\big )+\big ({\mathbb {E}}f_j(\hat{{\varvec{q}}})-f_j({\varvec{q}})\big )\big ({\mathbb {E}}f_j(\hat{{\varvec{q}}})-f_j(\varvec{\tau })\big )\bigg ]}{\sum _{j=1}^D{\mathbb {E}}\big [f_j(\hat{{\varvec{q}}})-f_j(\varvec{\tau })\big ]^2}. \end{aligned}$$

The proof is provided in the Appendix. This is a slight modification of the lemma shown in [8], see also the derivation in [32] and the application to the multinomial in [9]. To apply the proposition to \(\hat{{\varvec{q}}}^\textrm{sh}\), we observe that \(f_j\) simply corresponds to taking the j-th component and simplifications occur because the bias of \(\hat{{\varvec{q}}}\) vanishes: \({\mathbb {E}}f_j(\hat{{\varvec{q}}})-f_j({\varvec{q}})={\mathbb {E}}{\hat{q}}_j-q_j=0\). We obtain

$$\begin{aligned} R_{{\varvec{q}}}(\hat{{\varvec{q}}}^\textrm{sh})=(1-\uplambda )^2\sum _{j=1}^{D}\textrm{var}({\hat{q}}_j)+\uplambda ^2\sum _{j=1}^{D}\mathbbm {E}^2({\hat{q}}_j-\tau _j), \end{aligned}$$
(42)

with minimum risk at

$$\begin{aligned} \uplambda ^*=\frac{\sum _{j=1}^{D}\textrm{var}({\hat{q}}_j)}{\sum _{j=1}^{D}\mathbbm {E}\left[ ({\hat{q}}_j-\tau _j)^2\right] }. \end{aligned}$$
(43)

We can see that the risk function is a weighted average over the risk of the empirical estimator and an additional term that punishes expected difference from the target. Tuning the size of \(\uplambda \), we can trade off the bias of the target against the variance of the empirical estimate to obtain a smaller risk than (38). Estimators based on small sample data will generalize better to new data when flattening the data to a well-specified extent using an uninformative, maximum-entropy model. The amount of flattening depends on the data at hand and is optimized via the weight \(\uplambda \) of the target. Note that the relationships (40) and (41) imply that this is similar to an empirical Bayes procedure where we tune the size of the pseudocounts \(\alpha _j\) and by this, adjust the a-priori sample size \(\sum \alpha _k=n\uplambda /(1-\uplambda )\). To evaluate (43), the empirical estimates for variance and expectation are used in practice.

2.9 Power-transformed compositions and their Euclidean distance in ordination

Power transformations [33] have traditionally been applied to data in order to fulfill certain distributional assumptions. For instance, a suitable power transformation can reduce skew so data appear approximately normal. In the case of Poisson counts, where variance equals the mean, the square root transformation is a common choice to “stabilize" the variance (i.e., make it approximately constant independently of the mean). More generally, power transformations can appear through the link functions of generalized linear models [35] and then enable a fit of the data to a true underlying distribution.

Methods for dimension reduction and data visualization (a.k.a. ordination) such as Principal Component Analysis (PCA) often use some version of Euclidean distance between multivariate samples:

$$\begin{aligned} d^2(\hat{{\varvec{q}}}_1,\hat{{\varvec{q}}}_2)=\sum _{j=1}^D\omega _j\left( {\hat{q}}_{1j}-{\hat{q}}_{2j}\right) ^2, \end{aligned}$$
(44)

where the \(\omega _j\) are suitable weights. Here, for the data, we used the empirical parameter estimates of the count distribution \(\hat{{\varvec{q}}}\) instead of the counts \({\varvec{n}}\) themselves. In the case of relative counts, where the total of each sample is not of direct interest, this seems a good idea because we want to visualize the “shape" of the data without their “size" [34]. There are two main ordination methods that are relational in the sense that they visualize shape only [35], Correspondence Analysis (CA) and log-ratio analysis (LRA). CA uses a weighting scheme that involves row and column totals of the data matrix. In this way, it takes into account the data size indirectly to account for the precision of the shape estimates. LRA, in contrast, is a PCA of data that are log-transformed and double-centred. Here, relationships between parts remain invariant under taking subsets of the data,Footnote 11 and it is better suited for true compositions. It was shown [7] that via the following limit of the Box–Cox family [36] of power transformations

$$\begin{aligned} \lim _{\beta \rightarrow 0}\frac{x^\beta -1}{\beta }=\log (x), \end{aligned}$$
(45)

CA on power-transformed data converges to LRA. CA and LRA are thus special cases of a more general family of ordination methods. To make this more precise in the case of unweighted LRA, consider the following transformation of our empirical estimates:

$$\begin{aligned} f_\beta (\hat{{\varvec{q}}})=\left( \frac{{\hat{q}}_1^\beta }{\sum _{k=1}^D{\hat{q}}_k^\beta },\dots ,\frac{{\hat{q}}_D^\beta }{\sum _{k=1}^D{\hat{q}}_k^\beta }\right) ^T. \end{aligned}$$
(46)

When now using uniform weights \(\omega _j=D^2\), the limit

$$\begin{aligned} \lim _{\beta \rightarrow 0}\frac{1}{\beta ^2}d^2\left( f_\beta (\hat{{\varvec{q}}}_1),f_\beta (\hat{{\varvec{q}}}_2)\right) \end{aligned}$$
(47)

is the squared Aitchison distance

$$\begin{aligned} d^2_A(\hat{{\varvec{q}}}_1,\hat{{\varvec{q}}}_2)=\frac{1}{D}\sum _{i=1}^D\sum _{j<i}\left( \log \frac{{\hat{q}}_{1i}}{{\hat{q}}_{1j}}-\log \frac{{\hat{q}}_{2i}}{{\hat{q}}_{2j}}\right) ^2 \end{aligned}$$
(48)

(see [6] for a proof). Aitchison (or log-ratio) distance is the metric underlying LRA. Using the transformation \(f_\beta \) before evaluating Euclidean distance induces a parametrized class of distance measures that include the ones used in CA (\(\beta =1\)) and LRA (\(\beta =0\)) as special cases.Footnote 12 When using finite, “small enough" values of the power parameter \(\beta \), the subcompositional coherence of LRA remains approximately satisfied while there is no need for zero imputation (as CA does not involve logarithms). One can obtain an optimal value of the power parameter in the sense that it maximizes the Procrustes correlation between the log-ratio transformed data (using zero imputation) and the coordinates from the power-transformed CA (keeping the zeros) [37].

3 Exponential shrinkage

In this section we want to define and test an estimator based on the power transformation (46). The justification of this estimator comes from a formal analogy with \(\hat{{\varvec{q}}}^\textrm{sh}\). This analogy is more apparent when introducing the generalized notions of addition (a.k.a. perturbation) and scalar multiplication (a.k.a. powering) that equip the simplex with a linear structure. For \({\varvec{q}}, {\varvec{p}}\in {\mathcal {S}}^D\), and some \(\beta \in {\mathbb {R}}\), they are defined as the vectors

$$\begin{aligned} {\varvec{q}}\oplus {\varvec{p}}:= & {} {\mathcal {C}}(q_1p_1,\dots ,q_Dp_D)^T, \end{aligned}$$
(49)
$$\begin{aligned} \beta \odot {\varvec{q}}:= & {} {\mathcal {C}}(q_1^\beta ,\dots ,q_D^\beta )^T, \end{aligned}$$
(50)

where \({\mathcal {C}}\) denotes the closure operation \({\mathcal {C}}{\varvec{q}}:={\varvec{q}}/\sum _iq_i\). An inverse perturbation is given by \(\ominus {\varvec{q}}:=\oplus (-1)\odot {\varvec{q}}\).

3.1 Power transformed compositions as convex combinations, dual geodesics

The shrinkage estimator (39) is a weighted mean of the target and the observed point. This convex combination is an example for what is known as a mixture geodesic (or m-geodesic) in information geometry. Consider now a similar structure using the operations of perturbation and powering introduced above:

$$\begin{aligned} \tilde{{\varvec{q}}}=\uplambda \odot \varvec{\tau }\oplus (1-\uplambda )\odot \hat{{\varvec{q}}}. \end{aligned}$$
(51)

This describes a so-called exponential geodesic (or e-geodesic).Footnote 13 Usually [5], both types of geodesics are written in terms of their dual coordinates:

$$\begin{aligned} \varvec{\eta }(\uplambda )= & {} \uplambda \varvec{\eta }_{\varvec{\tau }}+(1-\uplambda )\varvec{\eta }_{\hat{{\varvec{q}}}}, \end{aligned}$$
(52)
$$\begin{aligned} \varvec{\theta }(\uplambda )= & {} \uplambda \varvec{\theta }_{\varvec{\tau }}+(1-\uplambda )\varvec{\theta }_{\hat{{\varvec{q}}}}, \end{aligned}$$
(53)

where we used subscripts to indicate at which points the coordinates are evaluated. Coming back to the power-transformation (46), we can easily see that it is described by the exponential geodesic between the observed point and the uniform target: Evaluating the exponential coordinates at \(f_\beta (\hat{{\varvec{q}}})\), we have

$$\begin{aligned} \varvec{\theta }_{f_\beta (\hat{{\varvec{q}}})}=\left( \log \frac{{\hat{q}}_1^\beta }{{\hat{q}}_D^\beta },\dots ,\log \frac{{\hat{q}}_{D-1}^\beta }{{\hat{q}}_D^\beta }\right) ^T=\beta \varvec{\theta }_{\hat{{\varvec{q}}}}. \end{aligned}$$
(54)

We also notice that for \(\varvec{\tau }=(1/D)_{i=1}^D\), \(\varvec{\theta }_{\varvec{\tau }}\) vanishes. Setting \(\beta =1-\uplambda \), we immediately obtain (53). When evaluating (53) for a general target, we can use the form (51) to obtain a generalized power transformation in terms of the original parameters:

$$\begin{aligned} \hat{{\varvec{q}}}^\textrm{es}:= \left( \frac{\tau _1^{1-\beta } {\hat{q}}_1^{\beta }}{\sum _{k=1}^{D}\tau _k^{1-\beta } {\hat{q}}_k^{\beta }},\dots ,\frac{\tau _D^{1-\beta } {\hat{q}}_D^{\beta }}{\sum _{k=1}^{D}\tau _k^{1-\beta } {\hat{q}}_k^{\beta }}\right) ^T. \end{aligned}$$
(55)

Comparing \(\hat{{\varvec{q}}}^\textrm{es}\) with the shrinkage estimator (34), we see that instead of a weighted arithmetic mean between the target and the empirical estimator, here we evaluate a weighted geometric mean between them.

3.2 Another reparametrization of the posterior

Since the generalized power transformation (55) can be described as a convex combination in exponential coordinates, it shares a structural similarity with the shrinkage estimator (34), which is obtained from a convex combination of expectation (a.k.a. mixture) coordinates. To make this a shrinkage problem, however, we need the resulting quantity \(\hat{{\varvec{q}}}^\textrm{es}\) to be interpreted as an estimator. Here we argue that \(\hat{{\varvec{q}}}^\textrm{es}\) is simply a reparametrization of \(\hat{{\varvec{q}}}^\textrm{sh}\) similar to (39). There, we went from \({\mathcal {C}}({\varvec{n}}+\varvec{\alpha })\) to an expression involving \(\uplambda \), \(\varvec{\tau }\), and \(\hat{{\varvec{q}}}\). We also showed a simple reparametrization of the posterior of \(\varvec{\theta }\) in terms of \(\hat{{\varvec{q}}}^\textrm{sh}\) together with the posterior sample size \({\hat{n}}\), see (35). Such alternative ways of writing posterior and posterior expectation can be obtained using \(\hat{{\varvec{q}}}^\textrm{es}\) as well, as we will show in the following.

As we have seen in the previous section, an alternative parameter \(\beta \) can be used to define a geometric mean between target and observed point. Defining \({\tilde{n}}:=\sum _{k=1}^{D}\tau _k^{1-\beta } n_k^{\beta }\), in the expression for the posterior (35) we can simply replace \({\hat{n}}\hat{{\varvec{q}}}^\textrm{sh}\) by new Dirichlet parameters \({\tilde{n}}\hat{{\varvec{q}}}^\textrm{es}\) to obtain the following expression of the posterior:

$$\begin{aligned} p(\varvec{\theta }\mid \hat{{\varvec{q}}}^\textrm{es},{\tilde{n}})=\exp \left( {\tilde{n}}\left[ \sum _{k=1}^{D-1}\theta ^k{\hat{q}}^\textrm{es}_k-\psi (\varvec{\theta })\right] -\log B\left( {\tilde{n}}\hat{{\varvec{q}}}^\textrm{es}\right) \right) . \end{aligned}$$
(56)

This provides us with another example for Proposition 1. Maximizing the posterior thus corresponds to a minimization of the KL-divergence between \(\hat{{\varvec{q}}}^\textrm{es}\) and the true parameter. Furthermore, the derivation of (32) given in the Appendix also shows that \(B({\hat{n}}\hat{{\varvec{q}}}^\textrm{es})\) normalizes (35).Footnote 14 Note that this also implies that the posterior expectation of \({\varvec{q}}\) can be written equally valid as either the shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{sh}\) or as the exponential shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{es}\). This means that the exponential shrinkage estimator is nothing but the reparametrized posterior expectation of \({\varvec{q}}\).

3.3 Quadratic risk on the tangent space

To evaluate the accuracy of the exponential shrinkage estimator, we would like a simple risk function like the MSE. We saw previously that with this risk function, an analytic estimate of the optimal prior weight was essentially possible because of the linearity of the shrinkage estimator. However, a generalized notion of linearity is now needed: While m-geodesics are straight lines in the simplex, e-geodesics are straight lines in its tangent space

$$\begin{aligned} {\mathcal {T}}^D=\left\{ {\varvec{v}}\in {\mathbb {R}}^D:\sum _{i=1}^Dv_i=0\right\} . \end{aligned}$$
(57)

A mapping from the simplex to \({\mathcal {T}}^D\) (a.k.a. clr plane in CoDA) is known as the clr transformation

$$\begin{aligned} \textrm{clr}({\varvec{q}})=\left( \log \frac{q_1}{g({\varvec{q}})},\dots ,\log \frac{q_D}{g({\varvec{q}})}\right) ^T, \end{aligned}$$
(58)

where g denotes the geometric mean \(g({\varvec{x}}) = \left( \prod _{i = 1}^D q_i\right) ^{1/D}\). This mapping is fundamental in both information geometry and CoDA. The constraint that the clr components sum to zero means that the points on an exponential geodesic retain their normalization on the simplex.

With this, a quadratic loss function in analogy to the one on the simplex can be obtained by first mapping the compositions in question to the tangent space and then using squared Euclidean distance again (see Fig. 2).

Fig. 2
figure 2

a The shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{sh}\) (in red) obtained by an addition of scaled vectors (in blue) ending in the unit simplex (shown in black). The m-geodesic connecting \(\varvec{\tau }\) and \(\hat{{\varvec{q}}}\) is shown as a thin blue line. b The exponential shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{es}\) (in red) obtained by vector addition in the tangent space. The e-geodesic is shown as a curved orange line in the simplex and a straight orange line in the tangent space

Let us first define the loss function on the tangent space for the empirical estimator:

$$\begin{aligned} L_A({\varvec{q}},\hat{{\varvec{q}}})=\sum _{j=1}^{D}\left( \textrm{clr}_j(\hat{{\varvec{q}}})-\textrm{clr}_j({\varvec{q}})\right) ^2. \end{aligned}$$
(59)

This is the (squared) Aitchison distance, i.e., an alternative expression of (48). Via the mapping of the simplex to \({\mathcal {T}}^D\), the expression \(\textrm{clr}(\hat{{\varvec{q}}})-\textrm{clr}({\varvec{q}})\) can be interpreted as a difference vector between compositions [6]. One can write this in form of a perturbation with the notation \(\hat{{\varvec{q}}}\ominus {\varvec{q}}\), which makes the analogy with (38) even more compelling. The “exponential" analogue to the MSE of Sect. 2.7 is the risk function associated with the squared Aitchison loss, i.e. the expectation

$$\begin{aligned} \tilde{R}_{{\varvec{q}}}(\hat{{\varvec{q}}})=\mathbbm {E}L_A({\varvec{q}},\hat{{\varvec{q}}})=\sum _{j=1}^{D}\left[ \textrm{var}\left( \textrm{clr}_j(\hat{{\varvec{q}}})\right) +\mathbbm {E}^2\left( \textrm{clr}_j(\hat{{\varvec{q}}})-\textrm{clr}_j({\varvec{q}})\right) \right] . \end{aligned}$$
(60)

Unfortunately, in this case the bias term does not vanish for the empirical estimator, and we shall need an approximation to evaluate it.

3.4 Optimization along the exponential geodesic

We can now use our modified risk function on the exponential shrinkage estimator, in analogy to (42), to minimize it with respect to \(\uplambda =1-\beta \). Using Proposition 2 with \(f_j(\cdot )=\textrm{clr}_j(\cdot )\), and \(\uplambda =1-\beta \), for the MSE of \(\textrm{clr}(\hat{{\varvec{q}}}^\textrm{es})\) we obtain

$$\begin{aligned} {R}_{{\varvec{q}}}(\hat{{\varvec{q}}}^\textrm{es})= & {} (1-\uplambda )^2\sum _{j=1}^{D}\textrm{var}\left( \textrm{clr}_j(\hat{{\varvec{q}}})\right) \nonumber \\{} & {} \quad +\sum _{j=1}^{D}\bigg [\uplambda \mathbbm {E}\left( \textrm{clr}_j(\varvec{\tau })-\textrm{clr}_j(\hat{{\varvec{q}}})\right) +\mathbbm {E}\textrm{clr}_j(\hat{{\varvec{q}}})-\textrm{clr}_j({\varvec{q}})\bigg ]^2. \end{aligned}$$
(61)

A solution for the minimum can be found at

$$\begin{aligned} \uplambda _\textrm{min}=\frac{\sum _{j=1}^{D}\bigg [\textrm{var}\left( \textrm{clr}_j(\hat{{\varvec{q}}})\right) -\mathbbm {E}\left( \textrm{clr}_j(\varvec{\tau })-\textrm{clr}_j(\hat{{\varvec{q}}})\right) \left( \mathbbm {E}\textrm{clr}_j(\hat{{\varvec{q}}})-\textrm{clr}_j({\varvec{q}})\right) \bigg ]}{\sum _{j=1}^{D}\mathbbm {E}\bigg [\left( \textrm{clr}_j(\varvec{\tau })-\textrm{clr}_j(\hat{{\varvec{q}}})\right) ^2\bigg ]}\nonumber \\ \end{aligned}$$
(62)

Again, this can be evaluated in practice by replacing \({\varvec{q}}\) by the best estimator available. To estimate the variance and the expectation terms of the clr-transformed empirical estimator, we resort to Taylor expansion. While the expressions become a bit more unwieldy compared with the ones on the m-geodesic, we can still evaluate them explicitly. For the mean we get

$$\begin{aligned} \mathbbm {E}\textrm{clr}_j(\hat{{\varvec{q}}})\approx E_j:=\textrm{clr}_j({\varvec{q}})-\frac{1-q_j}{2q_jn}+\frac{1}{2D}\sum _{k=1}^D\frac{1-q_k}{q_kn}, \end{aligned}$$
(63)

and for the variance (where this approximation is known as the Delta method)

$$\begin{aligned} \textrm{var}\left( \textrm{clr}_j(\hat{{\varvec{q}}})\right)\approx & {} \nonumber \\ \quad V_j:= & {} \left( 1-\frac{2}{D}\right) \frac{1-q_j}{q_jn}+\frac{1}{D^2}\sum _{k=1}^D\frac{1-q_k}{q_kn}-\frac{1}{n} \left( 3-\frac{7}{D}+\frac{4}{D^2}\right) \nonumber \\ \end{aligned}$$
(64)

(see Appendix for a derivation). In the case of the maximum-entropy target, the clr\(_j(\varvec{\tau })\) terms in (62) vanish, and an estimator of the optimal power can be obtained by

$$\begin{aligned} \beta ^*=1-\frac{\sum _{k=1}^D\left[ V_k-E_k(E_k-\textrm{clr}_k({\varvec{q}}))\right] }{\sum _{k=1}^D\left[ V_k+E_k^2\right] }. \end{aligned}$$
(65)

3.5 Performance on simulated data

We can now test how well we can infer true frequencies from simulated data using the exponential shrinkage estimator. For this, we use the equidistribution as the target and optimize the \(\beta \) parameter as described before.

Fig. 3
figure 3

Mean squared error (MSE) of the empirical estimator (green), the shrinkage estimator (blue), and the exponential shrinkage estimator (orange). Data are sampled from multinomial distributions with increasing sparsity. Boxplots in each row show the MSEs of 500 samples from the multinomial whose histogram is shown in the first column. Sample size increases from left to right (\(n=D/5,D,5D\)), while sparsity increases from top down. As \(D=100\), the vertical axis in the histograms can be read as a percentage. Note that the vertical boxplot axes change their range between columns

This should not be understood as an intent at a comprehensive benchmark but rather as a proof of concept. We test performance on multinomial counts only. The three different multinomial distributions (D=100) shown in Fig. 3 were obtained by sampling from Dirichlet distributions with three different choices for the hyper parameters. These were chosen to obtain multinomial parameters that are far from equidistributed and have an increasing number of essential zeros. As a measure of performance, we chose MSE as in [9]. Beside being simple and intuitive, MSE has the advantage that zeros are not problematic as there are no logarithms involved. Both zeros as obtained from undersampling (i.e., count zeros) as well as those that occur because parameters are truly (or almost) zero (so-called essential zeros) will have the effect that the observed point \(\hat{{\varvec{q}}}\) falls on the boundary of the simplex. This is not a problem for the shrinkage estimator, as m-geodesics can go from the centre to the boundary. However, e-geodesics are only defined inside the simplex, and we have to redefine the observed point as its projection to the nonzero parts, with a subsequent change in the dimension D. In any case, it is only the nonzero parts that can be modified by the exponential shrinkage estimator. As an approximation of the true parameter in the expressions (63) and (64), we use the shrinkage estimator \(\hat{{\varvec{q}}}^\textrm{sh}\). The exponential shrinkage estimator is optimized over the nonzero parts only. The results show that exponential shrinkage outperforms the empirical estimator but cannot compete with the shrinkage estimator if the data are severely undersampled (first column in Fig. 3). There is a sweet spot of performance when many essential zeros are present and the data are sampled at reasonable depth (middle column). In this case, the exponential shrinkage estimator can outperform the shrinkage estimator. Clearly, it is “already correct" for the unobserved values, while the shrinkage estimator imputes them. Further increasing sample size essentially equalizes the performance of all estimators (right column). Note that the presence of zeros in the multinomial parameters effectively increases the sample size as the same counts are now distributed over fewer parts. The two factors studied in Fig. 3, sample size and sparsity, are thus not independent of each other in their effects.

3.6 Discussion

We have shown that power transformations of relative count data can be understood as a shrinkage problem. An analytic solution for the optimal power for given data can be obtained in a way that is analogous to what was proposed for finding an optimal flattening constant. We find the underlying information-geometric structure intriguing: Both types of geodesics between the empirical estimate and the maximum-entropy estimate give rise to their own shrinkage problem. But we think that there are also practical implications for data anlysis. In the context of compositional data visualization, power transformations have been proposed as an approximation to log-ratio transformations, which require zero imputation. Correspondence Analysis (CA), one of the best methods for visualizing two-way tables containing counts, can be made more suitable for relative count data when applying such a transformation. It then approximates log-ratio analysis (LRA), whose visualization appeals more to our Euclidean intuition but whose zero imputed data may be suboptimal or even impossible for very sparse data sets. For side-by-side visualizations of geochemical and single-cell data using both methods, see [37]. While CA is a visualization of the stretched out (weighted) simplex, LRA is a PCA on its tangent space (the clr plane). When using the hybrid approach of CA with power transformed counts, currently a uniform power parameter is applied to an entire data matrix that could contain rows with heterogeneous sample sizes. As we have seen, in terms of an optimal approximation to the underlying parameters in each row, this would work best if samples follow the same distribution and the sample sizes are not too different. On the other hand, we could argue that, from a modelling perspective, it would be better to find the best power for each row in the data matrix separately. While the deformation with respect to LRA would now be heterogeneous among samples, the fit with underlying population parameters would be better. The shrinkage approach is of course applicable beyond data visualization, and we think that applying it as a kind of data normalization holds some promise for very sparse data sets as occurring in microbiome analysis or single-cell genomics. Not all of these zeros are essential zeros, but many of them may be caused by truly small occurrence probabilities. If so, the commonly applied log transform with a uniform pseudocount would almost certainly be less suitable than a data-driven power transformation as proposed here. While this approach may still appear overly simplistic, given today’s highly complex data acquisition protocols where effects of statistical and engineering decisions are hard to disentangle, simple approaches often perform similarly well as highly complex ones [38].