1 Introduction

The Bayes theorem encodes the measurement uncertainty in the probability assignments to the possible values of a measurand. It also applies to encoding the lack of certainty of the data model. If the data explanations are questionable and different models compete, their posterior probabilities provide the framework to assess the model uncertainties. The model likeliness is proportional to the data evidence (or marginal likelihood). Since the evidence must remain unchanged after changes in the model parameters, the prior distributions of different parameterisations must be proper and comply with the change-of-variable rule.

When no objective prior information is available, subjective inferences provide the required compliance with the transformation of the prior distributions under parameter changes. Alternatives, discussed by [16] and known as Jeffreys’ priors, are distributions obtained as a model functional encoding the symmetries of the measurand link with the data. Others, developed by [5] and [3], maximise the mutual information between the data and measurand, which measures how much we learn about the measurand from the data.

If improper, these distributions can lead to inconsistencies. For example, the Jeffreys’ uniform prior over the reals for the mean of Gaussian data leads to an inconsistent inference of their mean power, see [4, 5, 12, 13, 28].

We consider the simultaneous inferences of the individual means of multivariate Gaussian data, individual means’ squares, and average means’ squares. Supposing the data are the measured values of a discrete-time signal, we will refer to the lasts as powers. The uniform distribution for the means’ prior delivers information about their squares. To avoid introducing this information into the problem and taking a finite power into account, we investigate a novel solution which extends the approaches of [1, 8, 9, 24, 25].

Our investigations were prompted by related problems in the estimate of the power of signals and the modulus of complex quantities from measurement results affected by additive uncorrelated Gaussian errors, which were lengthy discussed in [2, 6, 14, 15, 19, 32].

The manuscript is organised as follows. Section 2 states the problem, identifies its origin, and outlines a solution. Next, we overview hierarchical modelling and averaging. In Sect. 4, apply them to derive consistent inferences of the data power. Numerical examples are given in Sect. 5.

All the integrations were carried out in terms of standard mathematical functions with the aid of Mathematica [30]. The relevant notebook is given as supplementary material. To read and interact with it, download the Wolfram Player free of charge [31].

2 Stein paradox

2.1 Problem statement

In the simplest form of the Stein paradox, \(x_i \sim N(x_i|\mu _i,\sigma =1)\) are observations of independent normal variables with unknown means \(\mu _i\) and known variance \(\sigma ^2\), for \(i = 1, 2,..., m\). With a somewhat inconsistent use of notation, we will use the same symbol to indicate both the random variables and the labels of their possible values. To keep the algebra simple, without loss of generality, we use units where \(\sigma =1\).

For example, \(x_i\) are samples of a \(\mu (t)\) signal affected by additive white Gaussian noise, where the uncertainty for each datum is known, as in [6] and [2]. Suppose that, in addition to the measurands \(\mu _i = \mu (t_i)\), we are also interested in the signal power, that is to \(\theta ^2=|{\varvec{\mu }}|^2/m\), where \({\varvec{\mu }}= \{\mu _1, \mu _2,...\, \mu _m\}\) [32].

The Jeffreys prior of every instantaneous signal \(\mu _i\) is the uniform distribution over the reals \(U_\infty (\mu _i)\propto \) const., resulting in independent normal posteriors, \(\mu _i \sim N(\mu _i|x_i,\sigma =1)\). By changing variable in the posterior distributions, \(\mu _i^2\) are independent non-central \(\chi ^2_1\) variables having one degree of freedom, mean \(1 + x_i^2\), and variance \(2(1+2 x_i^2)\). Similarly, \(m\theta ^2\) is a non-central \(\chi ^2_m\) variable having m degrees of freedom. Therefore, it follows that the posterior mean and variance of \(\theta ^2\) are

$$\begin{aligned} \mathbb {E}(\theta ^2|\textbf{x},\mathscr {M}_\infty ) = 1 + \overline{x^2}, \end{aligned}$$
(1a)

where \(\textbf{x}=\{x_1, x_2,\, ...\, x_m\}\), \(\overline{x^2} = |\textbf{x}|^2/m\), and \(\mathscr {M}_\infty \) is the data model assuming the uniform prior, and

$$\begin{aligned} \textrm{Var}(\theta ^2|\textbf{x},\mathscr {M}_\infty ) = \frac{2}{m} \left( 1 + 2\overline{x^2} \right) . \end{aligned}$$
(1b)

From a frequentist viewpoint, since \(m\overline{x^2}\) is a noncentral \(\chi ^2_m\) variable having m degrees of freedom, \(\overline{x^2}\) is a biased estimator of \(\theta ^2\). In fact,

$$\begin{aligned} \mathbb {E}(\overline{x^2}|{\varvec{\mu }}) = 1 + \theta ^2 \end{aligned}$$
(2a)

and

$$\begin{aligned} \textrm{Var}(\overline{x^2}|{\varvec{\mu }}) = \frac{2}{m} \left( 1 + 2\theta ^2 \right) . \end{aligned}$$
(2b)

As m tends to infinity, provided that, \(\theta ^2\) and, consequently, \(\overline{x^2}\) converge, a bad situation occurs: (1a) and (1b) jointly predict that \(\theta ^2\) is certainly \(\overline{x^2} + 1\), but, at the same time, (2a) and (2b) jointly predict that \(\theta ^2\) is certainly \(\overline{x^2}-1\).

2.2 Paradox explanation

The \(U_\infty (\mu _i)\propto \) const. usage can be justified by showing that the \(\mu _i \sim N(\mu _i|x_i,\sigma =1)\) posterior is a suitable limit obtained from proper priors having increasingly large variance [3]. The difference between the Bayesian and frequentist certainties is due to the information encoded in \(U_\infty (\mu _i)\), which makes the data explanations underlying the Bayesian and frequentist analyses different.

As most of its mass is at infinite, \(U_\infty (\mu _i)\) encodes that \(|\mu _i|\) is greater than any positive number. This information is irrelevant to \(\mu _i\); but, it is not for \(\mu _i^2\). It is worth noting that, if really \(\mu _i \sim U_\infty (\mu _i)\), then \(\overline{x^2}\) and \(\theta ^2\) must diverge and the inconsistency disappears. Therefore, the problem originates in the conflict between the information encapsulated in \(U_\infty (\mu _i)\) and the data, see also [11]. Also, if we are not happy with the difference between (1a) and (2a), it means that we believe that \(\theta ^2\) and, consequently, \(\mu _i^2\) are finite.

2.3 Proposed solution

To remove the inconsistency between (1a) and (2a), we must take the assumption that \(|\mu _i|\) is bounded into account and use a proper prior. Therefore, extending [1], we set \(\mu _i|a_i,b_i \sim N(\mu _i|a_i,b_i)\), where the mean and standard deviation \(a_i\) and \(b_i\) are hyper-parameters and the \(b_i \rightarrow \infty \) limit is the uniform distribution. The normal distribution has been chosen because it has the minimum relative entropy concerning the uniform one under a fixed variance [26, 27].

The simplest way to encode \(|\mu _i|<\infty \) is setting \(a_i=a\) and \(b_i=b\) for all the samples, which condition is sufficient, though not necessary, for the convergence. If the samples are indistinguishable, i.e., the data labelling is unknown, assigning the same mean and variance is reasonable.

Hence, we assume the prior

$$\begin{aligned} {\varvec{\mu }}|a,b \sim \pi ({\varvec{\mu }}|a,b) = \prod _{i=1}^m N(\mu _i|a,b). \end{aligned}$$
(3)

According to the Bayesian viewpoint, firstly, the \({\varvec{\mu }}\) measurands are sampled from the \(\pi ({\varvec{\mu }}|a,b)\) prior, then the \(\textbf{x}\) data are sampled from the \(N(\textbf{x}|{\varvec{\mu }},\sigma =1)\) distribution. Therefore, different priors identify different models, and we can let the data select the most likely.

Since the prior (3) encodes the belief that all the measurands (e.g., the instantaneous signals \(\mu _i\)) have the same mean, it will originate a shrinkage of the means’ inferences on the sample mean. Our prior choice is not new. However, previous investigations, for instance, [4, 12, 24], set \(\mu _i|b \sim N(\mu _i|0,b)\), which encodes the strongest belief that all the measurands are expected to have a zero mean.

3 Outline of the Bayesian model selection

To determine the most likely prior, we let the data choose. Let \(N(\textbf{x}|{\varvec{\mu }},\sigma =1)\) be the multivariate distribution of the \(\textbf{x}\) data. Hence, the hierarchical models competing to explain the data are

$$\begin{aligned} \mathscr {M}_{ab} =\{N(\textbf{x}|{\varvec{\mu }},\sigma =1): {\varvec{\mu }}\sim \pi ({\varvec{\mu }}|a,b)\}, \end{aligned}$$
(4)

where \(\pi ({\varvec{\mu }}|a,b)\) is the prior distribution (3) and a and b index the models. The posterior distribution of the measurands is

$$\begin{aligned} \Pi ({\varvec{\mu }}|\textbf{x},a,b) = \frac{N(\textbf{x}|{\varvec{\mu }},\sigma =1)\pi ({\varvec{\mu }}|a,b)}{Z(\textbf{x}|a,b)}. \end{aligned}$$
(5)

The marginal likelihood or evidence,

$$\begin{aligned} Z(\textbf{x}|a,b) = \int _{\mathbb {R}^m}\!\! N(\textbf{x}|{\varvec{\mu }},\sigma =1)\pi ({\varvec{\mu }}|a,b)\, \textrm{d}{\varvec{\mu }}, \end{aligned}$$
(6)

is the sampling distribution \(\textbf{x}\) given the model indexed by a and b, but no matter what the values of \({\varvec{\mu }}\) – or of others model parameterisations – might be.

The marginal likelihood can be used to compare the competing models by their probability \(Q(a,b|\textbf{x})\) as provided by the data,

$$\begin{aligned} Q(a,b|\textbf{x}) = \frac{Z(\textbf{x}|a,b)\varpi (a,b)}{\displaystyle \int _0^\infty \!\int _{-\infty }^{+\infty }\!\! Z(\textbf{x}|a',b')\varpi (a',b')\,\textrm{d}a'\, \textrm{d}b'}, \end{aligned}$$
(7)

where \(\varpi (a,b)\) is the prior probability of ab and the integration is carried out on its support.

The model-averaged posterior of \({\varvec{\mu }}\) is

$$\begin{aligned} P({\varvec{\mu }}|\textbf{x}) = \int _0^\infty \!\!\int _{-\infty }^{+\infty }\!\! \Pi ({\varvec{\mu }}|\textbf{x},a,b) Q(a,b|\textbf{x})\, \textrm{d}a\, \textrm{d}b, \end{aligned}$$
(8)

which can also be obtained by marginalising (5) for the hyper-parameters. The model uncertainty can be embedded in the expected value of the data mean by

$$\begin{aligned} \mathbb {E}({\varvec{\mu }}|\textbf{x})= & {} \int _{\mathbb {R}^m}\!\! {\varvec{\mu }}P({\varvec{\mu }}|\textbf{x})\, \textrm{d}{\varvec{\mu }}\nonumber \\= & {} \int _0^\infty \!\!\int _{-\infty }^{+\infty }\!\! \mathbb {E}({\varvec{\mu }}|\textbf{x},a,b) Q(a,b|\textbf{x})\, \textrm{d}a\, \textrm{d}b, \end{aligned}$$
(9)

where

$$\begin{aligned} \mathbb {E}({\varvec{\mu }}|\textbf{x},a,b) = \int _{\mathbb {R}^m}\!\! {\varvec{\mu }}\, \Pi ({\varvec{\mu }}|\textbf{x},a,b)\, \textrm{d}{\varvec{\mu }}. \end{aligned}$$
(10)

4 Application to the Stein paradox

4.1 Posteriors of the instantaneous signals

By application of (5) and (6), the prior (3) results in independent and identically distributed \(\mu _i\) having marginal likelihood

$$\begin{aligned} Z(x_i|a,b) = \frac{ 1 }{\sqrt{2\pi (1+b^2)}} \exp \left[ -\frac{(x_i-a)^2}{2(1+b^2)} \right] \end{aligned}$$
(11)

and normal posterior

$$\begin{aligned} \Pi (\mu _i|x_i,a,b) = N(\mu _i|\overline{\mu _i}, \sigma _\mu ), \end{aligned}$$
(12)

where

$$\begin{aligned} \overline{\mu _i} = \mathbb {E}(\mu _i|x_i,a,b) = \frac{a+b^2x_i}{1+b^2} \end{aligned}$$
(13a)

and

$$\begin{aligned} \sigma _\mu ^2 = \textrm{Var}(\mu _i|a,b) = \frac{b^2}{1+b^2} \end{aligned}$$
(13b)

are the posterior mean and variance of \(\mu _i\). It is worth noting that if \(b\rightarrow \infty \) then \(\overline{\mu _i} = x_i\) and \(\sigma _\mu ^2 = 1\). The relevant integrations are given in the supplementary material.

4.2 Hyper-prior

We assign prior probabilities to a and b to continue the analysis. The sampling distribution of \(\textbf{x}\) given a and b is (see the supplementary material)

$$\begin{aligned} Z(\textbf{x}|a,b) = \prod _{i=1}^m Z(x_i|a,b) = \frac{\exp \left\{ -\displaystyle \frac{m\big [s_x^2+(\overline{x}-a)^2\big ]}{2(1+b^2)} \right\} }{\sqrt{(2\pi )^m(1+b^2)^m}},\nonumber \\ \end{aligned}$$
(14)

where \(\overline{x}=\sum _{i=1}^m x_i/m\) and \(s_x^2=\sum _{i=1}^m (x_i-\overline{x})^2/m\) are the sample mean and (biased) sample variance, respectively.

Since any of the models (4) is uncertain, to offer evidence or disprove that it explains the data, they must allow for comparisons. This requires proper prior distributions of different parameterisations and compliance with the change-of-variable rule. In the absence of measurable information, the normalised Jeffreys’ hyper-prior (see the supplementary material),

$$\begin{aligned} \varpi (a,b) = \frac{b}{V_a\sqrt{(1+b^2)^3}}, \end{aligned}$$
(15)

where \(b>0\) and \(V_a\) is the length of the a’s domain, does the work. It is worth noting that (15) preserves the convergence of \(\theta ^2\) when \(m\rightarrow \infty \).

Fig. 1
figure 1

Posterior probability densities \(Q(a,b|\overline{x},s_x^2)\) of the model \(\mathscr {M}_{ab}\) given the \(\{x_1,x_2,...\, x_m\}\) normal data. Top: \(m=1\). Bottom: \(m=20\) and \(s_x = 2\)

4.3 Model probabilities

By application of (7) with (14) and (15), the probability of ab explaining the data is (see the supplementary material)

$$\begin{aligned} Q(a,b|\overline{x},s_x^2)= & {} \frac{ \sqrt{2m}\,u_x^m b }{ \sqrt{\pi (1+b^2)^{m+3}}\, \Gamma (m/2,0,u_x^2) } \nonumber \\{} & {} \times \exp \left\{ -\frac{m\big [ s_x^2+(\overline{x}-a)^2 \big ]}{2(1+b^2)} \right\} , \end{aligned}$$
(16)

where \(u_x^2=ms_x^2/2\) and \(\Gamma (a,z_1,z_2)\) is the generalized incomplete gamma function [20]. Since \(V_a\) simplifies in the \(Q(a,b|\overline{x},s_x^2)\) calculation and provided it is large enough to approximate (7) by extending the integration over a to the reals, there is no need to introduce additional undefined parameters.

Figure 1 shows (16) when \(m=1\) (top) and \(m=20\) and \(s_x = 2\) (bottom). It is worth noting that \(Q(a,b|\overline{x},s_x^2)\) depends only on the sample mean and biased variance; its mode is \(a=\overline{x}\) and, as \(m \rightarrow \infty \), \(b^2=s_x^2-1\) (see the supplementary material).

4.4 Expectations of the instantaneous signals

4.4.1 \(m=1\) case

By application of (7) with (11) and (15), the probability density reduces to

$$\begin{aligned} Q(a,b|x_1) = \frac{b}{\sqrt{2\pi }(1+b^2)^2} \exp \left[ - \frac{(x_1-a)^2}{2(1+b^2)} \right] , \end{aligned}$$
(17)

which can also be obtained as the \(s_x^2 \rightarrow 0\) limit of (16) evaluated for \(m=1\) (see the supplementary material). The most supported model is indexed by \(a=x_1\) and \(b=1/\sqrt{3}\), whereas the \(b\rightarrow \infty \) model – corresponding to the uniform prior – is excluded.

According to (8), after averaging the \(\mu _1\) posterior (12) over the model probability (17),

$$\begin{aligned} \mu _1|x_1 \sim P(\mu _1|x_1)= & {} \int _0^\infty \int _{-\infty }^{+\infty } N(\mu _1|\overline{\mu _1}, \sigma _\mu ) Q(a,b|x_1)\, \textrm{d}a\, \textrm{d}b \nonumber \\= & {} N(\mu _1|x_1,\sigma =1), \end{aligned}$$
(18)

i.e. the distribution of \(\mu _1\) is a normal distribution having mean \(x_1\) and unit variance, and the distribution of \(\mu _1^2\) is a non-central \(\chi _1^2\) distribution, having one degree of freedom and non-centrality parameter \(x_1^2\) (see the supplementary material). These are important and non-trivial results. They demonstrate that the hierarchical models (4) are consistent with the uniform prior (see Sect. 2.1) and that there is no hyper-prior effect on the posterior distributions of \(\mu _1\) and \(\mu _1^2\), as it occurs in [4].

4.4.2 \(m \ge 2\) case

By substitution of (13a) and (16) into (9), the model-averaged value of \(\mathbb {E}(\mu _i|x_i,a,b)\) is (see the supplementary material)

$$\begin{aligned} \mathbb {E}(\mu _i|x_i,\overline{x},s_x^2) = \overline{x} + (1-R) (x_i-\overline{x}), \end{aligned}$$
(19a)

where

$$\begin{aligned} R = \frac{\Gamma (m/2+1,0,u_x^2) }{\Gamma (m/2,0,u_x^2)u_x^2}, \end{aligned}$$
(19b)

\(u_x^2=ms_x^2/2\) and \(\Gamma (a,z_1,z_2)\) is the generalised incomplete gamma function [20]. Given the belief encoded in the priors, this inference minimises the (Bayesian) quadratic risk. It belongs to the estimator class considered in [21], but it does not comply with the condition required to dominate (from a frequentist perspective) the James–Stein estimator. In this regard, we note that this paper is not about the dominance over the James–Stein estimators, but about highlighting and encoding the belief \(\theta ^2<\infty \).

If \(m=1\) then \(s_x = 0\) and \(\overline{x}=x_i\). In this case, the last term of (19a) vanishes and the mean is the observed value (see the \(m=1\) case). Also, since its value is irrelevant, we set \(R(m=1)\) conventionally to zero (incidentally, the \(s_x\rightarrow 0\) limit is 1/3). As shown in Fig. 2, the mean (19a) is between \(x_i\) and the sample mean \(\overline{x}\). This behaviour follows from the mild assumption of a constant \(\mu _i\) (see Sect. 2.3) encapsulated in the most supported data model.

As shown in the supplementary material, when \(s_x^2 \ll 1\), (19a) is approximated by

$$\begin{aligned}{} & {} \mathbb {E}(\mu _i|x_i,\overline{x},s_x^2\ll 1) \approx \overline{x}\nonumber \\{} & {} \quad + \left( 1-\frac{m}{m+2-m^2s_x^2/2}\right) (x_i-\overline{x}), \end{aligned}$$
(20a)

which, as the sample size increases and \(s_x \rightarrow 0\), tends to \(\overline{x}\). When \(s_x^2 \gg 1\), (19a) is approximated by

$$\begin{aligned} \mathbb {E}(\mu _i|x_i,\overline{x},s_x^2\gg 1) \approx \overline{x} + \left( 1-\frac{1}{s_x^2}\right) (x_i-\overline{x})\qquad \end{aligned}$$
(20b)

and supports the \(x_i\) datum. For many observations, we obtain

$$\begin{aligned} \lim _{m\rightarrow \infty } \mathbb {E}(\mu _i|x_i,\overline{x},s_x^2) = \overline{x} + \left( 1-\frac{1}{s_x^2}\right) (x_i-\overline{x}).\qquad \end{aligned}$$
(20c)

As shown in the supplementary material, when \(m\rightarrow \infty \), it is certain that \(s_x^2 \ge 1\). These asymptotic expressions are consistent with the expectation that a sample variance larger than the data variance (which was set to one) supports a varying signal and a smaller one the opposite.

Fig. 2
figure 2

Shrinking factor of the model-averaged posterior mean of \(\mu _i\) vs the sample standard-deviation. Nine cases are considered, \(m=\) 1 (top line), 2, 4, 6, 8, 10, 12, 16, 20 (bottom line). The dashed line is the James–Stein estimate (21), when \(m=5\). The red line is the \(m\rightarrow \infty \) limit of both the model-averaged mean and James–Stein estimate; \(s_x <1\) is meaningless in this case, see the supplementary material

4.5 James–Stein estimate

Empirical Bayes methods set the a and b hyper-parameters in (13a) to specific values, see [1], instead of integrating them out. For instance, if in (13a) and following [8] we set a to its posterior mode \(\overline{x}\) and \(1+b^2\) to \(ms_x^2/(m-3)\), then \(\overline{\mu _i}\) reduces to the (positive) James–Stein estimate given in [9, 10],

$$\begin{aligned} \mu _i^\textrm{JS} = \left\{ \begin{array}{cc} \overline{x} + \left( 1-\frac{m-3}{ms_x^2}\right) (x_i-\overline{x}) &{} \textrm{if}\; s_x^2 \ge \frac{m-3}{m} \\ \overline{x} &{} \textrm{if}\; s_x^2 \le \frac{m-3}{m} \\ \end{array} \right. ,\qquad \end{aligned}$$
(21)

the first line of which is derived in the supplementary material and shown in Fig. 2 for \(m=5\). The replacement in the second line avoids pulling the estimate away from the \([x_i,\overline{x}]\) interval, see [10]. The reason for the \(1+b^2=ms_x^2/(m-3)\) choice resides in the fact that \((m-3)/(ms_x^2)\) is an unbiased estimator of \(1/(1+b^2)\), see [8].

In Sect. 4.3 we have shown that, as \(m\rightarrow \infty \), the \(b^2\) mode is \(s_x^2-1\), see (16). By using this value in (13a), we obtain (see the supplementary material) the same asymptotic limit of (21).

4.6 Expectations of the instantaneous powers

From (12), it follows that the normalized powers \((\mu _i/\sigma _\mu )^2\) are independent non-central \(\chi _1^2\) variables having one degree of freedom and non-centrality parameter \(\lambda _i=(\overline{\mu _i}/\sigma _\mu )^2\), where \(\overline{\mu _i}\) and \(\sigma _\mu ^2\) are given by (13a) and (13b), respectively. Hence,

$$\begin{aligned} \frac{\mu _i^2}{\sigma _\mu ^2} \big | x_i,a,b \approx \chi _1^2(\mu _i^2 / \sigma _\mu ^2 | \lambda _i). \end{aligned}$$
(22)

By taking the mean of the non-central \(\chi _1^2\) distribution and the \(\mu _i/\sigma _\mu \) normalisation into account, the posterior means of the \(\mu _i^2\) powers is

$$\begin{aligned} \mathbb {E}(\mu _i^2|x_i,a,b)= & {} \sigma _\mu ^2 + \overline{\mu _i}^2 \nonumber \\= & {} \frac{a^2 + b^2 (1 + 2 a x_i) + b^4 (1 + x_i^2)}{(1+b^2)^2}\qquad \end{aligned}$$
(23)

and, by application of (9) to (23), its model-averaged value is

$$\begin{aligned} \mathbb {E}(\mu _i^2|x_i,\overline{x},s_x^2) = x_i^2 + 1 + S, \end{aligned}$$
(24a)

where (see the supplementary material)

$$\begin{aligned} S= & {} \frac{1}{ \Gamma (m/2,0,u_x^2)u_x^4 } \bigg \{ \Gamma (2+m/2,0,u_x^2)(\overline{x}-x_i)^2 \nonumber \\{} & {} + \Gamma (1+m/2,0,u_x^2)[1/m - 1 + 2x_i(\overline{x} - x_i)]u_x^2 \bigg \},\nonumber \\ \end{aligned}$$
(24b)

\(u_x^2=ms_x^2/2\), and \(\Gamma (a,z_1,z_2)\) is the generalised incomplete gamma function [20].

Fig. 3
figure 3

Offset of the model-averaged posterior mean of \(\mu _i^2\) vs the sample standard-deviation, when \(x_i=\overline{x}\). Nine cases are considered, \(m=\) 1 (top line), 2, 4, 6, 8, 10, 12, 16, 20 (bottom line). The red line is the \(m\rightarrow \infty \) limit; \(s_x <1\) is meaningless in this case, see the supplementary material

The asymptotic means of the \(\mu _i^2\) powers are derived in the supplementary material. When \(s_x^2 \ll 1\) the data support equal \(\mu _i\),

$$\begin{aligned} \mathbb {E}(\mu _i^2|x_i,\overline{x},s_x^2\ll 1) \approx \overline{x}^2 + \frac{6-m^2 s_x^2}{2(m+2)-m^2 s_x^2},\qquad \end{aligned}$$
(25a)

where we used \(x_i \rightarrow \overline{x}\), and the mean shrinks to \(\overline{x}^2\). If \(m=1\), then \(s_x=0\) and the mean is \(x_1^2+1\). A large sample variance supports different \(\mu _i\) and

$$\begin{aligned} \mathbb {E}(\mu _i^2|x_i,\overline{x},s_x^2\gg 1) \approx x_i^2 + 1 - \frac{m-2m(\overline{x}-x_i)x_i-1}{m s_x^2}.\nonumber \\ \end{aligned}$$
(25b)

Eventually, when m tends to the infinity,

$$\begin{aligned}{} & {} \lim _{m\rightarrow \infty } \mathbb {E}(\mu _i^2|x_i,\overline{x},s_x^2) = x_i^2\nonumber \\{} & {} \quad + 1 + \frac{(\overline{x}-x_i)^2}{s_x^4} - \frac{1-2(\overline{x}-x_i)x_i}{s_x^2}, \end{aligned}$$
(25c)

where \(s_x^2 \ge 1\), see the supplementary material. Since \(x_i\), \(\overline{x}\), and \(s_x^2\) are not independent, a general graphical display of (24a) is impossible. To make it feasible, in Fig. 3, we considered the \(x_i=\overline{x}\) case.

4.7 Expectation of the mean power

Let us turn the attention to the mean power \(\theta ^2=|{\varvec{\mu }}|^2/m\). From (12), \(m\theta ^2/\sigma _\mu ^2\) is a non-central \(\chi _m^2\) variable having m degrees of freedom and non-centrality parameter \(\lambda =\sum (\overline{\mu _i}^2/\sigma _\mu ^2)\), where \(\overline{\mu _i}^2\) and \(\sigma _\mu ^2\) are given by (13a) and (13b), respectively. Hence, from (23), the expectation and variance of the mean power are (see the supplementary material)

$$\begin{aligned}{} & {} \mathbb {E}(\theta ^2|\overline{x},s_x^2,a,b) = \frac{(m+\lambda )\sigma ^2_\mu }{m} \nonumber \\{} & {} \quad = \frac{ a^2 + b^2(1+2a\overline{x}) + b^4(1+s_x^2+\overline{x}^2) }{(1+b^2)^2}\nonumber \\ \end{aligned}$$
(26a)

and

$$\begin{aligned}{} & {} \mathrm{{Var}}(\theta ^2|\overline{x},s_x^2,a,b) = \frac{2(m+2\lambda )\sigma ^4_\mu }{m^2} \nonumber \\{} & {} = \frac{ 2b^2\big [2a^2 + b^2(1+4a\overline{x}) + b^4(1+2s_x^2+2\overline{x}^2)\big ] }{m(1+b^2)^3},\nonumber \\ \end{aligned}$$
(26b)

Averaging \( \mathbb {E}(\theta ^2|\overline{x},s_x^2,a,b)\) over the models via (9), we obtain (see the supplementary material)

$$\begin{aligned} \mathbb {E}(\theta ^2|\overline{x^2},s_x^2) = \overline{x^2} - 1 + T, \end{aligned}$$
(27a)

where \(\overline{x^2} = |\textbf{x}|^2/m\) is the sample mean power,

$$\begin{aligned} T = \frac{3m\Gamma (m/2,0,u_x^2) + 2(2u_x^2-3) u_x^m \textrm{e}^{-u_x^2}}{ 2mu_x^2 \Gamma (m/2,0,u_x^2) },\qquad \end{aligned}$$
(27b)

\(u_x^2=ms_x^2/2\), and \(\Gamma (a,z_1,z_2)\) is the generalised incomplete gamma function [20]. This inference, which minimises the (Bayesian) quadratic risk, belongs to estimator classes previously considered by [7, 17, 18, 23,24,25].

Fig. 4
figure 4

Offset of the model-averaged expectation of \(\theta ^2\) vs the sample standard-deviation. Six cases were considered, \(m=\) 1 (top line), 2, 6, 12, 24, 48 (bottom line). The red line is the \(m\rightarrow \infty \) limit; \(s_x <1\) is meaningless in this case, see the supplementary material

As shown in Fig. 4, when \(s_x^2 \ll 1\) the data support \(\mu _i =\) const. and

$$\begin{aligned}{} & {} \mathbb {E}(\theta ^2|\overline{x^2},s_x^2\ll 1) \approx \overline{x^2} - 1\nonumber \\{} & {} + \frac{m+5}{m+2} - \frac{m(14+6m+m^2)s_x^2}{(m+2)^2(m+4)}. \end{aligned}$$
(28a)

If \(m=1\), then \(s_x=0\) and the power of this datum is again \(x_1^2+1\). When \(s_x^2 \gg 1\), the data support a varying signal and

$$\begin{aligned} \mathbb {E}(\theta ^2|\overline{x^2},s_x^2\gg 1) \approx \overline{x^2} - 1 + \frac{3}{ms_x^2}. \end{aligned}$$
(28b)

Eventually, it is non-obvious and remarkable that, as \(m\rightarrow \infty \), the expectation of the mean power converges to the frequentist estimate \(\overline{x^2}-1\), see (2a). In fact,

$$\begin{aligned} \mathbb {E}(\theta ^2|\overline{x^2},s_x^2, m\gg 1) \approx \overline{x^2}-1 + \frac{3}{ms_x^2}, \end{aligned}$$
(28c)

where \(s_x^2 \ge 1\), see the supplementary material. These asymptotic expressions are derived in the supplementary material.

Fig. 5
figure 5

Differences of the measured values of the Newtonian constant of gravitation G, the Planck constant h, and the Boltzmann constant k given in [22] and [29] from their weighted mean, \(G_0\), \(h_0\), and \(k_0\), respectively

5 Application examples

According to the Bayes theorem, the posterior probability of a model is proportional to the marginal likelihood of its parameters based on the data. However, if the parameter prior-density is improper, the marginal likelihood can not be determined. In fact, when a probability density is non-integrable, it is given only up to an arbitrary scale factor, which means that the marginal likelihood depends on the chosen value of this factor. This is the case of the Jeffreys’ uniform prior over the reals for the mean of Gaussian data.

The problem is evaded by the prior (3), which has been proved to produce sound posteriors for the data mean and power, while avoiding inconsistencies, and, contrary to the uniform one, is proper and encodes a finite measurand value.

To give examples, we considered the measured values of the Newtonian constant of gravitation G, the Planck constant h, and the Boltzmann constant k given in [22] and [29]. These measured values have been used by the CODATA Task Group on Fundamental Physical Constants to determine mutually consistent values for use in science and technology [29]. Their differences from the weighted mean are shown in Fig. 5.

These examples have been selected to represent the cases where a visual inspection of the data suggests disagreement (G values), agreement (k values), or uncertain judgment (h values). Where the data are mutually inconsistent, most probably they reflect systematic errors. Still, it is possible that they are pointing to unknown subtleties, – perhaps the constant value depend on how it is measured.

The objective of a Bayesian equal-mean test is to quantify these qualitative judgments by assigning them probabilities. Therefore, we compare the hypothesis \(H_0\) that the measured values are sampled from Gaussian distributions having the same mean against that they are sampled from Gaussian distributions whose means might be different, \(H_1\). Assuming the same 50% prior probability of the two data models, their posterior probabilities are

$$\begin{aligned} \mathrm{{Prob}}(H_n|\textbf{x}) = \frac{Z(\textbf{x}|H_n)}{Z(\textbf{x}|H_1)+Z(\textbf{x}|H_2)}, \end{aligned}$$
(29)

where \(Z(\textbf{x}|H_n)\) is the marginal likelihood of the n-th model parameters.

Calculating the \(Z(\textbf{x}|H_1)\) marginal likelihood in the simplest way, by resting on the previous results, requires equal and unit variances of the input data. The unequal variance case makes the algebra cumbersome without adding conceptual news. To comply with the unit variances constraint, we consider the normalised differences \((x_i - x_0)/u_i\) of the measured values from their weighted mean \(x_0\), where \(u_i^2\) is the variance of the i-th datum. However, these scaled data only have the same mean if it equals \(x_0\). Therefore, we must restrict \(H_0\) to this case and, to take the \(x_0\) variance, \(\sigma _0^2\), into account, increase the data variances to \(\sigma _i^2 = u_i^2 + \sigma _0^2\).

5.1 \(H_0\) hypothesis

Let us consider the normalised differences \((x_i - x_0)/\sigma _i\) of the measured values from their weighted mean \(x_0\), where \(\sigma _i^2 = u_i^2 + \sigma _0^2\) is the sum of variances of the i-th datum and the mean, \(u_i^2\) and \(\sigma ^2_0\), respectively. If each normalised difference is independently sampled from the same Gaussian distributions having zero mean and unit variance, their joint distribution is

$$\begin{aligned} L(\textbf{x}|H_0) = \prod _{i=1}^{m} N(x_i|\mu =0,\sigma =1). \end{aligned}$$
(30)

Since the distribution (30) is free of parameters, the marginal likelihood coincides with it. Hence,

$$\begin{aligned} Z(\textbf{x}|H_0) = L(\textbf{x}|H_0) = \frac{\exp \left( -\chi ^2\big /2\right) \exp \left[ -m\overline{x}^2\big /2 \right] }{\sqrt{(2\pi )^m}},\nonumber \\ \end{aligned}$$
(31)

where \(\overline{x}\) is the arithmetic mean of the normalised data and \(\chi ^2\) is the sum of the squared residuals.

5.2 \(H_1\) hypothesis

Contrary, if the measured values are independently sampled from Gaussian distributions having (or not having) different means and standard deviations, the likelihood of the scaled data \(x_i \rightarrow (x_i - x_0)\big /u_i\) is

$$\begin{aligned} L(\textbf{x}|{\varvec{\mu }},\sigma =1) = \prod _{i=1}^m \frac{\exp \left[ -(x_i-\mu _i)^2\big /2 \right] }{\sqrt{2\pi }}, \end{aligned}$$
(32)

where \(\mu _i\) is the scaled mean. By using the prior (3),

$$\begin{aligned} \pi ({\varvec{\mu }}|a,b) = \prod _{i=1}^m \frac{\exp \big [-(\mu _i-a)^2\big /b^2\big ]}{\sqrt{2\pi }\,b}, \end{aligned}$$
(33)

the marginal likelihood \(Z(\textbf{x}|a,b)\) of the scaled data is given by (14).

To determine the most probable model in the family (4), we look for the values of the hyper-parameters a and b maximising their posterior density, \(Q(a,b|\textbf{x})\), which is given by (16). They are \(a_0=\overline{x}\) and \(b_0 = \textrm{argmax}\big [ Q(a=\overline{x},b|\textbf{x}) \big ]\), which must be found numerically. Eventually,

$$\begin{aligned} Z(\textbf{x}|H_1) = \frac{\exp \left\{ -\displaystyle \frac{m s_x^2}{2(1+b_0^2)} \right\} }{\sqrt{(2\pi )^m(1+b_0^2)^m}}. \end{aligned}$$
(34)

5.3 Results

The calculations relevant to this analysis are available in the supplementary material. The results are summarised in table 1. The posterior probabilities confirm our expectations regarding the measured G and k values and resolve the uncertainty for the h values.

The probabilities of the \(H_0\) models are smaller than expected, which may be because we assumed not only a common mean but also that it is equal to the weighted mean of the measured values. In addition, \(H_1\) does not exclude that the data are sampled from distributions with the same mean. It’s worth noting that, assuming a uniform prior for the data means, the Bayesian test of equal means would have been impossible.

Table 1 Posterior probabilities of the \(H_0\) and \(H_1\) data model

6 Conclusion

Given measurement results affected by additive uncorrelated Gaussian errors, we investigated the Bayesian inferences of the data means, individual means’ squares, and average means’ squares. The result is a new way to cope with the inconsistency originated by using a uniform prior, which inconsistency occurs because the uniform prior – contrary to what was intended and the belief that it is finite – encodes that the data power is infinite.

To minimise the difference (expressed by the Kullback–Leibler divergence) from the uniform distribution, we encoded the measurands’ indistinguishability and the belief of finite measurand values in a normal prior hyper-parameterised by the mean and variance. Averaging over the unknown hyper-parameters or letting the data to chose the most supported ones removes the shortcomings of the uniform distribution.

In the case of a single datum, the inferred measurand is not biased to the smallest value, as occurs in [4], but it is the measurement result itself. With more than one datum, we derived a James–Stein estimate of every single measurand consistent with the stated belief. This result was obtained without the use of empirical methods as in [8]. We showed that, as the sample size grows, the inference of the mean power is consistent and converges to the frequentist estimate.

After proving that it produces sound posteriors for the data mean and power while avoiding inconsistencies, we applied the hyper-parameterised normal prior to determining whether the measured values of the Newtonian constant of gravitation came from populations with the same mean or not. We repeated the test using the results of the measurements of the Planck and Boltzmann constants. If we had used an improper prior for the data mean, this Bayesian test would have been impossible.