As we saw in Chap. 2, the frequentist paradigm is well suited for risk evaluations, but is less useful for estimator construction. It turns out that the Bayesian approach is complementary, as it is well suited for the construction of possibly optimal estimators. In this chapter we take a Bayesian view of minimax shrinkage estimation. In Sect. 3.1 we derive a general sufficient condition for minimaxity of Bayes and generalized Bayes estimators in the known variance case, we also illustrate the theory with numerous examples. In Sect. 3.2 we extend these results to the case when the variance is unknown. Section 3.3 considers the case of a known covariance matrix under a general quadratic loss. The admissibility of Bayes estimators in discussed in Sect. 3.4. Interesting connections to MAP estimation, penalized likelihood methods, and shrinkage estimation are developed in Sect. 3.5. The fascinating connections between Stein estimation and estimation of a predictive density under Kullback-Leibler divergence are outlined in Sect. 3.6.

3.1 Bayes Minimax Estimators

In this section, we derive a general sufficient condition for minimaxity of Bayes and generalized Bayes estimators when \(X \sim \mathcal {N}_{p}(\theta , \sigma ^{2} I_{p})\), with known σ 2, and the loss function is ∥δ − θ2, due to Stein (1973, 1981). The condition depends only on the marginal distribution and states that a generalized Bayes estimator is minimax if the square root of the marginal distribution is superharmonic. Alternative (stronger) sufficient conditions are that the prior distribution or the marginal distribution is superharmonic. We establish these results in Sect. 3.1.1 and apply them in Sect. 3.1.2 to obtain classes of prior distributions which lead to minimax (generalized and proper) Bayes estimators. Section 3.1.3 will be devoted to minimax multiple shrinkage estimators .

Throughout this section, let \(X \sim \mathcal {N}_{p}(\theta , \sigma ^{2} I_{p})\) (with σ 2 known) and the loss be L(θ, δ) = ∥δ − θ2. Let θ have the (generalized) prior distribution π and let the marginal density, m(x), of X be

$$\displaystyle \begin{aligned} m(x) = K \int_{\mathbb{R}^p} e^{- \frac{\| x - \theta \|{}^{2}}{2 \; \sigma^{2}}} \,d \pi(\theta). \end{aligned} $$
(3.1)

Recall from Sect. 1.4 that the Bayes estimator corresponding to π(θ) is given by

$$\displaystyle \begin{aligned} \delta_{\pi}(X) = X + \sigma^{2} \frac{\nabla m(X)}{m(X)}. \end{aligned} $$
(3.2)

Since the constant K in (3.1) plays no role in (3.2) we will typically take it to be equal to 1 for simplicity. It may happen that an estimator will have the form (3.2) where m(X) does not correspond to a true marginal distribution. In this case we will refer to such an estimator as a pseudo-Bayes estimator, provided x↦∇m(x)∕m(x) is weakly differentiable . Recall that, if δ π(X) is generalized Bayes, xm(x) is a positive analytic function and so x↦∇m(x)∕m(x) is automatically weakly differentiable.

3.1.1 A Sufficient Condition for Minimaxity of (Proper, Generalized, and Pseudo) Bayes Estimators

Stein (1973, 1981) gave the following sufficient condition for a generalized Bayes estimator to be minimax . This condition relies on the superharmonicity of the square root of the marginal. Recall from Corollary A.2 in Appendix A.8.3 that a function f from \(\mathbb {R}^p\) into \(\mathbb {R}\) which is twice weakly differentiable and lower semicontinuous is superharmonic if and only if, for almost every \(x \in \mathbb {R}^p\), we have Δf(x) ≤ 0, where Δf is the weak Laplacian of f. Note that, if the function f is analytic, the last inequality holds for any \(x \in \mathbb {R}^p\).

Theorem 3.1

Under the model of this section, an estimator of the form (3.2) has finite risk if E θ [∥∇m(X)∕m(X)∥2 ] < ∞ and is minimax provided \(x \mapsto \sqrt {m(x)}\) is superharmonic (i.e., \(\varDelta \sqrt {m(x)} \le 0\) , for any \(x \in \mathbb {R}^p\) ).

Proof

First, note that, as noticed in Example 1.1, the marginal m is a positive analytic function, and so is \(\sqrt {m}\).

Using Corollary 2.1 and the fact that δ π(X) = X + σ 2 g(X) with g(X) = ∇m(X)∕ m(X), the estimator δ π(X) has finite risk if E θ[∥∇m(X)∕m(X)∥2] < . Also, it is minimax provided, for almost any \(x \in \mathbb {R}^p\),

$$\displaystyle \begin{aligned} {\mathcal D}(x) = \frac{\| \nabla m(x) \|{}^{2}}{m^{2}(x)} + 2 \, {\mathrm{div}} \frac{\nabla m(x)}{m(x)} \leq 0 \, . \end{aligned}$$

Now, for any \(x \in \mathbb {R}^p\),

$$\displaystyle \begin{aligned} {\mathcal D}(x) = \frac{\| \nabla m(x) \|{}^{2}}{m^{2}(x)} + 2 \, \frac{m(x) \, \varDelta m(x) - \|\nabla m(x) \|{}^{2}}{m^{2}(x)} \end{aligned}$$

where

$$\displaystyle \begin{aligned} \varDelta m(x) = \sum^{p}_{i=1} \frac{\partial^{2}}{\partial x^{2}_{i}} m(x) \end{aligned}$$

is the Laplacian of m(x). Hence, by straightforward calculation,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathcal D}(x) &\displaystyle =&\displaystyle \frac{2 \, m(x) \, \varDelta m(x) - \| \nabla m(x) \|{}^{2}}{m^{2}(x)} \\ &\displaystyle =&\displaystyle 4 \, \frac{\varDelta \sqrt{m(x)}}{\sqrt{m(x)}} \, . \end{array} \end{aligned} $$
(3.3)

Therefore \({\mathcal D}(x) \leq 0\) since \(x \mapsto \sqrt {m(x)}\) is superharmonic. □

It is convenient to assemble the following results for the case of spherically symmetric marginals. The proof is straightforward and left to the reader.

Corollary 3.1

Assume the prior density π(θ) is spherically symmetric around 0 (i.e., π(θ) = π(∥θ2)). Then

  1. (1)

    the marginal density m of X is spherically symmetric around 0 (i.e., m(x) = m(∥x2), for any \(x \in \mathbb {R}^p\) );

  2. (2)

    the Bayes estimator equals

    $$\displaystyle \begin{aligned}\delta_\pi(X) = X + 2 \, \sigma^2\, \frac{m^\prime (\Vert X\Vert^2)}{m(\Vert X\Vert^2)} \, X \end{aligned}$$

    and has the form of a Baranchik estimator (2.19) with

    $$\displaystyle \begin{aligned}a\, r(t) = - 2\, \frac{m^\prime (t)}{m(t)}\, t \qquad \forall t \geq 0 \, ; \end{aligned}$$
  3. (3)

    the unbiased estimator of the risk difference between δ π(X) and X is given by

    $$\displaystyle \begin{aligned}{\mathcal D}(X) = 4 \, \sigma^4 \, \left\{ p\, \frac{m^\prime (\|X\|{}^2)}{m(\|X\|{}^2)} + 2 \, \|X\|{}^2\, \frac{m^{\prime \prime}(\|X\|{}^2)}{m(\|X\|{}^2)} - \|X\|{}^2 \left(\frac{m^\prime(\|X\|{}^2)}{m(\|X\|{}^2)}\right)^2\right\} . \end{aligned}$$

While, in Theorem 3.1 minimaxity of δ π(X) follows from the superharmonicity of \(\sqrt {m(X)}\), it is worth noting that, in the setting of Corollary 3.1, it can be obtained from the concavity of tm 1∕2(t 2∕(2−p)).

The following corollary is often useful. It shows that \(\sqrt {m(X)}\) is superharmonic if m(X) is superharmonic, which in turn follows if the prior density π(θ) is superharmonic.

Corollary 3.2

  1. (1)

    A finite risk (generalized, proper, or pseudo) Bayes estimator of the form (3.2) is minimax provided the marginal m is superharmonic (i.e. Δm(x) ≤ 0, for any \(x \in \mathbb {R}^p\) ).

  2. (2)

    If the prior distribution has a density, π, which is superharmonic, then a finite risk generalized or proper Bayes estimator of the form (3.2) is minimax.

Proof

Part (1) follows from the first equality in (3.3), which shows that superharmonicity of m implies superharmonicity of \(\sqrt {m}\). Indeed, the superharmonicity of m implies the superharmonicity of any nondecreasing concave function of m.

Part (2) follows since, for any \(x\in \mathbb {R}^p\),

$$\displaystyle \begin{aligned} \begin{array}{rcl} \varDelta_{x} m(x) &\displaystyle = &\displaystyle \varDelta_{x} \int_{\mathbb{R}^p} \exp \left( - \frac{1}{2 \, \sigma^{2}} \| x - \theta \|{}^{2} \right) \pi(\theta) \, d \theta \\ {} &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \varDelta_{x} \exp \left( - \frac{1}{2 \, \sigma^{2}} \| x - \theta \|{}^{2} \right) \pi(\theta) \, d \theta \\ {} &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \varDelta_{\theta} \exp \left( - \frac{1}{2 \, \sigma^{2}} \| x - \theta \|{}^{2} \right) \pi(\theta) \, d \theta \\ {} &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \exp \left( - \frac{1}{2 \, \sigma^{2}} \| x - \theta \|{}^{2} \right) \varDelta_{\theta} \pi(\theta) \, d \theta \end{array} \end{aligned} $$

where the second equality follows from exponential family properties and the last equality is Green’s formula (see also Sect. A.9). More generally, any mixture of superharmonic functions is superharmonic (Sect. A.8). □

Note that the condition of finiteness of risk is superfluous for proper Bayes estimators since the Bayes risk is bounded above by p σ 2, and Fubini’s theorem assures that the risk function is finite a.e. (π). Continuity of the risk function implies finiteness for all θ in the convex hull of the support of π (see Berger (1985a) and Lehmann and Casella (1998) for more discussion on finiteness and continuity of risk).

As an example of a pseudo-Bayes estimator, consider m(X) of the form

$$\displaystyle \begin{aligned} m(X) = \frac{1}{(\| X \|{}^{2})^{b}} \, . \end{aligned}$$

The case b = 0 corresponds to m(X) = 1 which is the marginal corresponding to the “uniform” generalized prior distribution π(θ) ≡ 1, which in turn corresponds to the generalized Bayes estimator δ 0(X) = X. If b > 0, m(X) is unbounded in a neighborhood of 0 and consequently is not analytic. Thus, m(X) cannot be a true marginal (for any generalized prior). However,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \nabla m(X) = \frac{-2 \, b}{(\| X \|{}^{2})^{b+1}} \, X \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\nabla m(X)}{m(X)} = \frac{-2 \, b}{\| X \|{}^{2}} \, X, \end{array} \end{aligned} $$

which is weakly differentiable if p ≥ 3 (see Sect. 2.3). Hence, for p ≥ 3, the James-Stein estimator

$$\displaystyle \begin{aligned} \delta^{JS}_{2b}(X) = \left(1 - \frac{2 \, b \, \sigma^{2}}{\| X \|{}^{2}} \right) X \end{aligned}$$

is a pseudo-Bayes estimator. Also, a simple calculation gives

$$\displaystyle \begin{aligned} \varDelta m(X) = \frac{(-2 \, b) [p-2 \, (b+1)]}{(\| X \|{}^{2})^{b+1}}. \end{aligned}$$

It follows that m(X) is superharmonic for 0 ≤ b ≤ (p − 2)∕2 and similarly that \(\sqrt {m(X)}\) is superharmonic for 0 ≤ b ≤ p − 2. An application of Theorem 3.1 gives minimaxity for 0 ≤ b ≤ p − 2 which agrees with Theorem 2.2 (with a = 2b), while an application of Corollary 3.2 establishes minimaxity for only half of the interval, i.e. 0 ≤ b ≤ (p − 2)∕2. Thus, while useful, the corollary is considerably weaker than the theorem.

Another interesting aspect of this example relates to the existence of proper Bayes minimax estimators for p ≥ 5. Considering the behavior of m(x) for ∥x∥≥ R for some positive R, note that

$$\displaystyle \begin{aligned} \int_{\| x \| \geq R} \,m(x)\, dx = \int_{\| x \| \geq R} \frac{1}{(\| X \|{}^{2})^{b}} \,dX \propto \int^{\infty}_{R} \frac{r^{p-1}}{r^{2 \, b}} \,dr = \int^{\infty}_{R} r^{p-2 \, b-1} \,dr \end{aligned}$$

and that this integral is finite if and only if p − 2 b < 0. Thus, integrability of m(x) for ∥x∥≥ R and minimaxity of the (James-Stein) pseudo-Bayes estimator corresponding to m(X) are possible if and only if p∕2 < b ≤ p − 2, which implies p ≥ 5.

It is also interesting to note that superharmonicity of m(X) (i.e. 0 ≤ b ≤ (p − 2)∕2) is incompatible with integrability of m(x) on ∥x∥≥ R (i.e. b > p∕2). This is illustrative of a general fact that a generalized Bayes minimax estimator corresponding to a superharmonic marginal cannot be proper Bayes (see Theorem 3.2).

3.1.2 Construction of (Proper and Generalized) Minimax Bayes Estimators

Corollary 3.1 provides a method of constructing pseudo-Bayes minimax estimators. In this section, we concentrate on the construction of proper and generalized Bayes minimax estimators. The results in this section are primarily from Fourdrinier et al. (1998). Although Corollary 3.1 is helpful in constructing minimax estimators it cannot be used to develop proper Bayes minimax estimators as indicated in the example at the end of the previous section. The following result establishes that a superharmonic marginal (and consequently a superharmonic prior density ) cannot lead to a proper Bayes estimator.

Theorem 3.2

Let m be a superharmonic marginal density corresponding to a prior π. Then π is not a probability measure.

Proof

Assume π is a probability measure. Then it follows that m is an integrable, strictly positive, and bounded function in C (the space of functions which have derivatives of all orders). Recall from Example 1.1 of Sect. 1.4 that the posterior risk is given, for any \(x \in \mathbb {R}^p\), by

$$\displaystyle \begin{aligned} p \, \sigma^{2} + \sigma^{4} \, \frac{m(x) \, \varDelta m(x) - \| \nabla m(x) \|{}^{2}}{m^{2}(x)}. \end{aligned}$$

Hence, the Bayes risk is

$$\displaystyle \begin{aligned} r(\pi) = E^m \left[p \sigma^{2} + \sigma^{4} \frac{m(X) \varDelta m(X) - \| \nabla m(X) \|{}^{2}}{m^{2}(X)} \right] , \end{aligned}$$

where E m is the expectation with respect to the marginal density m. Also, denoting by E π the expectation with respect to the prior π, we may use the unbiased estimate of risk to express r(π) as

$$\displaystyle \begin{aligned} \begin{array}{ll} r(\pi) & = E^\pi \left[ E_\theta \left[ p \, \sigma^{2} + \sigma^{4} \, \frac{2 \, m(X) \varDelta m(X) - \| \nabla m(X) \|{}^{2}}{m^{2}(X)} \right] \right] \\ \\ & = E^m \left[ p \, \sigma^{2} + \sigma^{4} \, \frac{2 m(X) \varDelta m(X) - \| \nabla m(X) \|{}^{2}}{m^{2}(X)} \right] , \end{array} \end{aligned}$$

since the unbiased estimate of risk does not depend on θ, by definition. Hence, by taking the difference,

$$\displaystyle \begin{aligned} E^m \left[ \frac{\varDelta m(X)}{m(X)} \right] = 0 \, . \end{aligned}$$

Now, since the marginal m is superharmonic (Δm(x) ≤ 0 for any \(x \in \mathbb {R}^p\)), strictly positive and in C , it follows that Δm ≡ 0. Finally, the strict positivity and harmonicity of m implies that m ≡ C where C is a positive constant (see Doob 1984), and hence, that \(\int _{\mathbb {R}^{p}}\,m(X)\, dx = \infty \), which contradicts the integrability of m. □

We now turn to the construction of Bayes minimax estimators. Consider prior densities of the form

$$\displaystyle \begin{aligned} \pi(\theta) = k \int^{\infty}_{0} \exp \left(- \frac{\| \theta \|{}^{2}}{2 \, \sigma^{2} \, v}\right) v^{-p / 2} \, h(v) \, dv \end{aligned} $$
(3.4)

for some constant k and some nonnegative function h on \(\mathbb {R}^{+}\) such that the integral exists, i.e. π(θ) is a variance mixture of normal distributions . It follows from Fubini’s theorem that, for any \(x \in \mathbb {R}^p\),

$$\displaystyle \begin{aligned} m(x) = \int^{\infty}_{0} m_{v}(x) \, h(v) \, dv \end{aligned}$$

where

$$\displaystyle \begin{aligned} m_{v}(x) = k \, \exp \left(- \frac{\| x \|{}^{2}}{2 \, \sigma^{2} \, (1+v)}\right) (1+v)^{- p / 2} \, . \end{aligned}$$

Lebesgue’s dominated convergence theorem ensures that we may differentiate under the integral sign and so

$$\displaystyle \begin{aligned} \nabla m(x) = \int^{\infty}_{0} \nabla m_{v}(x) \, h(v) \, dv \end{aligned} $$
(3.5)

and

$$\displaystyle \begin{aligned} \varDelta m(x) = \int^{\infty}_{0} \varDelta m_{v} (x) \, h(v) \, dv \end{aligned} $$
(3.6)

where

$$\displaystyle \begin{aligned} \nabla m_{v} (x) = - \frac{k}{\sigma^{2}} \, \exp \left(- \frac{\| x \|{}^{2}}{2 \, \sigma^{2} \, (1+v)}\right) (1+v)^{- p / 2 - 1} \, x \end{aligned}$$

and

$$\displaystyle \begin{aligned} \varDelta m_{v}(x) = - \frac{k}{\sigma^{2}} \left[ p - \frac{\| x \|{}^{2}}{\sigma^{2} (1+v)} \right] \exp \left(- \frac{\| x \|{}^{2}}{2 \, \sigma^{2} \, (1+v)}\right) (1+v)^{- p / 2 - 1}. \end{aligned}$$

Then the following integral

$$\displaystyle \begin{aligned} I_{j}(y) = \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-j} \, h(v) \, dv \end{aligned}$$

exists for j ≥ p∕2. Hence, with y = ∥x2∕2σ 2, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} m(x) &\displaystyle =&\displaystyle k \, I_{p/2} (y)\\ \nabla m(x) &\displaystyle =&\displaystyle - \frac{k}{\sigma^{2}} \, I_{p/2+1} (y) \, x \\ \varDelta m(x) &\displaystyle =&\displaystyle - \frac{k}{\sigma^{2}} \left[ p \, I_{p/2+1} (y) - 2 \, y \, I_{p/2+2}(y) \right] \\ \| \nabla m(x) \|{}^{2} &\displaystyle =&\displaystyle 2 \, \frac{k^{2}}{\sigma^{2}} \, y \, I^{2}_{\frac{p}{2}+1}(y). \end{array} \end{aligned} $$
(3.7)

Note that

$$\displaystyle \begin{aligned} \frac{\Vert\nabla m(x)\Vert^2}{m^2(x)} = \frac {2}{\sigma^2} \, \frac{I_{p/2+1}^2(y)}{I_{\frac{p}{2}}^2(y)} \, y \le \frac{2 \, y}{\sigma^2} = \frac{\Vert x\Vert^2}{\sigma^4} \end{aligned}$$

since I j+p(y) ≤ I j(y). Hence,

$$\displaystyle \begin{aligned} E_0\bigg[\frac{\Vert\nabla m(x)\Vert^2}{m^2(x)}\bigg] \le E_0 \bigg[ \frac{\Vert x\Vert^2}{\sigma^4}\bigg] < \infty \, , \end{aligned}$$

which, according to Theorem 3.1, guarantees the finiteness of the risk of the Bayes estimator δ π(X) in (3.2). Furthermore, the unbiased estimator of risk difference (3.3) can be expressed as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathcal D}(X) &\displaystyle = - \frac{2}{\sigma^{2}} \left[ p \, I_{p/2+1}(y) - 2 \, y \, I_{p/2+2}(y) \right]/ I_{p/2}(y) \\ &\displaystyle \qquad - \frac{2}{\sigma^{2}} \left[ y \, I^{2}_{p/2+1}(y)/I^{2}_{p/2}(y) \right]\\ &\displaystyle = \frac{2 \, I_{p/2 +1}(y)}{\sigma^{2} \, I_{p/2} (y)} \left[ \frac{2 \, y \, I_{p/2+2}(y)}{I_{p/2+1}(y)} - p - \frac{y \, I_{p/2+1}(y)}{I_{p/2}(y)} \right]. \end{array} \end{aligned} $$
(3.8)

Then the following intermediate result follows immediately from (3.2) and Theorem 3.1 since finiteness of risk has been guaranteed above.

Lemma 3.1

The generalized Bayes estimator corresponding to the prior density (3.4) is minimax provided

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{2 \, I_{p/2+2}(y)}{I_{p/2+1}(y)} - \frac{I_{p/2+1}(y)}{I_{p/2}(y)} \leq \frac{p}{y} \, . \end{array} \end{aligned} $$
(3.9)

The next theorem gives sufficient conditions on the mixing density h(⋅) so that the resulting generalized Bayes estimator is minimax.

Theorem 3.3

Let h be a positive differentiable function such that the function − (v + 1)h (v)∕h(v) = l 1(v) + l 2(v) where l 1(v) ≤ A and is nondecreasing while 0 ≤ l 2 ≤ B with A + 2 B ≤ (p − 2)∕2. Assume also that limv h(v)∕(v + 1)p∕2−1 = 0 and that \(\int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \) . Then the generalized Bayes estimator (3.2) for the prior density (3.4) corresponding to the mixing density h is minimax. Furthermore, if h is integrable, the resulting estimator is also proper Bayes.

Proof

Via integration by parts , we first find an alternative expression for

$$\displaystyle \begin{aligned} I_{k}(y) = \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-k} \, h(v) \, dv. \end{aligned}$$

Letting u = (1 + v)k+2 h(v) and \(dw = (1 + v)^{-2} \, \exp (-y/(1+v)) \, dv\), so that du = (−k + 2)(1 + v)k+1 h(v) + (1 + v)k+2 h (v) and \(w = \exp (-y/(1+v)) / y\), we have, for k ≥ p∕2 + 1,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} I_{k}(y) &\displaystyle =&\displaystyle \frac{(1+v)^{-k+2} \exp(-y/(1+v)) \, h(v)}{y} {\Big |}^{\infty}_{0} \\ &\displaystyle &\displaystyle + \frac{k-2}{y} \int^{\infty}_{0} \exp\left(- \frac{y}{1+v}\right) (1+v)^{-k+1} \, h(v)\, dv \\ &\displaystyle &\displaystyle - \frac{1}{y} \int^{\infty}_{0} \exp\left(- \frac{y}{1+v}\right) (1+v)^{-k+2} \, h^{\prime}(v) \,dv \\ &\displaystyle =&\displaystyle - \frac{e^{-y} \, h(0)}{y} + \frac{k-2}{y} \, I_{k-1}(y) \\ &\displaystyle &\displaystyle - \frac{1}{y} \int^{\infty}_{0} \exp\left(- \frac{y}{1+v}\right) (1+v)^{-k+2} \, h^{\prime}(v) \,dv \, . \end{array} \end{aligned} $$
(3.10)

Applying (3.10) to both numerators in the left-hand side of (3.9) we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \frac{2}{I_{p/2+1}(y)} \left[ \frac{- e^{-y} \, h(0)}{y} + \frac{p}{2 \, y} \, I_{p/2+1}(y) - \frac{1}{y} \int^{\infty}_{0} \exp\left( - \frac{y}{1+v} \right) (1+v)^{-p/2} \, h^{\prime}(v) \,dv \right] \\ &\displaystyle &\displaystyle \; \; - \frac{1}{I_{p/2}(y)} \left[ \frac{- e^{-y} \, h(0)}{y} + \frac{p-2}{2 \, y} \, I_{p/2}(y) - \frac{1}{y} \int^{\infty}_{0} \exp\left( - \frac{y}{1+v} \right) (1+v)^{-p/2+1} \, h^{\prime}(v) \,dv \right] \\ &\displaystyle &\displaystyle \leq \frac{p+2}{2 \, y} - \frac{2 \int^{\infty}_{0} \exp\left(- \frac{y}{1+v}\right) (1+v)^{-p/2+2} \, h^{\prime}(v) \,dv} {y \, I_{p/2+1} (y)} \\ &\displaystyle &\displaystyle + \frac{\int^{\infty}_{0} \exp\left(- \frac{y}{1+v}\right) (1+v)^{-p/2+1} \, h^{\prime}(v) \,dv} {y \, I_{p/2} (y)} \end{array} \end{aligned} $$

since I p∕2+1(y) < I p∕2(y). Then it follows from Lemma 3.1 that δ π(X) is minimax provided, for any y ≥ 0,

$$\displaystyle \begin{aligned} J_p^y \leq p - \frac{p+2}{2} = \frac{p-2}{2} \, , \end{aligned}$$

where

$$\displaystyle \begin{aligned} J_p^y = - 2 \, E^{y}_{p/2+1} \left[(V+1) \frac{h^{\prime}(V)}{h(V)}\right] + E^{y}_{p/2} \left[(V+1) \frac{h^{\prime}(V)}{h(V)}\right] \end{aligned}$$

and where \(E^{y}_{k} [f(V)]\) is the expectation of f(V ) with respect to the random variable V  with density \(g^{y}_{k}(v) = \exp (-y/(1+v)) \, (1+v)^{-k} \, h(v) / I_{k}(y)\). Now upon setting − (v + 1) h (v)∕h(v) = l 1(v) + l 2(v) and noting that \(g^{y}_{k}(v)\) has monotone decreasing likelihood ratio in k, for fixed y, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} J_p^y &\displaystyle =&\displaystyle 2 \, E^{y}_{p/2+1} \left[l_{1}(V) + l_{2}(V)\right] - E^{y}_{p/2} \left[l_{1}(V) + l_{2}(V)\right] \\ {} &\displaystyle \leq&\displaystyle 2 \, E^{y}_{p/2+1} \left[l_{1}(V)\right] - E^{y}_{p/2} \left[l_{1}(V)\right] + 2 \, E^{y}_{p/2+1} \left[l_{2}(V)\right] \end{array} \end{aligned} $$

since l 2 ≥ 0. Also

$$\displaystyle \begin{aligned} \begin{array}{rcl} E^{y}_{p/2+1} \left[l_{1}(V)\right] \leq E^{y}_{p/2} \left[l_{1}(V)\right] \end{array} \end{aligned} $$

since l 1 is nondecreasing. Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} J_p^y \leq E^{y}_{p/2} \left[l_{1}(V)\right] + 2 \, E^{y}_{p/2+1} \left[l_{2}(V)\right] \leq A + 2 \, B \leq \frac{p-2}{2}. \end{array} \end{aligned} $$

since l 1 ≤ A and l 2 ≤ B and by the assumptions on A and B. The result follows. □

The following corollary allows the construction of mixing distributions so that the conditions of the theorem are met and the resulting (generalized or proper) Bayes estimators are minimax.

Corollary 3.3

Let ψ = ψ 1 + ψ 2 be a continuous function such that ψ 1 ≤ C and is nondecreasing, while 0 ≤ ψ 2 ≤ D, and where C ≤−2D. Define, for v > 0, \(h(v) = \exp \left [ - \frac {1}{2} \int ^{v}_{v_{0}} \frac {2 \, \psi (u) + p - 2}{u + 1} \,du \right ]\) where v 0 ≥ 0. Assume also that limv h(v)∕(1 + v)p∕2−1 = 0 and that \(I_{p/2}(y) = \int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \).

Then the Bayes estimator corresponding to the mixing density h is minimax. Furthermore if h is integrable the estimator is proper Bayes.

Proof

A simple calculation shows that

$$\displaystyle \begin{aligned} -(v + 1) \, \frac{h^{\prime}(v)}{h(v)} = \psi_{1}(v) + \psi_{2}(v) + \frac{p-2}{2} \, . \end{aligned}$$

Setting l 1(v) = ψ 1(v) + (p − 2)∕2 and l 2(v) = ψ 2(v), the result follows from Theorem 3.1 with A = (p − 2)∕2 + C and B = D. □

Note that finiteness of I p∕2(y) in Corollary 3.2 is assured if we strengthen the limit condition to limv h(v)∕(1 + v)p∕2−1−𝜖 = 0 for some 𝜖 > 0, since this implies that, for h(v)∕(1 + v)p∕2 ≤ M∕(1 + v)1+𝜖 for some M > 0 and any v > 0. Thus

$$\displaystyle \begin{aligned} \begin{array}{rcl} I_{p/2}(y) = \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv &\displaystyle \leq&\displaystyle \int^{\infty}_{0} (1+v)^{-p/2} \, h(v) \, dv \\ &\displaystyle \leq&\displaystyle \int^{\infty}_{0} \frac{M}{(1+v)^{1+\epsilon}}\, dv \\ &\displaystyle <&\displaystyle \infty \, . \end{array} \end{aligned} $$

3.1.3 Examples

An interesting and useful class of examples results from the choice

$$\displaystyle \begin{aligned} \psi(v) = \alpha + \beta/v + \gamma/v^{2} \end{aligned} $$
(3.11)

for some \((\alpha , \beta , \gamma ) \in \mathbb {R}^3\). A simple calculation shows

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} h(v) &\displaystyle =&\displaystyle \exp \left[ - \int^{v}_{v_{0}} \frac{\alpha + \beta/u + \gamma/u^{2} + (p-2)/2}{u+1} \,du \right] \\ &\displaystyle \propto&\displaystyle (v + 1)^{\beta - \alpha - \gamma - \frac{p-2}{2}} v^{\gamma - \beta} \exp \left(\frac{\gamma}{v}\right). \end{array} \end{aligned} $$
(3.12)

Example 3.1 (The Strawderman 1971 prior)

Suppose α ≤ 0 and β = γ = 0 so that h(v) ∝ (v + 1)α−(p−2)∕2. Let ψ 1(v) = ψ(v) ≡ α and ψ 2(v) ≡ 0 so that C = D = 0. Then the minimaxity conditions of Corollary 3.1 require limv h(v)∕(1 + v)p∕2−1 =limv(v + 1)α−(p−2) = 0 and this is satisfied if α > 2 − p. Also

$$\displaystyle \begin{aligned} \begin{array}{rcl} I_{p/2}(y) &\displaystyle =&\displaystyle \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv \\ &\displaystyle \propto&\displaystyle \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-\alpha-p+1} \, h(v) \, dv \\ &\displaystyle \leq&\displaystyle \int^{\infty}_{0} (1+v)^{-\alpha-p+1} \, h(v) \, dv \\ &\displaystyle <&\displaystyle \infty \end{array} \end{aligned} $$

if α > 2 − p as above. Hence in this case the corresponding generalized Bayes estimator is minimax if 2 − p < α ≤ 0 (which requires p ≥ 3).

Furthermore it is proper Bayes minimax if \( \int ^{\infty }_{0} (1+v)^{- \alpha -(p-2)/2}\, dv < \infty \) which is equivalent to 2 − p∕2 < α ≤ 0. This latter condition requires p ≥ 5 and demonstrates the existence of proper Bayes minimax estimators for p ≥ 5. We will see below that this is the class of priors studied in Strawderman (1971) under the alternative parametrization λ = 1∕(1 + v).

Example 3.2

Consider ψ(v) given by (3.11) with α ≤ 0, β ≤ 0 and γ ≤ 0. Here we take ψ 1(v) = ψ(v), ψ 2(v) = 0, and C = D = 0. The minimaxity conditions of Corollary 3.2 require

$$\displaystyle \begin{aligned} \lim_{v \rightarrow \infty} h(v) / (1+v)^{p/2 - 1} = \lim_{v \rightarrow \infty}(v+1)^{\beta - \alpha - \gamma - p +2} v^{\gamma - \beta} \exp(\gamma / v) = 0. \end{aligned}$$

This implies 2 − p < α ≤ 0. The finiteness condition on

$$\displaystyle \begin{aligned} \begin{array}{rcl} I_{p/2}(y) &\displaystyle =&\displaystyle \int^{\infty}_{0} \exp(-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv \\ &\displaystyle \propto&\displaystyle \int^{\infty}_{0} e^{- \frac{y}{1+v}} (v+1)^{\beta - \alpha - \gamma - p+1} v^{\gamma - \beta} \exp(\gamma / v) \,dv \end{array} \end{aligned} $$

also requires 2 − p < α ≤ 0. Therefore, minimaxity is ensured as soon as 2 − p < α ≤ 0.

Furthermore, the minimax estimator will be proper Bayes if

$$\displaystyle \begin{aligned}\int^{\infty}_{0} \,h(v) \, dv \propto \int^{\infty}_{0} (1+v)^{\beta - \alpha - \gamma - (p-2)/2} \, v^{\gamma - \beta} \, \exp(\gamma / v) \, dv < \infty. \end{aligned}$$

This holds if \(2 - \frac {p}{2} < \alpha \leq 0\) as in Example 3.1.

Example 3.3

Suppose α ≤ 0, β > 0, and γ < 0 and take

$$\displaystyle \begin{aligned} \psi_{1}(v) = \alpha + (\gamma/v) (1/ + \beta/\gamma) I_{[0, - 2 \gamma/\beta]}(v) \, , \end{aligned}$$

for C = α and D = −β 2∕4γ.

Note first that ψ 1(v) is monotone nondecreasing and bounded above by α; also, 0 ≤ ψ 2(v) ≤−β 2∕4γ. Therefore, we require C = α < −2D = β 2∕2γ. The conditions limv h(v)∕(1 + v)p∕2−1 = 0 and \(\int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \) are, as in Example 3.2, 2 − p < α ≤ 0.

Thus, δ π(X) is minimax for 2 − p < α ≤ β 2∕2γ < 0. The condition for integrability of h is also, as in Example 3.2, i.e. \(2 - \frac {p}{2} < \alpha \leq \beta ^{2}/ 2 \gamma < 0\).

In this example, ψ(v) is not monotone but is increasing on [0, −2γβ) and decreasing thereafter. This typically corresponds to a non-monotone r(∥X2) in the Baranchik-type representation of δ π(X).

For simplicity, in the following examples, we assume σ 2 = 1.

Example 3.4 (Student-t priors)

In this example we take ψ(v) as in Examples 3.2 and 3.3 with the specific choices α = (m − p + 4)∕2 ≤ 0, β = (m (1 − φ) + 2)∕2, and γ = −m φ∕2 ≤ 0, where m ≥ 1. In this case \(h(v) = C \, v^{- (m+2) / 2} \exp (- m \, \varphi / 2 \, v)\), an inverse gamma density . Hence, as is well known, π(θ) is a multivariate-t distribution with m-degrees of freedom and scale parameter φ if m is an integer (see e.g. Muirhead 1982, p.33 or Robert 1994, p.174). If σ 2≠1, the scale of the t-distribution is φ σ.

For various different values of m and φ, either the conditions of Example 3.2 or the conditions of Example 3.3 apply. Both examples require α = (m − p + 4)∕2 ≤ 0, or equivalently 1 ≤ m ≤ p − 4 (so that p ≥ 5), and γ = −m φ∕2 ≤ 0.

Example 3.2 requires β = (m (1 − φ) + 2)∕2 < 0, or equivalently, φ ≥ (m + 2)∕m. The condition for minimaxity 2 − p < α ≤ 0 is satisfied since it is equivalent to m > −p. Furthermore the condition for proper Bayes minimaxity, \(2 - \frac {p}{2} < \alpha \le 0\), is satisfied as well since it reduces to m > 0. Hence, if φ ≥ (m + 2)∕m, the scaled p-variate t prior distribution leads to a proper Bayes minimax estimator for p ≥ 5 and m ≤ p − 4.

On the other hand, when φ < (m + 2)∕m, or equivalently, β > 0, the conditions of Example 3.3 are applicable. Considering the proper Bayes case only, the condition for minimaxity of the Bayes estimator is

$$\displaystyle \begin{aligned} 2 - \frac {p}{2} < \alpha = \frac{m-p+4}{2} \le \frac {\beta^2}{2\gamma} \le \frac {\beta^2}{2\gamma} = - \frac{1}{4} \, \frac {\big(m \, (1- \varphi) + 2\big)^2}{m \, \varphi}. \end{aligned}$$

The first inequality is satisfied by the fact that m > 0. The second inequality can be satisfied only for certain φ since, when φ goes to 0, the last expression tends to −. A straightforward calculation shows that the second inequality can hold only if

$$\displaystyle \begin{aligned} \varphi \ge \frac{p-2}{m}\left[1 - \sqrt{1 - \left(\frac{m+2}{p-2}\right)^2}\right] > 0 \, . \end{aligned}$$

In particular, if φ = 1 (the standard multivariate t), the condition becomes \(2 - p/2 < \frac {m - p + 4}{2} \leq - \frac {1}{m}\). As m ≥ 1 this is equivalent to m + 2∕m ≤ p − 4, which requires p ≥ 7 for m = 1 or 2, and p ≥ m + 5 for m ≥ 3.

An alternative approach to the results of this section can be made using the techniques of Sect. 2.4.2 applied to Baranchik-type estimators of the form \(\left (1 - a \, r (\| X \|{ }^{2}) / \|X \|{ }^{2} \right ) X\). Indeed any spherically symmetric prior distribution will lead to an estimator of the form ϕ(∥X2)X. More to the point, for prior distributions of the form studied in this section, the r(⋅) function is closely connected to the function v↦ − (v + 1)h (v)∕h(v). To see this, note that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \delta_{\pi}(X) &\displaystyle =&\displaystyle X + \sigma^{2} \frac{\nabla m(X)}{m(X)}\\ &\displaystyle =&\displaystyle \left( 1 - \frac{I_{p/2 +1}(y)}{I_{p/2}(y)} \right) X \qquad \mbox{from (3.2) with } y = \| X \|{}^{2} / 2 \sigma^{2} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} =\left( 1 - \frac{1}{y} \left(\frac{p-2}{2} - \frac{ \int^{\infty}_{0} e^{- \frac{y}{1+v}}(1 + v)^{- p/2} [(v + 1) h^{\prime}(v)/h(v)] \,dv - e^{-y} h(0)} {I_{p/2}(y)} \right)\right) X\end{aligned} $$
$$\displaystyle \begin{aligned} = \left(1 - \frac{2 \sigma^{2}}{\| X \|{}^{2}} \left(\frac{p-2}{2} + E^{y}_{p/2} \left[ - \frac{(V + 1) h^{\prime}(V)}{h(V)} \right] - \frac{e^{- \frac{\| X \|{}^{2}}{2 \sigma^{2}}}h(0)}{I_{p/2} (\frac{\| X \|{}^{2}}{2 \sigma^{2}})} \right) \right) X \, ,\end{aligned} $$

where \(E^{y}_{k}(f)\) is as in the proof of Theorem 3.1, the second to last equality following from (3.4).

Hence, the Bayes estimator is of Baranchik form with

$$\displaystyle \begin{aligned}a r (\| X \|{}^{2}) = 2 \left(\frac{p-2}{2} + E^{\frac{\| X \|{}^{2}}{2 \sigma^{2}}}_{p/2} \left[ - \frac{(V + 1) h^{\prime}(V)}{h(V)} \right] - \frac{e^{- \frac{\| X \|{}^{2}}{2 \sigma^{2}}}h(0)} {I_{p/2}(\frac{\| X \|{}^{2}}{2 \sigma^{2}})} \right). \end{aligned}$$

Recall, as in the proof of Theorem 3.1, that the density \(g^{y}_{k}(v)\) has a monotone decreasing likelihood ratio in k, but notice also that it has a monotone increasing likelihood ratio (actually as an exponential family ) in y.

Hence, if \(- \frac {(v+1) h^{\prime }(v)}{h(v)}\) is nondecreasing, it follows that r is nondecreasing since e yI p∕2(y) is also nondecreasing. Then the following corollary is immediate from Theorem 3.3.

Corollary 3.4

Suppose the prior is of the form (3.4) where − (v + 1) h (v)∕h(v) is nondecreasing and bounded above by A > 0. Then, the generalized Bayes estimator is minimax provided \(A \le \frac {p-2}{2}\).

Proof

As noted, r(⋅) is nondecreasing and is bounded above by p − 2 + 2A ≤ 2(p − 2). □

Corollary 3.3 yields an alternative proof for the minimaxity of the generalized Bayes estimator in Example 3.1.

Finally, as indicated earlier in this section, an alternative parametrization has often been used in minimaxity proofs for the mixture of normal priors, namely \(\lambda = \frac {1}{1+v}\), or equivalently, \(v = \frac {1 - \lambda }{\lambda }\).

Perhaps the easiest way to proceed is to reconsider the prior distribution as a hierarchical prior as discussed in Sect. 1.7. Here the distribution of \(\theta \mid v \sim {\mathcal N}_p(0, v \sigma ^{2} X)\) and the unconditional density of v is the mixing density h(v). The conditional distribution of θ given X and v is \(\mathcal {N}_p(\frac {v}{1+v} X, \frac {V}{1+v} \sigma ^{2} I_p)\). The Bayes estimator is

$$\displaystyle \begin{aligned} \begin{array}{ll} \delta_{\pi}(X) & = E(\theta \mid X)\\ {} & = E[E(\theta \mid X, V) \mid X]\\ {} & = E[\frac{v}{1+v} X \mid X]\\ {} & = (1 - E[\frac{1}{1+v} \mid X])X\\ {} & = (1 - E[\lambda \mid X]) X. \end{array} \end{aligned}$$

Note also that the Bayes estimator for the first stage prior

$$\displaystyle \begin{aligned} \theta \mid \lambda \sim {\mathcal N} (0, \frac{1 - \lambda}{\lambda} \sigma^{2} I) \end{aligned} $$
(3.13)

is (1 − λ)X. Therefore, in terms of the λ parametrization, one may think of E[λX] as the posterior mean of the shrinkage factor and of the (mixing) distribution on λ as the distribution of the shrinkage factor.

In particular, for the prior distribution of Example 3.1 where the mixing density on v is h(v) = C (1 + v)α−(p−2)∕2, the corresponding mixture density on λ is given by \(g(\lambda ) = C \lambda ^{ \alpha + \frac {p-2}{2} - 2} = C \lambda ^{\beta }\) and (β = α + p∕2 − 3). The resulting prior is proper Bayes minimax if 2 − p∕2 < α ≤ 0 or equivalently, − 1 < β ≤ ∕2 − 3 (and p ≥ 5). Note that, if p ≥ 6, β = 0 satisfies the conditions and consequently the mixing prior g(λ) ≡ 1 on 0 ≤ λ ≤ 1, i.e. the uniform prior on the shrinkage factor λ gives a proper Bayes minimax estimator. This class of priors is often referred to as the Strawderman priors.

To formalize the above discussion further we present a version of Theorem 3.3 in terms of the mixing distribution on λ. The proof follows from Theorem 3.3 and the change of variable λ = 1∕(1 + v).

Corollary 3.5

Let θ have the hierarchical prior \(\theta \mid \lambda \sim {\mathcal N}_{p}(0, (\{1 - \lambda \} / \lambda ) \, \sigma ^{2} \, I_p)\) where λ  g(λ) for 0 ≤ λ ≤ 1. Assume that limλ→0 g(λ)λ p∕2+1 = 0 and that \(\int ^{1}_{0} e^{- \lambda } \lambda ^{p/2} g(\lambda ) d \lambda < \infty \) . Suppose λg (λ)∕g(λ) can be decomposed as \(l^{*}_{1}(\lambda ) + l^{*}_{2}(\lambda )\) where \(l^{*}_{1}(\lambda )\) is monotone nonincreasing and \(l^{*}_{1}(\lambda ) \leq A^{*}\) , \(0 \leq l^{*}_{2}(\lambda ) \leq B^{*}\) with A  + 2B  p∕2 − 3.

Then the generalized Bayes estimator is minimax. Furthermore, if \(\int ^{1}_{0}g(\lambda ) d \lambda < \infty \) , the estimator is also proper Bayes.

Example 3.5 (Beta priors)

Suppose the prior g(λ) on λ is a Beta (a, b) distribution, i.e. g(λ) =  a−1(1 − λ)b−1. Note that the Strawderman (1971) prior is of this form if b = 1. An easy calculation shows \(\frac {\lambda g^{\prime }(\lambda )}{g(\lambda )} = a - 1 - (b - 1) \frac {\lambda }{1 - \lambda }\). Letting \(l^{*}_{1}(\lambda ) = \frac {\lambda g^{\prime }(\lambda )}{g(\lambda )}\) and \(l^{*}_{2}(\lambda ) \equiv 0\), we see that the resulting proper Bayes estimator is minimax for 0 < a ≤ p∕2 − 2 and b ≥ 1.

It is clear that our proof fails for 0 < b < 1 since in this case λg (λ)∕g(λ) is not bounded from above (and is also monotone increasing). Maruyama (1998) shows, using a different proof technique involving properties of confluent hypergeometric functions, that the generalized Bayes estimator is minimax (in our notation) for − p∕2 < a ≤ p∕2 − 2 and b ≥ (p + 2a + 2)(3p∕2 + a)−1. This bound in b is in (0, 1) for a < p∕2 − 2. Hence, certain Beta distributions with 0 < b < 1 also give proper Bayes minimax estimators. The generalized Bayes minimax estimators of Alam (1973) are also in Maruyama’s class.

3.1.4 Multiple Shrinkage Estimators

In this subsection, we consider a class of estimators that adaptively choose a point (or subspace) toward which to shrink . George (1986a,b) originated work in this area and the results in this section are largely due to him. The basic fact upon which the results rely is that a mixture of superharmonic functions is superharmonic (see the discussion in the Appendix), that is, if m α(x) is superharmonic for each α, then \(\int m_{\alpha }(x) \, d G(\alpha )\) is superharmonic if G(⋅) is a positive measure such that \(\int m_{\alpha }(x) \, d G(\alpha ) < \infty \). Using this property, we have the following result from Corollary 3.1.

Theorem 3.4

Let m α(x) be a family of twice weakly differentiable nonnegative superharmonic functions and G(x) a positive measure such that \(m(x) = \int m_{\alpha }(x) \, d G(x)\) < ∞, for all \(x \in \mathbb {R}^{p}\).

Then the (generalized, proper, or pseudo) Bayes estimator

$$\displaystyle \begin{aligned} \delta(X) = X + \sigma^{2} \frac{\nabla m(X)}{m(X)} \end{aligned}$$

is minimax provided E[∥∇m2m 2] < ∞.

The following corollary for finite mixtures is useful.

Corollary 3.6

Suppose that m i(x) is superharmonic and \(E [\| \nabla m_{i}(X) \|{ }^{2} / m^{2}_{i}(X)] < \infty \) for i = 1, …, n. Then, if \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\) , the (generalized, proper, or pseudo) Bayes estimator

$$\displaystyle \begin{aligned} \begin{array}{ll} \delta (X) & = X + \sigma^{2} \frac{\nabla m(X)}{m(X)}\\ & = \sum^{n}_{i=1} (X + \sigma^{2} \frac{\nabla m_{i}(X)}{m_{i}(X)}) W_{i}(X) \end{array} \end{aligned}$$

where \(W_{i}(X) = m_{i}(X) / \sum ^{n}_{i=1} m_{i}(X) \) for \(0 < W_{i}(X) < 1, \sum ^{n}_{i=1} W_{i}(X) = 1\) is minimax. (Note that \(E_{\theta }[ \| \nabla m(X) \|{ }^{2} / m^{2}(X)] < \sum ^{n}_{i=1} E_{\theta } [\| \nabla m_{i}(X) \|{ }^{2} / m^{2}(X_{i})] < \infty \).)

Example 3.6

  1. (1)

    Multiple shrinkage James-Stein estimator . Suppose we have several possible points X 1, X 2, …, X n toward which to shrink. Recall that m i(x) = (1∕∥xX i2)(p−2)∕2 is superharmonic if p ≥ 3 and the corresponding pseudo-Bayes estimator is \(\delta _{i}(X) = X_{i} + \left ( 1 - (p-2) \, \sigma ^{2} / \| X - X_{i} \|{ }^{2} \right ) (X - X_{i})\). Hence, if \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\), the resulting minimax pseudo Bayes estimator is given by

    $$\displaystyle \begin{aligned} \delta (X) = \sum^{n}_{i=1} \left[ X_{i} + (1 - \frac{(p-2) \sigma^{2}}{\| X - X_{i} \|{}^{2}}) (X - X_{i}) \right] W_{i}(X) \end{aligned}$$

    where \(W_{i}(X) \propto \left ( 1 / \| X - X_{i} \|{ }^{2} \right )^{(p-2) / 2}\) and \(\sum ^{n}_{i=1} W_{i}(X) = 1\). Note that W i(X) is large when X is close to X i and the estimator is seen to adaptively shrink toward X i.

  2. (2)

    Multiple shrinkage positive-part James-Stein estimators. Another possible choice for the m i(x) (leading to a positive-part James Stein estimator) is

    $$\displaystyle \begin{aligned} m_{i}(x) = \left\{ \begin{array}{ll} C \; \exp \left({\frac{\| x - X_{i} \|{}^{2}}{2 \, \sigma^{2}}} \right) & {\mathrm{if}} \; \| x - X_{i} \|{}^{2} < (p-2) \, \sigma^{2}\\ \left(\frac{1}{\| x - X_{i} \|{}^{2}}\right) & {\mathrm{if}} \; \| x - X_{i} \|{}^{2} \geq (p-2) \, \sigma^{2} \end{array} \right. \end{aligned}$$

    where \(C = \left ( 1 / (p-2) \, \sigma ^{2} \right )^{(p-2) / 2} e^{(p-2) / 2}\) so that m i(x) is continuous. This gives

    $$\displaystyle \begin{aligned} \delta_{i}(X) = X_{i} + \left( 1 - \frac{(p-2) \sigma^{2}}{\| X - X_{i} \|{}^{2}} \right)_{+} (X - X_{i}) \end{aligned}$$

    since

    $$\displaystyle \begin{aligned} \frac{\nabla m_{i}(X)}{m_{i}(X)} = \left\{ \begin{array}{l} - \frac{X - X_{i}}{\sigma^{2}} \; \;\; {\mathrm{if}} \; \| X - X_{i} \|{}^{2} < (p-2) \sigma^{2},\\ - \frac{(p-2)}{\| X - X_{i} \|{}^{2}} \; \;\; {\mathrm{otherwise.}} \end{array} \right. \end{aligned}$$

    The adaptive combination is again minimax by the corollary and inherits the usual advantages of the positive-part estimator over the James-Stein estimator.

    Note that a smooth alternative to the above is \(m_{i}(x) = \left (\frac {1}{b + \| x - X_{i} \|{ }^{2}} \right )^{\frac {p-2}{2}}\) for some b > 0.

In each of the above examples we may replace (p − 2)∕2 in the exponent by a∕2 where 0 ≤ a ≤ p − 2 (and where 0 ≤∥x − X i2 < (p − 2) σ 2 is replaced by 0 ≤∥x − X i2 < a σ 2 for the positive-part estimator). The choice of p − 2 as an upper bound for a ensures superharmonicity of m i(x). A choice of a in the range of p − 2 < a ≤ 2 (p − 2) seems also quite natural since \(\sqrt {m_{i}(x)}\) is superharmonic (but m i(x) is not) for a in this range so that each δ i(X) is minimax. Unfortunately minimaxity of \(\delta (X) = \sum _{i=1}^n W_{i}(X) \delta _{i}(X)\) does not follow from Corollary 3.3 for p − 2 < a ≤ 2 (p − 2) since it need not be true that \(\sqrt {\sum ^{n}_{i=1} m_{i}(x)}\) is superharmonic even though \(\sqrt {m_{i}(x)}\) is superharmonic for each i.

  1. (3)

    A generalized Bayes multiple shrinkage estimator . If π i(θ) is superharmonic then \(\pi (\theta ) = \sum ^{n}_{i=1}\pi _{i}(\theta )\) is also superharmonic as is \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\).

For example, \(\pi _{i}(\theta ) = \left ( 1 / b + \| \theta - X_{i} \|{ }^{2} \right )^{a/2}\), for b ≥ 0 and 0 ≤ a ≤ p − 2, is a suitable prior. Interestingly, according to a heuristic of Brown (1971), m(x) in this case should behave for large ∥x2 as \(\sum ^{n}_{i=1} 1 / \left ( b + \| x - X_{i} \|{ }^{2} \right )^{a/2}\), the “smooth” version of the adaptive positive-part multiple shrinkage pseudo-marginal in part (2) of this example.

By obvious modifications of the above, multiple shrinkage estimators may be constructed that shrink adaptively toward subspaces. Further examples can be found in George (1986a,b), Ki and Tsui (1990) and Wither (1991).

3.2 Bayes Estimators in the Unknown Variance Case

3.2.1 A Class of Proper Bayes Minimax Admissible Estimators

In this subsection, we give a class of hierarchical Bayes minimax estimators for the model

$$\displaystyle \begin{aligned} X \sim {\mathcal N}_p(\theta,\sigma^2 \, I_p) \quad S \sim \sigma^2 \, \chi_k^2 \, , \end{aligned} $$
(3.14)

where S is independent of X, under scale invariant squared error loss

$$\displaystyle \begin{aligned} L(\theta,\delta(X, S)) = \frac{\|\delta(X, S) - \theta\|{}^2}{\sigma^2} \, . \end{aligned} $$
(3.15)

We reparameterize σ 2 as 1∕η and consider the following hierarchically, on the unknown parameters, structured prior(θ, η), which is reminiscent of the hierarchical version of the Strawderman prior in (3.13),

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \theta | \lambda, \eta &\displaystyle \sim&\displaystyle {\mathcal N}_p \left( 0, \frac{1}{\eta} \, \frac{1 - \lambda}{\lambda} \, I_p \right) \\ \eta &\displaystyle \sim&\displaystyle Gamma\left( \frac{b}{2} , \frac{c}{2} \right) \\ \lambda &\displaystyle \sim&\displaystyle (1 + a) \, \lambda^a, \quad 0 < \lambda < 1 \, . \end{array} \end{aligned} $$
(3.16)

Lemma 3.2

For the model (3.14) and loss (3.15) , the (generalized or proper) Bayes estimator of θ is given by

$$\displaystyle \begin{aligned} \delta(X,S) = \left( 1 - \frac{S}{\|X\|{}^2} \, r(\|X\|{}^2,S) \right) X \end{aligned} $$
(3.17)

where

$$\displaystyle \begin{aligned} r(\|X\|{}^2,S) = \frac{\|X\|{}^2}{\|X\|{}^2 + c} \frac{\int_0^{(\|X\|{}^2 + c) / S} u^{A+1} \left( \frac{1}{u + 1} \right)^{B+1} du} {\int_0^{(\|X\|{}^2 + c) / S} u^A \left( \frac{1}{u + 1} \right)^{B+1} du} \end{aligned} $$
(3.18)

where

$$\displaystyle \begin{aligned} A = \frac{p + a + b}{2} \quad \mathit{\text{and}} \quad B = \frac{p + k + b - 2}{2} \end{aligned} $$
(3.19)

provided A > −1, A  B < 0, and c > 0.

Proof

Under the loss in (3.15) the Bayes estimator for the model in (3.16) is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta(X,S) = \frac{E[\theta \, \eta | X,S]}{E[\eta | X,S]} \, . \end{array} \end{aligned} $$
(3.20)

Expressing the expectation in the numerator of (3.20) gives

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} E[\theta \, \eta | X,S] &\displaystyle =&\displaystyle \int_0^\infty \int_0^1 \int_{\mathbb{R}^p} \theta \, \eta^{p/2+1} \left( \frac{\lambda \, \eta}{1 - \lambda} \right)^{p/2} \\ &\displaystyle &\displaystyle \hspace{2cm} \times \exp \left( - \frac{\eta}{2} \left[ \|x - \theta\|{}^2 + \frac{\lambda}{1 - \lambda} \, \|\theta\|{}^2 \right] \right) \eta^{(k+b-2)/2} \\ &\displaystyle &\displaystyle \times \lambda^{(b+a)/2} \, \exp\left( - \frac{\eta}{2} \, (S + \lambda \, c) \right) d\theta \, d\eta \, d\lambda \\ &\displaystyle =&\displaystyle \int_0^\infty \int_0^1 (1 - \lambda) \lambda^{A} \eta^{B} \exp\left( - \frac{\eta}{2} \, (S + \lambda(\|x\|{}^2 + c)) \right) d\eta \, d\lambda \end{array} \end{aligned} $$
(3.21)

upon integrating with respect to θ and evaluating with the constants in (3.19). Similarly, for the denominator in (3.20)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} E[\eta | X,S] &\displaystyle =&\displaystyle \int_0^\infty \int_0^1 \int_{\mathbb{R}^p} \eta^{p/2+1} \left( \frac{\lambda \, \eta}{1 - \lambda} \right)^{p/2} \\ &\displaystyle &\displaystyle \hspace{2cm} \times \exp \left( - \frac{\eta}{2} \left[ \|x - \theta\|{}^2 + \frac{\lambda}{1 - \lambda} \, \|\theta\|{}^2 \right] \right) \eta^{(k+b-2)/2} \\ &\displaystyle &\displaystyle \times \lambda^{(b+a)/2} \, \exp\left( - \frac{\eta}{2} \, (S + \lambda \, c) \right) d\theta \, d\eta \, d\lambda \\ &\displaystyle =&\displaystyle \int_0^\infty \int_0^1 \eta^{B} \lambda^{A} \exp\left( - \frac{\eta}{2} \, (S + \lambda(\|x\|{}^2 + c)) \right) d\eta d\lambda. \end{array} \end{aligned} $$
(3.22)

Therefore from (3.21) and (3.22) the Bayes estimator in (3.20) has the form

$$\displaystyle \begin{aligned} \delta(X,S) = \left( 1 - \frac{S}{\|X\|{}^2} \, r(\|X\|{}^2,S) \right) X \end{aligned}$$

where

$$\displaystyle \begin{aligned} \begin{array}{rcl} r(\|X\|{}^2,S) &\displaystyle =&\displaystyle \frac{\|X\|{}^2}{S} \, \frac {\int_0^\infty \int_0^1 \eta^{B} \, \lambda^{A+1} \, \exp \left( - \frac{\eta \, S}{2} \, \left(1 + \lambda \, \frac{\|x\|{}^2 + c}{S} \right) \right) d\eta \, d\lambda } {\int_0^\infty \int_0^1 \eta^{B} \, \lambda^{A} \, \exp \left( - \frac{\eta \, S}{2} \, \left(1 + \lambda \, \frac{\|x\|{}^2 + c}{S} \right) \right) d\eta \, d\lambda } \\ {} &\displaystyle =&\displaystyle \frac{\|X\|{}^2 / S}{(\|X\|{}^2 + c) \, S} \, \frac {\int_0^{(\|X\|{}^2 + c) / S} \int_0^\infty \eta^{B} \, u^{A+1} \, \exp \left( - \frac{\eta \, S}{2} \, (1 + u) \right) d\eta \, du } {\int_0^{(\|X\|{}^2 + c) / S} \int_0^\infty \eta^{B} \, u^{A} \, \exp \left( - \frac{\eta \, S}{2} \, (1 + u) \right) d\eta \, du } \\ {} &\displaystyle =&\displaystyle \frac{\|X\|{}^2}{\|X\|{}^2 + c} \, \frac {\int_0^{(\|X\|{}^2 + c) / S} u^{A+1} \, \left( \frac{1}{u + 1} \right)^{B+1} du } {\int_0^{(\|X\|{}^2 + c) / S} u^{A} \, \left( \frac{1}{u + 1} \right)^{B+1} du } \, , \end{array} \end{aligned} $$

with the change of variable u = λ (∥X2 + c)∕S is made in the next to last step. □

The properties of r(∥X2, S) in Lemma 3.2 are given in the following result.

Lemma 3.3

The function r(∥X2, S) given in (3.18) satisfies the following properties:

  1. (i)

    r(∥X2, S) is nondecreasing inX2 for fixed S;

  2. (ii)

    r(∥X2, S) is nonincreasing in S for fixedX2 ; and

  3. (iii)

    0 ≤ r(∥X2, S) ≤ (A + 1)∕(B  A − 1) = (p + a + b + 2)∕(k  a − 4)

provided the conditions of Lemma 3.2 hold.

Proof

Note first that \(\int _0^t u \, f(u) \, du / \int _0^t f(u) \, du\) is nondecreasing in t for any integrable nonnegative function f(⋅). Hence Part (i) follows since r(∥X2, S) is the product of two nonnegative nondecreasing functions ∥X2∕∥X2 + c and \(\int _0^{(\|X\|{ }^2 + c) / S} u \, f(u) \, du / \) \(\int _0^{(\|X\|{ }^2 + c) / S} f(u) \, du\) for f(u) = u A (1 + u)−(B+1).

Part (ii) follows from a similar reasoning since the first term is constant in S and (∥X2 + c)∕S is decreasing in S.

To show Part (iii) note that, by Parts (i) and (ii),

$$\displaystyle \begin{aligned} \begin{array}{rcl} 0 \leq r(\|X\|{}^2,S) &\displaystyle \leq&\displaystyle \lim_{ \substack{ \|X\|{}^2 \to \infty \\ S \to 0 } } r(\|X\|{}^2,S) \\ &\displaystyle \leq&\displaystyle \frac {\int_0^\infty u^{A+1} \left( \frac{1}{u + 1} \right)^{B+1} du } {\int_0^\infty u^{A} \left( \frac{1}{u + 1} \right)^{B+1} du } \\ &\displaystyle =&\displaystyle \frac {\int_0^1 \lambda^{B-A-2} \, (1 - \lambda)^{A+1} } {\int_0^1 \lambda^{B-A-1} \, (1 - \lambda)^{A} } \\ &\displaystyle =&\displaystyle \frac{A + 1}{B - A - 1}\\ &\displaystyle =&\displaystyle \frac{p + a + b + 2}{k - a - 4} \, , \end{array} \end{aligned} $$

expressing the beta functions and according to the values of A and B. □

We also need the following straightforward generalization of Corollary 2.6. The proof is left to the reader.

Corollary 3.7

Under model (3.14) and loss (3.15) an estimator of the form

$$\displaystyle \begin{aligned} \delta(X,S) = \left( 1 - \frac{S}{\|X\|{}^2} \, r(\|X\|{}^2,S) \right) X \end{aligned}$$

is minimax provided

  1. (i)

    r(∥X2, S) is nondecreasing inX2 for fixed S;

  2. (ii)

    r(∥X2, S) is nonincreasing in S for fixedX2 ; and

  3. (iii)

    0 ≤ r(∥X2, S) ≤ 2 (p − 2)∕(k + 2).

Combining Lemmas 3.2 and 3.3 and Corollary 3.7 gives the following result.

Theorem 3.5

For the model (3.14) , loss (3.15) and hierarchical prior (3.16) , the generalized or proper Bayes estimator in Lemma 3.2 is minimax provided

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{p + a + b + 2}{k - a - 4} \leq \frac{2 \, (p - 2)}{k + 2} \, . \end{array} \end{aligned} $$
(3.23)

Furthermore, if p ≥ 5, there exist values of a > −2 and b > 0 which satisfy (3.23) , i.e. such that the estimator is proper Bayes, minimax and admissible.

Proof

The first part is immediate. To see the second part, note that it suffices, if a = −2 + 𝜖 b = δ, for 𝜖, δ > 0, that

$$\displaystyle \begin{aligned} \frac{p}{k - 2} < \frac{p + \epsilon + \delta}{k - 2 - \epsilon} \leq \frac{2 \, (p - 2)}{k + 2} \end{aligned}$$

equivalently \(p > 4 \, \frac {k - 2}{k - 6}\). Hence, for p ≥ 5 and k sufficiently large, k > 2 (3 p − 4)∕(p − 4), there are values of a and b such that the priors are proper. □

Note that there exist values of a and b satisfying (3.23) and the assumptions of Lemma 3.2 whenever p ≥ 3.

Strawderman (1973) gave the first example of a generalized and proper Bayes minimax estimators in the unknown variance setting. Zinodiny et al. (2011) also give classes of generalized and proper Bayes minimax estimators along somewhat similar lines as the above. The major difference is that the prior distribution on η (= 1∕σ 2) in the above development is also hierarchical, as it also depends on λ.

3.2.2 The Construction of a Class of Generalized Bayes Minimax Estimators

In this subsection we extend the generalized Bayes results of Sect. 3.1.2, using the ideas in Maruyama and Strawderman (2005) and Wells and Zhou (2008), to consider point estimation of the mean of a multivariate normal when the variance is unknown. Specifically, we assume the following model in (3.14) and the scaled squared loss function in (3.15).

In order to derive the (formal) Bayes estimator we reparameterize the model in (3.14) by replacing σ by η −1. The model then becomes

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle X {\sim} {\mathcal N}_p(\theta,\eta^{-2}I_p),\quad S \sim s^{k / 2-1} \, \eta^k \exp(s \, \eta^2 / 2), \\ &\displaystyle &\displaystyle \theta {\sim} {\mathcal N}_p(0, \nu \, \eta^{-2}I_p),\quad \nu {\sim} h(\nu), \quad \eta \sim \eta^{d} \, , \eta > 0 \, , \end{array} \end{aligned} $$
(3.24)

for some constant d. Under this model, the prior for θ is a scale mixture of normal distributions. Note that the above class of priors cannot be proper due to the impropriety of the distribution of η. However, as a consequence of the form of this model, the resulting generalized Bayes estimator is of the Baranchik form (3.17), with r(∥X2, S) = r(F), where F = ||X||2S.

We develop sufficient conditions on k, p, and h(ν) such that the generalized Bayes estimators with respect to the class of priors in (3.24) are minimax under the invariant loss function in (3.15). Maruyama and Strawderman (2005) and Wells and Zhou (2008) were able to obtain such sufficient conditions by applying the bounds and monotonicity results of Baranchik (1970), Efron and Morris (1976), and Fourdrinier et al. (1998).

Before we derive the formula for the generalized Bayes estimator under the model (3.24), we impose three regularity conditions on the parameters of priors. These conditions are easily satisfied by many hierarchical priors. These three conditions are assumed throughout this section.

C1::

A > 1 where \(A =\frac {d + k + p + 3}{2}\);

C2::

\(\; \;\; {\int _0^1 \lambda ^{\frac {p}{2}-2}h\left (\frac {1-\lambda }{\lambda }\right )} \,d \lambda < \infty \); and

C3::

\(\; \; \;\lim _{\nu \rightarrow \infty }\frac {h(\nu )}{(1+\nu )^{p / 2-1}} = 0\).

Now, as in Sect. 3.1, we will first find the form of the Bayes estimator and then show that it satisfies some sufficient conditions for minimaxity. We start with the following lemma that corresponds to (3.2) in the known variance case and (3.18) in the previous subsection.

Lemma 3.4

Under the model in (3.24) , the generalized Bayes estimator can be written as

$$\displaystyle \begin{aligned} \delta(X,S) = X - R(F) \, X = X - \frac{r(F)}{F} \, X, \end{aligned} $$
(3.25)

where F = ||X||2S,

$$\displaystyle \begin{aligned} R(F) = \frac{ \int_0^1 \lambda^{p / 2 - 1} \, (1 + \lambda \, F)^{-A} \, h\left(\frac{1 - \lambda}{\lambda}\right) \,d \lambda } { \int_0^1 \lambda^{p / 2 - 2} \, (1 + \lambda \, F)^{-A} \, h\left(\frac{1 - \lambda}{\lambda}\right) \,d \lambda }, \end{aligned} $$
(3.26)

and

$$\displaystyle \begin{aligned} r(F) = F \, R(F) \, . \end{aligned} $$
(3.27)

Proof

Under the loss function (3.15), the generalized Bayes estimator for the model (3.24) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta(X,S) &\displaystyle =&\displaystyle \frac{E(\frac{\theta}{\sigma^2}|X,S)}{E(\frac{1}{\sigma^2}|X,S)} \\ &\displaystyle =&\displaystyle \frac{\int_0^\infty {h(\nu) \int_0^\infty [(\eta^2)^{A-\frac{1}{2}}e^{-\frac{1}{2}\eta^2 S } \int_{\mathbb{R}^p} (\frac{1}{2\pi \nu \eta^{-2}})^{\frac{p}{2}}\theta e^{-\frac{1}{2} \eta^2(\frac{||\theta||{}^2}{\nu} + ||X-\theta||{}^2)}d\theta]d\eta}\,d \nu}{\int_0^\infty {h(\nu) \int_0^\infty [(\eta^2)^{A-\frac{1}{2}}e^{-\frac{1}{2}\eta^2 S }\int_{\mathbb{R}^p} (\frac{1}{2\pi \nu \eta^{-2}})^{\frac{p}{2}} e^{-\frac{1}{2} \eta^2(\frac{||\theta||{}^2}{\nu}+||X-\theta||{}^2)}d\theta]d\eta}d \nu} \\ &\displaystyle =&\displaystyle \left(1-\frac{\int_0^\infty [(\frac{1}{1+\nu}) h(\nu) (\frac{1}{1+\nu})^{\frac{p}{2}} \int_0^\infty (\eta^2)^{A-\frac{1}{2}} e^{-\frac{1}{2} \eta^2(S+\frac{ ||X||{}^2}{1+\nu})}\,d\eta]\,d\nu}{\int_0^\infty [h(\nu) (\frac{1}{1+\nu})^{\frac{p}{2}} \int_0^\infty (\eta^2)^{A-\frac{1}{2}} e^{-\frac{1}{2} \eta^2(S+\frac{ ||X||{}^2}{1+\nu})}\, d\eta]\,d\nu}\right)\,X \\ &\displaystyle =&\displaystyle \left(1-\frac{\int_0^\infty (\frac{1}{1+\nu})h(\nu)(\frac{1}{1+\nu})^{\frac{p}{2}} (1+\frac{F}{1+\nu})^{-A}\, d\nu}{\int_0^\infty h(\nu)(\frac{1}{1+\nu})^{\frac{p}{2}}(1+\frac{F}{1+\nu})^{-A}\, d\nu}\right)\,X. \end{array} \end{aligned} $$
(3.28)

Letting λ = (1 + ν)−1, δ(X, S) = (1 − R(F))X, which gives the form of the generalized Bayes estimator. □

Recall from Stein (1981) that when σ 2 is known the Bayes estimator under squared error loss and corresponding to a prior π(θ) is given by (3.2), that is, \( \delta ^{\pi }(X) = X + \sigma ^{2} \frac {\bigtriangledown m(X)}{m(X)}\).

The form of the Bayes estimator given in (3.25) gives an analogous form with the unknown variance replaced by a multiple of the usual unbiased estimator. In particular, define the “quasi-marginal”

$$\displaystyle \begin{aligned}{\mathbf{M}}(x,s)=\int \int f_X(x) \, f_S(s) \, \pi(\theta,\sigma^2) \, d\theta \, d\sigma^2 \end{aligned}$$

where

$$\displaystyle \begin{aligned}f_X(x) = \left(\frac{1}{2\pi \sigma^2}\right)^{p / 2} e^{-\frac{1}{2\sigma^2} ||x -\theta||{}^2} \end{aligned}$$

and

$$\displaystyle \begin{aligned}f_S(s)=\frac{1}{2^{k / 2}\varGamma(k / 2)}s^{k / 2 - 1} (\sigma^2)^{-k / 2} e^{-\frac{s}{2\sigma^2}}. \end{aligned}$$

A straightforward calculation shows M(x, s) is proportional to

$$\displaystyle \begin{aligned}\int_0^\infty h(\nu) \int_0^\infty [(\eta^2)^{A-\frac{3}{2}}e^{-\frac{1}{2}\eta^2 s }\int_{\mathbb{R}^p} (\frac{1}{2\pi \nu \eta^{-2}})^{\frac{p}{2}}e^{-\frac{1}{2} \eta^2(\frac{||\theta||{}^2}{\nu}+||x-\theta||{}^2)}d\theta]d\eta d\nu. \end{aligned}$$

It is interesting to note the unknown variance analog of (3.2) is

$$\displaystyle \begin{aligned}\delta(X,S)=X-\frac{1}{2}\frac{\nabla_X{\mathbf{M}}(X,S)}{\nabla_S{\mathbf{M}}(X,S)}. \end{aligned}$$

Lastly, note that the exponential term in the penultimate expression in the representation of δ(X, S) in (3.28) (that comes from the normal sampling distribution assumption) cancels. Hence there is a sort of robustness with respect to the sampling distribution. We will develop this theme in greater detail in Chap. 6 in the setting of spherically symmetric distributions.

3.2.2.1 Preliminary Results

The minimax property of the generalized Bayes estimator is closely related to the behavior of the r(F) and R(F) functions, which is in turn closely related to the behavior of

$$\displaystyle \begin{aligned} g(\nu)= -(\nu+1)\frac{h^\prime (\nu)}{h(\nu)}. \end{aligned} $$
(3.29)

Fourdrinier et al. (1998) gave a detailed analysis of the type of function in (3.29). However, their argument was deduced from the superharmonicity of the square root of a marginal condition. Baranchik (1970) and Efron and Morris (1976) gave certain regularity conditions on the shrinkage function r(⋅) such that an estimator

$$\displaystyle \begin{aligned} \widehat{\theta}(X,S)=X-\frac{r(F)}{F}X \end{aligned} $$
(3.30)

is minimax under the loss function (3.15) for the model (3.14). Both results require an upper bound on r(F) and a condition on how fast R(F) = r(F)∕F decreases with F. Both theorems follow from a general result for spherically symmetric distributions given in Chap. 6 (Proposition 6.1), or by applying Theorem 2.5 in a manner similar to that in Corollary 2.3. The proofs are left to the reader.

Theorem 3.6 (Baranchik 1970)

Assume that r(F) is increasing in F and 0 ≤ r(F) ≤ 2 (p − 2)∕(k + 2). Then any point estimator of the form (3.30) is minimax.

Theorem 3.7 (Efron and Morris 1976)

Define \(c_k = \frac {p-2}{k+2}\) . Assume that 0 ≤ r(F) ≤ 2 c k , that for all F with r(F) < 2c k ,

$$\displaystyle \begin{aligned} \frac{F^{p / 2-1} \, r(F)}{(2 - r(F) / c_k)^{1 + 2 \, c_k}} \mathit{\mbox{ is increasing in }} F, \end{aligned} $$
(3.31)

and that, if an F 0 exists such that r(F 0) = 2c k , then r(F) = 2 c k for all F  F 0 . With the above assumptions, the estimator \(\widehat {\theta }(X,S) = X - r(F) / F \; X\) is minimax.

Consequently, to apply these results one has to establish an upper bound for r(F) in (3.27) and the monotonicity property for some variant of r(F). The candidate we use is \(\widetilde {r}(F)=F^cr(F)\) with a constant c. Note that the upper bound 2 c k is exactly the same upper bound needed in Corollary 3.7(iii). We develop the needed results below.

First note that if h(ν) is a continuously differentiable function on [0, ), and regularity conditions C1, C2 and C3 hold, then the integrations by parts used in Lemmas 3.5 and 3.6 are valid.

Lemma 3.5

Assume the regularity conditions C1, C2 and C3, and that g(ν) ≤ M, where M is a positive constant and g(ν) is defined as in (3.29) . Then, for the r(F) function (3.27) , we have

$$\displaystyle \begin{aligned} 0 \leq r(F) \leq \frac{\frac{p}{2}-1+M}{A-\frac{p}{2}-M} \, , \end{aligned}$$

where A is defined in condition C1.

Proof

By the definition in (3.26), R(F) ≥ 0. Then r(F) = FR(F) ≥ 0. Note that

$$\displaystyle \begin{aligned}r(F)=F\frac{\int_0^1 \lambda ^{\frac{p}{2}-1}(1+\lambda F)^{-A}h(\frac{1-\lambda}{\lambda})\,d \lambda}{\int_0^1\lambda ^{\frac{p}{2}-2}(1+\lambda F)^{-A}h(\frac{1-\lambda}{\lambda})\,d \lambda } = F\frac{I_{\frac{p}{2}-1,A,h}(F)}{I_{\frac{p}{2}-2,A,h} (F)}, \end{aligned}$$

where we are using the notation

$$\displaystyle \begin{aligned}I_{\alpha,A,h}(F) = \int_0^1 \lambda ^{\alpha}(1+\lambda F)^{-A} h(\frac{1-\lambda}{\lambda})\,d \lambda \, . \end{aligned}$$

Using integration by parts , we obtain

$$\displaystyle \begin{aligned} \begin{array}{rcl} FI_{\frac{p}{2}-1,A, h}(F)&\displaystyle =&\displaystyle \int_0^1 \lambda^{p/2-1} h\left(\frac{1-\lambda}{\lambda}\right) d \left[\frac{(1+\lambda F)^{1-A}}{1-A}\right]\\ &\displaystyle =&\displaystyle \lambda^{\frac{p}{2}-1}h\left(\frac{1-\lambda}{\lambda}\right)\frac{(1+\lambda F)^{1-A}}{1-A}|{}_0^1 +\frac{1}{A-1}\int_0^1 (1+\lambda F)^{-A}(1+\lambda F)\\&\displaystyle &\displaystyle \left[\left(\frac{p}{2}-1\right)\lambda^{\frac{p}{2}-2}h\left(\frac{ 1-\lambda}{\lambda}\right) -\frac{1}{\lambda^2}\lambda^{\frac{p}{2}-1} h^\prime\left(\frac{1-\lambda}{\lambda}\right)\right]\,d \lambda. \end{array} \end{aligned} $$

By C1 and C3, we know that the first term of the right hand side is nonpositive. The second term of the right hand side can be written as N 1 + N 2 + N 3 + N 4 where

$$\displaystyle \begin{aligned}N_1=\frac{1}{A-1}\int_0^1 (1+\lambda F)^{-A}\left(\frac{p}{2}-1\right) \lambda^{\frac{p}{2}-2} h\left(\frac{1-\lambda}{\lambda}\right)\,d\lambda = \frac{\frac{p}{2}-1}{A-1}I_{\frac{p}{2}-2,A,h}(F),\end{aligned}$$
$$\displaystyle \begin{aligned} \begin{array}{rcl} N_2&\displaystyle =&\displaystyle \frac{1}{A-1}\int_0^1 (1+\lambda F)^{-A}\lambda^{\frac{p}{2}-2}h^\prime\left(\frac{1-\lambda} {\lambda}\right)\left(\frac{-\lambda}{\lambda^2}\right)\,d \lambda\\ &\displaystyle =&\displaystyle \frac{I_{\frac{p}{2}-2,A,h}(F)}{A-1}\frac{\int_0^1 \lambda^{\frac{p}{2}-2} (1+\lambda F)^{-A} g(\frac{1-\lambda}{\lambda})h(\frac{1-\lambda}{\lambda})\,d \lambda}{\int_0^1 \lambda^{\frac{p}{2}-2} (1+\lambda F)^{-A}h(\frac{1-\lambda}{\lambda})\,d \lambda} \\ &\displaystyle \leq&\displaystyle \frac{M}{A-1}I_{\frac{p}{2}-2,A,h}(F), \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned}N_3=\frac{\frac{p}{2}-1}{A-1}FI_{\frac{p}{2}-1,A,h}(F)=\frac{(\frac{p}{2}-1)r(F) }{A-1}I_{\frac{p}{2}-2,A,h}(F), \end{aligned}$$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} N_4&\displaystyle =&\displaystyle \frac{I_{\frac{p}{2}-2,A,h}(F)}{A-1}\frac{F\int_0^1 \lambda^{\frac{p}{2}-1}(1+\lambda F)^{-A} h^\prime(\frac{1-\lambda}{\lambda})(\frac{-1}{\lambda})d\lambda}{I_{\frac{p}{2} -2,A,h}(F)}\\ &\displaystyle =&\displaystyle \frac{I_{\frac{p}{2}-2,A,h}(F)}{A-1}\frac{F \int_0^1 (1+\lambda F)^{-A}\lambda^{\frac{p}{2}-1}g(\frac{1-\lambda}{\lambda}) h(\frac{1-\lambda}{\lambda})d\lambda}{I_{\frac{p}{2}-2,A,h}(F)} \\&\displaystyle \leq&\displaystyle \frac{Mr(F)}{A-1}I_{\frac{p}{2}-2,A,h}(F). \end{array} \end{aligned} $$

Combining all the terms, we get the following inequality

$$\displaystyle \begin{aligned} (A-1)r(F) \leq \left(\frac{p}{2}-1\right)+M+\left(\frac{p}{2}-1\right)r(F)+Mr(F)\Rightarrow r(F) \leq \frac{\frac{p}{2}-1+M}{A-\frac{p}{2}-M}.\end{aligned}$$

Therefore, we have the needed bound on the r(F) function. □

We will now show that under certain regularity conditions on g(ν), we have the monotonicity property for \(\tilde {r}(F)=F^cr(F)\) with a constant c. This monotonicity property enables us to establish the minimaxity of the generalized Bayes estimator. The following lemma is analogous to Theorem 3.3 in the known variance case.

Lemma 3.6

If \(g(\nu )= -(\nu +1)\frac {h^\prime (\nu )}{h(\nu )}=l_1(\nu )+l_2(\nu )\) such that l 1(ν) is increasing in ν and 0 ≤ l 2(ν) ≤ c, then \(\widetilde {r}(F)=F^{c}r(F)\) is nondecreasing.

Proof

By taking the derivative, we only need to show (since r(F) = FR(F))

$$\displaystyle \begin{aligned} 0 \leq FR^\prime(F) + (1+c)R(F), \end{aligned} $$
(3.32)

which is equivalent to

$$\displaystyle \begin{aligned} 0 \leq F\frac{I^\prime_{\frac{p}{2}-1,A,h}(F)I_{\frac{p}{2}-2,A,h}(F)-I^\prime_{\frac{ p}{2}-2,A,h}(F)I_{\frac{p}{2}-1,A,h}(F)}{I^2_{\frac{p}{2}-2,A,h}(F)} + (1+c)\frac{I_{\frac{p}{2}-1,A,h}(F)}{I_{\frac{p}{2}-2,A,h}(F)}. \end{aligned}$$

This in turn equivalent to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {-FI^\prime_{\frac{p}{2}-1,A,h}(F)I_{\frac{p}{2}-2,A,h}(F)}\\ &\displaystyle &\displaystyle \ \leq - FI^\prime_{\frac{p}{2}-2,A,h}(F)I_{\frac{p}{2}-1,A,h}(F) + (1+c)I_{\frac{p}{2}-2,A,h}(F)I_{\frac{p}{2}-1,A,h}(F). \end{array} \end{aligned} $$
(3.33)

Now note that

$$\displaystyle \begin{aligned}-FI^\prime_{a,A,h}(F) = \int_0^1 \lambda^{a} (1+\lambda F)^{-A}h\left(\frac{1-\lambda}{\lambda}\right)\frac{A\lambda F}{1+\lambda F}d\lambda.\end{aligned}$$

Define the intergral operator

$$\displaystyle \begin{aligned} J_{a}\left(f\left(u\right)\right)=\int_0^F u^{a}(1+u)^{-A}f\left(u\right)\, du. \end{aligned}$$

Therefore,

$$\displaystyle \begin{aligned}J_{a}\left(h\left(\frac{F-u}{u}\right)\right)=\int_0^F u^{a}(1+u)^{-A}h\left(\frac{F-u}{u}\right)\, du\end{aligned}$$

and

$$\displaystyle \begin{aligned} J_{a}\left(\frac{Au}{1+u}h\left(\frac{F-u}{u}\right)\right) = \int_0^F u^{a}(1+u)^{-A}\frac{Au}{1+u}h\left(\frac{F-u}{u}\right)\,du.\end{aligned}$$

Also, note that

$$\displaystyle \begin{aligned}J_{a}\left(\frac{Au}{1+u}h\left(\frac{F-u}{u}\right)\right)=F^{a+1}\int_0^1 \lambda^{a} (1+\lambda F)^{-A}h\left(\frac{1-\lambda}{\lambda}\right)\frac{A\lambda F}{1+\lambda F}\,d\lambda,\end{aligned}$$

and

$$\displaystyle \begin{aligned}J_{a}\left(h\left(\frac{F-u}{u}\right)\right)=F^{a+1}I_{a,A,h}(F).\end{aligned}$$

Now, with this new notation, it follows that (3.33) is equivalent to

$$\displaystyle \begin{aligned} \frac{J_{\frac{p}{2}-1}(\frac{Au}{1+u}h(\frac{F-u}{u}))}{J_{\frac{p}{2}-1} (h(\frac{F-u}{u}))} \leq \frac{J_{\frac{p}{2}-2}(\frac{Au}{1+u}h(\frac{F-u}{u}))}{J_{\frac{p}{2}-2} (h(\frac{F-u}{u}))} + (1+c). \end{aligned} $$
(3.34)

Using integration by parts , we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle J_{a}\left(\frac{Au}{1+u}h\left(\frac{F-u}{u}\right)\right) = \int_0^F u^a(1+u)^{-A}h\left(\frac{F-u}{u}\right)\frac{Au}{1+u}\,du\\ &\displaystyle =&\displaystyle -u^{a+1}h\left(\frac{F-u}{u}\right)(1+u)^{-A}|{}_0^F\\ &\displaystyle &\displaystyle + \int_0^F(1+u)^{-A}\left[(a+1)u^ah\left(\frac{F-u}{u}\right) + u^{a+1}h^\prime\left(\frac{F-u}{u}\right)\left(\frac{-F}{u^2}\right)\right] du. \end{array} \end{aligned} $$

Hence, (3.34) is equivalent to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {\frac{-F^{\frac{p}{2}}h(0)(1+F)^{-A}}{J_{\frac{p}{2}-1}(h(\frac{F-u}{u} ))}+\left(\frac{p}{2}\right)} \\ &\displaystyle &\displaystyle + \frac{\int_0^F u^{\frac{p}{2}-1}(1+u)^{-A}h(\frac{F-u}{u})\left[\frac{h^\prime(\frac{F-u}{u})} {h(\frac{F-u}{u})}(\frac{-F}{u})\right]\,du}{\int_0^F u^{\frac{p}{2}-1}(1+u)^{-A}h(\frac{F-u}{u})\,du}\\ &\displaystyle &\displaystyle \leq \frac{-F^{\frac{p}{2}-1}h(0)(1+F)^{-A}}{J_{\frac{p}{2}-2} (h(\frac{F-u}{u}))}+\left(\frac{p}{2}-1\right)\\ &\displaystyle &\displaystyle +\frac{\int_0^Fu^{\frac{p}{2}-2}(1+u)^{-A}h(\frac{F-u}{u}) \left[\frac{h^\prime(\frac{F-u}{u})}{h(\frac{F-u}{u})}(\frac{-F}{u})\right]\,du} {\int_0^F u^{\frac{p}{2}-2}(1+u)^{-A}h(\frac{F-u}{u})\,du} + (1+c). \end{array} \end{aligned} $$
(3.35)

Since − (v + 1)h (v)∕h(v) = l 1(v) + l 2(v) (3.35) is equivalent to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle \frac{-h(0)(1+F)^{-A}}{I_{\frac{p}{2}-1,A,h}(F)} + \frac{J_{\frac{p}{2}-1}(h(\frac{F-u}{u})l_1(\frac{F-u}{u}))}{J_{\frac{p}{2}-1} (h(\frac{F-u}{u}))} + \frac{J_{\frac{p}{2}-1}(h(\frac{F-u}{u})l_2(\frac{F-u}{u}))}{J_{\frac{p}{2}-1} (h(\frac{F-u}{u}))} \\ &\displaystyle &\displaystyle \leq \frac{-h(0)(1+F)^{-A}}{I_{\frac{p}{2}-2,A,h}(F)}+\frac{J_{\frac{p}{2}-2} (h(\frac{F-u}{u})l_1(\frac{F-u} {u}))}{J_{\frac{p}{2}-2}(h(\frac{F-u}{u}))} + \frac{J_{\frac{p}{2}-2}(h(\frac{F-u}{u})l_2(\frac{F-u}{u}))}{J_{\frac{p}{2}-2} (h(\frac{F-u}{u}))} + c. \end{array} \end{aligned} $$
(3.36)

It is clear that \(I_{\frac {p}{2}-1,A,h}(F) \leq I_{\frac {p}{2}-2,A,h}(F),\) so we then have

$$\displaystyle \begin{aligned}\frac{-h(0)(1+F)^{-A}}{I_{\frac{p}{2}-1,A,h}(F)} \leq \frac{-h(0)(1+F)^{-A}}{I_{\frac{p}{2}-2,A,h}(F)} \end{aligned}$$

which accounts for the first terms on the left and right hand sides of (3.36). As for the second term on each side of (3.36) note that the hypothesis l 1(ν) is increasing in ν implies that for all fixed F, \(l_1(\frac {F-u}{u})\) is decreasing in u. When t < u, we have

By a monotone likelihood ratio argument, we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle {\frac{J_{\frac{p}{2}-1}(h(\frac{F-u}{u})l_1(\frac{F-u}{u}))}{J_{\frac{p }{2}-1}(h(\frac{F-u}{u}))} = \frac{\int_0^F u^{\frac{p}{2}-1}(1+u)^{-A}h(\frac{F-u}{u})l_1(\frac{F-u}{u})}{\int_0^F u^{\frac{p}{2}-1}(1+u)^{-A}h(\frac{F-u}{u})\,du}}\\ &\displaystyle &\displaystyle \leq \frac{\int_0^F u^{\frac{p}{2}-2}(1+u)^{-A}h(\frac{F-u}{u})l_1(\frac{F-u}{u})\,du}{\int_0^F u^{\frac{p}{2}-2}(1+u)^{-A}h(\frac{F-u}{u})\,du} = \frac{J_{\frac{p}{2}-2}(h(\frac{F-u}{u})l_1(\frac{F-u}{u}))}{J_{\frac{p}{2}-2} (h(\frac{F-u}{u}))}. \end{array} \end{aligned} $$

Finally, note that since 0 ≤ l 2(v) ≤ c for the third term on each side of (3.36) we have

$$\displaystyle \begin{aligned}0 \leq \frac{J_{\frac{p}{2}-i}(l_2(\frac{F-u}{u})h(\frac{F-u}{u}))} {J_{\frac{p}{2}-i}(h(\frac{F-u}{u}))} \leq c \; \; {\mathrm{for}}\; i=1, 2. \end{aligned}$$

Therefore we established the inequality (3.36) and the proof is complete. □

3.2.2.2 Minimaxity of the Generalized Bayes Estimators

In this subsection we apply Lemmas 3.4, 3.5, 3.6 and Theorems 3.6 and 3.7 to show minimaxity of the generalized Bayes estimator (3.25).

Theorem 3.8

Assume that g(ν) = −(ν + 1) h (ν)∕h(ν) is increasing in ν, g(ν) ≤ M, where M is a positive constant, and

$$\displaystyle \begin{aligned} \frac{p-2+2M}{k+3+d -2M} \leq 2 \, \frac{p-2}{k+2} \, . \end{aligned}$$

Then δ(X, S) in (3.25) is minimax.

Proof

Let l 2(ν) = 0 and l 1(ν) = g(ν). By applying Lemma 3.6 to the case c = 0, we have r(F) increasing in F. Applying the bound in Lemma 3.5, we can get \(0 \leq r(F) \leq 2\frac {p-2}{m+2}\). Therefore, by Lemma 3.4, δ(X, S) is minimax. □

It is interesting to make connections to the result in Faith (1978). Faith (1978) considered generalized Bayes estimator for \(\mathcal {N}_p(\theta , I_p)\) and showed that when g(ν) is increasing in ν and \(M \leq \frac {p-2}{2}\), the generalized Bayes estimator would be minimax. By taking k →, we deduce the same conditions as Faith (1978). The next lemma is a variant of Alam (1973) for the known variance case.

Theorem 3.9

Define \(c_k=\frac {p-2}{k+2}\) . If there exists b ∈ (0, 1] and \(c=\frac {b(p-2)}{4+4(2-b)c_k}\) , such that 0 ≤ r(F) ≤ (2 − b)c k , and F c r(F) is increasing in F, then the generalized Bayes estimator δ(X, S) in (3.25) is minimax.

Proof

By taking the derivative of the Efron and Morris’ condition, (3.31) can be satisfied by requiring

$$\displaystyle \begin{aligned} 0 \leq 2\left(\frac{p}{2}-1\right)R(F)\left(2-\frac{r(F)}{c_m} \right)+4r^\prime(F)(1+r(F)). \end{aligned} $$
(3.37)

Since r(F) ≤ (2 − b)c k, then (3.37) is satisfied at the point where r (F) ≥ 0. Since r(F) ≤ (2 − b)c k with β = (2 − b)c k

$$\displaystyle \begin{aligned} 4r^\prime(F)(1+\beta) \leq 4r^\prime(F)(1+r(F)), \end{aligned} $$
(3.38)

at the point where r (F) < 0. We now have

$$\displaystyle \begin{aligned} \begin{array}{rcl} 0 &\displaystyle \leq&\displaystyle (4+4\beta)(cR(F)+R(F)+FR^\prime(F))\\ &\displaystyle =&\displaystyle 2b\left(\frac{p}{2}-1\right)R(F)+4r^\prime(F)(1+\beta)\\ &\displaystyle \leq&\displaystyle 2\left(\frac{p}{2}-1\right)R(F)\left(2-\frac{r(F)}{c_k} \right)+4r^\prime(F)(1+r(F)) \end{array} \end{aligned} $$

since F c r(F) is increasing in F. Thus, for all values of F, we have proven (3.37), and combining with the bound on the r(F) function, we have proven the minimaxity of the generalized Bayes estimator. □

It is interesting to observe that by requiring a tighter upper bound on r(F), we can relax the monotonicity requirement on r(F). The tighter the upper bound, the more flexible r(F) can be. This result enriches the class of priors whose generalized Bayes estimators are minimax. Direct application of Lemmas 3.4, 3.5, 3.6, and 3.9 gives the following theorem.

Theorem 3.10

If there exists b ∈ (0, 1] such that g(ν) = l 1(ν) + l 2(ν) ≤ M, and l 1(ν) is increasing in ν, \(0 \leq l_2(\nu ) \leq c=\frac {b(p-2)}{4+4(2-b)\frac {p-2}{k+2}}\) , and \(\frac {p-2+2M}{k+3+d-2M}\leq \frac {(2-b)(p-2)}{k+2}\) , then the generalized Bayes estimator δ(X, S) in (3.25) is minimax.

3.2.2.3 Examples of the Priors in (3.24)

In this subsection, we will give several examples to which our results can be applied and make some connection to the existing literature found in Maruyama and Strawderman (2005) and Fourdrinier et al. (1998).

Example 3.7

Maruyama and Strawderman (2005) considered the priors with h(ν) ∝ ν b(1 + ν)ab−2 for b > 0 and show that \(r(F) \leq \frac {\frac {p}{2} + a+1}{\frac {k}{2} + \frac {d}{2} -a -\frac {1}{2}}\) (in terms of the Maruyama and Strawderman (2005) notation d = 2e + 1). Condition C1 is equivalent to the condition that d + k + p > −1. C2 and C3 are equivalent here, and both are equivalent to the condition that \(a+\frac {p}{2}+1 >0\). Then, using Theorem 3.8, we have g(ν) = a + 2 −  −1. The condition that g(ν) is increasing in ν is equivalent to the condition that b ≥ 0. Clearly, we can let M = a + 2. Then the condition of Theorem 3.8 is that

$$\displaystyle \begin{aligned}\frac{k}{2} + \frac{d}{2} -\frac{1}{2}>a \;\;\;\ {\mathrm{and}} \;\;\;\ \frac{\frac{p}{2} + a+1}{\frac{k}{2} + \frac{d}{2} -a -\frac{1}{2}} \leq 2c_k. \end{aligned}$$

A close examination of the Maruyama and Strawderman (2005) proof shows that their upper bound on r(F) is sharp. This implies that our bound in Lemma 3.5 cannot be relaxed.

Example 3.8

Generalized Student-t priors correspond to a mixing distribution of the form

$$\displaystyle \begin{aligned}h(\nu)=c(\nu+1)^{\beta-\alpha-\gamma-\frac{p-2}{2}}\nu^{\gamma-\beta}e^{\frac{ \gamma}{\nu}} \, . \end{aligned}$$

Consider the following two cases. The first case where α ≤ 0, β ≤ 0 and γ < 0 involves the construction of a monotonic r(⋅) function. The second case where α ≤ 0, β > 0 and γ < 0 does not require the r(⋅) function to be monotonic. In both cases,

$$\displaystyle \begin{aligned}\ln h(\nu) = (\beta-\alpha-\gamma-\frac{p-2}{2})\ln(1+\nu)+(\gamma-\beta)\ln\nu+\frac{\gamma} {\nu} \end{aligned}$$

and

$$\displaystyle \begin{aligned} g(\nu) = \left(\frac{p-2}{2}+\alpha+\gamma-\beta\right)+\frac{(1+\nu)(\beta-\gamma)}{\nu} +\frac{\gamma(1+\nu)}{\nu^2} = \frac{p-2}{2}+\alpha+\frac{\beta}{\nu}+\frac{\gamma}{\nu^2} \, . \end{aligned}$$

Clearly, g(ν) is monotonic in the first case, and minimaxity of the generalized Bayes estimator follows when

$$\displaystyle \begin{aligned}0 \leq \frac{p-2+\alpha}{\frac{k}{2}+\frac{1}{2}+\frac{d}{2}-\frac{p}{2}-\alpha} \leq \frac{p-2}{\frac{k}{2}+1} \end{aligned}$$

in addition to the conditions C1, C2, and C3. In the limiting case where m →, C1 holds trivially. Both C2 and C3 can be satisfied by α > 2 − p. The upper bound on R(F) can be satisfied by any α ≤ 0. Consequently, the conditions reduce to those in Example 3.4 for the case of known variance.

Next we consider spherical multivariate Student-t priors with f degrees of freedom and a scale parameter τ and with \(\alpha =\frac {f-p+4}{2}\), \(\beta =\frac {f(1-\tau )+2}{2}\), and \(\gamma =-\frac {f\tau }{2}\). The case of τ = 1 is of particular interest but does not necessarily give a monotonic r(⋅) function. However, we can use the result in Theorem 3.10 to show that the generalized Bayes estimator is minimax under the following conditions for f ≤ p − 4, suppose there exists a constant b ∈ (0, 1] such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{p+f+\frac{1}{f}}{k+1+d-f-\frac{1}{f}} &\displaystyle \leq&\displaystyle (2-b) \frac{p-2}{k+2}, \\ \frac{1}{2f} &\displaystyle \leq&\displaystyle c=\frac{b(p-2)}{4+4(2-b)\frac{p-2}{k+2}}. \end{array} \end{aligned} $$
(3.39)

Condition (3.39) can be established by observing that for this case,

$$\displaystyle \begin{aligned}g(\nu)=\frac{p-2}{2}+\alpha+\frac{\beta}{\nu}+\frac{\gamma}{\nu^2}=\frac{f}{2} +1+\frac{1}{\nu}-\frac{f}{2\nu^2} \end{aligned}$$

is clearly nonmonotonic. We then let \(M=\frac {f}{2}+1+\frac {1}{2f}\) and apply Lemma 3.5 to get the upper bound on r(⋅). We define \(l_1(\nu )=g(\nu )-\frac {1}{2f}\) when ν ≤ f and \(l_1(\nu )=\frac {f}{2}+1\) otherwise. We also define \(l_2(\nu )=\frac {1}{2f}\) when ν ≤ f and \(l_2(\nu )=\frac {1}{\nu }-\frac {f}{2\nu ^2}\) otherwise. By applying Lemma 3.6, we get condition (3.39).

The spherical multivariate Cauchy prior corresponds to the case f = 1. If k = O(p) and d = 3, then condition (3.39) reduces to p ≥ 5, \(\frac {p+2}{k+2} \leq (2-b) \frac {p-2}{k+2}\), and \(\frac {1}{2} \leq \frac {b(p-2)}{4+8-4b}\).

3.3 Results for Known Σ and General Quadratic Loss

3.3.1 Results for the Diagonal Case

Much of this section is based on the review in Strawderman (2003). We begin with a discussion of the multivariate normal case where \(\varSigma = {\mathrm {diag}} (\sigma ^2_{1},\ldots ,\sigma ^2_p)\) is diagonal, which we assume throughout this subsection. Let

$$\displaystyle \begin{aligned} X\sim {\mathcal N}_p(\theta,\varSigma) \end{aligned} $$
(3.40)

and the loss be equal to a weighted sum of squared errors loss

$$\displaystyle \begin{aligned} L(\theta,\delta) = (\delta - \theta)^{\scriptscriptstyle{\mathrm{T}}} D(\delta - \theta) = \sum_{i = 1}^p (\delta_i-\theta_i)^2 d_i \, . \end{aligned} $$
(3.41)

The results in Sects. 2.3, 2.4 and 3.1 extend by the use of Stein’s lemma in a straightforward way to give the following basic theorem.

Theorem 3.11

Let X have the distribution (3.40) and let the loss be given by (3.41) .

  1. (1)

    If δ(X) = X + Σg(X), where g(X) is weakly differentiable and E||g||2 < ∞, then the risk of δ is

    $$\displaystyle \begin{aligned} \begin{array}{rcl} R(\delta,\boldsymbol\theta) &\displaystyle =&\displaystyle E_{\boldsymbol\theta} ((\delta - \boldsymbol\theta)^{\scriptscriptstyle{\mathrm{T}}} D (\delta -\boldsymbol\theta)) \\ &\displaystyle =&\displaystyle tr (\varSigma D)+ E_{\theta} \left[ {\sum_{i = 1}^p {\sigma _i^4 } d_i \left( {g_i^2 \left( X \right) + 2\frac{{\partial g_i \left( X \right)}}{{\partial X_i }}} \right)} \right]. \end{array} \end{aligned} $$
  2. (2)

    If θ∼ π(θ), then the Bayes estimator of θ is \(\delta _{\varPi } (X) = X + \varSigma \frac {{\nabla m(X)}}{{m(X)}},\) where m(X) is the marginal distribution of X.

  3. (3)

    If θ∼ π(θ), then the risk of a proper (generalized, pseudo-) Bayes estimator of the form \(\delta _m(X) = X+\varSigma \frac {{\nabla m(X)}}{{m(X)}}\) is given by

    $$\displaystyle \begin{aligned} \begin{array}{rcl} R(\delta_m, \theta) &\displaystyle =&\displaystyle {\mathrm{tr}} (\varSigma D) \\ &\displaystyle +&\displaystyle E_\theta \left[ \frac{2m(X)\sum_{i=1}^p\sigma_i^4d_i \partial m^2(X) /\partial^2 X_i}{m^2(X)}- \frac{\sum_{i=1}^p\sigma_i^4d_i \left(\partial m(X) /\partial X_i\right)^2}{m^2(X)} \right]\\ &\displaystyle =&\displaystyle {\mathrm{tr}} (\varSigma D) + 4 \, E_{\theta} \left[ \frac{\sum_{i=1}^p \sigma_i^4d_i \partial^2\sqrt{m(X)}/\partial^2 X_i}{\sqrt{m(X)}} \right]. \end{array} \end{aligned} $$
  4. (4)

    If \(\frac {\sum \limits _{i=1}^p \sigma _i^4d_i \partial ^2\sqrt {m(X)}/\partial ^2 X_i}{\sqrt {m(X)}}\) is nonpositive, the proper (generalized, pseudo) Bayes δ m(X) is minimax.

The proof follows closely to that of corresponding results in Sects. 2.3, 2.4 and 3.1. The result is essentially from Stein (1981).

A key observation that allows us to construct Bayes minimax procedures for this situation, based on the procedures for the case Σ = D = I, is the following straightforward result from Strawderman (2003).

Lemma 3.7

Suppose η(X) is such that \(\varDelta \eta (X) =\sum \limits _{i = 1}^p \partial ^2 \eta (X) /\partial ^2 X_i^2 \le 0\) (i.e. η(X) is superharmonic). Then η (X) = η(Σ −1 D −1∕2 X) is such that \(\sum \limits _{i = 1}^p {\sigma _i^4 d_i \partial ^2 \eta ^{*}(X)}/\) 2 X i ≤ 0.

Note, that for any scalar a, if η(X) is superharmonic, then so is η(aX). This leads to the following result.

Theorem 3.12

Suppose X has the distribution (3.40) and the loss is given by (3.41) .

  1. (1)

    Suppose \(\sqrt {m(X)}\) is superharmonic (m(X) is a proper, generalized, or pseudo-marginal for the case Σ = D = I). Then

    $$\displaystyle \begin{aligned}\delta_m(X) = X+\varSigma \left( {\frac{{\nabla m(\varSigma ^{ - 1} D^{ - 1/2} X)}}{{m(\varSigma ^{ - 1} D^{ - 1/2} X)}}} \right)\end{aligned}$$

    is a minimax estimator.

  2. (2)

    If \(\sqrt {m(\left \| X \right \|{ }^2 )}\) is spherically symmetric and superharmonic, then

    $$\displaystyle \begin{aligned}\delta_m (X) = X+ \frac{{2m^\prime(X^{\scriptscriptstyle{\mathrm{T}}} \,\varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} X)D^{ - 1} \varSigma ^{ - 1} X}}{{m(X^{\scriptscriptstyle{\mathrm{T}}}\,\varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} X)}}\end{aligned}$$

    is minimax.

  3. (3)

    Suppose the prior distribution π(θ) has the hierarchical structure \(\theta |\lambda \sim \mathcal {N}_p(0, A_\lambda )\) for λ  h(λ), 0 < λ < 1, where A λ = (cλ)ΣDΣ  Σ, c is such that A 1 is positive definite, and h(λ) satisfies the conditions of Theorem 3.12 . Then

    $$\displaystyle \begin{aligned}\delta_\pi(X) = X+\varSigma \frac{{\nabla m(X)}}{{m(X)}}\end{aligned}$$

    is minimax.

  4. (4)

    Suppose m i(X), i = 1, 2… k are superharmonic. Then the multiple shrinkage estimator

    is a minimax multiple shrinkage estimator.

Proof

Part (1) follows directly from Parts (3) and (4) of Theorem 3.11 and Lemma 3.7. Part (2) follows from Part (1) and Part (2) of Theorem 3.11 with a straightforward calculation.

For Part (3), first note that \(\theta |\lambda \sim \mathcal {N}_p(0, A_\lambda )\) and \(X - \theta |\lambda \sim \mathcal {N}_p(0,\varSigma )\). Thus, X − θ and θ are conditionally independent given λ. Hence we have \(X|\lambda \sim \mathcal {N}_p(0, A_\lambda + \varSigma )\). It follows that

but , where \(\sqrt {\eta \left ( {X^{\scriptscriptstyle {\mathrm {T}}}\,X} \right )}\) is superharmonic by Theorem 3.11. Hence, by Part (2), δ π(X) is minimax (and proper or generalized Bayes depending on whether h(λ) is integrable or not).

Since superharmonicity of η(X) implies the superharmonicity of \(\sqrt {\eta \left ( {\,X} \right )}\), Part (4) follows from Part (1) and the superharmonicity of mixtures of superharmonic functions. □

Example 3.9 (Pseudo-Bayes minimax estimators)

When Σ = D = σ 2 I, we saw in Sect. 3.3 that by choosing \(m(X) = \frac {1}{{\left \| X \right \|{ }^{2b} }}\), the pseudo-Bayes estimator was the James-Stein estimator \(\delta _m(X) = (1- \frac {{2b\sigma ^2 }}{{\left \| X \right \|{ }^2 }})X\). It now follows from this and part (2) of Theorem 3.12 that m(X T Σ −1 D −1 Σ −1 X) = (1∕X T Σ −1 D −1 Σ −1 X)b has associated with it the pseudo-Bayes estimator \(\delta _m(X) = (1- \frac {{2 bD^{-1} \varSigma ^{-1} }}{{\left ( {X^{\scriptscriptstyle {\mathrm {T}}}\,\varSigma ^{-1} D^{-1} \varSigma ^{ - 1} X} \right )}})X\). This estimator is minimax for 0 < b ≤ 2(p − 2).

Example 3.10 (Hierarchical proper Bayes minimax estimator)

As suggested by Berger (1976) suppose the prior distribution has the hierarchical structure \(\theta |\lambda \sim {\mathcal N}_p(0, A_\lambda )\) where A λ = cΣDΣ − Σ, \(c > 1/\min (\sigma _i^2 d_i)\) and h(λ) = (1 + b)λ b for 0 < λ < 1 and \(-1< b \leq \frac {{(p - 6)}}{2}\). The resulting proper Bayes estimator will be minimax for p ≥ 5 by part (3) of Theorem 3.12 and Example 3.9. For p ≥ 3, the estimator δ π(X) given in part (3) of Theorem 3.12 is a generalized Bayes minimax estimator provided \(- \frac {{(p + 2)}}{2} < b \leq \frac {{(p - 6)}}{2}\).

It can be shown to be admissible if the lower bound is replaced by − 2, by the results of Brown (1971). Also see the development in Berger and Strawderman (1996) and Kubokawa and Strawderman (2007).

Example 3.11 (Multiple shrinkage minimax estimators)

It follows from Example 3.9 and Theorem 3.12 that \(m(X) = \sum \limits _{i = 1}^k {\left [ {\frac {1}{{\left ( {X - \nu _i } \right )^{\scriptscriptstyle {\mathrm {T}}} \varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} \left ( {X - \nu _i } \right )}}} \right ]^b }\) satisfies the conditions of Theorem 3.12 (4) for 0 < b ≤ (p − 2)∕2. and hence

(3.42)

is a minimax multiple shrinkage (pseudo-Bayes) estimator.

If, as in Example 3.11 we used the generalized prior

$$\displaystyle \begin{aligned} \pi (\theta) = \sum\limits_{i = 1}^k {\left[ {\frac{1}{{\left( {\theta - \nu _i } \right)^{\scriptscriptstyle{\mathrm{T}}} \varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} \left( {\theta - \nu _i } \right)}}} \right]^b ,} \end{aligned}$$

the resulting generalized Bayes (as opposed to pseudo-Bayes) estimators is minimax for 0 < b ≤ (p − 2)∕2.

3.3.2 General Σ and General Quadratic Loss

In this section, we generalize the above results to the case of

$$\displaystyle \begin{aligned} X\sim {\mathcal N}_p(\theta,\varSigma), \end{aligned} $$
(3.43)

where Σ is a general positive definite covariance matrix and the loss is given by

$$\displaystyle \begin{aligned} L(\theta,\delta)= (\delta - \theta)^{\scriptscriptstyle{\mathrm{T}}}Q(\delta - \theta), \end{aligned} $$
(3.44)

where Q is a general positive definite matrix. We will see that this case can be reduced to the canonical form Σ = I and Q = diag(d 1, d 2, …, d p) = D. We continue to follow the development in Strawderman (2003).

The following well known fact will be used repeatedly to obtain the desired generalization.

Lemma 3.8

For any pair of positive definite matrices, Σ and Q, there exits a non-singular matrix A such that AΣA T = I and (A T)−1 QA −1 = D where D is diagonal.

Using this fact we can now present the canonical form of the estimation problem.

Theorem 3.13

Let \(X\sim {\mathcal N}_p(\theta ,\varSigma )\) and suppose that the loss is L 1(δ, θ) = (δθ)T Q(δ  θ). Let A and D be as in Lemma 3.8 and let \(Y=AX \sim {\mathcal N}_p(v,I_p)\) , where v = Aθ and L 2(δ, v) = (δv)T D(δ  v).

  1. (1)

    If δ 1(X) is an estimator with risk function R 1(δ 1, θ) = E θ L 1(δ 1(X), θ), then the estimator δ 2(Y ) = Aδ 1(A −1 Y ) has risk function R 2(δ 2, v) = R 1(δ 1, θ) = E θ L 2(δ 2(Y ), v).

  2. (2)

    δ 1(X) is proper or generalized Bayes with respect to the proper prior distribution π 1(θ) (or pseudo-Bayes with respect to the pseudo-marginal m 1(X)) under loss L 1 if and only if δ 2(Y ) = Aδ 1(A −1 Y ) is proper or generalized Bayes with respect to π 2(v) = π 1(A −1 v) (or pseudo-Bayes with respect to the pseudo-marginal m 2(Y ) = m 1(A −1 Y )).

  3. (3)

    δ 1(X) is admissible (or minimax or dominates \(\delta _1^{\ast }(X)\) ) under L 1 if and only if δ 2(Y ) = Aδ 1(A −1 Y ) is admissible (or minimax or dominates \(\delta _2^{\ast }(Y)=A \delta _1^{\ast }(A^{-1} Y)\) under L 2 ).

Proof

To establish Part (1) note that the risk function

$$\displaystyle \begin{aligned} R_2(\delta_2,v)&= E_\theta L_2[\delta_2(Y),v] \\ &= E_\theta [(\delta_2(Y)-v)^{\scriptscriptstyle{\mathrm{T}}}D(\delta_2(Y)-v)]\\ &= E_\theta [(A \delta_1(A^{-1}(AX))-A\theta)^{\scriptscriptstyle{\mathrm{T}}}D (A \delta_1(A^{-1}(AX))-A\theta)] \\ &= E_\theta [(\delta_1((X)-\theta)^{\scriptscriptstyle{\mathrm{T}}}A^{\scriptscriptstyle{\mathrm{T}}}DA (\delta_1(X)-\theta)]\\ &=E_\theta [(\delta_1((X)-\theta)^{\scriptscriptstyle{\mathrm{T}}}Q (\delta_1(X)-\theta)] \\ &= R_1(\delta_1,\theta).\end{aligned} $$

Since the Bayes estimator for any quadratic loss is the posterior mean and θ ∼ π 1(θ) and v =  ∼ π 2(v) = π 1(A −1 v) (ignoring constants), then Part (2) follows by noting that

$$\displaystyle \begin{aligned}\delta_2(Y)\,{=}\,E[v|Y]\,{=}\,E[A\theta |Y]\,{=}\,E[A\theta|AX]\,{=}\,A\; E[\theta|X]\,{=}\,A \;\delta_1(X)\,{=}\,A \delta_1(A^{-1}Y).\end{aligned} $$

Lastly, Part (3) follows directly from Part (1). □

Note that if Σ 1∕2 is the positive definite square root of Σ and A =  −1∕2 where P is orthogonal and diagonalizes Σ 1∕2 1∕2, then this A and D =  1∕2 1∕2 P T satisfy the requirements of the theorem.

Example 3.12

Proceeding as we did in Example 3.9 and applying Theorem 3.13, m(X T Σ −1 Q −1 Σ −1 X) = (X T Σ −1 Q −1 Σ −1 X)b has associated with it, the pseudo-Bayes minimax James-Stein estimators is

$$\displaystyle \begin{aligned} \delta_m(X) = \left(1-\frac{{2 \, b \, Q^{ - 1} \varSigma ^{ - 1} }} {{\left( {X^{\scriptscriptstyle{\mathrm{T}}}\,\varSigma^{ - 1} Q^{- 1} \varSigma ^{ - 1} X} \right)}} \right) X ,\end{aligned} $$

for 0 < b ≤ 2 (p − 2).

Generalizations of Example 3.10 to hierarchical Bayes minimax estimators and generalizations of Example 3.11 to multiple shrinkage estimators are straightforward. We omit the details.

3.4 Admissibility of Bayes Estimators

Recall from Sect. 2.4 that an admissible estimator is one that cannot be dominated in risk, i.e. δ(X) is admissible if there does not exist an estimator δ (X) such that R(θ, δ ) ≤ R(θ, δ) for all θ, with strict inequality for some θ. We have already derived classes of minimax estimators in the previous sections.

In this section, we study their possible admissibility or inadmissibility. One reason that admissibility of these minimax estimators is interesting is that, as we have already seen, the usual estimator δ 0(X) = X is minimax but inadmissible if p ≥ 3. Actually, we have seen that it is possible to dominate X with a minimax estimator (e.g., \(\delta ^{JS}_{(p-2)}(X)\)) that has a substantially smaller risk at θ = 0. Hence, it is of interest to know if a particular (dominating) estimator is admissible.

Note that a unique proper Bayes estimator is automatically admissible (see Lemma 2.6), so we already have examples of admissible minimax estimators for p ≥ 5.

We also note that the class of generalized Bayes estimators contains all admissible estimators if loss is quadratic (i.e., it is a complete class; see e.g., Sacks 1963; Brown 1971; Berger and Srinivasan 1978). It follows that if an estimator is not generalized Bayes, it is not admissible. Further, in order to be generalized Bayes, an estimator must be everywhere differentiable by properties of the Laplace transform . In particular, the James-Stein estimators and the positive-part James-Stein estimators (for a ≠ 0) are not generalized Bayes and therefore not admissible.

In this section, we will study the admissibility of estimators corresponding to priors which are variance mixtures of normal distributions for the case of \(X \sim {\mathcal N}_p(\theta ,I)\) and quadratic loss ∥δ − θ2 as in Sect. 3.1.2. In particular, we consider prior densities of the form (3.4) and establish a connection between admissibility and the behavior of the mixing (generalized) density h(v) at infinity. The analysis will be based on Brown (1971), Theorem 1.2. An Abelian Theorem (see, e.g., Widder (1946), Corollary 1.a, p. 182) along with Brown’s theorem are our main tools. We use the notation f(x) ∼ g(x) as x → a to mean limxa f(x)∕g(x) = 1. Here is an adaptation of the Abelian theorem in Widder that meets our needs.

Theorem 3.14

Assume \(g: \mathbb {R}^{+} \rightarrow \mathbb {R}\) has a Laplace transform \(f(s) = \int ^{\infty }_{0} g(t) e^{-st}\, dt\) that is finite for s ≥ 0. If g(t) ∼ t γ as t → 0+ for some γ > −1, then f(s) ∼ s −(γ+1) Γ(γ + 1) as s ∞.

The proof is essentially as in Widder (1946) but the assumption of finiteness of the Laplace transform at s = 0 allows the extension from γ ≥ 0 to γ > −1.

We first give a lemma which relates the tail behavior of the mixing density h(v) to the tail behavior of π(∥θ2) and m(∥x2) and also shows that ∥δ(x) − x∥ is bounded whenever h(v) has polynomial tail behavior.

Lemma 3.9

Suppose \(X \sim {\mathcal N}_{p}(\theta , I_{p})\) , L(θ, δ) = ∥δ  θ2 and π(θ) is given by (3.4) where h(v) ∼ K v a as v ∞ with a < (p − 2)∕2 and where v p∕2 h(v) is integrable in a neighborhood of 0. Then

  1. (1)

    π(θ)  ∼ K (∥θ2)a−(p−2)∕2 Γ((p − 2)∕2 − a) asθ2 →∞,

    m(x) ∼ K(∥x2)a−(p−2)∕2 Γ((p − 2)∕2 − a) asx2 →∞,

    and therefore π(∥x2) ∼ m(∥x2) asx2 →∞,

  2. (2)

    δ(x) − xis uniformly bounded, where δ is the generalized Bayes estimator corresponding to π.

Proof

First note that (with t = 1∕v)

$$\displaystyle \begin{aligned} \pi (\theta) = \pi^{*}(\| \theta \|{}^{2}) = \int^{\infty}_{0} \exp\left\{- \frac{\| \theta \|{}^{2}}{2} t\right\} t^{\frac{p}{2} - 2} h(1/t) \,dt \end{aligned}$$

and \(g(t) = t^{\frac {p}{2} - 2} h(1/t) \sim K t^{\frac {p-4}{2} - a}\) as t → 0+. Therefore, by Theorem 3.14, \(\pi (\theta ) \sim K (\| \theta \|{ }^{2})^{a - \frac {p-2}{2}} \varGamma \left ( \frac {p-2}{2} - a \right )\) as ∥θ2 →. Similarly

$$\displaystyle \begin{aligned} \begin{array}{rcl} m(x) &\displaystyle = &\displaystyle \int^{\infty}_{0} e^{- \frac{\| \theta \|{}^{2}}{2(1 + v)}} (1 + v)^{-\frac{p}{2}} \,h(v)\, dv \,\left({\mathrm{for}}\,\; t = \frac{1}{1+v}\right) \\ &\displaystyle =&\displaystyle \int^{\infty}_{1} e^{- \frac{\| \theta \|{}^{2}}{2} t} t^{\frac{p}{2} - 2} h\left(\frac{1-t}{t} \right)\, dt. \end{array} \end{aligned} $$

We note that as t → 0+, \(t^{\frac {p}{2}-2} h\left (\frac {1-t}{t} \right ) \sim t^{\frac {p-4}{2}} \left (\frac {1-t}{t} \right )^{a} \sim t^{\frac {p-4}{2} - a}\). Thus, again by Theorem 3.14,

$$\displaystyle \begin{aligned}m(x) \sim K(\| x \|{}^{2})^{a - \frac{p-2}{2}} \varGamma \left( \frac{p-2}{2} - a \right) \; \mbox{as}\; \| x \|{}^{2} \rightarrow \infty,\end{aligned}$$

and Part (1) follows.

To prove Part (2) note that

$$\displaystyle \begin{aligned} \begin{array}{ll} \delta(x) - x & = \frac{\nabla m(x)}{m(x)}\\ & = - \frac{- \int^{\infty}_{0} \exp \left\{- \frac{\| x \|{}^{2}}{2(1 + v)}\right\} (1 + v)^{-(\frac{p}{2}+1)}\, h(v) \,dv} {\int^{\infty}_{0} \exp \left\{- \frac{\| x \|{}^{2}}{2(1 + v)}\right\} \,(1+v)^{\frac{p}{2}} \,h(v) \,dv} x. \end{array} \end{aligned}$$

The above argument applied to the numerator and denominator shows

$$\displaystyle \begin{aligned} \begin{array}{ll} \| \delta(x) - x \|{}^{2} & \sim \left[ \frac{(\| x \|{}^{2})^{a-\frac{p}{2}} \,\varGamma(\frac{p}{2} - a)}{\| x \|{}^{2})^{a-\frac{p-2}{2}} \,\varGamma(\frac{p-2}{2} - a)} \right]^{2} \| x \|{}^{2}\\ {} & \sim \left( \frac{p-2}{2} - a \right)^{2} \frac{1}{\| x \|{}^{2}} \; {\mathrm{as}} \; \| x \|{}^{2} \rightarrow \infty. \end{array} \end{aligned}$$

Since δ(x) − x is in \(\mathcal {C}^{\infty }\) and tends to zero as ∥x2 →, the function is uniformly bounded. □

The following result characterizes admissibility and inadmissibility for generalized Bayes estimators when the mixing density h(v) ∼ v a as v →.

Theorem 3.15

For priors π(θ) of the form (3.4) with mixing density h(v) ∼ v a as v ∞, the corresponding generalized Bayes estimator δ is admissible if and only if a ≤ 0.

Proof (Admissibility if a ≤ 0)

By Lemma 3.9, we have \(\bar {m}(r) = m^{*}(r^{2}) \sim K^{*}\) (r 2)a−(p−2)∕2, with m(x) = m (∥x2). Thus, for any 𝜖 > 0, there is an r 0 > 0 such that, for r > r 0, \(\bar {m}(r) \leq (1 + \epsilon ) K^{*} r^{2a -(p-2)}\). Since ∥δ(x) − x∥ is uniformly bounded,

$$\displaystyle \begin{aligned} \int^{\infty}_{r_{0}} (r^{p-1} \bar{m}(r))^{-1} \,dr \geq (K^{*}(1 + \epsilon))^{-1} \int^{\infty}_{r_{0}} r^{- (2a + 1)} \,dr = \infty \end{aligned}$$

if a ≥ 0. Hence, δ(x) is admissible if a ≤ 0, by Theorem 1.2.

(Inadmissibility if a > 0) Similarly, we have, for r ≥ r 0,

$$\displaystyle \begin{aligned} \underline{m}(r) = \frac{1}{m^{*}(r^{2})} \sim \frac{1}{K^{*}} (r^{2})^{\frac{p-2}{2}-a}, \end{aligned}$$
$$\displaystyle \begin{aligned} \underline{m}(r) \leq \frac{1}{(1- \epsilon) K^{*}} r^{p-2-2a}, \end{aligned}$$

and

$$\displaystyle \begin{aligned} \int^{\infty}_{0} r^{1-p} \underline{m}\,(r)\, dr \leq \frac{1}{K^{*}} \int^{\infty}_{r_{0}} r^{-(1+2a)}\, dr < \infty \end{aligned}$$

if a > 0. Thus δ(x) is inadmissible if a > 0. □

Example 3.13 (Continued)

Recall for the Strawderman prior that \(h(v) = C(1 + v)^{- \alpha - (\frac {p-2}{2})} \sim v^{a}\) as v → for \(a = - (\alpha + \frac {p-2}{2})\).

The above theorem implies that the generalized Bayes estimator is admissible if and only if \(\alpha + \frac {p-2}{2} \geq 0\) or \(1 - \frac {p}{2} \leq \alpha \). We previously established minimaxity when 2 − p < α ≤ 0 for p ≥ 3 and propriety of the prior when \(2 - \frac {p}{2} < \alpha \leq 0\) for p ≥ 5.

Note in general that for a mixing distribution of the form h(v) ∼ Kv a as v →, the prior distribution π(θ) will be proper if and only if a < −1 by the same argument as in the proof of Theorem 3.15. Hence the bound for admissibility, a ≤ 0, differs from the bound for propriety, a < −1, by 1.

3.5 Connections to Maximum a Posteriori Estimation

3.5.1 Hierarchical Priors

As we have seen in previous sections of this chapter, the classical Stein estimate and its positive-part modification can be motivated in a number of ways, perhaps most commonly as empirical Bayes estimates (i.e., posterior means) under a normal hierarchical model in which \(\theta \sim {\mathcal N}_p(0,\psi \, I_{p})\) where ψ, viewed as a hyperparameter, is estimated. In this section we look at shrinkage estimation through the lens of maximum a posteriori (MAP) estimation. The development of this section follows Strawderman and Wells (2012).

The class of proper Bayes minimax estimators constructed in Sect. 3.1 relies on the use of a hierarchically specified class of proper prior distributions π S(θ, κ). In particular, for the prior in Strawderman (1971), π S(θ, κ) is specified according to

(3.45)

where g(κ) = (1 − κ)∕κ and the constant a satisfies 0 ≤ a < 1, i.e., π S(κ) is a Beta(1 − a, 1) probability distribution . Suppose a = 1∕2; then, utilizing the transformation ψ = g(κ) > 0 in (3.45), we obtain the equivalent specification

(3.46)

Two interesting alternative formulations of (3.46) are given below for the case p = 1 and generalized later for arbitrary p. In what follows, we let Gamma(τ, ξ) denote a random variable with probability density function

and Exp(ξ) corresponds to the choice τ = 1 (i.e., an exponential random variable in its rate parametrization).

For p = 1, the marginal prior distribution on θ induced by (3.46) is equivalent to that obtained under the specification

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \theta | \psi, \lambda \sim {\mathcal N}(0, \psi), ~~ \psi | \lambda \sim {\mathrm{Exp}}\left(\frac{\lambda^2}{2} \right), ~~ \lambda | \alpha \sim {\mathrm{HN}}(\alpha^{-1}), \end{array} \end{aligned} $$
(3.47)

where α = 1 and HN(ζ) denotes the half-normal density

The marginal prior distribution on θ induced by (3.46) is also equivalent to that obtained under the alternative specification

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \theta | \lambda \sim {\mathrm{Laplace}}(\lambda),~~ \lambda | \alpha \sim {\mathrm{HN}}(\alpha^{-1}), \end{array} \end{aligned} $$
(3.48)

where α = 1 and Laplace(λ) denotes a random variable with the Laplace (double exponential) probability density function

This result follows from Griffin and Brown (2010). Define

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \theta | \psi, \omega \sim {\mathcal N}(0, \psi), ~~ \psi | \omega \sim \mbox{Exp}(\omega), ~~ \omega | \delta, \alpha \sim \mbox{Gamma}(1/2,\alpha) \end{array} \end{aligned} $$
(3.49)

as a hierarchically specified prior distribution for θ, ψ and ω. The resulting marginal prior distribution for θ, obtained by integrating out ψ and ω, is exactly the quasi-Cauchy distribution of Johnstone and Silverman (2004); see Griffin and Brown (2010) for details. Carvalho et al. (2010) showed that this distribution also coincides with the marginal prior distribution for θ induced by taking a = 1∕2 in (3.45). The transformation \(\lambda = \sqrt {2 \omega }\) in (3.49) leads directly to (3.47) upon setting α = 1; (3.48) is then obtained by integrating out ψ in (3.47).

3.5.2 The Positive-Part Estimator and Extensions as MAP Estimators

Takada (1979) showed that a positive-part type minimax estimator

$$\displaystyle \begin{aligned} \delta^{c}_{JS+}(X) = \left( 1 - \frac{c} {\| X \|{}_2^{2}} \right)_{+} X, \end{aligned} $$
(3.50)

where \((t)_{+} = \max (t, 0)\), is also the MAP estimator under a certain class of hierarchically specified generalized prior distributions, say π T(θ, κ) = π(θ|κ)π T(κ). For the specific choice c = p − 2 in (3.50), Takada’s prior reduces to

(3.51)

The improper prior (3.51) evidently behaves similarly to Strawderman’s proper prior (3.45) (i.e., for a = 1∕2). Notably, the numerator (1 − κ)p∕2 in π T(κ) explicitly offsets the contribution of (1 − κ)p∕2 arising from the determinant of the variance matrix g(κ) I p in the conditional prior specification θ|κ. Under the monotone decreasing variable transformation ψ = g(κ) > 0, (3.51) implies an alternative representation that is analogous to (3.46):

(3.52)

We observe that the proper prior (3.46) and improper prior (3.52) (almost) coincide when p = 1; in particular, multiplying the former by ψ 1∕2 yields the latter. In view of the fact that (3.46) and (3.47) lead to the same marginal prior on θ when p = 1, one is led to question whether a deeper connection between these two prior specifications might exist. Supposing p ≥ 1, consider the following straightforward generalization of (3.47):

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle \theta | \psi, \lambda \sim {\mathcal N}_p(0, \psi {I}_{p}), ~ \psi | \lambda \sim \mbox{Gamma}\left(\frac{p+1}{2},\frac{\lambda^2}{2} \right), ~ \lambda | \alpha \sim \mbox{HN}(\alpha^{-1}).\qquad \end{array} \end{aligned} $$
(3.53)

Integrating λ out of the higher level prior specification the resulting marginal (proper) prior for ψ reduces to

(3.54)

For α = 1 and any p ≥ 1, we now observe that the proper prior (3.54) is simply the improper prior π T(ψ) in (3.52) multiplied by ψ −1∕2 and it reduces to Strawderman’s prior (3.46) for p = 1.

3.5.3 Penalized Likelihood and Hierarchical Priors

Expressed in modern terms of penalization, Takada (1979) proved that the positive-part estimator (3.50) is the solution to a certain penalized likelihood estimation problem in which the penalty (or regularization) term is determined by the prior (3.51). Penalized likelihood estimation , and more generally problems of regularized estimation, have become a very important conceptual paradigm in both statistics and machine learning. Such methods suggest principled estimation and model selection procedures for a variety of high-dimensional problems. The statistical literature on penalized likelihood estimators has exploded, in part due to success in constructing procedures for regression problems in which one can simultaneously select variables and estimate their effects. The class of penalty functions leading to procedures with good asymptotic frequentist properties have singularities at the origin; important examples of separable penalties include the least absolute shrinkage and selection operator (LASSO) , Tibshirani (1996), smoothly clipped absolute deviation (SCAD) , Fan and Li (2001), and minimax concave penalties (MCP) Zhang (2010). In fact, most such penalties utilized in the literature behave similarly to the LASSO penalty near the origin, differing more in their respective behaviors away from the origin, where control of estimation bias for those parameters not estimated to be zero becomes the driving concern. Generalizations of the LASSO penalty have been proposed to deal with correlated groupings of parameters, such as those that might arise in problems with features that can be sensibly ordered, as in the fused LASSO in Tibshirani et al. (2005), or separated into distinct subgroups as in the group LASSO in Yuan and Lin (2006). In such problems, the use of these penalties serves a related purpose.

The LASSO was initially formulated as a least squares estimation problem subject to a 1 constraint on the parameter vector. The more well-known penalized likelihood version arises from a Lagrange multiplier formulation of a convex relaxation of a 0 non-convex optimization problem. Since the underlying objective function is separable in the parameters, the underlying estimation problem is evidently directly related to the now-classical problem of estimating a bounded normal mean. From a decision theoretic point of view, if \(X \sim {\mathcal N}(\theta , 1)\; {\mathrm {for}} \; |\theta | \leq \lambda \), then the projection of the usual estimator dominates the unrestricted MLE , but cannot be minimax for quadratic loss because it is not a Bayes estimator. Casella and Strawderman (1981) showed that the unique minimax estimator of θ is the Bayes estimator corresponding to a two-point prior on {−λ, λ} for λ sufficiently small. Casella and Strawderman (1981) further showed that the uniform boundary Bayes estimator, \(\lambda \tanh (\lambda x)\), is the unique minimax estimator if λ < λ 0 ≈ 1.0567. They also considered three-point priors supported on {−λ, 0, λ} and obtained sufficient conditions for such a prior to be least favorable . Marchand and Perron (2001) considered the multivariate extension, \({X} \sim \mathcal {N}_{p}(\theta , {I}_{p})\) with ∥θ2 ≤ λ and showed that the Bayes estimator with respect to a boundary uniform prior dominates the MLE whenever \(\lambda \leq \sqrt {p}\) under squared error loss.

It has long been recognized that the class of penalized likelihood estimators also has a Bayesian interpretation. For example, in the canonical version of the LASSO problem, minimizing

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{1}{2} \| {X} - \theta \|{}_2^2 + \lambda \|\theta\|{}_1, ~~~||\theta||{}_1 = \sum_{i=1}^p | \theta_i | \end{array} \end{aligned} $$
(3.55)

with respect to θ is easily seen to be equivalent to computing the MAP estimator of θ under a model specification in which \({X}\,{\sim }\,{\mathcal N}_p(\theta ,{I}_{p})\) and θ has a prior distribution satisfying \(\theta _i \stackrel {iid}{\sim } \mbox{Laplace}(\lambda ).\) It is easily shown that the solution to (3.55) is \(\widehat {\theta }_i({X}) = \mbox{sign}(X_i) (|X_i| - \lambda )_+,\) i = 1, …, p. The critical hyperparameter λ, though regarded as fixed for the purposes of estimating θ, is typically estimated in some ad hoc manner (e.g., cross validation), resulting in an estimator with an empirical Bayes flavor.

The Laplace prior inherent in the LASSO minimization problem (3.55) has broad connections to estimation under hierarchical prior specifications that lead to scale mixtures of normal distributions. As pointed out above, the conditional prior distribution of θ|λ obtained by integrating out ψ in (3.47) is exactly Laplace(λ). More generally, the conditional distribution for θ|λ under the hierarchical prior specification (3.53) is a special case of the class of multivariate exponential power distributions in Gomez-Sanchez-Manzano et al. (2008); in particular, we obtain

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi(\theta| \lambda) \propto \lambda^p \exp\left\{ - \lambda \| \theta \|{}_2 \right\}, \end{array} \end{aligned} $$
(3.56)

a direct generalization of the Laplace distribution that arises when p = 1. Treating λ as fixed hyperparameter, computation of the resulting MAP estimator under the previous model specification \(X \sim {\mathcal N}_p(\theta ,I_{p})\) reduces to determining the value of θ that minimizes

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{1}{2} \| X - \theta \|{}^2_2 + \lambda \| \theta \|{}_2. \end{array} \end{aligned} $$
(3.57)

The resulting estimator is easily shown to be

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta_{GL}(X) = \left( 1 - \frac{\lambda}{\| X \|{}_2} \right)_+ X, \end{array} \end{aligned} $$
(3.58)

an estimator that coincides with the solution to the canonical version of the grouped LASSO problem involving a single group of parameters (see Yuan and Lin 2006) and equals \(\widehat {\theta }(X) = \mbox{sign}(X) (|X| - \lambda )_+\) for the case where p = 1.

Consider the problem of estimating θ in the canonical setting \(X \sim {\mathcal N}_p(\theta , I_{p})\). In view of the fact that (3.53) leads to (3.56) upon integrating out ψ, our starting point is the (possibly improper) generalized class of joint prior distributions π(θ, λ|α, β), which we define in the following hierarchical fashion

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi(\theta | \lambda, \alpha, \beta) &\displaystyle \propto&\displaystyle \lambda^p \exp\left\{ - \lambda \| \theta \|{}_2 \right\}, \\ \pi(\lambda|\alpha,\beta) &\displaystyle \propto&\displaystyle \lambda^{-p} \exp\{-\alpha (\lambda - \beta)^2 \}, \end{array} \end{aligned} $$
(3.59)

where α, β > 0 are hyperparameters. Equivalently,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \pi(\theta, \lambda | \alpha, \beta) \propto \exp\left\{ - \lambda \| \theta \|{}_2 \right\} \exp\{ -\alpha (\lambda - \beta)^2 \}. \end{array} \end{aligned} $$
(3.60)

The prior on λ is an improper modification of that given in (3.53), in which a location parameter β is introduced and the factor λ p is introduced to offset the contribution λ p in (3.56). This construction mimics the idea underlying the prior used by Takada (1979) to motivate (3.50) as a MAP estimator .

Considering (3.60) as motivation for defining a new class of hierarchical penalty functions , Strawderman and Wells (2012) propose deriving the MAP estimator for (θ, λ) through minimizing the objective function

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} G(\theta, \lambda) &\displaystyle = &\displaystyle \frac{1}{2} \| X - \theta \|{}^2_2 + \lambda \| \theta \|{}_2 + \alpha (\lambda - \beta)^2 \end{array} \end{aligned} $$
(3.61)

jointly in \(\theta \in \mathbb {R}^p\) and λ > 0, where α > 1∕2 and β > 0 are fixed. The resulting estimator for θ takes the closed form

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta^{(\alpha,\beta)}(X) = w_{\alpha,\beta}(\| X \|{}_2) X, \end{array} \end{aligned} $$
(3.62)

where

$$\displaystyle \begin{aligned} w_{\alpha,\beta}(s) = \begin{cases} 0 & s \leq \beta \\ \nu_{\alpha} \left( 1-\frac{\beta}{s} \right) & \beta < s \leq 2 \alpha \beta \\ 1 & s > 2 \alpha \beta \\ \end{cases} \end{aligned}$$

for ν α = 2α∕(2α − 1). Equivalently, we may write

$$\displaystyle \begin{aligned} w_{\alpha,\beta}(s) = \begin{cases} \nu_{\alpha} \left( 1-\frac{\beta}{s} \right)_+ & s \leq 2 \alpha \beta \\ 1 & s > 2 \alpha \beta \\ \end{cases} \end{aligned}$$

demonstrating that (3.62) has the flavor of a range-modified positive-part estimator. A detailed derivation of this estimator is in Strawderman and Wells (2012).

Some interesting special cases of the estimator (3.62) arise when considering specific values of α, β and p. For example, letting α →, we obtain (for β > 0)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta^{(\beta)}(X) &\displaystyle = &\displaystyle \left( 1-\frac{\beta}{\| X \|{}_2} \right)_+ X; \end{array} \end{aligned} $$
(3.63)

upon setting β = λ, we evidently recover (3.58); subsequently, setting \(\lambda = \sqrt {p-2}\), one then obtains an obvious modification of (3.50) for the case where c = p − 2:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta^*_{PP}(X) &\displaystyle = &\displaystyle \left( 1-\frac{\sqrt{p-2}}{\| X \|{}_2} \right)_+ X \end{array} \end{aligned} $$
(3.64)

In the special case p = 1, the estimator (3.62) reduces to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \delta^{M}(X) = \left\{ \begin{array}{ll} 0 & \text{if } |X|\leq \beta \\[.5ex] \frac{2\alpha}{2\alpha-1}(X-{\text{sign}}(X)\beta) & \text{if } \beta < |X| \leq 2 \alpha \beta \\[.5ex] X & \text{if } |X| > 2 \alpha \beta \\ \end{array} \right.. \end{array} \end{aligned} $$
(3.65)

As shown in Strawderman et al. (2013), (3.65) is also the solution to the penalized minimization problem

$$\displaystyle \begin{aligned} \frac{1}{2}(X - \theta)^2 + \rho(\theta; \alpha, \beta), \end{aligned}$$

where β > 0, α > 1∕2 and

$$\displaystyle \begin{aligned} \rho(t; \alpha, \beta) = \beta \int_0^{|t|} (1- \frac{z}{2\alpha \beta})_+\ dz, ~~t \in \mathbb{R}. \end{aligned}$$

This optimization problem is the univariate equivalent of the penalized likelihood estimation problem considered in Zhang (2010), who referred to ρ(t;α, β) as MCP . It follows that (3.65) is equivalent to the univariate MCP thresholding operator; consequently, (3.62) may be regarded as a generalization of this operator for thresholding a vector of parameters. Zhang (2010) showed that the LASSO, SCAD, and MCP belong to a family of quadratic spline penalties with certain sparsity and continuity properties. MCP turns out to be the simplest penalty that results in an estimator that is nearly unbiased, sparse and continuous. As demonstrated above, MCP also has an interesting Bayesian motivation under a hierarchical modeling strategy. Strawderman et al. (2013) undertook a more detailed study of the connections between MCP, the hierarchically penalized estimator, and proximal operators for the case of p = 1. They also compared this estimator to several others through consideration of frequentist and Bayes risks.

3.6 Estimation of a Predictive Density

Consider a parametric model \( \{ {\mathcal Y} , { ( {\mathcal P^{\prime } }_{ \mu } ) }_{ \mu \in \varOmega } \}\) where \( { \mathcal Y}\) is the sample space, Ω is the parameter space and \( {\mathcal P^{\prime } } = \{ p(y | \mu ): \mu \in \varOmega \} \) is a class of densities of \({\mathcal P^{\prime }}_{\mu }\) with respect to a σ -finite measure. In addition, suppose an observed value x of the random variable X follows a model \( \{ { \mathcal X} , { ( {\mathcal P}_{ \mu } ) }_{ \mu \in \varOmega } \}\) indexed by the same parameter. In this section, we examine the problem of estimating the true density \(p^\prime ( . | \mu ) \in {\mathcal P^\prime }\) of a random variable Y . In this context p (⋅|μ) is referred to as the predictive density of Y .

Let the density \( \hat {q} (y|x)\) (belonging to some class of models \( { \mathcal C} \supset {\mathcal P^{\prime } } \)) be an estimate, based on the observed data x, of the true density p(y|μ). Aitchison (1975) proposed using the Kullback and Leibler (1951) divergence , defined in (3.66) below, as a loss function for estimating p(y|μ).

The class of estimates \( { \mathcal C}\) can be identical to the class \( {\mathcal P^{\prime } } \), that is, for any \(y \in { \mathcal Y}\)

$$\displaystyle \begin{aligned}\hat{q} (y|x) = p(y | \mu= \hat{ \mu} (x))\end{aligned}$$

where \( \hat { \mu }\) is some estimate of μ. This type of density estimator is called the “plug-in density estimate” associated with the estimate \( \hat {\mu } \). Alternatively, one may choose

$$\displaystyle \begin{aligned}\hat{q} (y|x) = \int_{ \varOmega} p(y| \mu) \, d \pi(\mu| x)\end{aligned}$$

where (μ|x) may be a weight function (measure) or an a posteriori density associated with a priori measure π(μ). In this case, the class \( { \mathcal C}\) will be broader than the class of the models \( {\mathcal P^{\prime } } \). Aitchison (1975) showed that this latter method is preferable to the plug-in approach for several families of probability distributions by comparing their risks induced by the Kullback-Leibler divergence.

3.6.1 The Kullback-Leibler Divergence

First, recall the definition of the Kullback-Leibler divergence and some of its properties.

Lemma 3.10

The Kullback-Leibler divergence (relative entropy) D KL(p, q) between two densities p and q is defined by

$$\displaystyle \begin{aligned} D_{\mathit{\mbox{{KL} }} } (p,q) = E_p \left[ \, \log \dfrac {p} {q} \, \right] = \int \log \left[ \frac{p(x) } {q(x) } \right] p(x) \,dx \geq 0 \end{aligned} $$
(3.66)

and equality is achieved if and only if p = q, p almost surely.

Note that the divergence can be finite only if the support of the density p is contained in the support of the density q. By convention, we define \(0 \, \log \frac { 0 }{ 0 } = 0\).

Proof

By definition of the Kullback-Leibler divergence we can write

$$\displaystyle \begin{aligned} \begin{array}{rcl} - D_{ {\mbox{{KL}}}} (p,q)&\displaystyle = &\displaystyle \int \log \left[ \frac{q(x) } {p(x) } \right] \, p(x) \,dx\\ &\displaystyle \leq &\displaystyle \log \left[ \int \frac{q(x) } {p(x)} \, p(x) \,dx\right] \mbox{ (by Jensen's inequality) } \\ &\displaystyle =&\displaystyle \log \left[ \int q(x) \,dx\right]\\ &\displaystyle = &\displaystyle 0. \end{array} \end{aligned} $$

We have equality, using Jensen’s inequality, if and only if p = q, p -almost surely. Note that the lemma is true if q is assumed only to be a subdensity (mass less than or equal to 1). □

The Kullback-Leibler divergence is not a true distance since it is not symmetric and it does not satisfy the triangle inequality. But it appears as the natural discrepancy measure in information theory. An important property, given in the following lemma, is that it is strictly convex.

Lemma 3.11

The Kullback-Leibler divergence is strictly convex, that is to say, if (p 1, p 2) and (q 1, q 2) are two pairs of densities then, for any 0 ≤ λ ≤ 1,

$$\displaystyle \begin{aligned} D_{{\mathit{\mbox{{KL}}}}} ( \lambda \, p_1+ (1- \lambda) \, p_2, \lambda \, q_1+ (1- \lambda) \, q_2) \leq \lambda D_{{\mathit{\mbox{{KL}}}}} (p_1,q_1)+ (1- \lambda) D_{{\mathit{\mbox{{KL}}}}} (p_2,q_2) \, , \end{aligned} $$
(3.67)

with strict inequality unless (p 1, p 2) = (q 1, q 2) a.e. with respect to p 1 + p 2.

Proof

Note that \(f(t) =t \, \log (t)\) is strictly convex on (0, ). Let

$$\displaystyle \begin{aligned}\alpha_1 = \frac{\lambda q_1}{\lambda q_1 + (1- \lambda)q_2 }, \; \alpha_2 = \frac{(1- \lambda)q_2}{\lambda q_1 + (1- \lambda)q_2}, \; t_1 = \frac{p_1}{q_1} \; \mbox{and} \; t_2 = \frac{p_2}{q_2 } \, . \end{aligned}$$

From the convexity of the function f it follows that

$$\displaystyle \begin{aligned} f( \alpha_1 t_1 + \alpha_2 t_2) \leq \alpha_1 f(t_1) + \alpha_2 f( t_2) \end{aligned}$$

and consequently

$$\displaystyle \begin{aligned} ( \alpha_1 t_1 + \alpha_2 t_2) \log( \alpha_1 t_1+ \alpha_2 t_2) \leq t_1 \alpha_1 \log(t_1) +t_2 \alpha_2 \log(t_2) \, . \end{aligned}$$

Substituting the above values of α 1, α 2, t 1 and t 2 gives

$$\displaystyle \begin{aligned}\left( \lambda p_1+ (1- \lambda) p_2 \right) \log \frac{ \lambda p_1 + (1- \lambda)p_2 } { \lambda q_1 + (1- \lambda)q_2} \leq \lambda p_1 \log \frac{p_1 } {q_1 }+ (1- \lambda) p_2 \log \frac{p_2 } {q_2 }. \end{aligned}$$

Finally, by integrating the latter term, (3.67) and the strict convexity follow from the strict convexity of the function f. □

3.6.2 The Bayesian Predictive Density

Assume in the rest of this subsection that p(x|μ) and p (y|μ) are densities with respect to the Lesbegue measure. For any estimator \(\hat {p} (\cdot |x)\) of the density p (y|μ), define the Kullback-Leibler loss by

$$\displaystyle \begin{aligned} \mbox{KL}( \mu, \hat{p} (\cdot |x)) = \int p^\prime(y | \mu) \log \left[ \frac{p^\prime(y| \mu) } { \hat{p} (y|x)} \right] dy \end{aligned} $$
(3.68)

and its corresponding risk as

$$\displaystyle \begin{aligned} {{\mathcal R}_{\mbox{{KL}}}}( \mu, \hat{p} ) = \int p(x | \mu) \left[ \int p^\prime(y | \mu) \log \left[ \frac{p^\prime(y| \mu) } { \hat{p} (y|x)} \right] \,dy \right] dx. \end{aligned} $$
(3.69)

We say that the density estimate \( \hat {p}_2\) dominates the density estimate \( \hat {p}_1\) if, for any μ ∈ Ω, \( {{\mathcal R}_{\mbox{ {KL}}}}( \mu , \hat {p}_1) - {{\mathcal R}_{\mbox{ {KL}}}}( \mu , \hat {p}_2) \leq 0\), with strict inequality for at least some value of μ.

In the Bayesian framework we will compare estimates using Bayes risk. We will consider the class, more general than Aitchison (1975), of all subdensities,

$$\displaystyle \begin{aligned}\mathcal D = \left\{q(\cdot |x) | \int q(y|x) \,dy \leq 1\; \; \mbox{ for all } x \right\} . \end{aligned}$$

Lemma 3.12 (Aitchison 1975)

The Bayes risk

$$\displaystyle \begin{aligned} r_{ \pi} ( \hat{p} ) = \int {{\mathcal R}_{\mathit{\mbox{{KL}}}}}( \mu, \hat{p}) \, \pi( \mu) \,d \mu \end{aligned}$$

is minimized by

$$\displaystyle \begin{aligned} \hat{p}_{\pi}(y |x)= \int p^\prime(y |\mu) \, p(\mu|x) \,d \mu = \frac{\int p^\prime(y |\mu) \, p(x|\mu) \pi(\mu) \,d \mu} {\int p(x|\mu) \, \pi(\mu) \,d \mu}. \end{aligned} $$
(3.70)

We call \( \hat {p}_{ \pi }\) the Bayesian predictive density.

Proof

The difference between the Bayes risks of \( \hat {p}_{ \pi }\) and another competing subdensity estimator \( \hat {q}\) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} r_{\pi}(\hat{q})-r_{\pi}(\hat{p}_{\pi}) &\displaystyle =&\displaystyle \int_{\varOmega} \left[ \int_{\mathcal X} \left\{ \int_{\mathcal Y} p^\prime(y|\mu) \log \frac{\hat{p}_{\pi}(y|x)}{\hat{q}(y|x)} \,dy \right\} p(x| \mu)\,dx\right] \pi(\mu) \,d \mu \\ &\displaystyle =&\displaystyle \int_{\varOmega} \left[ \int_{\mathcal X} \left\{ \int_{\mathcal Y} p^\prime(y|\mu) \log \frac{\hat{p}_{\pi}(y|x)}{\hat{q}(y|x)} \,dy \right\} p(x| \mu) \, \pi(\mu) \, \,dx\right] d \mu \\ &\displaystyle =&\displaystyle \int_{\varOmega} \left[ \int_{\mathcal X} \left\{ \int_{\mathcal Y} p^\prime(y|\mu) \log \frac{\hat{p}_{\pi}(y|x)}{\hat{q}(y|x)} \,dy \right\} p(\mu | x) \, m(x) \, \,dx\right] d \mu \, . \end{array} \end{aligned} $$

Rearranging the order of integration thanks to Fubini’sTheorem gives

$$\displaystyle \begin{aligned} \begin{array}{rcl} r_{\pi}(\hat{q})-r_{\pi}(\hat{r}) &\displaystyle =&\displaystyle \int_{\mathcal X} \left[ \int_{\mathcal Y} \left\{ \int_{\varOmega} p(\mu |x) \, p^\prime(y|\mu) \, \,d \mu \right\} \log \frac{\hat{p}_{\pi}(y|x)}{\hat{q}(y|x)} \,dy \right] m(x) \, dx \\ &\displaystyle =&\displaystyle \int_{\mathcal X} \left[ \int_{\mathcal Y} \hat{p}_{\pi}(y|x) \log \frac{\hat{p}_{\pi}(y|x)}{\hat{q}(y|x)} \,dy \right] m(x) \,dx \\ &\displaystyle =&\displaystyle \int_{\mathcal X} D_{{\mbox{{KL}}}}(\hat{p}_{\pi}(.|x),\hat{q}(.|x)) \, m(x) \,dx \geq 0. \end{array} \end{aligned} $$

3.6.3 Sufficiency Reduction in the Normal Case

Let X (n) = (X 1, …, X n) and Y (m) = (Y 1, …, Y m) be independent iid samples from p-dimensional normal distributions \({\mathcal N}_p(\mu ,\varSigma _1)\) and \({\mathcal N}_p(\mu ,\varSigma _2)\) with unknown common mean μ and known positive definite covariance matrices Σ 1 and Σ 2. On the basis of an observation x (n) = (x 1, …, x n) from X (n), consider the problem of estimating the true predictive density p (y (m)|μ) of y (m) = (y 1, …, y m), under the Kullback-Leibler loss . For a prior density π(μ), the Bayesian predictive density is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{p}_{\pi} (y_{(m)} | x_{(n)} ) = \dfrac {\displaystyle \int_{\varOmega} p^\prime (y_{(m)} | \mu) p(x_{(n)} | \mu ) \, \pi(\mu) \, d\mu } {\displaystyle \int_{\varOmega} p(x_{(n)} | \mu ) \, \pi(\mu) \, d\mu }. \end{array} \end{aligned} $$
(3.71)

For simplicity, we consider the case where Σ 1 = Σ 2 = I p. According to Komaki (2001) the Bayesian predictive densities satisfy

$$\displaystyle \begin{aligned} \displaystyle \int_{\mathbb{R}^{pm}} p^\prime(y_{(m)} | \mu) \, \log \dfrac{p^\prime(y_{(m)} | \mu)}{\hat{p}_{\pi}(y_{(m)} | x_{(n)})} \, d y_{(m)} = \int_{\mathbb{R}^p} p^\prime(\bar{y}_m| \mu) \, \log \dfrac{p^\prime(\bar{y}_m | \mu)}{\hat{p}_{\pi}(\bar{y}_m | \bar{x}_n)} \, d \bar{y}_m \end{aligned} $$
(3.72)

where, denoting by ϕ p(⋅, |μ, Σ) the density of \({\mathcal N}_p(\mu ,\varSigma )\), in the left-hand side of (3.72),

$$\displaystyle \begin{aligned} p^\prime(y_{(m)} | \mu) = \prod_{i=1}^m \phi_p(y_i, | \mu, I_p) \end{aligned}$$

while, in the right-hand side of (3.72),

$$\displaystyle \begin{aligned} p^\prime(\bar{y}_m | \mu) = \phi_p(\bar{y}_m | \mu, I_p / m) \end{aligned}$$

with \(\bar {y}_m = \sum _{j=1}^{m} y_{j} / m\). Similarly, \(\hat {p}_{\pi } (y_{(m)} | x_{(n)} ) \) corresponds to the conditional density of the p × m matrix y (m) given the p × m matrix x (n) while \(\hat {p}_{\pi }(\bar {y}_m | \bar {x}_m)\) corresponds to the conditional density of the p × 1 vector \(\bar {y}_m\) given the p × 1 vector \(\bar {x}_n = \sum _{i=1}^{n} x_i / n\).

To see this sufficiency reduction , use the fact that

$$\displaystyle \begin{aligned}\sum_{i=1}^{m} \|y_i -\mu \|{}^2 = \sum_{i=1}^{m} \|y_i-\bar{y}_m\|{}^2 \,+m\,(\|\bar{y}_m -\mu \|)^2. \end{aligned}$$

Then we can express p (y (m)|μ) as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} p^\prime(y_{(m)} | \mu) &\displaystyle =&\displaystyle \frac{1}{(2 \, \pi)^{m p/2}} \, \exp\left(-\frac{1}{2} \sum_{i=1}^{m} \|y_i-\bar{y}_m\|{}^2\right) \, \exp\left(-\frac{m}{2} \,(\|\bar{y}_m -\mu \|)^2 \right) \\ &\displaystyle =&\displaystyle \frac{m^{p/2}}{\left(2\pi\right)^{(m-1)p/2}} \, \exp\left(-\frac{1}{2} \sum_{i=1}^{m} \|y_i-\bar{y}_m\|{}^2 \right) \, p(\bar{y}_m|\mu). \end{array} \end{aligned} $$
(3.73)

Similarly, it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} p(x_{(n)} | \mu) = \frac{n^{p/2}}{\left(2\pi\right)^{(n-1)p/2}} \, \exp\left(-\frac{1}{2} \sum_{i=1}^{n} \|x_i-\bar{x}_m\|{}^2 \right) p(\bar{x}_m|\mu) \, . \end{array} \end{aligned} $$

By replacing these expressions in the form of the predictive density in (3.71), we get

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle \hat{p}_{\pi} (y_{(m)} | x_{(n)} ) \\ &\displaystyle &\displaystyle \qquad = \left\{ \frac{m^{p/2}}{\left(2\pi\right)^{(m-1)p/2}} \, \exp\left(-\frac{1}{2} \sum_{i=1}^{m} \|y_i-\bar{y}_m\|{}^2\right) \right\} \frac {\int p^\prime(\bar{y}_m|\mu) \, p(\bar{x}_m|\mu) \, \pi(\mu) \, d\mu } {\int p(\bar{x}_m|\mu) \, \pi(\mu) \, d\mu } \\ &\displaystyle &\displaystyle \qquad = \left\{ \frac{m^{p/2}}{\left(2\pi\right)^{(m-1)p/2}} \, \exp\left(-\frac{1}{2} \sum_{i=1}^{m} \|y_i-\bar{y}_m\|{}^2\right) \right\} \hat{p}_{\pi}(\bar{y}_m|\bar{x}_m). {} \end{array} \end{aligned} $$
(3.74)

Finally, for (3.73) and (3.74), it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \displaystyle \int p^\prime(y_{(m)} | \mu) \log \dfrac{p^\prime(y_{(m)} | \mu)}{\hat{p}(y_{(m)} | x_{(n)})} d y_{(m)} &\displaystyle =&\displaystyle \int p^\prime(y_{(m)} | \mu) \log \dfrac{p^\prime(\bar{y}_m | \mu)}{\hat{p}(\bar{y}_m | \bar{x}_m)} d y_{(m)} \\ {} &\displaystyle =&\displaystyle \int p^\prime(\bar{y}_m| \mu) \log \dfrac{p^\prime(\bar{y}_m | \mu)}{\hat{p}(\bar{y}_m | \bar{x}_m)} d \bar{y}_m. \end{array} \end{aligned} $$

Therefore, for any prior π, the risk of the Bayesian predictive density estimator is equal to the risk of the Bayesian predictive density associated to π in the reduced model \(X \sim {\mathcal N}_p (\mu , \frac {1}{n} I_p)\) and \(Y \sim {\mathcal N}_p (\mu , \frac {1}{m} I_p).\) Thus, for the Bayesian predictive densities, it is sufficient to consider the reduced model.

Now we will compare two plug-in density estimators, \(\hat {p}_1\) and \( \hat {p}_2\) associated with the two different estimators of μ, δ 1 and δ 2. That is, for i = 1, 2, define

$$\displaystyle \begin{aligned} \hat{p}_i(y_{(m)}|x_{(n)}) = p^\prime(y_{(m)} | \mu = \delta_i(x_{(n)})). \end{aligned} $$
(3.75)

The difference in risk between \( \hat {p}_2\) and \( \hat {p}_1\) is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} \varDelta {{{\mathcal R}_{\mbox{{KL}}}}}(\hat{p}_2,\hat{p}_1) &\displaystyle =&\displaystyle {{{\mathcal R}_{\mbox{{KL}}}}}(\mu,\hat{p}_2)- {{{\mathcal R}_{\mbox{{KL}}}}}(\mu,\hat{p}_1)\\ &\displaystyle =&\displaystyle \int p(x_{(n)}|\mu) \int p(y_{(m)} | \mu) \log \dfrac{\hat{p}_1(y_{(m)}|x_{(n)})}{\hat{p}_2(y_{(m)}|x_{(n)})} \, d y_{(m)} \, d x_{(n)} \\ &\displaystyle =&\displaystyle \int p(x_{(n)}|\mu) \int p(y_{(m)} | \mu) \bigg(\frac{1}{2} \sum_{i=1}^{m} \|\delta_2(x_{(n)})-y_i\|{}^2 \\ &\displaystyle \; \; \; \; \; \; \;-&\displaystyle \frac{1}{2} \sum_{i=1}^{m}\|\delta_1(x_{(n)}) -y_i\|{}^2 \bigg) \, d y_{(m)} \, d x_{(n)} \, . \end{array} \end{aligned} $$

By the independence of X (n) and Y (m) this can be reexpressed in terms of expectations as

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \varDelta {{{\mathcal R}_{\mbox{{KL}}}}}(\hat{p}_2,\hat{p}_1) \\ {} &\displaystyle = &\displaystyle \frac{1}{2} \sum_{i=1}^{m} E_{X_{(n)},Y_{(m)}} \bigg( \|\delta_2(X_{(n)})-\mu+\mu-Y_i\|{}^2 -\|\delta_1(X_{(n)})-\mu+\mu -Y_i\|{}^2 \bigg) \\ {} &\displaystyle = &\displaystyle \frac{m}{2} E_{X_{(n)},Y_{(m)}} \left[\|\delta_2(X_{(n)})-\mu\|{}^2 -\|\delta_1(X_{(n)})-\mu\|{}^2 \right]\\ {} &\displaystyle &\displaystyle \quad + \sum_{i=1}^{m} E_{X_{(n)},Y_{(m)}} \bigg(\left[ (\delta_2(X_{(n)})-\mu) (\mu-Y_i) \right] - \left[(\delta_1(X_{(n)})-\mu) (\mu-Y_i) \right] \bigg) \\ {} &\displaystyle =&\displaystyle \frac{m}{2} \left( E_{X_{(n)}} \left[\|\delta_2(X_{(n)})-\mu\|{}^2\right] - E_{X_{(n)}} \left[\|\delta_1(X_{(n)})-\mu\|{}^2\right] \right) \\ {} &\displaystyle =&\displaystyle \frac{m}{2} \, \bigg[\, {\mathcal R}_Q(\delta_2, \mu)- {\mathcal R}_Q(\delta_1, \mu) \, \bigg], \end{array} \end{aligned} $$

which shows that the risk difference between \( \hat {p}_2\) and \( \hat {p}_1\) is proportional to the risk difference between δ 2 and δ 1.

Note that, by completeness of the statistics \( \bar {X}_n\), it suffices to consider only estimates of μ that depend only on \( \bar {X}_n\).

3.6.4 Properties of the Best Invariant Density

In this subsection, we restrict our attention to location models. We assume X ∼ p(x|μ) = p(x − μ) and Y ∼ p (y|μ) = p (y − μ), where p and p are two known possibly different densities. A density \( \hat {q}\) is called invariant (equivariant) with respect to a location parameter if, for any \(a \in \mathbb {R}^p,\) \(x \in \mathbb {R}^p\), and \(y \in \mathbb {R}^p\) q(y|x + a) = q(y − a|x). This is equivalent to q(y + a|x + a) = q(y|x). The following result shows that the risk of an invariant predictive density is constant.

Lemma 3.13

The invariant predictive densities with respect to the location group of translations have constant risk.

Proof

By the property of invariance, the risk of an invariant density \( \hat {q}\) is equal to

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathcal R}(\mu,\hat{q})&\displaystyle =&\displaystyle \int \log \frac{p^\prime (y-\mu)}{\hat{q}(y|x)} \, p(x-\mu) \, p^\prime(y-\mu) \, dy \, dx \\ {} &\displaystyle =&\displaystyle \int \log \frac{p^\prime(y-\mu)}{\hat{q}(y-\mu|x-\mu)} \, p(x-\mu) \, p^\prime(y-\mu) \,dy \,dx \\ {} &\displaystyle =&\displaystyle \int \log \frac{p(z^\prime)}{\hat{q}(z^\prime|z)} \, p(z) \, p^\prime(z^\prime) \,dz^\prime \, dz, {} \end{array} \end{aligned} $$
(3.76)

by the change of variables z = x − μ and z  = z − μ. Therefore, the risk \( { \mathcal R} ( \mu , \hat {q})\) does not depend on μ and it is constant. □

Any invariant predictive density which minimizes this risk is known as the best invariant predictive density.

Lemma 3.14

The best invariant predictive density is the Bayesian predictive density \( \hat {p}_{U}\) associated with the Lebesgue measure on \( \mathbb {R}^p\) , π(μ) = 1, is given by

$$\displaystyle \begin{aligned} \hat{p}_U(y|x) = \frac{\int_{\mathbb{R}^p} p^\prime(y|\mu) \, p(x|\mu) \,d \mu} {\int_{\mathbb{R}^p} p(x|\mu) \, \,d \mu}. \end{aligned} $$
(3.77)

Proof

Let Z = X − μ, Z  = Y − μ, and T = Y − X = Z − Z. We will show that \( \hat {p} (t) \), the density of T, which is independent of μ, is the best invariant density. As noted in the previous section, if \( \hat {q}\) is an invariant predictive density, \( \hat {q}(y|x) = \hat {q}(y-x|0) = \hat {q}(y-x)\), by an abuse of notation. Hence,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathcal R}(\mu,\hat{q}) - {\mathcal R}(\mu,\hat{p}) &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \int_{\mathbb{R}^p} \left[ \log \frac{\hat{p}(y-x)}{\hat{q}(y-x)} \right] p(x-\mu) p^\prime(y-\mu) \,dx \,dy \\ {} &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \int_{\mathbb{R}^p} \left[ \log \frac{\hat{p}(z^\prime-z)}{\hat{q}(z^\prime-z)} \right] p(z) p^\prime(z^\prime) \,dz \, dz^\prime \\ {} &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \left[ \log \frac{\hat{p}(t)}{\hat{q}(t)} \right] \hat{p}(t) \, dt, \end{array} \end{aligned} $$
(3.78)

which is always positive by the inequality in (3.66). The result of the equality in (3.78), and hence the lemma, follows from the fact that \( \hat {p} (t) = \hat {p} (y-x) = \hat {p}_U(y|x),\) that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \hat{p}(t)&\displaystyle = &\displaystyle \int_{\mathbb{R}^p} p(z) \, p^\prime(z+t) \,dz \\ &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} p(z) \, p^\prime(z+y-x) \,dz \\ &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} p(x-\mu) \, p^\prime(y-\mu) \,d \mu \\ &\displaystyle = &\displaystyle \frac{\int_{\mathbb{R}^p} p^\prime(y|\mu) \, p(x|\mu) \,d \mu} {\int_{\mathbb{R}^p} p(x|\mu) \, \,d \mu} \end{array} \end{aligned} $$
(3.79)

which is the expression of \( \hat {p}_U\) given in (3.70) with π(μ) = 1. □

Murray (1977) showed that \( \hat {p}_{U}\) is the best invariant density under the action of translations and of linear transformations for a Gaussian model. Ng (1980) has generalized this result. Liang and Barron (2004), without the hypothesis of independence between X and Y , for the estimation of p (y|x, μ) showed that \( \displaystyle \hat {p}_U = \frac {\int _{ \mathbb {R}^p} p^\prime (y|x, \mu ) \, p(x| \mu ) \,d \mu } {\int _{ \mathbb {R}^p} p(x| \mu ) \, \,d \mu }\) is the best invariant density.

We will now show that \( \hat {p}_U\) is minimax in location problems.

Lemma 3.15

Let X  p(x|μ) = p(x  μ) and Y  p(y|μ) = p (y  μ), with unknown location parameter \( \mu \in \mathbb {R}^p\) . Assuming that \(E_0 \left [ \|X\| ^ 2 \right ] < \infty ,\) then the best predictive invariant density \( \hat {p}_{U}\) is minimax.

Proof

We show minimaxity using Lemma 1.8. Consider a sequence {π k} of normal \(\mathcal {N}_p(0, k \, I_p)\) priors . The difference of Bayes risk between \( \hat {p}_U\) and \( { \hat {p} }_{ \pi _k} \), is given by

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} r(\hat{p}_U,\pi_k)-r({\hat{p}}_{\pi_k},\pi_k) &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \left[ {\mathcal R}(\mu,\hat{p}_U) - {\mathcal R}(\mu,\hat{p}_{\pi_k}) \right] \, \pi_k(\mu) \, \,d \mu \\ &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \int_{\mathbb{R}^p} \int_{\mathbb{R}^p} \log \frac{{\hat{p}}_{\pi_k}(y|x)}{\hat{p}_U(y|x)} p(y|\mu) \, p(x|\mu) \pi_k(\mu) \, \,dy \, \,dx \, \,d \mu \\ &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \int_{\mathbb{R}^p} \log \frac{{\hat{p}}_{\pi_k}(y|x)}{\hat{p}_U(y|x)} \left[\int_{\mathbb{R}^p} p(y|\mu) \, p(x|\mu) \, \pi_k(\mu) \, \,d \mu \right]\,dy \, \,dx \\ &\displaystyle =&\displaystyle E_{\pi_k}^{X,Y} \log \frac{{\hat{p}}_{\pi_k}(Y|X)}{\hat{p}_U(Y|X)} \end{array} \end{aligned} $$
(3.80)

where \(E_{ \pi _k} ^ {x,y} \) denotes the expectation with respect to the joint marginal of (X, Y ),

$$\displaystyle \begin{aligned}m_{ \pi_k} (x,y) ~ = ~ \int_{ \mathbb{R}^p} ~p(y| \mu) \, p(x| \mu) \, \pi_k( \mu) \, \,d \mu. \end{aligned}$$

Since \(r(\hat {p}_U,\pi _k) = {\mathcal R}(\mu ,\hat {p}_U)\) (\(\hat {p}_U\) has constant risk) it suffices to show (3.80) tends to 0 as k tends to infinity. By simplifying one gets

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle {r(\hat{p}_U,\pi_k)-r({\hat{p}}_{\pi_k},\pi_k)}\\ {} &\displaystyle &\displaystyle \ = E_{\pi_k}^{X,Y} \left[ \log \left(\frac{\int p(x,y|\mu)\, \pi_k(\mu) \,d \mu} {\int p(x|\mu)\, \pi_k(\mu) \,d \mu} \frac{1}{\int p(x,y|\mu)\,d \mu} \right) \right]\\ {} &\displaystyle &\displaystyle \ = E_{\pi_k}^{X,Y} \left[ -\log \frac {\int p(x,y|\mu)\, \pi_k(\mu) \frac{1}{\pi_k(\mu)}\,d \mu} {\int p(x,y|\mu)\, \pi_k(\mu) \,d \mu} - \log \left(\int p(x|\mu)\, \pi_k(\mu) \,d \mu \right) \right] \\ {} &\displaystyle &\displaystyle \ =E_{\pi_k}^{X,Y} \left[ - \log E_{\mu|X,Y} \frac{1}{\pi_k(\mu)} - \log \left(\int p(x|\mu)\, \pi_k(\mu) \,d \mu \right) \right] \end{array} \end{aligned} $$

where E μ|X,Y denotes the expectation with respect to the posterior of μ given (X, Y ). An application of Jensen’s inequality gives

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {r(\hat{p}_U,\pi_k)-r({\hat{p}}_{\pi_k},\pi_k)}\\ &\displaystyle &\displaystyle \ \leq E_{\pi_k}^{X,Y} E_{\mu|X,Y} \log \pi_k(\mu) -E_{\pi_k}^{X,Y} \left[ \int p(X|\mu)\, \log \pi_k(\mu) \,d \mu \right] . \end{array} \end{aligned} $$
(3.81)

By developing the expectations, it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} E_{\pi_k}^{X,Y} E_{\mu|X,Y} \log \pi_k(\mu)&\displaystyle =&\displaystyle \iint m_{\pi_k}(x,y) \frac{\int p(x,y|\mu) \pi_k(\mu) \log(\pi_k(\mu))d\mu} {m_{\pi_k}(x,y)} dx dy \\ &\displaystyle =&\displaystyle \iiint \pi_k(\mu) \log(\pi_k(\mu)) \,d \mu \, dx dy \\ &\displaystyle =&\displaystyle \int \pi_k(\mu) \log(\pi_k(\mu)) d\mu. {} \end{array} \end{aligned} $$
(3.82)

Similarly, by integrating with respect to y and by interchanging between μ and μ we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {E_{\pi_k}^{X,Y} \left[ \int p(X|\mu)\, \log \pi_k(\mu) \,d \mu \right]}\\ &\displaystyle &\displaystyle =\iiiint p(x|\mu^\prime) p(y|\mu^\prime) \pi_k(\mu^\prime) p(x|\mu) \log \pi_k(\mu) \,d \mu^\prime d\mu dx dy \\ &\displaystyle &\displaystyle =\iiint \pi_k(\mu^\prime) p(x|\mu) p(x|\mu^\prime) \log {\pi_k(\mu)} d \mu^\prime \,dx \,d \mu \\ &\displaystyle &\displaystyle =\iiint \pi_k(\mu) p(x|\mu) p(x|\mu^\prime) \log {\pi_k(\mu^\prime)} d\mu \,dx d\mu^\prime. \end{array} \end{aligned} $$
(3.83)

By grouping the expressions (3.81), (3.83) and (3.84) and making the changes of variables z = x − μ and z  = x − μ it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {r(\hat{p}_U,\pi_k)-r({\hat{p}}_{\pi_k},\pi_k)} \\ &\displaystyle &\displaystyle \ \leq \iiint p(x|\mu) p(x|\mu^\prime) \pi_k(\mu) \left[\log(\pi_k(\mu))-\log(\pi_k(\mu^\prime))\right] d\mu d\mu^\prime \,dx \\ &\displaystyle &\displaystyle \ = \iiint \pi_k(\mu) p(x-\mu) p(x-\mu^\prime)\log \left(\frac{\pi_k(\mu)}{\pi_k(\mu^\prime)}\right) \,d \mu \,dz \, d z^\prime \\ &\displaystyle &\displaystyle \ =\iiint \pi_k(\mu) p(z)p(z^\prime)\log \left(\frac{\pi_k(\mu)}{\pi_k(\mu+z-z^\prime)}\right) \,d \mu \,dz \, d z^\prime . \end{array} \end{aligned} $$
(3.84)

In view of the form π k(μ), the term on the right in (3.84) can be written as

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle E_{\pi_k}E_{Z,Z^\prime} \log \left( \frac{\pi_k(\mu)}{\pi_k(\mu+Z-Z^\prime)}\right)\\ &\displaystyle =&\displaystyle E_{\pi_k} E_{Z,Z^\prime} \frac{1}{2k} \left( \|\mu+Z-Z^\prime\|- \|\mu^2\|\right) \\ &\displaystyle =&\displaystyle E_{\pi_k}E_{Z,Z^\prime} \left[ \frac{1}{2k} \left( \|Z\|{}^2+\|Z^\prime\|{}^2 +2 \langle \mu ,Z-Z^\prime \langle \right) \right] \\ &\displaystyle =&\displaystyle E_{Z,Z^\prime}\left[ \frac{1}{2k} \left( \|Z\|{}^2+\|Z^\prime\|{}^2\right) \right], \end{array} \end{aligned} $$

since E(Z) = E(Z ) = E 0(X) (here, \(E_{Z,Z^\prime }\) denotes the expectation with respect to p(z, z ) = p(z)p(z )). We then see that the limit of the difference of Bayes risks tends toward zero when k →. Therefore, \( \hat {p}_U\) is minimax by Lemma 1.8. □

This result is in Liang and Barron (2004), a more direct proof for the Gaussian case can be found in George et al. (2006) and is given in the next section.

3.6.5 An Explicit Expression for \( \hat {p}_U\) and Its Risk in the Normal Case

We now give an explicit expression of \( \hat {p}_U\), described the previous subsections, in the Gaussian setting. Let \(X \sim {\mathcal N}_p (\mu , \nu _x I_p)\) and \(Y \sim {\mathcal N}_p (\mu , \nu _y I_p)\).

Lemma 3.16

The Bayesian predictive density associated with the uniform prior on \( \mathbb {R}^p\) , π(μ) ≡ 1, is given by the following expression

$$\displaystyle \begin{aligned} \hat{p}_{U} (y|x) = \frac{1}{\big( (2 \, \pi) \, (v_y + v_x) \big)^{p/2}} \exp\left(-\frac{\|y-x\|{}^2}{2 \, (v_x+v_y)} \right) . \end{aligned} $$
(3.85)

Proof

For W = (v y X + v x Y )∕(v x + v y) and v w = (v x v y)∕(v x + v y) it is clear that \(W \sim {\mathcal N}_p ( \mu ,v_w I_p)\), by the independence of X and Y . Further, note that

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{\|x-\mu\|{}^2}{2 v_x}+\frac{\|y-\mu\|{}^2}{2 v_y} &\displaystyle =&\displaystyle \frac{\|\mu-w\|{}^2}{2 v_w} + \frac{\|y-x\|{}^2}{2(v_x+v_y)} \, . \end{array} \end{aligned} $$
(3.86)

By definition, and through the previous representation, it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \displaystyle \hat{p}_{U} (y|x) &\displaystyle = &\displaystyle \dfrac {\displaystyle \int_{\mathbb{R}^p} p(y| \mu, v_y) \, p(x | \mu, v_x) \, d \mu} {\displaystyle \int_{\mathbb{R}^p} p(x | \mu, v_x) \, d \mu} \\ {} &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \frac{1}{(2 \, \pi)^{p} (v_y \, v_x)^{p/2}} \exp\left(-\frac{\|x-\mu\|{}^2}{2 \, v_x} -\frac{\|y-\mu\|{}^2}{2 \, v_y}\right) \,d \mu \\ {} &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \frac{1}{(2 \, \pi)^{p} (v_y \, v_x)^{p/2}} \exp\left(-\frac{\|\mu-w\|{}^2}{2 \, v_w} \right) \exp\left(-\frac{\|y-x\|{}^2}{2 \, (v_x+v_y)} \right) \,d \mu \\ {} &\displaystyle =&\displaystyle \frac{(2 \, \pi v_w)^{p/2}}{(2 \, \pi)^{p} (v_y \, v_x)^{p/2}} \exp\left(-\frac{\|y-x\|{}^2}{2(v_x+v_y)} \right) \\ {} &\displaystyle =&\displaystyle \frac{1}{\big( (2 \, \pi) \, (v_y + v_x) \big)^{p/2}} \exp\left(-\frac{\|y-x\|{}^2}{2 \, (v_x+v_y)} \right) . \hspace{3.6cm} \end{array} \end{aligned} $$

Note that the risk of \( \hat {p}_U\) is constant, as we have previously seen for invariant densities. Given the form of \( \hat {p}_U( . |x )\) it follows that the Kullback-Liebler divergence is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} &\displaystyle &\displaystyle {\mbox{KL}(\hat{p}_U(.|x), \mu) } \\ {} &\displaystyle &\displaystyle \ = \int p(y | \mu, v_y) \log \frac{p(y|\mu,v)}{\hat{p}_U(y|x)} \,dy \\ {} &\displaystyle &\displaystyle \ = E^{Y} \left[ \log \frac{p(Y|\mu,v)}{\hat{p}_U(Y|x)} \right] \\ {} &\displaystyle &\displaystyle \ = E^{Y} \left[ -\frac{p}{2} \log \frac{v_y}{v_x+v_y}-\frac{1}{2 v_y} \|Y-\mu\|{}^2 +\frac{1}{2(v_x+v_y)}\|Y-x\|{}^2 \right] \\ {} &\displaystyle &\displaystyle \ = -\frac{p}{2} \log \frac{v_y}{v_x+v_y}-\frac{p}{2 }+ E^{Y} \left[\frac{1}{2(v_x+v_y)} \left(\|Y-\mu\|{}^2+ \|\mu-x\|{}^2 \right) \right] \\ {} &\displaystyle &\displaystyle \ = \left[-\frac{p}{2} \log \frac{v_y}{v_x+v_y}-\frac{p}{2 }+ \frac{p v_y}{2(v_x+v_y)}\right]+ \frac{1}{2(v_x+v_y)} \|\mu-x\|{}^2. {} \end{array} \end{aligned} $$
(3.87)

Hence, we can conclude that the risk of \( \hat {p}_U\) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {{\mathcal R}_{\mbox{{KL}}}}(\hat{p}_U,\mu)&\displaystyle = &\displaystyle E^X \left[ \mbox{KL}(\hat{p}_U, \mu,X) \right] \\ &\displaystyle =&\displaystyle \left[-\frac{p}{2} \log \frac{v_y}{v_x+v_y}-\frac{p}{2 }+ \frac{p v_y}{2(v_x+v_y)}\right]+ \frac{p v_x}{2(v_x+v_y)} \\ &\displaystyle = &\displaystyle -\frac{p}{2} \log \left(\frac{v_y}{v_x+v_y} \right) = \frac{p}{2} \log \left( 1+ \frac{v_x}{v_y}\right). {} \end{array} \end{aligned} $$
(3.88)

In the framework of the iid sampling model presented in Sect. 3.6.3 with Σ 1 = Σ 2 = I p, we can express the risk as

$$\displaystyle \begin{aligned}{{\mathcal R}_{\mbox{{KL}}}}( \hat{p}_U, \mu) = \frac{p} { 2} \log \left( 1+ \frac{m} {n} \right) \, . \end{aligned}$$

A predictive density is called the plug-in relative to an estimator δ if it has the form

$$\displaystyle \begin{aligned}\hat{p}_{ \delta} (y|x) = \dfrac{ 1 } { ( 2 \pi v_y )^{p/2} }\exp \left( - \dfrac{ 1 }{ 2} \dfrac{ \|y- \delta(x) \|{}^ 2 } {v_y} \right). \end{aligned}$$

The predictive plug-in density, which corresponds to the standard estimator of the mean, μ, δ 0(X) = X, is

$$\displaystyle \begin{aligned}\hat{p}_{ \delta} (y|x) = \dfrac{ 1 } { \left( 2 \pi v_y \right) ^ {p/ 2 }} \exp \left[ - \dfrac{ 1 } { 2} \dfrac{ \|y-x\|{}^ 2 } {v_y} \right]. \end{aligned}$$

We can directly verify that the predictive density \( \hat {p}_U\) dominates the plug-in density \( \hat {p}_{ \delta _0}\) for any \( \mu \in \mathbb {R}^p.\) In fact, their difference in risk is

Since \( E^{X,Y} \left ({\| Y - X \|}^2 \right )\) equals

$$\displaystyle \begin{aligned} \begin{array}{rcl} E^{X,Y} \left( \| Y-\mu\|{}^2 \right) + E^{X,Y} \left( \| X-\mu\|{}^2 \right) - &\displaystyle 2&\displaystyle \left< E^{X,Y} ( Y-\mu) , E^{X,Y} ( X-\mu) \right> \\ &\displaystyle =&\displaystyle p (v_x + v_y), \end{array} \end{aligned} $$

we have

Surprisingly, the predictive density \( \hat {p}_U\) has similar properties to the standard estimator, δ 0(X) = X, for the estimation of the mean under quadratic loss. Komaki (2001) showed that the density \( \hat {p}_U\) is dominated by the Bayesian predictive density using the harmonic prior , π(μ) = ∥μ2−p. George et al. (2006) extended the analogy with point estimation. We give some of this development next.

Lemma 3.17 (George et al. 2006, Lemma 2)

For W = (v y X + v x Y )∕(v x + v y) and v w = (v x v y)∕(v x + v y), let m π(W;v w) and m π(X;v x) be the marginals of W and X, respectively, relative to the a prior π. Then

$$\displaystyle \begin{aligned} \hat{p}_{\pi}(y|X) = \frac{m_{\pi}(W; v_w)}{m_{\pi}(X; v_x)} \; \hat{p}_{U}(y|X)\, \end{aligned} $$
(3.89)

where \( \hat {p}_{U} ( \cdot |X)\) is the Bayes estimator associated with the uniform prior on \( \mathbb {R}^p\) given by (3.85). In addition, for any prior measure π, the Kullback-Leibler risk difference between \( \hat {p}_U(\cdot |x)\) and the Bayesian predictive density \( \hat {p}_{ \pi } ( \cdot |x)\) is given by

$$\displaystyle \begin{aligned} {\mathcal R}_{{\mathit{\mbox{{KL}}}}}(\mu,\hat{p}_U)-{\mathcal R}_{{\mathit{\mbox{{KL}}}}}(\mu,\hat{p}_{\pi}) = E_{\mu,v_w} \left[ \log \, m_{\pi}(W;v_w)\right] - E_{\mu,v_x} \left[ \log \, m_{\pi}(X;v_x) \right] \end{aligned} $$
(3.90)

where E μ, v denotes the expectation with respect to the normal \(\mathcal {N}_p ( \mu ,vI_p)\) distribution.

Proof

The marginal density of (X, Y ) associated with π is equal to

$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{p}_{\pi}(x,y) &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} p(x|\mu,v_x) \, p(y|\mu,v_y) \, \pi(\mu) \, \,d \mu \\ &\displaystyle = &\displaystyle \int_{\mathbb{R}^p} \frac{1}{(2 \pi v_x)^{p/2}} \exp\left( -\frac{\|x-\mu\|{}^2}{2 v_x} \right) \frac{1}{(2 \pi v_y)^{p/2}} \exp\left( -\frac{\|y-\mu\|{}^2}{2 v_y} \right) \pi(\mu) \, \,d \mu. \end{array} \end{aligned} $$

Applying (3.85) and (3.86) it follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{p}_{\pi}(x,y) &\displaystyle = &\displaystyle \frac{1}{(2 \pi)^{p}\,(v_x \,v_y)^{p/2}} \int_{\mathbb{R}^p} \exp\left(-\frac{\|y-x\|{}^2}{2(v_x+v_y)} \right) \exp\left( -\frac{\|\mu-w\|{}^2}{2 v_w} \right) \, \pi(\mu) \,d \mu \\ &\displaystyle = &\displaystyle \frac{(2 \pi v_w)^{p/2}}{(2 \pi)^{p}\,(v_x \,v_y)^{p/2}} \, \exp\left( -\frac{\|y-x\|{}^2}{2(v_x+v_y)} \right) m_{\pi} (w;v_w)\\ &\displaystyle = &\displaystyle \hat{p}_U(y|x) \, m_{\pi} (w;v_w). \end{array} \end{aligned} $$

Since \(\hat {p}_{\pi }(y|x) = \hat {p}_{\pi }(x,y) / m_\pi (x)\), (3.89) follows.

Hence, we can write the risk difference as

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle {\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_U) - {\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_{\pi}) \\ {} &\displaystyle =&\displaystyle \int \int p(x|\mu,v_x)\, p(y|\mu,v_y) \log \frac{\hat{p}_{\pi}(y|x)}{\hat{p}_U(y|x)} \,dy \,dx \\ {} &\displaystyle = &\displaystyle \int \int p(x|\mu,v_x)\, p(y|\mu,v_y) \log \frac{m_{\pi}(W(x,y);v_w)}{m_{\pi}(x;v_x)} \,dy \,dx \\ {} &\displaystyle =&\displaystyle E^{X,Y} \log m_{\pi}(W(X,Y);v_w) - E^{X,Y} \log m_{\pi}(X;v_x) \\ {} &\displaystyle =&\displaystyle E^{W} \log m_{\pi}(W | v_w) - E^X \log m_{\pi}(X | v_x). \hspace{1.6cm} \end{array} \end{aligned} $$

Using this lemma, George et al. (2006) gave a simple proof of the result of Liang and Barron (2004) for the Gaussian setting. By taking the same sequence of priors \( \{ \pi _k \} = \mathcal {N}_p(0, kI_p)\), the difference of the Bayes risk equals (using constancy of the risk of \(\hat {p}_U\))

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_U)&\displaystyle -&\displaystyle r(\pi_k,\hat{p}_{\pi_k}) = \int \pi_k(\mu) \left[ E_{\mu,v_w} \log m_{\pi_k} (W,v_w) - E_{\mu,v_x} \log m_{\pi_k}(X,v_x) \right] \,d \mu \\ {} &\displaystyle =&\displaystyle \int \pi_k(\mu) \bigg[ E_{\mu,v_w} \log \left\{ (2 \pi(v_w+k))^{-p/2} \exp \left( - \frac{\|W\|{}^2}{2(v_w+k)} \right) \right\} \\ {} &\displaystyle -&\displaystyle E_{\mu,v_x} \log \left\{ (2 \pi(v_x+k))^{-p/2} \exp \left( - \frac{\|X\|{}^2}{2(v_x+k)} \right) \right\}\bigg] \,d \mu \\ {} &\displaystyle =&\displaystyle \int \pi_k(\mu) \bigg[ -p/2 \log(2 \pi(v_w+k))- \frac{p v_w}{2(v_w+k)} \\ {} &\displaystyle +&\displaystyle p/2 \log(2 \pi(v_x+k))+ \frac{p v_x}{2(v_x+k)} \bigg] \,d \mu \\ {} &\displaystyle =&\displaystyle -\frac{p}{2} \log \frac{v_w+k}{v_x+k}-\frac{p v_w}{2(v_w+k)} +\frac{p v_x}{2(v_x+k)}. \end{array} \end{aligned} $$

Hence, we see that \( \lim _{k \rightarrow \infty } r( \pi _k, { \hat {p}_U} ) -r( \pi _k, \hat {p}_{ \pi _k} ) = 0\) and so, \( \hat {p}_U\) is minimax by Lemma 1.8. George et al. (2006) also show that the best predictive invariant density is dominated by any Bayesian predictive density relative to a superharmonic prior . This result parallels the result of Stein for the estimation of the mean under quadratic loss and the use differential operators discussed in Sect. 2.6. The following lemma from George et al. (2006) allows us to give sufficient conditions for domination. We use Stein’s identity in the proof.

Lemma 3.18

If m π(z;v x) is finite for any z, then for any v w ≤ v  v x the marginal m π(z;v) is finite. In addition,

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{\partial}{\partial v} E \log m_{\pi} (z;v) &\displaystyle =&\displaystyle E_{\mu,v} \left[ \frac{\varDelta m_{\pi}(Z;v)}{m_{\pi}(Z;v)} - \frac{1}{2} \|\nabla \log m_{\pi}(Z;v) \|{}^2 \right] \\ &\displaystyle =&\displaystyle E_{\mu,v} \left[2\, \frac{\varDelta \sqrt{ m_{\pi}(Z;v)}}{\sqrt{m_{\pi}(Z;v)}}\right]. \end{array} \end{aligned} $$
(3.91)

Proof

For any v w ≤ v ≤ v x,

$$\displaystyle \begin{aligned} \begin{array}{rcl} m_{\pi}(z;v) &\displaystyle =&\displaystyle \int_{\mathbb{R}^p} \frac{1}{(2 \, \pi \, v)^{p/2}} \exp\left(- \frac{\|z-\mu \|{}^2}{2 v} \right) \pi(\mu) \, \,d \mu \\ &\displaystyle =&\displaystyle \left(\frac{v_x}{v}\right)^{p/2} \int_{\mathbb{R}^p} \frac{1}{(2 \, \pi \, v_x)^{p/2}} \exp\left(- \frac{v_x}{v} \frac{\|z-\mu \|{}^2}{2 v_x} \right) \pi(\mu) \, \,d \mu \\ &\displaystyle \leq&\displaystyle \left(\frac{v_x}{v}\right)^{p/2} \, m_{\pi}(z;v_x) < \infty. \end{array} \end{aligned} $$

Hence, the marginal m π is finite. Setting \(Z^\prime = (Z - \mu ) / \sqrt {v} \sim {\mathcal N} (0,I) \),

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{\partial}{\partial v} E_{\mu,v} \log m_{\pi}(Z;v) &\displaystyle =&\displaystyle \frac{\partial}{\partial v} \int p(z|\mu,v) \log \left( m_{\pi}(z;v) \,dz\right) \\ &\displaystyle =&\displaystyle \frac{\partial}{\partial v} \int p(z^\prime|0,1) \log \left( m_{\pi}(\sqrt{v}z^\prime+\mu;v) \right) \, dz^\prime \\ &\displaystyle =&\displaystyle E_{Z^\prime} \frac{(\partial /\partial v) m_{\pi}(\sqrt{v}Z^\prime+\mu;v)}{m_{\pi}(\sqrt{v}Z^\prime+\mu;v)} \end{array} \end{aligned} $$
(3.92)

where

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{\partial}{\partial v} &\displaystyle &\displaystyle m_{\pi}(\sqrt{v}z^{\prime} + \mu;v) = \frac{\partial}{\partial v} \int \frac{1}{(2\pi v)^{p/2}} \, \exp\left \{ - \frac{\|\sqrt{v}z^{\prime}+\mu-\mu^\prime \|{}^2}{2 v} \right \} \pi(\mu^{\prime}) \, d\mu^{\prime} \\ &\displaystyle =&\displaystyle \frac{1}{(2 \, \pi \, v)^{p/2}} \int \left( -\frac{p}{2 \, v} +\frac{\|z-\mu^{\prime}\|{}^2}{2 \, v^2} - \frac{\|z^{\prime}\|{}^2}{2 \, v}- \frac{2\langle z^{\prime},\mu-\mu^{\prime}\rangle}{2 \, v^{3/2}} \right) p(z|\mu^{\prime}) \, \pi(\mu^{\prime}) \, d\mu^{\prime} \\ &\displaystyle =&\displaystyle \frac{\partial}{\partial v} \, m_{\pi}(z;v) -\int \frac{\langle z-\mu,z-\mu^\prime \rangle}{2 \, v^2} \, p(z|\mu^{\prime}) \, \pi(\mu^{\prime}) \, d\mu^{\prime}. \end{array} \end{aligned} $$
(3.93)

Note that

$$\displaystyle \begin{aligned} \nabla_z m_{\pi}(z,v)= \int \frac{-(z-\mu)}{v} p(z|\mu) \pi(\mu) d\mu \end{aligned} $$
(3.94)

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \varDelta_z m_{\pi}(z,v)&\displaystyle =&\displaystyle \int \left[\frac{-p}{v}+ \frac{\|z-\mu\|{}^2}{v^2} \right] p(z|\mu) \pi(\mu) d\mu \\ &\displaystyle =&\displaystyle 2 \, \frac{\partial}{\partial v} m_{\pi}(z;v). \end{array} \end{aligned} $$
(3.95)

It follows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} E_{Z^{\prime}} \frac{(\partial /\partial v) m_{\pi}(\sqrt{v}Z^{\prime} + \mu;v)}{m_{\pi}(\sqrt{v}Z^{\prime}+\mu;v)} &\displaystyle =&\displaystyle E_{\mu,v} \left(\frac{1}{2} \frac{\varDelta m_{\pi}(Z;v)}{m_{\pi}(Z;v)} +\frac{\langle Z-\mu,\nabla \log m_{\pi}(Z;v) \rangle}{2 \, v} \right). \end{array} \end{aligned} $$

Hence, using Stein’s identity,

$$\displaystyle \begin{aligned} \begin{array}{rcl} E_{\mu,v} \left[\frac{(Z-\mu)^{\scriptscriptstyle{\mathrm{T}}}\nabla \log m_{\pi}(Z;v) }{2 \, v} \right] &\displaystyle =&\displaystyle E_{\mu,v} \left[ \frac{1}{2} \varDelta \log m_{\pi}(Z;v) \right] \\ {} &\displaystyle =&\displaystyle E_{\mu,v} \left[ \frac{1}{2} \left( \frac{\varDelta m_{\pi}(Z;v)}{m_{\pi}(Z;v)} - \left \| \nabla \log m_{\pi}(Z;v) \right\|{}^2 \right) \right], \end{array} \end{aligned} $$

which is the desired result. □

Lemmas 3.17 and 3.18 gives a result regarding minimaxity and domination from George et al. (2006). This result reveals parallels to those on minimax estimation of mean under quadratic loss in Sect. 3.1.1. Its proof is contained in the proof of Theorem 3.17.

Theorem 3.16

Assume that m π(z;v x) is finite for any z in \( \mathbb {R}^p.\) If Δm π ≤ 0 for all v w ≤ v  v x , then the Bayesian predictive density \( \hat {p}_{ \pi } (y|x)\) is minimax and dominates \( \hat {p}_U\) (when π is not the uniform itself). If Δπ ≤ 0, then the Bayesian predictive density \( \hat {p}_{\pi } (y|x)\) is minimax and dominates \( \hat {p}_U\) (when π is uniform).

The next result from Brown et al. (2008) illuminates the link between the two problems of estimating the predictive density under the Kullback-Leibler loss and estimating the mean under quadratic loss. The result expresses this link in terms of risk differences.

Theorem 3.17

Suppose the prior π(μ) is such that the marginal m π(z;v) is finite for any \(z \in \mathbb {R}^p\) . Then,

$$\displaystyle \begin{aligned} {\mathcal R}_{{\mathit{\mbox{{KL}}}}}(\mu,\hat{p}_U)-{\mathcal R}_{{\mathit{\mbox{{KL}}}}}(\mu,\hat{p}_{\pi}) \, = \, \frac{1}{2} \int_{v_w}^{v_x} \frac{1}{v^2} \left({\mathcal R}_Q^{v}(\mu, X) - {\mathcal R}_Q^{v}(\mu, \hat{\mu}_{\pi, v})\right) \,dv.\end{aligned} $$
(3.96)

Proof

From (3.90) and (3.91) it follows

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_U)-{\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_{\pi}) &\displaystyle = &\displaystyle \int_{v_w}^{v_x} -\frac{\partial}{\partial v} E_{\mu,v} [\log m_{\pi}(Z;v)] \, \,dv \\ &\displaystyle =&\displaystyle \int_{v_w}^{v_x} E_{\mu,v} \left[2 \, \frac{\varDelta\sqrt{ m_{\pi}(Z;v)}}{\sqrt{m_{\pi}(Z;v)}}\right] \,dv. \end{array} \end{aligned} $$
(3.97)

On the other hand, Stein (1981) showed that

$$\displaystyle \begin{aligned} {\mathcal R}_Q^{v}(\mu, X) - {\mathcal R}_Q^{v}(\mu, \hat{\mu}_{\pi, v}) =-4 v^2 E_{\mu,v} \frac{\varDelta\sqrt{ m_{\pi}(Z;v)}}{\sqrt{m_{\pi}(Z;v)}}. \end{aligned} $$
(3.98)

Hence substituting (3.98) in the integral (3.97) gives (3.96). □

It is worth noting that using (3.88) and (3.96) leads to the following expression for the Kullback-Liebler risk of \( \hat {p}_U\):

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \frac{1}{2} \int_{v_w}^{v_x} \frac{1}{v^2} \left({\mathcal R}_Q^{v}(\mu, X) \right) \, \,dv &\displaystyle =&\displaystyle \frac{1}{2} \int_{v_w}^{v_x} \frac{p}{v} \, \,dv \\ &\displaystyle =&\displaystyle \frac{p}{2} \, \log \frac{v_x}{v_w} \\ &\displaystyle =&\displaystyle \frac{p}{2} \log \left(1+ \frac{v_x}{v_y}\right). \\ &\displaystyle =&\displaystyle {\mathcal R}_{{\mbox{{KL}}}}(\mu,\hat{p}_U) \, . \end{array} \end{aligned} $$
(3.99)

The area of predictive density estimation continues to develop. Recent research covers the case of restricted parameter (Fourdrinier et al. 2011), general α-divergence losses (Maruyama and Strawderman 2012; Boisbunon and Maruyama 2014), integrated L1 and L2 losses (Kubokawa et al. 2015, 2017). For a general review, see George and Xu (2010).