Abstract
As we saw in Chap. 2, the frequentist paradigm is well suited for risk evaluations, but is less useful for estimator construction. It turns out that the Bayesian approach is complementary, as it is well suited for the construction of possibly optimal estimators. In this chapter we take a Bayesian view of minimax shrinkage estimation. In Sect. 3.1 we derive a general sufficient condition for minimaxity of Bayes and generalized Bayes estimators in the known variance case, we also illustrate the theory with numerous examples.
Access provided by CONRICYT-eBooks. Download chapter PDF
As we saw in Chap. 2, the frequentist paradigm is well suited for risk evaluations, but is less useful for estimator construction. It turns out that the Bayesian approach is complementary, as it is well suited for the construction of possibly optimal estimators. In this chapter we take a Bayesian view of minimax shrinkage estimation. In Sect. 3.1 we derive a general sufficient condition for minimaxity of Bayes and generalized Bayes estimators in the known variance case, we also illustrate the theory with numerous examples. In Sect. 3.2 we extend these results to the case when the variance is unknown. Section 3.3 considers the case of a known covariance matrix under a general quadratic loss. The admissibility of Bayes estimators in discussed in Sect. 3.4. Interesting connections to MAP estimation, penalized likelihood methods, and shrinkage estimation are developed in Sect. 3.5. The fascinating connections between Stein estimation and estimation of a predictive density under Kullback-Leibler divergence are outlined in Sect. 3.6.
3.1 Bayes Minimax Estimators
In this section, we derive a general sufficient condition for minimaxity of Bayes and generalized Bayes estimators when \(X \sim \mathcal {N}_{p}(\theta , \sigma ^{2} I_{p})\), with known σ 2, and the loss function is ∥δ − θ∥2, due to Stein (1973, 1981). The condition depends only on the marginal distribution and states that a generalized Bayes estimator is minimax if the square root of the marginal distribution is superharmonic. Alternative (stronger) sufficient conditions are that the prior distribution or the marginal distribution is superharmonic. We establish these results in Sect. 3.1.1 and apply them in Sect. 3.1.2 to obtain classes of prior distributions which lead to minimax (generalized and proper) Bayes estimators. Section 3.1.3 will be devoted to minimax multiple shrinkage estimators .
Throughout this section, let \(X \sim \mathcal {N}_{p}(\theta , \sigma ^{2} I_{p})\) (with σ 2 known) and the loss be L(θ, δ) = ∥δ − θ∥2. Let θ have the (generalized) prior distribution π and let the marginal density, m(x), of X be
Recall from Sect. 1.4 that the Bayes estimator corresponding to π(θ) is given by
Since the constant K in (3.1) plays no role in (3.2) we will typically take it to be equal to 1 for simplicity. It may happen that an estimator will have the form (3.2) where m(X) does not correspond to a true marginal distribution. In this case we will refer to such an estimator as a pseudo-Bayes estimator, provided x↦∇m(x)∕m(x) is weakly differentiable . Recall that, if δ π(X) is generalized Bayes, x↦m(x) is a positive analytic function and so x↦∇m(x)∕m(x) is automatically weakly differentiable.
3.1.1 A Sufficient Condition for Minimaxity of (Proper, Generalized, and Pseudo) Bayes Estimators
Stein (1973, 1981) gave the following sufficient condition for a generalized Bayes estimator to be minimax . This condition relies on the superharmonicity of the square root of the marginal. Recall from Corollary A.2 in Appendix A.8.3 that a function f from \(\mathbb {R}^p\) into \(\mathbb {R}\) which is twice weakly differentiable and lower semicontinuous is superharmonic if and only if, for almost every \(x \in \mathbb {R}^p\), we have Δf(x) ≤ 0, where Δf is the weak Laplacian of f. Note that, if the function f is analytic, the last inequality holds for any \(x \in \mathbb {R}^p\).
Theorem 3.1
Under the model of this section, an estimator of the form (3.2) has finite risk if E θ [∥∇m(X)∕m(X)∥2 ] < ∞ and is minimax provided \(x \mapsto \sqrt {m(x)}\) is superharmonic (i.e., \(\varDelta \sqrt {m(x)} \le 0\) , for any \(x \in \mathbb {R}^p\) ).
Proof
First, note that, as noticed in Example 1.1, the marginal m is a positive analytic function, and so is \(\sqrt {m}\).
Using Corollary 2.1 and the fact that δ π(X) = X + σ 2 g(X) with g(X) = ∇m(X)∕ m(X), the estimator δ π(X) has finite risk if E θ[∥∇m(X)∕m(X)∥2] < ∞. Also, it is minimax provided, for almost any \(x \in \mathbb {R}^p\),
Now, for any \(x \in \mathbb {R}^p\),
where
is the Laplacian of m(x). Hence, by straightforward calculation,
Therefore \({\mathcal D}(x) \leq 0\) since \(x \mapsto \sqrt {m(x)}\) is superharmonic. □
It is convenient to assemble the following results for the case of spherically symmetric marginals. The proof is straightforward and left to the reader.
Corollary 3.1
Assume the prior density π(θ) is spherically symmetric around 0 (i.e., π(θ) = π(∥θ∥2)). Then
-
(1)
the marginal density m of X is spherically symmetric around 0 (i.e., m(x) = m(∥x∥2), for any \(x \in \mathbb {R}^p\) );
-
(2)
the Bayes estimator equals
$$\displaystyle \begin{aligned}\delta_\pi(X) = X + 2 \, \sigma^2\, \frac{m^\prime (\Vert X\Vert^2)}{m(\Vert X\Vert^2)} \, X \end{aligned}$$and has the form of a Baranchik estimator (2.19) with
$$\displaystyle \begin{aligned}a\, r(t) = - 2\, \frac{m^\prime (t)}{m(t)}\, t \qquad \forall t \geq 0 \, ; \end{aligned}$$ -
(3)
the unbiased estimator of the risk difference between δ π(X) and X is given by
$$\displaystyle \begin{aligned}{\mathcal D}(X) = 4 \, \sigma^4 \, \left\{ p\, \frac{m^\prime (\|X\|{}^2)}{m(\|X\|{}^2)} + 2 \, \|X\|{}^2\, \frac{m^{\prime \prime}(\|X\|{}^2)}{m(\|X\|{}^2)} - \|X\|{}^2 \left(\frac{m^\prime(\|X\|{}^2)}{m(\|X\|{}^2)}\right)^2\right\} . \end{aligned}$$
While, in Theorem 3.1 minimaxity of δ π(X) follows from the superharmonicity of \(\sqrt {m(X)}\), it is worth noting that, in the setting of Corollary 3.1, it can be obtained from the concavity of t↦m 1∕2(t 2∕(2−p)).
The following corollary is often useful. It shows that \(\sqrt {m(X)}\) is superharmonic if m(X) is superharmonic, which in turn follows if the prior density π(θ) is superharmonic.
Corollary 3.2
-
(1)
A finite risk (generalized, proper, or pseudo) Bayes estimator of the form (3.2) is minimax provided the marginal m is superharmonic (i.e. Δm(x) ≤ 0, for any \(x \in \mathbb {R}^p\) ).
-
(2)
If the prior distribution has a density, π, which is superharmonic, then a finite risk generalized or proper Bayes estimator of the form (3.2) is minimax.
Proof
Part (1) follows from the first equality in (3.3), which shows that superharmonicity of m implies superharmonicity of \(\sqrt {m}\). Indeed, the superharmonicity of m implies the superharmonicity of any nondecreasing concave function of m.
Part (2) follows since, for any \(x\in \mathbb {R}^p\),
where the second equality follows from exponential family properties and the last equality is Green’s formula (see also Sect. A.9). More generally, any mixture of superharmonic functions is superharmonic (Sect. A.8). □
Note that the condition of finiteness of risk is superfluous for proper Bayes estimators since the Bayes risk is bounded above by p σ 2, and Fubini’s theorem assures that the risk function is finite a.e. (π). Continuity of the risk function implies finiteness for all θ in the convex hull of the support of π (see Berger (1985a) and Lehmann and Casella (1998) for more discussion on finiteness and continuity of risk).
As an example of a pseudo-Bayes estimator, consider m(X) of the form
The case b = 0 corresponds to m(X) = 1 which is the marginal corresponding to the “uniform” generalized prior distribution π(θ) ≡ 1, which in turn corresponds to the generalized Bayes estimator δ 0(X) = X. If b > 0, m(X) is unbounded in a neighborhood of 0 and consequently is not analytic. Thus, m(X) cannot be a true marginal (for any generalized prior). However,
and
which is weakly differentiable if p ≥ 3 (see Sect. 2.3). Hence, for p ≥ 3, the James-Stein estimator
is a pseudo-Bayes estimator. Also, a simple calculation gives
It follows that m(X) is superharmonic for 0 ≤ b ≤ (p − 2)∕2 and similarly that \(\sqrt {m(X)}\) is superharmonic for 0 ≤ b ≤ p − 2. An application of Theorem 3.1 gives minimaxity for 0 ≤ b ≤ p − 2 which agrees with Theorem 2.2 (with a = 2b), while an application of Corollary 3.2 establishes minimaxity for only half of the interval, i.e. 0 ≤ b ≤ (p − 2)∕2. Thus, while useful, the corollary is considerably weaker than the theorem.
Another interesting aspect of this example relates to the existence of proper Bayes minimax estimators for p ≥ 5. Considering the behavior of m(x) for ∥x∥≥ R for some positive R, note that
and that this integral is finite if and only if p − 2 b < 0. Thus, integrability of m(x) for ∥x∥≥ R and minimaxity of the (James-Stein) pseudo-Bayes estimator corresponding to m(X) are possible if and only if p∕2 < b ≤ p − 2, which implies p ≥ 5.
It is also interesting to note that superharmonicity of m(X) (i.e. 0 ≤ b ≤ (p − 2)∕2) is incompatible with integrability of m(x) on ∥x∥≥ R (i.e. b > p∕2). This is illustrative of a general fact that a generalized Bayes minimax estimator corresponding to a superharmonic marginal cannot be proper Bayes (see Theorem 3.2).
3.1.2 Construction of (Proper and Generalized) Minimax Bayes Estimators
Corollary 3.1 provides a method of constructing pseudo-Bayes minimax estimators. In this section, we concentrate on the construction of proper and generalized Bayes minimax estimators. The results in this section are primarily from Fourdrinier et al. (1998). Although Corollary 3.1 is helpful in constructing minimax estimators it cannot be used to develop proper Bayes minimax estimators as indicated in the example at the end of the previous section. The following result establishes that a superharmonic marginal (and consequently a superharmonic prior density ) cannot lead to a proper Bayes estimator.
Theorem 3.2
Let m be a superharmonic marginal density corresponding to a prior π. Then π is not a probability measure.
Proof
Assume π is a probability measure. Then it follows that m is an integrable, strictly positive, and bounded function in C ∞ (the space of functions which have derivatives of all orders). Recall from Example 1.1 of Sect. 1.4 that the posterior risk is given, for any \(x \in \mathbb {R}^p\), by
Hence, the Bayes risk is
where E m is the expectation with respect to the marginal density m. Also, denoting by E π the expectation with respect to the prior π, we may use the unbiased estimate of risk to express r(π) as
since the unbiased estimate of risk does not depend on θ, by definition. Hence, by taking the difference,
Now, since the marginal m is superharmonic (Δm(x) ≤ 0 for any \(x \in \mathbb {R}^p\)), strictly positive and in C ∞, it follows that Δm ≡ 0. Finally, the strict positivity and harmonicity of m implies that m ≡ C where C is a positive constant (see Doob 1984), and hence, that \(\int _{\mathbb {R}^{p}}\,m(X)\, dx = \infty \), which contradicts the integrability of m. □
We now turn to the construction of Bayes minimax estimators. Consider prior densities of the form
for some constant k and some nonnegative function h on \(\mathbb {R}^{+}\) such that the integral exists, i.e. π(θ) is a variance mixture of normal distributions . It follows from Fubini’s theorem that, for any \(x \in \mathbb {R}^p\),
where
Lebesgue’s dominated convergence theorem ensures that we may differentiate under the integral sign and so
and
where
and
Then the following integral
exists for j ≥ p∕2. Hence, with y = ∥x∥2∕2σ 2, we have
Note that
since I j+p(y) ≤ I j(y). Hence,
which, according to Theorem 3.1, guarantees the finiteness of the risk of the Bayes estimator δ π(X) in (3.2). Furthermore, the unbiased estimator of risk difference (3.3) can be expressed as
Then the following intermediate result follows immediately from (3.2) and Theorem 3.1 since finiteness of risk has been guaranteed above.
Lemma 3.1
The generalized Bayes estimator corresponding to the prior density (3.4) is minimax provided
The next theorem gives sufficient conditions on the mixing density h(⋅) so that the resulting generalized Bayes estimator is minimax.
Theorem 3.3
Let h be a positive differentiable function such that the function − (v + 1)h ′(v)∕h(v) = l 1(v) + l 2(v) where l 1(v) ≤ A and is nondecreasing while 0 ≤ l 2 ≤ B with A + 2 B ≤ (p − 2)∕2. Assume also that limv→∞ h(v)∕(v + 1)p∕2−1 = 0 and that \(\int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \) . Then the generalized Bayes estimator (3.2) for the prior density (3.4) corresponding to the mixing density h is minimax. Furthermore, if h is integrable, the resulting estimator is also proper Bayes.
Proof
Via integration by parts , we first find an alternative expression for
Letting u = (1 + v)−k+2 h(v) and \(dw = (1 + v)^{-2} \, \exp (-y/(1+v)) \, dv\), so that du = (−k + 2)(1 + v)−k+1 h(v) + (1 + v)−k+2 h ′(v) and \(w = \exp (-y/(1+v)) / y\), we have, for k ≥ p∕2 + 1,
Applying (3.10) to both numerators in the left-hand side of (3.9) we have
since I p∕2+1(y) < I p∕2(y). Then it follows from Lemma 3.1 that δ π(X) is minimax provided, for any y ≥ 0,
where
and where \(E^{y}_{k} [f(V)]\) is the expectation of f(V ) with respect to the random variable V with density \(g^{y}_{k}(v) = \exp (-y/(1+v)) \, (1+v)^{-k} \, h(v) / I_{k}(y)\). Now upon setting − (v + 1) h ′(v)∕h(v) = l 1(v) + l 2(v) and noting that \(g^{y}_{k}(v)\) has monotone decreasing likelihood ratio in k, for fixed y, we have
since l 2 ≥ 0. Also
since l 1 is nondecreasing. Then
since l 1 ≤ A and l 2 ≤ B and by the assumptions on A and B. The result follows. □
The following corollary allows the construction of mixing distributions so that the conditions of the theorem are met and the resulting (generalized or proper) Bayes estimators are minimax.
Corollary 3.3
Let ψ = ψ 1 + ψ 2 be a continuous function such that ψ 1 ≤ C and is nondecreasing, while 0 ≤ ψ 2 ≤ D, and where C ≤−2D. Define, for v > 0, \(h(v) = \exp \left [ - \frac {1}{2} \int ^{v}_{v_{0}} \frac {2 \, \psi (u) + p - 2}{u + 1} \,du \right ]\) where v 0 ≥ 0. Assume also that limv→∞ h(v)∕(1 + v)p∕2−1 = 0 and that \(I_{p/2}(y) = \int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \).
Then the Bayes estimator corresponding to the mixing density h is minimax. Furthermore if h is integrable the estimator is proper Bayes.
Proof
A simple calculation shows that
Setting l 1(v) = ψ 1(v) + (p − 2)∕2 and l 2(v) = ψ 2(v), the result follows from Theorem 3.1 with A = (p − 2)∕2 + C and B = D. □
Note that finiteness of I p∕2(y) in Corollary 3.2 is assured if we strengthen the limit condition to limv→∞ h(v)∕(1 + v)p∕2−1−𝜖 = 0 for some 𝜖 > 0, since this implies that, for h(v)∕(1 + v)p∕2 ≤ M∕(1 + v)1+𝜖 for some M > 0 and any v > 0. Thus
3.1.3 Examples
An interesting and useful class of examples results from the choice
for some \((\alpha , \beta , \gamma ) \in \mathbb {R}^3\). A simple calculation shows
Example 3.1 (The Strawderman 1971 prior)
Suppose α ≤ 0 and β = γ = 0 so that h(v) ∝ (v + 1)−α−(p−2)∕2. Let ψ 1(v) = ψ(v) ≡ α and ψ 2(v) ≡ 0 so that C = D = 0. Then the minimaxity conditions of Corollary 3.1 require limv→∞ h(v)∕(1 + v)p∕2−1 =limv→∞(v + 1)−α−(p−2) = 0 and this is satisfied if α > 2 − p. Also
if α > 2 − p as above. Hence in this case the corresponding generalized Bayes estimator is minimax if 2 − p < α ≤ 0 (which requires p ≥ 3).
Furthermore it is proper Bayes minimax if \( \int ^{\infty }_{0} (1+v)^{- \alpha -(p-2)/2}\, dv < \infty \) which is equivalent to 2 − p∕2 < α ≤ 0. This latter condition requires p ≥ 5 and demonstrates the existence of proper Bayes minimax estimators for p ≥ 5. We will see below that this is the class of priors studied in Strawderman (1971) under the alternative parametrization λ = 1∕(1 + v).
Example 3.2
Consider ψ(v) given by (3.11) with α ≤ 0, β ≤ 0 and γ ≤ 0. Here we take ψ 1(v) = ψ(v), ψ 2(v) = 0, and C = D = 0. The minimaxity conditions of Corollary 3.2 require
This implies 2 − p < α ≤ 0. The finiteness condition on
also requires 2 − p < α ≤ 0. Therefore, minimaxity is ensured as soon as 2 − p < α ≤ 0.
Furthermore, the minimax estimator will be proper Bayes if
This holds if \(2 - \frac {p}{2} < \alpha \leq 0\) as in Example 3.1.
Example 3.3
Suppose α ≤ 0, β > 0, and γ < 0 and take
for C = α and D = −β 2∕4γ.
Note first that ψ 1(v) is monotone nondecreasing and bounded above by α; also, 0 ≤ ψ 2(v) ≤−β 2∕4γ. Therefore, we require C = α < −2D = β 2∕2γ. The conditions limv→∞ h(v)∕(1 + v)p∕2−1 = 0 and \(\int ^{\infty }_{0} \exp (-y/(1+v)) \, (1+v)^{-p/2} \, h(v) \, dv < \infty \) are, as in Example 3.2, 2 − p < α ≤ 0.
Thus, δ π(X) is minimax for 2 − p < α ≤ β 2∕2γ < 0. The condition for integrability of h is also, as in Example 3.2, i.e. \(2 - \frac {p}{2} < \alpha \leq \beta ^{2}/ 2 \gamma < 0\).
In this example, ψ(v) is not monotone but is increasing on [0, −2γ∕β) and decreasing thereafter. This typically corresponds to a non-monotone r(∥X∥2) in the Baranchik-type representation of δ π(X).
For simplicity, in the following examples, we assume σ 2 = 1.
Example 3.4 (Student-t priors)
In this example we take ψ(v) as in Examples 3.2 and 3.3 with the specific choices α = (m − p + 4)∕2 ≤ 0, β = (m (1 − φ) + 2)∕2, and γ = −m φ∕2 ≤ 0, where m ≥ 1. In this case \(h(v) = C \, v^{- (m+2) / 2} \exp (- m \, \varphi / 2 \, v)\), an inverse gamma density . Hence, as is well known, π(θ) is a multivariate-t distribution with m-degrees of freedom and scale parameter φ if m is an integer (see e.g. Muirhead 1982, p.33 or Robert 1994, p.174). If σ 2≠1, the scale of the t-distribution is φ σ.
For various different values of m and φ, either the conditions of Example 3.2 or the conditions of Example 3.3 apply. Both examples require α = (m − p + 4)∕2 ≤ 0, or equivalently 1 ≤ m ≤ p − 4 (so that p ≥ 5), and γ = −m φ∕2 ≤ 0.
Example 3.2 requires β = (m (1 − φ) + 2)∕2 < 0, or equivalently, φ ≥ (m + 2)∕m. The condition for minimaxity 2 − p < α ≤ 0 is satisfied since it is equivalent to m > −p. Furthermore the condition for proper Bayes minimaxity, \(2 - \frac {p}{2} < \alpha \le 0\), is satisfied as well since it reduces to m > 0. Hence, if φ ≥ (m + 2)∕m, the scaled p-variate t prior distribution leads to a proper Bayes minimax estimator for p ≥ 5 and m ≤ p − 4.
On the other hand, when φ < (m + 2)∕m, or equivalently, β > 0, the conditions of Example 3.3 are applicable. Considering the proper Bayes case only, the condition for minimaxity of the Bayes estimator is
The first inequality is satisfied by the fact that m > 0. The second inequality can be satisfied only for certain φ since, when φ goes to 0, the last expression tends to −∞. A straightforward calculation shows that the second inequality can hold only if
In particular, if φ = 1 (the standard multivariate t), the condition becomes \(2 - p/2 < \frac {m - p + 4}{2} \leq - \frac {1}{m}\). As m ≥ 1 this is equivalent to m + 2∕m ≤ p − 4, which requires p ≥ 7 for m = 1 or 2, and p ≥ m + 5 for m ≥ 3.
An alternative approach to the results of this section can be made using the techniques of Sect. 2.4.2 applied to Baranchik-type estimators of the form \(\left (1 - a \, r (\| X \|{ }^{2}) / \|X \|{ }^{2} \right ) X\). Indeed any spherically symmetric prior distribution will lead to an estimator of the form ϕ(∥X∥2)X. More to the point, for prior distributions of the form studied in this section, the r(⋅) function is closely connected to the function v↦ − (v + 1)h ′(v)∕h(v). To see this, note that
where \(E^{y}_{k}(f)\) is as in the proof of Theorem 3.1, the second to last equality following from (3.4).
Hence, the Bayes estimator is of Baranchik form with
□
Recall, as in the proof of Theorem 3.1, that the density \(g^{y}_{k}(v)\) has a monotone decreasing likelihood ratio in k, but notice also that it has a monotone increasing likelihood ratio (actually as an exponential family ) in y.
Hence, if \(- \frac {(v+1) h^{\prime }(v)}{h(v)}\) is nondecreasing, it follows that r is nondecreasing since e −y∕I p∕2(y) is also nondecreasing. Then the following corollary is immediate from Theorem 3.3.
Corollary 3.4
Suppose the prior is of the form (3.4) where − (v + 1) h ′(v)∕h(v) is nondecreasing and bounded above by A > 0. Then, the generalized Bayes estimator is minimax provided \(A \le \frac {p-2}{2}\).
Proof
As noted, r(⋅) is nondecreasing and is bounded above by p − 2 + 2A ≤ 2(p − 2). □
Corollary 3.3 yields an alternative proof for the minimaxity of the generalized Bayes estimator in Example 3.1.
Finally, as indicated earlier in this section, an alternative parametrization has often been used in minimaxity proofs for the mixture of normal priors, namely \(\lambda = \frac {1}{1+v}\), or equivalently, \(v = \frac {1 - \lambda }{\lambda }\).
Perhaps the easiest way to proceed is to reconsider the prior distribution as a hierarchical prior as discussed in Sect. 1.7. Here the distribution of \(\theta \mid v \sim {\mathcal N}_p(0, v \sigma ^{2} X)\) and the unconditional density of v is the mixing density h(v). The conditional distribution of θ given X and v is \(\mathcal {N}_p(\frac {v}{1+v} X, \frac {V}{1+v} \sigma ^{2} I_p)\). The Bayes estimator is
Note also that the Bayes estimator for the first stage prior
is (1 − λ)X. Therefore, in terms of the λ parametrization, one may think of E[λ∣X] as the posterior mean of the shrinkage factor and of the (mixing) distribution on λ as the distribution of the shrinkage factor.
In particular, for the prior distribution of Example 3.1 where the mixing density on v is h(v) = C (1 + v)−α−(p−2)∕2, the corresponding mixture density on λ is given by \(g(\lambda ) = C \lambda ^{ \alpha + \frac {p-2}{2} - 2} = C \lambda ^{\beta }\) and (β = α + p∕2 − 3). The resulting prior is proper Bayes minimax if 2 − p∕2 < α ≤ 0 or equivalently, − 1 < β ≤ ∕2 − 3 (and p ≥ 5). Note that, if p ≥ 6, β = 0 satisfies the conditions and consequently the mixing prior g(λ) ≡ 1 on 0 ≤ λ ≤ 1, i.e. the uniform prior on the shrinkage factor λ gives a proper Bayes minimax estimator. This class of priors is often referred to as the Strawderman priors.
To formalize the above discussion further we present a version of Theorem 3.3 in terms of the mixing distribution on λ. The proof follows from Theorem 3.3 and the change of variable λ = 1∕(1 + v).
Corollary 3.5
Let θ have the hierarchical prior \(\theta \mid \lambda \sim {\mathcal N}_{p}(0, (\{1 - \lambda \} / \lambda ) \, \sigma ^{2} \, I_p)\) where λ ∼ g(λ) for 0 ≤ λ ≤ 1. Assume that limλ→0 g(λ)λ p∕2+1 = 0 and that \(\int ^{1}_{0} e^{- \lambda } \lambda ^{p/2} g(\lambda ) d \lambda < \infty \) . Suppose λg ′(λ)∕g(λ) can be decomposed as \(l^{*}_{1}(\lambda ) + l^{*}_{2}(\lambda )\) where \(l^{*}_{1}(\lambda )\) is monotone nonincreasing and \(l^{*}_{1}(\lambda ) \leq A^{*}\) , \(0 \leq l^{*}_{2}(\lambda ) \leq B^{*}\) with A ∗ + 2B ∗≤ p∕2 − 3.
Then the generalized Bayes estimator is minimax. Furthermore, if \(\int ^{1}_{0}g(\lambda ) d \lambda < \infty \) , the estimator is also proper Bayes.
Example 3.5 (Beta priors)
Suppose the prior g(λ) on λ is a Beta (a, b) distribution, i.e. g(λ) = Kλ a−1(1 − λ)b−1. Note that the Strawderman (1971) prior is of this form if b = 1. An easy calculation shows \(\frac {\lambda g^{\prime }(\lambda )}{g(\lambda )} = a - 1 - (b - 1) \frac {\lambda }{1 - \lambda }\). Letting \(l^{*}_{1}(\lambda ) = \frac {\lambda g^{\prime }(\lambda )}{g(\lambda )}\) and \(l^{*}_{2}(\lambda ) \equiv 0\), we see that the resulting proper Bayes estimator is minimax for 0 < a ≤ p∕2 − 2 and b ≥ 1.
It is clear that our proof fails for 0 < b < 1 since in this case λg ′(λ)∕g(λ) is not bounded from above (and is also monotone increasing). Maruyama (1998) shows, using a different proof technique involving properties of confluent hypergeometric functions, that the generalized Bayes estimator is minimax (in our notation) for − p∕2 < a ≤ p∕2 − 2 and b ≥ (p + 2a + 2)(3p∕2 + a)−1. This bound in b is in (0, 1) for a < p∕2 − 2. Hence, certain Beta distributions with 0 < b < 1 also give proper Bayes minimax estimators. The generalized Bayes minimax estimators of Alam (1973) are also in Maruyama’s class.
3.1.4 Multiple Shrinkage Estimators
In this subsection, we consider a class of estimators that adaptively choose a point (or subspace) toward which to shrink . George (1986a,b) originated work in this area and the results in this section are largely due to him. The basic fact upon which the results rely is that a mixture of superharmonic functions is superharmonic (see the discussion in the Appendix), that is, if m α(x) is superharmonic for each α, then \(\int m_{\alpha }(x) \, d G(\alpha )\) is superharmonic if G(⋅) is a positive measure such that \(\int m_{\alpha }(x) \, d G(\alpha ) < \infty \). Using this property, we have the following result from Corollary 3.1.
Theorem 3.4
Let m α(x) be a family of twice weakly differentiable nonnegative superharmonic functions and G(x) a positive measure such that \(m(x) = \int m_{\alpha }(x) \, d G(x)\) < ∞, for all \(x \in \mathbb {R}^{p}\).
Then the (generalized, proper, or pseudo) Bayes estimator
is minimax provided E[∥∇m∥2∕m 2] < ∞.
The following corollary for finite mixtures is useful.
Corollary 3.6
Suppose that m i(x) is superharmonic and \(E [\| \nabla m_{i}(X) \|{ }^{2} / m^{2}_{i}(X)] < \infty \) for i = 1, …, n. Then, if \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\) , the (generalized, proper, or pseudo) Bayes estimator
where \(W_{i}(X) = m_{i}(X) / \sum ^{n}_{i=1} m_{i}(X) \) for \(0 < W_{i}(X) < 1, \sum ^{n}_{i=1} W_{i}(X) = 1\) is minimax. (Note that \(E_{\theta }[ \| \nabla m(X) \|{ }^{2} / m^{2}(X)] < \sum ^{n}_{i=1} E_{\theta } [\| \nabla m_{i}(X) \|{ }^{2} / m^{2}(X_{i})] < \infty \).)
Example 3.6
-
(1)
Multiple shrinkage James-Stein estimator . Suppose we have several possible points X 1, X 2, …, X n toward which to shrink. Recall that m i(x) = (1∕∥x − X i∥2)(p−2)∕2 is superharmonic if p ≥ 3 and the corresponding pseudo-Bayes estimator is \(\delta _{i}(X) = X_{i} + \left ( 1 - (p-2) \, \sigma ^{2} / \| X - X_{i} \|{ }^{2} \right ) (X - X_{i})\). Hence, if \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\), the resulting minimax pseudo Bayes estimator is given by
$$\displaystyle \begin{aligned} \delta (X) = \sum^{n}_{i=1} \left[ X_{i} + (1 - \frac{(p-2) \sigma^{2}}{\| X - X_{i} \|{}^{2}}) (X - X_{i}) \right] W_{i}(X) \end{aligned}$$where \(W_{i}(X) \propto \left ( 1 / \| X - X_{i} \|{ }^{2} \right )^{(p-2) / 2}\) and \(\sum ^{n}_{i=1} W_{i}(X) = 1\). Note that W i(X) is large when X is close to X i and the estimator is seen to adaptively shrink toward X i.
-
(2)
Multiple shrinkage positive-part James-Stein estimators. Another possible choice for the m i(x) (leading to a positive-part James Stein estimator) is
$$\displaystyle \begin{aligned} m_{i}(x) = \left\{ \begin{array}{ll} C \; \exp \left({\frac{\| x - X_{i} \|{}^{2}}{2 \, \sigma^{2}}} \right) & {\mathrm{if}} \; \| x - X_{i} \|{}^{2} < (p-2) \, \sigma^{2}\\ \left(\frac{1}{\| x - X_{i} \|{}^{2}}\right) & {\mathrm{if}} \; \| x - X_{i} \|{}^{2} \geq (p-2) \, \sigma^{2} \end{array} \right. \end{aligned}$$where \(C = \left ( 1 / (p-2) \, \sigma ^{2} \right )^{(p-2) / 2} e^{(p-2) / 2}\) so that m i(x) is continuous. This gives
$$\displaystyle \begin{aligned} \delta_{i}(X) = X_{i} + \left( 1 - \frac{(p-2) \sigma^{2}}{\| X - X_{i} \|{}^{2}} \right)_{+} (X - X_{i}) \end{aligned}$$since
$$\displaystyle \begin{aligned} \frac{\nabla m_{i}(X)}{m_{i}(X)} = \left\{ \begin{array}{l} - \frac{X - X_{i}}{\sigma^{2}} \; \;\; {\mathrm{if}} \; \| X - X_{i} \|{}^{2} < (p-2) \sigma^{2},\\ - \frac{(p-2)}{\| X - X_{i} \|{}^{2}} \; \;\; {\mathrm{otherwise.}} \end{array} \right. \end{aligned}$$The adaptive combination is again minimax by the corollary and inherits the usual advantages of the positive-part estimator over the James-Stein estimator.
Note that a smooth alternative to the above is \(m_{i}(x) = \left (\frac {1}{b + \| x - X_{i} \|{ }^{2}} \right )^{\frac {p-2}{2}}\) for some b > 0.
In each of the above examples we may replace (p − 2)∕2 in the exponent by a∕2 where 0 ≤ a ≤ p − 2 (and where 0 ≤∥x − X i∥2 < (p − 2) σ 2 is replaced by 0 ≤∥x − X i∥2 < a σ 2 for the positive-part estimator). The choice of p − 2 as an upper bound for a ensures superharmonicity of m i(x). A choice of a in the range of p − 2 < a ≤ 2 (p − 2) seems also quite natural since \(\sqrt {m_{i}(x)}\) is superharmonic (but m i(x) is not) for a in this range so that each δ i(X) is minimax. Unfortunately minimaxity of \(\delta (X) = \sum _{i=1}^n W_{i}(X) \delta _{i}(X)\) does not follow from Corollary 3.3 for p − 2 < a ≤ 2 (p − 2) since it need not be true that \(\sqrt {\sum ^{n}_{i=1} m_{i}(x)}\) is superharmonic even though \(\sqrt {m_{i}(x)}\) is superharmonic for each i.
-
(3)
A generalized Bayes multiple shrinkage estimator . If π i(θ) is superharmonic then \(\pi (\theta ) = \sum ^{n}_{i=1}\pi _{i}(\theta )\) is also superharmonic as is \(m(x) = \sum ^{n}_{i=1} m_{i}(x)\).
For example, \(\pi _{i}(\theta ) = \left ( 1 / b + \| \theta - X_{i} \|{ }^{2} \right )^{a/2}\), for b ≥ 0 and 0 ≤ a ≤ p − 2, is a suitable prior. Interestingly, according to a heuristic of Brown (1971), m(x) in this case should behave for large ∥x∥2 as \(\sum ^{n}_{i=1} 1 / \left ( b + \| x - X_{i} \|{ }^{2} \right )^{a/2}\), the “smooth” version of the adaptive positive-part multiple shrinkage pseudo-marginal in part (2) of this example.
By obvious modifications of the above, multiple shrinkage estimators may be constructed that shrink adaptively toward subspaces. Further examples can be found in George (1986a,b), Ki and Tsui (1990) and Wither (1991).
3.2 Bayes Estimators in the Unknown Variance Case
3.2.1 A Class of Proper Bayes Minimax Admissible Estimators
In this subsection, we give a class of hierarchical Bayes minimax estimators for the model
where S is independent of X, under scale invariant squared error loss
We reparameterize σ 2 as 1∕η and consider the following hierarchically, on the unknown parameters, structured prior(θ, η), which is reminiscent of the hierarchical version of the Strawderman prior in (3.13),
Lemma 3.2
For the model (3.14) and loss (3.15) , the (generalized or proper) Bayes estimator of θ is given by
where
where
provided A > −1, A − B < 0, and c > 0.
Proof
Under the loss in (3.15) the Bayes estimator for the model in (3.16) is given by
Expressing the expectation in the numerator of (3.20) gives
upon integrating with respect to θ and evaluating with the constants in (3.19). Similarly, for the denominator in (3.20)
Therefore from (3.21) and (3.22) the Bayes estimator in (3.20) has the form
where
with the change of variable u = λ (∥X∥2 + c)∕S is made in the next to last step. □
The properties of r(∥X∥2, S) in Lemma 3.2 are given in the following result.
Lemma 3.3
The function r(∥X∥2, S) given in (3.18) satisfies the following properties:
-
(i)
r(∥X∥2, S) is nondecreasing in ∥X∥2 for fixed S;
-
(ii)
r(∥X∥2, S) is nonincreasing in S for fixed ∥X∥2 ; and
-
(iii)
0 ≤ r(∥X∥2, S) ≤ (A + 1)∕(B − A − 1) = (p + a + b + 2)∕(k − a − 4)
provided the conditions of Lemma 3.2 hold.
Proof
Note first that \(\int _0^t u \, f(u) \, du / \int _0^t f(u) \, du\) is nondecreasing in t for any integrable nonnegative function f(⋅). Hence Part (i) follows since r(∥X∥2, S) is the product of two nonnegative nondecreasing functions ∥X∥2∕∥X∥2 + c and \(\int _0^{(\|X\|{ }^2 + c) / S} u \, f(u) \, du / \) \(\int _0^{(\|X\|{ }^2 + c) / S} f(u) \, du\) for f(u) = u A (1 + u)−(B+1).
Part (ii) follows from a similar reasoning since the first term is constant in S and (∥X∥2 + c)∕S is decreasing in S.
To show Part (iii) note that, by Parts (i) and (ii),
expressing the beta functions and according to the values of A and B. □
We also need the following straightforward generalization of Corollary 2.6. The proof is left to the reader.
Corollary 3.7
Under model (3.14) and loss (3.15) an estimator of the form
is minimax provided
-
(i)
r(∥X∥2, S) is nondecreasing in ∥X∥2 for fixed S;
-
(ii)
r(∥X∥2, S) is nonincreasing in S for fixed ∥X∥2 ; and
-
(iii)
0 ≤ r(∥X∥2, S) ≤ 2 (p − 2)∕(k + 2).
Combining Lemmas 3.2 and 3.3 and Corollary 3.7 gives the following result.
Theorem 3.5
For the model (3.14) , loss (3.15) and hierarchical prior (3.16) , the generalized or proper Bayes estimator in Lemma 3.2 is minimax provided
Furthermore, if p ≥ 5, there exist values of a > −2 and b > 0 which satisfy (3.23) , i.e. such that the estimator is proper Bayes, minimax and admissible.
Proof
The first part is immediate. To see the second part, note that it suffices, if a = −2 + 𝜖 b = δ, for 𝜖, δ > 0, that
equivalently \(p > 4 \, \frac {k - 2}{k - 6}\). Hence, for p ≥ 5 and k sufficiently large, k > 2 (3 p − 4)∕(p − 4), there are values of a and b such that the priors are proper. □
Note that there exist values of a and b satisfying (3.23) and the assumptions of Lemma 3.2 whenever p ≥ 3.
Strawderman (1973) gave the first example of a generalized and proper Bayes minimax estimators in the unknown variance setting. Zinodiny et al. (2011) also give classes of generalized and proper Bayes minimax estimators along somewhat similar lines as the above. The major difference is that the prior distribution on η (= 1∕σ 2) in the above development is also hierarchical, as it also depends on λ.
3.2.2 The Construction of a Class of Generalized Bayes Minimax Estimators
In this subsection we extend the generalized Bayes results of Sect. 3.1.2, using the ideas in Maruyama and Strawderman (2005) and Wells and Zhou (2008), to consider point estimation of the mean of a multivariate normal when the variance is unknown. Specifically, we assume the following model in (3.14) and the scaled squared loss function in (3.15).
In order to derive the (formal) Bayes estimator we reparameterize the model in (3.14) by replacing σ by η −1. The model then becomes
for some constant d. Under this model, the prior for θ is a scale mixture of normal distributions. Note that the above class of priors cannot be proper due to the impropriety of the distribution of η. However, as a consequence of the form of this model, the resulting generalized Bayes estimator is of the Baranchik form (3.17), with r(∥X∥2, S) = r(F), where F = ||X||2∕S.
We develop sufficient conditions on k, p, and h(ν) such that the generalized Bayes estimators with respect to the class of priors in (3.24) are minimax under the invariant loss function in (3.15). Maruyama and Strawderman (2005) and Wells and Zhou (2008) were able to obtain such sufficient conditions by applying the bounds and monotonicity results of Baranchik (1970), Efron and Morris (1976), and Fourdrinier et al. (1998).
Before we derive the formula for the generalized Bayes estimator under the model (3.24), we impose three regularity conditions on the parameters of priors. These conditions are easily satisfied by many hierarchical priors. These three conditions are assumed throughout this section.
- C1::
-
A > 1 where \(A =\frac {d + k + p + 3}{2}\);
- C2::
-
\(\; \;\; {\int _0^1 \lambda ^{\frac {p}{2}-2}h\left (\frac {1-\lambda }{\lambda }\right )} \,d \lambda < \infty \); and
- C3::
-
\(\; \; \;\lim _{\nu \rightarrow \infty }\frac {h(\nu )}{(1+\nu )^{p / 2-1}} = 0\).
Now, as in Sect. 3.1, we will first find the form of the Bayes estimator and then show that it satisfies some sufficient conditions for minimaxity. We start with the following lemma that corresponds to (3.2) in the known variance case and (3.18) in the previous subsection.
Lemma 3.4
Under the model in (3.24) , the generalized Bayes estimator can be written as
where F = ||X||2∕S,
and
Proof
Under the loss function (3.15), the generalized Bayes estimator for the model (3.24) is
Letting λ = (1 + ν)−1, δ(X, S) = (1 − R(F))X, which gives the form of the generalized Bayes estimator. □
Recall from Stein (1981) that when σ 2 is known the Bayes estimator under squared error loss and corresponding to a prior π(θ) is given by (3.2), that is, \( \delta ^{\pi }(X) = X + \sigma ^{2} \frac {\bigtriangledown m(X)}{m(X)}\).
The form of the Bayes estimator given in (3.25) gives an analogous form with the unknown variance replaced by a multiple of the usual unbiased estimator. In particular, define the “quasi-marginal”
where
and
A straightforward calculation shows M(x, s) is proportional to
It is interesting to note the unknown variance analog of (3.2) is
Lastly, note that the exponential term in the penultimate expression in the representation of δ(X, S) in (3.28) (that comes from the normal sampling distribution assumption) cancels. Hence there is a sort of robustness with respect to the sampling distribution. We will develop this theme in greater detail in Chap. 6 in the setting of spherically symmetric distributions.
3.2.2.1 Preliminary Results
The minimax property of the generalized Bayes estimator is closely related to the behavior of the r(F) and R(F) functions, which is in turn closely related to the behavior of
Fourdrinier et al. (1998) gave a detailed analysis of the type of function in (3.29). However, their argument was deduced from the superharmonicity of the square root of a marginal condition. Baranchik (1970) and Efron and Morris (1976) gave certain regularity conditions on the shrinkage function r(⋅) such that an estimator
is minimax under the loss function (3.15) for the model (3.14). Both results require an upper bound on r(F) and a condition on how fast R(F) = r(F)∕F decreases with F. Both theorems follow from a general result for spherically symmetric distributions given in Chap. 6 (Proposition 6.1), or by applying Theorem 2.5 in a manner similar to that in Corollary 2.3. The proofs are left to the reader.
Theorem 3.6 (Baranchik 1970)
Assume that r(F) is increasing in F and 0 ≤ r(F) ≤ 2 (p − 2)∕(k + 2). Then any point estimator of the form (3.30) is minimax.
Theorem 3.7 (Efron and Morris 1976)
Define \(c_k = \frac {p-2}{k+2}\) . Assume that 0 ≤ r(F) ≤ 2 c k , that for all F with r(F) < 2c k ,
and that, if an F 0 exists such that r(F 0) = 2c k , then r(F) = 2 c k for all F ≥ F 0 . With the above assumptions, the estimator \(\widehat {\theta }(X,S) = X - r(F) / F \; X\) is minimax.
Consequently, to apply these results one has to establish an upper bound for r(F) in (3.27) and the monotonicity property for some variant of r(F). The candidate we use is \(\widetilde {r}(F)=F^cr(F)\) with a constant c. Note that the upper bound 2 c k is exactly the same upper bound needed in Corollary 3.7(iii). We develop the needed results below.
First note that if h(ν) is a continuously differentiable function on [0, ∞), and regularity conditions C1, C2 and C3 hold, then the integrations by parts used in Lemmas 3.5 and 3.6 are valid.
Lemma 3.5
Assume the regularity conditions C1, C2 and C3, and that g(ν) ≤ M, where M is a positive constant and g(ν) is defined as in (3.29) . Then, for the r(F) function (3.27) , we have
where A is defined in condition C1.
Proof
By the definition in (3.26), R(F) ≥ 0. Then r(F) = FR(F) ≥ 0. Note that
where we are using the notation
Using integration by parts , we obtain
By C1 and C3, we know that the first term of the right hand side is nonpositive. The second term of the right hand side can be written as N 1 + N 2 + N 3 + N 4 where
and
Combining all the terms, we get the following inequality
Therefore, we have the needed bound on the r(F) function. □
We will now show that under certain regularity conditions on g(ν), we have the monotonicity property for \(\tilde {r}(F)=F^cr(F)\) with a constant c. This monotonicity property enables us to establish the minimaxity of the generalized Bayes estimator. The following lemma is analogous to Theorem 3.3 in the known variance case.
Lemma 3.6
If \(g(\nu )= -(\nu +1)\frac {h^\prime (\nu )}{h(\nu )}=l_1(\nu )+l_2(\nu )\) such that l 1(ν) is increasing in ν and 0 ≤ l 2(ν) ≤ c, then \(\widetilde {r}(F)=F^{c}r(F)\) is nondecreasing.
Proof
By taking the derivative, we only need to show (since r(F) = FR(F))
which is equivalent to
This in turn equivalent to
Now note that
Define the intergral operator
Therefore,
and
Also, note that
and
Now, with this new notation, it follows that (3.33) is equivalent to
Using integration by parts , we have
Hence, (3.34) is equivalent to
Since − (v + 1)h ′(v)∕h(v) = l 1(v) + l 2(v) (3.35) is equivalent to
It is clear that \(I_{\frac {p}{2}-1,A,h}(F) \leq I_{\frac {p}{2}-2,A,h}(F),\) so we then have
which accounts for the first terms on the left and right hand sides of (3.36). As for the second term on each side of (3.36) note that the hypothesis l 1(ν) is increasing in ν implies that for all fixed F, \(l_1(\frac {F-u}{u})\) is decreasing in u. When t < u, we have
By a monotone likelihood ratio argument, we have
Finally, note that since 0 ≤ l 2(v) ≤ c for the third term on each side of (3.36) we have
Therefore we established the inequality (3.36) and the proof is complete. □
3.2.2.2 Minimaxity of the Generalized Bayes Estimators
In this subsection we apply Lemmas 3.4, 3.5, 3.6 and Theorems 3.6 and 3.7 to show minimaxity of the generalized Bayes estimator (3.25).
Theorem 3.8
Assume that g(ν) = −(ν + 1) h ′(ν)∕h(ν) is increasing in ν, g(ν) ≤ M, where M is a positive constant, and
Then δ(X, S) in (3.25) is minimax.
Proof
Let l 2(ν) = 0 and l 1(ν) = g(ν). By applying Lemma 3.6 to the case c = 0, we have r(F) increasing in F. Applying the bound in Lemma 3.5, we can get \(0 \leq r(F) \leq 2\frac {p-2}{m+2}\). Therefore, by Lemma 3.4, δ(X, S) is minimax. □
It is interesting to make connections to the result in Faith (1978). Faith (1978) considered generalized Bayes estimator for \(\mathcal {N}_p(\theta , I_p)\) and showed that when g(ν) is increasing in ν and \(M \leq \frac {p-2}{2}\), the generalized Bayes estimator would be minimax. By taking k →∞, we deduce the same conditions as Faith (1978). The next lemma is a variant of Alam (1973) for the known variance case.
Theorem 3.9
Define \(c_k=\frac {p-2}{k+2}\) . If there exists b ∈ (0, 1] and \(c=\frac {b(p-2)}{4+4(2-b)c_k}\) , such that 0 ≤ r(F) ≤ (2 − b)c k , and F c r(F) is increasing in F, then the generalized Bayes estimator δ(X, S) in (3.25) is minimax.
Proof
By taking the derivative of the Efron and Morris’ condition, (3.31) can be satisfied by requiring
Since r(F) ≤ (2 − b)c k, then (3.37) is satisfied at the point where r ′(F) ≥ 0. Since r(F) ≤ (2 − b)c k with β = (2 − b)c k
at the point where r ′(F) < 0. We now have
since F c r(F) is increasing in F. Thus, for all values of F, we have proven (3.37), and combining with the bound on the r(F) function, we have proven the minimaxity of the generalized Bayes estimator. □
It is interesting to observe that by requiring a tighter upper bound on r(F), we can relax the monotonicity requirement on r(F). The tighter the upper bound, the more flexible r(F) can be. This result enriches the class of priors whose generalized Bayes estimators are minimax. Direct application of Lemmas 3.4, 3.5, 3.6, and 3.9 gives the following theorem.
Theorem 3.10
If there exists b ∈ (0, 1] such that g(ν) = l 1(ν) + l 2(ν) ≤ M, and l 1(ν) is increasing in ν, \(0 \leq l_2(\nu ) \leq c=\frac {b(p-2)}{4+4(2-b)\frac {p-2}{k+2}}\) , and \(\frac {p-2+2M}{k+3+d-2M}\leq \frac {(2-b)(p-2)}{k+2}\) , then the generalized Bayes estimator δ(X, S) in (3.25) is minimax.
3.2.2.3 Examples of the Priors in (3.24)
In this subsection, we will give several examples to which our results can be applied and make some connection to the existing literature found in Maruyama and Strawderman (2005) and Fourdrinier et al. (1998).
Example 3.7
Maruyama and Strawderman (2005) considered the priors with h(ν) ∝ ν b(1 + ν)−a−b−2 for b > 0 and show that \(r(F) \leq \frac {\frac {p}{2} + a+1}{\frac {k}{2} + \frac {d}{2} -a -\frac {1}{2}}\) (in terms of the Maruyama and Strawderman (2005) notation d = 2e + 1). Condition C1 is equivalent to the condition that d + k + p > −1. C2 and C3 are equivalent here, and both are equivalent to the condition that \(a+\frac {p}{2}+1 >0\). Then, using Theorem 3.8, we have g(ν) = a + 2 − bν −1. The condition that g(ν) is increasing in ν is equivalent to the condition that b ≥ 0. Clearly, we can let M = a + 2. Then the condition of Theorem 3.8 is that
A close examination of the Maruyama and Strawderman (2005) proof shows that their upper bound on r(F) is sharp. This implies that our bound in Lemma 3.5 cannot be relaxed.
Example 3.8
Generalized Student-t priors correspond to a mixing distribution of the form
Consider the following two cases. The first case where α ≤ 0, β ≤ 0 and γ < 0 involves the construction of a monotonic r(⋅) function. The second case where α ≤ 0, β > 0 and γ < 0 does not require the r(⋅) function to be monotonic. In both cases,
and
Clearly, g(ν) is monotonic in the first case, and minimaxity of the generalized Bayes estimator follows when
in addition to the conditions C1, C2, and C3. In the limiting case where m →∞, C1 holds trivially. Both C2 and C3 can be satisfied by α > 2 − p. The upper bound on R(F) can be satisfied by any α ≤ 0. Consequently, the conditions reduce to those in Example 3.4 for the case of known variance.
Next we consider spherical multivariate Student-t priors with f degrees of freedom and a scale parameter τ and with \(\alpha =\frac {f-p+4}{2}\), \(\beta =\frac {f(1-\tau )+2}{2}\), and \(\gamma =-\frac {f\tau }{2}\). The case of τ = 1 is of particular interest but does not necessarily give a monotonic r(⋅) function. However, we can use the result in Theorem 3.10 to show that the generalized Bayes estimator is minimax under the following conditions for f ≤ p − 4, suppose there exists a constant b ∈ (0, 1] such that
Condition (3.39) can be established by observing that for this case,
is clearly nonmonotonic. We then let \(M=\frac {f}{2}+1+\frac {1}{2f}\) and apply Lemma 3.5 to get the upper bound on r(⋅). We define \(l_1(\nu )=g(\nu )-\frac {1}{2f}\) when ν ≤ f and \(l_1(\nu )=\frac {f}{2}+1\) otherwise. We also define \(l_2(\nu )=\frac {1}{2f}\) when ν ≤ f and \(l_2(\nu )=\frac {1}{\nu }-\frac {f}{2\nu ^2}\) otherwise. By applying Lemma 3.6, we get condition (3.39).
The spherical multivariate Cauchy prior corresponds to the case f = 1. If k = O(p) and d = 3, then condition (3.39) reduces to p ≥ 5, \(\frac {p+2}{k+2} \leq (2-b) \frac {p-2}{k+2}\), and \(\frac {1}{2} \leq \frac {b(p-2)}{4+8-4b}\).
3.3 Results for Known Σ and General Quadratic Loss
3.3.1 Results for the Diagonal Case
Much of this section is based on the review in Strawderman (2003). We begin with a discussion of the multivariate normal case where \(\varSigma = {\mathrm {diag}} (\sigma ^2_{1},\ldots ,\sigma ^2_p)\) is diagonal, which we assume throughout this subsection. Let
and the loss be equal to a weighted sum of squared errors loss
The results in Sects. 2.3, 2.4 and 3.1 extend by the use of Stein’s lemma in a straightforward way to give the following basic theorem.
Theorem 3.11
Let X have the distribution (3.40) and let the loss be given by (3.41) .
-
(1)
If δ(X) = X + Σg(X), where g(X) is weakly differentiable and E||g||2 < ∞, then the risk of δ is
$$\displaystyle \begin{aligned} \begin{array}{rcl} R(\delta,\boldsymbol\theta) &\displaystyle =&\displaystyle E_{\boldsymbol\theta} ((\delta - \boldsymbol\theta)^{\scriptscriptstyle{\mathrm{T}}} D (\delta -\boldsymbol\theta)) \\ &\displaystyle =&\displaystyle tr (\varSigma D)+ E_{\theta} \left[ {\sum_{i = 1}^p {\sigma _i^4 } d_i \left( {g_i^2 \left( X \right) + 2\frac{{\partial g_i \left( X \right)}}{{\partial X_i }}} \right)} \right]. \end{array} \end{aligned} $$ -
(2)
If θ∼ π(θ), then the Bayes estimator of θ is \(\delta _{\varPi } (X) = X + \varSigma \frac {{\nabla m(X)}}{{m(X)}},\) where m(X) is the marginal distribution of X.
-
(3)
If θ∼ π(θ), then the risk of a proper (generalized, pseudo-) Bayes estimator of the form \(\delta _m(X) = X+\varSigma \frac {{\nabla m(X)}}{{m(X)}}\) is given by
$$\displaystyle \begin{aligned} \begin{array}{rcl} R(\delta_m, \theta) &\displaystyle =&\displaystyle {\mathrm{tr}} (\varSigma D) \\ &\displaystyle +&\displaystyle E_\theta \left[ \frac{2m(X)\sum_{i=1}^p\sigma_i^4d_i \partial m^2(X) /\partial^2 X_i}{m^2(X)}- \frac{\sum_{i=1}^p\sigma_i^4d_i \left(\partial m(X) /\partial X_i\right)^2}{m^2(X)} \right]\\ &\displaystyle =&\displaystyle {\mathrm{tr}} (\varSigma D) + 4 \, E_{\theta} \left[ \frac{\sum_{i=1}^p \sigma_i^4d_i \partial^2\sqrt{m(X)}/\partial^2 X_i}{\sqrt{m(X)}} \right]. \end{array} \end{aligned} $$ -
(4)
If \(\frac {\sum \limits _{i=1}^p \sigma _i^4d_i \partial ^2\sqrt {m(X)}/\partial ^2 X_i}{\sqrt {m(X)}}\) is nonpositive, the proper (generalized, pseudo) Bayes δ m(X) is minimax.
The proof follows closely to that of corresponding results in Sects. 2.3, 2.4 and 3.1. The result is essentially from Stein (1981).
A key observation that allows us to construct Bayes minimax procedures for this situation, based on the procedures for the case Σ = D = I, is the following straightforward result from Strawderman (2003).
Lemma 3.7
Suppose η(X) is such that \(\varDelta \eta (X) =\sum \limits _{i = 1}^p \partial ^2 \eta (X) /\partial ^2 X_i^2 \le 0\) (i.e. η(X) is superharmonic). Then η ∗(X) = η(Σ −1 D −1∕2 X) is such that \(\sum \limits _{i = 1}^p {\sigma _i^4 d_i \partial ^2 \eta ^{*}(X)}/\) ∂ 2 X i ≤ 0.
Note, that for any scalar a, if η(X) is superharmonic, then so is η(aX). This leads to the following result.
Theorem 3.12
Suppose X has the distribution (3.40) and the loss is given by (3.41) .
-
(1)
Suppose \(\sqrt {m(X)}\) is superharmonic (m(X) is a proper, generalized, or pseudo-marginal for the case Σ = D = I). Then
$$\displaystyle \begin{aligned}\delta_m(X) = X+\varSigma \left( {\frac{{\nabla m(\varSigma ^{ - 1} D^{ - 1/2} X)}}{{m(\varSigma ^{ - 1} D^{ - 1/2} X)}}} \right)\end{aligned}$$is a minimax estimator.
-
(2)
If \(\sqrt {m(\left \| X \right \|{ }^2 )}\) is spherically symmetric and superharmonic, then
$$\displaystyle \begin{aligned}\delta_m (X) = X+ \frac{{2m^\prime(X^{\scriptscriptstyle{\mathrm{T}}} \,\varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} X)D^{ - 1} \varSigma ^{ - 1} X}}{{m(X^{\scriptscriptstyle{\mathrm{T}}}\,\varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} X)}}\end{aligned}$$is minimax.
-
(3)
Suppose the prior distribution π(θ) has the hierarchical structure \(\theta |\lambda \sim \mathcal {N}_p(0, A_\lambda )\) for λ ∼ h(λ), 0 < λ < 1, where A λ = (c∕λ)ΣDΣ − Σ, c is such that A 1 is positive definite, and h(λ) satisfies the conditions of Theorem 3.12 . Then
$$\displaystyle \begin{aligned}\delta_\pi(X) = X+\varSigma \frac{{\nabla m(X)}}{{m(X)}}\end{aligned}$$is minimax.
-
(4)
Suppose m i(X), i = 1, 2… k are superharmonic. Then the multiple shrinkage estimator
is a minimax multiple shrinkage estimator.
Proof
Part (1) follows directly from Parts (3) and (4) of Theorem 3.11 and Lemma 3.7. Part (2) follows from Part (1) and Part (2) of Theorem 3.11 with a straightforward calculation.
For Part (3), first note that \(\theta |\lambda \sim \mathcal {N}_p(0, A_\lambda )\) and \(X - \theta |\lambda \sim \mathcal {N}_p(0,\varSigma )\). Thus, X − θ and θ are conditionally independent given λ. Hence we have \(X|\lambda \sim \mathcal {N}_p(0, A_\lambda + \varSigma )\). It follows that
but , where \(\sqrt {\eta \left ( {X^{\scriptscriptstyle {\mathrm {T}}}\,X} \right )}\) is superharmonic by Theorem 3.11. Hence, by Part (2), δ π(X) is minimax (and proper or generalized Bayes depending on whether h(λ) is integrable or not).
Since superharmonicity of η(X) implies the superharmonicity of \(\sqrt {\eta \left ( {\,X} \right )}\), Part (4) follows from Part (1) and the superharmonicity of mixtures of superharmonic functions. □
Example 3.9 (Pseudo-Bayes minimax estimators)
When Σ = D = σ 2 I, we saw in Sect. 3.3 that by choosing \(m(X) = \frac {1}{{\left \| X \right \|{ }^{2b} }}\), the pseudo-Bayes estimator was the James-Stein estimator \(\delta _m(X) = (1- \frac {{2b\sigma ^2 }}{{\left \| X \right \|{ }^2 }})X\). It now follows from this and part (2) of Theorem 3.12 that m(X T Σ −1 D −1 Σ −1 X) = (1∕X T Σ −1 D −1 Σ −1 X)b has associated with it the pseudo-Bayes estimator \(\delta _m(X) = (1- \frac {{2 bD^{-1} \varSigma ^{-1} }}{{\left ( {X^{\scriptscriptstyle {\mathrm {T}}}\,\varSigma ^{-1} D^{-1} \varSigma ^{ - 1} X} \right )}})X\). This estimator is minimax for 0 < b ≤ 2(p − 2).
Example 3.10 (Hierarchical proper Bayes minimax estimator)
As suggested by Berger (1976) suppose the prior distribution has the hierarchical structure \(\theta |\lambda \sim {\mathcal N}_p(0, A_\lambda )\) where A λ = cΣDΣ − Σ, \(c > 1/\min (\sigma _i^2 d_i)\) and h(λ) = (1 + b)λ b for 0 < λ < 1 and \(-1< b \leq \frac {{(p - 6)}}{2}\). The resulting proper Bayes estimator will be minimax for p ≥ 5 by part (3) of Theorem 3.12 and Example 3.9. For p ≥ 3, the estimator δ π(X) given in part (3) of Theorem 3.12 is a generalized Bayes minimax estimator provided \(- \frac {{(p + 2)}}{2} < b \leq \frac {{(p - 6)}}{2}\).
It can be shown to be admissible if the lower bound is replaced by − 2, by the results of Brown (1971). Also see the development in Berger and Strawderman (1996) and Kubokawa and Strawderman (2007).
Example 3.11 (Multiple shrinkage minimax estimators)
It follows from Example 3.9 and Theorem 3.12 that \(m(X) = \sum \limits _{i = 1}^k {\left [ {\frac {1}{{\left ( {X - \nu _i } \right )^{\scriptscriptstyle {\mathrm {T}}} \varSigma ^{ - 1} D^{ - 1} \varSigma ^{ - 1} \left ( {X - \nu _i } \right )}}} \right ]^b }\) satisfies the conditions of Theorem 3.12 (4) for 0 < b ≤ (p − 2)∕2. and hence
is a minimax multiple shrinkage (pseudo-Bayes) estimator.
If, as in Example 3.11 we used the generalized prior
the resulting generalized Bayes (as opposed to pseudo-Bayes) estimators is minimax for 0 < b ≤ (p − 2)∕2.
3.3.2 General Σ and General Quadratic Loss
In this section, we generalize the above results to the case of
where Σ is a general positive definite covariance matrix and the loss is given by
where Q is a general positive definite matrix. We will see that this case can be reduced to the canonical form Σ = I and Q = diag(d 1, d 2, …, d p) = D. We continue to follow the development in Strawderman (2003).
The following well known fact will be used repeatedly to obtain the desired generalization.
Lemma 3.8
For any pair of positive definite matrices, Σ and Q, there exits a non-singular matrix A such that AΣA T = I and (A T)−1 QA −1 = D where D is diagonal.
Using this fact we can now present the canonical form of the estimation problem.
Theorem 3.13
Let \(X\sim {\mathcal N}_p(\theta ,\varSigma )\) and suppose that the loss is L 1(δ, θ) = (δ − θ)T Q(δ − θ). Let A and D be as in Lemma 3.8 and let \(Y=AX \sim {\mathcal N}_p(v,I_p)\) , where v = Aθ and L 2(δ, v) = (δ − v)T D(δ − v).
-
(1)
If δ 1(X) is an estimator with risk function R 1(δ 1, θ) = E θ L 1(δ 1(X), θ), then the estimator δ 2(Y ) = Aδ 1(A −1 Y ) has risk function R 2(δ 2, v) = R 1(δ 1, θ) = E θ L 2(δ 2(Y ), v).
-
(2)
δ 1(X) is proper or generalized Bayes with respect to the proper prior distribution π 1(θ) (or pseudo-Bayes with respect to the pseudo-marginal m 1(X)) under loss L 1 if and only if δ 2(Y ) = Aδ 1(A −1 Y ) is proper or generalized Bayes with respect to π 2(v) = π 1(A −1 v) (or pseudo-Bayes with respect to the pseudo-marginal m 2(Y ) = m 1(A −1 Y )).
-
(3)
δ 1(X) is admissible (or minimax or dominates \(\delta _1^{\ast }(X)\) ) under L 1 if and only if δ 2(Y ) = Aδ 1(A −1 Y ) is admissible (or minimax or dominates \(\delta _2^{\ast }(Y)=A \delta _1^{\ast }(A^{-1} Y)\) under L 2 ).
Proof
To establish Part (1) note that the risk function
Since the Bayes estimator for any quadratic loss is the posterior mean and θ ∼ π 1(θ) and v = Aθ ∼ π 2(v) = π 1(A −1 v) (ignoring constants), then Part (2) follows by noting that
Lastly, Part (3) follows directly from Part (1). □
Note that if Σ 1∕2 is the positive definite square root of Σ and A = PΣ −1∕2 where P is orthogonal and diagonalizes Σ 1∕2 QΣ 1∕2, then this A and D = PΣ 1∕2 QΣ 1∕2 P T satisfy the requirements of the theorem.
Example 3.12
Proceeding as we did in Example 3.9 and applying Theorem 3.13, m(X T Σ −1 Q −1 Σ −1 X) = (X T Σ −1 Q −1 Σ −1 X)−b has associated with it, the pseudo-Bayes minimax James-Stein estimators is
for 0 < b ≤ 2 (p − 2).
Generalizations of Example 3.10 to hierarchical Bayes minimax estimators and generalizations of Example 3.11 to multiple shrinkage estimators are straightforward. We omit the details.
3.4 Admissibility of Bayes Estimators
Recall from Sect. 2.4 that an admissible estimator is one that cannot be dominated in risk, i.e. δ(X) is admissible if there does not exist an estimator δ ′(X) such that R(θ, δ ′) ≤ R(θ, δ) for all θ, with strict inequality for some θ. We have already derived classes of minimax estimators in the previous sections.
In this section, we study their possible admissibility or inadmissibility. One reason that admissibility of these minimax estimators is interesting is that, as we have already seen, the usual estimator δ 0(X) = X is minimax but inadmissible if p ≥ 3. Actually, we have seen that it is possible to dominate X with a minimax estimator (e.g., \(\delta ^{JS}_{(p-2)}(X)\)) that has a substantially smaller risk at θ = 0. Hence, it is of interest to know if a particular (dominating) estimator is admissible.
Note that a unique proper Bayes estimator is automatically admissible (see Lemma 2.6), so we already have examples of admissible minimax estimators for p ≥ 5.
We also note that the class of generalized Bayes estimators contains all admissible estimators if loss is quadratic (i.e., it is a complete class; see e.g., Sacks 1963; Brown 1971; Berger and Srinivasan 1978). It follows that if an estimator is not generalized Bayes, it is not admissible. Further, in order to be generalized Bayes, an estimator must be everywhere differentiable by properties of the Laplace transform . In particular, the James-Stein estimators and the positive-part James-Stein estimators (for a ≠ 0) are not generalized Bayes and therefore not admissible.
In this section, we will study the admissibility of estimators corresponding to priors which are variance mixtures of normal distributions for the case of \(X \sim {\mathcal N}_p(\theta ,I)\) and quadratic loss ∥δ − θ∥2 as in Sect. 3.1.2. In particular, we consider prior densities of the form (3.4) and establish a connection between admissibility and the behavior of the mixing (generalized) density h(v) at infinity. The analysis will be based on Brown (1971), Theorem 1.2. An Abelian Theorem (see, e.g., Widder (1946), Corollary 1.a, p. 182) along with Brown’s theorem are our main tools. We use the notation f(x) ∼ g(x) as x → a to mean limx→a f(x)∕g(x) = 1. Here is an adaptation of the Abelian theorem in Widder that meets our needs.
Theorem 3.14
Assume \(g: \mathbb {R}^{+} \rightarrow \mathbb {R}\) has a Laplace transform \(f(s) = \int ^{\infty }_{0} g(t) e^{-st}\, dt\) that is finite for s ≥ 0. If g(t) ∼ t γ as t → 0+ for some γ > −1, then f(s) ∼ s −(γ+1) Γ(γ + 1) as s →∞.
The proof is essentially as in Widder (1946) but the assumption of finiteness of the Laplace transform at s = 0 allows the extension from γ ≥ 0 to γ > −1.
We first give a lemma which relates the tail behavior of the mixing density h(v) to the tail behavior of π(∥θ∥2) and m(∥x∥2) and also shows that ∥δ(x) − x∥ is bounded whenever h(v) has polynomial tail behavior.
Lemma 3.9
Suppose \(X \sim {\mathcal N}_{p}(\theta , I_{p})\) , L(θ, δ) = ∥δ − θ∥2 and π(θ) is given by (3.4) where h(v) ∼ K v a as v →∞ with a < (p − 2)∕2 and where v −p∕2 h(v) is integrable in a neighborhood of 0. Then
-
(1)
π(θ) ∼ K (∥θ∥2)a−(p−2)∕2 Γ((p − 2)∕2 − a) as ∥θ∥2 →∞,
m(x) ∼ K(∥x∥2)a−(p−2)∕2 Γ((p − 2)∕2 − a) as ∥x∥2 →∞,
and therefore π(∥x∥2) ∼ m(∥x∥2) as ∥x∥2 →∞,
-
(2)
∥δ(x) − x∥ is uniformly bounded, where δ is the generalized Bayes estimator corresponding to π.
Proof
First note that (with t = 1∕v)
and \(g(t) = t^{\frac {p}{2} - 2} h(1/t) \sim K t^{\frac {p-4}{2} - a}\) as t → 0+. Therefore, by Theorem 3.14, \(\pi (\theta ) \sim K (\| \theta \|{ }^{2})^{a - \frac {p-2}{2}} \varGamma \left ( \frac {p-2}{2} - a \right )\) as ∥θ∥2 →∞. Similarly
We note that as t → 0+, \(t^{\frac {p}{2}-2} h\left (\frac {1-t}{t} \right ) \sim t^{\frac {p-4}{2}} \left (\frac {1-t}{t} \right )^{a} \sim t^{\frac {p-4}{2} - a}\). Thus, again by Theorem 3.14,
and Part (1) follows.
To prove Part (2) note that
The above argument applied to the numerator and denominator shows
Since δ(x) − x is in \(\mathcal {C}^{\infty }\) and tends to zero as ∥x∥2 →∞, the function is uniformly bounded. □
The following result characterizes admissibility and inadmissibility for generalized Bayes estimators when the mixing density h(v) ∼ v a as v →∞.
Theorem 3.15
For priors π(θ) of the form (3.4) with mixing density h(v) ∼ v a as v →∞, the corresponding generalized Bayes estimator δ is admissible if and only if a ≤ 0.
Proof (Admissibility if a ≤ 0)
By Lemma 3.9, we have \(\bar {m}(r) = m^{*}(r^{2}) \sim K^{*}\) (r 2)a−(p−2)∕2, with m(x) = m ∗(∥x∥2). Thus, for any 𝜖 > 0, there is an r 0 > 0 such that, for r > r 0, \(\bar {m}(r) \leq (1 + \epsilon ) K^{*} r^{2a -(p-2)}\). Since ∥δ(x) − x∥ is uniformly bounded,
if a ≥ 0. Hence, δ(x) is admissible if a ≤ 0, by Theorem 1.2.
(Inadmissibility if a > 0) Similarly, we have, for r ≥ r 0,
and
if a > 0. Thus δ(x) is inadmissible if a > 0. □
Example 3.13 (Continued)
Recall for the Strawderman prior that \(h(v) = C(1 + v)^{- \alpha - (\frac {p-2}{2})} \sim v^{a}\) as v →∞ for \(a = - (\alpha + \frac {p-2}{2})\).
The above theorem implies that the generalized Bayes estimator is admissible if and only if \(\alpha + \frac {p-2}{2} \geq 0\) or \(1 - \frac {p}{2} \leq \alpha \). We previously established minimaxity when 2 − p < α ≤ 0 for p ≥ 3 and propriety of the prior when \(2 - \frac {p}{2} < \alpha \leq 0\) for p ≥ 5.
Note in general that for a mixing distribution of the form h(v) ∼ Kv a as v →∞, the prior distribution π(θ) will be proper if and only if a < −1 by the same argument as in the proof of Theorem 3.15. Hence the bound for admissibility, a ≤ 0, differs from the bound for propriety, a < −1, by 1.
3.5 Connections to Maximum a Posteriori Estimation
3.5.1 Hierarchical Priors
As we have seen in previous sections of this chapter, the classical Stein estimate and its positive-part modification can be motivated in a number of ways, perhaps most commonly as empirical Bayes estimates (i.e., posterior means) under a normal hierarchical model in which \(\theta \sim {\mathcal N}_p(0,\psi \, I_{p})\) where ψ, viewed as a hyperparameter, is estimated. In this section we look at shrinkage estimation through the lens of maximum a posteriori (MAP) estimation. The development of this section follows Strawderman and Wells (2012).
The class of proper Bayes minimax estimators constructed in Sect. 3.1 relies on the use of a hierarchically specified class of proper prior distributions π S(θ, κ). In particular, for the prior in Strawderman (1971), π S(θ, κ) is specified according to
where g(κ) = (1 − κ)∕κ and the constant a satisfies 0 ≤ a < 1, i.e., π S(κ) is a Beta(1 − a, 1) probability distribution . Suppose a = 1∕2; then, utilizing the transformation ψ = g(κ) > 0 in (3.45), we obtain the equivalent specification
Two interesting alternative formulations of (3.46) are given below for the case p = 1 and generalized later for arbitrary p. In what follows, we let Gamma(τ, ξ) denote a random variable with probability density function
and Exp(ξ) corresponds to the choice τ = 1 (i.e., an exponential random variable in its rate parametrization).
For p = 1, the marginal prior distribution on θ induced by (3.46) is equivalent to that obtained under the specification
where α = 1 and HN(ζ) denotes the half-normal density
The marginal prior distribution on θ induced by (3.46) is also equivalent to that obtained under the alternative specification
where α = 1 and Laplace(λ) denotes a random variable with the Laplace (double exponential) probability density function
This result follows from Griffin and Brown (2010). Define
as a hierarchically specified prior distribution for θ, ψ and ω. The resulting marginal prior distribution for θ, obtained by integrating out ψ and ω, is exactly the quasi-Cauchy distribution of Johnstone and Silverman (2004); see Griffin and Brown (2010) for details. Carvalho et al. (2010) showed that this distribution also coincides with the marginal prior distribution for θ induced by taking a = 1∕2 in (3.45). The transformation \(\lambda = \sqrt {2 \omega }\) in (3.49) leads directly to (3.47) upon setting α = 1; (3.48) is then obtained by integrating out ψ in (3.47).
3.5.2 The Positive-Part Estimator and Extensions as MAP Estimators
Takada (1979) showed that a positive-part type minimax estimator
where \((t)_{+} = \max (t, 0)\), is also the MAP estimator under a certain class of hierarchically specified generalized prior distributions, say π T(θ, κ) = π(θ|κ)π T(κ). For the specific choice c = p − 2 in (3.50), Takada’s prior reduces to
The improper prior (3.51) evidently behaves similarly to Strawderman’s proper prior (3.45) (i.e., for a = 1∕2). Notably, the numerator (1 − κ)p∕2 in π T(κ) explicitly offsets the contribution of (1 − κ)−p∕2 arising from the determinant of the variance matrix g(κ) I p in the conditional prior specification θ|κ. Under the monotone decreasing variable transformation ψ = g(κ) > 0, (3.51) implies an alternative representation that is analogous to (3.46):
We observe that the proper prior (3.46) and improper prior (3.52) (almost) coincide when p = 1; in particular, multiplying the former by ψ 1∕2 yields the latter. In view of the fact that (3.46) and (3.47) lead to the same marginal prior on θ when p = 1, one is led to question whether a deeper connection between these two prior specifications might exist. Supposing p ≥ 1, consider the following straightforward generalization of (3.47):
Integrating λ out of the higher level prior specification the resulting marginal (proper) prior for ψ reduces to
For α = 1 and any p ≥ 1, we now observe that the proper prior (3.54) is simply the improper prior π T(ψ) in (3.52) multiplied by ψ −1∕2 and it reduces to Strawderman’s prior (3.46) for p = 1.
3.5.3 Penalized Likelihood and Hierarchical Priors
Expressed in modern terms of penalization, Takada (1979) proved that the positive-part estimator (3.50) is the solution to a certain penalized likelihood estimation problem in which the penalty (or regularization) term is determined by the prior (3.51). Penalized likelihood estimation , and more generally problems of regularized estimation, have become a very important conceptual paradigm in both statistics and machine learning. Such methods suggest principled estimation and model selection procedures for a variety of high-dimensional problems. The statistical literature on penalized likelihood estimators has exploded, in part due to success in constructing procedures for regression problems in which one can simultaneously select variables and estimate their effects. The class of penalty functions leading to procedures with good asymptotic frequentist properties have singularities at the origin; important examples of separable penalties include the least absolute shrinkage and selection operator (LASSO) , Tibshirani (1996), smoothly clipped absolute deviation (SCAD) , Fan and Li (2001), and minimax concave penalties (MCP) Zhang (2010). In fact, most such penalties utilized in the literature behave similarly to the LASSO penalty near the origin, differing more in their respective behaviors away from the origin, where control of estimation bias for those parameters not estimated to be zero becomes the driving concern. Generalizations of the LASSO penalty have been proposed to deal with correlated groupings of parameters, such as those that might arise in problems with features that can be sensibly ordered, as in the fused LASSO in Tibshirani et al. (2005), or separated into distinct subgroups as in the group LASSO in Yuan and Lin (2006). In such problems, the use of these penalties serves a related purpose.
The LASSO was initially formulated as a least squares estimation problem subject to a ℓ 1 constraint on the parameter vector. The more well-known penalized likelihood version arises from a Lagrange multiplier formulation of a convex relaxation of a ℓ 0 non-convex optimization problem. Since the underlying objective function is separable in the parameters, the underlying estimation problem is evidently directly related to the now-classical problem of estimating a bounded normal mean. From a decision theoretic point of view, if \(X \sim {\mathcal N}(\theta , 1)\; {\mathrm {for}} \; |\theta | \leq \lambda \), then the projection of the usual estimator dominates the unrestricted MLE , but cannot be minimax for quadratic loss because it is not a Bayes estimator. Casella and Strawderman (1981) showed that the unique minimax estimator of θ is the Bayes estimator corresponding to a two-point prior on {−λ, λ} for λ sufficiently small. Casella and Strawderman (1981) further showed that the uniform boundary Bayes estimator, \(\lambda \tanh (\lambda x)\), is the unique minimax estimator if λ < λ 0 ≈ 1.0567. They also considered three-point priors supported on {−λ, 0, λ} and obtained sufficient conditions for such a prior to be least favorable . Marchand and Perron (2001) considered the multivariate extension, \({X} \sim \mathcal {N}_{p}(\theta , {I}_{p})\) with ∥θ∥2 ≤ λ and showed that the Bayes estimator with respect to a boundary uniform prior dominates the MLE whenever \(\lambda \leq \sqrt {p}\) under squared error loss.
It has long been recognized that the class of penalized likelihood estimators also has a Bayesian interpretation. For example, in the canonical version of the LASSO problem, minimizing
with respect to θ is easily seen to be equivalent to computing the MAP estimator of θ under a model specification in which \({X}\,{\sim }\,{\mathcal N}_p(\theta ,{I}_{p})\) and θ has a prior distribution satisfying \(\theta _i \stackrel {iid}{\sim } \mbox{Laplace}(\lambda ).\) It is easily shown that the solution to (3.55) is \(\widehat {\theta }_i({X}) = \mbox{sign}(X_i) (|X_i| - \lambda )_+,\) i = 1, …, p. The critical hyperparameter λ, though regarded as fixed for the purposes of estimating θ, is typically estimated in some ad hoc manner (e.g., cross validation), resulting in an estimator with an empirical Bayes flavor.
The Laplace prior inherent in the LASSO minimization problem (3.55) has broad connections to estimation under hierarchical prior specifications that lead to scale mixtures of normal distributions. As pointed out above, the conditional prior distribution of θ|λ obtained by integrating out ψ in (3.47) is exactly Laplace(λ). More generally, the conditional distribution for θ|λ under the hierarchical prior specification (3.53) is a special case of the class of multivariate exponential power distributions in Gomez-Sanchez-Manzano et al. (2008); in particular, we obtain
a direct generalization of the Laplace distribution that arises when p = 1. Treating λ as fixed hyperparameter, computation of the resulting MAP estimator under the previous model specification \(X \sim {\mathcal N}_p(\theta ,I_{p})\) reduces to determining the value of θ that minimizes
The resulting estimator is easily shown to be
an estimator that coincides with the solution to the canonical version of the grouped LASSO problem involving a single group of parameters (see Yuan and Lin 2006) and equals \(\widehat {\theta }(X) = \mbox{sign}(X) (|X| - \lambda )_+\) for the case where p = 1.
Consider the problem of estimating θ in the canonical setting \(X \sim {\mathcal N}_p(\theta , I_{p})\). In view of the fact that (3.53) leads to (3.56) upon integrating out ψ, our starting point is the (possibly improper) generalized class of joint prior distributions π(θ, λ|α, β), which we define in the following hierarchical fashion
where α, β > 0 are hyperparameters. Equivalently,
The prior on λ is an improper modification of that given in (3.53), in which a location parameter β is introduced and the factor λ −p is introduced to offset the contribution λ p in (3.56). This construction mimics the idea underlying the prior used by Takada (1979) to motivate (3.50) as a MAP estimator .
Considering (3.60) as motivation for defining a new class of hierarchical penalty functions , Strawderman and Wells (2012) propose deriving the MAP estimator for (θ, λ) through minimizing the objective function
jointly in \(\theta \in \mathbb {R}^p\) and λ > 0, where α > 1∕2 and β > 0 are fixed. The resulting estimator for θ takes the closed form
where
for ν α = 2α∕(2α − 1). Equivalently, we may write
demonstrating that (3.62) has the flavor of a range-modified positive-part estimator. A detailed derivation of this estimator is in Strawderman and Wells (2012).
Some interesting special cases of the estimator (3.62) arise when considering specific values of α, β and p. For example, letting α →∞, we obtain (for β > 0)
upon setting β = λ, we evidently recover (3.58); subsequently, setting \(\lambda = \sqrt {p-2}\), one then obtains an obvious modification of (3.50) for the case where c = p − 2:
In the special case p = 1, the estimator (3.62) reduces to
As shown in Strawderman et al. (2013), (3.65) is also the solution to the penalized minimization problem
where β > 0, α > 1∕2 and
This optimization problem is the univariate equivalent of the penalized likelihood estimation problem considered in Zhang (2010), who referred to ρ(t;α, β) as MCP . It follows that (3.65) is equivalent to the univariate MCP thresholding operator; consequently, (3.62) may be regarded as a generalization of this operator for thresholding a vector of parameters. Zhang (2010) showed that the LASSO, SCAD, and MCP belong to a family of quadratic spline penalties with certain sparsity and continuity properties. MCP turns out to be the simplest penalty that results in an estimator that is nearly unbiased, sparse and continuous. As demonstrated above, MCP also has an interesting Bayesian motivation under a hierarchical modeling strategy. Strawderman et al. (2013) undertook a more detailed study of the connections between MCP, the hierarchically penalized estimator, and proximal operators for the case of p = 1. They also compared this estimator to several others through consideration of frequentist and Bayes risks.
3.6 Estimation of a Predictive Density
Consider a parametric model \( \{ {\mathcal Y} , { ( {\mathcal P^{\prime } }_{ \mu } ) }_{ \mu \in \varOmega } \}\) where \( { \mathcal Y}\) is the sample space, Ω is the parameter space and \( {\mathcal P^{\prime } } = \{ p(y | \mu ): \mu \in \varOmega \} \) is a class of densities of \({\mathcal P^{\prime }}_{\mu }\) with respect to a σ -finite measure. In addition, suppose an observed value x of the random variable X follows a model \( \{ { \mathcal X} , { ( {\mathcal P}_{ \mu } ) }_{ \mu \in \varOmega } \}\) indexed by the same parameter. In this section, we examine the problem of estimating the true density \(p^\prime ( . | \mu ) \in {\mathcal P^\prime }\) of a random variable Y . In this context p ′(⋅|μ) is referred to as the predictive density of Y .
Let the density \( \hat {q} (y|x)\) (belonging to some class of models \( { \mathcal C} \supset {\mathcal P^{\prime } } \)) be an estimate, based on the observed data x, of the true density p(y|μ). Aitchison (1975) proposed using the Kullback and Leibler (1951) divergence , defined in (3.66) below, as a loss function for estimating p(y|μ).
The class of estimates \( { \mathcal C}\) can be identical to the class \( {\mathcal P^{\prime } } \), that is, for any \(y \in { \mathcal Y}\)
where \( \hat { \mu }\) is some estimate of μ. This type of density estimator is called the “plug-in density estimate” associated with the estimate \( \hat {\mu } \). Alternatively, one may choose
where dπ(μ|x) may be a weight function (measure) or an a posteriori density associated with a priori measure π(μ). In this case, the class \( { \mathcal C}\) will be broader than the class of the models \( {\mathcal P^{\prime } } \). Aitchison (1975) showed that this latter method is preferable to the plug-in approach for several families of probability distributions by comparing their risks induced by the Kullback-Leibler divergence.
3.6.1 The Kullback-Leibler Divergence
First, recall the definition of the Kullback-Leibler divergence and some of its properties.
Lemma 3.10
The Kullback-Leibler divergence (relative entropy) D KL(p, q) between two densities p and q is defined by
and equality is achieved if and only if p = q, p −almost surely.
Note that the divergence can be finite only if the support of the density p is contained in the support of the density q. By convention, we define \(0 \, \log \frac { 0 }{ 0 } = 0\).
Proof
By definition of the Kullback-Leibler divergence we can write
We have equality, using Jensen’s inequality, if and only if p = q, p -almost surely. Note that the lemma is true if q is assumed only to be a subdensity (mass less than or equal to 1). □
The Kullback-Leibler divergence is not a true distance since it is not symmetric and it does not satisfy the triangle inequality. But it appears as the natural discrepancy measure in information theory. An important property, given in the following lemma, is that it is strictly convex.
Lemma 3.11
The Kullback-Leibler divergence is strictly convex, that is to say, if (p 1, p 2) and (q 1, q 2) are two pairs of densities then, for any 0 ≤ λ ≤ 1,
with strict inequality unless (p 1, p 2) = (q 1, q 2) a.e. with respect to p 1 + p 2.
Proof
Note that \(f(t) =t \, \log (t)\) is strictly convex on (0, ∞). Let
From the convexity of the function f it follows that
and consequently
Substituting the above values of α 1, α 2, t 1 and t 2 gives
Finally, by integrating the latter term, (3.67) and the strict convexity follow from the strict convexity of the function f. □
3.6.2 The Bayesian Predictive Density
Assume in the rest of this subsection that p(x|μ) and p ′(y|μ) are densities with respect to the Lesbegue measure. For any estimator \(\hat {p} (\cdot |x)\) of the density p ′(y|μ), define the Kullback-Leibler loss by
and its corresponding risk as
We say that the density estimate \( \hat {p}_2\) dominates the density estimate \( \hat {p}_1\) if, for any μ ∈ Ω, \( {{\mathcal R}_{\mbox{ {KL}}}}( \mu , \hat {p}_1) - {{\mathcal R}_{\mbox{ {KL}}}}( \mu , \hat {p}_2) \leq 0\), with strict inequality for at least some value of μ.
In the Bayesian framework we will compare estimates using Bayes risk. We will consider the class, more general than Aitchison (1975), of all subdensities,
Lemma 3.12 (Aitchison 1975)
The Bayes risk
is minimized by
We call \( \hat {p}_{ \pi }\) the Bayesian predictive density.
Proof
The difference between the Bayes risks of \( \hat {p}_{ \pi }\) and another competing subdensity estimator \( \hat {q}\) is
Rearranging the order of integration thanks to Fubini’sTheorem gives
□
3.6.3 Sufficiency Reduction in the Normal Case
Let X (n) = (X 1, …, X n) and Y (m) = (Y 1, …, Y m) be independent iid samples from p-dimensional normal distributions \({\mathcal N}_p(\mu ,\varSigma _1)\) and \({\mathcal N}_p(\mu ,\varSigma _2)\) with unknown common mean μ and known positive definite covariance matrices Σ 1 and Σ 2. On the basis of an observation x (n) = (x 1, …, x n) from X (n), consider the problem of estimating the true predictive density p ′(y (m)|μ) of y (m) = (y 1, …, y m), under the Kullback-Leibler loss . For a prior density π(μ), the Bayesian predictive density is given by
For simplicity, we consider the case where Σ 1 = Σ 2 = I p. According to Komaki (2001) the Bayesian predictive densities satisfy
where, denoting by ϕ p(⋅, |μ, Σ) the density of \({\mathcal N}_p(\mu ,\varSigma )\), in the left-hand side of (3.72),
while, in the right-hand side of (3.72),
with \(\bar {y}_m = \sum _{j=1}^{m} y_{j} / m\). Similarly, \(\hat {p}_{\pi } (y_{(m)} | x_{(n)} ) \) corresponds to the conditional density of the p × m matrix y (m) given the p × m matrix x (n) while \(\hat {p}_{\pi }(\bar {y}_m | \bar {x}_m)\) corresponds to the conditional density of the p × 1 vector \(\bar {y}_m\) given the p × 1 vector \(\bar {x}_n = \sum _{i=1}^{n} x_i / n\).
To see this sufficiency reduction , use the fact that
Then we can express p ′(y (m)|μ) as
Similarly, it follows that
By replacing these expressions in the form of the predictive density in (3.71), we get
Finally, for (3.73) and (3.74), it follows that
Therefore, for any prior π, the risk of the Bayesian predictive density estimator is equal to the risk of the Bayesian predictive density associated to π in the reduced model \(X \sim {\mathcal N}_p (\mu , \frac {1}{n} I_p)\) and \(Y \sim {\mathcal N}_p (\mu , \frac {1}{m} I_p).\) Thus, for the Bayesian predictive densities, it is sufficient to consider the reduced model.
Now we will compare two plug-in density estimators, \(\hat {p}_1\) and \( \hat {p}_2\) associated with the two different estimators of μ, δ 1 and δ 2. That is, for i = 1, 2, define
The difference in risk between \( \hat {p}_2\) and \( \hat {p}_1\) is given by
By the independence of X (n) and Y (m) this can be reexpressed in terms of expectations as
which shows that the risk difference between \( \hat {p}_2\) and \( \hat {p}_1\) is proportional to the risk difference between δ 2 and δ 1.
Note that, by completeness of the statistics \( \bar {X}_n\), it suffices to consider only estimates of μ that depend only on \( \bar {X}_n\).
3.6.4 Properties of the Best Invariant Density
In this subsection, we restrict our attention to location models. We assume X ∼ p(x|μ) = p(x − μ) and Y ∼ p ′(y|μ) = p ′(y − μ), where p and p ′ are two known possibly different densities. A density \( \hat {q}\) is called invariant (equivariant) with respect to a location parameter if, for any \(a \in \mathbb {R}^p,\) \(x \in \mathbb {R}^p\), and \(y \in \mathbb {R}^p\) q(y|x + a) = q(y − a|x). This is equivalent to q(y + a|x + a) = q(y|x). The following result shows that the risk of an invariant predictive density is constant.
Lemma 3.13
The invariant predictive densities with respect to the location group of translations have constant risk.
Proof
By the property of invariance, the risk of an invariant density \( \hat {q}\) is equal to
by the change of variables z = x − μ and z ′ = z − μ. Therefore, the risk \( { \mathcal R} ( \mu , \hat {q})\) does not depend on μ and it is constant. □
Any invariant predictive density which minimizes this risk is known as the best invariant predictive density.
Lemma 3.14
The best invariant predictive density is the Bayesian predictive density \( \hat {p}_{U}\) associated with the Lebesgue measure on \( \mathbb {R}^p\) , π(μ) = 1, is given by
Proof
Let Z = X − μ, Z ′ = Y − μ, and T = Y − X = Z ′− Z. We will show that \( \hat {p} (t) \), the density of T, which is independent of μ, is the best invariant density. As noted in the previous section, if \( \hat {q}\) is an invariant predictive density, \( \hat {q}(y|x) = \hat {q}(y-x|0) = \hat {q}(y-x)\), by an abuse of notation. Hence,
which is always positive by the inequality in (3.66). The result of the equality in (3.78), and hence the lemma, follows from the fact that \( \hat {p} (t) = \hat {p} (y-x) = \hat {p}_U(y|x),\) that is,
which is the expression of \( \hat {p}_U\) given in (3.70) with π(μ) = 1. □
Murray (1977) showed that \( \hat {p}_{U}\) is the best invariant density under the action of translations and of linear transformations for a Gaussian model. Ng (1980) has generalized this result. Liang and Barron (2004), without the hypothesis of independence between X and Y , for the estimation of p ′(y|x, μ) showed that \( \displaystyle \hat {p}_U = \frac {\int _{ \mathbb {R}^p} p^\prime (y|x, \mu ) \, p(x| \mu ) \,d \mu } {\int _{ \mathbb {R}^p} p(x| \mu ) \, \,d \mu }\) is the best invariant density.
We will now show that \( \hat {p}_U\) is minimax in location problems.
Lemma 3.15
Let X ∼ p(x|μ) = p(x − μ) and Y ∼ p(y|μ) = p ′(y − μ), with unknown location parameter \( \mu \in \mathbb {R}^p\) . Assuming that \(E_0 \left [ \|X\| ^ 2 \right ] < \infty ,\) then the best predictive invariant density \( \hat {p}_{U}\) is minimax.
Proof
We show minimaxity using Lemma 1.8. Consider a sequence {π k} of normal \(\mathcal {N}_p(0, k \, I_p)\) priors . The difference of Bayes risk between \( \hat {p}_U\) and \( { \hat {p} }_{ \pi _k} \), is given by
where \(E_{ \pi _k} ^ {x,y} \) denotes the expectation with respect to the joint marginal of (X, Y ),
Since \(r(\hat {p}_U,\pi _k) = {\mathcal R}(\mu ,\hat {p}_U)\) (\(\hat {p}_U\) has constant risk) it suffices to show (3.80) tends to 0 as k tends to infinity. By simplifying one gets
where E μ|X,Y denotes the expectation with respect to the posterior of μ given (X, Y ). An application of Jensen’s inequality gives
By developing the expectations, it follows that
Similarly, by integrating with respect to y and by interchanging between μ and μ ′ we have
By grouping the expressions (3.81), (3.83) and (3.84) and making the changes of variables z = x − μ and z ′ = x − μ ′ it follows that
In view of the form π k(μ), the term on the right in (3.84) can be written as
since E(Z) = E(Z ′) = E 0(X) (here, \(E_{Z,Z^\prime }\) denotes the expectation with respect to p(z, z ′) = p(z)p(z ′)). We then see that the limit of the difference of Bayes risks tends toward zero when k →∞. Therefore, \( \hat {p}_U\) is minimax by Lemma 1.8. □
This result is in Liang and Barron (2004), a more direct proof for the Gaussian case can be found in George et al. (2006) and is given in the next section.
3.6.5 An Explicit Expression for \( \hat {p}_U\) and Its Risk in the Normal Case
We now give an explicit expression of \( \hat {p}_U\), described the previous subsections, in the Gaussian setting. Let \(X \sim {\mathcal N}_p (\mu , \nu _x I_p)\) and \(Y \sim {\mathcal N}_p (\mu , \nu _y I_p)\).
Lemma 3.16
The Bayesian predictive density associated with the uniform prior on \( \mathbb {R}^p\) , π(μ) ≡ 1, is given by the following expression
Proof
For W = (v y X + v x Y )∕(v x + v y) and v w = (v x v y)∕(v x + v y) it is clear that \(W \sim {\mathcal N}_p ( \mu ,v_w I_p)\), by the independence of X and Y . Further, note that
By definition, and through the previous representation, it follows that
□
Note that the risk of \( \hat {p}_U\) is constant, as we have previously seen for invariant densities. Given the form of \( \hat {p}_U( . |x )\) it follows that the Kullback-Liebler divergence is
Hence, we can conclude that the risk of \( \hat {p}_U\) is
In the framework of the iid sampling model presented in Sect. 3.6.3 with Σ 1 = Σ 2 = I p, we can express the risk as
A predictive density is called the plug-in relative to an estimator δ if it has the form
The predictive plug-in density, which corresponds to the standard estimator of the mean, μ, δ 0(X) = X, is
We can directly verify that the predictive density \( \hat {p}_U\) dominates the plug-in density \( \hat {p}_{ \delta _0}\) for any \( \mu \in \mathbb {R}^p.\) In fact, their difference in risk is
Since \( E^{X,Y} \left ({\| Y - X \|}^2 \right )\) equals
we have
Surprisingly, the predictive density \( \hat {p}_U\) has similar properties to the standard estimator, δ 0(X) = X, for the estimation of the mean under quadratic loss. Komaki (2001) showed that the density \( \hat {p}_U\) is dominated by the Bayesian predictive density using the harmonic prior , π(μ) = ∥μ∥2−p. George et al. (2006) extended the analogy with point estimation. We give some of this development next.
Lemma 3.17 (George et al. 2006, Lemma 2)
For W = (v y X + v x Y )∕(v x + v y) and v w = (v x v y)∕(v x + v y), let m π(W;v w) and m π(X;v x) be the marginals of W and X, respectively, relative to the a prior π. Then
where \( \hat {p}_{U} ( \cdot |X)\) is the Bayes estimator associated with the uniform prior on \( \mathbb {R}^p\) given by (3.85). In addition, for any prior measure π, the Kullback-Leibler risk difference between \( \hat {p}_U(\cdot |x)\) and the Bayesian predictive density \( \hat {p}_{ \pi } ( \cdot |x)\) is given by
where E μ, v denotes the expectation with respect to the normal \(\mathcal {N}_p ( \mu ,vI_p)\) distribution.
Proof
The marginal density of (X, Y ) associated with π is equal to
Applying (3.85) and (3.86) it follows that
Since \(\hat {p}_{\pi }(y|x) = \hat {p}_{\pi }(x,y) / m_\pi (x)\), (3.89) follows.
Hence, we can write the risk difference as
□
Using this lemma, George et al. (2006) gave a simple proof of the result of Liang and Barron (2004) for the Gaussian setting. By taking the same sequence of priors \( \{ \pi _k \} = \mathcal {N}_p(0, kI_p)\), the difference of the Bayes risk equals (using constancy of the risk of \(\hat {p}_U\))
Hence, we see that \( \lim _{k \rightarrow \infty } r( \pi _k, { \hat {p}_U} ) -r( \pi _k, \hat {p}_{ \pi _k} ) = 0\) and so, \( \hat {p}_U\) is minimax by Lemma 1.8. George et al. (2006) also show that the best predictive invariant density is dominated by any Bayesian predictive density relative to a superharmonic prior . This result parallels the result of Stein for the estimation of the mean under quadratic loss and the use differential operators discussed in Sect. 2.6. The following lemma from George et al. (2006) allows us to give sufficient conditions for domination. We use Stein’s identity in the proof.
Lemma 3.18
If m π(z;v x) is finite for any z, then for any v w ≤ v ≤ v x the marginal m π(z;v) is finite. In addition,
Proof
For any v w ≤ v ≤ v x,
Hence, the marginal m π is finite. Setting \(Z^\prime = (Z - \mu ) / \sqrt {v} \sim {\mathcal N} (0,I) \),
where
Note that
and
It follows that
Hence, using Stein’s identity,
which is the desired result. □
Lemmas 3.17 and 3.18 gives a result regarding minimaxity and domination from George et al. (2006). This result reveals parallels to those on minimax estimation of mean under quadratic loss in Sect. 3.1.1. Its proof is contained in the proof of Theorem 3.17.
Theorem 3.16
Assume that m π(z;v x) is finite for any z in \( \mathbb {R}^p.\) If Δm π ≤ 0 for all v w ≤ v ≤ v x , then the Bayesian predictive density \( \hat {p}_{ \pi } (y|x)\) is minimax and dominates \( \hat {p}_U\) (when π is not the uniform itself). If Δπ ≤ 0, then the Bayesian predictive density \( \hat {p}_{\pi } (y|x)\) is minimax and dominates \( \hat {p}_U\) (when π is uniform).
The next result from Brown et al. (2008) illuminates the link between the two problems of estimating the predictive density under the Kullback-Leibler loss and estimating the mean under quadratic loss. The result expresses this link in terms of risk differences.
Theorem 3.17
Suppose the prior π(μ) is such that the marginal m π(z;v) is finite for any \(z \in \mathbb {R}^p\) . Then,
Proof
From (3.90) and (3.91) it follows
On the other hand, Stein (1981) showed that
Hence substituting (3.98) in the integral (3.97) gives (3.96). □
It is worth noting that using (3.88) and (3.96) leads to the following expression for the Kullback-Liebler risk of \( \hat {p}_U\):
The area of predictive density estimation continues to develop. Recent research covers the case of restricted parameter (Fourdrinier et al. 2011), general α-divergence losses (Maruyama and Strawderman 2012; Boisbunon and Maruyama 2014), integrated L1 and L2 losses (Kubokawa et al. 2015, 2017). For a general review, see George and Xu (2010).
References
Aitchison J (1975) Goodness of prediction fit. Biometrika 62:547–554
Alam K (1973) A family of admissible minimax estimators of the mean of a multivariate normal distribution. Ann Stat 1:517–525
Baranchik A (1970) A family of minimax estimators of the mean of a multivariate normal distribution. Ann Math Stat 41:642–645
Berger JO (1976) Inadmissibility results for the best invariant estimator of two coordinates of a location vector. Ann Stat 4(6):1065–1076
Berger JO (1985a) Statistical decision theory and Bayesian analysis, 2nd edn. Springer, New York
Berger JO, Srinivasan C (1978) Generalized Bayes estimators in multivariate problems. Ann Stat 6(4):783–801
Berger JO, Strawderman WE (1996) Choice of hierarchical priors. Admissibility in estimation of normal means. Ann Stat 24:931–951
Boisbunon A, Maruyama Y (2014) Inadmissibility of the best equivariant predictive density in the unknown variance case. Biometrika 101(3):733–740
Brown LD (1971) Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann Math Stat 42:855–903
Brown LD, George I, Xu X (2008) Admissible predictive density estimation. Ann Stat 36:1156–1170
Carvalho C, Polson N, Scott J (2010) The horseshoe estimator for sparse signals. Biometrika 97:465–480
Casella G, Strawderman WE (1981) Estimating a bounded normal mean. Ann Stat 9:870–878
Doob JL (1984) Classical potential theory and its probabilistic counterpart. Springer, Berlin/Heidelberg/New York
Efron B, Morris C (1976) Families of minimax estimators of the mean of a multivariate normal distribution. Ann Stat 4(1):11–21
Faith RE (1978) Minimax Bayes point estimators of a multivariate normal mean. J Multivar Anal 8:372–379
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fourdrinier D, Strawderman, Wells MT (1998) On the construction of Bayes minimax estimators. Ann Stat 26:660–671
Fourdrinier D, Marchand E, Righi A, Strawderman WE (2011) On improved predictive density estimation with parametric constraints. Electron J Stat 5:172–191
George EI (1986a) A formal Bayes multiple shrinkage estimator. Commun Stat Theory Methods 15(7):2099–2114
George EI (1986b) Minimax multiple shrinkage estimation. Ann Stat 14(1):188–205
George EI, Xu X (2010) Bayesian predictive density estimation. In: Chen M-H, Müller P, Sun D, Ye K, Dey DK (eds) Frontiers of statistical decision making and Bayesian analysis in honor of James O Berger. Springer, New York, pp 83–95
George EI, Feng L, Xu X (2006) Improved minimax predictive densities under Kullback-Leibler loss. Ann Stat 34:78–91
Gomez-Sanchez-Manzano E, Gomez-Villegas M, Marin J (2008) Multivariate exponential power distributions as mixtures of normal distributions with Bayesian applications. Commun Stat Theory Methods 37:972–985
Griffin JE, Brown PJ (2010) Inference with normal-gamma prior distributions in regression problems. Bayesian Anal 5(1):171–188. https://doi.org/10.1214/10-BA507
Johnstone IM, Silverman BW (2004) Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann Stat 32(4):1594–1649
Ki F, Tsui KW (1990) Multiple-shrinkage estimators of means in exponential families. Can J Stat/La Revue Canadienne de Statistique 18:31–46
Komaki F (2001) A shrinkage predictive distribution for multivariate normal observables. Biometrika 88:859–864
Kubokawa T, Strawderman WE (2007) On minimaxity and admissibility of hierarchical Bayes estimators. J Multivar Anal 98(4):829–851
Kubokawa T, Marchand E, Strawderman WE (2015) On predictive density estimation for location families under integrated squared error loss. J Multivar Anal 142:57–74
Kubokawa T, Marchand E, Strawderman WE (2017) On predictive density estimation for location families under integrated absolute error loss. Bernoulli 23(4B):3197–3212
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New York
Liang F, Barron A (2004) Exact minimax strategies for predictive density estimation, data compression and model selection. IEEE Trans Inf Theory 50:2708–2726
Marchand É, Perron F (2001) Improving on the MLE of a bounded normal mean. Ann Stat 29(4):1078–1093
Maruyama Y (1998) A unified and broadened class of admissible minimax estimators of a multivariate normal mean. J Multivar Anal 64:196–205
Maruyama Y, Strawderman WE (2005) A new class of generalized Bayes minimax ridge regression estimators. Ann Stat 33:1753–1770
Maruyama Y, Strawderman WE (2012) Bayesian predictive densities for linear regression models under α-divergence loss: some results and open problems. In: Fourdrinier D, Marchand E, Rukhin AL (eds) Contemporary developments in Bayesian analysis and statistical decision theory: a Festschrift for William E. Strawderman, vol 8. Institute of Mathematical Statistics, Beachwood, pp 42–56
Muirhead RJ (1982) Aspects of multivariate statistics. Wiley, New York
Murray GD (1977) A note on the estimation of probability density functions. Biometrika 64:150–152
Ng VM (1980) On the estimation of parametric density functions. Biometrika 67:505–506
Robert CP (1994) The Bayesian choice: a decision theoretic motivation. Springer, New York
Sacks J (1963) Generalized Bayes solutions in estimation problems. Ann Math Stat 34:751–768
Stein C (1973) Estimation of the mean of a multivariate normal distribution. In: Proceedings of Prague symposium asymptotic statistics, pp 345–381
Stein C (1981) Estimation of the mean of multivariate normal distribution. Ann Stat 9:1135–1151
Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385–388
Strawderman WE (1973) Proper Bayes minimax estimators of the multivariate normal mean vector for the case of common unknown variances. Ann Stat 1:1189–1194
Strawderman WE (2003) On minimax estimation of a normal mean vector for general quadratic loss. In: Moore M, Froda S, Leger C (eds) Mathematical statistics and applications: Festschrift for constance van Eeden. Lecture notes–monograph series, vol 42. Institute of Mathematical Statistics, Beachwood, pp 3–14
Strawderman RL, Wells MT (2012) On hierarchical prior specifications and penalized likelihood. In: Fourdrinier D, Marchand E, Rukhin AL (eds) Contemporary developments in Bayesian analysis and statistical decision theory: a Festschrift for William E. Strawderman, collections, vol 8. Institute of Mathematical Statistics, Beachwood, pp 154–180
Strawderman RL, Wells MT, Schifano ED (2013) Hierarchical Bayes, maximum a posteriori estimators, and minimax concave penalized likelihood estimation. Electron J Stat 7:973–990
Takada Y (1979) Stein’s positive part estimator and Bayes estimator. Ann Inst Stat Math 31:177–183
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267–288
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused LASSO. J R Stat Soc Ser B 67(1):91–108
Wells MT, Zhou G (2008) Generalized Bayes minimax estimators of the mean of multivariate normal distribution with unknown variance. J Multivar Anal 99:2208–2220
Widder DV (1946) The Laplace transform. Princeton University Press, Princeton
Wither C (1991) A class of multiple shrinkage estimators. Ann Inst Stat Math 43:147–156
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zinodiny S, Strawderman WE, Parsian A (2011) Bayes minimax estimation of the multivariate normal mean vector for the of common unknown variance. J Multivar Anal 102(9):1256–1262
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Fourdrinier, D., Strawderman, W.E., Wells, M.T. (2018). Estimation of a Normal Mean Vector II. In: Shrinkage Estimation. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-02185-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-02185-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02184-9
Online ISBN: 978-3-030-02185-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)