Abstract
In Chapters 2 and 3 we studied estimators that improve over the “usual” estimator of the location vector for the case of a normal distribution. In this chapter, we extend the discussion to spherically symmetric distributions discussed in Chapter!4. Section 5.2 is devoted to a discussion of domination results for Baranchik type estimators while Section 5.3 examines more general estimators. Section 5.4 discusses Bayes minimax estimation. Finally, Section 5.5 discusses estimation with a concave loss.
Access provided by CONRICYT-eBooks. Download chapter PDF
5.1 Introduction
In Chaps. 2 and 3 we studied estimators that improve over the “usual” estimator of the location vector for the case of a normal distribution. In this chapter, we extend the discussion to spherically symmetric distributions discussed in Chap. 4. Section 5.2 is devoted to a discussion of domination results for Baranchik type estimators while Sect. 5.3 examines more general estimators. Section 5.4 discusses Bayes minimax estimation. Finally, Sect. 5.5 discusses estimation with a concave loss.
We close this introductory section by extending the discussion of Sect. 2.2 on the empirical Bayes justification of the James-Stein estimator to the general multivariate (but not necessarily normal) case.
Suppose X has a p-variate distribution with density f(∥x − θ∥2), unknown location vector θ and known scale matrix σ 2 I p. The problem is to estimate θ under loss L(θ, δ) = ∥δ − θ∥2. Let the prior distribution on θ be given by π(θ) = f ⋆n(θ), the n-fold convolution of the density f(⋅) with itself. Note that the distribution of θ is the same as that of \(\sum _{i=1}^{n} Y_i \) where the Y i are iid with density f(⋅). Recall from Example 1.3 that the Bayes estimator of θ is given by
Assume now that n is unknown. Since
an unbiased estimator of n + 1 is X T X∕(pσ 2), and so p σ 2∕(X T X) is a reasonable estimator of 1∕(n + 1). Substituting p σ 2∕(X T X) for 1∕(n + 1) in the Bayes estimator, we have that
can be viewed as an empirical Bayes estimator of θ without any assumption on the form of the density (and in fact there is not even any need to assume there is a density). Hence this Stein-like estimator can be viewed as a reasonable alternative to X from an empirical Bayes perspective regardless of the form of the underlying distribution.
Note that Diaconis and Ylvisaker (1979) introduced the prior f ⋆n(θ) as a reasonable conjugate prior for location families since it gives linear Bayes estimators. Strawderman (1992) gave the above empirical Bayes argument. In the normal case the sequence of priors corresponds to that in Sect. 2.2.3 with τ 2 = n σ 2. The shrinkage factor p σ 2 in the present argument differs from (p − 2) σ 2 in the normal case since in this general case we use a “plug-in” estimator of 1∕(n + 1) as opposed to the unbiased estimator (in the normal case) of 1∕(σ 2 + τ 2).
5.2 Baranchik-Type Estimators
In this section, assuming that X has a spherically symmetric distribution with mean vector θ and that loss is L(θ, δ) = ∥δ − θ∥2, we consider estimators of the Baranchik-type, as (2.19) in the normal setting, for different families of densities. In Sect. 5.3, we consider results for general estimators of the form X + g(X).
5.2.1 Variance Mixtures of Normal Distributions
We first consider spherically symmetric densities which are variance mixtures of normal distributions. Suppose
where G(⋅) is a probability distribution on (0, ∞), i.e., a mixture of \(\mathcal {N}_p(\theta , vI)\) distributions with mixing distribution G(⋅).
Our first result gives a domination result for Baranchik type estimators for such distributions. This result is analogous to Theorem 2.3 in the normal case.
Theorem 5.1 (Strawderman 1974b)
Let X have density of the form (5.1) and let
where the function r(⋅) is absolutely continuous. Assume the expectations E[V ] and E[V −1] are finite where V has distribution G. Then \(\delta _{a,r}^B(X)\) is minimax for the loss L(θ, δ) = ∥δ − θ∥2 provided
-
(1)
0 ≤ a ≤ 2(p − 2)∕E[V −1],
-
(2)
0 ≤ r(t) ≤ 1 for any t ≥ 0,
-
(3)
r(t) is nondecreasing in t, and
-
(4)
r(t)∕t is nonincreasing in t.
Furthermore, \(\delta _{a,r}^B(X)\) dominates X provided the inequalities in (1) or (2) (on a set of positive measure) are strict or r ′(t) is strictly increasing on a set of positive measure.
Proof
The proof proceeds by calculating the conditional risk given V = v, noting that the distribution of X|V = v is normal N(θ, vI p). First note that E[V ] < ∞ is equivalent to E 0[∥X∥2] < ∞ so that the risk of X is finite. Similarly, it can be seen that E[V −1] < ∞ if and only if E 0[∥X∥−2] < ∞. Then, thanks to (2), we have E 0[r 2(∥X∥2)∥X∥−2] < ∞. Actually, we will see below that, for any θ, E θ[∥X∥−2] ≤ E 0[∥X∥−2], and hence, E θ[r 2(∥X∥2)∥X∥−2] < ∞ which guarantees that the risk of \(\delta _{a,r}^B(X)\) is finite. Note that, conditionally on V , ∥X∥2∕V has a noncentral chi-square distribution with p degrees of freedom and noncentrality parameter ∥θ∥2∕V . Hence, since the family of noncentral chi-square distributions have monotone (increasing) likelihood ratios in the noncentrality parameter (and therefore are stochastically increasing), ∥X∥2∕V is (conditionally) stochastically decreasing in V and increasing in ∥θ∥2.
Hence,
and, as a result,
This sufficies to establish finiteness of the risk of \(\delta _{a,r}^B(X)\). We now deal with the main part of the theorem. Using Corollary 2.1 and Theorem 2.3, we have
since r 2(∥X∥2) ≤ r(∥X∥2) and r ′(∥X∥2) ≥ 0. Now, as a consequence of the above monotone likelihood property, ∥X∥2∕V is stochastically decreasing in V . It follows that the conditional expectation in (5.2) is nondecreasing in V since, if v 1 < v 2, we have
The first inequality follows since r(∥X∥2) is nondecreasing while the second since r(t)∕t is nonincreasing and ∥X∥2∕V is stochastically decreasing in V . Finally, using the fact that aV −1 − 2(p − 2) is decreasing in V , and the fact that E[g(Y )h(Y )] ≤ E[g(Y )]E[h(Y )] if g and h are monotone in opposite directions, it follows that
by assumption (a). Hence \(\delta _{a,r}^B(X)\) is minimax, since X is minimax.
The dominance result follows since the inequality in (5.2) is strict if there is strict inequality in (2) or if r ′(⋅) is strictly positive on a set of positive measure and the inequality in (5.3) is strict if the inequalities in (1) are strict. □
Example 5.1 (The multivariate Student-t distribution)
The multivariate Student-t distribution: If V has an inverse Gamma (v∕2, v∕2) distribution (that is, \(V \sim v/\chi _{v}^2\)), then the distribution of X is a multivariate Student-t distribution with ν degrees of freedom. Since \(E[V] = E[v/\chi _{v}^2] = v/(v - 2)\) for v > 2 and \(E[V^{-1}] = E[\chi _{v}^2/v]=1\), the conditions of Theorem 5.1 requires 0 ≤ a ≤ 2(p − 2) and v > 2.
Example 5.2 (Examples of the function r(t))
The James-Stein estimator has r(t) ≡ 1 and hence satisfies conditions (2), (3) and (4) of Theorem 5.1. Also r(t) = t∕(t + b) satisfies these conditions. Similarly, the positive-part James-Stein estimator (1 − a∕X T X)+ X is such that
and
hence also satisfies the conditions (2), (3) and (4) of Theorem 5.1.
It is worth noting, and easy to see, that if the sampling distribution is N(θ, I p) and the prior distribution is any variance mixture of normal distributions as in (3.4), in the Baranchik representation of the Bayes estimator (see Corollary 3.1), the function r(t)∕t is always nonincreasing. This fact leads to the following observation on the (sampling distribution) robustness of Bayes minimax estimators for a normal sampling distribution. If δ π(X) = (1 − a r(∥X∥2)∕∥X∥2)X is a Bayes minimax estimator with respect to a scale mixture of normal priors for a N(θ, I p) sampling distribution, and if r(t) is nondecreasing, this Bayes minimax estimator remains minimax for a multivariate-t sampling distribution in Example 5.1 as long as the degrees of freedom is greater than two.
It is also interesting to note that, in general, there will be no uniformly optimal choice of the shrinkage constant “a” in the James-Stein estimator if the mixing distribution G(⋅) is nondegenerate. The optimal choice will typically depend on ∥θ∥2. This is in contrast to the normal sampling distribution case, where G(⋅) is degenerate, and where the optimal choice is a = (p − 2)σ 2.
5.2.2 Densities with Tails Flatter Than the Normal
In this section we consider the subclass of spherically symmetric densities f(∥x − θ∥2) such that, for any t ≥ 0 for which f(t) > 0,
for some fixed positive c, where
This class was introduced in Berger (1975) (without the constant 1/2 multiplier).
This class of densities contains a large subclass of variance mixtures of normal densities but also many others. The following lemma gives some conditions which guarantee inclusion or exclusion from the class satisfying (5.4) and (5.5).
Lemma 5.1
Suppose X has density f(∥x − θ∥2).
-
(1)
(Mixture of normals). If, for some distribution G on (0, ∞),
$$\displaystyle \begin{aligned}f(\Vert x - \theta\Vert^2) = \bigg(\frac{1}{\sqrt {2\pi}}\bigg)^p \int_0^\infty v^{-p/2} \exp \left\{- \frac{\Vert x -\theta\Vert^2}{2v}\right\} d G(v)\end{aligned} $$where E[V −p∕2] is finite, E denoting the expectation with respect to G, then f(⋅) is in the class (5.4) with c = E[V −p∕2+1]∕E[V −p∕2] for p ≥ 3.
-
(2)
If f(t) = h(t)e −at with h(t) nondecreasing, then f(⋅) is in the class (5.4) .
-
(3)
If f(t) = e −atg(t) where g(t) is nondecreasing and limt→∞ g(t) = ∞, then f(t) is not in the class (5.4) .
Proof
(1) Applying the definition of F in (5.5) we have
Hence the ratio in (5.4) equals
The inequality follows since the family of densities proportional to the function \(v \mapsto v^{-p/2} \, \exp \left \{-t/2v\right \}\) has monotone (increasing) likelihood ratio in the parameter t. Note that if p ≥ 3, E[V −p∕2] < ∞ implies E[V −p∕2+1] < ∞. This completes the proof of (1).
(2) In this case it follows
Hence (5.4) is satisfied with c = 1∕2a, which proves (2).
(3) In this case it follows
Hence f(t) is not in the class (5.4), which shows (c). □
Part (2) of the lemma shows that densities with tails flatter than the normal (and including the normal) are in the class (5.4), while densities with tails “sufficiently lighter” than the normal are not included. Also the condition in part (3) is stronger than necessary in that it suffices that the condition hold only for all t larger than some positive K. See Berger (1975) for further details and discussion.
Example 5.3
Some specific examples in the class (5.4) include (see Berger 1975 for more details)
The latter two distributions are known as the logistic type and Kotz , respectively.
The following lemma plays the role of Stein’s lemma (Theorem 2.1) for the family of spherically symmetric densities.
Lemma 5.2
Let X have density f(∥x − θ∥2) and let g(X) be a weakly differentiable function such that E θ[|(X − θ)T g(X)|] < ∞. Then
where F(t) is defined as in (5.5) and \(E_\theta ^*\) denotes expectation with respect to the density
and where it is assumed that
Proof
Note that the existence of the expectations in Lemma 5.2 will be guaranteed for any function g(x) such that E θ[∥g(x)∥2] < ∞ as soon as E 0[∥X∥2] < ∞. The proof will follow along the lines of Sect. 2.4 making use of Stokes’ theorem . It follows that
□
Now, with the important analog of Stein’s lemma in hand, we can extend some of the minimaxity results from the Gaussian setting to the case of spherically symmetric distributions. The following result gives conditions for minimaxity of estimators of the Baranchik type.
Theorem 5.2
Let X have density f(∥x − θ∥2) which satisfies (5.4) for some 0 < c < ∞. Assume also that E 0[∥X∥2] < ∞ and E 0[∥X∥−2] < ∞. Let
where r(⋅) is absolutely continuous. Then \(\delta _{a,r}^B(X)\) is minimax for p ≥ 3 provided
-
(1)
0 < a ≤ 2 c (p − 2),
-
(2)
0 ≤ r(t) ≤ 1, and
-
(3)
r(⋅) is nondecreasing.
Furthermore \( \delta _{a,r}^B(X) \) dominates X provided both inequalities are strict in (1) or in (2) on a set of positive measure or if r ′(⋅) is strictly positive on a set of positive measure.
Proof
We note that the conditions ensure finiteness of the risk so that Lemma 5.2 is applicable. Hence we have
by Lemma 5.2. Therefore the risk difference between \(\delta _{a,r}^B(X)\) and X equals
The domination part follows as in Theorem 5.1. □
Theorem 5.2 applies for certain densities for which Theorem 5.1 is not applicable and additionally lifts the restriction that r(t)∕t is nonincreasing. However, if the density is a mixture of normals, and both theorems apply, the shrinkage constant “a” given by Theorem 5.1 (with a = 2(p − 2)∕E[V −1]) is strictly larger than that for Theorem 5.2 ( with a = 2(p − 2)c) whenever the mixing distribution G(⋅) is not degenerate. To see this note that
or equivalently
whenever the positive random variable V is non-degenerate. Note also that E[V −1] < ∞ whenever E[V −p∕2] < ∞ and p ≥ 3.
Example 5.4 (The multivarite Student-t distribution, continued)
Suppose X has a p-variate Student-t distribution with ν degrees of freedom as in Example 5.1, so that V has an inverse Gamma(ν∕2, ν∕2) distribution . In this case
which is finite for all ν > 0 and p > 0.
The bound on the shrinkage constant, “a”, in Theorem 5.1 is 2(p − 2) as shown in Example 5.1, while the bound on “a”, in Theorem 5.2, as indicated above, is given by
Hence, for large p, the bound on the shrinkage factor “a” can be substantially less for Theorem 5.2 than for Theorem 5.1 in the case of a multivariate-t sampling distribution. Note that, for fixed p, as ν tends to infinity the smaller bound tends to the larger one (and the Student-t distribution tends to the normal).
Example 5.5 (Examples 5.3 continued)
All of the distributions in Example 5.3 satisfy the assumptions of Theorem 5.2 (under suitable moment conditions for the second density). It is interesting to note that for the Kotz distribution , the value of c (= 1), as in (5.4), doesn’t depend on the parameter n > 0. Hence the bound on the shrinkage factor “a” is 2(p − 2) and is also independent of n, indicating a certain distributional robustness of the minimaxity property of Baranchik type estimators with a < 2(p − 2).
With additional assumptions on the function F(t)∕f(t) in (5.4) (i.e. it is either monotone increasing or monotone decreasing), theorems analogous to Theorem 5.2 can be developed which further improve the bounds on the shrinkage factor “a”. These typically may involve additional assumptions on the function r(⋅). We will see examples of this type in the next section.
5.3 More General Minimax Estimators
We now consider minimaxity of general estimators of the form X + a g(X). The initial results rely on Lemma 5.2. The first result follows immediately from this lemma and gives an expression for the risk.
Corollary 5.1
Let X have a density f(∥x − θ∥2) such that E 0[∥X∥2] < ∞ and let g(X) be weakly differentiable and be such that E θ[∥g(X)∥2] < ∞.
Then, for loss L(θ, δ) = ∥δ − θ∥2 , the risk of X + a g(X) can be expressed as
where
and where F(∥X − θ∥2) is defined in (5.5) .
An immediate consequence of Corollary 5.1 when the density of f satisfies (5.4), i.e. Q(t) ≥ c > 0 for some constant c, is the following.
Corollary 5.2
Under the assumptions of Corollary 5.1 , assume that, for some c > 0, we have Q(t) ≥ c for any t ≥ 0. Then X + g(X) is minimax and dominates X provided, for any \(x\in \mathbb {R}^p\) ,
with strict inequality on a set of positive measure.
The following two theorems establish minimaxity results under the assumption that Q(t) is monotone.
Theorem 5.3 (Brandwein et al. 1993)
Suppose X has density f(∥x − θ∥2) such that E 0[∥X∥2] < ∞ and that Q(t) in (5.8) is nonincreasing. Suppose there exists a nonpositive function h(U) such that E R,θ[h(U)] is nondecreasing where U ∼ U R,θ (the uniform distribution on the sphere of radius R centered at θ) and such that E θ[|h(x)|] < ∞. Furthermore suppose that g(X) is weakly differentiable and also satisfies
-
(1)
div g(X) ≤ h(X),
-
(2)
∥g(X)∥2 + 2 h(X) ≤ 0 , and
-
(3)
0 ≤ a ≤ E 0(∥X∥2)∕p.
Then δ(X) = X + ag(X) is minimax. Also δ(X) dominates X provided g(⋅) is nonzero with positive probability and strict inequality holds with positive probability in (1) or (2) , or both inequalities are strict in (3) .
Proof
Note that g(x) satisfies the conditions of Corollary 5.1. Then we have
where E R,θ is as above and E denotes the expectation with respect to the radial distribution. Now, using (1) and (2), we have
by the monotonicity assumptions on E R,θ[h(⋅)] and Q(t) as well as the covariance inequality .
Hence, since − h(X) ≥ 0, we have R(θ, δ) ≤ R(θ, X), provided 0 ≤ a ≤ E[Q(R 2)]. Now E[Q(R 2)] = E 0[∥X∥2]∕p by Lemma 5.3 below, hence δ is minimax. The domination result follows since the additional conditions imply strict inequality between the risks. □
Lemma 5.3
For any k > −p such that E[R k+2] < ∞,
In particular, we have
and, for p ≥ 3,
Proof
Recall that the radial density φ(r) of R = ∥X − θ∥ can be expressed as φ(r) = σ(S)r p−1 f(r 2) where σ(S) is the area of the unit sphere S in R p. By (5.8) and (5.5), we have
Note that positivity of integrands and E[R k+2] < ∞ implies E[R k Q(R 2)] < ∞. □
The next theorem reverses the monotonicity assumption on Q(⋅) and changes the condition on the function h(X) which, in turn, bounds the divergence of g(X).
Theorem 5.4 (Brandwein et al. 1993)
Suppose X has a density f(∥x − θ∥2) such that E 0[∥X∥2] < ∞ and E 0[1∕∥X∥2] < ∞ and such that Q(t) in (5.8) is nondecreasing. Suppose there exists a nonpositive function h(X) such that \(E_{R,\theta } \left [R^2 h(U)\right ]\) is nonincreasing where U ∼ U R,θ and such that E θ[−h(X)] < ∞.
Furthermore suppose that g(X) is weakly differentiable and also satisfies
-
(1)
div g(X) ≤ h(X),
-
(2)
∥g(X)∥2 + 2 h(X) ≤ 0, and
-
(3)
\(0 \le a \le \frac {1}{(p-2)E_0(1/\Vert X\Vert ^2)}\).
Then δ(X) = X + a g(X) is minimax. Also δ(X) dominates X provided g(⋅) is nonzero with positive probability and strict inequality holds with positive probability in (1) or (2) , or both inequalities are strict in (3) .
Proof
As in the proof of Theorem 5.3, we have
where R 0 is a point such that \(a - Q(R_0^2) = 0\), provided such a point exists. Here we have used the version of the covariance inequality that states
provided that g(X) is nondecreasing (respectively, nonincreasing) and f(X) changes sign once from + to − (respectively, − to + ) at X 0. But such a point R 0 does exist provided
since Q(R 2) is nondecreasing.
It follows that R(θ, δ) ≤ R(θ, X) provided that \(aE[\frac {1}{R^2}] \le E [\frac {Q(R^2)}{R^2}]\). However \(E[\frac {Q(R^2)}{R^2}] = \frac {1}{p-2}\) by Lemma 5.3 and hence the result follows as in Theorem 5.3. □
Note that the bound on “a” in both of these theorems is strictly larger than the bound in Theorem 5.2 provided Q(R 2) is not constant. This is so since the bound in Theorem 5.2 is based on \(c = \inf Q(R^2)\) while, in these results, the bound is equal to a (possibly weighted) average of Q(R 2).
We indicate the utility of these two results by applying them to the James-Stein estimator.
Corollary 5.3
Let X ∼ f(∥x − θ∥2) for p ≥ 4 and let \(\delta _b^{JS} (X) = (1 - b / \Vert X \Vert ^2 )X\) . Assume also that E 0[∥X∥2] < ∞ and E 0[1∕∥X∥2] < ∞. Then \(\delta _b^{JS}(X)\) is minimax and dominates X provided either
-
(1)
Q(R 2) is nonincreasing and
$$\displaystyle \begin{aligned}0 < b < 2(p-2)\frac{E_0\Vert X\Vert^2}{p}, \; or \end{aligned}$$ -
(2)
Q(R 2) is nondecreasing and
$$\displaystyle \begin{aligned}0 < b < \frac{2}{E_0(1/\Vert X\Vert^2)}. \end{aligned}$$
Proof
We apply Theorems 5.3 and 5.5 with g(X) = −[2 (p − 2)∕∥X∥2]X, div g(X) = −2 (p − 2)2∕∥X∥2 = h(X). It follows from Lemma A.5 in Appendix A.10 that when p ≥ 4, E θ,R[h(U)] is nondecreasing in R and E θ,R[R 2 h(U)] is nonincreasing in R. Hence, if Q(R 2) is nonincreasing, Theorem 5.3 implies that
is minimax and dominates X provided 0 < a < E 0[∥X∥2]∕p or equivalently 0 < 2 (p − 2) a < 2 (p − 2) E 0(∥X∥2)∕p which is (1) with b = 2 (p − 2) a. Similarly, applying Theorem 5.5 when Q(R 2) is nondecreasing, we find that δ a(X) is minimax and dominates X if
which is (2). □
Example 5.6 (Densities with increasing and decreasing Q(R 2))
Note first that variance mixtures of normal distributions have increasing Q(R 2) since, by (5.6) and (5.8), Q(R 2) may be viewed as the expected value of V with respect to a family of distributions with monotone increasing likelihood ratio in t = R 2. Note also that the bound for the shrinkage constant “a” in a James-Stein estimator is the same in Corollary 5.3 as it is in Theorem 5.1 for mixtures of normals.
We also note that, if we consider f(t) to be proportional to a density of a positive random variable, then 2 Q(t) is the reciprocal of the hazard rate. There is a large literature on increasing and decreasing hazard rates (see, for example, Barlow and Proschan 1981).
We note that the monotonicity of Q(t) may be determined in many cases by studying the log-convexity or the log-concavity of f(t). In particular, if ln f(t) is convex (concave), then Q(t) is nondecreasing (nonincreasing). To see this, note that
and hence Q(t) will be nondecreasing (nonincreasing) if \(\frac {f(s+t)}{f(t)}\) is nondecreasing (nonincreasing) in t for each s > 0. But, assuming for simplicity that f is differentiable, for any t ≥ 0 such that f(t) > 0,
This is positive or negative when ln f(s + t) is convex or concave in t, respectively. For example if X has a Kotz distribution with parameter n, f(t) ∝ t n e −t∕2. Then \(\mathrm {ln}\, f(t) = K + n\, \mathrm {ln}\, t - \frac {t}{2}\) which is concave if n ≥ 0 and convex if n ≤ 0. Hence Q(t) is decreasing if n > 0 and increasing if n < 0. Of course the log-convexity (log-concavity) of f(t) is not a necessary condition for the nondecreasing (nonincreasing) monotonicity of Q(t). Thus, it is easy to check that \( f(t) \, \propto \, \exp (-t^2) \exp [-1/2 \int ^t_0 \exp (-u^2)\, \,du] \) leads to \(Q(t) = \exp (t^2) \), which is increasing. But \(\log f(t)\) is not convex.
An important class of distributions is covered by the following corollary.
Corollary 5.4
Let X ∼ f(∥x − θ∥2) for p ≥ 4 with \(f(t) \propto \exp (-\beta t^\alpha )\) where α > 0 and β > 0. Then \(\delta _b^{JS}(X) = (1 - b/\Vert X\Vert ^2)X\) is minimax and dominates X provided either
-
(1)
α ≤ 1 and \(0 < b < \frac {2} {\beta ^{1/\alpha }} \, \frac {p-2} {p} \, \frac {\varGamma ((p+2)/2\alpha )} {\varGamma (p/2\alpha )}\) or
-
(1)
α > 1 and \(0 < b < \frac {2} {\beta ^{1/\alpha }} \, \frac {\varGamma ( p/2\alpha )} {\varGamma ((p- 2)/2\alpha )}\).
Proof
By the above discussion, Q(R 2) is nonincreasing (nondecreasing) for α ≥ 1 (α ≤ 1). Then the result follows from Corollary 5.3 and the fact that
for k > −p. □
The final theorem of this section gives conditions for minimaxity of estimators of the form X + a g(X) for general spherically symmetric distributions. Note that no density is needed for this result which relies on the radial distribution defined in Theorem 4.1.
We first need the following lemma which will play the role of the Stein lemma in the proof of the domination and minimaxity results.
Lemma 5.4
Let X have a spherically symmetric distribution around θ, and let g(X) be a weakly differentiable function such that E θ[ |(X − θ)T g(X)| ] < ∞. Then
where E denotes the expectation with respect to the radial distribution and where \(\mathcal {V}_{R,\theta }(\cdot )\) is the uniform distribution on B R, θ , the ball of radius R centered at θ.
Proof
Let ρ be the radial distribution and according to Theorem 4.1, we have
since the volume of B R,θ equals λ(B R,θ) = Rσ R,θ(S R,θ)∕p. □
Theorem 5.5 (Brandwein and Strawderman 1991a)
Let X have a spherically symmetric distribution around θ, and suppose E 0[∥X∥2] < ∞ and E 0[1∕∥X∥2] < ∞. Suppose there exists a nonpositive function h(⋅) such that h(X) is subharmonic and E R,θ[R 2 h(U)] is nonincreasing where \(U\sim {\mathcal U}_{R,\theta }\) and such that E θ[|h(x)|] < ∞. Furthermore suppose that g(X) is weakly differentiable and also satisfies
-
(1)
div g(X) ≤ h(X),
-
(2)
∥g(X)∥2 + 2 h(X) ≤ 0, and
-
(3)
\(0 \le a \le \frac {1}{pE_0(1/\Vert X\Vert ^2)}\).
Then δ(X) = X + a g(X) is minimax. Also δ(X) dominates X provided g(⋅) is non-zero with positive probability and strict inequality holds with positive probability in (1) or (2) , or both inequalities are strict in (3) .
Proof
Using Lemma 5.4 and Conditions (1) and (2), we have
By subharmonicity of h (see Appendix A.8 and Sections 1.3 and 2.5 in du Plessis 1970),
Hence,
The last inequality follows from the monotonicity of E R,θ[R 2 h(X)] and the covariance inequality. Hence R(θ, δ) ≤ R(θ, X) when E[a∕R 2 − 1∕p] ≤ 0 which is equivalent to (3). The domination part follows as before. □
We note that the shrinkage constant in the above result 1∕{pE 0[1∕∥X∥2]} is somewhat smaller than the constant in Theorem 5.4 (a = 1∕{(p − 2)E 0[1∕∥X∥2]}), but Theorem 5.5 has essentially no restrictions on the distribution of X aside from moment conditions (which coincide in Theorems 5.4 and 5.5). In particular we do not even assume that a density exists! However there is an additional assumption of subharmonicity of h.
The following useful corollary gives minimaxity for James-Stein estimators in dimension p ≥ 4 for all spherically symmetric distributions with finite E 0[∥X∥2] and E 0[1∕∥X∥2].
Corollary 5.5
Let X have a spherically symmetric distribution with p ≥ 4, and suppose E 0[∥X∥2] < ∞ and E 0[1∕∥X∥2] < ∞. Then
is minimax and dominates X provided
Proof
Here g(X) = −X∕∥X∥2 and is weakly differentiable for p ≥ 3. Then div g(X) = −(p − 2)∕∥X∥2 and ∥g(X)∥2 = 1∕∥X∥2 so that Conditions (1) and (2) of Theorem 5.5 are satisfied with h(X) = −α∕∥X∥2 where 0 ≤ α ≤ p − 2. Now the subharmonicity of h(X) and its monotonicity condition hold since it is shown in the appendix that, for p ≥ 4, 1∕∥X∥2 is super-harmonic (so that E R,θ[1∕∥X∥2] is nonincreasing in R) and that R 2 E R,θ[1∕∥U∥2] is nondecreasing in R.
Furthermore, it is worth noting that E R,θ[1∕∥U∥2] is nonincreasing in ∥θ∥ (see Lemma A.5 and remark that follows). Hence, for any \( \theta \in \mathbb {R} ^p \), we have E θ[−h(X)] < ∞ since
so that
by assumption. □
Example 5.7 (Nonspherical minimax estimators)
In Sect. 2.4.4, we considered estimators which shrink toward a subspace. Theorem 5.5 allows us to show that estimators of this type are minimax for general spherically symmetric distributions. To be specific, suppose V is a s < p dimensional linear subspace and let
As in the proof of Theorem 2.6, it can be shown that the risk of δ a(X) equals
where Y 1, Y 2, ν 1 and ν 2 are as in Theorem 2.6.
In the present case, Y 2 has a spherically symmetric distribution about ν 2 of dimension p − s. Hence, by Theorem 5.5,
provided p − s ≥ 4 and
5.4 Bayes Estimators
In this section, we consider (generalized) Bayes estimators of the location vector \(\theta \in \mathbb {R}^p\) of a spherically symmetric distribution. More specifically let X be a random vector in \(\mathbb {R}^p\) with density f(∥x − θ∥2) and let π(θ) be a prior density. Under quadratic loss ∥δ − θ∥2, the (generalized) Bayes estimator of θ is the posterior mean given by
where m(x) is the marginal
Recall from Sect. 3.1.1 that, in the normal case (that is, \(f(t) \propto \exp (-t/2\sigma ^2)\) with σ 2 known) the superharmonicity of \(\sqrt {m(x)}\) is a sufficient condition for minimaxity of δ π(X). This superharmonicity is implied by that of m(x) and in turn by that of π(θ). While in the nonnormal case minimaxity has been studied by many authors (for example, see Strawderman (1974b); Berger (1975); Brandwein and Strawderman (1978, 1991a)) relatively few results on minimaxity of Bayes estimators are known. The primary technique to establish minimaxity is through a Baranchik representation of the form (1 − a r (∥X∥2)∕∥X∥2)X. The minimaxity conditions are essentially those developed in Theorems 5.3 and 5.4 and most of the derivations are in the context of variance mixtures of normals. See Strawderman (1974b), Maruyama (2003a) and Fourdrinier et al. (2008) for more discussion and results on Bayes estimation in this setting.
The main difficulty in using Theorem 5.1 with mixtures of normals densities for the sampling distribution is to prove the monotonicity (and boundedness) properties of the function r(⋅). Maruyama (2003a) and Fourdrinier et al. (2008) consider priors which are mixtures of normals as well. Their main condition for obtaining minimaxity of the corresponding Bayes estimator is that the mixing density g of the sampling distribution has monotone nondecreasing likelihood ratio when considered as a scale parameter family. In Fourdrinier et al. (2008), explicit use is made of that monotone likelihood ratio property for the mixing (possibly generalized) density h of the prior distribution.
The main result of Fourdrinier et al. (2008) is the following. Consult that paper for the somewhat technical proof.
Theorem 5.6
Let X be a random vector in \(\mathbb {R}^p\) (p ≥ 3) distributed as a variance mixture of multivariate normal distributions with density
where g is the density of a known nonnegative random variable V . Let π be a (generalized) prior with density of the form
where h is a function from \(\mathbb {R}_+\) into \(\mathbb {R}_+\) such that this integral exists.
Assume that the mixing density g is such that
Assume also that the mixing function h of the (possibly improper) prior density π is absolutely continuous and satisfies
for some β < p∕2 − 1 and some 0 < c < ∞. Assume, finally, that h and g have monotone increasing likelihood ratio when considered as a scale parameter family.
Then, if there exist K > 0, t 0 > 0 and α < 1 such that
the (generalized or proper) Bayes estimator δ h with respect to the prior distribution corresponding to the mixing function h is minimax provided that β satisfies
For priors with mixing distribution h satisfying (5.16) and (5.17) an argument as in Maruyama (2003a) using Brown (1979) and a Tauberian theorem suggests that the resulting generalized Bayes estimator is admissible if β ≤ 0. Maruyama and Takemura (2008) have verified this under additional conditions which imply, in the setting of Theorem 5.6, that E θ[∥X∥3] < ∞.
As an illustration assume that the sampling distribution is a p-variate Student-t with n 0 degrees of freedom which corresponds to the inverse gamma mixing density (n 0∕2, n 0∕2), that is, to \(g(v) \propto v^{-(n_0 + 2)/2} \exp (- n_0/2v)\). Let the prior be a Student-t distribution with n degrees of freedom, that is, with mixing density \(h(t) \propto t^{-(n+2)/2} \exp (-n/2t)\). It is clear that Conditions (5.14) and (5.15) are satisfied with n 0 ≥ 7. It is also clear that Condition (5.16) holds for any α < 1. Finally a simple calculation shows that
so that Condition (5.17) reduces to
Note that, as n > 0, this condition holds if and only if p ≥ 5 and
Other examples (including generalized priors) can be found in Fourdrinier et al. (2008).
In the following, we consider broader classes of spherically symmetric distributions which are not restricted to variance mixtures of normals. Minimaxity of generalized Bayes estimators is obtained for unimodal spherically symmetric superharmonic priors π(∥θ∥2) under the additional assumption that the Laplacian of π(∥θ∥2) is a nondecreasing function of ∥θ∥2. The results presented below are derived in Fourdrinier and Strawderman (2008a). An interesting feature is that their approach does not rely on the Baranchik representation used in Maruyama (2003a) and Fourdrinier et al. (2008). Note, however, that the superharmonicity property of the priors implies that the corresponding Bayes estimators cannot be proper (see Theorem 3.2).
First note that, for any prior π(θ), the Bayes estimator in (5.10) can be written as
where, for any \(X\in \mathbb {R}^p\),
with F given in (5.5). Thus δ π(X) has the general form δ π(X) = X + g(X) (with g(X) = ∇M(X)∕m(X)). If the density f(∥x − θ∥2) is as in Sect. 5.2.1, that is, such F(t)∕f(t) ≥ c > 0 for some fixed positive constant c, then Corollary 5.2 applies and δ π(X) = X + g(X) = X + ∇M(X)∕m(X) is minimax provided, for any \(x\in \mathbb {R}^p\),
In particular, it follows that if
and
δ π is minimax.
For a spherically symmetric prior π(∥θ∥2), the main result of Fourdrinier and Strawderman (2008a) is the following.
Theorem 5.7
Assume that X has a spherically symmetric distribution in \(\mathbb {R}^p\) with density f(∥x − θ∥2). Assume that \(\theta \in \mathbb {R}^p\) has a superharmonic prior π(∥θ∥2) such that π(∥θ∥2) is nonincreasing and Δπ(∥θ∥2) is nondecreasing in ∥θ∥2 . Assume also that
Then the Bayes estimator δ π is minimax under quadratic loss provided that f(t) is log-convex, \(c = \frac {F(0)}{f(0)} > 0\) and
To prove Theorem 5.7 we need some preliminary lemmas whose proofs are given in Appendix A.9 . Note first that it follows from the spherical symmetry of π that, for any \(x\in \mathbb {R}^p\), m(x) and M(x) are functions of t = ∥x∥2. Then, setting
we have
Lemma 5.5
Assume that π ′(t) ≤ 0, for any t ≥ 0. Then we have M ′(t) ≤ 0, for any t ≥ 0.
Lemma 5.6
For any \(x\in \mathbb {R}^p\) ,
and
where, for u ≥ 0 and for t ≥ 0,
and \(\mathcal {V}_{\sqrt {u}, x}\) is the uniform distribution on the ball \(B_{\sqrt {u}, x}\) of radius \(\sqrt {u}\) centered at x and λ(B) is the volume of the unit ball.
Lemma 5.7
For any t ≥ 0, the function H(u, t) in (5.22) is nondecreasing in u provided that Δπ(∥θ∥2) is nondecreasing in ∥θ∥2.
Lemma 5.8
Let h(∥θ − x∥2) be a unimodal density and let ψ(θ) be a symmetric function. Then
as soon as ψ is nonnegative.
Proof (Proof of Theorem 5.7 )
By the superharmonicity of π(∥θ∥2), we have ΔM(x) ≤ 0 for all \(x\in \mathbb {R}^p\) so that by (5.19), it suffices to prove that
for all \(x\in \mathbb {R}^p\). Since m and M are spherically symmetric, by (5.21), (5.23) reduces to − 2cM ′(t)m ′(t) + (M ′(t))2 ≤ 0 where t = ∥x∥2. Since M ′(t) ≤ 0 by Lemma 5.5, (5.23) reduces to − 2cm ′(t) + M ′(t) ≥ 0 or, by (5.21), to − 2 c x ⋅∇m(x) + x ⋅∇M(x) ≥ 0 or, by Lemma 5.6, to
where E denotes the expectation with respect to the density proportional to u p∕2 f(u). Since, by assumption, Δπ(∥θ∥2) is nondecreasing in ∥θ∥2, H(u, t) is nondecreasing in u by Lemma 5.7. Furthermore f ′(u)∕f(u) is nondecreasing by log-convexity of f so that (5.16) is satisfied as soon as
Finally, as π ′(∥θ∥2) ≤ 0 by assumption, Lemma 5.2 guarantees that H(u, t) ≤ 0 (note that \(V_{\sqrt {u}, x}\) has a unimodal density) and hence (5.25) reduces to
which is equivalent to (5.20). □
Several examples of priors and sampling distributions which satisfy the assumptions of Theorem 5.7 are given in Fourdrinier and Strawderman (2008a). We briefly summarize these.
Example 5.8 (Priors related to the fundamental harmonic prior)
Let \(\displaystyle \pi (\|\theta \|{ }^2) = \left (\frac {1}{A+\|\theta \|{ }^2}\right )^c\) with A ≥ 0 and \(0\leq c \leq \frac {p}{2}-1\).
Example 5.9 (Mixtures of priors)
Let (π α)α𝜖A be a family of priors such that the assumptions of Theorem 5.7 are satisfied for any α ∈ A. Then any mixture of the form ∫A π α(∥θ∥2) dH(α) where H is a probability measure on A satisfies these assumptions as well. For instance, Example 5.8 with c = 1, p ≥ 4, A = α and the gamma density \(\displaystyle \alpha \longmapsto \frac {\beta ^{1-v}}{\varGamma (1-v)}\alpha ^{-v} e^{-\beta \alpha }\) with β > 0 and 0 < v < 1 leads to the prior
where
is the complement of the incomplete gamma function.
Example 5.10 (Variance mixtures of normals)
Let
a mixture of normals with respect to the inverse of the variance . As soon as, for any u > 0,
the prior π(∥θ∥2) satisfies the assumptions of Theorem 5.7. Note that the priors in Example 5.10 arise as such a mixture with \(h(u) \propto \alpha u^{k-p/2 -1} \exp (- A/2 u)\).
Other examples can be given and a constructive approach is proposed in Fourdrinier and Strawderman (2008a).
We now give examples of sampling distributions which satisfy the assumptions of Theorem 5.7.
Example 5.11 (Variance mixtures of normals)
Let
where h is a mixing density and let V be a nonnegative random variable with density proportional to f(t). If E[V −p∕2] < ∞ and E[V ] E[V −p∕2]∕E[V −p∕2+1] < 2 then the sampling density f satisfies the assumptions of Theorem 5.7.
Example 5.12 (Densities proportional to \(e^{-\alpha t^\beta }\))
Let
where α > 0, \(\frac {1}{2} < \beta \le 1\) and K is the normalizing constant. Then the sampling density f satisfies the assumptions of Theorem 5.7 as soon as β is in a neighborhood of the form ]1 − 𝜖, 1] with 𝜖 > 0. However, note that these are not satisfied when β = 1∕2.
Fourdrinier and Strawderman (2008a) give other examples with densities proportional to e −αt+βφ(t) where φ is a convex function.
5.5 Shrinkage Estimators for Concave Loss
In this section we consider improved shrinkage estimators for loss functions that are concave functions of squared error loss. The basic results are due to Brandwein and Strawderman (1980, 1991b) and we largely follow the method of proof in the later paper. The general nature of the main result is that (under mild conditions) if an estimator can be shown to dominate X under squared error loss then the same estimator, with a suitably altered shrinkage constant, will dominate X for a loss which is a concave function of squared error loss.
Let X have a spherically symmetric distribution around θ, and let g(X) be a weakly differentiable function. The estimators considered are of the form
The loss functions are of the form
where ℓ(⋅) is a differentiable nonnegative, nondecreasing concave function (so that, in particular ℓ ′(⋅) ≥ 0).
One basic tool needed for the main result is Theorem 5.5, and the other is the basic property of the concave function ℓ(⋅) that ℓ(t + a) ≤ ℓ(t) + aℓ ′(t).
The following result shows that shrinkage estimators that improve on X for squared error loss also improve on X for concave loss provided the shrinkage constant is adjusted properly.
Theorem 5.8 (Brandwein and Strawderman 1991a)
Let X have a spherically symmetric distribution around θ, let g(X) be a weakly differentiable function, and let the loss be given by (5.27) .
Suppose there exists a subharmonic function h(⋅) such that E θ,R[R 2 h(U)] is nonincreasing where \(U\sim {\mathcal U}_{R,\theta }\) . Furthermore suppose that the function g(⋅) satisfies \( E^*_\theta [||g(X)||{ }^2]< \infty \) and also satisfies
-
(1)
div g(x) ≤ h(x), for any \(x \in \mathbb {R}^p\) ,
-
(2)
∥g(x)∥2 + 2h(x) ≤ 0, for any \(x \in \mathbb {R}^p\) , and
-
(3)
\(0 \le a \le \frac {1}{pE^*_0(1/\Vert X\Vert ^2)}\) ,
where \(E^*_\theta \) refers to the expectation with respect to the distribution whose Radon-Nikodyn derivative with respect to the distribution of X is proportional to ℓ ′(||X − θ||2).
Then δ(X) = X + ag(X) is minimax. Also δ(X) dominates X provided g(⋅) is non-zero with positive probability and strict inequality holds with positive probability in (1) or (2) , or both inequalities are strict in (3) .
Proof
Note, by concavity of ℓ(⋅) and the usual identity
Hence, the difference in risk, R(θ, δ) − R(θ, X) is bounded by
by Theorem 5.5 applied to the distribution corresponding to \(E^{*}_\theta \). □
References
Barlow RE, Proschan F (1981) Statistical theory of reliability and life testing: probability models. TO BEGIN WITH, Silver Spring
Berger JO (1975) Minimax estimation of location vectors for a wide class of densities. Ann Stat 3(6):1318–1328
Brandwein AC, Strawderman WE (1978) Minimax estimation of location parameters for spherically symmetric unimodal distributions under quadratic loss. Ann Stat 6:377–416
Brandwein AC, Strawderman WE (1980) Minimax estimation of location parameters for spherically symmetric distributions with concave loss. Ann Stat 8:279–284
Brandwein AC, Strawderman WE (1991a) Generalizations of James-Stein estimators under spherical symmetry. Ann Stat 19:1639–1650
Brandwein AC, Strawderman WE (1991b) Improved estimates of location in the presence of unknown scale. J Multivar Anal 39:305–314
Brandwein AC, Ralescu S, Strawderman W (1993) Shrinkage estimators of the location parameter for certain spherically symmetric distributions. Ann Inst Stat Math 45(3):551–565
Brown LD (1979) A heuristic method for determining admissibility of estimators–with applications. Ann Stat 7(5):960–994. https://doi.org/10.1214/aos/1176344782
Diaconis P, Ylvisaker D (1979) Conjugate priors for exponential families. Ann Stat 7:269–281
du Plessis N (1970) An introduction to potential theory. Hafner, Darien
Fourdrinier D, Strawderman WE (2008a) Generalized Bayes minimax estimators of location vector for spherically symmetric distributions. J Multivar Anal 99(4):735–750
Fourdrinier D, Kortbi O, Strawderman W (2008) Bayes minimax estimators of the mean of a scale mixture of multivariate normal distributions. J Multivar Anal 99(1):74–93
Maruyama Y (2003a) Admissible minimax estimators of a mean vector of scale mixtures of multivariate normal distributions. J Multivar Anal 84:274–283
Maruyama Y, Takemura A (2008) Admissibility and minimaxity of generalized Bayes estimators for spherically symmetric family. J Multivar Anal 99:50–73
Strawderman WE (1974b) Minimax estimation of powers of the variance of a normal population under squared error loss. Ann Stat 2:190–198
Strawderman WE (1992) The James-Stein estimator as an empirical Bayes estimator for an arbitrary location family. In: Bayesian statistics, 4. Oxford University Press, New York, pp 821–824
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Fourdrinier, D., Strawderman, W.E., Wells, M.T. (2018). Estimation of a Mean Vector for Spherically Symmetric Distributions I: Known Scale. In: Shrinkage Estimation. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-02185-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-02185-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02184-9
Online ISBN: 978-3-030-02185-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)