Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1.1 Introduction to Two Papers on Higher Order Asymptotics

1.1.1 Introduction

Peter Bickel has contributed substantially to the study of rank-based nonparametric statistics. Of his many contributions to research in this area I shall discuss his work on second order asymptotics that yielded surprising results and set off more than a decade of research that deepened our understanding of asymptotic statistics. I shall restrict my discussion to two papers, which are Albers et al. (1976) “Asymptotic expansions for the power of distribution free tests in the one-sample problem” and Bickel (1974) “Edgeworth expansions in nonparametric statistics” where the entire area is reviewed.

1.1.2 Asymptotic Expansions for the Power of Distribution Free Tests in the One-Sample Problem

Let X 1, X 2, ⋯ be i.i.d. random variables with a common distribution function F θ for some real-valued parameter θ. For N = 1, 2, ⋯, let A N and B N be two tests of level α ∈ (0, 1) based on X 1, X 2, ⋯, X N for the null-hypothesis H : θ = 0 against a contiguous sequence of alternatives \({K}_{N,c} : \theta = c{N}^{-1/2}\) for a fixed c > 0. Let π A, N (c) and π B, N (c) denote the powers of A N and B N for this testing problem and suppose that A N performs at least as well as B N , i.e. π A, N (c) ≥ π B, N (c). Then we may look for a sample size k = k N  ≥ N such that B k performs as well against alternative K N, c as A N does for sample size N, i.e. \({\pi }_{B,k}(c{(k/N)}^{1/2}) = {\pi }_{A,N}(c)\). For finite sample size N it is generally impossible to find a usable expression for k = k N , so one resorts to large sample theory and defines the asymptotic relative efficiency (ARE) of sequence {B N } with respect to {A N } as

$$e = e(B,A) {=\lim }_{N\rightarrow \infty }N/{k}_{N}.$$

If π A, N (c) → π A (c) and π B, N (c) → π B (c) uniformly for bounded c, and π A and π B are continuous, then e is the solution of

$${\pi }_{B}(c{e}^{-1/2}) = {\pi }_{ A}(c).$$

Since we assumed that A N performs at least as well as B N , we have e ≤ 1.

If e < 1, the ARE provides a useful indication of the quality of the sequence {B N } as compared to {A N }. To mimic the performance of A N by B k we need \({k}_{N} - N = N(1 - e)/e + o(N)\) additional observations where the remainder term o(N) is relatively unimportant. If e = 1, however, all we know is that the number of additional observations needed is o(N), which may be of any order of magnitude, such as 1 or N ∕ loglogN. Hence in Hodges and Lehmann (1970) the authors considered the case e = 1 and proposed to investigate the asymptotic behavior of what they named the deficiency of B with respect to A

$${d}_{N} = {k}_{N} - N,$$

rather than k N  ∕ N. Of course this is a much harder problem than determining the ARE. To compute e, all we have to show is that \({k}_{N} = N/e + o(N)\), and only the limiting powers π A and π B enter into the solution. If e = 1, then \({k}_{N} = N + o(N)\), but for determining the deficiency, we need to evaluate k N to the next lower order, which may well be O(1) in which case we have to evaluate k N with an error of the order o(1). To do this, one will typically need asymptotic expansions for the power functions π A, N and π B, N with remainder term o(N  − 1). For this we need similar expansions for the distribution functions of the test statistics of the two tests under the hypothesis as well as under the alternative.

In their paper Hodges and Lehmann computed deficiencies for some parametric tests and estimators, but they clearly had a more challenging problem in mind. When Frank Wilcoxon introduced his one- and two-sample rank tests (Wilcoxon 1945) most people thought that replacing the observations by ranks would lead to a considerable loss of power compared to the best parametric procedures. Since then, research had consistently shown that this is not the case. Many rank tests have ARE 1 when compared to the optimal test for a particular parametric problem, so it was not surprising that the first question that Hodges and Lehmann raised for further research was: “What is the deficiency (for contiguous normal shift alternatives) of the normal scores test or of van der Waerden’s X-test with respect to the t-test?”.

In the paper under discussion this question is generalized to other distributions than the normal and answered for the appropriate one-sample rank test as compared with the optimal parametric test. Let X 1, X 2, ⋯, X N be i.i.d. with a common distribution function G and density g, and let Z 1 < Z 2 < ⋯ < Z N be the order statistics of the absolute values | X 1 |, | X 2 |, ⋯, | X N  | . If Z j  =  | X R(j) |, define V j  = 1 if X R(j) > 0 and V j  = 0 otherwise. Let a = (a 1, a 2, ⋯, a N ) be a vector of scores and define

$$T ={ \sum \nolimits }_{1\leq j\leq N}{a}_{j}{V }_{j}.$$
(1.1)

T is the linear rank statistic for testing the hypothesis that g is symmetric about zero. Note that the dependence of G, g and a on N is suppressed in the notation. Conditionally on Z, the random variables V 1, V 2, ⋯, V N are independent with

$${P}_{j} = P({V }_{j} = 1\vert Z) = g({Z}_{j})/\{g({Z}_{j}) + g(-{Z}_{j})\}.$$
(1.2)

Under the null hypothesis, V 1, V 2, ⋯, V N are i.i.d. with \(P({V }_{j} = 1) = 1/2\). Hence the obvious strategy for obtaining an expansion for the distribution function of T is to introduce independent random variables W 1, W 2, ⋯, W N with \({p}_{j} = P({W}_{j} = 1) = 1 - P({W}_{j} = 0)\) and obtain an expansion for the distribution function of ∑1 ≤ j ≤ N a j W j . In this expansion we substitute the random vector P = (P 1, P 2, ⋯, P N ) for p = (p 1, p 2, ⋯, p N ). The expected value of the resulting expression will then yield an expansion for the distribution function of T.

This approach is not without problems. Consider i.i.d. random variables Y 1, Y 2, ⋯, Y N with a common continuous distribution with mean EY j  = 0, variance EY j 2 = 1, third and fourth moments μ3 = EY j 3 and μ4 = EY j 4, and third and fourth cumulants κ3 = μ3 and \({\kappa }_{4} = {\mu }_{4} - 3{\mu }_{2}^{2}\). Let \({S}_{N} = {N}^{-1/2}{ \sum \nolimits }_{1\leq j\leq N}{Y }_{j}\) denote the normalized sum of these variables. In Edgeworth (1905) the author provided a formal series expansion of the distribution function F N (x) = P(S N  ≤ x) in powers of \({N}^{-1/2}\). Up to and including the terms of order 1, \({N}^{-1/2}\) and N  − 1, Edgeworth’s expansion of F N (x) reads

$$\begin{array}{rcl}{ F}_{N}^{{_\ast}}(x)& =& \Phi (x) - \phi (x) \cdot [({\kappa }_{ 3}/6)({x}^{2} - 1){N}^{-1/2} \\ & & \qquad \qquad \qquad +\{ ({\kappa }_{4}/24)({x}^{3} - 3x) + ({\kappa }_{ 3}^{2}/72)({x}^{5} - 10{x}^{3} + 15x)\}{N}^{-1}]. \\ & & \end{array}$$
(1.3)

We shall call this the three-term Edgeworth expansion. Though it was a purely formal series expansion, the Edgeworth expansion caught on and became a popular tool to approximate the distribution function of any sequence of continuous random variables U N with expected value 0 and variance 1 that was asymptotically standard normal. As \({\lambda }_{3,N} = {\kappa }_{3}{N}^{-1/2}\) and \({\lambda }_{4,N} = {\kappa }_{4}{N}^{-1}\) are the third and fourth cumulants of the random variable S N under discussion, one merely replaced these quantities by the cumulants of U N in 1.3. Incidentally, I recently learned from Professor Ibragimov that the Edgeworth expansion was first proposed in Chebyshev (1890), which predates Edgeworth’s paper by 15 years. Apparently this is one more example of Stigler’s law of eponymy, which states that no scientific discovery – including Stigler’s law – is named after its original discoverer (Stigler 1980).

A proof of the validity of the Edgeworth expansion for normalized sums S N was given by Cramér (cf. 1937; Feller 1966). He showed that for the three-term Edgeworth expansion 1.3, the error \({F}_{N}^{{_\ast}}(x) - {F}_{N}(x) = o({N}^{-1})\) uniformly in x, provided that μ4 <  and the characteristic function ψ(t) = E exp{itY j } satisfies Cramér’s condition

$${limsup}_{\vert t\vert \rightarrow \infty }\vert \psi (t)\vert < 1.$$
(1.4)

Assumption 1.4 can not be satisfied if Y 1 is a discrete random variable as then its characteristic function is almost periodic and the lim sup equals 1. In the case we are discussing, the summands a j W j of the statistic ∑1 ≤ j ≤ N a j W j are independent discrete variables taking only two values 0 and a j . However, the summands are not identically distributed unless the a j as well as the p j are equal. Hence the only case where the summands are i.i.d. is that of the sign test under the null-hypothesis, where a j  = 1 for all j, and the values 0 and 1 are assumed with probability 1 ∕ 2. In that case the statistic ∑1 ≤ j ≤ N a j W j has a binomial distribution with point probabilities of the order \({N}^{-1/2}\) and it is obviously not possible to approximate a function F N with jumps of order \({N}^{-1/2}\) by a continuous function F N  ∗  with error o(N  − 1).

In all other cases the summands a j W j of ∑1 ≤ j ≤ N a j W j are independent but not identically distributed. Cramér has also studied the validity of the Edgeworth expansion for the case that the Y j are independent by not identically distributed. Assume again that EY j  = 0 and define S N as the normalized sum \({S}_{N} = {\sigma }^{-1}{ \sum \nolimits }_{1\leq j\leq N}{Y }_{j}\) with σ2 = ∑1 ≤ j ≤ N EY j 2. As before F N (x) = P(S N  ≤ x) and in the three-term Edgeworth expansion F N  ∗ (x) we replace \({\kappa }_{3}{N}^{-1/2}\) and κ4 N  − 1 by the third and fourth cumulants of S N . Cramér’s conditions to ensure that \({F}_{N}^{{_\ast}}(x) - {F}_{N}(x) = o({N}^{-1})\) uniformly in x, are uniform versions of the earlier ones for the i.i.d. case: EY j 2 ≥ c > 0, EY j 4 ≤ C <  for j = 1, 2, ⋯, N, and for every δ > 0 there exists q δ < 1 such that the characteristic functions ψ j (t) = E exp{itY j } satisfy

$${ \sup }_{\vert t\vert \geq \delta }\vert {\psi }_{j}(t)\vert < {q}_{\delta }\quad \text{ for all $j$}.$$
(1.5)

As the a j W j are lattice variables 1.5 does not hold for even a single j and the plan of attack of this problem is beginning to look somewhat dubious. However, Feller points out, condition 1.5 is “extravagantly luxurious” for validating the three-term Edgeworth expansion and can obviously be replaced by \({\sup }_{\vert t\vert \geq \delta }\vert {\Pi }_{1\leq j\leq N}{\psi }_{j}(t)\vert = o({N}^{-1})\) (cf. Feller 1966, Theorem XVI.7.2 and Problem XVI.8.12). This, in turn, is slightly too optimistic but it is true that the condition

$${ \sup }_{\delta \leq \vert t\vert \leq N}\vert {\Pi }_{1\leq j\leq N}{\psi }_{j}(t)\vert = o({(N\log N)}^{-1})$$
(1.6)

is sufficient and the presence of logN is not going to make any difference. Hence 1.6 has to be proved for the case where \({Y }_{j} = {a}_{j}({W}_{j} - {p}_{j})\) and \({S}_{N} ={ \sum \nolimits }_{1\leq j\leq N}{a}_{j}({W}_{j} - {p}_{j})/\tau (p)\) with \(\tau {(p)}^{2} ={ \sum \nolimits }_{1\leq j\leq N}{p}_{j}(1 - {p}_{j}){a}_{j}^{2}\) and ρ(t) = ∏1 ≤ j ≤ N ψ j (t) is the characteristic function of S N .

This problem is solved in Lemma 2.2 of the paper. The moment assumptions (2.15) of this lemma simply state that N  − 1τ(p)2 ≥ c > 0 and N  − 11 ≤ j ≤ N a j 4 ≤ C < , and assumption (2.16) ensures the desired behavior of | ∏1 ≤ j ≤ N ψ j (t) | by requiring that there exist δ > 0 and 0 < ε < 1 ∕ 2 such that

$$\lambda \{x : \exists j : \vert x - {a}_{j}\vert < \zeta,\epsilon \leq {p}_{j} \leq 1 - \epsilon \} \geq \delta N\zeta \quad \text{ for some}\ \zeta \geq {N}^{-3/2}\log N,$$
(1.7)

where λ is Lebesgue measure. This assumption ensures that the set of the scores a j for which p j is bounded away from 0 and 1, does not cluster too much about too few points. As is shown in the proof of Lemma 2.2 and Theorem 2.1 of the paper, assumptions (2.15) and (2.16) imply

$${ \sup }_{\delta \leq \vert t\vert \leq N}\vert {\prod \nolimits }_{1\leq j\leq N}{\psi }_{j}(t)\vert \leq \exp \{-d{(\log N)}^{2}\} = {N}^{-d\log N},$$
(1.8)

which obviously implies 1.6. Hence the three-term Edgeworth expansion for \({S}_{N} ={ \sum \nolimits }_{1\leq j\leq N}{a}_{j}({W}_{j} - {p}_{j})/\tau (p)\) is valid with remainder o(N  − 1), and in fact \(O({N}^{-5/4})\). This was a very real extension of the existing theory at the time.

To obtain an expansion for the distribution of the rank statistic T = ∑1 ≤ j ≤ N a j V j , the next step is to replace the probabilities p j by the random quantities P j in 1.2 and take the expectation. Under the null-hypothesis that the density g of the X j is symmetric this is straightforward because \({P}_{j} = 1/2\) for all j. The alternatives discussed in the paper are contiguous location alternatives where \(G(x) = F(x\,-\,\theta )\) for a specific known F with symmetric density f and \(0 \leq \theta \leq C{N}^{-1/2}\) for a fixed C > 0. Finding an expansion for the distribution of T under these alternatives is highly technical and laborious, but fairly straightforward under the assumptions N  − 11 ≤ j ≤ N a j 2 ≥ c, N  − 11 ≤ j ≤ N a j 4 ≤ C,

$$\lambda \{x : \exists j : \vert x - {a}_{j}\vert < \zeta \} \geq \delta N\zeta \quad \text{ for some}\ \zeta \geq {N}^{-3/2}\log N$$
(1.9)

and some technical assumptions concerning f and its first four derivatives. Among many other things, the latter ensure that ε ≤ P j  ≤ 1 − ε for a substantial proportion of the P j . Having obtained expansions for the distribution function of \((2T -\sum \nolimits {a}_{j})/{(\sum \nolimits {a}_{j}^{2})}^{1/2}\) both under the hypothesis and the alternative, an expansion for the power is now immediate.

It remains to discuss the choice of the scores a j  = a j, N . For a comparison between best rank tests and best parametric tests we choose a distribution function F with a symmetric smooth density f and consider the locally most powerful (LMP) rank test based on the scores

$${a}_{j,N} = E\Psi ({U}_{j:N})\quad \text{ where}\ \Psi (t) = -f^{\prime}{F}^{-1}((1 + t)/2)/f{F}^{-1}((1 + t)/2)$$
(1.10)

and U j: N denotes the j-th order statistic of a sample of size N from the uniform distribution on (0, 1). Since \({F}^{-1}((1 + t)/2)\) is the inverse function of the distribution function (2F − 1) on (0, ), \({F}^{-1}((1 + {U}_{j:N})/2)\) is distributed as the j-th order statistic V j of the absolute values | X 1 |, | X 2 |, ⋯, | X N  | of a sample X 1, X 2, ⋯, X N from F. Hence \({a}_{j} = -Ef^{\prime}({V }_{j})/f({V }_{j})\). As f is symmetric, the function f′ ∕ f can only be constant on the positive half-line if f is the density \(f(x) = 1/2\gamma {e}^{-\gamma \vert x\vert }\) of a Laplace distribution on R 1 for which the sign test is the LMP rank test. We already concluded that this test can not be handled with the tools of this paper, but for every other symmetric four times differentiable f, the important condition 1.9 will hold.

If, instead of the so-called exact scores a j, N  = (U j: N ), one uses the approximate scores \({a}_{j,N} = \Psi (j/(N + 1))\), then the power expansions remain unchanged. This is generally not the case for other score generating functions than Ψ.

The most powerful parametric test for the null-hypothesis F against the contiguous shift alternative F(x − θ) with \(\theta = c{N}^{1/2}\) for fixed c > 0 will serve as a basis for comparison of the LMP rank test. Its test statistic is simply \({\sum \nolimits }_{1\leq j\leq N}\{\log f({X}_{j} - \theta ) -\log f({X}_{j})\}\) which is a sum of i.i.d. random variables and therefore its distribution function under the hypothesis and the alternative admit Edgeworth expansions under the usual assumptions, and so does the power. Explicit expressions are found for the deficiency of the LMP rank test and some examples are:

Normal distribution (Hodges-Lehmann problem). For normal location alternatives the one-sample normal scores test as well as van der Waerden’s one-sample rank test with respect to the most powerful parametric test based on the sample mean equals

$${d}_{N} = 1/2\log \log N + 1/2({u}_{\alpha }^{2} - 1) + 1/2\gamma + o(1),$$

where \(\Phi ({u}_{\alpha }) = 1 - \alpha \) and γ = 0. 577216 is Euler’s constant. Note that in the paper there is an error in the constant (cf. Albers et al. 1978). In this case the deficiency does tend to infinity, but no one is likely to notice as \(1/2\log \log N = 1.568\cdots \) for N = 1010 (logarithms to base e).

It is also shown that the deficiency of the permutation test based on the sample mean with respect to Student’s one-sample test tends to zero as \(O({N}^{-1/2})\).

Logistic distribution. For logistic location alternatives the deficiency of Wilcoxon’s one-sample test with respect to the most powerful test for testing \(F(x) = {(1 + {e}^{-x})}^{-1}\) against \(F(x - b{N}^{-1/2})\) tends to a finite limit and equals

$${d}_{N} =\{ 18 + 12{u}_{\alpha }^{2} + {(48)}^{1/2}b{u}_{ \alpha } + {b}^{2}\}/60 + o(1).$$

It came as somewhat of a surprise that Wilcoxon’s test statistic admits a three-term Edgeworth expansion, as it is a purely lattice random variable. As we pointed out above, the reason that this is possible is that its conditional distribution is that of a sum of independent but not identically distributed random variables. Intuitively the reason is that the point probabilities of the Wilcoxon statistic are of the order \({N}^{-3/2}\) which is allowed as the error of the expansion is o(N  − 1).

The final section of the paper discusses deficiencies of estimators of location. It is shown that the deficiency of the Hodges-Lehmann type of location estimator associated with the LMP rank test for location alternatives with respect to the maximum likelihood estimator for location, differs by \(O({N}^{-1/4})\) from the deficiency of the parent tests.

The paper deals with a technically highly complicated subject and is therefore not easy to read. At the time of appearance it had the dubious distinction of being the second longest paper published in the Annals. With 49 pages it was second only to Larry Brown’s 50 pages on the admissibility of invariant estimators (Brown 1966). However, for those interested in expansions and higher order asymptotics it contains a veritable treasure of technical achievements that improve our understanding of asymptotic statistics. I hope this review will facilitate the reading. While I’m about it, let me also recommend reading the companion paper (Bickel and van Zwet 1978) where the same program is carried out for two-sample rank tests. With its 68 pages it was regrettably the longest paper in the Annals at the time it was published, but don’t let that deter you! Understanding the technical tricks in this area will come in handy in all sorts of applications.

1.1.3 Edgeworth Expansions in Nonparametric Statistics

This paper is a very readable review of the state of the art at the time in the area of Edgeworth expansions. It discusses the extension of Cramér’s work to sums of i.i.d. random vectors, as well as expansions for M-estimators. It also gives a preview of the results of the paper we have just discussed on one-sample rank tests and the paper we just mentioned on two-sample rank tests. There is also a new result of Bickel on U-statistics that may be viewed as the precursor of a move towards a general theory of expansions for functions of independent random variables. As we have already discussed Cramér’s work as well as rank statistics, let me restrict the discussion of the present paper to the result on U-statistics.

First of all, recall the classical Berry-Esseen inequality for normalized sums \({S}_{N} = {N}^{-1/2}\) ⋅ ∑1 ≤ j ≤ N X j of i.i.d. random variables X 1, ⋯, X N , with EX 1 = 0 and EX 1 2 = 1. If E | X 1 | 3 < , and Φ denotes the standard normal distribution function, then there exists a constant C such that for all N,

$${ \sup }_{x}\vert P({S}_{N} \leq x) - \Phi (x)\vert \leq CE\vert {X}_{1}{\vert }^{3}{N}^{-1/2}.$$
(1.11)

In the present paper a bound of Berry-Esseen-type is proved for U-statistics. Let X 1, X 2, ⋯ be i.i.d. random variables with a common distribution function F and let ψ be a measurable, real-valued function on R 2 where it is bounded, say | ψ | ≤ M < , and symmetric, i.e. ψ(x, y) = ψ(y, x). Define

$$\gamma (x) = E(\psi ({X}_{1},{X}_{2})\vert {X}_{1} = x) ={!}_{(0,1)}\psi (x,y)\ dF(y)$$

and suppose that \(E\psi ({X}_{1},{X}_{2}) = E\gamma ({X}_{1}) = 0\). Define a normalized U-statistic T N by

$${T}_{N} = {\sigma }_{N}^{-1}{ \sum \nolimits }_{1\leq i<j\leq N}\psi ({X}_{i},{X}_{j})\quad \text{ with}\ {\sigma }_{N}^{2} = E\{{\sum \nolimits }_{1\leq i<j\leq N}\psi {({X}_{i},{X}_{j})\}}^{2},$$
(1.12)

and hence ET N  = 0 and ET N 2 = 1. In the paper it is proved that if Eγ2(X 1) > 0, then there exists a constant C depending on ψ but not on N such that

$${ \sup }_{x}\vert P({T}_{N} \leq x) - \Phi (x)\vert \leq C{N}^{-1/2}.$$
(1.13)

When comparing this result with the Berry-Esseen bound for the normalized sum S N , one gets the feeling that the assumption that ψ is bounded is perhaps a bit too restrictive and that it should be possible to replace it by one or more moment conditions. But it was a good start and improvements were made in quick succession. The boundedness assumption for ψ was dropped and Chan and Wierman (1977) proved the result under the conditions that Eγ2(X 1) > 0 and E{ψ(X 1, X 2)}4 < . Next Callaert and Janssen (1978) showed that Eγ2(X 1) > 0 and E | ψ(X 1, X 2) | 3 <  suffice. Finally Helmers and van Zwet (1982) proved the bound under the assumptions Eγ2(X 1) > 0, E | γ(X 1) | 3 <  and Eψ(X 1, X 2)2 < .

Why is this development of interest? The U-statistics discussed so far are a special case of U-statistics of order k which are of the form

$$T ={ \sum \nolimits }_{{ 1\leq j(1)<j(2)< \atop \cdots <j(k)\leq N} }{\psi }_{k}({X}_{j(1)},{X}_{j(2)},\cdots \,,{X}_{j(k)}),$$
(1.14)

where ψ k is a symmetric function of k variables with Eψ k (X 1, X 2, ⋯, X k ) = 0 and the summation is over all distinct k-tuples chosen from X 1, X 2, ⋯, X N . Clearly the U-statistics discussed above have degree k = 2, but extension of the Berry-Esseen inequality to U-statistics of fixed finite degree k is straightforward. In an unpublished technical report (Hoeffding 1961) Wassily Hoeffding showed that any symmetric function T = t(X 1, ⋯, X N ) of N i.i.d. random variables X 1, ⋯, X N that has ET = 0 and finite variance \({\sigma }^{2} = E{T}^{2} -\{ E{T\}}^{2} < \infty \) can be written as a sum of U-statistics of orders k = 1, 2, ⋯, N in such a way that all terms involved in this decomposition are uncorrelated and have several additional desirable properties. Hence it seems that it might be possible to obtain results for symmetric functions of N i.i.d. random variables through a study of U-statistics. For the Berry-Esseen theorem this was done in van Zwet (1984) where the result was obtained under fairly mild moment conditions that reduce to the best conditions for U-statistics when specialized to this case. A first step for obtaining Edgeworth expansions for symmetric functions of i.i.d. random variables was taken in Bickel et al. (1986) where the case of U-statistics of degree k = 2 was treated. More work is needed here.