Permutation and Rank Tests

Boos, Denni D; Stefanski, L A

doi:10.1007/978-1-4614-4818-1_12

Denni D Boos³ &
L A Stefanski³

Part of the book series: Springer Texts in Statistics ((STS,volume 120))

13k Accesses
1 Citations

Abstract

In the early 1930s R. A. Fisher discovered a very general exact method of testing hypotheses based on permuting the data in ways that do not change its distribution under the null hypothesis. This permutation method does not require standard parametric assumptions such as normality of the data.

Access provided by Autonomous University of Puebla. Download chapter PDF

Permutation and Randomization Tests

Exact Permutation/Randomization Tests Algorithms

Article 06 October 2020

Permutation Statistical Methods

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the early 1930s R. A. Fisher discovered a very general exact method of testing hypotheses based on permuting the data in ways that do not change its distribution under the null hypothesis. Thispermutation method does not require standard parametric assumptions such as normality of the data. It does require, however, certain invariance properties under the null hypothesis that restricts application to fairly simple designs. But in such situations, the method results in exact tests with level α under very weak distributional assumptions. Moreover, the method isstatistic-inclusive in the sense that any test statistic can be used and inherits the level-α property, although some statistics are much more powerful than others.

Tests based on this method are calledpermutation tests orrandomization tests depending on whether the data can be viewed as samples from populations or not. That is, when sampling from populations, “permutation tests” refer to use of the permutation method to obtain level α tests under weak distributional assumptions. In Fisher’s words (1935, Sec. 21), these are tests of a “wider” null hypothesis (as compared to assuming normal distributions, for example).

However, experiments may be performed on units that cannot be viewed as arising from random sampling of any population. In such situations “randomization inference” refers to inference drawn based only on the physical randomization of the units to different treatments, and on the test statistic calculated at all possible randomizations of the data. The same test that we called a permutation test in random sampling contexts is now called a randomization test. Of course one needs to qualify all statements of significance about such experiments with the disclaimer that randomization inference only applies to the units used in the experiment.

Permutation tests are the foundation of classical nonparametric statistics (also calleddistribution-free statistics), which itself is often identified with rank tests. Rank tests are actually a special subclass of permutation tests with three distinct advantages:

1.
For data without ties, the conditional permutation distribution of a rank test is actually unconditional (does not change from sample to sample) because the ranks of a continuous data set are the same for every sample. Thus, the distribution of an important rank statistic like the Wilcoxon Rank Sum statistic can be tabulated or programmed. However, this computing advantage is less important today, and when there are ties in the data (a very common occurrence), the tabulated values are not appropriate, and the conditional permutation distribution is required for exact inference.
2.
The key philosophical foundation of rank tests arises from the theory of invariant tests as described in Lehmann [1986, Ch. 5]. The idea with invariant tests is to reduce the class of tests considered to those that are naturally invariant with respect to a group of transformationsG on the sample space of the data. GivenG, a maximal invariant is a statistic$M(x)$ with the property that any invariant test with respect toG must be a function of$x$ only through$M(x)$. Now consider the two-sample problem withH ₀ : F _X(x) = F _Y(x) versus the alternative “F _Y is stochastically larger thanF _X,” that is,${H}_{a} : 1 - {F}_{\mbox{ Y}}(x) \geq1 - {F}_{\mbox{ X}}(x)$ for allx with strict inequality for at least onex. This alternative is more general than the usual shift alternative,${F}_{\mbox{ Y}}(x) = {F}_{\mbox{ X}}(x - \Delta )$, but it certainly includes the shift alternative as a special case. LetG be the group of transformations such that eachg ∈ G is continuous and strictly increasing. For this testing problem and groupG, the set of ranks of the combinedX andY samples is the maximal invariant statistic. Thus, any invariant test must be a function of the ranks. Does it make sense to require tests to be invariant with respect to monotone transformations? Whenever data are ordinal or we do not trust the measurement scale, then invariance certainly makes sense, and rank tests are the obvious choice.
3.
Rank tests may be preferred in many situations because of their Type II error robustness. That is, for an appropriate data generation model, the permutation method can make any statistic Type I error robust (level α), but because rank tests are a function of the data only through the ranks, the influence of outliers is automatically limited. Thus, rank tests are power robust in outlier-prone situation. The key example is the Wilcoxon Rank Sum test that is powerful in the face of a wide variety of distributional shapes. In fact, Hodges and Lehmann [1956] showed that the asymptotic relative efficiency (ARE) of the Wilcoxon Rank Sum test to thet test satisfies the following:
1. a)
  ARE=.955 for normal shift alternatives, and thus the Wilcoxon Rank Sum test loses little in comparison to thet where thet is best;
2. b)
  and ARE ≥ .864 for any continuous unimodal shift alternative with finite variance, and thus the Wilcoxon Rank Sum test can never be much worse than thet-test but possibly much better.
Optimality for permutation and rank procedures is discussed in more detail later.

Although the term “nonparametric” was classically associated with permutation and rank procedures, in recent times it is more commonly used for nonparametric density and regression estimation methods based on smoothing. Thus, when describing rank or permutation procedures, it is best to use the specific names “rank” or “permutation” rather than “nonparametric.” Although permutation tests are inherently defined in terms of randomization, they overlap with a variety of conditional procedures and uniformly most powerful unbiased (UMPU) “Neyman structure similar” tests based on exponential family theory (the most well known is Fisher’s Exact Test).

Permutation procedures are very computationally intensive. These extensive computations prevented widespread use of the method until the 1990’s. Thus, asymptotic approximations were dominant until the 1990’s, although exact small-sample distributions were tabled for a number of important rank test statistics.

The asymptotic approximations are basically of three kinds: normal approximations based on the Central Limit Theorem,F orbeta approximations based on matching permutation moments with normal theory moments, and Edgeworth expansions that improve on the normal approximations. The normal approximations have been used the most due to their simplicity. However, theF approximations initiated by Pitman [1937a,b] and Welch [1937] in the 1930s and updated by Box and Andersen [1955] are generally better for situations where they apply. The Edgeworth approximations are very good for the Wilcoxon Rank Sum and Wilcoxon Signed Rank statistics, but are somewhat more complicated for other statistics and seem not to be in general usage. Thus, we emphasize theF approximations rather than the normal or Edgeworth approximations. In fact theseF approximations appear to be underused in general, but the work of Conover and Iman [1981] may have rekindled their use. Asymptotic normal theory remains important for comparing different methods according to asymptotic power, rather than for finding critical values. We give an overview of these results and then a few technical details in an appendix. There are excellent texts such as Hajek and Sidak [1967] and Randles and Wolfe [1979] that carefully explain asymptotic normality proof techniques for rank statistics. We add that most nonparametric texts of the last forty years are mainly about rank statistics, although Lehmann [1975] and Pratt and Gibbons [1981] have portions devoted to permutation tests. Puri and Sen [1971] emphasize the theory of permutation tests in multivariate settings.

In our current situation of extensive computing power, Monte Carlo approximations are the most important alternative to exact calculations. By Monte Carlo approximation we mean random sampling from the set of all permutations. This method can be used for any statistic in a situation where permutation methods are appropriate. Moreover, the error of approximation can be reduced by just adding more replications. This sampling (or resampling) in the “permutation world” is very similar to sampling in the bootstrap world; the main difference is that bootstrapp-values are typically approximate, even using the limit as the number of resamplesB goes to∞. In contrast, the limitingp-value in the permutation world is exact, and even the finiteB estimatedp-value has an exact interpretation.

Thus, our treatment of nonparametric methods is quite a bit different from most texts written in the last half of the twentieth century, which have emphasized rank tests and asymptotic normal approximations. We believe the basic permutation approach is the most important idea because it provides Type I error robustness for any statistic. Monte Carlo approximations can handle any problem for which the exact permutation distribution is too difficult to compute. Rank methods are still very important, but now because they provide Type II error robustness (good power in the face of outliers), not because they are easy to use or their distributions are tabled.

We start first with the two-sample problem to illustrate the basic permutation test approach. We then give some general theory for permutation tests along with approximations and discuss optimality results. Then we review results for the most important designs admitting permutation tests, their use in contingency tables, and estimators and confidence procedures derived from inverting permutation and rank tests.

2 A Simple Example: The Two-Sample Location Problem

We illustrate here the basic permutation approach with a simple two treatment experiment.

A clever middle school student believes that she has discovered a new method for teaching fractions to third graders. To test her hypothesis, she selects six students from her father’s third grade class and randomly assigns four to learn the new method and two to use the standard method. After training both groups, they are given twenty test problems. The scores for the standard method group arex ₁ = 6, x ₂ = 8 and for the new method group arey ₁ = 7, y ₂ = 18, y ₃ = 11, y ₄ = 9. The results look promising for the new method, but how shall we assess statistical significance?

One possible test statistic is the standard two-samplet,

$$t(X,Y ) = \frac{\overline{Y } -\overline{X}} {\sqrt{{s}_{p }^{2 }\left ( \frac{1} {m} + \frac{1} {n}\right )}},$$

(12.1)

where${s}_{p}^{2} =\{ \sum\nolimits {({X}_{i} -\overline{X})}^{2} + \sum\nolimits {({Y }_{j} -\overline{Y })}^{2}\}/(m + n - 2)$. Ift is large, then one might be convinced that the new method is better than the standard one.

Another commonly used statistic isW = the sum of the ranks of theY values when bothX andY samples are thrown together and ranked from smallest to largest. Let$Z$ denote the joint sample of both$X$ and$Y$ together:$Z = (X,Y )$ with observed values here (6, 8, 7, 18, 11, 9). The ranks of these observed values are then (1, 3, 2, 6, 5, 4) and$W = 2 + 6 + 5 + 4 = 17$, the sum of theY ranks. If the new teaching method is better, then on average we would expectW to be large. Assuming that eithert orW are reasonable statistics for our testing problem, we still need to agree on what is a proper reference distribution for each. A simple but very general approach is to recognize that there were actually$\left({{6}\atop {2}}\right) = 15$ different ways that two students could have been selected from the original six to go in theX sample (with the remaining four assigned to theY sample). Table 12.1 is a listing of the possible samples and the values oft andW for both.

Table 12.1 All Possible Permutations for Example Data

Full size table

If the treatments produce identical results, then the outcomes for each student would have been exactly the same for any of the 15 possible randomizations. Thus, a suitable reference distribution fort orW is just the possible 15 values oft orW along with the probability 1/15 of each. This reference distribution fort, called the permutation distribution, is in Table 12.2.

Table 12.2 Permutation Distribution oft

Full size table

Note that the permutation distribution oft is discrete even when sampling from a continuous distribution. (Here the distribution of the data is also discrete because the possible test scores are 0, 1, …, 20).

Using the distribution in Table 12.2, a conditional test for this experiment with$\alpha= 1/15$ would be to reject ift ≥ 1. 47. A one-sidedp-value for the observed value oft = 1. 17 is 2/15. Similarly a conditional$\alpha= 1/15$ level test based on the rank sumW would reject ifW ≥ 18, and the one-sidedp-value is 2/15.

In general, the tests based ont andW would not give exactly the same results. For example, suppose the original data had been the 14th permutation, (7,9,6,18,11,8). Then the permutationp-value fort would be$5/15 =.33$, whereas the permutationp-value forW would be$6/15 =.40$. Note, however, the column in Table 12.1 (p. 453) for the sum of theY values. Comparing the ∑Y _i andt values, one can see that the permutationp-values from ∑Y _i andt are identical if the original data had been any of the 15 permutations. In such a case, we say that the two statistics are permutationally equivalent because they give exactly the same testing results.

In Problem 12.1 (p. 523) we ask for the permutation distribution ofW from Table 12.1 (p. 453). A unique feature of rank statistics when there are no ties in the data is that the permutation distribution is the same for every such data set. That is, although the data values would change for every data set, as long as there are no ties in the 6 data points, the ranks would always be (1,2,3,4,5,6). Thus, the results forW in Table 12.1 (p. 453) would be exactly the same except in a different order, and therefore the distribution would be the same. This is one reason that rank statistics gained popularity: without ties, the exact distribution does not change and can then be tabled for easy lookup.

For simplicity we purposely started with a data set having no ties. However, ties occur frequently in real data even in continuous data settings due to rounding or inaccurate measurement. The standard way to rank data with ties is to assign the average rank to each of a set of tied values. For example, suppose our secondX data point had been 7 instead of 8. Then theZ vector would have been (6,7,7,18,11,9), and instead of (1,3,2,6,5,4) for the ranks we would have (1,2.5,2.5,6,5,4). These are now called themidranks. We have taken the values 7 and 7 that would have occupied ranks 2 and 3 and replaced them by$(2 + 3)/2 = 2.5.$ If the firstX data point had also been a 7, then the midrank vector would have been (2,2,2,6,5,4), where we have used$(1 + 2 + 3)/3 = 2$ for the first three midranks. The use of midranks has no effect on the general permutation approach, but tabling distributions as mentioned in the previous paragraph is no longer possible since every configuration of tied values has a different permutation distribution.

3 The General Two-Sample Setting

The two-sample problem assumes thatN experimental units (rats, for example) are available to compare two treatments A and B. First,m units are randomly assigned to receive treatment A, and the$n = N - m$ remaining units are assigned to receive treatment B. After the experiment is run, we obtain realizations of some measurementX ₁, …, X _m for treatment A andY ₁, …, Y _n for treatment B. The null hypothesisH ₀ is that both treatments are the same or have identical effects on the rats. In other words, if the third rat in group A whose measurement isX ₃ had been assigned to group B instead, theX ₃ would still have been the result underH ₀ for that rat, but now it would have aY label. In fact, we can think of all possible$\left({{N}\atop {m}}\right)$ random assignments ofm rats to group A andn rats to group B, and assume that underH ₀ the individual results would be the same regardless of group assignment.

We might then formulate a test procedure as follows.

1.
Randomly assignm units to A andn units to B.
2.
Run the experiment to obtainX ₁, …, X _m andY ₁, …, Y _n.
3.
Think of the collection$Z = ({X}_{1},\ldots,{X}_{m},{Y }_{1},\ldots,{Y }_{n})$ as fixed and order the$M_{N} = \left({{N}\atop {m}}\right)$ values of some statisticT calculated for each${Z}^{{_\ast}}$ obtained by permuting$Z$ to have different sets ofm first coordinates. Call these ordered values${T}_{(1)} \leq{T}_{(2)} \leq \ldots\leq{T}_{({M}_{N})}$, and let${T}_{0} = T(X,Y )$ be the statistic calculated for the original data.
4.
RejectH ₀ ifT ₀ > T _(k).

This test, conditional on$Z$, has conditional α-level

$$1 - \frac{k} {{M}_{N}}$$

ifT _(k) < T _(k + 1) (not tied) sinceM _N − k values ofT are larger thanT _(k). The exact conditionalp-value is the proportion of values greater than or equal toT ₀,

$$\frac{[\#{T}_{(i)} \geq{T}_{0}]} {{M}_{N}}.$$

(12.2)

WhenT is thet statistic in (12.1, p. 452), the above two-sample permutation procedure was proposed by Pitman [1937a]. The credit for the permutation approach, however, goes to R. A. Fisher who had earlier introduced the permutation approach in the fifth edition ofStatistical methods for Research Workers (2 ×2 table example) published in 1934 and in the first edition ofThe Design of Experiments (one-samplet example) in 1935.

Besides computational problems, the main drawback of the procedure described in points 1. − 4. outlined above is that:

a):: the results pertain to theN units obtained and not to a larger population;
b):: computations of test power are difficult.

Thus, it is often useful to assume a population sampling model of the usual form

$${X}_{1},\ldots,{X}_{m}\;\;\;\mbox{ iid}\;\;{F}_{\mbox{ X}}(x) = P({X}_{1} \leq x),$$

$${Y }_{1},\ldots,{Y }_{n}\;\;\;\mbox{ iid}\;\;{F}_{\mbox{ Y}}(x) = P({Y }_{1} \leq x),$$

withH ₀ : F _X(x) = F _Y(x). Under this model we can show that the conditional permutation test actually has exact size α unconditionally, i.e.,

$$P(\mbox{ rejection}\mid {H}_{0}) = \alpha.$$

The permutation approach has the advantage that no assumption regarding distributions of random variables is required. Moreover, one can often show using permutational Central Limit Theorems (e.g., Theorem 12.2, p. 465) that the conditional distribution of$T(X,Y )$ properly standardized converges to a standard normal as min(m, n) → ∞. Thus, in large samples one can use normal critical values rather than list allM _N possible values ofT. Alternatively, one can randomly sampleB of the possible permutations and base a test on the ordered values ofT ₁, …, T _B. First we give the general theory of permutation tests and then discuss these approximations as well as the Box-AndersenF approximation.

4 Theory of Permutation Tests

4.1 Size α Property of Permutation Tests

In this subsection we show that permutation tests used in random sampling contexts can have exact size α when randomizing on rejection region boundaries, and otherwise has level α when the test is carried out without such randomization. Recall that a size α test is one for which${\sup }_{{H}_{0}}P(\mbox{ reject}{H}_{0}) = \alpha $ and level α means${\sup }_{{H}_{0}}P(\mbox{ reject}{H}_{0}) \leq\alpha $. The reference torandomization merely refers to flipping a biased coin for sample points on the boundary between the rejection and acceptance region in order to obtain size α and has nothing to do with the randomization used in the definition of a permutation test.

To prove size-α results rigorously, we need some additional notation. Two useful sources are Hoeffding [1952] and Puri and Sen [1971]. Let$Z = {({Z}_{1},\ldots,{Z}_{N})}^{T}$ have joint distribution function${F}_{Z}(z)$ and sample spaceS. LetG be a group ofM _N transformations ofS ontoS such that underH ₀ the distribution of each${g}_{i}(Z)$,g _i ∈ G, i = 1, …, M _N, is exactly the same as the distribution of$Z$. Two examples of such groups are as follows.

Permutations: G consists of allN! permutations of$Z$. If$Z$ is exchangeable or iid, then${g}_{i}(Z)\stackrel{d}{=}Z$. Although, in the two-sample problem (two independent samples), we usually consider only the$\left({{N}\atop {m}}\right)$ partitions into two groups since the statistics used do not change by permuting elements within each sample. In thek-sample problem (k independent samples), we consider only the

$$\left ({ N \atop {n}_{1}{n}_{2}\ldots {n}_{k}} \right ) = \frac{N!} {{n}_{1}!\cdots {n}_{k}!}$$

partitions intok groups, where${n}_{1} + {n}_{2} + \cdots+ {n}_{k} = N$. The group ofN! permutations is relevant for the two-sample,k-sample, and correlation problems.

Sign Changes: G consists of all 2^N sign change transformations,${g}_{1}(Z) = ({Z}_{1},{Z}_{2},\ldots,{Z}_{N})$,${g}_{2}(Z) = (-{Z}_{1},{Z}_{2},\ldots,{Z}_{N})$,${g}_{3}(Z) = ({Z}_{1},-{Z}_{2},{Z}_{3},\ldots,{Z}_{N})$, etc. If theZ _i’s are independently (but not necessarily identically) distributed, where eachZ _i is symmetrically distributed about 0, then${g}_{i}(Z)\stackrel{d}{=}Z$. The sign change group is relevant for the paired two-sample problem and the one-sample symmetry problem.

The following development is due to Hoeffding [1952]. Because the permutation distribution is discrete, it is not possible to achieve arbitrarily chosen α-levels like α = . 05 without using a randomized testing procedure. This makes the details seem harder than they really are.

Let$T(z)$ be a real-valued function onS such that for each$z \in S$

$${T}_{(1)}(z) \leq{T}_{(2)}(z) \leq \cdots {T}_{({M}_{N})}(z)$$

are the ordered values of$T({g}_{i}(z)),i = 1,\ldots,{M}_{N}$. Given α, 0 < α < 1, letk be defined by

$$k = {M}_{N} - [{M}_{N}\alpha ],$$

where [ ⋅] is the greatest integer function. Let${M}_{N}^{+}(z)$ and${M}_{N}^{0}(z)$ be the numbers of${T}_{(j)}(z),j = 1,\ldots,{M}_{N},$ which are greater than${T}_{(k)}(z)$ and equal to${T}_{(k)}(z)$, respectively. Define

$$a(z) = \frac{{M}_{N}\alpha- {M}_{N}^{+}(z)} {{M}_{N}^{0}(z)}.$$

Then define the test function$\phi (z)$ by

$$\phi (z) = \left \{\begin{array}{ll} 1, &\;\;\mbox{ if}\ T(z) > {T}_{(k)}(z); \\ a(z),&\;\;\mbox{ if}\ T(z) = {T}_{(k)}(z); \\ 0, &\;\;\mbox{ if}\ T(z)< {T}_{(k)}(z). \end{array} \right.$$

Note that for a test function,$\phi (z) = 1$ means rejection ofH ₀,$\phi (z) = 0$ means acceptance ofH ₀, and$\phi (z) = \pi $ means to randomly rejectH ₀ with probability π. The test defined by ϕ is an exact conditional level α test by construction. The following theorem tells us that under${g}_{i}(Z)\stackrel{d}{=}Z$ for eachg _i ∈ G, the test is unconditionally a size-α test.

Theorem 12.1.

(Hoeffding). Let the data $Z = ({Z}_{1},\ldots,{Z}_{N})$ and the group G of transformations be such that ${g}_{i}(Z)\stackrel{d}{=}Z$ for each g _i ∈ G under H ₀ . Then the test defined above by $\phi (Z)$ has size α.

Proof.

First note that by the definition of$a(z)$ and ϕ, we have for each$z \in S$

$$\frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\phi ({g}_{i}(z)) = \frac{{M}_{N}^{+} + a(z){M}_{N}^{0}(z)} {{M}_{N}} = \alpha.$$

Now since${g}_{i}(Z)\stackrel{d}{=}Z$ andG is a group,${\mbox{ E}}_{{H}_{0}}\phi (Z) ={ \mbox{ E}}_{{H}_{0}}\phi ({g}_{i}(Z))$ for eachi, and

$$\begin{array}{rcl}{ P}_{{H}_{0}}(\mbox{ rejection}) = {E}_{{H}_{0}}\phi (Z)& =& \frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\mathrm{{E}}_{{H}_{0}}\phi ({g}_{i}(Z)) \\ & =&{ \mbox{ E}}_{{H}_{0}}\left [ \frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\phi ({g}_{i}(Z))\right ] = \alpha \end{array}$$

The above proof is deceptively simple. The key fact that makes it work is that E${}_{{H}_{0}}\phi ({g}_{i}(Z))$ is the same for eachg _i including$g(Z) = Z$. This fact rests on the identical distribution of${g}_{i}(Z)$ for eachi and on the group nature ofG. The identical distribution requirement is intuitive, but why do we needG to be a group? Recall that the test procedure consists of computingT for each member ofG and then rejecting if$T(Z)$ is larger than an order statistic of the$T({g}_{i}(Z))$ values. Now$\phi ({g}_{i}(Z))$ is the test that computes$T({g}_{j}({g}_{i}(Z)))$,j = 1, …, M _N, orders all of them, and rejects if$T({g}_{i}(Z))$ is larger than one of the ordered values. IfG is not a group, then the set of ordered values will not be the same for each test$\phi ({g}_{i}(Z))$ becauseg _j(g _i) will not be inG for somei andj. Since the sets of ordered values could be different, there would be no basis for believing that a test based on${g}_{i}(Z)$ would have the same expectation as that based on$Z$.

Note also that the use of$a(z)$ in$\phi (z)$ is a way of randomizing to get an exact size-α test. In practice we might just define$\phi (z)$ to be one if$t(z) > {t}_{(k)}(z)$ and zero otherwise. The resulting unconditional level is a weighted average of the discrete levels less than or equal to α and will usually be less than α.

The conditional test procedure described in 1) − 4) may be used for any test statistic, but the rejection region in Step 4) should be modified to correspond to the situation. For example, the alternative hypothesis might be that the mean ofA is less than that ofB. We would then look for small values oft. Or the test could be two-sided and we would reject ift < t _(k) or ift > t _(m).

4.2 Permutation Moments of Linear Statistics

The exact permutation distribution may be difficult to compute. For certain linear statistics, though, we can calculate the moments of the permutation distribution quite easily. These moments are then used in the various normal andF approximations found in later sections.

We consider general results for situations associated with the group of transformations consisting of all permutations. These situations include the two-sample andk-sample situations, and bivariate data (X ₁, Y ₁), …, (X _N, Y _N) where correlation and regression ofY onX are of interest. Let$a = ({a}_{1},\ldots,{a}_{N})$ and$c = ({c}_{1},\ldots,{c}_{N})$ be two vectors of real constants. We select a random permutation of thea values, call themA ₁, …, A _N, and form the statistic

$$T = \sum\limits _{i=1}^{N}{c}_{ i}{A}_{i}.$$

(12.3)

In applications$a$ is actually the observed vector$Z$ (or a function of$Z$ such as the rank vector), and$c$ is chosen for the particular problem at hand. For example, in the two-sample problem, with$a = Z$ andc _i = 0 fori = 1, …, m and 1 otherwise, the observed value ofT for the original data is ∑_i = 1 ⁿ Y _i, and here$T = \sum\limits _{i=m+1}^{N}{A}_{i}$ is a sum of the lastn elements of a random permutation of$Z$. A very important subclass of (12.3) are the linear rank statistics given in the next section.

Assuming that each permutation of$A$ is equally likely and thus has probability 1 ∕ N! , it is easy to see that

$$P({A}_{i} = {a}_{s}) = \frac{1} {N}\quad \mbox{ for}\;s = 1,\ldots,N,$$

and

$$P({A}_{i} = {a}_{s},{A}_{j} = {a}_{t}) = \frac{1} {N(N - 1)}\quad \mbox{ for}\;s\neq t = 1,\ldots,N.$$

Then, using those two results, we get

$$\mbox{ E}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{a}_{ i} \equiv \overline{a},\quad \mbox{ for}\;i = 1,\ldots,N,$$

$$\mbox{ Var}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i = 1,\ldots,N,$$

and

$$\mbox{ Cov}({A}_{i},{A}_{j}) = \frac{-1} {N(N - 1)}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i\neq j = 1,\ldots,N.$$

Finally, putting these last three results together, we get

$$\mbox{ E}(T) = N\overline{c}\;\overline{a},$$

and

$$\mbox{ Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{2} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{2},$$

(12.4)

where$\overline{a}$ and$\overline{c}$ are the averages of thea’s andc’s, respectively. These first two moments ofT are sufficient for normal approximations based on the asymptotic normality ofT asN → ∞. In some cases it may be of value to use more complex approximations involving the third and fourth moments ofT. Thus, the central third moment is

$$\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{3} = \frac{N} {(N - 1)(N - 2)}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{3} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{3},$$

and the standardized third moment (skewness coefficient) is

$$\mathrm{Skew}(T) = \frac{\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{3}} {\{\mathrm{Var}{(T)\}}^{3/2}} = \frac{{(N - 1)}^{1/2}} {(N - 2)} \frac{{\mu }_{3}(c){\mu }_{3}(a)} {\{{\mu }_{2}(c){\mu }_{2}{(a)\}}^{3/2}},$$

where we have introduced the notation${\mu }_{q}(c) = {N}^{-1} \sum\limits _{i=1}^{N}{({c}_{i} -\overline{c})}^{q}$ forq ≥ 2. Similarly the standardized central fourth moment (kurtosis coefficient) is

$$\begin{array}{rcl} \mathrm{Kurt}(T) = \frac{\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{4}} {\{\mathrm{Var}{(T)\}}^{2}} & =& \frac{(N + 1)(N - 1)} {N(N - 2)(N - 3)} \frac{{\mu }_{4}(c){\mu }_{4}(a)} {\{{\mu }_{2}(c){\mu }_{2}{(a)\}}^{2}} \\ & -& \frac{3{(N - 1)}^{2}} {N(N - 2)(N - 3)}\left [ \frac{{\mu }_{4}(c)} {\{{\mu }_{2}{(c)\}}^{2}} + \frac{{\mu }_{4}(a)} {\{{\mu }_{2}{(a)\}}^{2}}\right ] \\ & +& \frac{3({N}^{2} - 3N + 3)(N - 1)} {N(N - 2)(N - 3)} \end{array}$$

4.3 Linear Rank Tests

Many popular rank tests have the general form

$$T = \sum\limits _{i=1}^{N}c(i)a({R}_{ i})$$

(12.5)

of alinear rank statistic, wherec(1), …, c(N) are called theregression constants anda(1), …, a(N) are called thescores, and$R$ is the vector of ranks (possibly midranks due to ties) of some data vector$Z$. There is a room for confusion here in the use of the notation for$c$ and$a$, because in the general notation of the last section, (c ₁, …, c _N) and (a ₁, …, a _N) are vectors of real numbers, but herec( ⋅) anda( ⋅) are functions so that${c}_{1} = c(1),\ldots,{c}_{N} = c(N)$ and${a}_{1} = a(1),\ldots,{a}_{N} = a(N)$. This function notation just makes it easier to work with rank statistics. In particular, the score functionsa( ⋅) are typically derived fromscores generating functions ϕ via$a(i) = \phi (i/(N + 1))$. In tied rank situations,a( ⋅) needs to be defined for non-integer values.

The simplest setting is the two-sample problem where${Z}^{T} = ({X}_{1},\ldots,{X}_{m},$ ${Y }_{1},\ldots,{Y }_{n})$ and thec values are all zeroes for theXs and ones for theY s or vice-versa. A different situation covered byT, though, is for trend alternatives, wherec(1), …, c(N) are the integers 1, …, N and$T = \sum\limits _{i=1}^{N}i{R}_{i}$ will tend to be large whenZ _i + 1 tends to be larger thanZ _i. A related problem is forN independent pairs (X ₁, Y ₁), …, (X _N, Y _N). Here, tests based on Spearman’s Correlation (Section 12.7, p. 487) are equivalent to ones having the same null distribution as$T = \sum\limits _{i=1}^{N}i{R}_{i}$.

ClearlyT in (12.5) is a subclass of the linear permutation statistics given in (12.3, p. 458). Thus results for that class are inherited byT. For example, if$R$ is uniformly distributed on the permutations of 1, …, N (no tied ranks), then

$$\mbox{ E}(T) = N\overline{c}\;\overline{a},$$

and

$$\mbox{ Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{(c(i) -\overline{c})}^{2} \sum\limits _{j=1}^{N}{(a(j) -\overline{a})}^{2},$$

where of course$\overline{c}$ and$\overline{a}$ are the means of thec anda values, respectively. For a tied rank situation with observed vector of midranks$R$, the expressions above still hold but witha(j) replaced bya(R _j).

For deciding on a score function in a given problem, we first select a parametric family and then derive an optimal score function for that family. An overview of how to do this is given in Section 12.5 (p. 473). The most important linear rank statistic is the Wilcoxon Rank Sum. So we give a few more details about it in the next section.

4.4 Wilcoxon-Mann-Whitney Two-Sample Statistic

For two independent samplesX ₁, …, X _m andY ₁, …, Y _n, Wilcoxon [1945] introduced the linear rank statistic

$$W = \sum\limits _{i=m+1}^{N}{R}_{ i},$$

(12.6)

whereR ₁, …, R _N are the joint rankings of$Z = {({X}_{1},\ldots,{X}_{m},{Y }_{1},\ldots,{Y }_{n})}^{T}$,$N = m + n$. The Wilcoxon Rank Sum test has a number of optimal properties that are mentioned in Section 12.5 (p. 473). Along with the Wilcoxon Signed Rank test for paired data (Section 12.8.3,130), it is the simplest and most important rank test.

Independently, Mann and Whitney [1947] proposed the equivalent statistic

$${W}_{\mathrm{YX}} = \sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j}< {X}_{i}),$$

(12.7)

whereI( ⋅) is the indicator function. In the absence of ties${W}_{\mathrm{YX}} = mn + n(n + 1)/2 - W$. Another equivalent version is

$${W}_{\mathrm{XY}} = \sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j} > {X}_{i}),$$

(12.8)

with${W}_{\mathrm{XY}} = W - n(n + 1)/2$. We prefer this latter version and define theU-statistic estimator of θ_XY = P(Y ₁ > X ₁)

$$\widehat{{\theta }}_{\mathrm{XY}} = \frac{{W}_{\mathrm{XY}}} {mn} = \frac{1} {mn}\sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j} > {X}_{i}).$$

(12.9)

In a clinical trial, θ_XY can be viewed as the probability of a more favorable response for a randomly selected patient getting Treatment 2 compared to another patient getting Treatment 1. For screening tests where a “positive” is declared ifY > c for a diseased subject or ifX > c for a non-diseased subject, then θ_XY is the area under the receiver operating characteristic (ROC) curve. This interpretation is developed in Problem12.8 (p. 525).

For hand computations,W is much easier to handle than theseU-statistic versions. The null moments follow easily from Section 12.4.2 (p. 458) after noting that$c(1) = \cdots= c(m) = 0$ and$c(m + 1) = \cdots= c(N) = 1$ lead to$\overline{c} = n/N$ and$\sum\limits _{i=1}^{N}{(c(i) -\overline{c})}^{2} = mn/N$. The null mean is$n(N + 1)/2$ whether there are ties or not. The variance follows from (12.4, p. 459). With no ties, we have

$$\mbox{ Var}(W) = \frac{mn(N + 1)} {12}.$$

(12.10)

With ties so that (R ₁, …, R _N) are the tied ranks, we have

$$\mathrm{Var}(W) = \frac{mn} {N(N - 1)}\left \{\sum\limits _{i=1}^{N}{R}_{ i}^{2} -\frac{N{(N + 1)}^{2}} {4} \right \}.$$

(12.11)

Lehmann [1975, p. 20] gives a different expression for the variance ofW in the face of ties,

$$\mathrm{Var}(W) = \frac{mn(N + 1)} {12} -\frac{mn\sum\limits _{i=1}^{e}({d}_{i}^{3} - {d}_{i})} {12N(N - 1)},$$

(12.12)

wheree are the number of tied groups, andd _i is the number of tied observations in each group. For example, with the simple example data modified to ({6, 7}, {7, 18, 11, 9}), the midranks are (1, 2. 5, 2. 5, 6, 5, 4) ande = 1,d ₁ = 2; so$\mathrm{Var}(W) = (2)(4)(6 + 1)/12 - (2)(4)[{2}^{3} - 2]/[12(6)(5)] = 4.53$. Expression (12.12) may be easier to use by hand than (12.11), but its main value may be to show that the variance ofW for tied data is always smaller than (12.10) for untied data.

TheU-statistic versions in (12.7)–(12.9) are useful for easy calculation of moments and derivation of asymptotic normality under non-null distributions. For example, using equation (3.4.7, p. 91) of Randles and Wolfe [1979] for the variance of a two-sampleU-statistic from independent iid samples, we have that

$$\begin{array}{rcl} \mathrm{\mathrm{Var}}(\widehat{{\theta }}_{\mathrm{XY}}) = \frac{1} {mn}\left \{(m - 1)({\gamma }_{0,1} - {\theta }_{\mathrm{XY}}^{2}) + (n - 1)({\gamma }_{ 1,0} - {\theta }_{\mathrm{XY}}^{2}) + {\gamma }_{ 1,1} - {\theta }_{\mathrm{XY}}^{2}\right \},& & \\ & &\end{array}$$

(12.13)

where in the absence of ties γ_0, 1 = P(Y ₁ > X ₁, Y ₁ > X ₂), γ_1, 0 = P(Y ₁ > X ₁, Y ₂ > X ₁), and${\gamma }_{1,1} = {\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1})$. If theX andY have identical continuous distributions, then it is easy to show that${\gamma }_{0,1} = {\gamma }_{1,0} = 1/3$ and${\gamma }_{1,1} = {\theta }_{\mathrm{XY}} = 1/2$ and (12.13) reduces to (12.10).

In the presence of ties, theU-statistic quantities need to be modified by adding$I({Y }_{j} = {X}_{i})/2$ to the indicators in the sums. For example,

$$\widehat{{\theta }}_{\mathrm{XY}} = \frac{{W}_{\mathrm{XY}}} {mn} = \frac{1} {mn}\sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}\left \{I({Y }_{ j} > {X}_{i}) + I({Y }_{j} = {X}_{i})/2\right \}.$$

(12.14)

The relationships${W}_{\mathrm{YX}} = mn + n(n + 1)/2 - W$ and${W}_{\mathrm{XY}} = W - n(n + 1)/2$ then continue to hold. The definitions of γ_0, 1, γ_1, 0, and γ_1, 1 for use in (12.13) have to be modified in the face of ties; see, for example, Boos and Brownie [1992, p. 72]. In the next section we give the basic asymptotic normal results for linear statistics under the null hypothesis of identical populations. Those general results are useful for approximate critical regions for permutation and rank statistics. However, the Wilcoxon statistics are special because they are related to theU-statistic$\widehat{{\theta }}_{\mathrm{XY}}$ for which a large body of theory exists. In particular,$\widehat{{\theta }}_{\mathrm{XY}}$ is AN$\left \{{\theta }_{\mathrm{XY}},\mathrm{Var}(\widehat{{\theta }}_{\mathrm{XY}})\right \}$, and this follows from basicU-statistic theory with no assumptions except thatX ₁, …, X _m are iid with any distribution functionF(x), andY ₁, …, Y _n are iid with any distribution functionG(x). Because this asymptotic result is not just for null situations, it helps us think about i) the form of the alternative hypothesis, ii) the classes of distribution functions for which the Wilcoxon Rank Sum is consistent, in other words, rejects with probability converging to 1, and iii) asymptotic power and sample size determination. We now discuss these ideas.

In general, the null hypothesis of interest is

$${H}_{0} : F(x) = G(x),\;\mbox{ each }x \in(-\infty,\infty ).$$

However, the alternative hypothesis can be formulated in several ways. The most common way is to assume the shift model$G(x) = F(x - \Delta )$, and then the alternative hypothesis is purely in terms ofΔ, for example

$${H}_{1} : \Delta> 0.$$

Another popular, more nonparametric, way to phrase the alternative is

$${H}_{2} : F(x) \geq G(x),\;\mbox{ each }x \in(-\infty,\infty ),$$

and with strict inequality for at least onex. Here,G is said to bestochastically larger thanF. Clearly,H ₂ is a larger class of alternatives since (F, G) ∈ H ₁ implies (F, G) ∈ H ₂. Lastly, the natural alternative when thinking in terms of$\widehat{{\theta }}_{\mathrm{XY}}$ is

$${H}_{3} : {\theta }_{\mathrm{XY}} > \frac{1} {2}.$$

Now ifF andG are continuous distribution functions and (F, G) ∈ H ₂, then (F, G) ∈ H ₃. This follows from

$${\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1}) = \int\nolimits \nolimits \int\nolimits \nolimits I(y > x)\,dF(x)\,dG(y) = \int\nolimits \nolimits \{1 - G(x)\}\,dF(x),$$

after noting that if continuous distribution functions satisfyF(x) > G(x) for at least onex, then this strict inequality must hold for an interval ofx values, and$\int\nolimits \nolimits F(x)\,dF(x) = 1/2$. Assuming thatH ₃ holds, then the Wilcoxon Rank Sum test is consistent because of the general asymptotic normality result mentioned above. This also means that it is also consistent under alternativesH ₁ andH ₂.

Lastly, following Noether [1987], the approximate power of a one-sided α level test when${\theta }_{\mathrm{XY}} > \frac{1} {2}$ is given by

$$1 - \Phi \left \{\frac{1/2 - {\theta }_{\mathrm{XY}}} {\rho {\sigma }_{0}} + \frac{{\Phi }^{-1}(1 - \alpha )} {\rho } \right \},$$

(12.15)

where σ₀ is the square root of the null variance ofW (12.10, p. 462), ρ is the ratio of the square root of the non-null variance ofW (m ² n ² times eq. 12.13, p. 462) to σ₀, andΦ is the standard normal distribution function. Typically, ρ is close to 1. Letting ρ = 1 andm = λN, the total sample sizeN required to have power 1 − β for alternative θ_XY is given by Noether [1987] to be

$$N = \frac{{\left \{{\Phi }^{-1}(1 - \alpha ) + {\Phi }^{-1}(1 - \beta )\right \}}^{2}} {12\lambda (1 - \lambda ){({\theta }_{\mathrm{XY}} - 1/2)}^{2}}.$$

(12.16)

This is a fairly simple formula, but it might be preferable to state power and sample size in terms of the shift model. Plugging in$G(x) = F(x - \Delta )$, we have

$${\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1}) = \int\nolimits \nolimits \{1 - F(x - \Delta )\}\,dF(x).$$

For example, if we wanted shifts of sizeΔ ∕ σ in a normal(μ, σ²) population, then a simple R program to get θ_XY using the midpoint rule is

theta.xy<-function(delta,n=10000){

# u-stat parameter for normal shift delta/sigma

# for sigma=1

# n is the number of points for midpoint rule

points<-(2*(1:n)-1)/(2*n)

mean(1-pnorm(qnorm(points)-delta))

}

If$\Delta /\sigma=.5$, then

> theta.xy(.5,10000)

[1] 0.6381632

so that θ_XY = . 638. Choosing α = . 05, β = . 80, and$\lambda= 1/2$, we findN = 108 or$m = n = 54$.

4.5 Asymptotic Normal Approximation

Approximate normal distributions for linear statistics have been the most popular approximation to permutation distributions, especially for rank statistics. Here we use the following permutation Central Limit Theorem for$T = \sum\limits _{i=1}^{N}{c}_{i}{A}_{i}$, introduced in (12.3, p. 458), directly from Puri and Sen [1971, p. 73] who give credit to Wald and Wolfowitz [1944], Noether [1949], and Hoeffding [1951]. The notation${\mu }_{q}(c)$ is for theqth central moment${N}^{-1} \sum\limits _{i=1}^{N}{({c}_{i} -\overline{c})}^{q}$.

Theorem 12.2 (Wald-Wolfowitz-Noether-Hoeffding).

If for N →∞

(i):: $$\frac{{\mu }_{q}(c)} {{\mu }_{2}{(c)}^{q/2}} = O(1)\;\;\;\mbox{ for all}\;q = 3,4,\ldots $$
(ii):: $$\frac{{\mu }_{q}(a)} {{\mu }_{2}{(a)}^{q/2}} = o({N}^{r/2-1})\;\;\;\mbox{ for all}\;q = 3,4,\ldots,$$

then

$$\frac{T -\mbox{ E}(T)} {\sqrt{\mbox{ Var} (T)}}\stackrel{d}{\rightarrow }N(0,1).$$

In a particular problem either or both of the vectors$c$ and$a$ may be random, that is, calculated from the data$Z$. In such cases we would need to show that the appropriate conditions (i) and/or (ii) holdwp1 with respect to the random vector$Z$. Moreover, the conclusion of Theorem 12.2 is that the permutation distribution of the standardizedT converges to a standard normal distribution with probability one with respect to$Z$.

In the case of linear rank statistics without ties, we can give a much simpler theorem due to Hajek (1961). We follow the exposition given in Randles and Wolfe [1979, Ch. 8] and state their version of Hajek’s theorem.

Theorem 12.3 (Hajek).

Let $T = \sum\limits _{i=1}^{N}c(i)a({R}_{i})$ be the linear rank statistic, where the rank vector $R$ comes from data vector $Z$ that is continuous (no ties with probability one) and exchangeable, the constants c(1),…,c(N) satisfy the Noether condition

$${ \frac{{\sum}_{i=1}^{N}{(c(i) -\overline{c})}^{2}} {\max }_{1\leq i\leq N}{(c(i) -\overline{c})}^{2}} \rightarrow \infty \quad \mbox{ as $N$} \rightarrow \infty,$$

(12.17)

and the scores have the form $a(i) = \phi (i/(N + 1))$ , where ϕ can be written as the difference of two nondecreasing functions and $0< {\int \nolimits_{0}^{1}} \phi (t)^2 dt< \infty\,and\, {\int \nolimits_{0}^{1}} |\phi(t)|dt< \infty.\, Then\,T\,is\,AN\{N\bar{c}\bar{a}, Var (T)\}$ as N →∞.

It has been customary to use the normal approximation with rank statistics, often with a continuity correction. For example, in the two-sample problem, consider the Wilcoxon Rank SumW of (12.6, p. 461). Note that for application of Theorem 12.3 above, ϕ(u) = u, and the theorem actually applies directly to$W/(N + 1)$. For the simple example of Section 1.2 where$z = (x,y) = (6,8,7,18,11,9)$ with ranks$R = (1,3,2,6,5,4)$, we findW = 17, E$(W) = 4(6 + 1)/2 = 14$, Var$(W) = (2)(4)(6 + 1)/12 = 14/3$ (from 12.10, p. 462), and the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{17 - 14} {\sqrt{14/3}} \right ) = P(N(0,1) \geq1.39) = 0.08.$$

With continuity correction the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{17 - 14 - 1/2} {\sqrt{14/3}} \right ) = P(N(0,1) \geq1.16) = 0.12.$$

Lehmann [1975, p. 16] cites Kruskal and Wallis [1952, p. 591] with the recommendation that the continuity correction be used when the probability is above 0.02. Recall that the exact null distribution ofW can be obtained from Table 12.1 leading to the usualp-value$P(W \geq17) = 2/15 = 0.13$ which is closer to the continuity corrected value.

When there are tied values, we can still use the normal approximation withW, but we must be sure to use the null variance from (12.11, p. 462) or (12.12, p. 462) and not from (12.10, p. 462). Lehmann [1975, p. 20] does not use the continuity correction in the presence of ties.

We can also look at approximations to the permutationp-value of$T = \sum\limits _{i=1}^{n}{Y }_{i}$ which is permutationally equivalent to the two-samplet statistic. For the simple example$c = (0,0,1,1,1,1)$ and$a = z = (6,8,7,18,11,9)$. Thus, E(T) = (6)$(4/6)(59/6) = 39.33$, Var(T) = 25. 23, and the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{45 - 39.33} {\sqrt{25.23}} \right ) = P(N(0,1) \geq1.13) = 0.13.$$

This seems almost too good an approximation to the true permutationp-value of$2/15 = 0.13\;.$ Usually thet approximationp-value is more accurate, but here it isP(t ₄ ≥ 1. 17) = 0. 15.

4.6 Edgeworth Approximation

Edgeworth approximations were mentioned briefly in Ch. 3 (5.6, p. 219) and Ch. 9 (11.7, p. 428). Basically, an Edgeworth expansion is an approximation to the distribution function of an asymptotically normal statistic. It is based on estimation of Skew and/or Kurt and other higher moments of the statistic. Rigorous development of Edgeworth expansions for general permutation statistics under the null hypothesis may be found in Bickel [1974], Bickel and van Zwet [1978], and Robinson [1980]. However, it has not proved of much practical use for obtaining critical values orp-values of permutation statistics except in the special case of the Wilcoxon Rank SumW and of the one-sample Wilcoxon signed rank statistic.

Here we give the approximation forW originally due to Fix and Hodges [1955]. For$W = \sum\limits _{i=1}^{n}{R}_{i}$,

$$P(W \geq w) \approx1 - \Phi (t) -\left \{\frac{{m}^{2} + {n}^{2} + mn + m + n} {20mn(m + n + 1)} \right \}({t}^{3} - 3t)\phi (t),$$

(12.18)

where ϕ andΦ are the standard normal density and distribution function, respectively, and$t =\{ w -\mathrm{ E}(W) - 1/2\}/\sqrt{\mathrm{Var } (W)}$,$\mathrm{E}(W) = n(N + 1)/2$,$\mathrm{Var}(W) = mn(N + 1)/12$.

Figure 12.1 gives the error = truep-value − (12.18) and the relative error = [truep-value − (12.18)]/(truep-value) of (12.18) compared to the truep-value and similar quantities for the normal approximations.The range of thep-values is most of the right tail of the distribution function ofW plotted in reverse order, that is, 0.0005 to 0.11. The Edgeworth approximation is excellent forp-values larger than 0.0024, but then deteriorates as thep-value gets very small. For example, when the truep-value is 0.00087, the Edgeworth approximation is 0.00073, and at 0.00025 it is 0.00009. The right panel of Figure 12.1 is especially helpful for illuminating what happens at smallp-values. The normal approximation is much cruder, and below 0.02 we can see that the continuity correction is no longer useful.

Figure 12.1 suggests that (12.18) can be used for most values ofW, thus essentially replacing tabled values of the distribution ofW. However, when there are ties in the data, (12.18) as well as tabled values are no longer correct, and the exact permutation distribution (or a Monte Carlo approximation) is required.

4.7 Box-Andersen Approximation

Pitman [1937a,b] and Welch [1937] pioneered an approximation to permutation distributions that was modernized by Box and Andersen [1955] and Box and Watson [1962]. These later authors mainly used the approach to show the Type I error robustness of F statistics for tests comparing means and the nonrobustness of tests comparing variances. However, we follow the Box and Andersen [1955] formulation since it is the most straightforward.

The basic idea of the approximation is to getF statistics into their equivalent “beta” version, then match the first two permutation moments of this beta version to what one gets from the first two moments of a beta distribution with degrees of freedom multiplied by a constantd. Solving ford leads to the approximation of the permutation distribution of theF statistics by anF distribution with usual degrees of freedom multiplied byd. We develop the approximation here for the two-sample problem and later give it for one-way and two-way ANOVA situations.

The square of thet statistic in (12.1, p. 452) may be written in the one-way ANOVAF form

$${t}^{2} = \frac{m{(\overline{X} -\overline{Z})}^{2} + n{(\overline{Y } -\overline{Z})}^{2}} {{s}_{p}^{2}} = \frac{\mbox{ SSTR}} {\mbox{ SSE}/(N - 2)},$$

(12.19)

where recall we use theZ’s to denote all theX andY values thrown together, and SSTR and SSE are sums of squares for treatments and error, respectively. Using the fact that$\sum\limits _{i=1}^{N}{({Z}_{i} -\overline{Z})}^{2} = \mbox{ SSTR} + \mbox{ SSE}$, we have for the beta version of theF statistic

$$b({t}^{2}) = \frac{{t}^{2}} {{t}^{2} + N - 2} = \frac{\mbox{ SSTR}} {{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}}.$$

Note that for normal data under the null hypothesis,b(t ²) has a beta$(1/2,(N - 2)/2)$ distribution. Originallyb(t ²) was used with the beta critical values rather thant ² withF(1, N − 2) critical values. Although,t ² andb(t ²) are equivalent test statistics, for permutation analysisb(t ²) is much simpler because the denominator is constant over permutations. Thus, the first permutation moment is

$$\mathrm{{E}}_{\mathrm{P}}\{b({t}^{2})\} = \frac{m\mathrm{{Var}}_{\mathrm{P}}(\overline{X}) + n\mathrm{{Var}}_{\mathrm{P}}(\overline{Y })} {{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} = \frac{1} {N - 1},$$

where we have used (12.4, p. 459) to get

$$\mathrm{{Var}}_{\mathrm{P}}(\overline{X}) = \frac{n{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} {mN(N - 1)} \qquad \mathrm{{Var}}_{\mathrm{P}}(\overline{Y }) = \frac{m{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} {nN(N - 1)}.$$

Note also that under normal theory$\mathrm{E}\{b({t}^{2})\} = 1/2/(1/2 + (N - 2)/2) = 1/(N - 1)$ from the beta distribution. Thus, the normal theory and permutation first moments ofb(t ²) are both$1/(N - 1)$. The next step is to calculate the permutation variance ofb(t ²) (involving fourth moments), equate it to the variance of a beta$(d/2,d(N - 2)/2)$ distribution,$2(N - 2)/[d(N - 1)(N + 3)]$, and solve ford. Box and Andersen [1955, p. 13] gived for the general one-way ANOVA situation withk groups and sample sizesn ₁, n ₂, …, n _k:

$$d = 1 + \left (\frac{N + 1} {N - 1}\right ) \frac{{c}_{2}} {{({N}^{-1} + A)}^{-1} - {c}_{2}},$$

(12.20)

where

$$A = \frac{N + 1} {2(k - 1)(N - k)}\left (\frac{{k}^{2}} {N} -\sum\limits _{i=1}^{k} \frac{1} {{n}_{i}}\right ),$$

${c}_{2} = {k}_{4}/{k}_{2}^{2}$,

$${k}_{2} = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2},$$

(12.21)

$${k}_{4} = \frac{N(N + 1){\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{4} - 3(N - 1){\left \{{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}\right \}}^{2}} {(N - 1)(N - 2)(N - 3)}.$$

(12.22)

The statisticsk ₂ andk ₄ are unbiased estimators of the population cumulants introduced in Chapter 1.

For our two-samplet ²,k = 2,n ₁ = m,n ₂ = n,$m + n = N$, and the Pitman-Welch-Box-Andersen approximation is to comparet ² to an$F(d,d(m + n - 2))$ distribution. Box and Andersen [1955] show that$\mathrm{E}(d) \approx1 + (\mathrm{Kurt} - 3)/N$ under the null hypothesis of sampling from equal populations with kurtosis Kurt. Thus,t ² with the usual$F(1,(m + n - 2))$ is quite Type I error robust to nonnormality since the correctiond is relatively small for moderate sizeN. Also, for long-tailed distributions with thicker tails than the normal distribution, Kurt > 3 and thusd > 1, so that using the$F(1,(m + n - 2))$ critical values results in conservative tests, that is, true test levels less than the nominal α values. For example, with Laplace data, Kurt = 6 and$d \approx1 + 3/N$; at$m = n = 10$ d ≈ 1. 15, and a nominal α = . 05 level test would actually have true level approximately.043. For continuous uniform data, Kurt = 1. 8; at$m = n = 10$ d ≈ . 94 and a nominal α = . 05 level test would have true level approximately.053. Since these deviations from α are small, common practice is to just use the standard$F(1,(m + n - 2))$ reference distribution with thet ² statistic rather than the permutation distribution or an approximation to it.

Althought ² is Type I error robust in the face of outliers, it loses power because outliers inflate the variance estimate in the denominator oft ². Thust ² is not Type II error robust when sampling from distributions heavier-tailed than the normal. In contrast, as we mentioned in the Chapter introduction, the Wilcoxon Rank Sum statisticW is Type II error robust, and later we use asymptotic power calculations to verify its superiority tot ². But for the moment, we note thatW is related tot ² applied to the ranks of the data, and therefore inherits robustness to outliers because the ranks themselves are resistant to the effects of outliers. This relationship also allows us to use the above approximation for the permutation distribution ofW.

Define the standardized Wilcoxon Rank Sum statistic by

$${W}_{\mathrm{S}} = \frac{W -\mathrm{ E}(W)} {{\left \{\mathrm{Var}(W)\right \}}^{1/2}}.$$

Then,t ² applied to the ranks of the observations, that is, theX ranksR ₁, …, R _m replacingX ₁, …, X _m, and theY ranksR _m + 1, …, R _N replacingY ₁, …, Y _n, results in

$${t}_{\mathrm{R}}^{2} = \frac{(N - 2){W}_{\mathrm{S}}^{2}} {N - 1 - {W}_{\mathrm{S}}^{2}}.$$

Thust _R ² andW are equivalent test statistics and we can apply the Box-Andersen approximation tot _R ² using$d \approx1 + (1.8 - 3)/N$ because the ranks are a uniform distribution on the integers 1 toN and thus have Kurt ≈ 1. 8, the kurtosis of a continuous uniform distribution. For example, in the case ofm = 10 andn = 6 given in Figure 12.1 (p. 467), the Box-Andersen approximation along with the continuity correction gives results that are considerably better than the normal approximation with continuity correction but not quite as good as the Edgeworth approximation. In later sections we see that the Box-Andersen approximation is very good in one-way and two-way ANOVA situations when the number of treatments is greater than two.

4.8 Monte Carlo Approximation

In the previous sections, approximations to permutation distributions were given for statistics based on linear forms, and essentially rely on the Central Limit Theorem and its extensions. However, the simplest and most important approximation to a permutation distribution is to randomly sample from the set of all possible permutations, and directly estimate the permutation distribution. This approach can be used for any statisticT, and its accuracy is determined simply by the numberB of random permutations used. This resampling of permutations is very similar to resampling in the bootstrap world, and we suggest sampling with replacement because of simplicity although sampling without replacement could be used.

Suppose thatT calculated on all permutations has distinct values${t}_{1},\ldots,{t}_{k}$. For example, in Table 12.1 (p. 453) thet statistic hask = 13 distinct values − 2.98, − 1.72, − 1.36, − 1.08, − 0.84, − 0.06, 0.12, 0.30, 0.49, 0.69, 0.91, 1.17, 1.47, corresponding to the 15 permutations (0.49 and 0.91 appeared twice). The Monte Carlo approach is to randomly selectB times from the 15 possible permutations, calculate the statistic for each random selection, sayT ₁ ^∗, …T _B ^∗, and let the number ofT ^∗s equal tot _i be denotedN _i,i = 1, …, k. If we select permutations with replacement, then (N ₁, …, N _k) is multinomial(B; p ₁, …, p _k), wherep _i is the permutation distribution probability of obtainingt _i. The estimatesN _i ∕ B have binomial variances${p}_{i}(1 - {p}_{i})/B$. Thus, if we were trying to estimate the probabilities in Table 12.2 (p. 453), most of the estimates would have variance$(1/15)(14/15)/B$ although two of them would have variance$(2/15)(13/15)/B$ because of the duplication of values 0.49 and 0.91.

In typical applications, we are not interested in the whole permutation distribution, but merely want to estimate thep-value given in (12.2, p. 455) using

$$\widehat{p} = \frac{\left \{\#{T}_{i}^{{_\ast}}\geq{T}_{0}\right \}} {B},$$

whereT ₀ is the value of the statistic for the original data. In the simple example,T ₀ = 1. 17. Recall that in this case the true permutationp-value is$2/15 =.13$. Thus,B = 1000 would yield an estimate with standard deviation$\{(.13)(.87)/100{0\}}^{1/2} =.01$ that would be adequate for most purposes. However, if thep-value were smaller, say.005, then we would want to takeB larger so that the standard deviation of the estimate would be a small fraction of thep-value, say not more than 10–20%. For example, setting$.001 =\{ (.005)(.995)/{B\}}^{1/2}$ would suggestB = 4975. When the estimatedp-value is to be used with rejection rules like “rejectH ₀ if$\widehat{p} \leq\alpha $,” then it is wise to chooseB so that (B + 1)α is an integer as was discussed in the bootstrap Section 11.6.2 (p. 442) as the“99 rule”. Mainly this would be used in Monte Carlo simulation studies whereB = 99 orB = 199 might be used to save computing time. However, in situations where computations of the test statistic are extremely expensive, one may view the random partitions as part of the test itself, and the procedure “rejectH ₀ if$\widehat{p} \leq\alpha $” is called a Monte Carlo test, not just an approximation to the permutation test. This approach was first introduced by Barnard [1963] and later studied by Hope [1968], Jöckel and Jockel [1986], and Hall and Titterington [1989].

4.9 Comparing the Approximations in a Study of Two Drugs

A new drug regimen (B) was given to 16 subjects, and one week later each subject’s status was assessed. A second independent group of 13 subjects received the standard drug regimen (A). Both sets of measurements were compared to baseline measurements taken before the treatment period began. The difference from baseline data is given in Figure 12.2.This is real data but the actual details are confidential. The drug company wanted to prove that regimenB involving their new drug had larger differences from baseline than the standard. In terms of means of the differences, the testing situation isH ₀ : μ_B = μ_A versusH _a : μ_B > μ_A. The sample means and standard deviations are$\overline{X} =.92,\overline{Y } = 3.19,{s}_{X} = 5.45,{s}_{Y } = 10.21$. The standard pooledt from (12.1, p. 452) is.72 with one-sidedp-value.24 from thet distribution. The exact permutationtp-value is 0.249, but with a largep-value like this, thet distribution approximation is adequate and agrees with the Type I error robustness mentioned previously. The Box-Andersend = 1. 074 leading to an adjustedtp-value of.245.

However, Figure 12.2 reveals that most of the Drug B subjects have positive changes from baseline whereas the Drug A changes are more centered around 0. The two large negative values − 22 and − 11 have a strong effect on thet statistic. The Wilcoxon Rank Sum statisticW is less affected by outliers, and might paint a different picture. First we compute the midranks and list them with the data ordered within samples.

Then$W = 1 + 2 + \ldots+ 28 + 29 = 271.5$. The null mean ofW is$(16)(16 + 13 + 1)/2 = 240$. To compute the null variance using the formula for ties, (12.12, p. 462), note that there aree = 16 distinct values and 2 values tied at − 3, 7 tied at − 1, 3 tied at 0, 2 tied at 2, 2 tied at 4, 2 tied at 6, and 2 tied at 10. Thus the null variance is

$$\begin{array}{rcl} & & \frac{(16)(13)(16 + 13 + 1)} {12} - \frac{(16)(13)} {(12)(29)(29 - 1)}\left [({7}^{3} - 7) + ({3}^{3} - 3) + 5({2}^{3} - 2)\right ] \\ & & \quad = 520 - 8.325 = \end{array}$$

(511.675.)

The approximate normal statistic is$(271.5 - 240)/\sqrt{511.675} = 1.39$ withp-value.082. Thet statistic on the ranks is 1.42 withp-value.084. The Box and Andersen [1955] degrees of freedom approximation with$d = (1 - 1.2/29) = 0.96$ does not change that latterp-value until the fourth decimal. The Edgeworth approximationp-value is.084 without continuity correction and.087 with continuity correction.

Table 3

Full size table

Unfortunately, because of the ties we cannot trust the exact tables or a continuity correction or the Edgeworth approximation. Thus, it seems wise to either calculate the exact permutationp-value or estimate it by Monte Carlo methods. WithB = 10, 000 we got$\widehat{p} =.085$ with 95% confidence interval (.080,.090). Rather than makeB larger, in this case it is fairly easy to get the exactp-value = . 0849 with existing software. Summarizing the one-sidedp-values, we have

So this is a situation where the Wilcoxon Rank Sum statistic might be preferred to thet because of its robustness to outliers. Here it apparently downweighted the outliers − 22 and − 11 enough to have a much lowerp-value than thet statistic. The normal andt approximations to theWp-value are quite reasonable here, but we would not know that without getting the exactp-value = . 0849 or by estimating it fairly accurately.

Table 4

Full size table

5 Optimality Properties of Rank and Permutation Tests

There are actually very few results available on the optimality properties of permutation tests. The main source is Lehmann and Stein [1949], see also Lehmann [1986, Ch. 5], who give the form of the most powerful permutation test for shift alternatives and note that it depends on a variety of unknown quantities including the form of the distribution. In the particular case of normal data with common unknown variance, they show that the most powerful permutation statistic is$\overline{Y }$ or equivalently$\overline{Y } -\overline{X}$ or the pooled two samplet statistic. Thus general optimality results are not available, but a general approach is clear: derive an (asymptotically) optimal parametric test statistic under a specific parametric family assumption (your best guess), and use the permutation approach for critical values. The resulting permutation test is valid under the null hypothesis for any distribution as long as the conditions of Theorem 12.1 (p. 457) hold, and is close to optimal if the distribution of the data is close to the one used to derive the test statistic.

For rank statistics there are two main bodies of results: locally most powerful rank tests and asymptotically most powerful rank tests based on Pitman Asymptotic Relative Efficiency (ARE). Here we briefly give the flavor of these approaches and main results leaving technical details for the Appendix.

5.1 Locally Most Powerful Rank Tests

For simplicity we focus on the two-sample shift model whereX ₁, …, X _m are iid with distribution functionF, andY ₁, …, Y _n are iid with distributionG(y) = F(y − Δ). We assume thatF is continuous with densityf. Consider

$${H}_{0} : \Delta= 0\quad \mbox{ versus}\quad {H}_{a} : \Delta> 0.$$

If there exists a rank test that is uniformly most powerful of level α for some ε > 0 in the restricted testing problem

$${H}_{0} : \Delta= 0\quad \mbox{ versus}\quad {H}_{a,\epsilon } : 0< \Delta< \epsilon,$$

then we say that the test is thelocally most powerful rank test for the original testing problem.

The basic approach to finding a locally most powerful rank test is to take a Taylor expansion of the probability of the rank vector as a function ofΔ and maximize its derivative atΔ = 0. For sufficiently smallΔ, the values of the rank vector that are ordered by its probability under the alternativeΔ are the same as those ordered by its derivative atΔ = 0. Thus, we need only obtain an expression for the derivative and maximize it. These details are left for the Appendix.

For the two-sample shift problem, the locally most powerful rank test rejects for large values of

$$T = \sum\limits _{i=m+1}^{N}a({R}_{ i}),$$

wherea(i) = E{ϕ(U _(i), f)},

$$\phi (u,f) = -\frac{{f}^{{\prime}}({F}^{-1}(u))} {f({F}^{-1}(u))}$$

(12.23)

is called the optimal score function, andU ₍₁₎ ≤ U ₍₂₎ ≤ ⋯ ≤ U _(N) are the order statistics from a uniform (0,1) distribution. Recall thatR _m + 1, …, R _N are the ranks of theY values in the joint ranking of all theX’s andY ’s together. We see in the next section that a closely related statistic,$\sum\limits _{i=m+1}^{N}\phi ({R}_{i}/(N + 1),f),$ is asymptotically equivalent and comes naturally from asymptotic relative efficiency considerations.

IfF is the logistic distribution, then we are led to the Wilcoxon Rank Sum as the locally most powerful rank test for shift alternatives because$-{f}^{{\prime}}(x)/f(x) = 2F(x) - 1$ and$\mathrm{E}\{{U}_{(i)}\} = i/(N + 1)$. WhenF is a normal distribution, then the optimal score function is$\phi (u,f) = {\Phi }^{-1}(u)$, and the locally most powerful test is based on thenormal scores

$$a(i) =\mathrm{ E}\{{\Phi }^{-1}({U}_{ (i)})\} =\mathrm{ E}\{{Z}_{(i)}\},$$

whereZ _(i) is a standard normal order statistic. For shifts in the scale of an exponential distribution,$F(x;\sigma ) = 1 -\exp (-x/\sigma )$, we can turn it into a shift in location of the negative of an extreme value distribution,$F(x) = 1 -\exp \{-\exp (x)\}$, by taking the natural logarithm of the exponential data. The resulting optimal test has score

$$a(i) + 1 = \sum\limits _{j=N+1-i}^{N}\;\frac{1} {j},$$

where the latter sum is the expected value of theith order statistic from a standard exponential distribution. These are calledSavage scores from Savage [1956]. In censored data situations, the analogous test is called the logrank test.

Lehmann [1953] studied alternatives of the form

$${F}_{\Delta }(x) = (1 - \Delta )F(x) + \Delta {F}^{2}(x),$$

and showed that the Wilcoxon Rank Sum is the locally most powerful rank test for these alternatives. In general, alternatives of the formF _Δ(x) = h _Δ(F(x)) for some functionh _Δ(u), are calledLehmann alternatives. They have the property that two-sample rank tests have the same distribution under an alternativeΔ for all continuousF.

Johnson et al. [1987] consider locally most powerful rank tests using Lehmann alternatives for the nonresponder problem where only a fraction of subjects respond to treatment. Conover and Salsburg [1988] consider other locally most powerful rank tests for the nonresponder problem. Additional situations where locally most powerful rank tests are considered include Doksum and Bickel [1969] and Bhattacharyya and Johnson [1973].

The optimal score functions (12.23, p. 475) appear in thek-sample problem, Section 12.6 (p. 480), and in the correlation problem, Section 12.7 (p. 487). Analogous results are also available in the one-sample location or matched pairs problem, Section 12.7 (p. 487), and are mentioned there.

Theoretical development and rigorous theorems on locally most powerful rank tests may be found in Hajek and Sidak [1967, Ch. 2], Conover [1973], and Randles and Wolfe [1979, Chs. 4 and 9].

5.2 Pitman Asymptotic Relative Efficiency

Perhaps the most useful way to evaluate and compare rank tests is due to Pitman [1948] and further developed by Noether [1955] and others. The basic idea is that Pitman Asymptotic Relative Efficiency (ARE) is the ratio of sample sizes for two different tests to have the same power at a sequence of alternatives converging to the null hypothesis.

LetS andT be two test statistics forH : θ = θ₀ where θ_k is a sequence of alternatives converging to θ₀ ask → ∞. If we can choose sample sizes${N}_{{S}_{k}}$ and${N}_{{T}_{k}}$ and critical values${c}_{{S}_{k}}$ and${c}_{{T}_{k}}$ forS andT, respectively, such that$S > {c}_{{S}_{k}}$ and$T > {c}_{{T}_{k}}$ have levels that converge to α and their powers under θ_k converge to β, α < β < 1, then the Pitman asymptotic relative efficiency ofS toT is given by

$$\mbox{ ARE}(S,T) =\lim \limits_{k\rightarrow \infty }\frac{{N}_{{T}_{k}}} {{N}_{{S}_{k}}}.$$

Note that if ARE(S, T) > 1, thenS is preferred toT because it takes fewer observations (${N}_{{S}_{k}}$ is less than${N}_{{T}_{k}}$) to achieve the same power. Technical conditions in the Appendix and$P({S}_{k} > {c}_{{S}_{k}}) \rightarrow\beta< 1$ require that the alternatives have a specific form: for some δ > 0

$${\theta }_{k} = {\theta }_{0} + \frac{\delta } {\sqrt{{N}_{{S}_{k }}}} + o\left ( \frac{1} {\sqrt{{N}_{{S}_{k }}}}\right )\;\;\mbox{ as}\;\;k \rightarrow \infty.$$

(12.24)

Such sequences of alternatives are calledPitman alternatives. Another important quantity arising from the technical details is theefficacy of a test statisticS,

$$\mbox{ eff}(S) =\lim \limits_{k\rightarrow \infty } \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{{S}_{k } } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}},$$

where${\mu }_{{S}_{k}}({\theta }_{0})$ and${\sigma }_{{S}_{k}}({\theta }_{0})$ are the asymptotic mean ofS and standard deviation ofS. Thus, the efficacy of a test is the rate of change of its asymptotic mean at the null hypothesis relative to its asymptotic standard deviation (the factor$1/\sqrt{{N}_{{S}_{k }}}$ is introduced in the derivative because of 12.24). A powerful test in the Pitman sense is one that is able to detect changes in the parameter value near the null hypothesis. The ARE ofS toT turns out to be

$$\mbox{ ARE}(S,T) ={ \left \{\frac{\mbox{ eff}(S)} {\mbox{ eff}(T)}\right \}}^{2}.$$

The Pitman ARE is both a limiting ratio of sample sizes required to give the same power and the square of the ratio of the test efficacies. High efficacies lead to high ARE’s.

In the Appendix we give details for finding efficacies in the one-sample problem, but here we use similar standard results on efficacies for the two-sample problem from Randles and Wolfe [1979, Chs. 5 and 9]. The most important comparison is between the two-samplet test and the Wilcoxon Rank Sum test. The efficacy of thet test is

$$\mbox{ eff}(t) = \frac{\sqrt{\lambda (1 - \lambda )}} {\sigma },$$

where σ is the standard deviation of theX distribution functionF(x) and of theY distribution function$G(y) = F(x - \Delta )$, and$\lambda=\lim \limits_{\min (m,n)\rightarrow \infty }m/(m + n)$. For the Wilcoxon Rank Sum statisticW we have

$$\mbox{ eff}(W) = \sqrt{12\lambda (1 - \lambda )}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx,$$

wheref is the density ofF(x), and the integral is assumed to exist. Putting these efficacies together, we have that the Pitman ARE ofW tot is

$$\mbox{ ARE}(W,t) = 12{\sigma }^{2}{\left \{{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx\right \}}^{2}.$$

(12.25)

We put ARE(W, t) into Table 12.3 for a number of distributions. Remember that ARE(W, t) > 1 means that the Wilcoxon Rank Sum test is preferred to thet test.The first number is the lower bound 0.864 derived by Hodges and Lehmann [1956] which shows that the Wilcoxon Rank Sum cannot do much worse than thet test for any continuous unimodal distribution. The second number 0.955 is for the normal distribution and shows that the Wilcoxon loses very little efficiency at the normal distribution where thet test is optimal. At the uniform distribution, the tests perform equivalently, and at the remaining examples in Table 12.3, the Wilcoxon is preferred.

Table 12.3 ARE(W, t) for the Two-Sample Shift Model

Full size table

One might think that these ARE results are just asymptotic and may not relate to small sample results. To supplement the ARE results, in Figure 12.3 we plot power results for$m = n = 15$ taken from Table 4.1.10 of Randles and Wolfe [1979, p. 118–119]. They simulated the power of thet and Wilcoxon using 1000 replications. Here we see good correspondence between small sample power and the ARE results of Table 12.3. For the normal, uniform, and logistic distributions, there is little power difference as one might expect from ARE values of.955, 1.00, and 1.10, respectively. For the Laplace, the Wilcoxon has a significant power advantage, perhaps not quite as large at the ARE(W, t) = 1. 5 would imply. Thet ₁ (Cauchy) and exponential power results strongly favor the Wilcoxon and are consistent with the large ARE values.

We should mention that the Laplace distribution with density$f(x) = (1/2)\exp$ ( − | x | ) has been used quite a bit in the rank literature as a model for data, especially for ARE comparisons and simulation studies. But it may not be very useful as a model for real data, and ARE results for it are not as consistent with simulation results in small samples as with other densities. The optimal rank test for the Laplace uses scoresa(i) = 1 for$i > (N + 1)/2$ and 0 otherwise, and is called the two-sample median test. However, its power performance in small samples, even when simulating from the Laplace distribution, is poor. Freidlin and Gastwirth [2000] show by simulation that the Wilcoxon Rank Sum test outperforms the median test at the Laplace distribution for samples sizesm = n less than or equal to 25. They recommend that the median test “be retired” from general usage, and we agree.

It turns out that in the scale problem mentioned briefly in Section 12.6.6 (p. 486), ARE values are overly optimistic when compared to small sample power results. This may reflect the fact that measuring scale (standard deviation) is an inherently harder problem that is not as well suited to rank statistics. Klotz [1962] pointed out this discrepancy between small sample power and ARE results. Fortunately, ARE results have been used mainly in location comparisons where they yield good intuition about the qualitative behavior of tests.

Another result from Randles and Wolfe [1979, p. 307] is that under suitable regularity results on the score functions, the efficacy of any linear rank test$S = \sum\limits _{i=m+1}^{N}\phi ({R}_{i}/(N + 1))$ in the two-sample shift model is given by

$$\mbox{ eff}(S) = \sqrt{\lambda (1 - \lambda )} \frac{{\int\nolimits \nolimits }_{0}^{1}\phi (u)\phi (u,f)\,du} {{\left [{\int\nolimits \nolimits }_{0}^{1}\{\phi (u) -{\overline{\phi }\}}^{2}\,du\right ]}^{1/2}},$$

(12.26)

where ϕ(u, f) is given in (12.23, p. 475). Expression (12.26) now justifies the nameoptimal score function since the efficacy in (12.26) is optimized by choosing ϕ(u) = ϕ(u, f). This can be seen by noting that

$${\int\nolimits \nolimits }_{0}^{1}{\phi }^{2}(u,f)\,du ={ \int\nolimits \nolimits }_{-\infty }^{\infty }{\left \{\frac{f^{\prime}(x)} {f(x)}\right \}}^{2}f(x)\,dx = I(f),$$

whereI(f) is the Fisher information for the model$f(x;\theta ) = f(x - \theta )$. Now, noting that ∫₀ ¹ϕ(u, f) du = 0, (12.26) can be reexpressed as

$$\mbox{ eff}(S) = \sqrt{\lambda (1 - \lambda )I(f)}\mbox{ Corr}(\phi (U),\phi (U,f)),$$

(12.27)

whereU is a uniform random variable and Corr is the correlation. Clearly, the correlation is maximized by choosing ϕ(u) = ϕ(u, f). Moreover, it can also be shown that$\sqrt{\lambda (1 - \lambda )I(f)}$ is not only the largest possible efficacy among linear rank tests but also among all α-level tests. Thus, optimal linear rank tests are asymptotically equivalent in terms of Pitman ARE to the best possible tests, say likelihood ratio or score or Wald tests for the shift model in a parametric framework. Of course, this optimality in either the rank test or the parametric test requires that the assumed family is correct.

In the next sections we consider i) thek-sample problem that is a generalization of the two-sample problem tok > 2 samples; ii) the correlation or regression problem; and then iii) the matched pairs or one-sample symmetry problem. The Pitman ARE analysis has to be adjusted to handle each situation, but the numbers found in Table 12.3 (p. 477) continue to hold for these situations as well. Thus Wilcoxon procedures, in other words rank methods using scoresa(i) = i, tend to give very good results across a wide range of distributions in each of these situations.

6 Thek-sample Problem, One-way ANOVA

The extension of the two-sample case tok samples or treatments is straightforward. Suppose that we have availablek independent random samples$\left \{{Y }_{i1},\ldots,{Y }_{i{n}_{i}};\right.$ i = 1, …, k, where in each sample theY _ij (j = 1, …, n _i) are iid with distribution functionF _i(x), and$N = {n}_{1} + \cdots+ {n}_{k}$. The linear model representation is

$${Y }_{ij} = \mu+ {\alpha }_{i} + {e}_{ij}.$$

(12.28)

If the errorse _ij all come from the same distribution, then (12.28) is an extension of the shift model for two-sample data.

For example, the following are data on the ratio of Assessed Value to Sale Price for single family dwellings (n ₁ = 27), two-family dwellings (n ₂ = 22), three-family dwellings (n ₃ = 17), and four or more family dwellings (n ₄ = 14) in Fitchburg, Massachusetts, in 1979.

The null hypothesis of interest is of identical distribution functions,

$${H}_{0} : {F}_{1}(y) = {F}_{2}(y) = \cdots= {F}_{k}(y),$$

(12.29)

which arises most naturally if we randomly assignedN experimental units tok treatment groups with sample sizesn ₁, n ₂, …, n _k. (The above data are not of this type.) There are

$${M}_{N} = \left ({ N \atop {n}_{1}{n}_{2}\cdots {n}_{k}} \right ) = \frac{N!} {{n}_{1}!{n}_{2}!\cdots {n}_{k}!}$$

possible assignments, which of course is the relevant number of permutations even if the data do not come from a randomized experiment. Pitman [1938] proposed the permutation approach for the ANOVAF statistic

$$F = \frac{ \frac{1} {k - 1}\sum\limits _{i=1}^{k}{n}_{ i}{({\overline{Y }}_{i.} -{\overline{Y }}_{..})}^{2}} { \frac{1} {N - k}\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{({Y }_{ij} -{\overline{Y }}_{i.})}^{2}},$$

(12.30)

where${\overline{Y }}_{i.} = {n}_{i}^{-1} \sum\limits _{j=1}^{{n}_{i}}{Y }_{ij}$, and${\overline{Y }}_{..} = {N}^{-1} \sum\limits _{i=1}^{k}{n}_{i}{\overline{Y }}_{i}$. The number of permutationsM _N gets large very fast. For example, with$k = 3,N = 15,{n}_{1} = {n}_{2} = {n}_{3} = 5$, we get${M}_{N} = \left({15}\atop{5\;5\;5}\right) = 756,756$. Thus Monte Carlo or asymptotic approximations are more important than in the two-sample case. For the above housing data, the ANOVAF in (12.30) isF = 1. 24 withp-value =.30 from theF(3, 75) distribution. The exact permutationp-value is obtained by computingF for each of the 1. 9 ×10⁴⁴ distinct allocations of$\left \{{Y }_{i1},\ldots,{Y }_{i{n}_{i}};i = 1,\ldots,4\right \}$ to samples of sizen ₁ = 27,n ₂ = 22,n ₃ = 17, andn ₄ = 14, and finding the proportion of these greater to or equal toF = 1. 24. A Monte Carlo estimate of the exact permutationp-value is.267 based on 100,000 resamples with standard error =.0014. Because the housing ratios are quite skewed with a number of large observations, it is not surprising thatF is small. Now we turn to rank methods that naturally limit the effect of outliers.

Table 6

Full size table

6.1 Rank Methods for the k-Sample Location Problem

Kruskal and Wallis [1952] proposed the rank extension of the Wilcoxon Rank Sum statistic to thek-sample situation. The rank approach is to put allN observations together and rank them; letR _ij be the rank ofY _ij in the combined sample. Further define the sample sums

$${S}_{i} = \sum\limits _{j=1}^{{n}_{i} }a({R}_{ij}),$$

where the scoresa(i) could be of any form for permutational analysis, but for asymptotic results we assume$a(i) = \phi (i/(N + 1))$ and ϕ is a scores generating function as in Theorem 12.3 (p. 465). The Kruskal-Wallis statistic usesa(i) = i or equivalently$a(i) = i/(N + 1)$. Note thatS _i is just a two-sample linear rank statistic for comparing theith population to all the others combined. The general linear rank statistic form for comparing thek populations is then

$$Q = \sum\limits _{i=1}^{k} \frac{1} {{s}_{a}^{2}{n}_{i}}{({S}_{i} - {n}_{i}\overline{a})}^{2} = \sum\limits _{i=1}^{k}\left (\frac{N - {n}_{i}} {N} \right )\frac{{({S}_{i} -\mathrm{ E}{S}_{i})}^{2}} {\mathrm{Var}({S}_{i})},$$

(12.31)

where${s}_{a}^{2} = {(N - 1)}^{-1} \sum\limits _{i=1}^{N}\{a(i) -{\overline{a}\}}^{2}$,$\overline{a} = \sum\limits _{i=1}^{N}a(i)$, and Var(S _i) is given by (12.4, p. 459) with the constantsc _i in that expression equal to 1 forn _i of them and 0 otherwise. The reason for giving the second form in (12.31) is that it is then clear that$\mathrm{E}(Q) = k - 1$ under the null hypothesis of equal populations. The Kruskal-Wallis statistic that allows for ties is explicitly given by

$$H = \frac{(N - 1)\left \{\sum\limits _{i=1}^{k}{n}_{ i}{\left ({\overline{R}}_{i.} -\frac{N + 1} {2} \right )}^{2}\right \}} {\left (\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{R}_{ij}^{2}\right ) - N{(N + 1)}^{2}/4},$$

where${\overline{R}}_{i.} = {n}_{i}^{-1} \sum\limits _{j=1}^{{n}_{i}}{R}_{ij}$. If there are no ties in the data, then

$$\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{R}_{ij}^{2} = N(N + 1)(2N + 1)/6,$$

andH reduces to the more familiar form

$$H = \frac{12} {N(N + 1)}\sum\limits _{i=1}^{k}{n}_{ i}{\left ({\overline{R}}_{i.} -\frac{N + 1} {2} \right )}^{2}.$$

Under the null hypothesis (12.29, p. 480), standard asymptotic theory similar to Theorem 12.3 (p. 465) yields that$Q\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}$ as min{n ₁, …, n _k} → ∞. The χ_k − 1 ² approximation is not very good in small samples, but fortunately theF statistic on the scoresa(R _ij) is a monotone function ofQ,

$${F}_{\mathrm{R}} = \left (\frac{N - k} {k - 1} \right )\left ( \frac{Q} {N - 1 - Q}\right ),$$

and using$F(k - 1,N - k)$ as a reference distribution or the Box-Andersen adjusted$F(d(k - 1),d(N - k))$ distribution yields excellent results. For the housing data above,H = 9. 8856 withp-value = 0.020 from the χ₃ ² distribution.F _R = 3. 6283 withp-value 0.017 from theF(3, 75) distribution. The Box-Andersend=0.9876, and so the adjustment is very minor, only in the fourth decimal place. A Monte Carlo approximation to the exactp-value is.017 based on 100,000 samples with standard error.0004. So here theF distribution approximation is right on target to 3 decimals, but the χ² approximation is not bad due to the fairly large samples.

In Figure 12.4 we look at much smaller sample sizes fork = 3 andk = 5. Figure 12.4 shows the difference between the exact permutationp-value and each approximation versus the exactp-value for the Kruskal-Wallis statistic. Note that the left panel is more expanded in the vertical scale than the right panel and actually has less error. Nevertheless, the Box-Andersen approximation is the best in both plots and is generally very good fork > 2. The χ_k − 1 ² approximation gets more conservative ask gets larger. This can be explained by the following large-k asymptotic results.

6.2 Large-k Asymptotics for the ANOVAF Statistic

Brownie and Boos [1994] show under the null hypothesis of equal populations that

$$\sqrt{k}({F}_{\mathrm{R}} - 1)\stackrel{d}{\rightarrow }\mbox{ N}\left (0, \frac{2n} {n - 1}\right ),$$

(12.32)

for equal sample sizes${n}_{1} = {n}_{2} = \cdots= {n}_{k} = n$ andk → ∞ withn fixed. Note that the usual result withn → ∞ andk fixed is$(k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}$, similar to the result forQ. The “largek” asymptotic result (12.32) implies that

$$\sqrt{k}\left ( \frac{Q} {k - 1} - 1\right )\stackrel{d}{\rightarrow }\mbox{ N}\left (0, \frac{2(n - 1)} {n} \right ),$$

(12.33)

ask → ∞ withn fixed, using

$$Q = \frac{(N - 1){F}_{\mathrm{R}}} {(N - k)/(k - 1) + {F}_{\mathrm{R}}}$$

(12.34)

(see Problem 12.17, p. 527). Note that comparingQ to a χ_k − 1 ² is asymptotically (k → ∞) like comparing$Q/(k - 1)$ to a N$\{1,2/(k - 1)\}$ because a χ_k − 1 ² random variable obeys the Central Limit Theorem (it is a sum of χ₁ ² random variables). However, (12.33) says that$Q/(k - 1)$ should be compared to a N$\{1,2(n - 1)/(kn)\}$ distribution. Because$2(n - 1)/(kn)< 2/(k - 1)$, using the χ_k − 1 ² distribution withQ results in conservative true levels. For example, ifk = 5 andn = 5, then the large sample 95th percentile from N$\{1,2/(k - 1)\}$ is$1 + {(2/4)}^{1/2}1.645 = 2.16$, and the approximate true level of a nominal α = . 05 test is

$$P(Q \geq{\chi }_{4}^{2}(.95)) \approx P(1 + {(8/25)}^{1/2}Z \geq2.16) = P(Z \geq2.05) =.02.$$

In contrast, use ofF _R with an$F(k - 1,N - k)$ reference distribution is supported by (12.32) underk → ∞ and by the usual asymptotics$(k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}$ whenn → ∞ withk fixed. We leave those details for Problem 12.18 (p. 527). Thus, it is not surprising that theF approximations in Figure 12.4 are much better than the χ_k − 1 ² ones.

6.3 Comparison of Approximate P-Values – Data on Cadmium in Rat Diet

Nation et al. [1984] studied the effect of diets containing cadmium (Cd) on the neurobehavior of adult rats. The data consists of the number of platform descents during a passive-avoidance training scheme for 27 rats randomly assigned to three groups:

Table 7

Full size table

The control group had no Cd in the diet, and Cd1 and Cd5 refer to daily diets containing 1 milligram and 5 milligrams, respectively, of Cd per kilogram of body weight. The usual one-way ANOVAF = 5. 10, and the permutationp-valueF statistic is$\widehat{p} = 0.016$ based on 100,000 random permutations. TheF(2, 24) distribution givesp-value =.014, and the Box-Andersen correction factor isd = . 954 leading top-value =.016. The Kruskal-Wallis rank statistic isQ = 8. 18 with permutationp-value$\widehat{p} =.012$ based on 100,000 random permutations. The χ₂ ² approximation givesp-value =.017. The associatedF statistic isF _R = 5. 51 withp-value =.011. The Box-Andersen correction factor is$d = 1 - 1.2/24 =.95$ leading top-value =.012. A summary is as follows:

As expected theF approximations give excellentp-values.

Statistic	Method	P-value
F	Monte Carlo (B=100,000)	0.016
	F(2, 24)	0.014
	Box-Andersen	0.016
KW	Monte Carlo (B=100,000)	0.012
	χ₂ ²	0.017
	F(2, 24)	0.011
	Box-Andersen	0.012

6.4 Other Types of Alternative Hypotheses

Thek-sampleF statistic and Kruskal-Wallis statistic are used to compare the centers or locations of thek populations. Other statistics could be used for that purpose, perhaps ones more suited to long-tailed or skewed populations. The logrank or Savage scores, for example, are asymptotically optimal for detecting shifts in the scale parameter of exponential populations (or the shift parameter of extreme value distributions).

Other types of alternatives may also be of interest. For example, there may be an implied order in the populations, say increasing doses, and there may be interest in trends in location. There might also be interest in comparing the spread of the populations or even the skewness.

These latter alternatives present a problem to permutation and rank methods because the null hypothesis of interest may not be the one of identical populations. For comparing spread, the usual null hypothesis of interest would be equal spread rather than identical populations. In such a situation, use of the permutation approach would require subtraction of unknown location parameters. We first discuss ordered alternatives in location.

6.5 Ordered Means or Location Parameters

Recall Section 3.6.1a (p. 154) where we discussed likelihood-based methods for ordered alternatives. Here we discuss permutation methods with simple statistics in the context of a Phase I toxicology study where there seems to be trends in both the means and variances with dose:

TheF statistic for comparing means isF = 1. 77, and the usualF(4, 16) distribution and the Box-Andersen approximation givep-value = 0.19. Similarly, a Monte Carlo estimatedp-value based on 10,000 random permutations gives$\widehat{p} = 0.19$. The Kruskal-Wallis statistic isH = 6. 73 with χ₄ ² p-value = 0.15. TheF approximation fromF _R = 2. 06 and the Box-Andersen approximation both givep-value = 0.14. A Monte Carlo estimatedp-value based on 10,000 random permutations gives$\widehat{p} = 0.14$. So the global comparison of location is not significant at usual levels.

Suppose that we considerH ₀ : identical populations versusH _a : means are decreasing. The permutation approach with$M_{N} = \left({{20}\atop {44444}}\right)$ permutations may be used with thet statistic from a regression of the observations on dose or equivalently Pearson’s correlation coefficient (see also the next section). Pearson’s correlation coefficient is$r = -0.53$ with Monte Carlo estimatedp-value$\widehat{p} = 0.007$ based on 10,000 random permutations. Spearman’s correlation coefficient is − 0. 56 with$\widehat{p} = 0.005$. Another statistic that could have been used is the likelihood ratio statistic for decreasing means assuming the data are normally distributed (see Section 3.6.1a, p. 154). In addition to Spearman’s correlation coefficient, the standard rank-based statistic is the Jonckheere-Terpstra statistic based on summing pairwise Wilcoxon Rank Sum statistics in increasing order, ∑_i < j W _ij, whereW _ij is the Wilcoxon Rank Sum for comparing dose groupi with dose groupj (see Lehmann 1975, p. 233). Its value here is − 2. 458 with exact permutationp-value = 0.0069. So we can be pretty confident that there is a downward trend in means or other location measures.

6.6 Scale or Variance Comparisons

Motivated by the apparent increase in variances for the dose-response data above, we now discuss hypotheses about variances or scale parameters. Unfortunately, there is a philosophical dilemma for using permutation procedures here. Usually, the typical set of hypotheses when testing for unequal variances is for a semiparametric model,$P({Y }_{ij} \leq y) = {F}_{0}((y - {\mu }_{i})/{\sigma }_{i})$,j = 1, …, n _i; i = 1, …, k, whereF ₀ is an unknown distribution function. Note that ifF ₀(x) has mean 0 and variance 1, then μ_i is theith population mean, and σ_i ² is theith population variance. In any case, under this semiparametric model, theith standard deviation iscσ_i for some constantc, and we can always refer to σ_i as a scale parameter. The hypotheses for increasing scale are then${H}_{0} : {\sigma }_{1} = \cdots= {\sigma }_{k}$ versusH _a : σ₁ ≤ ⋯ ≤ σ_k with at least one inequality. The reason for this hypothesis formulation is that we often know that the means are different; therefore it makes little sense to assume identical populations when testing for variance differences. Basically, we usually want to test for variance differences in the presence of location differences.

Unfortunately, the permutation argument requires that the null hypothesis be one of identical populations. It makes intuitive sense to center the data first by subtracting means, but these residuals${Y }_{ij} -{\overline{Y }}_{i}$ no longer satisfy exchangeability required for using Theorem 12.1 (p. 457). The permutation distribution is correct asymptotically, but the exact level-α property no longer holds. An overview of the scale testing problem is given in Boos and Brownie [2004]. The best method that has emerged for comparing scales is to uset orF statistics on the dataY _ij replaced by | Y _ij − M _i | , whereM _i is theith sample median.

One way to avoid the centering problem for the dose-response data is to reduce the data to the sample standard deviations (or some other scale estimator) and then calculate an appropriate statistic for the 5! = 120 permutations possible. For the correlation between dose and standard deviation we getr = 0. 79 andp-value$= 7/120 =.058.$ If we use the likelihood ratio test for increasing variances for normal distributions, we getp-value = 5/120=.042. There is a loss of information when the number of permutations get reduced so much, from$M_{N} = \left({{20}\atop {44444}}\right)$ toM _N = 120; perhaps the loss of information is just a discreteness problem caused by having too few permutations. This can be seen more clearly by calculating the exact permutation test on the data reduced to the five means; the correlation is higher than when using all the data, but thep-value = 2/120 =.017 is much larger than the.007 value we obtained previously with the whole data set.

We note that the use of rank statistics for scale comparisons has not been very successful. The subtraction of means or medians ruins the permutation argument as mentioned above. However, rank statistics for scale based on centered data are asymptotically distribution free if the samples are symmetrically distributed. The larger problem is that rank tests for scale tend to have low power in small samples. Although rank tests for location perform well in small samples and are consistent with asymptotic relative efficiency comparisons, the opposite is true for rank tests for scale. The latter statistics are not as powerful in small samples as would be expected from asymptotic relative efficiency calculations.

7 Testing Independence and Regression Relationships

Regression methods are among the most important tools of statistics. Unfortunately, permutation methods can really be applied in only the simplest setting of (X, Y ) pairs; that is, correlation or simple regression (not necessarily linear). Here we discuss that simple situation and mention at the end of the section why permutation methods cannot handle the more interesting case of multiple explanatory variables.

Suppose that we have iid random pairs (X ₁, Y ₁), …, (X _n, Y _n) and permute each coordinate independently to getn! different pairings. In reality, we need only permute one of the coordinates to obtain all the different pairings. For example, suppose thatn = 3 with pairs (1, 2. 5), (2, 3. 7), (3, 6. 4). Then the 6 possible permutations are

Pitman [1937b] suggested that a test for independence ofX andY based on the sample correlation

$$r = \frac{\sum\limits _{i=1}^{n}({X}_{ i} -\overline{X})({Y }_{i} -\overline{Y })} {{\left [\sum\limits _{i=1}^{n}{({X}_{ i} -\overline{X})}^{2} \sum\limits _{i=1}^{n}{({Y }_{ i} -\overline{Y })}^{2}\right ]}^{1/2}}$$

use this permutation distribution for critical values. A permutationally equivalent statistic is the least squares slope estimate$\widehat{\beta } = \sum\limits _{i=1}^{n}({X}_{i} -\overline{X})({Y }_{i} -\overline{Y })/\sum\limits _{i=1}^{n}{({X}_{i} -\overline{X})}^{2}$. Other popular measures that could be used to test independence are Kendall’s rank correlation and Spearman’s rank correlation. Spearman’s estimated correlation coefficientr _S is simply to replaceX _i by its rank amongX ₁, …, X _n andY _i by its rank amongY ₁, …, Y _n, and compute the Pearson correlationr between these pairs of ranks. It is important to keep in mind that the null hypothesis is independence ofX andY and not zero correlation. Independence is needed for then! different pairings to have the same distribution and thus for Theorem 12.1 (p. 457) to apply.

1	2	3	4	5	6
(1,2.5)	(1,3.7)	(1,6.4)	(1,2.5)	(1,3.7)	(1,6.4)
(2,3.7)	(2,2.5)	(2,3.7)	(2,6.4)	(2,6.4)	(2,2.5)
(3,6.4)	(3,6.4)	(3,2.5)	(3,3.7)	(3,2.5)	(3,3.7)

Typical approximations to the permutation distribution ofr (and similarly ofr _S) are to compare${(n - 1)}^{1/2}r$ to a standard normal distribution or${(n - 2)}^{1/2}r/{(1 - {r}^{2})}^{1/2}$ to at(n − 2) distribution. Pitman [1937b] gave the first two permutation moments ofr ²,$\mathrm{{E}}_{\mathrm{P}}({r}^{2}) = 1/(n - 1)$, and

$$\mathrm{{E}}_{\mathrm{P}}({r}^{4}) = \frac{3} {(n - 1)(n + 1)} + \frac{(n - 2)(n - 3)} {n(n + 1){(n - 1)}^{3}}\left \{ \frac{{k}_{4}(X)} {{k}_{2}{(X)}^{2}}\right \}\left \{ \frac{{k}_{4}(Y )} {{k}_{2}{(Y )}^{2}}\right \},$$

where the sample cumulantsk ₂ andk ₄ were given in (12.21, p. 469) and (12.22, p. 469), respectively. Note that these moments are straightforward from the results in Section 12.4.2 (p. 458) since the numerator ofr has the form (12.3, p. 458) of a linear statistic, and the denominator is constant over permutations. If the pairs are iid with a bivariate normal distribution, thenr ² has a beta($1/2,n/2 - 1)$ distribution with$\mathrm{E}({r}^{2}) = 1/(n - 1)$ and$\mathrm{E}({r}^{4}) = 3/(n - 1)(n + 1)$. Because the permutation moments and normal theory moments are so close, Pitman [1937b] suggested using the beta approximation, which is equivalent to comparing$(n - 2){r}^{2}/(1 - {r}^{2})$ to anF(1, n − 2) distribution. Box and Watson [1962] generalized these results to the fullp regressor case for the test that all regressors are independent ofY. They derived the adjustedF approximation (see Box and Watson 1962, p. 100), which for thep = 1 case here is to compare$(n - 2){r}^{2}/(1 - {r}^{2})$ to anF(d, d(n − 2)) distribution, where

$$\frac{1} {d} = 1 + \frac{(n + 1){\alpha }_{1}} {n - 1 - 2{\alpha }_{1}},\quad {\alpha }_{1} = \frac{n - 3} {2n(n - 1)}\left \{ \frac{{k}_{4}(X)} {{k}_{2}{(X)}^{2}}\right \}\left \{ \frac{{k}_{4}(Y )} {{k}_{2}{(Y )}^{2}}\right \}.$$

In large samples,$d \approx1 +\{\mathrm{ Kurt}(X) - 3\}\{\mathrm{Kurt}(Y ) - 3\}/2n$, revealing a double Type I error robustness to nonnormality: if eitherX orY is approximately normally distributed, then the usualF approximation is very good. To numerically illustrate, recall$r = -.53$ from the dose-response data (p. 110) where the Monte Carlo estimated one-sidedp-value was$\widehat{p} =.007$. Taking half of theF(1, 18)p-value approximation for$18{r}^{2}/(1 - {r}^{2}) = 7.03$, we getp-value =.008. Similarly, for Spearman’s${r}_{\mathrm{S}} = -.56$ we obtained previously$\widehat{p} =.005$. Using one half of theF(1, 18)p-value for$18{r}_{\mathrm{S}}^{2}/(1 - {r}_{\mathrm{S}}^{2}) = 8.22$ yieldsp-value =.005.

Dose					$\overline{Y }$	s _n − 1
0	1.44	1.63	1.40	1.59	1.52	0.11
1	1.27	1.50	1.45	1.57	1.45	0.13
2	1.26	1.07	1.38	1.75	1.37	0.29
3	1.04	1.14	1.46	1.06	1.18	0.19
4	1.37	0.79	1.32	1.42	1.23	0.29

Now let us move to the more complicated situation of the linear model,

$${Y }_{i} = {\beta }_{0} + {\beta }_{1}{X}_{1i} + {\beta }_{2}{X}_{2i} + {e}_{i},\quad i + 1,\ldots,n,$$

where we assumee ₁, …, e _n are iid from some distribution and independent of all theX _ij. As mentioned above, permuting theY ’s under the assumption${H}_{0} : {\beta }_{1} = {\beta }_{2} = 0$ yields a suitable permutation distribution for testing independence ofY and (X ₁, X ₂). Unfortunately, we are usually much more interested in testingH ₀ : β₂ = 0 with β₀ and β₁ unrestricted. Without knowledge of β₁, however, an exact permutation procedure forH ₀ : β₂ = 0 is not possible. (Actually, it is possible to take the maximum over permutationp-values for each value of β₁ in a confidence interval underH ₀ as described in Berger and Boos [1994], but the loss in power is typically not worth the gain in exactness.) Anderson and Robinson [2001] review a number of different proposals that use residuals from first fitting the reduced model, and show that they are asymptotically correct but do not satisfy the assumptions of Theorem 12.1 (p. 457). Fortunately, standard linear model and rank-based linear model testing procedures have good Type I error robustness properties in general. The rank-based linear model methods given in Ch. 5 of Hettmansperger [1984] have good Type II error robustness properties as well. Similarly, the M-estimation regression methods discussed in Ch. 5 also have good robustness properties.

We conclude this section with an example that illustrates how easy it is to use Monte Carlo approximation in an autocorrelation setting.

Example 12.1 (Raleigh snowfall).

Is the total snowfall in one year independent of the total snowfall in other years? The left panel of Figure 12.5 plots Raleigh, NC, annual snowfall for 1962–1991 versus year. The right panel plots each year’s snowfall versus the previous year’s snowfall.The sample correlation from the right panel isr = . 32. Does that suggest nonzero autocorrelation? The null hypothesis for a permutation approach is that the sequence of yearly snowfalls is iid or at least exchangeable. Below we give R code for samplingB permutations from the set of 30! possible permutations, computing the lag-1 sample correlation for each, and estimating the one-sidedp-value for a positive autocorrelation. UsingB = 10, 000, we get$\widehat{p} =.027$ with standard error.0016. Thus there is good evidence of a positive autocorrelation. The main point here is to illustrate how easy it is to carry out the permutation test.

r.auto<-function(x){

n<-length(x)

cor(x[1:(n-1)],x[2:n])

}

perm1<-function(b, x, stat,...){

# Gives est. permutation $p$-value for vector x.

# Assumes test rejects for large values of stat.

call <- match.call()

n <- length(x)

t0 <- stat(x)

res <- numeric(b)

for(i in 1:b) {

perm.xx <- sample(x)

res[i] <- stat(perm.xx)

}

pvalue <- sum(res >= t0)/b

se<-sqrt(pvalue*(1-pvalue)/b)

return(list(call=call,results=data.frame

(nperm=b, stat0=round(t0,4),pvalue=pvalue,

se=round(se,5))))

}

> set.seed(2458)

> perm1(10000,raleigh.snow$snow,r.auto)

npermstat0 pvaluese

1 10000 0.3245 0.0269 0.00162

8 One-Sample Test for Symmetry about θ₀ or Matched Pairs Problem

Fisher [1935] introduced the permutation approach for the matched-pairs problem in a discussion of Darwin’s data on self-fertilized and cross-fertilized plants. There were 15 pairs of plants, and the differences

$$49,-67,8,16,6,23,28,41,14,29,56,24,75,60,-48$$

have mean$\overline{D} = 20.933$,s = 37. 744, andt = 2. 148 for testingH ₀ : μ_D = 0 versusH _a : μ_D≠0, where μ_D is the population mean difference. The two-sidedp-value is.0497 from thet table with 14 degrees of freedom. Alternatively, consider Fisher’s permutation argument. There were 2¹⁵ possible random assignments of types of seeds to the 15 blocks of size 2. Thus, Fisher considered all 2¹⁵ sums ∑_i = 1 ¹⁵ D _i, whereD _i is theith difference, and found only 835+28 = 863 which are greater than or equal to the observed sum = 314. The two-sidedp-value is (2)(863)/32,768 =.0527 (by symmetry there are 863 sums ≤ − 314). Note that$t = \sqrt{n}\overline{D}/s$ is permutationally equivalent to ∑_i = 1 ¹⁵ D _i becauset is a monotonic function of ∑_i = 1 ¹⁵ D _i that depends on ∑_i = 1 ¹⁵ D _i ², which is constant over all 2¹⁵ permutations.

Let us consider the theory behind Fisher’s approach. The population null model is that the differencesD ₁, …, D _n are independent, each with a symmetric distribution about some θ₀; often θ₀ = 0. The distributions do not need to be the same, merely symmetric about θ₀. Thus

$${H}_{0} : {D}_{i} - {\theta }_{0}\stackrel{d}{=}{\theta }_{0} - {D}_{i},\quad i = 1,\ldots,n.$$

(12.35)

The group of transformations to be used with Theorem 12.1 (p. 457) is the set of 2ⁿ sign changes applied to the data with θ₀ subtracted. For notational simplicity, let${D}_{i0} = {D}_{i} - {\theta }_{0}$,i = 1, …, n. Then, for example, ifn = 4, one such transformation is$(-,+,+,-)$. It would transform

$$({D}_{10},{D}_{20},{D}_{30},{D}_{40})$$

(12.36)

into

$$(-{D}_{10},{D}_{20},{D}_{30},-{D}_{40}).$$

(12.37)

Because of (12.35) and independence, all 2ⁿ transformations of the original data have the same distribution. That is, under (12.35) and independence, the joint distribution of (12.36) is the same as (12.37), etc. Thus, the conditions of Theorem 12.1 (p. 457) apply with the group of sign changes, and Fisher’s original method is a valid permutation approach.

8.1 Moments and Normal Approximation

Now let us abstract the above situation slightly in order to compute moments and approximations. Suppose thatd ₁, …, d _n is a sequence of real constants, playing the role of the observedD _i − θ₀ above. Letc ₁, …, c _n be iid random variables with$P({c}_{i} = 1) = P({c}_{i} = -1) = 1/2$; these play the role of making the sign changes. Now consider the linear statistic$T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}$. Note that thec _i are symmetrically distributed around 0 so that all odd moments ofc _i are 0 and all even moments equal to 1. ThenT is also symmetrically distributed about 0 with odd moments 0 and$\mathrm{E}({T}^{2}) =\mathrm{ Var}(T) = \sum\limits _{i=1}^{n}{d}_{i}^{2}$ and$\mathrm{E}({T}^{4}) = 3{(\sum\limits _{i=1}^{n}{d}_{i}^{2})}^{2} - 2\sum\limits _{i=1}^{n}{d}_{i}^{4}$. Now we give a Central Limit Theorem forT. A more general version and proof are given in Hettmansperger [1984, p. 302–303].

Theorem 12.4.

. Suppose that d ₁ ,…,d _n and c ₁ ,…,c _n are defined as above and

$$\frac{1} {n}\sum\limits _{i=1}^{n}{d}_{ i}^{2}\rightarrow {\sigma }^{2}< \infty \qquad \mbox{ as}\;\;n \rightarrow \infty.$$

Then

$$\frac{T} {\sqrt{\mbox{ Var} (T)}} = \frac{\sum\limits _{i=1}^{n}{c}_{ i}{d}_{i}} {{\left (\sum\limits _{i=1}^{n}{d}_{ i}^{2}\right )}^{1/2}}\stackrel{d}{\rightarrow }N(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty.$$

Now we apply this theorem to the permutation distribution of ∑_i = 1 ⁿ D _i when sampling from a population.

Theorem 12.5.

Suppose that D ₁ ,…,D _n are iid random variables satisfying (12.35) and with variance σ² < ∞. Then the permutation distribution function of $\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})$ under the group of sign changes satisfies

$${P}^{{_\ast}}\left \{\sum\limits _{i=1}^{n}({D}_{ i} - {\theta }_{0})/\sqrt{n}\sigma \right \}\stackrel{wp1}{\rightarrow }\mbox{ N}(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty.$$

We have used the notationP ^∗ to emphasize that the probability is taken with respect to the permutation distribution holdingD ₁, …, D _n fixed. An alternative statement of the result is that the permutation distribution of$\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})/\sqrt{n}\sigma $ converges in distribution to a standard normal distribution with probability 1. Note also that we could just as well have put$\{\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0}){}^{2}\}}^{1/2}$ in place of$\sqrt{n}\sigma $ in the conclusion, giving

$$\frac{\sum\limits _{i=1}^{n}({D}_{ i} - {\theta }_{0})} {{\left \{\sum\limits _{i=1}^{n}{({D}_{ i} - {\theta }_{0})}^{2}\right \}}^{1/2}}\stackrel{{d}^{{_\ast}}}{\rightarrow }\mbox{ N}(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty \quad wp1.$$

(12.38)

The result follows from Theorem 12.4 because for each infinite sequenceD ₁(ω), D ₂(ω), … where ω ∈ Ω withP(Ω) = 1,

$$\frac{1} {n}\sum\limits _{i=1}^{n}{({D}_{ i}(\omega ) - {\theta }_{0})}^{2}\rightarrow {\sigma }^{2}\qquad \mbox{ as}\;\;n \rightarrow \infty $$

by the Strong Law of Large Numbers. For each of these sequences, Theorem 12.4 holds, and thus the convergence in distribution holds with probability 1.

8.2 Box-Andersen Approximation

The Box-Andersen adjustedF approximation to the permutation distribution of$\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})$ uses thebeta version of${t}^{2} = n{(\overline{D} - {\theta }_{0})}^{2}/{s}^{2}$,

$$b({t}^{2}) = \frac{{t}^{2}} {n - 1 + {t}^{2}} = \frac{n{(\overline{D} - {\theta }_{0})}^{2}} {\sum\limits _{i=1}^{n}{({D}_{ i} - {\theta }_{0})}^{2}}.$$

Under an iid normal distribution assumption forD ₁, …, D _n,b(t ²) has abeta(1 ∕ 2, $(n - 1)/2)$ distribution with mean 1 ∕ n and variance$2(n - 1)/\{{n}^{2}(n + 2)\}$. Using the results in the previous section for$T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}$, where${d}_{i} = ({D}_{i} - {\theta }_{0})/n$, the permutation moments ofb(t ²) are$\mathrm{{E}}_{\mathrm{P}}\{b({t}^{2})\} = 1/n$ and

$$\mathrm{{Var}}_{\mathrm{P}}\{b({t}^{2})\} = \frac{2(n - 1)} {{n}^{2}(n + 2)}\left (1 -\frac{{f}_{2} - 3} {n - 1} \right ),$$

(12.39)

where${f}_{2} = (n + 2)\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0})}^{4}/\{\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0}){}^{2}\}}^{2}$. Equating the permutation moments to those of a$beta(d/2,d(n - 1)/2)$ distribution leads to

$$d = 1 + \frac{{f}_{2} - 3} {n\{1 - {f}_{2}/(n + 2)\}}.$$

(12.40)

In the above derivation we have followed the notation in Box and Andersen [1955, p. 9], but theirW is 1 − b(t ²), and we relabeled theirb ₂ asf ₂. Note thatf ₂ is close to the sample kurtosis of theD _i − θ₀, and thus$d \approx1 +\{\mathrm{ Kurt}(D) - 3\}/n$.

For the Darwin data,d = . 94 and theF adjusted two-sidedp-value is.053. Recall from previous analysis that the exact two-sided permutationp-value is.0527. The normal approximation here isZ = 1. 9282 with two-sidedp-value = . 054. Thus, the normal approximation is surprisingly good here, better than theF = t ² approximation that Fisher gave (.0497), but the Box-Andersen adjustment has made theF approximation slightly better than the normal approximation.

8.3 Signed Rank Methods

Now we turn to signed rank methods. Here again for simplicity we use the notationD _i0 forD _i − θ₀. LetR _i be the rank of | D _i0 | among | D ₁₀ | , …, | D _n0 | . Let the sign function be defined by$\mbox{ sign}(x) = I(x > 0) - I(x< 0)$ ifx is nonzero and sign(0) = 0. Then the signed rank ofD _i0 is sign(D _i0)R _i although some authors useI(D _i0 > 0)R _i as the definition of the signed rank. We illustrate with a simple data set from Wilcoxon [1945] on the difference between wheat yields in two treatments in 8 blocks:

D _i0	58	32	30	5	− 7	6	11	10
R _i	8	7	6	1	3	2	5	4
sign(D _i0)R _i	8	7	6	1	− 3	2	5	4
I(D _i0 > 0)R _i	8	7	6	1	0	2	5	4

Then define${W}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0){R}_{i}$,${W}^{-} = \sum\limits _{i=1}^{n}I({D}_{i0}< 0){R}_{i}$ and$W = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0}){R}_{i}$. As long as there are no ties in the data, then all three of these are equivalent and$W = {W}^{+} - {W}^{-}$. For the above sample we have${W}^{+} = 33,$ ${W}^{-} = 3$, andW = 30. It is perhaps more standard to callW ⁺ the Wilcoxon Signed Rank statistic. Under (12.35, p. 491) and continuity of the data (implying no ties with probability 1), the basic facts are that:

1.
sign(D ₁₀), …, sign(D _n0) andI(D ₁₀ > 0), …, I(D _n0 > 0) are independent of | D ₁₀ | , …, | D _n0 | and thus also independent ofR ₁, …, R _n;
2.
${W}^{+}\stackrel{d}{=}{W}^{-}\stackrel{d}{=}\sum\limits _{i=1}^{n}I({D}_{i0} > 0)i,$ andI(D ₁₀ > 0), …, I(D _n0 > 0) are independent Bernoulli(1/2) random variables;
3.
$W\stackrel{d}{=}\sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0})i$, and sign(D ₁₀), …sign(D _n0) are iid with$P(\mbox{ sign}({D}_{i0}) = 1) = 1/2$;
4.
$$\mathrm{E}({W}^{+}) = \frac{1} {2}\sum\limits _{i=1}^{n}i = \frac{n(n + 1)} {4},\quad \mathrm{Var}({W}^{+}) = \frac{1} {4}\sum\limits _{i=1}^{n}{i}^{2} = \frac{n(n + 1)(2n + 1)} {24} ;$$
5.
$$\mathrm{E}(W) = 0,\quad \mathrm{Var}(W) = \sum\limits _{i=1}^{n}{i}^{2} = \frac{n(n + 1)(2n + 1)} {6}.$$

For the simple example above withn = 8, we have$\mathrm{E}({W}^{+}) = (8)(9)/4 = 18$ and$\mathrm{Var}({W}^{+}) = (8)(9)(17)/24 = 51$ leading to the standardized value$(33 - 18)/\sqrt{51} = 2.1$, which is clearly the same forW ⁻ andW as well. From a normal table, we get the right-tailedp-value.018, whereas the exact permutationp-value for the signed rank statistics is$5/256 =.01953.$

Although the Wilcoxon Signed Rank is by far the most important of the signed rank procedures, the general signed rank procedures are${T}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0)a({R}_{i})$,${T}^{-} = \sum\limits _{i=1}^{n}I({D}_{i0}< 0)a({R}_{i})$, and

$$T = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{ i0})a({R}_{i}),$$

(12.41)

where the scoresa(i) could be of any form. The analogues of the above properties forW hold for the general signed rank statistics. In particular$T\stackrel{d}{=}\sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0})a(i)$ simplifies the distribution and moment calculations in the case of no ties. In the case of ties, the permutation variance ofT, given the midranksR ₁, …, R _n, is ∑_i = 1 ⁿ{a(R _i)}². Thus, for the normal approximation, it is simplest to use the form

$$Z = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{ i0})a({R}_{i})/{\left [\sum\limits _{i=1}^{n}\{a{({R}_{ i})\}}^{2}\right ]}^{1/2},$$

(12.42)

that automatically adjusts for ties (see Section 12.8.6, p. 497, for a discussion of ties).

The most well-known score functions area(i) = i for the Wilcoxon, the quantile normal scores$a(i) = {\Phi }^{-1}(1/2 + i/[2(n + 1)])$, and the sign testa(i) = 1. These are asymptotically optimal for shifts in the center of symmetryD ₀ of the logistic distribution, the normal distribution, and the Laplace distribution, respectively. For asymptotic analysis we assume$a(i) = {\phi }^{+}(i/(n + 1))$, where ϕ⁺(u) is nonnegative and nonincreasing and ∫₀ ¹[ϕ⁺(u)]² du < ∞. The asymptotically optimal general form for data with densityf(x − θ₀) and$f(x) = f(-x)$ is

$${\phi }^{+}(u) = -\frac{f^{\prime}\left \{{F}^{-1}\left (\frac{1} {2} + \frac{u} {2}\right )\right \}} {f\left \{{F}^{-1}\left (\frac{1} {2} + \frac{u} {2}\right )\right \}}.$$

Asymptotic normality is similar to Theorem 12.5 (p. 492) (see for example, Theorem 10.2.5, p. 333 of Randles and Wolfe, 1979). The Edgeworth expansion forW ⁺ andT ⁺ may be found on p. 37 and p. 89, respectively, of Hettmansperger [1984].

8.4 Sign Test

The sign test mentioned in the last section as (12.41) witha(i) = 1 is usually given in the form${T}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0)$, the number of positive differences. Under the null hypothesis (12.35, p. 491),T ⁺ has a binomial(n, 1 ∕ 2) distribution and is extremely easy to use. Because of this simple distribution,T ⁺ is often given early in a nonparametric course to illustrate exact null distributions.

The sign test does not require symmetry of the distributions to be valid. It can be used as a test ofH ₀ : median of${D}_{i} - {\theta }_{0} = 0$, where it is assumed only thatD ₁, …, D _n are independent, each with the same median. Thus, the test is often used in skewed distributions to test that the median has value θ₀. This generality, though, comes with a price because typically the sign test is not as powerful as the signed rank ort test in situations where all three are valid. If there are zeroes inD ₁, …, D _n, the standard approach is remove them before applying the sign test.

8.5 Pitman ARE for the One-Sample Symmetry Problem

In the Appendix, we give some details for finding expressions for the efficacy and Pitman efficiency of tests for the one-sample symmetry problem. Here we just report some Pitman ARE’s in Table 12.4 for the sign test, thet test, and the Wilcoxon signed rank. The comparison of the signed rank and thet are very similar to those given in Table 12.3 (p. 477) for the two-sample problem. The only difference is that skewed distributions are allowed in the shift problem but not here.

Table 12.4 Pitman ARE’s for the One-Sample Symmetry Problem

Full size table

The general message from Table 12.4 is that the tails of the distribution must be very heavy compared to the normal distribution in order for the sign test to be preferred. This is a little unfair to the sign test because symmetry off is not required for the sign test to be valid, whereas symmetry is required for the Wilcoxon signed rank test. In fact Hettmansperger [1984, p. 10–12] shows that the sign test is uniformly most powerful among size-α tests if no shape assumptions are made about the density off. Moreover, in the matched pairs situation where symmetry is justified by differencing, the uniform distribution is not possible, and that is where the sign test performs so poorly.

Monte Carlo power estimates in Randles and Wolfe [1979, p. 116] show that generally the ARE results in Table 12.4 correspond qualitatively to power comparisons. For example, atn = 10 and normal alternative$({\theta }_{0} +.4)/\sigma $, the Wilcoxon signed rank has power.330 compared to.263 for the sign test. The ratio$.263/.330 =.80$ is not too far from ARE = . 64. The estimated power ratio atn = 20 is$.417/.546 =.76.$ The Laplace distribution AREs in Table 12.4 are not as consistent. For example, atn = 20 for a similar alternative, the ratio is$.644/.571 = 1.13,$ not all that close to ARE = 2. 00.

The Wilcoxon signed rank test is seen to have good power relative to the sign test and to thet test. The Hodges and Lehmann [1956] result that ARE(W ⁺, t) ≥ . 864 also holds here for all symmetric unimodal densities. Coupled with the fact that there is little loss of power relative to thet test at the normal distribution (ARE(${W}^{+},t) = 0.955$),W ⁺ should be the statistic of choice in many situations.

8.6 Treatment of Ties

The general permutation approach is not usually bothered by ties in the data, although rank methods typically require some thought about how to handle the definition of ranks in the case of ties. For the original situation ofn pairs of data and a well-defined statistic like the pairedt statistic, the 2ⁿ permutations of the data merely yield redundance if members of a pair are equal. For example, considern = 3 and the following data with all 8 permutations (1 is the original data pairing):

Permutations 1–4 are exactly the same as permutations 5–8 because permuting the 2nd pair has no effect. Thus, a permutationp-value defined from just permutations 1–4 is exactly the same as for using the full set 1–8. After taking differences between members of each pair, the 2ⁿ sign changes work in the same way by using sign(0) = 0; that is, there is the same kind of redundancy in that there are really just${2}^{n-{n}_{0}}$ unique permutations, wheren ₀ is the number of zero differences.

1	2	3	4	5	6	7	8
3,5	5,3	3,5	5,3	3,5	5,3	3,5	5,3
2,2	2,2	2,2	2,2	2,2	2,2	2,2	2,2
7,4	7,4	4,7	4,7	7,4	7,4	4,7	4,7

For signed rank statistics, there are two kinds of ties to consider after converting to differences, multiple zeros and multiple non-zero values. For the non-zero multiple values, we just use mid-ranks (average ranks) as before. For the multiple zeros, there are basically two recommended approaches:

Method 1:Remove the differences that are zero and proceed with the reduced sample in the usual fashion. This is the simplest approach and the most powerful for the sign statistic (see Lehmann 1975, p. 144). Pratt and Gibbons [1981, p. 169] discuss anomalies when using this procedure withW ⁺.

Method 2: First rank all | D ₁₀ | , …, | D _n0 | . Then remove the ranks associated with the zero values before getting the permutation distribution of the rank statistic,but do not change the ranks associated with the non-zero values. However, as above, since the permutation distribution is the same with and without the redundancy, it really just makes the computing easier to remove the ranks associated with the zero values. The normal approximation in (12.42, p. 495) automatically eliminates the ranks associated with the zero values because sign(0) = 0. For the Box-Andersen approximation, the degrees of freedom are different depending on whether the reduced set is used or not. It appears best to use the reduced set for the Box-Andersen approximation although a few zero values make little difference.

Example 12.2 (Fault rates of telephone lines).

Welch [1987] gives the difference (times 10⁵) of a transformation of telephone line fault rates for 14 matched areas. We modify the data by dividing by 10 and rounding to 2 digits leading to

Notice that there two ties in the absolute values 20 and 8 for which the midranks are given. The exact right-tailed permutationp-value based on thet statistic is.38, whereas thet tables gives.33 and the Box-Andersen approximation is.40. The large outlier − 99 essentially kills the power of thet statistic. The sign test first removes the 0 value and then the binomial probability of getting 10 or more positives out of 13 is.046. Welch [1987] used the sample median as a statistic and for these data we get exactp-value.062. Note that the mean and sum andt statistic are all permutationally equivalent, but the median is not permutationally equivalent to using a Wald statistic based on the median. So, the properties of using the median as a test statistic are not totally clear.

D _i0	− 99	31	27	23	20	20	19	− 14	11	9	8	− 8	6	0
sign(D _i0)R _i	− 14	13	12	11	9. 5	9. 5	8	− 7	6	5	3. 5	− 3. 5	2	0

For the Wilcoxon Signed Rank, no tables can be used because of the ties and the 0. However, it is straightforward to get the permutation after choosing one of the methods above for dealing with the 0 difference.

Method 1: First remove the 0, then rank. The remaining data are

The exactp-value based on the sign(D _i0)R _i values above (for example, just insert the signed ranks into the R program below) is.048, the normal approximation is.047, and the Box-Andersen approximation is.049.

D _i0	− 99	31	27	23	20	20	19	− 14	11	9	8	− 8	6
sign(D _i0)R _i	− 14	13	12	11	9. 5	9. 5	8	− 7	6	5	3. 5	− 3. 5	2

Method 2: Rank the data first, then throw away the signed rank associated with the 0. The exactp-value is.044 Recall, for the permutationp-value, it does not matter whether we drop the 0 or not after ranking. Similarly, the normal approximationp-value.042 based on (12.42, p. 495) automatically handles the 0 value. For the Box-Andersen approximation, we get.0437 based on all 14 signed ranks and.0441 after throwing out the 0; so it matters very little whether we include the 0 or not.

For problems withn ≤ 20, the following R code modified from Venables and Ripley [1997, p. 189-190] gives the exact permutationp-value for signed statistics:

perm.sign<-function(d,stat,pr=FALSE,...){

# Exact perm. $p$-value for one-sample problem.

# Assumes test rejects for large values of stat.

# Looks at all 2^n sign change samples.

# Use only for small n.

# Need the following obscure function

bi<-function(x,digits=if(x>0)1+

floor(log(x,base=2)) else 1){

ans<-0:(digits-1)

(x %/% 2^ans) %% 2

}# note %/% and %% are different

# The main program

t0<-stat(d,...)

digits<-length(d)

b <- 2^digits

res <- numeric(b)

for(i in 1:b){

x <- d*2*(bi(i,digits=digits) - 0.5)

res[i] <- stat(x,...)

if(pr)cat(i,x,res[i],fill=T) # prints

}

pvalue <- sum(res >= t0)/b

sum(res==t0)->co

return(data.frame(b=b,stat0=round(t0,4),

eq.t0=co,rt.pvalue=pvalue,pv2=2*pvalue))

}

9 Randomized Complete Block Data—the Two-Way Design

Blocking is one of the most important techniques for reducing variation in experimental designs. The usual Randomized Complete Block design may be viewed as a generalization of the matched pairs to situations with more than two treatments. To use the permutation argument with blocked data, we do not need for the treatments to be assigned randomly, but it is most natural to discuss blocked data in that context. The key assumption required underH ₀ is that the data are exchangeable within blocks.

Suppose thatk treatments are to be assigned at random within each block of size k. Forn blocks, there are (k)ⁿ possible permutations of the data corresponding to permuting independently among treatments within each block. In the following table there arek = 4 blocks withn = 10 treatments, thus${M}_{N} = 2{4}^{10} = 6.34 \times1{0}^{13}$ possible permutations. These data are actually treatments 6–15 from an example of aphid infestation of crepe myrtle cultivars given in Table 1 of Brownie and Boos [1994]. The response variable is the number of aphids on the three most heavily infested leaves plus the percent of foliage covered with sooty mold.

The linear model representation is

$${Y }_{ij} = \mu+ {\beta }_{i} + {\alpha }_{j} + {e}_{ij},$$

(12.43)

where α₁, …, α_k are the treatment effects, and β₁, …β_n are the block effects. Note that we have switched subscripts onY _ij compared to the one-way model (12.28, p. 480) so that the blocks can be the rows. Often the block effects are assumed random, but the nonparametric literature typically considers them fixed effects.

	Treatments
Block	1	2	3	4	5	6	7	8	9	10
1	0	0	93	78	5	1	0	21	1	1
2	0	24	0	3	2	180	0	0	3	9
3	0	2	10	0	0	3	2	3	3	140
4	0	4	2	2	0	0	1	47	1	52

The usual ANOVAF statistic could be used with these data:

$$F = \frac{ \frac{1} {k - 1}\sum\limits _{j=1}^{k}n{({\overline{Y }}_{.j} -{\overline{Y }}_{..})}^{2}} { \frac{1} {(k - 1)(n - 1)}\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{({Y }_{ ij} -{\overline{Y }}_{i.} -{\overline{Y }}_{.j} +{ \overline{Y }}_{..})}^{2}},$$

(12.44)

where${\overline{Y }}_{i.} = {k}^{-1} \sum\limits _{j=1}^{k}{Y }_{ij}$,${\overline{Y }}_{.j} = {n}^{-1} \sum\limits _{i=1}^{n}{Y }_{ij}$, and${\overline{Y }}_{..} = {n}^{-1} \sum\limits _{i=1}^{n}{\overline{Y }}_{i.}$. For the above dataF = 0. 80 withp-value = 0.62 from anF distribution with 9 and 27 degrees of freedom. Since theF distribution approximates the permutation distribution, the value 0.62 should be satisfactory. A Monte Carlo approximation to the exact permutationp-value based on 10,000 samples gave.60 with standard error.005, thus confirming the Type I error robustness of the usualF procedure. However, the nonnormality of the response variable is cause for concern because theF statistic is not Type II error robust in the face of outliers. Transformations are an obvious approach, andF on log(Y _ij + 1) resulted inp-value =.29. Fortunately, with rank procedures we do not have to guess the correct transformation.

9.1 Friedman’s Rank Test

The standard rank procedure was introduced by Friedman [1937]. For the untied case, it has the form

$$T = \frac{12n} {k(k + 1)}\sum\limits _{j=1}^{k}{\left ({\overline{R}}_{.j} -\frac{k + 1} {2} \right )}^{2},$$

(12.45)

whereR _ij is the rank ofY _ij within theith row, and${\overline{R}}_{.j} = {n}^{-1} \sum\limits _{i=1}^{n}{R}_{ij}$ is thejth treatment mean rank. Note that$(k + 1)/2$ is${\overline{R}}_{..}$ since the average of the integers 1 tok is$(k + 1)/2$. The within-row ranksR _ij for the above table are

We see immediately that there are numerous ties in the data. The form of the Friedman statistic that accommodates ties is (see, for example, Conover and Iman, 1981, p. 126)

$$T = \frac{(k - 1){n}^{2} \sum\limits _{j=1}^{k}{\left ({\overline{R}}_{.j} -\frac{k + 1} {2} \right )}^{2}} {\left (\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{R}_{ ij}^{2}\right ) -\frac{nk{(k + 1)}^{2}} {4} }.$$

(12.46)

Under the null hypothesis of identical treatments,T converges to a χ_k − 1 ² distribution asn → ∞ andk remains fixed. For the above data,T = 13. 7732, and comparing to a χ₉ ² distribution givesp-value =.13. However, as in the one-way design, the χ² approximation becomes increasingly conservative as the number of treatments gets large relative to the number of blocks.F distributionp-values provide much better approximations and can be justified by either asymptotic theory or the Box-Andersen permutation moment approximations.

	Treatments
Block	1	2	3	4	5	6	7	8	9	10
1	2	2	10	9	7	5	2	8	5	5
2	2.5	9	2.5	6.5	5	10	2.5	2.5	6.5	8
3	2	4.5	9	2	2	7	4.5	7	7	10
4	2	8	6.5	6.5	2	2	4.5	9	4.5	10

9.2 F Approximations

Friedman [1937, pp. 694–695] conjectured that the Friedman statistic is asymptotically normal ask → ∞ with meank − 1 and variance 2(n − 1)(k − 1) ∕ n (a proof may be found in Lemma 4 of Brownie and Boos, 1994). Similar to the one-way design, this asymptotic normal result is consistent with applying theF statistic (12.44, p. 500) to the within-row Friedman ranks and then using theF(k − 1, (k − 1)(n − 1)) distribution forp-values. This argument is to be fleshed out in Problem 12.22 (p. 528). Of course, theF distribution should be used in practice; the asymptotic normal result just supports use of theF distribution.

From Box and Andersen [1955, p. 14-15], we may approximate the permutation distribution ofF of (12.44, p. 500) or of the sameF applied to the within-row Friedman ranks by aF(d(k − 1), d(k − 1)(n − 1)) distribution, where

$$d = 1 + \frac{(nk - n + 2){V }_{2} - 2n} {n(k - 1)(n - {V }_{2})},$$

$${V }_{2} = \frac{1} {n - 1}\sum\limits _{i=1}^{n}{({s}_{ i}^{2} -{\overline{s}}^{2})}^{2}/{({\overline{s}}^{2})}^{2},$$

and thes _i ² are the within-row variances, and${\overline{s}}^{2} = {n}^{-1} \sum\limits _{i=1}^{n}{s}_{i}^{2}$. In the case of the Friedman ranks with no ties in the data,d = 1 − 2 ∕ {n(k − 1)}. For the Crepe Myrtle data this latter expression isd = . 944, the same (to three decimals) as the actuald value from the tied ranks. We summarize the various approximations in the following table:

The Monte Carlo estimates are based on 10,000 random permutations and have standard error bounded by.005. TheF approximations are good, but the Box-Andersen adjustments do not help here. Interestingly,d = 1. 08 for the usualF (row 3), but thep-value is adjusted upwards because theF = . 80 is so small. Typically, ad value greater than 1 lowers thep-value from theF approximation.

	ApproximateP-Values
	for the Crepe Myrtle Data
	Monte		Box-And.
	Carlo	F(9, 27)	F(9d, 27d)	χ₉ ²
Friedman	.10			.13
F _R	.10	.10	.11
F onY	.60	.62	.63
F on log(Y + 1)	.29	.29	.30

9.3 Pitman ARE for Blocked Data

From van Elteren and Noether [1959] we find the surprising result that the Pitman asymptotic relative efficiency of the Friedman test to the ANOVAF depends on the number of treatmentsk,

$$\mbox{ ARE(Friedman},F) = \left \{ \frac{k} {k + 1}\right \}12{\sigma }^{2}{\left \{{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx\right \}}^{2},$$

(12.47)

where σ² is the variance of the observations. Expression (12.47) is just$k/(k + 1)$ times the ARE(W, t) in (12.25, p. 477). Table 12.5 gives a few values of (12.47) for several distributions.

Table 12.5 Pitman ARE of the Friedman Test to theF Test

Full size table

The value.64 atk = 2 for the normal distribution is the same as the ARE of the sign test to thet in Table 12.4 (p. 496). That is no accident. It turns out that fork = 2, the Friedman test is equivalent to the sign test. (The other values in Table 12.4, p. 496, do not correspond to thek = 2 values in Table 12.5 because Table 12.4 refers to the distribution after taking differences, whereas Table 12.5 is for the distribution of the individual treatment results, not the difference of treatment results. For the normal distribution, the difference of normal random variables is also normally distributed; so for the normal the results are the same in both tables.)

The reason for the low efficiency in Table 12.5 is that ranking within rows (intrablock ranking) takes no advantage of between block (interblock) information. For thek = 2 case, the Wilcoxon signed rank statistic uses interblock information by ranking the absolute differences (note the improved efficiencies in Table 12.4, p. 496, for the signed rank test compared to the sign test). In the next section we discuss some rank approaches that use interblock information.

9.4 Aligned Ranks and the Rank Transform

Many approaches have been used to remedy the low efficiency in Table 12.5 for small values ofk. Perhaps the earliest approach (and still one of the best) is the aligned rank method due to Hodges and Lehmann [1962]. The aligned rank approach is to first subtract the block mean (or any other location measure such as the median) from each observationY _ij, then rank all the resultingnk residuals together. These latter ranks on the residuals, denoted$\widehat{{R}}_{ij}$, are calledaligned ranks. We suggest usingF of (12.44, p. 500) on these aligned ranks.

Actually, Sen [1968] and Lehmann [1975, p. 272] use

$$\widehat{Q} = \frac{{n}^{2}(k - 1)\sum\limits _{j=1}^{k}{\left ({\overline{\widehat{R}}}_{.j} -\frac{nk + 1} {2} \right )}^{2}} {\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{\left (\widehat{{R}}_{ ij} -{\overline{\widehat{R}}}_{i.}\right )}^{2}},$$

(12.48)

a statistic that is asymptotically χ_k − 1 ² underH ₀. The justification for the form (12.48) comes from noting that the permutation mean of${\overline{\widehat{R}}}_{.j}$ is$(nk + 1)/2$, and the permutation covariance matrix of$({\overline{\widehat{R}}}_{.1},\ldots,{\overline{\widehat{R}}}_{.k})$ is

$$\frac{{\sigma }^{2}k} {k - 1}\mbox{ diag}\left ({I}_{k} -\frac{{\mathbf{1}}_{k}{\mathbf{1}}_{k}^{T}} {k} \right ),$$

(12.49)

where${I}_{k}$ is thek-dimensional identity matrix,1 _k is a vector of ones, and

$${\sigma }^{2} = \frac{1} {{n}^{2}k}\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{(\widehat{{R}}_{ ij} -{\overline{\widehat{R}}}_{i.})}^{2}$$

(12.50)

is the permutation variance of${\overline{\widehat{R}}}_{.j}$.$\widehat{Q}$ in (12.48) is the appropriate quadratic form in$({\overline{\widehat{R}}}_{.1},\ldots,{\overline{\widehat{R}}}_{.k})$ upon noting that$(k - 1){I}_{k}/(k{\sigma }^{2})$ is a generalized inverse of the covariance matrix (12.49).

Other authors (Fawcett and Salter, 1984, and O’Gorman, 2001) use a one-way ANOVAF on the aligned ranks, but we prefer the two-wayF of (12.44, p. 500) because the Box-Andersen adjustment is readily available. All three statistics,$\widehat{Q}$ and the twoF statistics on the aligned ranks, are permutationally equivalent to the numerator of$\widehat{Q}$; so if exact or Monte Carlo approximations are used, it does not matter which of the three statistics is chosen. Clearly, either of the twoFs gives better approximatep-values than$\widehat{Q}$ with χ_k − 1 ² p-values.

Mehra and Sarangi [1967] give somewhat complicated formulas for the Pitman ARE of the aligned rank approach to the usualF and to Friedman’s statistic, but the bottom line is that the AREs of the aligned rank procedure to the usualF are close to the last column of Table 12.5 (p. 503). Thus, the aligned rank approach is able to recover most of the interblock information.

Another approach to recovering the interblock information is to just rank all the observations together and applyF of (12.44, p. 500) on the resulting ranks. Thisrank transform approach, due to Conover and Iman [1981] works well as long as the block effects are not strong. When the block effects are strong, then this approach is similar to Friedman’s test. Hora and Iman [1988] give Pitman ARE results for this approach.

There is an extensive literature on rank methods in block models. Mahfoud and Randles [2005] and Kepner and Wackerly [1996] are several places that briefly review many of the approaches. The latter also gives extensions to incomplete blocks.

9.5 Replications within Blocks

In the preceding discussion we have been talking about cases where there is just one observation per cell,nk total observations forn blocks andk treatments, and no block by treatment interaction. Consider thek = 2 case andn blocks where there arem _i Xs for the first treatment in blocki andn _i Y s for the second treatment,i = 1, …, n. These type data arise naturally in clinical trials atn centers or sites. The sites might be hospitals or clinics or individual doctors. The usual rank approach is the van Elteren statistic (van Elteren, 1960, or Lehmann 1975, p. 145), a weighted sum of individual Wilcoxon rank sum statisticsW _i within each block,

$${W}_{\mathrm{VE}} = \sum\limits _{i=1}^{n} \frac{{W}_{i}} {{m}_{i} + {n}_{i} + 1}.$$

van Elteren [1960] showed that the weights$1/({m}_{i} + {n}_{i} + 1)$ are asymptotically optimal among all linear combinations of theW _i. This optimality makes sense if we write the standardized version ofW _VE as

$$\sum\limits _{i=1}^{n} \frac{1} {{\sigma }_{0}^{2}(\widehat{{\theta }}_{i})}\left (\widehat{{\theta }}_{i} -\frac{1} {2}\right )\left /{\left \{\sum\limits _{i=1}^{n} \frac{1} {{\sigma }_{0}^{2}(\widehat{{\theta }}_{i})}\right \}}^{1/2}\right.,$$

(12.51)

where$\widehat{{\theta }}_{i}$ is the Mann-Whitney estimator of${\theta }_{i} = P({Y }_{i1} > {X}_{i1}) + (1/2)P({Y }_{i1} = {X}_{i1})$ given in (12.14, p. 463) (here we have dropped theXY subscript for simplicity), and${\sigma }_{0}^{2}(\widehat{{\theta }}_{i})$ is the variance of$\widehat{{\theta }}_{i}$ under the null hypothesis of identicalX andY populations. In the completely nonparametric case (in the absence of the shift model), θ_i is the underlying parameter of interest for Wilcoxon statistics. For continuous data (no ties),${\sigma }_{0}^{2}(\widehat{{\theta }}_{i}) = ({m}_{i} + {n}_{i} + 1)/(12{m}_{i}{n}_{i})$. Thus, the numerator of the standardized version ofW _VE is a weighted average of$\widehat{{\theta }}_{i} - 1/2$, where the weights are inversely proportional to null variances.

The analogoust procedure is based on standardizing

$$\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i}}({\overline{Y }}_{i} -{\overline{X}}_{i}).$$

(12.52)

Thus, thet procedure uses a weighted linear combination of the difference of sample means, where the weights are inversely proportional to$\mathrm{Var}\left ({\overline{Y }}_{i} -{\overline{X}}_{i}\right ) = {\sigma }^{2}(1/{m}_{i} + 1/{n}_{i})$.

The standard permutation approach is to consider all possible

$${M}_{N} ={ \prod\nolimits }_{i=1}^{n}\left ({ {m}_{i} + {n}_{i} \atop {n}_{i}} \right )$$

independent permutations within sites. The normal approximation forW _VE should be very good if ∑_i = 1 ⁿ m _i and ∑_i = 1 ⁿ n _i are reasonably large and therefore is widely used in practice. In the case that ∑_i = 1 ⁿ m _i and ∑_i = 1 ⁿ n _i converge to∞, Hodges and Lehmann [1962] give the Pitman ARE of (12.51) to (12.52) for normal data as

$$.955\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i} + 1}\left /\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i}}\right..$$

Thus, ifm _i + n _i is reasonably large, then the ARE is close to the best value.955. For example, if${m}_{i} + {n}_{i} = 10$ for each site, then the ARE is.955(10/11).

For the case that there are small numbers of replications per block (site), we are led back to the procedures of the previous section, aligned ranks and possibly the rank transform. With replications within blocks, however, we now have the ability to test for block by treatment interactions. Unfortunately, standard permutation procedures are not available for testing the no interaction hypothesis in the face of main effects. A large literature exists evaluating and criticizing the rank transform approach for testing interactions. See, for example, Akritas [1990, 1991] and Thompson [1991]. In general, for more complicated fixed effects models with interaction, to achieve robustness via rank methods, we feel it is better to use the general R-estimation linear model approach mentioned at the end of Section 12.7 (p. 487).

Boos and Brownie [1992] argue that a mixed model approach is usually more appropriate, allowing inferences to be made to a larger population, but the mixed model leads away from van Eltern’s statistic (12.51, p. 505) and permutation inference.

10 Contingency Tables

10.1 2 x 2 Table – Fisher’s Exact Test

The first use of the permutation method was given by Fisher [1934a,Statistical Methods for Research Workers, fifth edition] in an analysis of 2 ×2 tables. Fisher’s example was of 13 identical twins and 17 fraternal twins (of the same sex) who had at least one of the pair convicted of a crime. Of the 13 identical twins only 3 had a twin free of conviction. Of the 17 fraternal twins 15 had a twin free of conviction. Thus the table is as follows,

To fix notation, a general 2 ×2 table is,

	Both	One
	Convicted	Convicted	Total
Identical	10	3	13
Fraternal	2	15	17
Total	12	18	30

	Category	Category
	1	2	Total
Group 1	N ₁₁	N ₁₂	N _1.
Group 2	N ₂₁	N ₂₂	N _2.
Total	N _. 1	N _. 2	N

A standard analysis of these data assumes thatN ₁₁ is binomial (N _1., p ₁) and independent ofN ₂₁ assumed to be binomial (N _2., p ₂). The usual statistic for testingH ₀ : p ₁ = p ₂ is the pooledZ, the square root of the score statistic found in Section 3.2.9 (p. 144),

$$Z = \frac{\widehat{{p}}_{1} -\widehat{ {p}}_{2}} {{\left \{\frac{\widetilde{p}(1 -\widetilde{ p})} {{N}_{1.}} + \frac{\widetilde{p}(1 -\widetilde{ p})} {{N}_{2.}} \right \}}^{1/2}},$$

where$\widehat{{p}}_{1} = {N}_{11}/{N}_{1.}$,$\widehat{{p}}_{2} = {N}_{21}/{N}_{2.}$, and$\widetilde{p} = {N}_{.1}/N$. To testH _a : p ₁ > p ₂, the standard approach would be to compareZ toz _α, the 1 − α quantile of the standard normal.

Instead of this approximate procedure, Fisher noted that conditional on the marginsN _. 1 andN _. 2 held fixed in addition toN _1. andN _2., that a given table has hypergeometric probability of (n ₁₁, n ₁₂, n ₂₁, n ₂₂) given by

$$\frac{\left ({ {N}_{1.} \atop {n}_{11}} \right )} {} \left ({ {N}_{2.} \atop {n}_{21}} \right )\left ({ N \atop {N}_{.1}} \right ) = \frac{{N}_{1.}!{N}_{2.}!{N}_{.1}!{N}_{.2}!} {N!{n}_{11}!{n}_{12}!{n}_{21}!{n}_{22}!}.$$

This hypergeometric probability is easily obtained if one thinks about an urn withN _. 1 balls of type 1 andN _. 2 of type 2. If we draw outN _1. balls without replacement, then the above probability is the probability of gettingn ₁₁ of type 1 andn ₂₁ of type 2.

One can also think of the above table arising in the two-sample problem where the data consists of just 1’s and 0’s. Although there are$\left ({ N \atop {N}_{1.}} \right )$ permutations of interest, many of them yield the same table. The numerator of the above hypergeometric probability just gives the number of permutations which lead a given table.

Now a variety of statistics can be used to order the possible tables from supportingH ₀ to strongly rejectingH ₀ and to calculate ap-value. Or one can just use intuition for the ordering: most people would agree that for testingH _a : p ₁ > p ₂, the table below is more extreme than the original.

	Category	Category
	1	2	Total
Group 1	N ₁₁ + 1	N ₁₂ − 1	N _1.
Group 2	N ₂₁ − 1	N ₂₂ + 1	N _2.
Total	N _. 1	N _. 2	N

Thus, a one-tailedp-value would be obtained by summing up the hypergeometric probabilities of those tables as extreme or more extreme than the original table (N ₁₁, N ₁₂, N ₂₁, N ₂₂). A number of seemingly different ways of ordering the tables lead to the same definition of “more extreme” and are called Fisher’s Exact Test. The simplest way to order is either the intuitive notion above or to order via the pooledZ statistic.

For the twins data, Fisher noted that the two more extreme tables haveN ₁₁ = 11,N ₁₂ = 2,N ₂₁ = 1,N ₂₂ = 16 andN ₁₁ = 12,N ₁₂ = 1,N ₂₁ = 0,N ₂₂ = 17. Thus thep-value is the probability of the original table plus the probability of these two more extreme tables:

$$\frac{13!17!12!18!} {30!} \left \{ \frac{1} {10!3!2!15!} + \frac{1} {11!2!1!16!} + \frac{1} {12!1!0!17!}\right \} = \frac{619} {1330665} =.000465.$$

The definition of a two-sidedp-value is not so clear, but the usual practice is to add in the probabilities of tables as extreme or more extreme in the other direction (having probabilities less than or equal to the probability of the observed table). In the above example we would need to add the probabilities of tables withN ₁₁ = 0,N ₁₂ = 13,N ₂₁ = 12,b ₂₂ = 5 andN ₁₁ = 1,N ₁₂ = 12,N ₂₁ = 11,N ₂₂ = 6 but notN ₁₁ = 2,N ₁₂ = 11,N ₂₁ = 10,N ₂₂ = 7 since it has higher probability than the original table.

When accompanied by a randomization rule to yield exact α levels, Fisher’s Exact Test is uniformly most powerful unbiased as discussed in Lehmann [1986, Ch. 4]. But many people have noted how conservative it is whenp-values are used with the rule: rejectH ₀ whenp-value ≤ α. In this case the discreteness of the permutation distribution does prove costly in terms of power.

Barnard [1945, 1947], Boschloo [1970], and Suissa and Shuster [1985] proposed unconditional tests in the 2 x 2 table that are typically more powerful than the Fisher Exact Test without randomization. See Berger [1996] for details and power comparisons.

We have given Fisher’s Exact Test in the context of two independent binomials andH ₀ : p ₁ = p ₂. It also applies in the context of multinomial data where the data consists of a pair of binary variables (X, Y ) with valuesx ₁ andx ₂ andy ₁ andy ₂, respectively:

		Y
		y ₁	y ₂	Total
X	x ₁	N ₁₁	N ₁₂	N _1.
	x ₂	N ₂₁	N ₂₂	N _2.
	Total	N _. 1	N _. 2	N

The entries (N ₁₁, N ₁₂, N ₂₁, N ₂₂) are multinomial(N; p ₁₁, p ₁₂, p ₂₁, p ₂₂) with associated parameters

		Y
		y ₁	y ₂	Total
X	x ₁	p ₁₁	p ₁₂	p _1.
	x ₂	p ₂₁	p ₂₂	p _2.
	Total	p _. 1	p _. 2	1

In this paired variable context, the null hypothesis for Fisher’s Exact Test is independence ofX andY,

$${H}_{0} : {p}_{ij} = {p}_{i.}{p}_{.j},\quad i = 1,2;j = 1,2.$$

(12.53)

Of course, ifp ₁₁ = p _1. p _. 1, then all the other equalities such asp ₁₂ = p _1. 2 p _. 2 hold as well.

10.2 Paired Binary Data – McNemar’s Test

In the context of paired binary data introduced in the last section, we might expect association betweenX andY, but our main interest could be in their marginal probabilities. In particular, the null hypothesis is often

$${H}_{0} : {p}_{1.} = {p}_{.1}.$$

(12.54)

A typical application is in matched pair studies such as the following well-known case-control data from Miller [1980],

		Sibling (Control)
		Tons.	No Tons.	Total
Hodgkin’s	Tons.	26	15	41
Patient	No Tons.	7	37	44
	Total	33	52	85

where Hodgkin’s patients were paired with a sibling and it was determined whether they each had a tonsillectomy or not. If the marginal estimates$\widehat{{p}}_{1.} = {N}_{1.}/N = 41/85$ and$\widehat{{p}}_{.1} = {N}_{.1}/N = 33/85$ differ significantly, then incidence of tonsillectomies may be associated with contracting Hodgkin’s disease. Noting that$\widehat{{p}}_{1.} -\widehat{ {p}}_{.1} = {N}_{12}/N - {N}_{21}/N$ has multinomial variance$\{{p}_{12} + {p}_{21} - {({p}_{12} - {p}_{21})}^{2}\}/N = ({p}_{12} + {p}_{21})/N$ underH ₀, the score statistic is

$$Z = \frac{{N}_{12} - {N}_{21}} {{\left ({N}_{12} + {N}_{21}\right )}^{1/2}}.$$

Exact inference follows by noting that under (12.54, p. 509),N ₁₂ | N ₁₂ + N ₂₁ has a binomial$({N}_{12} + {N}_{21},1/2)$ distribution. Thus,Z = 1. 71 has approximate normal one-sidedp-value = . 044, but$P(\mbox{ binomial}(22,1/2) \geq15) =.067$. These procedures are generally referred to as McNemar’s test.

What do these tests have to do with permutation and rank statistics? LetX = 1 denote that a Hodgkin’s patient had a tonsillectomy, andX = 0 denote that he/she did not, and similarlyY = 1 andY = 0 for the sibling control. Then the paired data and their differences are

Note that there areN ₁₂ = 15 positive differences out of${N}_{12} + {N}_{21} = 22$ nonzero differences. Thus, the exact binomial procedure above is just the sign test for the differences, andZ is exactly (12.42, p. 495) fora(i) = 1. In fact, since all the nonzero absolute differences are identically 1, the exact signed rank test (assuming zeroes are deleted) yields the same binomial procedure, andZ is also (12.42, p. 495) witha(i) = i.

	Hodgkin’s	Sibling
Pair	Patient	(Control)	Diff.
1	1	1	0
.	.	.	.
.	.	.	.
26	1	1	0
27	1	0	1
.	.	.	.
.	.	.	.
41	1	0	1
42	0	1	− 1
.	.	.	.
.	.	.	.
48	0	0	0
49	0	0	0
.	.	.	.
.	.	.	.
85	0	0	0

10.3 I byJ Tables

We now consider the generalI byJ contingency table

		Y
		y ₁	.	.	.	y _J	Total
	x ₁	N ₁₁	.	.	.	N _1J	N _1.
	.	.	.	.	.	.	.
X	.	.	.	.	.	.	.
	.	.	.	.	.	.	.
	x _I	N _I1	.	.	.	N _IJ	N _J.
	Total	N _. 1	.	.	.	N _. J	N

The distribution of these data could be a full multinomial withIJ cells orI independent rows of multinomial data. In either case, exact permutation analysis is achieved by conditioning on the marginal totals resulting in a multiple hypergeometric for the joint distribution of the entriesN _ij having probability$P({N}_{ij} = {n}_{ij},i = 1,\ldots,I;j = 1,\ldots,J\mid {N}_{1.},\ldots,{N}_{I.},{N}_{.1},\ldots,{N}_{.J})$ given by

$$\frac{\left ({\prod\nolimits }_{i=1}^{I}{N}_{ i.}!\right )\left ({\prod\nolimits }_{j=1}^{J}{N}_{.j}!\right )} {N!{\prod\nolimits }_{i=1}^{I}{ \prod\nolimits }_{j=1}^{J}{n}_{ ij}!}.$$

The question remains as to what statistic should be used. If bothX andY have nominal categories, then the chi-squared goodness-of-fit statistic is natural, but not very interesting. IfX andY have numerical scores or are at least ordered, then some type of association or correlation statistic should be used. For example, one might use Pearson’sr or Spearman’s rank correlation. IfX has nominal categories andY has numerical categories, then ANOVA type comparisons among the row means makes sense. IfX has nominal categories andY has ordered categories, then the Kruskal-Wallis test might be a good choice of statistic. Moreover, all these situations can be generalized to multi-way tables, sayI byJ byK tables, usually viewed as stratified comparisons ofX andY.

All these options for statistics in two-way and multiway tables come under the general purview ofGeneralized Cochran-Mantel-Haenszel statistics. Expositions of these statistics may be found in Landis et al. [1978] and Agresti [2002, Section 7.5.3] and implementation is found inSAS PROC FREQ.

11 Confidence Intervals and R-Estimators

Confidence intervals can be obtained from permutation and rank test statistics in the same way as for other types of statistics: choose values of θ appearing in a null hypothesis such that the statisticT(θ) viewed as a function of θ does not reject the null hypothesis (see 3.19, p. 146). We often refer to this approach as “inverting a test statistic.” For example, in the one-sample problem with dataD ₁, …, D _n assumed to be symmetrically distributed about θ₀, a two-sided permutationt test could just as well be based on$T({\theta }_{0}) = \vert \sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})\vert $. The permutation distribution depends on the 2ⁿ sign change configurations of${D}_{i} - {\theta }_{0},\ldots,{D}_{n} - {\theta }_{0}$; we reject ifT(θ₀) is larger than the largest α of the 2ⁿ values ofT(θ₀) computed on those permutations. So the 1 − α confidence interval can be found by trial and error, but it would seem to be a pretty laborious task because the permutation distribution changes with each θ₀. A somewhat easier computing method is suggested in Lehmann [1986, p. 263], but in general, the usualt interval is close enough to the permutation interval that it is mostly used in practice.

Inverting the signed rank statisticW ⁺ leads to an interval$[{W}_{({k}_{1})},{W}_{({k}_{2})}]$, where${W}_{(1)} \leq{W}_{(2)}\cdots\leq{W}_{(n(n+1)/2)}$ are the ordered values of theWalsh averages

$${W}_{ij} = \frac{{D}_{i} + {D}_{j}} {2},\qquad 1 \leq i \leq j \leq n.$$

(12.55)

The order numberk ₂ is such that$P({W}^{+} \geq{k}_{2}) \leq\alpha /2$, and${k}_{1} = n(n + 1)/2 + 1 - {k}_{2}$. We have specified a closed interval so that the probability of coverage is at least 1 − α for tied data situations (see Randles and Wolfe, 1979, p. 181-183). For example, atn = 7 with continuous data and α = . 05,$P({W}^{+} \geq26) = P({W}^{+} \leq2) =.0234$, and thus the interval [W ₍₃₎, W ₍₂₆₎] has exact confidence level 1$-.0468 =.9532.$ Oftenk ₁ andk ₂ are taken from the normal approximation to the permutation distribution ofW ⁺. For example,${k}_{1} = q + 1$ and${k}_{2} = n(n + 1)/2 - q$, whereq is the closest integer to

$$\frac{n(n + 1)} {4} - {z}_{\alpha /2}{\left \{\frac{1} {4}\sum\limits _{i=1}^{n}{R}_{ i}^{2}\right \}}^{1/2}.$$

In then = 7 example above, this latter calculation gives 2.4, and thusq = 2,k ₁ = 3, and${k}_{2} = 28 - 2 = 26$ as before. For the sample − 1. 11, 2.23, 3.35, 4.67, 5.34, 6.17, 7.44, the interval is [W ₍₃₎, W ₍₂₆₎] = [1. 12, 6. 39].

Inverting the sign test leads to an interval of order statistics

$$({D}_{(k)},{D}_{(n-k+1)}),\quad 1 \leq k \leq n - k + 1.$$

This interval has exact coverage probability${C}_{n}(k) = 1 - {(1/2)}^{n-1} \sum\limits _{i=0}^{k-1}\left ({ n \atop i} \right )$ for the population median from any continuous, not necessarily symmetric distribution. To obtain at least the same coverage for any discrete distribution, we need to again change to the closed interval$[{D}_{(k)},{D}_{(n-k+1)}]$. An interesting addendum to these intervals due to Guilbaud [1979] is that the average of two such intervals,

$$\left [\frac{{D}_{(k)} + {D}_{(k+t)}} {2}, \frac{{D}_{(n-k-t+1)} + {D}_{(n-k+1)}} {2} \right ],\;\;k + t \leq n - k - t + 1,$$

has guaranteed coverage$\{{C}_{n}(k) + {C}_{n}(k + t)\}/2$ for any distribution. This latter interval is useful for smalln because it give more options for the confidence level than given byC _n(k) alone. A more practical solution is given byHettmansperger and Sheather [1986], who interpolate between adjacent order statistics to get an interval with approximately the specified confidence, say 95%. The intervals are no longer distribution-free, but the confidence is close to the specified value.

Moving to the two-sample problem, the permutation interval based on the two-samplet is hard to compute, similar to the one-sample interval, and the usualt interval is mostly used in practice. Inversion of the Wilcoxon Rank Sum statistic for the shift model$G(x) = F(x - \Delta )$ leads to a confidence interval forΔ of the form$[{U}_{({k}_{1})},{U}_{({k}_{2})}]$, whereU ₍₁₎ ≤ U ₍₂₎⋯ ≤ U _(mn) are the ordered values of the pairwise differences

$${U}_{ij} = {Y }_{j} - {X}_{i},\quad i = 1,\ldots,m;j = 1,\ldots,n.$$

(12.56)

Similar to the one-sample case,k ₂ is chosen so that$P(W \geq{k}_{2} + n(n + 1)/2) = \alpha /2$ and${k}_{1} = mn + 1 - {k}_{2}$. In practice, one often uses the normal approximation interval with${k}_{1} = q + 1$ and${k}_{2} = mn - q$, whereq is the integer closest to

$$\frac{mm} {2} - {z}_{\alpha /2}{\left \{\mathrm{Var}(W)\right \}}^{1/2},$$

where Var(W) is given by (12.10, p. 462) or (12.11, p. 462).

Point estimators obtained from rank test statistics were introduced by Hodges and Lehmann [1963]. TheseR-estimators inherit some of the natural robustness properties of rank methods; see, for example Huber [1981] and Serfling [1980, Ch. 9], Randles and Wolfe [1979, Ch. 7], and Hettmansperger [1984, Ch. 5]. The most well known are: i) the one-sample center of symmetry estimator$\widehat{\theta } = \mbox{ median}\{{W}_{ij}\}$, where theW _ij are in (12.55, p. 512); and ii) the two-sample shift estimator$\widehat{\Delta } = \mbox{ median}\{{U}_{ij}\}$, where theU _ij are in (12.56, p. 513). Asymptotic relative efficiency comparisons for confidence intervals and estimators derived from rank tests are exactly the same as for the associated rank tests.

12 Appendix – Technical Topics for Rank Tests

12.1 Locally Most Powerful Rank Tests

Recall from Section 12.5.1 (p. 474) that forH ₀ : Δ = 0 versusH _a : Δ > 0, if there exists a rank test that is uniformly most powerful of level α for some ε > 0 in the restricted testing problemH ₀ : Δ = 0 versusH _a, ε : 0 < Δ < ε, we say that the test is thelocally most powerful rank test for the original testing problem. By using a Taylor expansion of the probability of the rank vector$R$ as a function ofΔ,${L}_{r}(\Delta ) \equiv{P}_{\Delta }(R = r)$, we need only obtain an expression for the derivative of${L}_{r}(\Delta )$ and maximize it.

To see this consider the Taylor expansion

$${L}_{r}(\Delta ) = {L}_{r}(0) + {L}_{r}^{{\prime}}(0)\Delta+ o(\vert \Delta \vert ),$$

and a rank test with$\alpha= k/N!$ based on maximizing${L}_{r}^{{\prime}}(0)$. Let${r}^{(1)}$ be the rank configuration that makes${L}_{r}^{{\prime}}(0)$ largest among allN! rank configurations,${r}^{(2)}$ makes${L}_{r}^{{\prime}}(0)$ second largest among allN! rank configurations, etc. Such a rank test has power

$$\beta (\Delta ) = \sum\limits _{j=1}^{k}{L}_{{ r}^{(j)}}(\Delta ) = \sum\limits _{j=1}^{k}\left [ \frac{1} {N!} + {L}_{{r}^{(j)}}^{{\prime}}(0)\Delta+ o(\vert \Delta \vert )\right ].$$

For each rank configuration${r}^{(j)}$, we can chooseΔ _j small enough so that${L}_{{r}^{(j)}}(\Delta )$ is also thejth largest among${L}_{{r}^{(1)}}(\Delta ),\ldots,{L}_{{r}^{(N!)}}(\Delta )$ for all 0 < Δ < Δ _j. Now take ε to be smaller than all of theΔ _j. This shows that for 0 < Δ < ε, the power of the test that places points in the rejection region as ordered by${L}_{r}^{{\prime}}(0)$ also puts points in the rejection as ordered by${P}_{\Delta }(R = r) = {L}_{r}(\Delta )$; in other words, it is the locally most powerful rank test.

Let us now consider the two-sample problem whereX ₁, …, X _m are iid with distribution functionF(x), andY ₁, …, Y _n are iid with distribution functionG(x). Suppose thatF andG have densitiesf(x) andg(x), respectively, whose support is contained in that of a densityh(x). This means thath(x) is positive wheneverf(x) andg(x) are positive; for example, when all three densities have support on ( − ∞, ∞). From Theorem 12.6, (p. 515), we have

$$P(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=1}^{m}f({V }_{({r}_{i})}){\prod\nolimits }_{i=m+1}^{N}g({V }_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{m}h({V }_{({r}_{i})}){\prod\nolimits }_{i=m+1}^{N}h({V }_{({r}_{i})})}\right ],$$

whereV ₍₁₎ < ⋯ < V _(N) are the order statistics of an iid sample of sizeN fromh(x).

Shift alternatives have the form$g(x) = f(x - \Delta )$ so that theX distribution has the same shape as theY distribution but shiftedΔ to the right of it. Iff(x) has support on ( − ∞, ∞), then we may takeh(x) = f(x) and obtain

$${P}_{\Delta }(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=m+1}^{N}f({V }_{({r}_{i})} - \Delta )} {{\prod\nolimits }_{i=m+1}^{N}f({V }_{({r}_{i})})} \right ],$$

(12.57)

where nowV ₍₁₎ < ⋯ < V _(N) are order statistics for a random sample fromf. Now suppose thatf(x) is differentiable and that we can take the derivative inside the expectation in (12.57). Then,

$${L}_{r}^{{\prime}}(0) ={ \left. \frac{\partial } {\partial \Delta }{P}_{\Delta }(R = r)\right \vert }_{\Delta =0} = \frac{1} {N!}\sum\limits _{i=m+1}^{N}\mbox{ E}\left [\frac{-{f}^{{\prime}}({V }_{ ({r}_{i})})} {f({V }_{({r}_{i})})} \right ].$$

(12.58)

The locally most powerful rank test places points in the rejection region according to large values of this latter expression.

If we letV ₍₁₎ < ⋯ < V _(N) be replaced by${F}^{-1}({U}_{(1)})< \cdots< {F}^{-1}({U}_{(N)})$ where theU _(i) are uniform order statistics from an iid sampleU ₁, …, U _N, then the locally most powerful rank test rejects for large values of

$$T = \sum\limits _{i=m+1}^{N}a({R}_{ i}),$$

wherea(i) = Eϕ(U _(i), f), and$\phi (u,f) = -{f}^{{\prime}}({F}^{-1}(u))/f({F}^{-1}(u))$ is given in (12.23, p. 475) and called the optimal score function.

12.2 Distribution of the Rank Vector under Alternatives

A version of the following result first appeared in Hoeffding [1951].

Theorem 12.6.

Suppose that Z ₁ ,… Z _N are independent continuous random variables with respective densities f ₁ ,…,f _N . Let $R = {({R}_{1},\ldots,{R}_{N})}^{T}$ be the corresponding rank vector. If h is the density of a continuous random variable whose support contains the support of each of f ₁ ,…,f _N , then

$$P(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=1}^{N}{f}_{i}({V }_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{N}h({V }_{({r}_{i})})} \right ],$$

where V ₍₁₎ < ⋯ < V _(N) are the order statistics of an iid sample from h.

Proof.

Let$C =\{ t : {t}_{i}\;\;\mbox{ has rank}\;\;{r}_{i}\}$. Then by definition

$$P(R = r) = \int\nolimits \nolimits \cdots \int\nolimits \nolimits I(t \in C)\left \{{\prod\nolimits }_{i=1}^{N}{f}_{ i}({t}_{i})\right \}d{t}_{1}d{t}_{2}\cdots d{t}_{N}.$$

Now let${v}_{({r}_{i})} = {t}_{i}$ so thatv ₍₁₎ < ⋯ < v _(N). On the setC this is just a 1-to-1 change of variable, but its implications are important. For a given vector$t$ suppose thatt ₁ has rankr ₁ = 3; that is,t ₁ is third from the bottom when the components of$t$ are ranked. Then${v}_{({r}_{1})} = {v}_{(3)} = {t}_{1}$. Ift ₂ has rankr ₂ = 9, then${v}_{({r}_{2})} = {v}_{(9)} = {t}_{2}$. Now we make the change of variable, and multiply and divide by$N!{\prod\nolimits }_{i=1}^{N}h({v}_{({r}_{i})})$ to get

$$\begin{array}{rcl} P(R = r)& =& \frac{1} {N!}\int\nolimits \nolimits \cdots \int\nolimits \nolimits \left [\frac{{\prod\nolimits }_{i=1}^{N}{f}_{i}({v}_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{N}h({v}_{({r}_{i})})} \right ]I({v}_{(1)}< \cdots< {v}_{(N)})N! \\ & & \quad \times \left \{{\prod\nolimits }_{i=1}^{N}h({v}_{ (i)})\right \}d{v}_{(1)}d{v}_{(2)}\cdots d{v}_{(N)}\end{array}$$

The result follows by noticing thatI(v ₍₁₎ < ⋯ < v _(N))N! ∏_i = 1 ^N h(v _(i)) is the density of the order statistic vector fromh.

12.3 Pitman Efficiency

Recall from Section (12.5.2, p. 476) that the Pitman asymptotic relative efficiency of testS to testT is given by

$$\mbox{ ARE}(S,T) =\lim \limits_{k\rightarrow \infty }\frac{{N}_{k}^{{\prime}}} {{N}_{k}},$$

whereN _k andN _k ^′ are the sample sizes required for the two tests to have the same limiting level α and power β under the sequence of alternatives

$${\theta }_{k} = {\theta }_{0} + \frac{\delta } {\sqrt{{N}_{k}}} + o\left ( \frac{1} {\sqrt{{N}_{k}}}\right )\;\;\mbox{ as}\;\;k \rightarrow \infty.$$

(12.59)

These sequences of alternatives are calledPitman alternatives, and the basic approach is due to Pitman [1948] and Noether [1955]. In the following we have drawn heavily from the accounts in Lehmann [1975] and Randles and Wolfe [1979].

We assume in Theorem 12.7 below that both test statistics satisfy 1–7 below. For simplicity we state the conditions for justS and then give a result on asymptotic power before giving the main theorem.

In the following${\mu }_{{S}_{k}}(\theta )$ and${\sigma }_{{S}_{k}}(\theta )$ refer to sequences of constants associated withS _k under θ. They might be the means and standard deviations, but need not be.

1.
$${\theta }_{k} \rightarrow{\theta }_{0}\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
2.
$${N}_{k} \rightarrow \infty \;\;\mbox{ as}\;\;k \rightarrow \infty.$$
3.
Under θ = θ₀
$$\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \stackrel{d}{\rightarrow }\mbox{ N}(0,1)\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
4.
Under θ = θ_k
$$\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{k})} \stackrel{d}{\rightarrow }\mbox{ N}(0,1)\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
5.
The derivative${\mu }_{{S}_{k}}^{{\prime}}(\theta )$ exists in a neighborhood of θ = θ₀ with${\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0}) > 0$ and
$$\frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})} {{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} \rightarrow1\;\;\mbox{ for all}\;\;{\theta }_{k}^{{_\ast}}\rightarrow{\theta }_{ 0}\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
6.
$$\frac{{\sigma }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow1\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
7.
There exists a positive constantc such that
$$c =\lim \limits_{k\rightarrow \infty } \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{k } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}}.$$

This constantc is called the efficacy ofS and denoted eff(S). Based on these conditions we first give a result on asymptotic power. The result shows that the higher the efficacy of a test, the more power it has. The result also gives a way to approximate the power of a test based onS. LetZ be a standard normal random variable, and letz _α be its upper 1 − α quantile.

Theorem 12.7.

Suppose that the test that rejects for S _k > c _k has level α _k → α as k →∞ under H ₀ : θ = θ ₀.

a)
If Conditions 1–7 and (12.59, p. 516) hold, then
$${\beta }_{k} = P({S}_{k} > {c}_{k}) \rightarrow P(Z > {z}_{\alpha } - c\delta )\;\;\mbox{ as}\;\;k \rightarrow \infty,$$
(12.60)
where δ is given in (12.59, p. 516).
b)
If Conditions 1–7 and (12.60) hold, then (12.24, p. 476) holds.

Proof.

Note first that if Condition 3. holds, then since α_k → α

$$\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow{z}_{\alpha }\;\;\mbox{ as}\;\;k \rightarrow \infty.$$

NowP(S _k > c _k) is given by

$$\begin{array}{rcl} & & P\left (\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{k})} > \left [\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} -\frac{{\mu }_{{S}_{k}}({\theta }_{k}) - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \right ]\frac{{\sigma }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{k})}\right ) \\ & \rightarrow & P(Z > {z}_{\alpha } - c\delta )\;\;\mbox{ as}\;\;k \rightarrow \infty \end{array}$$

To see this last step, note that by the mean value theorem there exists a θ_k ^∗ such that

$$\begin{array}{rcl} \frac{{\mu }_{{S}_{k}}({\theta }_{k}) - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} & =& \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})({\theta }_{k} - {\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \\ & =& \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})} {{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{k } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}}\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0}) \rightarrow c\delta \end{array}$$

For part b) we just work backwards and note that (12.60) and Conditions 1–7 force the convergence tocδ which means that$\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0}) \rightarrow\delta $ which is equivalent to (12.59, p. 516).

Now we give the main Pitman ARE theorem.

Theorem 12.8.

Suppose that the tests that reject for S _k > c _k and T _k > c _k ^′ based on sample sizes N _k and N _k ^′ , respectively, have levels α _k and α _k ^′ that converge to α under H : θ = θ ₀ and their powers under θ _k both converge to β, α < β < 1. If conditions 1–7 hold and their efficacies are c =eff(S) and c ^′ =eff(T), respectively, then the Pitman asymptotic relative efficiency of S to T is given by

$$\mbox{ ARE} ={ \left \{\frac{\mbox{ eff}(S)} {\mbox{ eff}(T)}\right \}}^{2}.$$

Proof.

By Theorem 12.7 (p. 517) b),$\beta= P(Z > {z}_{\alpha } - c\delta ) = P(Z > {z}_{\alpha } - {c}^{{\prime}}{\delta }^{{\prime}})$. Thuscδ = c ^′δ^′ and

$$\begin{array}{rcl} \mbox{ ARE}(S,T)& =& \lim \limits_{k\rightarrow \infty }\frac{{N}_{k}^{{\prime}}} {{N}_{k}} \\& =& \lim \limits_{k\rightarrow \infty }{\left (\frac{\sqrt{{N}_{k }^{{\prime}}}({\theta }_{k} - {\theta }_{0})} {\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} \right )}^{2} \\ & =&{ \left (\frac{{\delta }^{{\prime}}} {\delta } \right )}^{2} ={ \left ( \frac{c} {{c}^{{\prime}}}\right )}^{2}\end{array}$$

To apply Theorem 12.8 it would appear that we have to verify Conditions 3–6 above for arbitrary subsequences θ_k converging to θ₀ and then compute the efficacy in 7 for such sequences. However, if Conditions 1–7 and (12.60, p. 517) hold, we know by Theorem 12.7 (p. 517) that (12.24, p. 476) holds. Thus, we really only need to assume Condition 2 and verify Conditions 3–6 for alternatives of the form (12.59, p. 516). Moreover, the efficacy need only be computed for a simple sequenceN converging to∞ since the numerator and denominator in Condition 7 only involve θ₀.

12.4 Pitman ARE for the One-Sample Location Problem

Using the notation of Section 12.8 (p. 419) letD ₁, …, D _N be iid fromF(x − θ), whereF(x) has densityf(x) that is symmetric about 0,$f(x) = f(-x)$. ThusD _i has densityf(x − θ) that is symmetric about θ. The testing problem isH ₀ : θ = θ₀ versusH _a : θ = θ_k, where θ_k is given by (12.59).

12.4.1 a Efficacy for the One-Sample t

The one-samplet statistic is

$$t = \frac{\sqrt{N}(\overline{D} - {\theta }_{0})} {s},$$

wheres is then − 1 version of the sample standard deviation. The simplest choice of standardizing constants are

$${\mu }_{{t}_{k}}({\theta }_{k}) = \frac{\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} {\sigma }$$

and${\sigma }_{{t}_{k}}({\theta }_{k}) = 1$, where σ is the standard deviation ofD ₁ (under both θ = θ₀ and θ = θ_k). To verify Conditions 3 and 4 (p. 517), we have

$$\begin{array}{rcl} \frac{{t}_{k} - {\mu }_{{t}_{k}}({\theta }_{0})} {{\sigma }_{{t}_{k}}({\theta }_{0})} & =& \frac{\sqrt{{N}_{k}}(\overline{D} - {\theta }_{0})} {s} -\frac{\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} {\sigma } \\ & =& \frac{\sqrt{{N}_{k}}(\overline{D} - {\theta }_{k})} {\sigma } \left ( \frac{s} {\sigma }\right ) + \sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})\left (\frac{1} {s} - \frac{1} {\sigma }\right )\end{array}$$

Under both θ = θ₀ and θ = θ_k,s has the same distribution and converges in probability to σ ifD has a finite variance. Thus, under θ = θ_k the last term in the latter display converges to 0 in probability since (12.59) forces$\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0})$ to converge to δ. Of course under θ = θ₀ this last term is identically 0. The standardized means converge to standard normals under both θ = θ₀ and θ = θ_k by Theorem 5.33 (p. 263). Two applications of Slutsky’s Theorem then gives Conditions 3 and 4 (p. 517). Since the derivative of${\mu }_{{t}_{k}}(\theta )$ is${\mu }_{{t}_{k}}^{{\prime}}(\theta ) = \sqrt{{N}_{k}}/\sigma $ for all θ, Condition 5 (p. 517) is satisfied. Since${\sigma }_{{t}_{k}}({\theta }_{k}) = 1$, Condition 6 (p. 517) is satisfied. Finally, dividing${\mu }_{{t}_{k}}^{{\prime}}({\theta }_{0}) = \sqrt{{N}_{k}}/\sigma $ by$\sqrt{{ N}_{k}}$ yields

$$\mbox{ eff}(t) = \frac{1} {\sigma }.$$

It should be pointed out that this efficacy expression also holds true for the permutation version of thet test because the permutation distribution of thet statistic also converges to a standard normal under θ = θ₀.

12.4.2 b Efficacy for the Sign Test

The sign test statistic is the number of observations above θ₀,

$$S = \sum\limits _{i=1}^{N}I({D}_{ i} > {\theta }_{0}).$$

S has a binomial(N, 1 ∕ 2) distribution under θ = θ₀ and a binomial$(N,1 - F({\theta }_{0} - \theta ))$ distribution under general θ. Let${\mu }_{{S}_{k}}(\theta ) = N[1 - F({\theta }_{0} - \theta )]$ and${\sigma }_{{S}_{k}}^{2}(\theta ) = N[1 - F({\theta }_{0} - \theta )]F({\theta }_{0} - \theta )$. Conditions 3. and 4. (p. 517) follow again by Theorem 5.33 (p. 263), and${\mu }_{{S}_{k}}^{{\prime}}(\theta ) = Nf({\theta }_{0} - \theta )$. SinceF is continuous, Condition 6 (p. 517)is satisfied, and iff is continuous, then Condition 5 (p. 517) is satisfied, and the efficacy is

$$\mbox{ eff}(S) =\lim \limits_{N\rightarrow \infty } \frac{Nf(0)} {\sqrt{{N}^{2 } /4}} = 2f(0).$$

Now we are able to compute the Pitman ARE of the sign test to thet test:

$$\mbox{ ARE}(S,t) = 4{\sigma }^{2}{f}^{2}(0).$$

Table 12.4 (p. 496) gives values of ARE(S, t) for some standard distributions.

12.4.3 c Efficacy for the Wilcoxon Signed Rank Test

Recall that the signed rank statistic is

$${W}^{+} = \sum\limits _{i=1}^{N}I({D}_{ i} > {\theta }_{0}){R}_{i}^{+},$$

whereR _i ⁺ is the rank of | D _i − θ₀ | among$\vert {D}_{1} - {\theta }_{0}\vert,\ldots,\vert {D}_{N} - {\theta }_{0}\vert $. The asymptotic distribution ofW ⁺ under θ_k requires more theory than we have developed so far, but Olshen [1967] showed that the efficacy ofW ⁺ is

$$\sqrt{12}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx$$

under the condition that ∫_− ∞ ^∞ f ²(x)dx < ∞. Thus the Pitman asymptotic relative efficiency of the sign test to the Wilcoxon Signed Rank test is

$$\mbox{ ARE}(S,{W}^{+}) = \frac{{f}^{2}(0)} {3{\left ({\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right )}^{2}}.$$

Similarly, the Pitman asymptotic relative efficiency of the Wilcoxon Signed Rank test to thet test is

$$\mbox{ ARE}({W}^{+},t) = 12{\sigma }^{2}{\left ({\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right )}^{2}.$$

Table 12.4 (p. 496) displays these AREs for a number of distributions.

12.4.4 d Power approximations for the One-Sample Location problem

Theorem 12.7 (p. 517) gives the asymptotic power approximation

$$P(Z > {z}_{\alpha } - c\delta ) = 1 - \Phi \left ({z}_{\alpha } - c\,\sqrt{N}(\theta- {\theta }_{0})\right )$$

based on setting$\delta= \sqrt{N}(\theta- {\theta }_{0})$ in (12.60, p. 517), where θ is the alternative of interest at sample sizeN.

For example, let us first consider thet statistic with$c = 1/\sigma $ and θ₀ = 0. The power approximation is then

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{N}\theta /\sigma \right ).$$

This is the exact power we get for theZ statistic$\sqrt{N}(\overline{X} - {\theta }_{0})/\sigma $ when we know σ instead of estimating it. At$\theta /\sigma=.2$ andN = 10, we get power 0.16, which may be compared with the estimated exact power taken from the first four distributions in Randles and Wolfe [1979, p. 116]:.14,.15,.16,.17. These latter estimates were based on 5000 simulations and have standard deviation around.005. At$\theta /\sigma=.4$ andN = 10, the approximate power is 0.35, and the estimated exact powers for those first four distributions in Randles and Wolfe [1979, p. 116] are.29,.33,.35, and.37, respectively. So here our asymptotic approximation may be viewed as substituting aZ for thet, and the approximation is quite good. Of course, for the normal distribution we could easily have used the noncentralt distribution to get the exact power.

For the sign test, the approximation is

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2f(0)\theta \right ) = 1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2{f}_{0}(0)\theta /\sigma \right ),$$

where we have putf in the form of a location-scale model$f(x) = {f}_{0}((x - \theta )/\sigma )/\sigma $, wheref ₀(x) has standard deviation 1, and thus σ is the standard deviation. For the uniform distribution,${f}_{0}(x) = I(-\sqrt{3}< x< \sqrt{3})/\sqrt{12}$, so that$2{f}_{0}(0) = 2/\sqrt{12}$. The approximate power at$\theta /\sigma=.2,.4,.6,.8$ andN = 10 is then.10,.18,.29,.43, respectively. The corresponding Randles and Wolfe [1979, p. 116] estimates are.10,.19,.30, and.45, respectively. Here of course we could calculate the power exactly using the binomial. The approximate power we have used is similar to the normal approximation to the binomial but not the same because our approximation has replaced the difference of$p = F(0) = 1/2$ andp = F(θ) by a derivative times θ (Taylor expansion) and also used the null variance. It is perhaps surprising how good the approximation is.

The most interesting case is the signed rank statistic because we do not have any standard way of calculating the power. The approximate power for an alternative θ when θ₀ = 0 is

$$\begin{array}{rcl} P(Z > {z}_{\alpha } - c\delta )& =& 1 - \Phi \left ({z}_{\alpha } - \theta \sqrt{12N}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right ) \\ & =& 1 - \Phi \left ({z}_{\alpha } - \frac{\theta } {\sigma }\sqrt{12N}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}_{ 0}^{2}(x)dx\right )\end{array}$$

Here again in the second part we have substituted so that σ is the standard deviation off(x). For example, at the standard normal${\int\nolimits \nolimits }_{-\infty }^{\infty }{f}_{0}^{2}(x)dx = 1/\sqrt{4\pi }$, and the approximate power is

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{\frac{3N} {\pi }} \frac{\theta } {\sigma }\right ).$$

Plugging in θ ∕ σ =.2,.4,.6, and.8 atN = 10, we obtain.15,.34,.58, and.80, respectively. The estimates of the exact powers from Randles and Wolfe [1979, p. 116] are.14,.32,.53, and.74. Thus the asymptotic approximation is a bit too high, especially at the larger θ ∕ σ values.

Although the approximation is a little high, it could easily be used for planning purposes. For example, suppose that a clinical trial is to be run with power = . 80 at the α = . 05 level against alternatives expected to be around$\theta /\sigma=.5$. Since the FDA requires two-sided procedures, we usez _. 025 = 1. 96 and solve${\Phi }^{-1}(1 -.8) = 1.96 -\sqrt{3N/\pi }(.5)$ to get

$$N ={ \left [\frac{1.96 - {\Phi }^{-1}(.2)} {.5} \right ]}^{2}\frac{\pi } {3} = 32.9.$$

Notice that if we invert theZ statistic power formula used above for approximating the power of thet statistic, the only difference from the last display is that the factor π ∕ 3 does not appear. Thus for thet the calculations result in 31.4 observations. Of course this ratio$3/\pi= 31.4/32.9$ is just the ARE efficiency of the signed rank test to thet test at the normal distribution.

13 Problems

12.1.
For the permutations in Table 12.1 (p. 453), give the permutation distribution of the Wilcoxon Rank Sum statisticW.
12.2.
For the two-sample problem with samplesX ₁, …, X _m andY ₁, …, Y _n, show that the permutation test based on ∑_i = 1 ⁿ Y _i is equivalent to the permutation tests based on ∑_i = 1 ^m X _i,$\sum\limits _{i=1}^{n}{Y }_{i} -\sum\limits _{i=1}^{m}{X}_{i}$, and$\overline{Y } -\overline{X}$.
12.3.
A one-way ANOVA situation withk = 3 groups and two observations within each group (${n}_{1} = {n}_{2} = {n}_{3} = 2$) results in the following data. Group 1: 37, 24; Group 2: 12, 15; Group 3: 9, 16. The ANOVAF = 5. 41 results in ap-value of.101 from theF table. If we exchange the 15 in Group 2 for the 9 in Group 3, thenF = 7. 26.
1. a.
  What are the total number of ways of grouping the data that are relevant to testing that the means are equal?
2. b.
  Without resorting to the computer, give reasons why the permutationp-value using theF statistic is 2/15.
12.4.
In a one-sided testing problem with continuous test statisticT, thep-value is eitherF _H(T _obs.) or 1 − F _H(T _obs.) depending on the direction of the hypotheses, whereF _H is the distribution function ofT under the null hypothesisH, andT _obs. is the observed value of the test statistic. In either case, under the null hypothesis thep-value is a uniform random variable as seen from the probability integral transformation. Now consider the case whereT has a discrete distribution with valuest ₁, …, t _k and probabilities$P(T = {t}_{i}) = {p}_{i},i = 1,\ldots,k$ under the null hypothesisH ₀. If we are rejectingH ₀ for small values ofT, then thep-value is$p = P(T \leq{T}_{\mbox{ obs.}}) = {p}_{1} + \cdots+ P(T = {T}_{\mbox{ obs.}})$, and the mid-p value is$p - (1/2)P(T = {T}_{\mbox{ obs.}})$. Under the null hypothesisH ₀, show that E(mid-p)=1/2 and thus that the expected value of the usualp-value must be greater than 1/2 (and thus greater than the expected value of thep-value in continuous cases).
12.5.
Consider a finite population of valuesa ₁, …, a _N and a set of constantsc ₁, …, c _N. We select a random permutation of thea values, call themA ₁, …, A _N, and form the statistic
$$T = \sum\limits _{i=1}^{N}{c}_{ i}{A}_{i}.$$

The purpose of this problem is to derive the first two permutation momentsT given in Section 12.4.2 (p. 458).
1. a.
  First show that
  $$P({A}_{i} = {a}_{s}) = \frac{1} {N}\quad \mbox{ for}\;s = 1,\ldots,N,$$
  
  and
  $$P({A}_{i} = {a}_{s},{A}_{j} = {a}_{t}) = \frac{1} {N(N - 1)}\quad \mbox{ for}\;s\neq t = 1,\ldots,N.$$
  (Hint: for the first result there are (N − 1)! permutations witha _s in theith slot out of a total ofN! equally likely permutations.)
2. b.
  Using a. show that
  $$\mbox{ E}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{a}_{ i} \equiv \overline{a},\quad \mathrm{Var}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{({a}_{ i}-\overline{a})}^{2},\quad \mbox{ for}\;i = 1,\ldots,N,$$
  and
  $$\mbox{ Cov}({A}_{i},{A}_{j}) = \frac{-1} {N(N - 1)}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i\neq j = 1,\ldots,N.$$
3. c.
  Now use b. to show that
  $$\mbox{ E}(T) = N\overline{c}\;\overline{a}\quad \mbox{ and}\quad \mathrm{Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{2} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{2},$$
  where$\overline{a}$ and$\overline{c}$ are the averages of thea’s andc’s, respectively.
12.6.
As an application of the previous problem, consider the Wilcoxon Rank Sum statisticW = sum of the ranks of theY ’s in a two-sample problem where we assume continuous distributions so that there are no ties. Thec values are 1 for$i = m + 1,\ldots,N = m + n$ and 0 otherwise. With no ties thea’s are just the integers 1, …, N corresponding to the ranks. Show that
$$\mbox{ E}(W) = \frac{n(m + n + 1)} {2}$$
and
$$\mathrm{Var}(W) = \frac{mn(m + n + 1)} {12}.$$
12.7.
In Section 12.4.4 (p. 461), the integral
$$\begin{array}{rcl} P({X}_{1}< {X}_{2}) =\mathrm{ E}\left \{I({X}_{1}< {X}_{2})\right \}& =& \int\nolimits \nolimits \int\nolimits \nolimits I({x}_{1}< {x}_{2})\,dF({x}_{1})\,dF({x}_{2}) \\ & =& \int\nolimits \nolimits F(x)\,dF(x) \\ \end{array}$$
arises, whereX ₁ andX ₂ are independent with distribution functionF. IfF is continuous, argue that$P({X}_{1}< {X}_{2}) = 1/2$ sinceX ₁ < X ₂ andX ₁ > X ₂ are equally likely. Also use iterated expectations and the probability integral transformations to get the same result. Finally, letu = F(x) in the final integral to get the result.
12.8.
Suppose thatX andY represent some measurement that signals the presence of disease via a threshold to be used in screening for the disease. Assume thatY has distribution functionG(y) and represents a diseased population, andX has distribution functionF(x) and represents a disease-free population. A “positive” for a disease-free subject is declared ifX > c and has probability 1 − F(c), whereF(c) is called thespecificity of the screening test. A “positive” for a diseased subject is declared ifY > c and has probability 1 − G(c), called thesensitivity of the test. The receiver operating characteristic (ROC) curve is a plot of 1 − G(c _i) versus 1 − F(c _i) for a sequence of thresholdsc ₁, …, c _k. Instead of a discrete set of points, we may let$t = 1 - F(c)$, solve to get$c = {F}^{-1}(1 - t)$, and plug into 1 − G(c) to get the ROC curve$R(t) = 1 - G({F}^{-1}(1 - t))$. Show that
$${\int\nolimits \nolimits }_{0}^{1}R(t)\,dt = \int\nolimits \nolimits \{1 - G(u)\}\,dF(u) = {\theta }_{\mathrm{XY}}$$
for continuousF andG.
12.9.
Use the asymptotic normality result for$\widehat{{\theta }}_{\mathrm{XY}}$ to derive (12.15, p. 464).
12.10.
Use (12.15, p. 464) to prove that the power of the Wilcoxon Rank Sum Test goes to 1 asm andn go to∞ andm ∕ N converges to a number λ between 0 and 1. You may assume that theF andG are continuous.
12.11.
Use (12.15, p. 464) to derive (12.16, p. 464).
12.12.
Suppose that$\widehat{{\theta }}_{\mathrm{XY}}$ is.7 andm = n. How large shouldm = n be in order to have approximately 80% power at α = . 05 with the Wilcoxon Rank Sum Test?
12.13.
Suppose that two normal populations with the same standard deviation σ differ in means by$\Delta /\sigma=.7$. How large shouldm = n be in order to have approximately 80% power at α = . 05 with the Wilcoxon Rank Sum Test?
12.14.
The number of permutations needed to carry out a permutation test can be computationally overwhelming. Thus the typical use of a permutation test involves estimating the true permutationp-value by randomly selectingB = 1, 000,B = 10, 000, or even more of the possible permutations. If we use sampling with replacement, then$B\widehat{p}$ has a binomial distribution with the truep-valuep being the probability in the binomial. Consider the following situation where an approach of questionable ethics is under consideration. A company has just run a clinical trial comparing a placebo to a new drug that they want to market, but unfortunately the estimatedp-value based onB = 1000 shows ap-value of around$\widehat{p} =.10$. Everybody is upset because they “know” the drug is good. One clever doctor suggests that they run the simulation ofB = 1000 over and over again until they get a$\widehat{p}$ less than.05. Are they likely to find a run for which$\widehat{p}$ is less than.05 if the truep-value isp = . 10? Use the following calculation based onk separate (independent) runs resulting in$\widehat{{p}}_{1},\ldots,\widehat{{p}}_{k}$:
$$\begin{array}{rcl} P{(\min }_{1\leq i\leq k}\widehat{{p}}_{i} \leq.05)& =& 1 - P{(\min }_{1\leq i\leq k}\widehat{{p}}_{i} >.05) \\ & =& 1 - {[1 - P(\widehat{{p}}_{1} \leq.05)]}^{k} \\ & =& 1 - {[1 - P(\mbox{ Bin(1000,.1)} \leq50)]}^{k}\end{array}$$
Plug in some values ofk to find out how largek would need to be to get a$\widehat{p}$ under.05 with reasonably high probability.
12.15.
The above problem is for given data, and we were trying to estimate the true permutationp-value conditional on the data set and therefore conditional on the set of test statistics computed for every possible permutation. In the present problem we want to think in terms of the overall unconditional probability distribution of$B\widehat{p}$ where we have two stages: first the data is generated and then we randomly selectT ₁ ^∗, …, T _B ^∗ from the set of permutations. The calculation of importance for justifying Monte Carlo tests is the unconditional probability$P(\widehat{p} \leq\alpha ) = P(B\widehat{p} \leq B\alpha )$ that takes both stages into account.
1. a.
  First we consider a simpler problem. Suppose that we get some data that seems to be normally distributed and decide to compute at statistic, call itT ₀. Then we discover that we have lost ourt tables, but fortunately we have a computer. Thus we can generate normal data and computeT ₁ ^∗, …, T _B ^∗ for each ofB independent data sets. In this caseT ₀, T ₁ ^∗, …, T _B ^∗ are iid from a continuous distribution so that there are no ties among them with probability one. Let$\widehat{p} = \sum\limits _{i=1}^{B}I({T}_{i}^{{_\ast}}\geq{T}_{0})/B$ and prove that$B\widehat{p}$ has a discrete uniform distribution on the integers (0, 1, …, B + 1). (Hint: just use the argument that each ordering has equal probability$1/((B + 1)!)$. For example,$B\widehat{p} = 0$ occurs whenT ₀ is the largest value. How many orderings haveT ₀ as the largest value?)
2. b.
  The above result also holds ifT ₀, T ₁ ^∗, …, T _B ^∗ have no ties and are merely exchangeable. However, if we are samplingT ₁ ^∗, …, T _B ^∗ with replacement from a finite set of permutations, then ties occur with probability greater than one. Think of a way to randomly break ties so that we can get the same discrete uniform distribution.
3. c.
  Assuming that$B\widehat{p}$ has a discrete uniform distribution on the integers (0, 1, …, B), show that$P(\widehat{p} \leq\alpha ) = \alpha $ as long as (B + 1)α is an integer.
12.16.
From (12.20, p. 469),d = . 933 for the Wilcoxon Rank Sum statistic form = 10 andn = 6 and assuming no ties. This corresponds to$Z$ being the integers 1 to 16. For no ties andW = 67, the exactp-value for a one-sided test is.0467. Show that the normal approximationp-value is.0413 and the Box-Andersenp-value is.0426. Also find the Box-Andersenp-values using the approximations$d = 1 + (1.8 - 3)/(m + n)$ andd = 1.
12.17.
Show that the result “$Q/(k - 1)$ of (12.31, p. 482) is AN$\{1,2(n - 1)/(kn)\}.$ ask → ∞ withn fixed” follows from (12.32, p. 483) and writing
$$\begin{array}{lr} \sqrt{ k}\left ( \frac{Q} {k - 1} - \frac{n{F}_{\mathrm{R}}} {n - 1 + {F}_{\mathrm{R}}}\right ) = \frac{\sqrt{k}\{(N - 1)/(k - 1) - n\}{F}_{\mathrm{R}}} {(n - 1)\left ( \frac{k} {k - 1}\right ) + {F}_{\mathrm{R}}} \\ + \sqrt{k}(n{F}_{\mathrm{R}})\left ( \frac{1} {(n - 1)\left ( \frac{k} {k - 1}\right ) + {F}_{\mathrm{R}}} - \frac{1} {n - 1 + {F}_{\mathrm{R}}}\right )\end{array}$$
Then show that each of the above two pieces converges to 0 in probability and use the delta theorem on$n{F}_{\mathrm{R}}/(n - 1 + {F}_{\mathrm{R}})$. (Keep in mind thatn is a fixed constant.)
12.18.
Justify the statement: “use ofF _R with an$F(k - 1,N - k)$ reference distribution is supported by (12.32, p. 483) underk → ∞ and by the usual asymptotics$(k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}$ whenn → ∞ withk fixed.” Hint: for thek → ∞ asymptotics, write an$F(k - 1,N - k)$ random variable as an average ofk − 1 χ₁ ² random variables divided by an independent average ofk(n − 1) χ₁ ² random variables. Then subtract 1, multiply by$\sqrt{k}$ and use the Central Limit Theorem and Slutsky’s Theorem.
12.19.
From Section 12.8.1 (p. 492), show that for$T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}$,$\mathrm{E}({T}^{4}) = 3{(\sum\limits _{i=1}^{n}{d}_{i}^{2})}^{2} - 2\sum\limits _{i=1}^{n}{d}_{i}^{4}$. (Hint: first show that
$${\left (\sum\nolimits {c}_{i}{d}_{i}\right )}^{4} = \sum\nolimits {c}_{i}^{4}{d}_{ i}^{4} + 6\sum\limits _{i<j}{c}_{i}^{2}{d}_{ i}^{2}{c}_{ j}^{2}{d}_{ j}^{2}$$
plus sums of odd moments.)
12.20.
Verify (12.39, p. 493) and (12.40, p. 493) for the Box-Andersen approximation in the matched pairs problem.
12.21.
Using results in Section 12.4.2 (p. 458), show that$\mathrm{E}\{{\overline{R}}_{.j}\} = (k + 1)/2$,$\mathrm{Var}\{{\overline{R}}_{.j}\} = ({k}^{2} - 1)/(12n)$, and$\mathrm{Cov}\{{\overline{R}}_{.j},{\overline{R}}_{.m}\} = -({k}^{2} - 1)/\{12n(k - 1)\}$, whereR _i1, …R _ik are Friedman ranks in theith block randomly assigned to the integers 1 tok and independent of the ranks in the other blocks. Putting these results together, the covariance matrix of$\overline{R} = {({\overline{R}}_{.1},\ldots,{\overline{R}}_{.k})}^{T}$ is$\{k(k + 1)/(12n)\}{C}_{k}$, where${C}_{k} = \mbox{ diag}\left ({I}_{k} -\frac{{\mathbf{1}}_{k}{\mathbf{1}}_{k}^{T}} {k} \right )$. Using the fact thatC _k is idempotent, find a generalized inverse of the covariance matrix of$\overline{R}$, call itG, and show that (12.45, p. 501) is given by${\overline{R}}^{T}G\overline{R}$.
12.22.
Similar to Problem 12.18, explain why asymptotic normality of the Friedman statistic (12.45, p. 501) supports use of theF in (12.44, p. 500) on the within row Friedman ranks with an$F(k - 1,(k - 1)(n - 1))$ reference distribution.
12.23.
From Section 12.9.4 (p. 503) verify the permutation moments in (12.49, p. 504) and (12.50, p. 504). Use results from Section 12.4.2 (p. 458) under the assumption that permutations are independently carried out within rows.
12.24.
From Section 12.10.1 (p. 506) consider the two independent binomial testing problem wherem = 12 (N ₁₁ + N ₁₂) for Group 1 andn = 4 (N ₂₁ + N ₂₂) for Group 2, and we want to testH ₀ : p ₁ = p ₂ versusH _a : p ₁ < p ₂, wherep ₁ andp ₂ are the respective probabilities of falling in Category 1. Suppose thatT = 4 (N ₁₁ + N ₂₁) is observed. Write down the conditional probability distribution ofN ₁₁ | T = 4 (just the hypergeometric probabilities forn ₁₁ = 0, 1, 2, 3, 4). Also, letting each of 0, 1, 2, 3, 4 be considered observed values forN ₁₁, list:
1. a.
  the Fisher Exactp-values
2. b.
  the Fisher Exact mid-p values.
12.25.
For a multinomial vector (N ₁₁, N ₁₂, N ₂₁, N ₂₂),${N}_{11} + {N}_{12} + {N}_{21} + {N}_{22} = N$, with associated probabilities (p ₁₁, p ₁₂, p ₂₁, p ₂₂), show that the variance ofN ₁₂ − N ₂₁ is$N\{{p}_{12} + {p}_{21} - {({p}_{12} - {p}_{21})}^{2}\}$.
12.26.
Show that (12.58, p. 515) follows from (12.57, p. 515) if the derivative can be taken inside the expectation.
12.27.
Show why α_k → α and Condition 3. (p. 517) imply that
$$\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow{z}_{\alpha }\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
(Hint: it helps to use Pólya’s result on uniform convergence, Theorem 5.6, p. 222.)
12.28.
Verify that Theorem 5.33 (p. 263) applies to$\overline{X}$ when${X}_{1}^{{_\ast}},\ldots,{X}_{{N}_{k}}^{{_\ast}}$ are iid fromF(x) having mean 0 and finite variance σ², and${X}_{i} = {X}_{i}^{{_\ast}} + \delta /\sqrt{{N}_{k}},i = 1,\ldots,{N}_{k}$.
12.29.
Verify that Theorem 5.33 (p. 263) applies to$S = \sum\limits _{i=1}^{N}I({X}_{i} > 0$ when${X}_{1}^{{_\ast}},\ldots,{X}_{{N}_{k}}^{{_\ast}}$ are iid fromF(x) having median 0 and${X}_{i} = {X}_{i}^{{_\ast}} + \delta /\sqrt{{N}_{k}},i = 1,\ldots,{N}_{k}$.
12.30.
The data areY ₁, …, Y _n iid with median θ. ForH ₀ : θ = 0 versusH _a : θ > 0, use the normal approximation to the binomial distribution to find a power approximation for the sign test and compare to the expression$1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2f(0){\theta }_{a}\right )$ derived from Theorem 12.7 (p. 517), where θ_a is an alternative. Where are the differences?
12.31.
For the Wilcoxon Signed Rank statistic, calculate an approximation to the power of a. 05 level test for a sample of sizeN = 20 from the Laplace distribution with a shift of.6 in standard deviation units. Compare with the simulation estimate.63 from Randles and Wolfe [1979, p.116].
12.32.
Consider the two-sample problem whereX ₁, …, X _m andY ₁, …, Y _n are iid fromF(x) underH ₀, but theY ’s are shifted to the right by${\Delta }_{k} = \delta /\sqrt{{N}_{k}}$ under a sequence of the Pitman alternatives. Verify Conditions 3.-6 (p. 517), making any assumptions necessary and show that the efficacy of the two-samplet test is given by eff$(t) = \sqrt{\lambda (1 - \lambda )}/\sigma $, where σ is the standard deviation ofF.
12.33.
Consider a variable having a Likert scale with possible answers 1,2,3,4,5. Suppose that we are thinking of a situation where the treatment group has answers that tend to be spread toward 1 or 5 and away from the middle. Can we design a rank test to handle this? Here is one formulation. For the two-sample problem suppose that the base density is a beta density of the following form:
$$\frac{\Gamma (2(1 - \theta ))} {\Gamma (1 - \theta )\Gamma (1 - \theta )}{x}^{-\theta }{(1 - x)}^{-\theta },\;\;0< x< 1,\;\;\theta< 1.$$
A sketch of this density shows that it spreads towards the ends as θ gets large. Using the LMPRT theory, find the optimal score function forH ₀ : θ = θ₀ versusH _a : θ > θ₀, where 0 ≤ θ₀ < 1. At θ₀ = 0, the score function simplifies to$\phi (u) = -2 -\log [u(1 - u)]$. Sketch this score function and comment on whether a linear rank statistic of the form$S = \sum\limits _{i=1}^{m}\phi ({R}_{i}/(N + 1))$ makes sense here.
12.34.
For the two-sample problem with$G(x) = (1 - \Delta )F(x) + \Delta {F}^{2}(x)$ andH ₀ : Δ = 0 versusH _a : Δ > 0, show that the Wilcoxon Rank Sum test is the locally most powerful rank test. (You may takeh(x) = f(x) in the expression for$P(R = r)$.)
12.35.
In some two-sample situations (treatment and control), only a small proportion of the treatment group responds to the treatment. Johnson et al. [1987] were motivated by data on sister chromatid exchanges in the chromosomes of smokers where only a small number of units are affected by a treatment, that is, where the treatment group seemed to have a small but higher proportion of large values than the control group. For this two-sample problem, they proposed a mixture alternative,
$$G(x) = (1 - \Delta )F(x) + \Delta K(x),$$
whereK(x) is stochastically larger thanF(x), i.e.,K(x) ≤ F(x) for allx, andΔ refers to the proportion of responders. ForH ₀ : Δ = 0 versusH _a : Δ > 0, verify that the locally most powerful rank test has optimal score function$k({F}^{-1}(u))/f({F}^{-1}(u)) - 1$. LetF(x) andK(x) be normal distribution functions with means μ₁ and μ₂, respectively, μ₂ > μ₁, and variance σ². Show that the optimal score function is
$$\phi (u) =\exp (-{\delta }^{2}/2)\exp (\delta {\Phi }^{-1}(u)) - 1,$$
(12.61)
where$\delta= ({\mu }_{2} - {\mu }_{1})/\sigma $.
12.36.
Related to the previous problem, Johnson et al. [1987] give the following example data:

X: 99 10 10 14 14 14 15 16 20

Y: 6 10 13 15 18 21 22 23 30 37

By sampling from the permutation distribution of the linear rank statistic$\sum\limits _{i=m+1}^{m+n}\phi ({R}_{i}/(m + n + 1))$ with score function in (12.61), estimate the one-sided permutationp-values with δ = 1 and δ = 2. For comparison, also give one-sidedp-values for the Wilcoxon rank sum (exact) and pooledt-tests (fromt table).
12.37.
Similar in motivation to problem 12.35 (p. 529), Conover and Salsburg [1988] proposed the mixture alternative
$$G(x) = (1 - \Delta )F(x) + \Delta {\left \{F(x)\right \}}^{a}.$$
Note thatF(x)^a is the distribution function of the maximum ofa random variables with distribution functionF(x). ForH ₀ : Δ = 0 versusH _a : Δ > 0, verify that the locally most powerful rank test has optimal score functionu ^a − 1.
12.38.
For the data in Problem 12.36 (p. 530), by sampling from the permutation distribution of the linear rank statistic$\sum\limits _{i=m+1}^{m+n}\phi ({R}_{i}/(m + n + 1))$ with score function$\phi (u) = {u}^{a-1}$, estimate the one-sided permutationp-value witha = 5. For comparison, also give one-sidedp-values for the Wilcoxon rank sum (exact) and pooledt-tests (fromt table).
12.39.
Conover and Salsburg [1988] gave the following example data set on changes from baseline of serum glutamic oxaloacetic transaminase (SGOT):

X:-50-17-10-3 4 7 8122637

Y: -116-56 2024292935353741

Plot the data and decide what type of test should be used to detect larger values in some or all of theY ’s. Then, give the one-sidedp-value for that test and for one other possible test.
12.40.
Useperm.sign to get the exact one-sidedp-value 0.044 for the data give in Example 12.2 (p. 498). Then by trial and error get an exact confidence interval for the center of the distribution with coverage at least 90%. Also give the exact confidence interval for the median based on the order statistics with coverage at least 90%.

References

Agresti, A. (2002).Categorical Data Analysis. New Jersey: Wiley.
MATH Google Scholar
Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions.The American Statistician, 52:119–126.
MathSciNet Google Scholar
Agresti, A. and Min, Y. (2005). Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 ×2 contingency tables.Biometrics, 61(2):515–523.
MATH MathSciNet Google Scholar
Aitchison, J. and Silvey, S. D. (1958). Maximum-likelihood estimation of parameters subject to restraints.The Annals of Mathematical Statistics, 29:813–828.
MATH MathSciNet Google Scholar
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Petrov, B. N. e. and Czaki, F. e., editors,2nd International Symposium on Information Theory, pages 267–281. Akademiai Kiado.
Google Scholar
Akritas, M. G. (1990). The rank transform method in some two-factor designs.Journal of the American Statistical Association, 85:73–78.
MATH MathSciNet Google Scholar
Akritas, M. G. (1991). Limitations of the rank transform procedure: A study of repeated measures designs. Part I.Journal of the American Statistical Association, 86:457–460.
Google Scholar
Anderson, M. J. and Robinson, J. (2001). Permutation tests for linear models.Australian & New Zealand Journal of Statistics, 43(1):75–88.
MATH MathSciNet Google Scholar
Andrews, D. W. K. (1987). Asymptotic results for generalized Wald tests.Econometric Theory, 3:348–358.
Google Scholar
Arvesen, J. N. (1969). Jackknifing u-statistics.The Annals of Mathematical Statistics, 40:2076–2100.
MATH MathSciNet Google Scholar
Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances.The Annals of Mathematical Statistics, 35:1545–1552.
MATH MathSciNet Google Scholar
Bahadur, R. R. (1966). A note on quantiles in large samples.The Annals of Mathematical Statistics, 37:577–580.
MATH MathSciNet Google Scholar
Barlow, R. E., Bartholomew, D. J., B. J. M., and Brunk, H. D. (1972).Statistical Inference under order restrictions: the theory and application of isotonic regression. John Wiley & Sons.
Google Scholar
Barnard, G. A. (1945). A new test for 2 × 2 tables.Nature, 156:177.
MATH MathSciNet Google Scholar
Barnard, G. A. (1947). Significance tests for 2 × 2 tables.Biometrika, 34:123–138.
MATH MathSciNet Google Scholar
Barnard, G. A. (1963). Discussion on “the spectral analysis of point processes”.Journal of the Royal Statistical Society, Series B: Statistical Methodology, 25:294–294.
MathSciNet Google Scholar
Barndorff-Nielsen, O. (1978).Information and Exponential Families in Statistical Theory. John Wiley & Sons.
Google Scholar
Barndorff-Nielsen, O. (1982). Exponential families. In Banks, D. L., Read, C. B., and Kotz, S., editors,Encyclopedia of Statistical Sciences (9 vols. plus Supplement), Volume 2, pages 587–596. John Wiley & Sons.
Google Scholar
Barndorff-Nielsen, O. and Cox, D. R. (1979). Edgeworth and saddle-point approximations with statistical applications (C/R p299-312).Journal of the Royal Statistical Society, Series B: Methodological, 41:279–299.
MATH MathSciNet Google Scholar
Bartholomew, D. J. (1957). A problem in life testing.Journal of the American Statistical Association, 52:350–355.
Google Scholar
Bartholomew, D. J. (1959). A test of homogeneity for ordered alternatives.Biometrika, 46:36–48.
MATH MathSciNet Google Scholar
Benichou, J. and Gail, M. H. (1989). A delta method for implicitly defined random variables (C/R: 90V44 p58).The American Statistician, 43:41–44.
MathSciNet Google Scholar
Beran, R. (1986). Simulated power functions.The Annals of Statistics, 14:151–173.
MATH MathSciNet Google Scholar
Beran, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refinements.Journal of the American Statistical Association, 83:687–697.
MATH MathSciNet Google Scholar
Beran, R. and Srivastava, M. S. (1985). Bootstrap tests and confidence regions for functions of a covariance matrix (Corr: V15 p470-471).The Annals of Statistics, 13:95–115.
MATH MathSciNet Google Scholar
Berger, J. O. and Wolpert, R. L. (1984).The Likelihood Principle. Institute of Mathematical Statistics.
Google Scholar
Berger, R. L. (1996). More powerful tests from confidence intervalp values.The American Statistician, 50:314–318.
Google Scholar
Berger, R. L. and Boos, D. D. (1994).P values maximized over a confidence set for the nuisance parameter.Journal of the American Statistical Association, 89:1012–1016.
MATH MathSciNet Google Scholar
Berndt, E. R. and Savin, N. E. (1977). Conflict among criteria for testing hypotheses in the multivariate linear regression model.Econometrica, 45:1263–1277.
MATH MathSciNet Google Scholar
Best, D. J. and Rayner, J. C. W. (1987). Welch’s approximate solution for the Behrens-Fisher problem.Technometrics, 29:205–210.
MATH MathSciNet Google Scholar
Bhattacharyya, G. K. and Johnson, R. A. (1973). On a test of independence in a bivariate exponential distribution.Journal of the American Statistical Association, 68:704–706.
MATH MathSciNet Google Scholar
Bickel, P. J. (1974). Edgeworth expansions in nonparametric statistics.The Annals of Statistics, 2:1–20.
MATH MathSciNet Google Scholar
Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited.Journal of the American Statistical Association, 76:296–311.
MATH MathSciNet Google Scholar
Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap.The Annals of Statistics, 9:1196–1217.
MATH MathSciNet Google Scholar
Bickel, P. J. and van Zwet, W. R. (1978). Asymptotic expansions for the power of distribution free tests in the two-sample problem (Corr: V6 p1170-1171).The Annals of Statistics, 6:937–1004.
MATH MathSciNet Google Scholar
Billingsley, P. (1999).Convergence of Probability Measures. John Wiley & Sons.
Google Scholar
Birmbaum, L. S., Morrissey, R. E., and Harris, M. W. (1991). Teratogenic effects of 2,3,7,8-tetrabromodibenzo-p-dioxin and three polybrominated dibenzofurans in c57bl/6n mice.Toxicology and Applied Pharmacology, 107:141–152.
Google Scholar
Boos, D., Janssen, P., and Veraverbeke, N. (1989). Resampling from centered data in the two-sample problem.Journal of Statistical Planning and Inference, 21:327–345.
MATH MathSciNet Google Scholar
Boos, D. D. (1992). On generalized score tests (Com: 93V47 p311-312).The American Statistician, 46:327–333.
Google Scholar
Boos, D. D. (2003). Introduction to the bootstrap world.Statistical Science, 18(2):168–174.
MathSciNet Google Scholar
Boos, D. D. and Brownie, C. (1992). A rank-based mixed model approach to multisite clinical trials (Corr: 94V50 p322).Biometrics, 48:61–72.
Google Scholar
Boos, D. D. and Brownie, C. (2004). Comparing variances and other measures of dispersion.Statistical Science, 19(4):571–578.
MATH MathSciNet Google Scholar
Boos, D. D. and Hughes-Oliver, J. M. (2000). How large doesn have to be forZ andt intervals?The American Statistician, 54(2):121–128.
Google Scholar
Boos, D. D. and Monahan, J. F. (1986). Bootstrap methods using prior information.Biometrika, 73:77–83.
Google Scholar
Boos, D. D. and Zhang, J. (2000). Monte Carlo evaluation of resampling-based hypothesis tests.Journal of the American Statistical Association, 95(450):486–492.
Google Scholar
Booth, J. and Presnell, B. (1998). Allocation of Monte Carlo resources for the iterated bootstrap.Journal of Computational and Graphical Statistics, 7:92–112.
MathSciNet Google Scholar
Booth, J. G. and Hall, P. (1994). Monte Carlo approximation and the iterated bootstrap.Biometrika, 81:331–340.
MATH MathSciNet Google Scholar
Boschloo, R. D. (1970). Raised conditional level of significance for the 2 × 2-table when testing the equality of two probabiities.Statistica Neerlandica, 24:1–35.
MATH MathSciNet Google Scholar
Box, G. E. P. and Andersen, S. L. (1955). Permutation theory in the derivation of robust criteria and the study of departures from assumption.Journal of the Royal Statistical Society, Series B: Statistical Methodology, 17:1–16.
MATH Google Scholar
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations.Journal of the Royal Statistical Society Series B-Statistical Methodology, 26:211–252.
MATH MathSciNet Google Scholar
Box, G. E. P. and Watson, G. S. (1962). Robustness to non-normality of regression tests (Corr: V52 p669).Biometrika, 49:93–106.
MATH MathSciNet Google Scholar
Breslow, N. (1989). Score tests in overdispersed GLM’s. In Decarli, A., Francis, B. J., Gilchrist, R., and Seeber, G. U. H., editors,Statistical Modelling, pages 64–74. Springer-Verlag Inc.
Google Scholar
Breslow, N. (1990). Tests of hypotheses in overdispersed Poisson regression and other quasi-likelihood models.Journal of the American Statistical Association, 85:565–571.
Google Scholar
Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models.Journal of the American Statistical Association, 88:9–25.
MATH Google Scholar
Brockwell, S. E. and Gordon, I. R. (2007). A simple method for inference on an overall effect in meta-analysis.Statistics in Medicine, 26(25):4531–4543.
MathSciNet Google Scholar
Brown, L., Cai, T., DasGupta, A., Agresti, A., Coull, B., Casella, G., Corcoran, C., Mehta, C., Ghosh, M., Santner, T., Brown, L., Cai, T., and DasGupta, A. (2001). Interval estimation for a binomial proportion - comment - rejoinder.Statistical Science, 16(2):101–133.
MATH MathSciNet Google Scholar
Brown, L. D. (1986).Fundamentals of Statistical Exponential Families: with Applications in Statistical Decision Theory. Institute of Mathematical Statistics.
Google Scholar
Brownie, C., Anderson, D. R., Burnham, K. P., and Robson, D. S. (1985).Statistical Inference from Band Recovery Data: A Handbook (Second Edition). U.S. Fish and Wildlife Service [U.S. Department of Interior].
Google Scholar
Brownie, C. and Boos, D. D. (1994). Type I error robustness of ANOVA and ANOVA on ranks when the number of treatments is large.Biometrics, 50:542–549.
MATH Google Scholar
Brownie, C. F. and Brownie, C. (1986). Teratogenic effect of calcium edetate (caedta) in rats and the protective effect of zinc).Toxicology and Applied Pharmacology, 82(3):426–443.
MathSciNet Google Scholar
Carlstein, E. (1986). The use of subseries values for estimating the variance of a general statistic from a stationary sequence.The Annals of Statistics, 14:1171–1179.
MATH MathSciNet Google Scholar
Carroll, R. J. and Ruppert, D. (1984). Power transformations when fitting theoretical models to data.Journal of the American Statistical Association, 79:321–328.
MathSciNet Google Scholar
Carroll, R. J. and Ruppert, D. (1988).Transformation and weighting in regression. Chapman & Hall Ltd.
Google Scholar
Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006).Measurement error in nonlinear models: a modern perspective. Chapman & Hall.
Google Scholar
Casella, G. and Berger, R. L. (2002).Statistical Inference. Duxbury Press.
Google Scholar
Chernoff, H. (1954). On the distribution of the likelihood ratio.Annals of Mathematical Statistics, 25(3):573–578.
MATH MathSciNet Google Scholar
Chernoff, H. and Lehmann, E. L. (1954). The use of maximum likelihood estimates in χ² tests for goodness of fit.Annals of Mathematical Statistics, 25:579–586.
MATH MathSciNet Google Scholar
Coles, S. G. and Dixon, M. J. (1999). Likelihood-based inference for extreme value models.Extremes, 2:5–23.
MATH Google Scholar
Conover, W. J. (1973). Rank tests for one sample, two samples, andk samples without the assumption of a continuous distribution function.The Annals of Statistics, 1:1105–1125.
MATH MathSciNet Google Scholar
Conover, W. J. and Iman, R. L. (1981). Rank transformations as a bridge between parametric and nonparametric statistics (C/R: P129-133).The American Statistician, 35:124–129.
MATH Google Scholar
Conover, W. J. and Salsburg, D. S. (1988). Locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to “respond” to treatment.Biometrics, 44:189–196.
MATH MathSciNet Google Scholar
Cook, J. R. and Stefanski, L. A. (1994). Simulation-extrapolation estimation in parametric measurement error models.Journal of the American Statistical Association, 89:1314–1328.
MATH Google Scholar
Cook, S. R., Gelman, A., and Rubin, D. B. (2006). Validation of software for Bayesian models using posterior quantiles.Journal of Computational and Graphical Statistics, 15(3):675–692.
MathSciNet Google Scholar
Cox, D. R. (1970).Analysis of binary Data. Chapman & Hall.
Google Scholar
Cox, D. R. and Snell, E. J. (1981).Applied statistics: principles and examples. Chapman & Hall Ltd.
Google Scholar
Cox, D. R. and Snell, E. J. (1989).Analysis of Binary Data. Chapman & Hall Ltd.
Google Scholar
Cramér, H. (1946).Mathematical Methods of Statistics. Princeton University Press.
Google Scholar
Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests.Journal of the Royal Statistical Society, Series B: Methodological, 46:440–464.
MATH MathSciNet Google Scholar
Dacunha-Castelle, D. and Gassiat, E. (1999). Testing the order of a model using locally conic parametrization: Population mixtures and stationary ARMA processes.The Annals of Statistics, 27(4):1178–1209.
MATH MathSciNet Google Scholar
Darmois, G. (1935). The laws of probability to exhaustive estimation.Comptes rendus hebdomadaires des seances de l academie des sciences, 200:1265–1266.
Google Scholar
David, H. A. (1998). First (?) occurrence of common terms in probability and statistics — A second list, with corrections (Corr: 1998V52 p371).The American Statistician, 52:36–40.
MathSciNet Google Scholar
Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative.Biometrika, 64:247–254.
MATH MathSciNet Google Scholar
Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative.Biometrika, 74:33–43.
MATH MathSciNet Google Scholar
Davison, A. C. and Hinkley, D. V. (1997).Bootstrap Methods and Their Application. Cambridge University Press.
Google Scholar
DeGroot, M. H. (1970).Optimal Statistical Decisions. New York, McGraw-Hill.
MATH Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via em algorithm.Journal of the Royal Statistical Society Series B-Methodological, 39:1–38.
MATH MathSciNet Google Scholar
DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials.Controlled Clinical Trials, 7:177–188.
Google Scholar
DiCiccio, T. and Tibshirani, R. (1987). Bootstrap confidence intervals and bootstrap approximations.Journal of the American Statistical Association, 82:163–170.
MATH MathSciNet Google Scholar
DiCiccio, T. J. and Efron, B. (1996). Bootstrap confidence intervals (Disc: P212-228).Statistical Science, 11:189–212.
MATH MathSciNet Google Scholar
Diggle, P. J., Heagerty, P., Liang, K.-Y., and Zeger, S. L. (2002).Analysis of Longitudinal Data. Oxford University Press.
Google Scholar
Doksum, K. and Bickel, P. (1969). Test for monotone failure rate based on normalized spacing.The Annals of Mathematical Statistics, 40:1216–1235.
MATH MathSciNet Google Scholar
Donald, A. and Donner, A. (1990). A simulation study of the analysis of sets of 2 ×2 contingency tables under cluster sampling: Estimation of a common odds ratio.Journal of the American Statistical Association, 85:537–543.
Google Scholar
Dubey, S. D. (1967). Some percentile estimators for Weibull parameters.Technometrics, 9:119–129.
MathSciNet Google Scholar
Dubinin, T. M. and Vardeman, S. B. (2003). Likelihood-based inference in some continuous exponential families with unknown threshold parameters.Journal of the American Statistical Association, 98(463):741–749.
MATH MathSciNet Google Scholar
Dunlop, D. D. (1994). Regression for longitudinal data: A bridge from least squares regression.The American Statistician, 48:299–303.
MathSciNet Google Scholar
Efron, B. (1979). Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7:1–26.
MATH MathSciNet Google Scholar
Efron, B. (1982).The Jackknife, the Bootstrap and Other Resampling Plans. SIAM [Society for Industrial and Applied Mathematics].
Google Scholar
Efron, B. (1987). Better bootstrap confidence intervals (C/R: P186-200).Journal of the American Statistical Association, 82:171–185.
MATH MathSciNet Google Scholar
Efron, B. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information (C/R: P482-487).Biometrika, 65:457–481.
MATH MathSciNet Google Scholar
Efron, B. and Morris, C. (1972). Empirical Bayes on vector observations: An extension of Stein’s method.Biometrika, 59:335–347.
MATH MathSciNet Google Scholar
Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors – An empirical Bayes approach.Journal of the American Statistical Association, 68:117–130.
MATH MathSciNet Google Scholar
Efron, B. and Stein, C. (1981). The jackknife estimate of variance.The Annals of Statistics, 9:586–596.
MATH MathSciNet Google Scholar
Efron, B. and Tibshirani, R. (1993).An Introduction to the Bootstrap. Chapman & Hall Ltd.
Google Scholar
Ehrenberg, A. S. C. (1977). Rudiments of numeracy (Pkg: P277-323).Journal of the Royal Statistical Society, Series A: General, 140:277–297.
Google Scholar
Ehrenberg, A. S. C. (1978).Data Reduction: Analysing and Interpreting Statistical Data (Revised Reprint). John Wiley & Sons.
Google Scholar
Ehrenberg, A. S. C. (1981). The problem of numeracy.The American Statistician, 35:67–71.
Google Scholar
El-Shaarawi, A. H. (1985). Some goodness-of-fit methods for the Poisson plus added zeros distribution.Applied and environmental microbiology, 49:1304–1306.
Google Scholar
Embrechts, P. A. L., Pugh, D., and Smith, R. L. (1985).Statistical Extremes and Risks. Course Notes. Imperial College, London, Dept. of Mathematics.
Google Scholar
Fawcett, R. F. and Salter, K. C. (1984). A Monte Carlo study of theF test and three tests based on ranks of treatment effects in randomized block designs.Communications in Statistics: Simulation and Computation, 13:213–225.
Google Scholar
Fay, M. P. and Graubard, B. I. (2001). Small-sample adjustments for Wald-type tests using sandwich estimators.Biometrics, 57(4):1198–1206.
MATH MathSciNet Google Scholar
Feller, W. (1966).An Introduction to Probability Theory and Its Applications, Vol. II. John Wiley & Sons.
Google Scholar
Fisher, R. A. (1912). On an absolute criterion for fitting frequency curves.Messenger of Mathematics, 41:155–160.
Google Scholar
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.Philos. Trans. Roy. Soc. London Ser. A, 222:309–368.
MATH Google Scholar
Fisher, R. A. (1934a).Statistical Methods for Research Workers, fifth edition. Oliver & Boyd.
Google Scholar
Fisher, R. A. (1934b). Two new properties of mathematical likelihood.Proceedings of the Royal Society of London. Series A, 144:285–307.
Google Scholar
Fisher, R. A. (1935).The Design of Experiments (eighth edition, 1966). Hafner Press.
Google Scholar
Fix, E. and Hodges, J. L. (1955). Significance probabilities of the wilcoxon test.The Annals of Mathematical Statistics, 26:301–312.
MATH MathSciNet Google Scholar
Fouskakis, D. and Draper, D. (2002). Stochastic optimization: A review.International Statistical Review, 70(3):315–349.
MATH Google Scholar
Freidlin, B. and Gastwirth, J. L. (2000). Should the median test be retired from general use?The American Statistician, 54(3):161–164.
Google Scholar
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance.Journal of the American Statistical Association, 32:675–701.
Google Scholar
Fuller, W. A. (1987).Measurement Error Models. John Wiley & Sons.
Google Scholar
Gallant, A. R. (1987).Nonlinear Statistical Models. John Wiley & Sons.
Google Scholar
Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities.Journal of the American Statistical Association, 85:398–409.
MATH MathSciNet Google Scholar
Gelman, A., Pasarica, C., and Dodhia, R. (2002). Let’s practice what we preach: Turning tables into graphs.The American Statistician, 56(2):121–130.
MathSciNet Google Scholar
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
MATH Google Scholar
Ghosh, J. K. (1971). A new proof of the Bahadur representation of quantiles and an application.The Annals of Mathematical Statistics, 42:1957–1961.
MATH Google Scholar
Glasser, M. (1965). Regression analysis with dependent variable censored.Biometrics, 21:300–306.
MathSciNet Google Scholar
Gnedenko, B. V. (1943). Sur la distribution limite du terme maximum d’une série aléatoire.Annals of Mathematics, 44:423–453.
MATH MathSciNet Google Scholar
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation (Ack: V32 p1343).The Annals of Mathematical Statistics, 31:1208–1212.
MathSciNet Google Scholar
Goffinet, B., Loisel, P., and Laurent, B. (1992). Testing in normal mixture models when the proportions are known.Biometrika, 79:842–846.
MATH MathSciNet Google Scholar
Graybill, F. A. (1976).Theory and Application of the Linear Model. Duxbury Press.
Google Scholar
Graybill, F. A. (1988).Matrices with Applications in Statistics. Wadsworth.
Google Scholar
Guilbaud, O. (1979). Interval estimation of the median of a general distribution.Scandinavian Journal of Statistics, 46:29–36.
MathSciNet Google Scholar
Haberman, S. J. (1989). Concavity and estimation.The Annals of Statistics, 17:1631–1661.
MATH MathSciNet Google Scholar
Hadi, A. S. and Wells, M. T. (1990). A note on generalized Wald’s method.Metrika, 37:309–315.
MATH MathSciNet Google Scholar
Hajek, J. and Sidak, Z. (1967).Theory of Rank Tests. Academic Press.
Google Scholar
Hald, A. (1998).A History of Mathematical Statistics from 1750 to 1930. Wiley, New York.
MATH Google Scholar
Hall, P. (1986). On the bootstrap and confidence intervals.The Annals of Statistics, 14:1431–1452.
MATH MathSciNet Google Scholar
Hall, P. (1987). Edgeworth expansion for Student’st statistic under minimal moment conditions.The Annals of Probability, 15:920–931.
MATH MathSciNet Google Scholar
Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals (C/R: P953-985).The Annals of Statistics, 16:927–953.
MATH MathSciNet Google Scholar
Hall, P. (1992).The Bootstrap and Edgeworth Expansion. Springer-Verlag Inc.
Google Scholar
Hall, P. and Titterington, D. M. (1989). The effect of simulation order on level accuracy and power of Monte Carlo tests.Journal of the Royal Statistical Society, Series B: Methodological, 51:459–467.
MATH MathSciNet Google Scholar
Hampel, F. R. (1974). The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69:383–393.
MATH MathSciNet Google Scholar
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.Econometrica, 50:1029–1054.
MATH MathSciNet Google Scholar
Harville, D. (1976). Extension of the Gauss-Markov theorem to include the estimation of random effects.The Annals of Statistics, 4:384–395.
MATH MathSciNet Google Scholar
Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems (C/R: P338-340).Journal of the American Statistical Association, 72:320–338.
MATH MathSciNet Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. H. (2001).The Elements of Statistical Learning: Data Mining, Inference, and Prediction: with 200 Full-color Illustrations. Springer-Verlag Inc.
Google Scholar
Hauck, W. W. and Donner, A. (1977). Wald’s test as applied to hypotheses in logit analysis (Corr: V75 p482).Journal of the American Statistical Association, 72:851–853.
MATH MathSciNet Google Scholar
Heagerty, P. J. and Lumley, T. (2000). Window subsampling of estimating functions with application to regression models.Journal of the American Statistical Association, 95(449):197–211.
MATH MathSciNet Google Scholar
Hernandez, F. and Johnson, R. A. (1980). The large-sample behavior of transformations to normality.Journal of the American Statistical Association, 75:855–861.
MATH MathSciNet Google Scholar
Hettmansperger, T. P. (1984).Statistical Inference Based on Ranks. John Wiley & Sons.
Google Scholar
Hettmansperger, T. P. and Sheather, S. J. (1986). Confidence intervals based on interpolated order statistics (Corr: V4 p217).Statistics & Probability Letters, 4:75–79.
MATH MathSciNet Google Scholar
Higgins, J. P. T., Thompson, S. G., and Spiegelhalter, D. J. (2009). A re-evaluation of random-effects meta-analysis.Journal of the Royal Statistical Society, Series A: Statistics in Society, 172(1):137–159.
MathSciNet Google Scholar
Hinkley, D. V. (1977). Jackknifing in unbalanced situations.Technometrics, 19:285–292.
MATH MathSciNet Google Scholar
Hinkley, D. V. and Runger, G. (1984). The analysis of transformed data (C/R: P309-320).Journal of the American Statistical Association, 79:302–309.
MATH MathSciNet Google Scholar
Hirji, K. F., Mehta, C. R., and Patel, N. R. (1987). Computing distributions for exact logistic regression.Journal of the American Statistical Association, 82:1110–1117.
MATH MathSciNet Google Scholar
Hobert, J. P. and Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models.Journal of the American Statistical Association, 91:1461–1473.
MATH MathSciNet Google Scholar
Hodges, J. L., J. and Lehmann, E. L. (1962). Rank methods for combination of independent experiments in the analysis of variance.The Annals of Mathematical Statistics, 33:482–497.
Google Scholar
Hodges, J. L. and Lehmann, E. L. (1956). The efficiency of some nonparametric competitors of thet-test.The Annals of Mathematical Statistics, 27:324–335.
MATH MathSciNet Google Scholar
Hodges, J. L. and Lehmann, E. L. (1963). Estimates of location based on rank tests (Ref: V42 p1450-1451).The Annals of Mathematical Statistics, 34:598–611.
MATH MathSciNet Google Scholar
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.The Annals of Mathematical Statistics, 19:293–325.
MATH MathSciNet Google Scholar
Hoeffding, W. (1951). A combinatorial central limit theorem.The Annals of Mathematical Statistics, 22:558–566.
MATH MathSciNet Google Scholar
Hoeffding, W. (1952). The large-sample power of tests based on permutations of observations.The Annals of Mathematical Statistics, 23:169–192.
MATH MathSciNet Google Scholar
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12:55–67.
MATH Google Scholar
Hope, A. C. A. (1968). A simplified Monte Carlo significance test procedure.Journal of the Royal Statistical Society, Series B: Methodological, 30:582–598.
MATH Google Scholar
Hora, S. C. and Iman, R. L. (1988). Asymptotic relative efficiencies of the rank-transformation procedure in randomized complete block designs.Journal of the American Statistical Association, 83:462–470.
MATH MathSciNet Google Scholar
Hosking, J. R. M. (1990).L-moments: Analysis and estimation of distributions using linear combinations of order statistics.Journal of the Royal Statistical Society, Series B: Methodological, 52:105–124.
MATH MathSciNet Google Scholar
Huber, P. J. (1964). Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35:73–101.
MATH Google Scholar
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Neyman, J., editor,Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pages 221–233. University of California Press.
Google Scholar
Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo.The Annals of Statistics, 1:799–821.
MATH MathSciNet Google Scholar
Huber, P. J. (1981).Robust Statistics. John Wiley & Sons.
Google Scholar
Hyndman, R. J. and Fan, Y. (1996). Sample quantiles in statistical packages.The American Statistician, 50:361–365.
Google Scholar
Iverson, H. K. and Randles, R. H. (1989). The effects on convergence of substituting parameter estimates intoU-statistics and other families of statistics.Probability Theory and Related Fields, 81:453–471.
MATH MathSciNet Google Scholar
James, W. and Stein, C. (1961). Estimation with quadratic loss. In Neyman, J., editor,Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pages 361–379. University of California Press.
Google Scholar
Jeffreys, H. (1961).Theory of Probability. Oxford University Press.
Google Scholar
Jöckel, K.-H. and Jockel, K.-H. (1986). Finite sample properties and asymptotic efficiency of Monte Carlo tests.The Annals of Statistics, 14:336–347.
MATH MathSciNet Google Scholar
Johansen, S. (1979).Introduction to the Theory of Regular Exponential Families. University of Copenhagen.
Google Scholar
Johnson, R. A., Verrill, S., and Moore, D. H., I. (1987). Two-sample rank tests for detecting changes that occur in a small proportion of the treated population.Biometrics, 43:641–655.
Google Scholar
Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model.The Annals of Statistics, 32(2):784–817.
MATH MathSciNet Google Scholar
Kackar, R. N. and Harville, D. A. (1984). Approximations for standard errors of estimators of fixed and random effects in mixed linear models.Journal of the American Statistical Association, 79:853–862.
MATH MathSciNet Google Scholar
Kass, R. E. and Steffey, D. (1989). Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models).Journal of the American Statistical Association, 84:717–726.
MathSciNet Google Scholar
Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules (Corr: 1998V93 p412).Journal of the American Statistical Association, 91:1343–1370.
MATH Google Scholar
Kent, J. T. (1982). Robust properties of likelihood ratio tests (Corr: V69 p492).Biometrika, 69:19–27.
MATH MathSciNet Google Scholar
Kenward, M. G. and Roger, J. H. (1997). Small sample inference for fixed effects from restricted maximum likelihood.Biometrics, 53:983–997.
MATH Google Scholar
Kepner, J. L. and Wackerly, D. D. (1996). On rank transformation techniques for balanced incomplete repeated-measures designs.Journal of the American Statistical Association, 91:1619–1625.
MATH MathSciNet Google Scholar
Khatri, C. G. (1963). Some results for the singular normal multivariate regression models.Sankhyā, Series A, 30:267–280.
MathSciNet Google Scholar
Kilgore, D. L. (1970). The effects of northward dispersal on growth rate of young at birth and litter size in sigmodon hispidus.American Midland Naturalist, 84:510–520.
Google Scholar
Kim, H.-J. and Boos, D. D. (2004). Variance estimation in spatial regression using a non-parametric semivariogram based on residuals.Scandinavian Journal of Statistics, 31(3):387–401.
MATH MathSciNet Google Scholar
Kim, Y. and Singh, K. (1998). Sharpening estimators using resampling.Journal of Statistical Planning and Inference, 66:121–146.
MATH MathSciNet Google Scholar
Klotz, J. (1962). Non-parametric tests for scale.The Annals of Mathematical Statistics, 33:498–512.
MATH MathSciNet Google Scholar
Koopman, B. O. (1936). On distributions admitting a sufficient statistic.Transactions of the American Mathematical Society, 39(3):399–409.
MathSciNet Google Scholar
Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.Journal of the American Statistical Association, 47:583–621.
MATH Google Scholar
Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations.The Annals of Statistics, 17:1217–1241.
MATH MathSciNet Google Scholar
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data.Biometrics, 38:963–974.
MATH Google Scholar
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing.Technometrics, 34:1–14.
MATH Google Scholar
Landis, J. R., Heyman, E. R., and Koch, G. G. (1978). Average partial association in three-way contingency tables: A review and and discussion of alternative tests.International Statistical Review, 46:237–254.
MATH MathSciNet Google Scholar
Larsen, R. J. and Marx, M. L. (2001).An Introduction to Mathematical Statistics and Its Applications. Prentice-Hall Inc.
Google Scholar
Lawless, J. F. (1982).Statistical Models and Methods for Lifetime Data. John Wiley & Sons.
Google Scholar
Lawley, D. N. (1956). A general method for approximating to the distribution of likelihood ratio criteria.Biometrika, 43:295–303.
MATH MathSciNet Google Scholar
Leadbetter, M. R., Lindgren, G., and Rootzén, H. (1983).Extremes and Related Properties of Random Sequences and Processes. Springer-Verlag Inc.
Google Scholar
Lehmann, E. L. (1953). The power of rank tests.The Annals of Mathematical Statistics, 24:23–43.
MATH Google Scholar
Lehmann, E. L. (1975).Nonparametrics: Statistical Methods Based on Ranks. Holden-Day Inc.
Google Scholar
Lehmann, E. L. (1983).Theory of Point Estimation. John Wiley & Sons.
Google Scholar
Lehmann, E. L. (1986).Testing Statistical Hypotheses. John Wiley & Sons.
Google Scholar
Lehmann, E. L. and Casella, G. (1998).Theory of Point Estimation. Springer-Verlag Inc.
Google Scholar
Lehmann, E. L. and Stein, C. (1949). On the theory of some non-parametric hypotheses.The Annals of Mathematical Statistics, 20:28–45.
MathSciNet Google Scholar
Leroux, B. G. and Puterman, M. L. (1992). Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models.Biometrics, 48:545–558.
Google Scholar
Liang, K.-Y. (1985). Odds ratio inference with dependent data.Biometrika, 72:678–682.
Google Scholar
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.Biometrika, 73:13–22.
MATH MathSciNet Google Scholar
Lindley, D. V. and Phillips, L. D. (1976). Inference for a Bernoulli process (A Bayesian view).The American Statistician, 30:112–119.
MATH MathSciNet Google Scholar
Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods.The Annals of Statistics, 22:1081–1114.
MATH MathSciNet Google Scholar
Lindsay, B. G. and Qu, A. (2003). Inference functions and quadratic score tests.Statistical Science, 18(3):394–410.
MATH MathSciNet Google Scholar
Little, R. J. A. and Rubin, D. B. (1987).Statistical Analysis with Missing Data. J. Wiley & Sons.
Google Scholar
Liu, R. Y. and Singh, K. (1987). On a partial correction by the bootstrap.The Annals of Statistics, 15:1713–1718.
MATH MathSciNet Google Scholar
Liu, X. and Shao, Y. (2003). Asymptotics for likelihood ratio tests under loss of identifiability.The Annals of Statistics, 31(3):807–832.
MATH MathSciNet Google Scholar
Loh, W.-Y. (1987). Calibrating confidence coefficients.Journal of the American Statistical Association, 82:155–162.
MATH MathSciNet Google Scholar
Magee, L. (1990).R ² measures based on Wald and likelihood ratio joint significance tests.The American Statistician, 44:250–253.
Google Scholar
Mahfoud, Z. R. and Randles, R. H. (2005). Practical tests for randomized complete block designs.Journal of Multivariate Analysis, 96(1):73–92.
MATH MathSciNet Google Scholar
Makelainen, T., Schmidt, K., and Styan, G. P. H. (1981). On the existence and uniqueness of the maximum-likelihood estimate of a vector-valued parameter in fixed-size samples.Annals of Statistics, 9:758–767.
MathSciNet Google Scholar
Mancl, L. A. and DeRouen, T. A. (2001). A covariance estimator for GEE with improved small-sample properties.Biometrics, 57(1):126–134.
MATH MathSciNet Google Scholar
Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, 18:50–60.
MATH MathSciNet Google Scholar
McCullagh, P. (1983). Quasi-likelihood functions.The Annals of Statistics, 11:59–67.
MATH MathSciNet Google Scholar
McCullagh, P. (1997). Linear models, vector spaces, and residual likelihood. In Gregoire, T. G., Brillinger, D. R., Diggle, P. J., Russek-Cohen, E., Warren, W. G., and Wolfinger, R. D., editors,Modelling Longitudinal and Spatially Correlated Data: Methods, Applications, and Future Directions. Lecture Notes in Statistics, Vol. 122, pages 1–10. Springer-Verlag Inc.
Google Scholar
McLachlan, G.J. & Krishnan, T. (1997).The EM Algorithm and Extensions. J. Wiley & Sons.
Google Scholar
Mehra, K. L. and Sarangi, J. (1967). Asymptotic efficiency of certain rank tests for comparative experiments.The Annals of Mathematical Statistics, 38:90–107.
MATH MathSciNet Google Scholar
Memon, M. A., Cooper, N. J., Memon, B., Memon, M. I., and Abrams, K. R. (2003). Meta-analysis of randomized clinical trials comparing open and laparoscopic inguinal hernia repair.British Journal of Surgery, 90:1479–1492.
Google Scholar
Messig, M. A. and Strawderman, W. E. (1993). Minimal sufficiency and completeness for dichotomous quantal response models.The Annals of Statistics, 21:2149–2157.
MATH MathSciNet Google Scholar
Miller, R. G., J. (1974). An unbalanced jackknife.The Annals of Statistics, 2:880–891.
Google Scholar
Miller, J. J. (1977). Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance.The Annals of Statistics, 5:746–762.
MATH MathSciNet Google Scholar
Miller, R. G. (1980). Combining 2 ×2 contingency tables. In Miller, R. G. e., Efron, B. e., Brown, B. W., J. e., and Moses, L. E. e., editors,Biostatistics Casebook, pages 73–83. John Wiley & Sons.
Google Scholar
Monahan, J. F. (2001).Numerical Methods of Statistics. Cambridge University Press.
Google Scholar
Monahan, J. F. and Boos, D. D. (1992). Proper likelihoods for Bayesian analysis.Biometrika, 79:271–278.
MATH MathSciNet Google Scholar
Moore, D. S. (1977). Generalized inverses, Wald’s method, and the construction of chi-squared tests of fit.Journal of the American Statistical Association, 72:131–137.
MATH MathSciNet Google Scholar
Moore, D. S. (1986). Tests of chi-squared type. In D’Agostino, R. B. and Stephens, M. A., editors,Goodness-of-fit techniques, pages 63–95. Marcel Dekker Inc.
Google Scholar
Nation, J. R., Bourgeois, A. E., Clark, D. E., Baker, D. M., and Hare, M. F. (1984). The effects of oral cadmium exposure on passive avoidance performance in the adult rat.Toxicology Letters, 20:41–47.
Google Scholar
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models.Journal of the Royal Statistical Society, Series A: General, 135:370–384.
Google Scholar
Neyman, J. and Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference.Biometrika, 20A:175–240, 263–294.
Google Scholar
Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses.Philosophical Transactions of the Royal Society of London, Ser. A, 231:289–337.
Google Scholar
Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations.Econometrica, 16:1–32.
MathSciNet Google Scholar
Noether, G. E. (1949). On a theorem by wald and wolfowitz.The Annals of Mathematical Statistics, 20:455–458.
MATH MathSciNet Google Scholar
Noether, G. E. (1955). On a theorem of pitman.The Annals of Mathematical Statistics, 26:64–68.
MATH MathSciNet Google Scholar
Noether, G. E. (1987). Sample size determination for some common nonparametric tests.Journal of the American Statistical Association, 82:645–647.
MATH MathSciNet Google Scholar
O’Gorman, T. W. (2001). A comparison of the F-test, Friedman’s test, and several aligned rank tests for the analysis of randomized complete blocks.Journal of Agricultural, Biological, and Environmental Statistics, 6(3):367–378.
Google Scholar
Olshen, R. A. (1967). Sign and Wilcoxon tests for linearity.The Annals of Mathematical Statistics, 38:1759–1769.
MATH MathSciNet Google Scholar
Pace, L. and Salvan, A. (1997).Principles of Statistical Inference: from a Neo-Fisherian Perspective. World Scientific Publishing Company.
Google Scholar
Pirazzoli, P. A. (1982). Maree estreme a venezia (periodo 1872-1981).Acqua Aria, 10:1023–1029.
Google Scholar
Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy.Proc. Camb. Phil. Soc., 32:567–579.
Google Scholar
Pitman, E. J. G. (1937a). Significance tests which may be applied to samples from any populations.Supplement to the Journal of the Royal Statistical Society, 4:119–130.
Google Scholar
Pitman, E. J. G. (1937b). Significance tests which may be applied to samples from any populations. ii. the correlation coefficient test.Supplement to the Journal of the Royal Statistical Society, 4:225–232.
Google Scholar
Pitman, E. J. G. (1938). Significance tests which may be applied to samples from any populations. iii. the analysis of variance test.Biometrika, 29:322–335.
Google Scholar
Pitman, E. J. G. (1948).Notes on Non-Parametric Statitistical Inference. Columbia University (duplicated).
Google Scholar
Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity.The Annals of Statistics, 16:356–366.
MATH MathSciNet Google Scholar
Pratt, J. W. and Gibbons, J. D. (1981).Concepts of Nonparametric Theory. Springer-Verlag Inc.
Google Scholar
Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation.Biometrics, 44:1033–1048.
MATH MathSciNet Google Scholar
Presnell, B. and Boos, D. D. (2004). The ios test for model misspecification.Journal of the American Statistical Association, 99(465):216–227.
MATH MathSciNet Google Scholar
Puri, M. L. and Sen, P. K. (1971).Nonparametric Methods in Multivariate Analysis. John Wiley & Sons.
Google Scholar
Pyke, R. (1965). Spacings (with discussion).Journal of the Royal Statistical Society, Series B: Statistical Methodology, 27:395–449.
MATH MathSciNet Google Scholar
Qu, A., Lindsay, B. G., and Li, B. (2000). Improving generalised estimating equations using quadratic inference functions.Biometrika, 87(4):823–836.
MATH MathSciNet Google Scholar
Quenouille, M. H. (1949). Approximate use of correlation in time series.Journal of the Royal Statistical Society, Series B, 11:18–84.
MathSciNet Google Scholar
Quenouille, M. H. (1956). Notes on bias in estimation.Biometrika, 43:353–360.
MATH MathSciNet Google Scholar
Quesenberry, C. P. (1975). Transforming samples from truncation parameter distributions to uniformity.Communications in Statistics, 4:1149–1156.
MathSciNet Google Scholar
Radelet, M. L. and Pierce, G. L. (1991). Choosing those who will die: race and the death penalty in florida.Florida Law Review, 43:1–34.
Google Scholar
Randles, R. H. (1982). On the asymptotic normality of statistics with estimated parameters.Annals of Statistics, 10:462–474.
MATH MathSciNet Google Scholar
Randles, R. H. and Wolfe, D. A. (1979).Introduction to the Theory of Nonparametric Statistics. John Wiley & Sons.
Google Scholar
Randolph, P. A., Randolph, J. C., Mattingly, K., and Foster, M. M. (1977). Energy costs of reproduction in the cotton rat sigmodon hispidus.Ecology, 58:31–45.
Google Scholar
Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters with application to problems of estimation.Proceedings of the Cambridge Philosophical Society, 44:50–57.
MATH Google Scholar
Rao, C. R. (1973).Linear Statistical Inference and Its Applications. John Wiley & Sons.
Google Scholar
Rao, C. R. and Wu, Y. (2001). On model selection (Pkg: P1-64). InModel selection [Institute of Mathematical Statistics lecture notes-monograph series 38], pages 1–57. IMS Press.
Google Scholar
Read, T. R. C. and Cressie, N. A. C. (1988).Goodness-of-fit Statistics for Discrete Multivariate Data. Springer-Verlag Inc.
Google Scholar
Reid, N. (1988). Saddlepoint methods and statistical inference (C/R: P228-238).Statistical Science, 3:213–227.
MATH MathSciNet Google Scholar
Ridout, M., Hinde, J., and Demétrio, C. G. B. (2001). A score test for testing a zero-inflated Poisson regression model against zero-inflated negative binomial alternatives.Biometrics, 57(1):219–223.
MATH MathSciNet Google Scholar
Robert, C. P. (2001).The Bayesian Choice: from Decision-theoretic Foundations to Computational Implementation. Springer-Verlag Inc.
Google Scholar
Roberts, M. E., Tchanturia, K., Stahl, D., Southgate, L., and Treasure, J. (2007). A systematic review and metaanalysis of set-shifting ability in eating disorders.Psychol. Med., 37:1075–1084.
Google Scholar
Robertson, T., Wright, F. T., and Dykstra, R. (1988).Order Restricted Statistical Inference. John Wiley & Sons.
Google Scholar
Robins, J. M., van der Vaart, A., and Ventura, V.
Google Scholar
Robinson, J. (1980). An asymptotic expansion for permutation tests with several samples.The Annals of Statistics, 8:851–864.
MATH MathSciNet Google Scholar
Rosen, O. and Cohen, A. (1995). Constructing a bootstrap confidence interval for the unknown concentration in radioimmunoassay (Disc: P953-953).Statistics in Medicine, 14:935–952.
Google Scholar
Rotnitzky, A. and Jewell, N. P. (1990). Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data.Biometrika, 77:485–497.
MATH MathSciNet Google Scholar
Ruppert, D. (1987). What is kurtosis - an influence function-approach.American Statisitican, 41:1–5.
MATH MathSciNet Google Scholar
Sampson, A. R. (1974). A tale of two regressions.Journal of the American Statistical Association, 69:682–689.
MATH MathSciNet Google Scholar
Santner, T. J. (1998). Teaching large-sample binomial confidence intervals.Teaching Statistics, 20:20–23.
Google Scholar
Savage, I. R. (1956). Contributions to the theory of rank order statistics-the two-sample case.The Annals of Mathematical Statistics, 27:590–615.
MATH MathSciNet Google Scholar
Schrader, R. M. and Hettmansperger, T. P. (1980). Robust analysis of variance based upon a likelihood ratio criterion.Biometrika, 67:93–101.
MATH MathSciNet Google Scholar
Schucany, W. R., Gray, H. L., and Owen, D. B. (1971). On bias reduction in estimation.Journal of the American Statistical Association, 66:524–533.
MATH Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model.The Annals of Statistics, 6:461–464.
MATH MathSciNet Google Scholar
Searle, S. R. (1971).Linear models. John Wiley & Sons.
Google Scholar
Seber, G. A. F. and Wild, C. J. (1989).Nonlinear Regression. John Wiley & Sons.
Google Scholar
Self, S. G. and Liang, K.-Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions.Journal of the American Statistical Association, 82:605–610.
MATH MathSciNet Google Scholar
Sen, P. K. (1968). On a class of aligned rank order tests in two-way layouts.The Annals of Mathematical Statistics, 39:1115–1124.
MATH Google Scholar
Sen, P. K. (1982). OnM tests in linear models.Biometrika, 69:245–248.
MATH MathSciNet Google Scholar
Serfling, R. J. (1980).Approximation Theorems of Mathematical Statistics. Wiley, New York.
MATH Google Scholar
Shanbhag, D. N. (1968). Some remarks concerning Khatri’s result on quadratic forms.Biometrika, 55:593–595.
MATH Google Scholar
Shao, J. and Tu, D. (1995).The Jackknife and Bootstrap. Springer-Verlag Inc.
Google Scholar
Shao, J. and Wu, C. F. J. (1989). A general theory for jackknife variance estimation.The Annals of Statistics, 17:1176–1197.
MATH MathSciNet Google Scholar
Sidik, K. and Jonkman, J. N. (2007). A comparison of heterogeneity variance estimators in combining results of studies.Statistics in Medicine, 26(9):1964–1981.
MathSciNet Google Scholar
Siegel, A. F. (1985). Modelling data containing exact zeroes using zero degrees of freedom.Journal of the Royal Statistical Society, Series B: Methodological, 47:267–271.
MATH MathSciNet Google Scholar
Silvapulle, M. J. and Sen, P. K. (2005).Constrained Statistical Inference: Inequality, Order, and Shape Restrictions. Wiley-Interscience.
Google Scholar
Simpson, D. G. (1987). Minimum Hellinger distance estimation for the analysis of count data.Journal of the American Statistical Association, 82:802–807.
MATH MathSciNet Google Scholar
Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap.The Annals of Statistics, 9:1187–1195.
MATH MathSciNet Google Scholar
Smith, R. L. (1985). Maximum likelihood estimation in a class of nonregular cases.Biometrika, 72:67–90.
MATH MathSciNet Google Scholar
Smith, T. C., Spiegelhalter, D. J., and Thomas, A. (1995). Bayesian approaches to random-effects meta-analysis: A comparative study.Statistics in Medicine, 14:2685–2699.
Google Scholar
Stacy, E. W. (1962). A generalization of the gamma distribution.The Annals of Mathematical Statistics, 33:1187–1191.
MATH MathSciNet Google Scholar
Stefanski, L. A. and Boos, D. D. (2002). The calculus of M-estimation.The American Statistician, 56(1):29–38.
MathSciNet Google Scholar
Stefanski, L. A. and Carroll, R. J. (1987). Conditional scores and optimal scores for generalized linear measurement-error models.Biometrika, 74:703–716.
MATH MathSciNet Google Scholar
Stefanski, L. A. and Cook, J. R. (1995). Simulation-extrapolation: The measurement error jackknife.Journal of the American Statistical Association, 90:1247–1256.
MATH MathSciNet Google Scholar
Stephens, M. A. (1977). Goodness of fit for the extreme value distribution.Biometrika, 64:583–588.
MATH MathSciNet Google Scholar
Student (Gosset, W. S. (1908). The probable error of a mean.Biometrika, 6(1):1–25.
Google Scholar
Styan, G. P. H. (1970). Notes on the distribution of quadratic forms in singular normal variables.Biometrika, 57:567–572.
MATH Google Scholar
Suissa, S. and Shuster, J. J. (1985). Exact unconditional sample sizes for the 2 by 2 binomial trial.Journal of the Royal Statistical Society, Series A: General, 148:317–327.
MATH MathSciNet Google Scholar
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (C/R: P541-550).Journal of the American Statistical Association, 82:528–540.
MATH MathSciNet Google Scholar
Tarone, R. E. (1979). Testing the goodness of fit of the binomial distribution (Corr: V77 p668).Biometrika, 66:585–590.
MATH Google Scholar
Tarone, R. E. and Gart, J. J. (1980). On the robustness of combined tests for trends in proportions.Journal of the American Statistical Association, 75:110–116.
MATH Google Scholar
Teo, K. K., Yusuf, S., Collins, R., Held, P. H., and Peto, R. (1991). Effects of intravenous magnesium in suspected acute myocardial infarction: Overview of randomised trials.British Medical Journal, 303:1499–1503.
Google Scholar
Thompson, G. L. (1991). A note on the rank transform for interactions (Corr: 93V80 p711).Biometrika, 78:697–701.
MATH MathSciNet Google Scholar
Tobin, J. (1958). Estimation of relationships for limited dependent-variables.Econometrica, 26:24–36.
MATH MathSciNet Google Scholar
Tsiatis, A. A. and Davidian, M. (2001). A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error.Biometrika, 88:447–458.
MATH MathSciNet Google Scholar
Tukey, J. (1958). Bias and confidence in not quite large samples (abstract).The Annals of Mathematical Statistics, 29:614–614.
Google Scholar
van den Broek, J. (1995). A score test for zero inflation in a Poisson distribution.Biometrics, 51:738–743.
MATH MathSciNet Google Scholar
van der Vaart, A. W. (1998).Asymptotic Statistics. Cambridge University Press.
Google Scholar
van Elteren, P. H. (1960). On the combination of independent two-sample tests of wilcoxon.Bulletin of the International Statistical Institute, 37:351–361.
MATH Google Scholar
van Elteren, P. H. and Noether, G. E. (1959). The asymptotic efficiency of the χ_r ²-test for a balanced incomplete block design.Biometrika, 46:475–477.
MATH Google Scholar
Venables, W. N. and Ripley, B. D. (1997).Modern Applied Statistics with S-Plus. Springer-Verlag Inc.
Google Scholar
Verbeke, G. and Molenberghs, G. (2003). The use of score tests for inference on variance components.Biometrics, 59(2):254–262.
MATH MathSciNet Google Scholar
von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions.The Annals of Mathematical Statistics, 18:309–348.
MATH Google Scholar
Wainer, H. (1993). Tabular presentation.Chance, 6(3):52–56.
Google Scholar
Wainer, H. (1997a). Improving tabular displays, with NAEP tables as examples and inspirations.Journal of Educational and Behavioral Statistics, 22:1–30.
Google Scholar
Wainer, H. (1997b).Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. Springer-Verlag Inc.
Google Scholar
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large.Transactions of the American Mathematical Society, 54(3):426–482.
MATH MathSciNet Google Scholar
Wald, A. (1949). Note on the consistency of maximum likelihood estimate.The Annals of Mathematical Statistics, 20:595–601.
MATH MathSciNet Google Scholar
Wald, A. and Wolfowitz, J. (1944). Statistical tests based on permutations of the observations.The Annals of Mathematical Statistics, 15:358–372.
MATH MathSciNet Google Scholar
Warn, D. E., Thompson, S. G., and Spiegelhalter, D. J. (2002). Bayesian random effects meta-analysis of trials with binary outcomes: Methods for the absolute risk difference and relative risk scales.Statistics in Medicine, 21(11):1601–1623.
Google Scholar
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method.Biometrika, 61:439–447.
MATH MathSciNet Google Scholar
Weir, B. S. (1996).Genetic Data Analysis 2: Methods for Discrete Population Genetic Data. Sunderland: Sinauer Associates.
Google Scholar
Welch, B. L. (1937). On the z-test in randomized blocks and latin squares.Biometrika, 29:21–52.
MATH Google Scholar
Welch, W. J. (1987). Rerandomizing the median in matched-pairs designs.Biometrika, 74:609–614.
MathSciNet Google Scholar
White, H. (1981). Consequences and detection of misspecified nonlinear regression models.Journal of the American Statistical Association, 76:419–433.
MATH MathSciNet Google Scholar
White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica, 50:1–26.
MATH MathSciNet Google Scholar
Wilcoxon, F. (1945). Individual comparisons by ranking methods.Biometrics Bulletin, 6:80–83.
Google Scholar
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses.Annals of Mathematical Statistics, 9:60–62.
Google Scholar
Wilson, D. H. (2002).Signed Scale Measures: An Introduction and Application. Ph.D Thesis, NC State University.
Google Scholar
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm.The Annals of Statistics, 11:95–103.
MATH MathSciNet Google Scholar
Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis (C/R: P1295-1350; Ref: V16 p479).The Annals of Statistics, 14:1261–1295.
MATH MathSciNet Google Scholar
Wu, C.-t., Gumpertz, M. L., and Boos, D. D. (2001). Comparison of GEE, MINQUE, ML, and REML estimating equations for normally distributed data.The American Statistician, 55(2):125–130.
Google Scholar
Zehna, P. W. (1966). Invariance of maximum likelihood estimations.The Annals of Mathematical Statistics, 37:744–744.
MATH MathSciNet Google Scholar
Zeng, Q. and Davidian, M. (1997). Bootstrap-adjusted calibration confidence intervals for immunoassay.Journal of the American Statistical Association, 92:278–290.
MATH MathSciNet Google Scholar
Zhang, J. and Boos, D. D. (1994). Adjusted power estimates in Monte Carlo experiments.Communications in Statistics: Simulation and Computation, 23:165–173.
MATH Google Scholar
Zhang, J. and Boos, D. D. (1997). Mantel-Haenszel test statistics for correlated binary data.Biometrics, 53:1185–1198.
MATH MathSciNet Google Scholar
Zhao, L. P. and Prentice, R. L. (1990). Correlated binary regression using a quadratic exponential model.Biometrika, 77:642–648.
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, North Carolina State University, Raleigh, North Carolina, USA
Denni D Boos & L A Stefanski

Authors

Denni D Boos
View author publications
You can also search for this author in PubMed Google Scholar
L A Stefanski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Boos, D.D., Stefanski, L.A. (2013). Permutation and Rank Tests. In: Essential Statistical Inference. Springer Texts in Statistics, vol 120. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4818-1_12

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4818-1_12
Published: 27 September 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4817-4
Online ISBN: 978-1-4614-4818-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Permutation and Rank Tests

Abstract

Similar content being viewed by others

Permutation and Randomization Tests

Exact Permutation/Randomization Tests Algorithms

Permutation Statistical Methods

Keywords

1 Introduction

2 A Simple Example: The Two-Sample Location Problem

3 The General Two-Sample Setting

4 Theory of Permutation Tests

4.1 Size α Property of Permutation Tests

Theorem 12.1.

Proof.

4.2 Permutation Moments of Linear Statistics

4.3 Linear Rank Tests

4.4 Wilcoxon-Mann-Whitney Two-Sample Statistic

4.5 Asymptotic Normal Approximation

Theorem 12.2 (Wald-Wolfowitz-Noether-Hoeffding).

Theorem 12.3 (Hajek).

4.6 Edgeworth Approximation

4.7 Box-Andersen Approximation

4.8 Monte Carlo Approximation

4.9 Comparing the Approximations in a Study of Two Drugs

5 Optimality Properties of Rank and Permutation Tests

5.1 Locally Most Powerful Rank Tests

5.2 Pitman Asymptotic Relative Efficiency

6 Thek-sample Problem, One-way ANOVA

6.1 Rank Methods for the k-Sample Location Problem

6.2 Large-k Asymptotics for the ANOVAF Statistic

6.3 Comparison of Approximate P-Values – Data on Cadmium in Rat Diet

6.4 Other Types of Alternative Hypotheses

6.5 Ordered Means or Location Parameters

6.6 Scale or Variance Comparisons

7 Testing Independence and Regression Relationships

Example 12.1 (Raleigh snowfall).

8 One-Sample Test for Symmetry about θ0 or Matched Pairs Problem

8.1 Moments and Normal Approximation

Theorem 12.4.

Theorem 12.5.

8.2 Box-Andersen Approximation

8.3 Signed Rank Methods

8.4 Sign Test

8.5 Pitman ARE for the One-Sample Symmetry Problem

8.6 Treatment of Ties

Example 12.2 (Fault rates of telephone lines).

9 Randomized Complete Block Data—the Two-Way Design

9.1 Friedman’s Rank Test

9.2 F Approximations

9.3 Pitman ARE for Blocked Data

9.4 Aligned Ranks and the Rank Transform

9.5 Replications within Blocks

10 Contingency Tables

10.1 2 x 2 Table – Fisher’s Exact Test

10.2 Paired Binary Data – McNemar’s Test

10.3 I byJ Tables

11 Confidence Intervals and R-Estimators

12 Appendix – Technical Topics for Rank Tests

12.1 Locally Most Powerful Rank Tests

12.2 Distribution of the Rank Vector under Alternatives

Theorem 12.6.

Proof.

12.3 Pitman Efficiency

Theorem 12.7.

Proof.

Theorem 12.8.

Proof.

12.4 Pitman ARE for the One-Sample Location Problem

12.4.1 a Efficacy for the One-Sample t

12.4.2 b Efficacy for the Sign Test

12.4.3 c Efficacy for the Wilcoxon Signed Rank Test

12.4.4 d Power approximations for the One-Sample Location problem

13 Problems

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

8 One-Sample Test for Symmetry about θ₀ or Matched Pairs Problem