Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the early 1930s R. A. Fisher discovered a very general exact method of testing hypotheses based on permuting the data in ways that do not change its distribution under the null hypothesis. Thispermutation method does not require standard parametric assumptions such as normality of the data. It does require, however, certain invariance properties under the null hypothesis that restricts application to fairly simple designs. But in such situations, the method results in exact tests with level α under very weak distributional assumptions. Moreover, the method isstatistic-inclusive in the sense that any test statistic can be used and inherits the level-α property, although some statistics are much more powerful than others.

Tests based on this method are calledpermutation tests orrandomization tests depending on whether the data can be viewed as samples from populations or not. That is, when sampling from populations, “permutation tests” refer to use of the permutation method to obtain level α tests under weak distributional assumptions. In Fisher’s words (1935, Sec. 21), these are tests of a “wider” null hypothesis (as compared to assuming normal distributions, for example).

However, experiments may be performed on units that cannot be viewed as arising from random sampling of any population. In such situations “randomization inference” refers to inference drawn based only on the physical randomization of the units to different treatments, and on the test statistic calculated at all possible randomizations of the data. The same test that we called a permutation test in random sampling contexts is now called a randomization test. Of course one needs to qualify all statements of significance about such experiments with the disclaimer that randomization inference only applies to the units used in the experiment.

Permutation tests are the foundation of classical nonparametric statistics (also calleddistribution-free statistics), which itself is often identified with rank tests. Rank tests are actually a special subclass of permutation tests with three distinct advantages:

  1. 1.

    For data without ties, the conditional permutation distribution of a rank test is actually unconditional (does not change from sample to sample) because the ranks of a continuous data set are the same for every sample. Thus, the distribution of an important rank statistic like the Wilcoxon Rank Sum statistic can be tabulated or programmed. However, this computing advantage is less important today, and when there are ties in the data (a very common occurrence), the tabulated values are not appropriate, and the conditional permutation distribution is required for exact inference.

  2. 2.

    The key philosophical foundation of rank tests arises from the theory of invariant tests as described in Lehmann [1986, Ch. 5]. The idea with invariant tests is to reduce the class of tests considered to those that are naturally invariant with respect to a group of transformationsG on the sample space of the data. GivenG, a maximal invariant is a statistic\(M(x)\) with the property that any invariant test with respect toG must be a function of\(x\) only through\(M(x)\). Now consider the two-sample problem withH 0 : F X(x) = F Y(x) versus the alternative “F Y is stochastically larger thanF X,” that is,\({H}_{a} : 1 - {F}_{\mbox{ Y}}(x) \geq1 - {F}_{\mbox{ X}}(x)\) for allx with strict inequality for at least onex. This alternative is more general than the usual shift alternative,\({F}_{\mbox{ Y}}(x) = {F}_{\mbox{ X}}(x - \Delta )\), but it certainly includes the shift alternative as a special case. LetG be the group of transformations such that eachg ∈ G is continuous and strictly increasing. For this testing problem and groupG, the set of ranks of the combinedX andY samples is the maximal invariant statistic. Thus, any invariant test must be a function of the ranks. Does it make sense to require tests to be invariant with respect to monotone transformations? Whenever data are ordinal or we do not trust the measurement scale, then invariance certainly makes sense, and rank tests are the obvious choice.

  3. 3.

    Rank tests may be preferred in many situations because of their Type II error robustness. That is, for an appropriate data generation model, the permutation method can make any statistic Type I error robust (level α), but because rank tests are a function of the data only through the ranks, the influence of outliers is automatically limited. Thus, rank tests are power robust in outlier-prone situation. The key example is the Wilcoxon Rank Sum test that is powerful in the face of a wide variety of distributional shapes. In fact, Hodges and Lehmann [1956] showed that the asymptotic relative efficiency (ARE) of the Wilcoxon Rank Sum test to thet test satisfies the following:

    1. a)

      ARE=.955 for normal shift alternatives, and thus the Wilcoxon Rank Sum test loses little in comparison to thet where thet is best;

    2. b)

      and ARE ≥ .864 for any continuous unimodal shift alternative with finite variance, and thus the Wilcoxon Rank Sum test can never be much worse than thet-test but possibly much better.

    Optimality for permutation and rank procedures is discussed in more detail later.

Although the term “nonparametric” was classically associated with permutation and rank procedures, in recent times it is more commonly used for nonparametric density and regression estimation methods based on smoothing. Thus, when describing rank or permutation procedures, it is best to use the specific names “rank” or “permutation” rather than “nonparametric.” Although permutation tests are inherently defined in terms of randomization, they overlap with a variety of conditional procedures and uniformly most powerful unbiased (UMPU) “Neyman structure similar” tests based on exponential family theory (the most well known is Fisher’s Exact Test).

Permutation procedures are very computationally intensive. These extensive computations prevented widespread use of the method until the 1990’s. Thus, asymptotic approximations were dominant until the 1990’s, although exact small-sample distributions were tabled for a number of important rank test statistics.

The asymptotic approximations are basically of three kinds: normal approximations based on the Central Limit Theorem,F orbeta approximations based on matching permutation moments with normal theory moments, and Edgeworth expansions that improve on the normal approximations. The normal approximations have been used the most due to their simplicity. However, theF approximations initiated by Pitman [1937a,b] and Welch [1937] in the 1930s and updated by Box and Andersen [1955] are generally better for situations where they apply. The Edgeworth approximations are very good for the Wilcoxon Rank Sum and Wilcoxon Signed Rank statistics, but are somewhat more complicated for other statistics and seem not to be in general usage. Thus, we emphasize theF approximations rather than the normal or Edgeworth approximations. In fact theseF approximations appear to be underused in general, but the work of Conover and Iman [1981] may have rekindled their use. Asymptotic normal theory remains important for comparing different methods according to asymptotic power, rather than for finding critical values. We give an overview of these results and then a few technical details in an appendix. There are excellent texts such as Hajek and Sidak [1967] and Randles and Wolfe [1979] that carefully explain asymptotic normality proof techniques for rank statistics. We add that most nonparametric texts of the last forty years are mainly about rank statistics, although Lehmann [1975] and Pratt and Gibbons [1981] have portions devoted to permutation tests. Puri and Sen [1971] emphasize the theory of permutation tests in multivariate settings.

In our current situation of extensive computing power, Monte Carlo approximations are the most important alternative to exact calculations. By Monte Carlo approximation we mean random sampling from the set of all permutations. This method can be used for any statistic in a situation where permutation methods are appropriate. Moreover, the error of approximation can be reduced by just adding more replications. This sampling (or resampling) in the “permutation world” is very similar to sampling in the bootstrap world; the main difference is that bootstrapp-values are typically approximate, even using the limit as the number of resamplesB goes to. In contrast, the limitingp-value in the permutation world is exact, and even the finiteB estimatedp-value has an exact interpretation.

Thus, our treatment of nonparametric methods is quite a bit different from most texts written in the last half of the twentieth century, which have emphasized rank tests and asymptotic normal approximations. We believe the basic permutation approach is the most important idea because it provides Type I error robustness for any statistic. Monte Carlo approximations can handle any problem for which the exact permutation distribution is too difficult to compute. Rank methods are still very important, but now because they provide Type II error robustness (good power in the face of outliers), not because they are easy to use or their distributions are tabled.

We start first with the two-sample problem to illustrate the basic permutation test approach. We then give some general theory for permutation tests along with approximations and discuss optimality results. Then we review results for the most important designs admitting permutation tests, their use in contingency tables, and estimators and confidence procedures derived from inverting permutation and rank tests.

2 A Simple Example: The Two-Sample Location Problem

We illustrate here the basic permutation approach with a simple two treatment experiment.

A clever middle school student believes that she has discovered a new method for teaching fractions to third graders. To test her hypothesis, she selects six students from her father’s third grade class and randomly assigns four to learn the new method and two to use the standard method. After training both groups, they are given twenty test problems. The scores for the standard method group arex 1 = 6, x 2 = 8 and for the new method group arey 1 = 7, y 2 = 18, y 3 = 11, y 4 = 9. The results look promising for the new method, but how shall we assess statistical significance?

One possible test statistic is the standard two-samplet,

$$t(X,Y ) = \frac{\overline{Y } -\overline{X}} {\sqrt{{s}_{p }^{2 }\left ( \frac{1} {m} + \frac{1} {n}\right )}},$$
(12.1)

where\({s}_{p}^{2} =\{ \sum\nolimits {({X}_{i} -\overline{X})}^{2} + \sum\nolimits {({Y }_{j} -\overline{Y })}^{2}\}/(m + n - 2)\). Ift is large, then one might be convinced that the new method is better than the standard one.

Another commonly used statistic isW = the sum of the ranks of theY values when bothX andY samples are thrown together and ranked from smallest to largest. Let\(Z\) denote the joint sample of both\(X\) and\(Y\) together:\(Z = (X,Y )\) with observed values here (6, 8, 7, 18, 11, 9). The ranks of these observed values are then (1, 3, 2, 6, 5, 4) and\(W = 2 + 6 + 5 + 4 = 17\), the sum of theY ranks. If the new teaching method is better, then on average we would expectW to be large. Assuming that eithert orW are reasonable statistics for our testing problem, we still need to agree on what is a proper reference distribution for each. A simple but very general approach is to recognize that there were actually\(\left({{6}\atop {2}}\right) = 15\) different ways that two students could have been selected from the original six to go in theX sample (with the remaining four assigned to theY sample). Table 12.1 is a listing of the possible samples and the values oft andW for both.

Table 12.1 All Possible Permutations for Example Data

If the treatments produce identical results, then the outcomes for each student would have been exactly the same for any of the 15 possible randomizations. Thus, a suitable reference distribution fort orW is just the possible 15 values oft orW along with the probability 1/15 of each. This reference distribution fort, called the permutation distribution, is in Table 12.2.

Table 12.2 Permutation Distribution oft

Note that the permutation distribution oft is discrete even when sampling from a continuous distribution. (Here the distribution of the data is also discrete because the possible test scores are 0, 1, …, 20).

Using the distribution in Table 12.2, a conditional test for this experiment with\(\alpha= 1/15\) would be to reject ift ≥ 1. 47. A one-sidedp-value for the observed value oft = 1. 17 is 2/15. Similarly a conditional\(\alpha= 1/15\) level test based on the rank sumW would reject ifW ≥ 18, and the one-sidedp-value is 2/15.

In general, the tests based ont andW would not give exactly the same results. For example, suppose the original data had been the 14th permutation, (7,9,6,18,11,8). Then the permutationp-value fort would be\(5/15 =.33\), whereas the permutationp-value forW would be\(6/15 =.40\). Note, however, the column in Table 12.1 (p. 453) for the sum of theY values. Comparing the ∑Y i andt values, one can see that the permutationp-values from ∑Y i andt are identical if the original data had been any of the 15 permutations. In such a case, we say that the two statistics are permutationally equivalent because they give exactly the same testing results.

In Problem 12.1 (p. 523) we ask for the permutation distribution ofW from Table 12.1 (p. 453). A unique feature of rank statistics when there are no ties in the data is that the permutation distribution is the same for every such data set. That is, although the data values would change for every data set, as long as there are no ties in the 6 data points, the ranks would always be (1,2,3,4,5,6). Thus, the results forW in Table 12.1 (p. 453) would be exactly the same except in a different order, and therefore the distribution would be the same. This is one reason that rank statistics gained popularity: without ties, the exact distribution does not change and can then be tabled for easy lookup.

For simplicity we purposely started with a data set having no ties. However, ties occur frequently in real data even in continuous data settings due to rounding or inaccurate measurement. The standard way to rank data with ties is to assign the average rank to each of a set of tied values. For example, suppose our secondX data point had been 7 instead of 8. Then theZ vector would have been (6,7,7,18,11,9), and instead of (1,3,2,6,5,4) for the ranks we would have (1,2.5,2.5,6,5,4). These are now called themidranks. We have taken the values 7 and 7 that would have occupied ranks 2 and 3 and replaced them by\((2 + 3)/2 = 2.5.\) If the firstX data point had also been a 7, then the midrank vector would have been (2,2,2,6,5,4), where we have used\((1 + 2 + 3)/3 = 2\) for the first three midranks. The use of midranks has no effect on the general permutation approach, but tabling distributions as mentioned in the previous paragraph is no longer possible since every configuration of tied values has a different permutation distribution.

3 The General Two-Sample Setting

The two-sample problem assumes thatN experimental units (rats, for example) are available to compare two treatments A and B. First,m units are randomly assigned to receive treatment A, and the\(n = N - m\) remaining units are assigned to receive treatment B. After the experiment is run, we obtain realizations of some measurementX 1, , X m for treatment A andY 1, , Y n for treatment B. The null hypothesisH 0 is that both treatments are the same or have identical effects on the rats. In other words, if the third rat in group A whose measurement isX 3 had been assigned to group B instead, theX 3 would still have been the result underH 0 for that rat, but now it would have aY label. In fact, we can think of all possible\(\left({{N}\atop {m}}\right)\) random assignments ofm rats to group A andn rats to group B, and assume that underH 0 the individual results would be the same regardless of group assignment.

We might then formulate a test procedure as follows.

  1. 1.

    Randomly assignm units to A andn units to B.

  2. 2.

    Run the experiment to obtainX 1, , X m andY 1, , Y n .

  3. 3.

    Think of the collection\(Z = ({X}_{1},\ldots,{X}_{m},{Y }_{1},\ldots,{Y }_{n})\) as fixed and order the\(M_{N} = \left({{N}\atop {m}}\right)\) values of some statisticT calculated for each\({Z}^{{_\ast}}\) obtained by permuting\(Z\) to have different sets ofm first coordinates. Call these ordered values\({T}_{(1)} \leq{T}_{(2)} \leq \ldots\leq{T}_{({M}_{N})}\), and let\({T}_{0} = T(X,Y )\) be the statistic calculated for the original data.

  4. 4.

    RejectH 0 ifT 0 > T (k).

This test, conditional on\(Z\), has conditional α-level

$$1 - \frac{k} {{M}_{N}}$$

ifT (k) < T (k + 1) (not tied) sinceM N  − k values ofT are larger thanT (k). The exact conditionalp-value is the proportion of values greater than or equal toT 0,

$$\frac{[\#{T}_{(i)} \geq{T}_{0}]} {{M}_{N}}.$$
(12.2)

WhenT is thet statistic in (12.1, p. 452), the above two-sample permutation procedure was proposed by Pitman [1937a]. The credit for the permutation approach, however, goes to R. A. Fisher who had earlier introduced the permutation approach in the fifth edition ofStatistical methods for Research Workers (2 ×2 table example) published in 1934 and in the first edition ofThe Design of Experiments (one-samplet example) in 1935.

Besides computational problems, the main drawback of the procedure described in points 1. − 4. outlined above is that:

a):

the results pertain to theN units obtained and not to a larger population;

b):

computations of test power are difficult.

Thus, it is often useful to assume a population sampling model of the usual form

$${X}_{1},\ldots,{X}_{m}\;\;\;\mbox{ iid}\;\;{F}_{\mbox{ X}}(x) = P({X}_{1} \leq x),$$
$${Y }_{1},\ldots,{Y }_{n}\;\;\;\mbox{ iid}\;\;{F}_{\mbox{ Y}}(x) = P({Y }_{1} \leq x),$$

withH 0 : F X(x) = F Y(x). Under this model we can show that the conditional permutation test actually has exact size α unconditionally, i.e.,

$$P(\mbox{ rejection}\mid {H}_{0}) = \alpha.$$

The permutation approach has the advantage that no assumption regarding distributions of random variables is required. Moreover, one can often show using permutational Central Limit Theorems (e.g., Theorem 12.2, p. 465) that the conditional distribution of\(T(X,Y )\) properly standardized converges to a standard normal as min(m, n) → . Thus, in large samples one can use normal critical values rather than list allM N possible values ofT. Alternatively, one can randomly sampleB of the possible permutations and base a test on the ordered values ofT 1, , T B . First we give the general theory of permutation tests and then discuss these approximations as well as the Box-AndersenF approximation.

4 Theory of Permutation Tests

4.1 Size α Property of Permutation Tests

In this subsection we show that permutation tests used in random sampling contexts can have exact size α when randomizing on rejection region boundaries, and otherwise has level α when the test is carried out without such randomization. Recall that a size α test is one for which\({\sup }_{{H}_{0}}P(\mbox{ reject}{H}_{0}) = \alpha \) and level α means\({\sup }_{{H}_{0}}P(\mbox{ reject}{H}_{0}) \leq\alpha \). The reference torandomization merely refers to flipping a biased coin for sample points on the boundary between the rejection and acceptance region in order to obtain size α and has nothing to do with the randomization used in the definition of a permutation test.

To prove size-α results rigorously, we need some additional notation. Two useful sources are Hoeffding [1952] and Puri and Sen [1971]. Let\(Z = {({Z}_{1},\ldots,{Z}_{N})}^{T}\) have joint distribution function\({F}_{Z}(z)\) and sample spaceS. LetG be a group ofM N transformations ofS ontoS such that underH 0 the distribution of each\({g}_{i}(Z)\),g i  ∈ G, i = 1, , M N , is exactly the same as the distribution of\(Z\). Two examples of such groups are as follows.

Permutations: G consists of allN! permutations of\(Z\). If\(Z\) is exchangeable or iid, then\({g}_{i}(Z)\stackrel{d}{=}Z\). Although, in the two-sample problem (two independent samples), we usually consider only the\(\left({{N}\atop {m}}\right)\) partitions into two groups since the statistics used do not change by permuting elements within each sample. In thek-sample problem (k independent samples), we consider only the

$$\left ({ N \atop {n}_{1}{n}_{2}\ldots {n}_{k}} \right ) = \frac{N!} {{n}_{1}!\cdots {n}_{k}!}$$

partitions intok groups, where\({n}_{1} + {n}_{2} + \cdots+ {n}_{k} = N\). The group ofN! permutations is relevant for the two-sample,k-sample, and correlation problems.

Sign Changes: G consists of all 2N sign change transformations,\({g}_{1}(Z) = ({Z}_{1},{Z}_{2},\ldots,{Z}_{N})\),\({g}_{2}(Z) = (-{Z}_{1},{Z}_{2},\ldots,{Z}_{N})\),\({g}_{3}(Z) = ({Z}_{1},-{Z}_{2},{Z}_{3},\ldots,{Z}_{N})\), etc. If theZ i ’s are independently (but not necessarily identically) distributed, where eachZ i is symmetrically distributed about 0, then\({g}_{i}(Z)\stackrel{d}{=}Z\). The sign change group is relevant for the paired two-sample problem and the one-sample symmetry problem.

The following development is due to Hoeffding [1952]. Because the permutation distribution is discrete, it is not possible to achieve arbitrarily chosen α-levels like α = . 05 without using a randomized testing procedure. This makes the details seem harder than they really are.

Let\(T(z)\) be a real-valued function onS such that for each\(z \in S\)

$${T}_{(1)}(z) \leq{T}_{(2)}(z) \leq \cdots {T}_{({M}_{N})}(z)$$

are the ordered values of\(T({g}_{i}(z)),i = 1,\ldots,{M}_{N}\). Given α, 0 < α < 1, letk be defined by

$$k = {M}_{N} - [{M}_{N}\alpha ],$$

where [ ⋅] is the greatest integer function. Let\({M}_{N}^{+}(z)\) and\({M}_{N}^{0}(z)\) be the numbers of\({T}_{(j)}(z),j = 1,\ldots,{M}_{N},\) which are greater than\({T}_{(k)}(z)\) and equal to\({T}_{(k)}(z)\), respectively. Define

$$a(z) = \frac{{M}_{N}\alpha- {M}_{N}^{+}(z)} {{M}_{N}^{0}(z)}.$$

Then define the test function\(\phi (z)\) by

$$\phi (z) = \left \{\begin{array}{ll} 1, &\;\;\mbox{ if}\ T(z) > {T}_{(k)}(z); \\ a(z),&\;\;\mbox{ if}\ T(z) = {T}_{(k)}(z); \\ 0, &\;\;\mbox{ if}\ T(z)< {T}_{(k)}(z). \end{array} \right.$$

Note that for a test function,\(\phi (z) = 1\) means rejection ofH 0,\(\phi (z) = 0\) means acceptance ofH 0, and\(\phi (z) = \pi \) means to randomly rejectH 0 with probability π. The test defined by ϕ is an exact conditional level α test by construction. The following theorem tells us that under\({g}_{i}(Z)\stackrel{d}{=}Z\) for eachg i  ∈ G, the test is unconditionally a size-α test.

Theorem 12.1.

(Hoeffding). Let the data \(Z = ({Z}_{1},\ldots,{Z}_{N})\) and the group G of transformations be such that \({g}_{i}(Z)\stackrel{d}{=}Z\) for each g i ∈ G under H 0 . Then the test defined above by \(\phi (Z)\) has size α.

Proof.

First note that by the definition of\(a(z)\) and ϕ, we have for each\(z \in S\)

$$\frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\phi ({g}_{i}(z)) = \frac{{M}_{N}^{+} + a(z){M}_{N}^{0}(z)} {{M}_{N}} = \alpha.$$

Now since\({g}_{i}(Z)\stackrel{d}{=}Z\) andG is a group,\({\mbox{ E}}_{{H}_{0}}\phi (Z) ={ \mbox{ E}}_{{H}_{0}}\phi ({g}_{i}(Z))\) for eachi, and

$$\begin{array}{rcl}{ P}_{{H}_{0}}(\mbox{ rejection}) = {E}_{{H}_{0}}\phi (Z)& =& \frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\mathrm{{E}}_{{H}_{0}}\phi ({g}_{i}(Z)) \\ & =&{ \mbox{ E}}_{{H}_{0}}\left [ \frac{1} {{M}_{N}} \sum\limits _{i=1}^{{M}_{N} }\phi ({g}_{i}(Z))\right ] = \alpha \end{array}$$

The above proof is deceptively simple. The key fact that makes it work is that E\({}_{{H}_{0}}\phi ({g}_{i}(Z))\) is the same for eachg i including\(g(Z) = Z\). This fact rests on the identical distribution of\({g}_{i}(Z)\) for eachi and on the group nature ofG. The identical distribution requirement is intuitive, but why do we needG to be a group? Recall that the test procedure consists of computingT for each member ofG and then rejecting if\(T(Z)\) is larger than an order statistic of the\(T({g}_{i}(Z))\) values. Now\(\phi ({g}_{i}(Z))\) is the test that computes\(T({g}_{j}({g}_{i}(Z)))\),j = 1, , M N , orders all of them, and rejects if\(T({g}_{i}(Z))\) is larger than one of the ordered values. IfG is not a group, then the set of ordered values will not be the same for each test\(\phi ({g}_{i}(Z))\) becauseg j (g i ) will not be inG for somei andj. Since the sets of ordered values could be different, there would be no basis for believing that a test based on\({g}_{i}(Z)\) would have the same expectation as that based on\(Z\).

Note also that the use of\(a(z)\) in\(\phi (z)\) is a way of randomizing to get an exact size-α test. In practice we might just define\(\phi (z)\) to be one if\(t(z) > {t}_{(k)}(z)\) and zero otherwise. The resulting unconditional level is a weighted average of the discrete levels less than or equal to α and will usually be less than α.

The conditional test procedure described in 1) − 4) may be used for any test statistic, but the rejection region in Step 4) should be modified to correspond to the situation. For example, the alternative hypothesis might be that the mean ofA is less than that ofB. We would then look for small values oft. Or the test could be two-sided and we would reject ift < t (k) or ift > t (m).

4.2 Permutation Moments of Linear Statistics

The exact permutation distribution may be difficult to compute. For certain linear statistics, though, we can calculate the moments of the permutation distribution quite easily. These moments are then used in the various normal andF approximations found in later sections.

We consider general results for situations associated with the group of transformations consisting of all permutations. These situations include the two-sample andk-sample situations, and bivariate data (X 1, Y 1), , (X N , Y N ) where correlation and regression ofY onX are of interest. Let\(a = ({a}_{1},\ldots,{a}_{N})\) and\(c = ({c}_{1},\ldots,{c}_{N})\) be two vectors of real constants. We select a random permutation of thea values, call themA 1, , A N , and form the statistic

$$T = \sum\limits _{i=1}^{N}{c}_{ i}{A}_{i}.$$
(12.3)

In applications\(a\) is actually the observed vector\(Z\) (or a function of\(Z\) such as the rank vector), and\(c\) is chosen for the particular problem at hand. For example, in the two-sample problem, with\(a = Z\) andc i  = 0 fori = 1, , m and 1 otherwise, the observed value ofT for the original data is ∑ i = 1 n Y i , and here\(T = \sum\limits _{i=m+1}^{N}{A}_{i}\) is a sum of the lastn elements of a random permutation of\(Z\). A very important subclass of (12.3) are the linear rank statistics given in the next section.

Assuming that each permutation of\(A\) is equally likely and thus has probability 1 ∕ N! , it is easy to see that

$$P({A}_{i} = {a}_{s}) = \frac{1} {N}\quad \mbox{ for}\;s = 1,\ldots,N,$$

and

$$P({A}_{i} = {a}_{s},{A}_{j} = {a}_{t}) = \frac{1} {N(N - 1)}\quad \mbox{ for}\;s\neq t = 1,\ldots,N.$$

Then, using those two results, we get

$$\mbox{ E}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{a}_{ i} \equiv \overline{a},\quad \mbox{ for}\;i = 1,\ldots,N,$$
$$\mbox{ Var}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i = 1,\ldots,N,$$

and

$$\mbox{ Cov}({A}_{i},{A}_{j}) = \frac{-1} {N(N - 1)}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i\neq j = 1,\ldots,N.$$

Finally, putting these last three results together, we get

$$\mbox{ E}(T) = N\overline{c}\;\overline{a},$$

and

$$\mbox{ Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{2} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{2},$$
(12.4)

where\(\overline{a}\) and\(\overline{c}\) are the averages of thea’s andc’s, respectively. These first two moments ofT are sufficient for normal approximations based on the asymptotic normality ofT asN → . In some cases it may be of value to use more complex approximations involving the third and fourth moments ofT. Thus, the central third moment is

$$\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{3} = \frac{N} {(N - 1)(N - 2)}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{3} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{3},$$

and the standardized third moment (skewness coefficient) is

$$\mathrm{Skew}(T) = \frac{\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{3}} {\{\mathrm{Var}{(T)\}}^{3/2}} = \frac{{(N - 1)}^{1/2}} {(N - 2)} \frac{{\mu }_{3}(c){\mu }_{3}(a)} {\{{\mu }_{2}(c){\mu }_{2}{(a)\}}^{3/2}},$$

where we have introduced the notation\({\mu }_{q}(c) = {N}^{-1} \sum\limits _{i=1}^{N}{({c}_{i} -\overline{c})}^{q}\) forq ≥ 2. Similarly the standardized central fourth moment (kurtosis coefficient) is

$$\begin{array}{rcl} \mathrm{Kurt}(T) = \frac{\mathrm{E}\{T -\mathrm{ E}{(T)\}}^{4}} {\{\mathrm{Var}{(T)\}}^{2}} & =& \frac{(N + 1)(N - 1)} {N(N - 2)(N - 3)} \frac{{\mu }_{4}(c){\mu }_{4}(a)} {\{{\mu }_{2}(c){\mu }_{2}{(a)\}}^{2}} \\ & -& \frac{3{(N - 1)}^{2}} {N(N - 2)(N - 3)}\left [ \frac{{\mu }_{4}(c)} {\{{\mu }_{2}{(c)\}}^{2}} + \frac{{\mu }_{4}(a)} {\{{\mu }_{2}{(a)\}}^{2}}\right ] \\ & +& \frac{3({N}^{2} - 3N + 3)(N - 1)} {N(N - 2)(N - 3)} \end{array}$$

4.3 Linear Rank Tests

Many popular rank tests have the general form

$$T = \sum\limits _{i=1}^{N}c(i)a({R}_{ i})$$
(12.5)

of alinear rank statistic, wherec(1), , c(N) are called theregression constants anda(1), , a(N) are called thescores, and\(R\) is the vector of ranks (possibly midranks due to ties) of some data vector\(Z\). There is a room for confusion here in the use of the notation for\(c\) and\(a\), because in the general notation of the last section, (c 1, , c N ) and (a 1, , a N ) are vectors of real numbers, but herec( ⋅) anda( ⋅) are functions so that\({c}_{1} = c(1),\ldots,{c}_{N} = c(N)\) and\({a}_{1} = a(1),\ldots,{a}_{N} = a(N)\). This function notation just makes it easier to work with rank statistics. In particular, the score functionsa( ⋅) are typically derived fromscores generating functions ϕ via\(a(i) = \phi (i/(N + 1))\). In tied rank situations,a( ⋅) needs to be defined for non-integer values.

The simplest setting is the two-sample problem where\({Z}^{T} = ({X}_{1},\ldots,{X}_{m},\) \({Y }_{1},\ldots,{Y }_{n})\) and thec values are all zeroes for theXs and ones for theY s or vice-versa. A different situation covered byT, though, is for trend alternatives, wherec(1), , c(N) are the integers 1, , N and\(T = \sum\limits _{i=1}^{N}i{R}_{i}\) will tend to be large whenZ i + 1 tends to be larger thanZ i . A related problem is forN independent pairs (X 1, Y 1), , (X N , Y N ). Here, tests based on Spearman’s Correlation (Section 12.7, p. 487) are equivalent to ones having the same null distribution as\(T = \sum\limits _{i=1}^{N}i{R}_{i}\).

ClearlyT in (12.5) is a subclass of the linear permutation statistics given in (12.3, p. 458). Thus results for that class are inherited byT. For example, if\(R\) is uniformly distributed on the permutations of 1, , N (no tied ranks), then

$$\mbox{ E}(T) = N\overline{c}\;\overline{a},$$

and

$$\mbox{ Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{(c(i) -\overline{c})}^{2} \sum\limits _{j=1}^{N}{(a(j) -\overline{a})}^{2},$$

where of course\(\overline{c}\) and\(\overline{a}\) are the means of thec anda values, respectively. For a tied rank situation with observed vector of midranks\(R\), the expressions above still hold but witha(j) replaced bya(R j ).

For deciding on a score function in a given problem, we first select a parametric family and then derive an optimal score function for that family. An overview of how to do this is given in Section 12.5 (p. 473). The most important linear rank statistic is the Wilcoxon Rank Sum. So we give a few more details about it in the next section.

4.4 Wilcoxon-Mann-Whitney Two-Sample Statistic

For two independent samplesX 1, , X m andY 1, , Y n , Wilcoxon [1945] introduced the linear rank statistic

$$W = \sum\limits _{i=m+1}^{N}{R}_{ i},$$
(12.6)

whereR 1, , R N are the joint rankings of\(Z = {({X}_{1},\ldots,{X}_{m},{Y }_{1},\ldots,{Y }_{n})}^{T}\),\(N = m + n\). The Wilcoxon Rank Sum test has a number of optimal properties that are mentioned in Section 12.5 (p. 473). Along with the Wilcoxon Signed Rank test for paired data (Section 12.8.3,130), it is the simplest and most important rank test.

Independently, Mann and Whitney [1947] proposed the equivalent statistic

$${W}_{\mathrm{YX}} = \sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j}< {X}_{i}),$$
(12.7)

whereI( ⋅) is the indicator function. In the absence of ties\({W}_{\mathrm{YX}} = mn + n(n + 1)/2 - W\). Another equivalent version is

$${W}_{\mathrm{XY}} = \sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j} > {X}_{i}),$$
(12.8)

with\({W}_{\mathrm{XY}} = W - n(n + 1)/2\). We prefer this latter version and define theU-statistic estimator of θXY = P(Y 1 > X 1)

$$\widehat{{\theta }}_{\mathrm{XY}} = \frac{{W}_{\mathrm{XY}}} {mn} = \frac{1} {mn}\sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}I({Y }_{ j} > {X}_{i}).$$
(12.9)

In a clinical trial, θXY can be viewed as the probability of a more favorable response for a randomly selected patient getting Treatment 2 compared to another patient getting Treatment 1. For screening tests where a “positive” is declared ifY > c for a diseased subject or ifX > c for a non-diseased subject, then θXY is the area under the receiver operating characteristic (ROC) curve. This interpretation is developed in Problem12.8 (p. 525).

For hand computations,W is much easier to handle than theseU-statistic versions. The null moments follow easily from Section 12.4.2 (p. 458) after noting that\(c(1) = \cdots= c(m) = 0\) and\(c(m + 1) = \cdots= c(N) = 1\) lead to\(\overline{c} = n/N\) and\(\sum\limits _{i=1}^{N}{(c(i) -\overline{c})}^{2} = mn/N\). The null mean is\(n(N + 1)/2\) whether there are ties or not. The variance follows from (12.4, p. 459). With no ties, we have

$$\mbox{ Var}(W) = \frac{mn(N + 1)} {12}.$$
(12.10)

With ties so that (R 1, , R N ) are the tied ranks, we have

$$\mathrm{Var}(W) = \frac{mn} {N(N - 1)}\left \{\sum\limits _{i=1}^{N}{R}_{ i}^{2} -\frac{N{(N + 1)}^{2}} {4} \right \}.$$
(12.11)

Lehmann [1975, p. 20] gives a different expression for the variance ofW in the face of ties,

$$\mathrm{Var}(W) = \frac{mn(N + 1)} {12} -\frac{mn\sum\limits _{i=1}^{e}({d}_{i}^{3} - {d}_{i})} {12N(N - 1)},$$
(12.12)

wheree are the number of tied groups, andd i is the number of tied observations in each group. For example, with the simple example data modified to ({6, 7}, {7, 18, 11, 9}), the midranks are (1, 2. 5, 2. 5, 6, 5, 4) ande = 1,d 1 = 2; so\(\mathrm{Var}(W) = (2)(4)(6 + 1)/12 - (2)(4)[{2}^{3} - 2]/[12(6)(5)] = 4.53\). Expression (12.12) may be easier to use by hand than (12.11), but its main value may be to show that the variance ofW for tied data is always smaller than (12.10) for untied data.

TheU-statistic versions in (12.7)–(12.9) are useful for easy calculation of moments and derivation of asymptotic normality under non-null distributions. For example, using equation (3.4.7, p. 91) of Randles and Wolfe [1979] for the variance of a two-sampleU-statistic from independent iid samples, we have that

$$\begin{array}{rcl} \mathrm{\mathrm{Var}}(\widehat{{\theta }}_{\mathrm{XY}}) = \frac{1} {mn}\left \{(m - 1)({\gamma }_{0,1} - {\theta }_{\mathrm{XY}}^{2}) + (n - 1)({\gamma }_{ 1,0} - {\theta }_{\mathrm{XY}}^{2}) + {\gamma }_{ 1,1} - {\theta }_{\mathrm{XY}}^{2}\right \},& & \\ & &\end{array}$$
(12.13)

where in the absence of ties γ0, 1 = P(Y 1 > X 1, Y 1 > X 2), γ1, 0 = P(Y 1 > X 1, Y 2 > X 1), and\({\gamma }_{1,1} = {\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1})\). If theX andY have identical continuous distributions, then it is easy to show that\({\gamma }_{0,1} = {\gamma }_{1,0} = 1/3\) and\({\gamma }_{1,1} = {\theta }_{\mathrm{XY}} = 1/2\) and (12.13) reduces to (12.10).

In the presence of ties, theU-statistic quantities need to be modified by adding\(I({Y }_{j} = {X}_{i})/2\) to the indicators in the sums. For example,

$$\widehat{{\theta }}_{\mathrm{XY}} = \frac{{W}_{\mathrm{XY}}} {mn} = \frac{1} {mn}\sum\limits _{i=1}^{m} \sum\limits _{j=1}^{n}\left \{I({Y }_{ j} > {X}_{i}) + I({Y }_{j} = {X}_{i})/2\right \}.$$
(12.14)

The relationships\({W}_{\mathrm{YX}} = mn + n(n + 1)/2 - W\) and\({W}_{\mathrm{XY}} = W - n(n + 1)/2\) then continue to hold. The definitions of γ0, 1, γ1, 0, and γ1, 1 for use in (12.13) have to be modified in the face of ties; see, for example, Boos and Brownie [1992, p. 72]. In the next section we give the basic asymptotic normal results for linear statistics under the null hypothesis of identical populations. Those general results are useful for approximate critical regions for permutation and rank statistics. However, the Wilcoxon statistics are special because they are related to theU-statistic\(\widehat{{\theta }}_{\mathrm{XY}}\) for which a large body of theory exists. In particular,\(\widehat{{\theta }}_{\mathrm{XY}}\) is AN\(\left \{{\theta }_{\mathrm{XY}},\mathrm{Var}(\widehat{{\theta }}_{\mathrm{XY}})\right \}\), and this follows from basicU-statistic theory with no assumptions except thatX 1, , X m are iid with any distribution functionF(x), andY 1, , Y n are iid with any distribution functionG(x). Because this asymptotic result is not just for null situations, it helps us think about i) the form of the alternative hypothesis, ii) the classes of distribution functions for which the Wilcoxon Rank Sum is consistent, in other words, rejects with probability converging to 1, and iii) asymptotic power and sample size determination. We now discuss these ideas.

In general, the null hypothesis of interest is

$${H}_{0} : F(x) = G(x),\;\mbox{ each }x \in(-\infty,\infty ).$$

However, the alternative hypothesis can be formulated in several ways. The most common way is to assume the shift model\(G(x) = F(x - \Delta )\), and then the alternative hypothesis is purely in terms ofΔ, for example

$${H}_{1} : \Delta> 0.$$

Another popular, more nonparametric, way to phrase the alternative is

$${H}_{2} : F(x) \geq G(x),\;\mbox{ each }x \in(-\infty,\infty ),$$

and with strict inequality for at least onex. Here,G is said to bestochastically larger thanF. Clearly,H 2 is a larger class of alternatives since (F, G) ∈ H 1 implies (F, G) ∈ H 2. Lastly, the natural alternative when thinking in terms of\(\widehat{{\theta }}_{\mathrm{XY}}\) is

$${H}_{3} : {\theta }_{\mathrm{XY}} > \frac{1} {2}.$$

Now ifF andG are continuous distribution functions and (F, G) ∈ H 2, then (F, G) ∈ H 3. This follows from

$${\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1}) = \int\nolimits \nolimits \int\nolimits \nolimits I(y > x)\,dF(x)\,dG(y) = \int\nolimits \nolimits \{1 - G(x)\}\,dF(x),$$

after noting that if continuous distribution functions satisfyF(x) > G(x) for at least onex, then this strict inequality must hold for an interval ofx values, and\(\int\nolimits \nolimits F(x)\,dF(x) = 1/2\). Assuming thatH 3 holds, then the Wilcoxon Rank Sum test is consistent because of the general asymptotic normality result mentioned above. This also means that it is also consistent under alternativesH 1 andH 2.

Lastly, following Noether [1987], the approximate power of a one-sided α level test when\({\theta }_{\mathrm{XY}} > \frac{1} {2}\) is given by

$$1 - \Phi \left \{\frac{1/2 - {\theta }_{\mathrm{XY}}} {\rho {\sigma }_{0}} + \frac{{\Phi }^{-1}(1 - \alpha )} {\rho } \right \},$$
(12.15)

where σ0 is the square root of the null variance ofW (12.10, p. 462), ρ is the ratio of the square root of the non-null variance ofW (m 2 n 2 times eq. 12.13, p. 462) to σ0, andΦ is the standard normal distribution function. Typically, ρ is close to 1. Letting ρ = 1 andm = λN, the total sample sizeN required to have power 1 − β for alternative θXY is given by Noether [1987] to be

$$N = \frac{{\left \{{\Phi }^{-1}(1 - \alpha ) + {\Phi }^{-1}(1 - \beta )\right \}}^{2}} {12\lambda (1 - \lambda ){({\theta }_{\mathrm{XY}} - 1/2)}^{2}}.$$
(12.16)

This is a fairly simple formula, but it might be preferable to state power and sample size in terms of the shift model. Plugging in\(G(x) = F(x - \Delta )\), we have

$${\theta }_{\mathrm{XY}} = P({Y }_{1} > {X}_{1}) = \int\nolimits \nolimits \{1 - F(x - \Delta )\}\,dF(x).$$

For example, if we wanted shifts of sizeΔ ∕ σ in a normal(μ, σ2) population, then a simple R program to get θXY using the midpoint rule is

theta.xy<-function(delta,n=10000){

# u-stat parameter for normal shift delta/sigma

# for sigma=1

# n is the number of points for midpoint rule

    points<-(2*(1:n)-1)/(2*n)

    mean(1-pnorm(qnorm(points)-delta))

}

If\(\Delta /\sigma=.5\), then

  > theta.xy(.5,10000)

  [1] 0.6381632

so that θXY = . 638. Choosing α = . 05, β = . 80, and\(\lambda= 1/2\), we findN = 108 or\(m = n = 54\).

4.5 Asymptotic Normal Approximation

Approximate normal distributions for linear statistics have been the most popular approximation to permutation distributions, especially for rank statistics. Here we use the following permutation Central Limit Theorem for\(T = \sum\limits _{i=1}^{N}{c}_{i}{A}_{i}\), introduced in (12.3, p. 458), directly from Puri and Sen [1971, p. 73] who give credit to Wald and Wolfowitz [1944], Noether [1949], and Hoeffding [1951]. The notation\({\mu }_{q}(c)\) is for theqth central moment\({N}^{-1} \sum\limits _{i=1}^{N}{({c}_{i} -\overline{c})}^{q}\).

Theorem 12.2 (Wald-Wolfowitz-Noether-Hoeffding). 

If for N →∞

(i):
$$\frac{{\mu }_{q}(c)} {{\mu }_{2}{(c)}^{q/2}} = O(1)\;\;\;\mbox{ for all}\;q = 3,4,\ldots $$
(ii):
$$\frac{{\mu }_{q}(a)} {{\mu }_{2}{(a)}^{q/2}} = o({N}^{r/2-1})\;\;\;\mbox{ for all}\;q = 3,4,\ldots,$$

then

$$\frac{T -\mbox{ E}(T)} {\sqrt{\mbox{ Var} (T)}}\stackrel{d}{\rightarrow }N(0,1).$$

In a particular problem either or both of the vectors\(c\) and\(a\) may be random, that is, calculated from the data\(Z\). In such cases we would need to show that the appropriate conditions (i) and/or (ii) holdwp1 with respect to the random vector\(Z\). Moreover, the conclusion of Theorem 12.2 is that the permutation distribution of the standardizedT converges to a standard normal distribution with probability one with respect to\(Z\).

In the case of linear rank statistics without ties, we can give a much simpler theorem due to Hajek (1961). We follow the exposition given in Randles and Wolfe [1979, Ch. 8] and state their version of Hajek’s theorem.

Theorem 12.3 (Hajek). 

Let \(T = \sum\limits _{i=1}^{N}c(i)a({R}_{i})\) be the linear rank statistic, where the rank vector \(R\) comes from data vector \(Z\) that is continuous (no ties with probability one) and exchangeable, the constants c(1),…,c(N) satisfy the Noether condition

$${ \frac{{\sum}_{i=1}^{N}{(c(i) -\overline{c})}^{2}} {\max }_{1\leq i\leq N}{(c(i) -\overline{c})}^{2}} \rightarrow \infty \quad \mbox{ as $N$} \rightarrow \infty,$$
(12.17)

and the scores have the form \(a(i) = \phi (i/(N + 1))\) , where ϕ can be written as the difference of two nondecreasing functions and \(0< {\int \nolimits_{0}^{1}} \phi (t)^2 dt< \infty\,and\, {\int \nolimits_{0}^{1}} |\phi(t)|dt< \infty.\, Then\,T\,is\,AN\{N\bar{c}\bar{a}, Var (T)\}\) as N →∞.

It has been customary to use the normal approximation with rank statistics, often with a continuity correction. For example, in the two-sample problem, consider the Wilcoxon Rank SumW of (12.6, p. 461). Note that for application of Theorem 12.3 above, ϕ(u) = u, and the theorem actually applies directly to\(W/(N + 1)\). For the simple example of Section 1.2 where\(z = (x,y) = (6,8,7,18,11,9)\) with ranks\(R = (1,3,2,6,5,4)\), we findW = 17, E\((W) = 4(6 + 1)/2 = 14\), Var\((W) = (2)(4)(6 + 1)/12 = 14/3\) (from 12.10, p. 462), and the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{17 - 14} {\sqrt{14/3}} \right ) = P(N(0,1) \geq1.39) = 0.08.$$

With continuity correction the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{17 - 14 - 1/2} {\sqrt{14/3}} \right ) = P(N(0,1) \geq1.16) = 0.12.$$

Lehmann [1975, p. 16] cites Kruskal and Wallis [1952, p. 591] with the recommendation that the continuity correction be used when the probability is above 0.02. Recall that the exact null distribution ofW can be obtained from Table 12.1 leading to the usualp-value\(P(W \geq17) = 2/15 = 0.13\) which is closer to the continuity corrected value.

When there are tied values, we can still use the normal approximation withW, but we must be sure to use the null variance from (12.11, p. 462) or (12.12, p. 462) and not from (12.10, p. 462). Lehmann [1975, p. 20] does not use the continuity correction in the presence of ties.

We can also look at approximations to the permutationp-value of\(T = \sum\limits _{i=1}^{n}{Y }_{i}\) which is permutationally equivalent to the two-samplet statistic. For the simple example\(c = (0,0,1,1,1,1)\) and\(a = z = (6,8,7,18,11,9)\). Thus, E(T) = (6)\((4/6)(59/6) = 39.33\), Var(T) = 25. 23, and the normal approximationp-value is

$$p \approx P\left (N(0,1) \geq\frac{45 - 39.33} {\sqrt{25.23}} \right ) = P(N(0,1) \geq1.13) = 0.13.$$

This seems almost too good an approximation to the true permutationp-value of\(2/15 = 0.13\;.\) Usually thet approximationp-value is more accurate, but here it isP(t 4 ≥ 1. 17) = 0. 15.

4.6 Edgeworth Approximation

Edgeworth approximations were mentioned briefly in Ch. 3 (5.6, p. 219) and Ch. 9 (11.7, p. 428). Basically, an Edgeworth expansion is an approximation to the distribution function of an asymptotically normal statistic. It is based on estimation of Skew and/or Kurt and other higher moments of the statistic. Rigorous development of Edgeworth expansions for general permutation statistics under the null hypothesis may be found in Bickel [1974], Bickel and van Zwet [1978], and Robinson [1980]. However, it has not proved of much practical use for obtaining critical values orp-values of permutation statistics except in the special case of the Wilcoxon Rank SumW and of the one-sample Wilcoxon signed rank statistic.

Here we give the approximation forW originally due to Fix and Hodges [1955]. For\(W = \sum\limits _{i=1}^{n}{R}_{i}\),

$$P(W \geq w) \approx1 - \Phi (t) -\left \{\frac{{m}^{2} + {n}^{2} + mn + m + n} {20mn(m + n + 1)} \right \}({t}^{3} - 3t)\phi (t),$$
(12.18)

where ϕ andΦ are the standard normal density and distribution function, respectively, and\(t =\{ w -\mathrm{ E}(W) - 1/2\}/\sqrt{\mathrm{Var } (W)}\),\(\mathrm{E}(W) = n(N + 1)/2\),\(\mathrm{Var}(W) = mn(N + 1)/12\).

Figure 12.1 gives the error = truep-value − (12.18) and the relative error = [truep-value − (12.18)]/(truep-value) of (12.18) compared to the truep-value and similar quantities for the normal approximations.The range of thep-values is most of the right tail of the distribution function ofW plotted in reverse order, that is, 0.0005 to 0.11. The Edgeworth approximation is excellent forp-values larger than 0.0024, but then deteriorates as thep-value gets very small. For example, when the truep-value is 0.00087, the Edgeworth approximation is 0.00073, and at 0.00025 it is 0.00009. The right panel of Figure 12.1 is especially helpful for illuminating what happens at smallp-values. The normal approximation is much cruder, and below 0.02 we can see that the continuity correction is no longer useful.

Fig. 12.1
figure 1

Error (Left Panel) and relative error (Right Panel) of approximations to Wilcoxon Rank Sump-values form = 10,n = 6: normal approximation, normal approximation with continuity correction, and the Edgeworth approximation in (12.18, p. 467)

Figure 12.1 suggests that (12.18) can be used for most values ofW, thus essentially replacing tabled values of the distribution ofW. However, when there are ties in the data, (12.18) as well as tabled values are no longer correct, and the exact permutation distribution (or a Monte Carlo approximation) is required.

4.7 Box-Andersen Approximation

Pitman [1937a,b] and Welch [1937] pioneered an approximation to permutation distributions that was modernized by Box and Andersen [1955] and Box and Watson [1962]. These later authors mainly used the approach to show the Type I error robustness of F statistics for tests comparing means and the nonrobustness of tests comparing variances. However, we follow the Box and Andersen [1955] formulation since it is the most straightforward.

The basic idea of the approximation is to getF statistics into their equivalent “beta” version, then match the first two permutation moments of this beta version to what one gets from the first two moments of a beta distribution with degrees of freedom multiplied by a constantd. Solving ford leads to the approximation of the permutation distribution of theF statistics by anF distribution with usual degrees of freedom multiplied byd. We develop the approximation here for the two-sample problem and later give it for one-way and two-way ANOVA situations.

The square of thet statistic in (12.1, p. 452) may be written in the one-way ANOVAF form

$${t}^{2} = \frac{m{(\overline{X} -\overline{Z})}^{2} + n{(\overline{Y } -\overline{Z})}^{2}} {{s}_{p}^{2}} = \frac{\mbox{ SSTR}} {\mbox{ SSE}/(N - 2)},$$
(12.19)

where recall we use theZ’s to denote all theX andY values thrown together, and SSTR and SSE are sums of squares for treatments and error, respectively. Using the fact that\(\sum\limits _{i=1}^{N}{({Z}_{i} -\overline{Z})}^{2} = \mbox{ SSTR} + \mbox{ SSE}\), we have for the beta version of theF statistic

$$b({t}^{2}) = \frac{{t}^{2}} {{t}^{2} + N - 2} = \frac{\mbox{ SSTR}} {{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}}.$$

Note that for normal data under the null hypothesis,b(t 2) has a beta\((1/2,(N - 2)/2)\) distribution. Originallyb(t 2) was used with the beta critical values rather thant 2 withF(1, N − 2) critical values. Although,t 2 andb(t 2) are equivalent test statistics, for permutation analysisb(t 2) is much simpler because the denominator is constant over permutations. Thus, the first permutation moment is

$$\mathrm{{E}}_{\mathrm{P}}\{b({t}^{2})\} = \frac{m\mathrm{{Var}}_{\mathrm{P}}(\overline{X}) + n\mathrm{{Var}}_{\mathrm{P}}(\overline{Y })} {{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} = \frac{1} {N - 1},$$

where we have used (12.4, p. 459) to get

$$\mathrm{{Var}}_{\mathrm{P}}(\overline{X}) = \frac{n{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} {mN(N - 1)} \qquad \mathrm{{Var}}_{\mathrm{P}}(\overline{Y }) = \frac{m{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}} {nN(N - 1)}.$$

Note also that under normal theory\(\mathrm{E}\{b({t}^{2})\} = 1/2/(1/2 + (N - 2)/2) = 1/(N - 1)\) from the beta distribution. Thus, the normal theory and permutation first moments ofb(t 2) are both\(1/(N - 1)\). The next step is to calculate the permutation variance ofb(t 2) (involving fourth moments), equate it to the variance of a beta\((d/2,d(N - 2)/2)\) distribution,\(2(N - 2)/[d(N - 1)(N + 3)]\), and solve ford. Box and Andersen [1955, p. 13] gived for the general one-way ANOVA situation withk groups and sample sizesn 1, n 2, , n k :

$$d = 1 + \left (\frac{N + 1} {N - 1}\right ) \frac{{c}_{2}} {{({N}^{-1} + A)}^{-1} - {c}_{2}},$$
(12.20)

where

$$A = \frac{N + 1} {2(k - 1)(N - k)}\left (\frac{{k}^{2}} {N} -\sum\limits _{i=1}^{k} \frac{1} {{n}_{i}}\right ),$$

\({c}_{2} = {k}_{4}/{k}_{2}^{2}\),

$${k}_{2} = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2},$$
(12.21)
$${k}_{4} = \frac{N(N + 1){\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{4} - 3(N - 1){\left \{{\sum}_{i=1}^{N}{({Z}_{ i} -\overline{Z})}^{2}\right \}}^{2}} {(N - 1)(N - 2)(N - 3)}.$$
(12.22)

The statisticsk 2 andk 4 are unbiased estimators of the population cumulants introduced in Chapter 1.

For our two-samplet 2,k = 2,n 1 = m,n 2 = n,\(m + n = N\), and the Pitman-Welch-Box-Andersen approximation is to comparet 2 to an\(F(d,d(m + n - 2))\) distribution. Box and Andersen [1955] show that\(\mathrm{E}(d) \approx1 + (\mathrm{Kurt} - 3)/N\) under the null hypothesis of sampling from equal populations with kurtosis Kurt. Thus,t 2 with the usual\(F(1,(m + n - 2))\) is quite Type I error robust to nonnormality since the correctiond is relatively small for moderate sizeN. Also, for long-tailed distributions with thicker tails than the normal distribution, Kurt > 3 and thusd > 1, so that using the\(F(1,(m + n - 2))\) critical values results in conservative tests, that is, true test levels less than the nominal α values. For example, with Laplace data, Kurt = 6 and\(d \approx1 + 3/N\); at\(m = n = 10\) d ≈ 1. 15, and a nominal α = . 05 level test would actually have true level approximately.043. For continuous uniform data, Kurt = 1. 8; at\(m = n = 10\) d ≈ . 94 and a nominal α = . 05 level test would have true level approximately.053. Since these deviations from α are small, common practice is to just use the standard\(F(1,(m + n - 2))\) reference distribution with thet 2 statistic rather than the permutation distribution or an approximation to it.

Althought 2 is Type I error robust in the face of outliers, it loses power because outliers inflate the variance estimate in the denominator oft 2. Thust 2 is not Type II error robust when sampling from distributions heavier-tailed than the normal. In contrast, as we mentioned in the Chapter introduction, the Wilcoxon Rank Sum statisticW is Type II error robust, and later we use asymptotic power calculations to verify its superiority tot 2. But for the moment, we note thatW is related tot 2 applied to the ranks of the data, and therefore inherits robustness to outliers because the ranks themselves are resistant to the effects of outliers. This relationship also allows us to use the above approximation for the permutation distribution ofW. 

Define the standardized Wilcoxon Rank Sum statistic by

$${W}_{\mathrm{S}} = \frac{W -\mathrm{ E}(W)} {{\left \{\mathrm{Var}(W)\right \}}^{1/2}}.$$

Then,t 2 applied to the ranks of the observations, that is, theX ranksR 1, , R m replacingX 1, , X m , and theY ranksR m + 1, , R N replacingY 1, , Y n , results in

$${t}_{\mathrm{R}}^{2} = \frac{(N - 2){W}_{\mathrm{S}}^{2}} {N - 1 - {W}_{\mathrm{S}}^{2}}.$$

Thust R 2 andW are equivalent test statistics and we can apply the Box-Andersen approximation tot R 2 using\(d \approx1 + (1.8 - 3)/N\) because the ranks are a uniform distribution on the integers 1 toN and thus have Kurt ≈ 1. 8, the kurtosis of a continuous uniform distribution. For example, in the case ofm = 10 andn = 6 given in Figure 12.1 (p. 467), the Box-Andersen approximation along with the continuity correction gives results that are considerably better than the normal approximation with continuity correction but not quite as good as the Edgeworth approximation. In later sections we see that the Box-Andersen approximation is very good in one-way and two-way ANOVA situations when the number of treatments is greater than two.

4.8 Monte Carlo Approximation

In the previous sections, approximations to permutation distributions were given for statistics based on linear forms, and essentially rely on the Central Limit Theorem and its extensions. However, the simplest and most important approximation to a permutation distribution is to randomly sample from the set of all possible permutations, and directly estimate the permutation distribution. This approach can be used for any statisticT, and its accuracy is determined simply by the numberB of random permutations used. This resampling of permutations is very similar to resampling in the bootstrap world, and we suggest sampling with replacement because of simplicity although sampling without replacement could be used.

Suppose thatT calculated on all permutations has distinct values\({t}_{1},\ldots,{t}_{k}\). For example, in Table 12.1 (p. 453) thet statistic hask = 13 distinct values − 2.98, − 1.72, − 1.36, − 1.08, − 0.84, − 0.06, 0.12, 0.30, 0.49, 0.69, 0.91, 1.17, 1.47, corresponding to the 15 permutations (0.49 and 0.91 appeared twice). The Monte Carlo approach is to randomly selectB times from the 15 possible permutations, calculate the statistic for each random selection, sayT 1  ∗ , …T B  ∗ , and let the number ofT  ∗ s equal tot i be denotedN i ,i = 1, , k. If we select permutations with replacement, then (N 1, , N k ) is multinomial(B; p 1, , p k ), wherep i is the permutation distribution probability of obtainingt i . The estimatesN i  ∕ B have binomial variances\({p}_{i}(1 - {p}_{i})/B\). Thus, if we were trying to estimate the probabilities in Table 12.2 (p. 453), most of the estimates would have variance\((1/15)(14/15)/B\) although two of them would have variance\((2/15)(13/15)/B\) because of the duplication of values 0.49 and 0.91.

In typical applications, we are not interested in the whole permutation distribution, but merely want to estimate thep-value given in (12.2, p. 455) using

$$\widehat{p} = \frac{\left \{\#{T}_{i}^{{_\ast}}\geq{T}_{0}\right \}} {B},$$

whereT 0 is the value of the statistic for the original data. In the simple example,T 0 = 1. 17. Recall that in this case the true permutationp-value is\(2/15 =.13\). Thus,B = 1000 would yield an estimate with standard deviation\(\{(.13)(.87)/100{0\}}^{1/2} =.01\) that would be adequate for most purposes. However, if thep-value were smaller, say.005, then we would want to takeB larger so that the standard deviation of the estimate would be a small fraction of thep-value, say not more than 10–20%. For example, setting\(.001 =\{ (.005)(.995)/{B\}}^{1/2}\) would suggestB = 4975. When the estimatedp-value is to be used with rejection rules like “rejectH 0 if\(\widehat{p} \leq\alpha \),” then it is wise to chooseB so that (B + 1)α is an integer as was discussed in the bootstrap Section 11.6.2 (p. 442) as the“99 rule”. Mainly this would be used in Monte Carlo simulation studies whereB = 99 orB = 199 might be used to save computing time. However, in situations where computations of the test statistic are extremely expensive, one may view the random partitions as part of the test itself, and the procedure “rejectH 0 if\(\widehat{p} \leq\alpha \)” is called a Monte Carlo test, not just an approximation to the permutation test. This approach was first introduced by Barnard [1963] and later studied by Hope [1968], Jöckel and Jockel [1986], and Hall and Titterington [1989].

4.9 Comparing the Approximations in a Study of Two Drugs

A new drug regimen (B) was given to 16 subjects, and one week later each subject’s status was assessed. A second independent group of 13 subjects received the standard drug regimen (A). Both sets of measurements were compared to baseline measurements taken before the treatment period began. The difference from baseline data is given in Figure 12.2.This is real data but the actual details are confidential. The drug company wanted to prove that regimenB involving their new drug had larger differences from baseline than the standard. In terms of means of the differences, the testing situation isH 0 : μ B  = μ A versusH a : μ B  > μ A . The sample means and standard deviations are\(\overline{X} =.92,\overline{Y } = 3.19,{s}_{X} = 5.45,{s}_{Y } = 10.21\). The standard pooledt from (12.1, p. 452) is.72 with one-sidedp-value.24 from thet distribution. The exact permutationtp-value is 0.249, but with a largep-value like this, thet distribution approximation is adequate and agrees with the Type I error robustness mentioned previously. The Box-Andersend = 1. 074 leading to an adjustedtp-value of.245.

Fig. 12.2
figure 2

Change from Baseline for Drugs A and B

However, Figure 12.2 reveals that most of the Drug B subjects have positive changes from baseline whereas the Drug A changes are more centered around 0. The two large negative values − 22 and − 11 have a strong effect on thet statistic. The Wilcoxon Rank Sum statisticW is less affected by outliers, and might paint a different picture. First we compute the midranks and list them with the data ordered within samples.

Then\(W = 1 + 2 + \ldots+ 28 + 29 = 271.5\). The null mean ofW is\((16)(16 + 13 + 1)/2 = 240\). To compute the null variance using the formula for ties, (12.12, p. 462), note that there aree = 16 distinct values and 2 values tied at − 3, 7 tied at − 1, 3 tied at 0, 2 tied at 2, 2 tied at 4, 2 tied at 6, and 2 tied at 10. Thus the null variance is

$$\begin{array}{rcl} & & \frac{(16)(13)(16 + 13 + 1)} {12} - \frac{(16)(13)} {(12)(29)(29 - 1)}\left [({7}^{3} - 7) + ({3}^{3} - 3) + 5({2}^{3} - 2)\right ] \\ & & \quad = 520 - 8.325 = \end{array}$$
(511.675.)

The approximate normal statistic is\((271.5 - 240)/\sqrt{511.675} = 1.39\) withp-value.082. Thet statistic on the ranks is 1.42 withp-value.084. The Box and Andersen [1955] degrees of freedom approximation with\(d = (1 - 1.2/29) = 0.96\) does not change that latterp-value until the fourth decimal. The Edgeworth approximationp-value is.084 without continuity correction and.087 with continuity correction.

Table 3

Unfortunately, because of the ties we cannot trust the exact tables or a continuity correction or the Edgeworth approximation. Thus, it seems wise to either calculate the exact permutationp-value or estimate it by Monte Carlo methods. WithB = 10, 000 we got\(\widehat{p} =.085\) with 95% confidence interval (.080,.090). Rather than makeB larger, in this case it is fairly easy to get the exactp-value = . 0849 with existing software. Summarizing the one-sidedp-values, we have

So this is a situation where the Wilcoxon Rank Sum statistic might be preferred to thet because of its robustness to outliers. Here it apparently downweighted the outliers − 22 and − 11 enough to have a much lowerp-value than thet statistic. The normal andt approximations to theWp-value are quite reasonable here, but we would not know that without getting the exactp-value = . 0849 or by estimating it fairly accurately.

Table 4

5 Optimality Properties of Rank and Permutation Tests

There are actually very few results available on the optimality properties of permutation tests. The main source is Lehmann and Stein [1949], see also Lehmann [1986, Ch. 5], who give the form of the most powerful permutation test for shift alternatives and note that it depends on a variety of unknown quantities including the form of the distribution. In the particular case of normal data with common unknown variance, they show that the most powerful permutation statistic is\(\overline{Y }\) or equivalently\(\overline{Y } -\overline{X}\) or the pooled two samplet statistic. Thus general optimality results are not available, but a general approach is clear: derive an (asymptotically) optimal parametric test statistic under a specific parametric family assumption (your best guess), and use the permutation approach for critical values. The resulting permutation test is valid under the null hypothesis for any distribution as long as the conditions of Theorem 12.1 (p. 457) hold, and is close to optimal if the distribution of the data is close to the one used to derive the test statistic.

For rank statistics there are two main bodies of results: locally most powerful rank tests and asymptotically most powerful rank tests based on Pitman Asymptotic Relative Efficiency (ARE). Here we briefly give the flavor of these approaches and main results leaving technical details for the Appendix.

5.1 Locally Most Powerful Rank Tests

For simplicity we focus on the two-sample shift model whereX 1, , X m are iid with distribution functionF, andY 1, , Y n are iid with distributionG(y) = F(y − Δ). We assume thatF is continuous with densityf. Consider

$${H}_{0} : \Delta= 0\quad \mbox{ versus}\quad {H}_{a} : \Delta> 0.$$

If there exists a rank test that is uniformly most powerful of level α for some ε > 0 in the restricted testing problem

$${H}_{0} : \Delta= 0\quad \mbox{ versus}\quad {H}_{a,\epsilon } : 0< \Delta< \epsilon,$$

then we say that the test is thelocally most powerful rank test for the original testing problem.

The basic approach to finding a locally most powerful rank test is to take a Taylor expansion of the probability of the rank vector as a function ofΔ and maximize its derivative atΔ = 0. For sufficiently smallΔ, the values of the rank vector that are ordered by its probability under the alternativeΔ are the same as those ordered by its derivative atΔ = 0. Thus, we need only obtain an expression for the derivative and maximize it. These details are left for the Appendix.

For the two-sample shift problem, the locally most powerful rank test rejects for large values of

$$T = \sum\limits _{i=m+1}^{N}a({R}_{ i}),$$

wherea(i) = E{ϕ(U (i), f)},

$$\phi (u,f) = -\frac{{f}^{{\prime}}({F}^{-1}(u))} {f({F}^{-1}(u))}$$
(12.23)

is called the optimal score function, andU (1) ≤ U (2) ≤ ⋯ ≤ U (N) are the order statistics from a uniform (0,1) distribution. Recall thatR m + 1, , R N are the ranks of theY values in the joint ranking of all theX’s andY ’s together. We see in the next section that a closely related statistic,\(\sum\limits _{i=m+1}^{N}\phi ({R}_{i}/(N + 1),f),\) is asymptotically equivalent and comes naturally from asymptotic relative efficiency considerations.

IfF is the logistic distribution, then we are led to the Wilcoxon Rank Sum as the locally most powerful rank test for shift alternatives because\(-{f}^{{\prime}}(x)/f(x) = 2F(x) - 1\) and\(\mathrm{E}\{{U}_{(i)}\} = i/(N + 1)\). WhenF is a normal distribution, then the optimal score function is\(\phi (u,f) = {\Phi }^{-1}(u)\), and the locally most powerful test is based on thenormal scores

$$a(i) =\mathrm{ E}\{{\Phi }^{-1}({U}_{ (i)})\} =\mathrm{ E}\{{Z}_{(i)}\},$$

whereZ (i) is a standard normal order statistic. For shifts in the scale of an exponential distribution,\(F(x;\sigma ) = 1 -\exp (-x/\sigma )\), we can turn it into a shift in location of the negative of an extreme value distribution,\(F(x) = 1 -\exp \{-\exp (x)\}\), by taking the natural logarithm of the exponential data. The resulting optimal test has score

$$a(i) + 1 = \sum\limits _{j=N+1-i}^{N}\;\frac{1} {j},$$

where the latter sum is the expected value of theith order statistic from a standard exponential distribution. These are calledSavage scores from Savage [1956]. In censored data situations, the analogous test is called the logrank test.

Lehmann [1953] studied alternatives of the form

$${F}_{\Delta }(x) = (1 - \Delta )F(x) + \Delta {F}^{2}(x),$$

and showed that the Wilcoxon Rank Sum is the locally most powerful rank test for these alternatives. In general, alternatives of the formF Δ (x) = h Δ (F(x)) for some functionh Δ (u), are calledLehmann alternatives. They have the property that two-sample rank tests have the same distribution under an alternativeΔ for all continuousF.

Johnson et al. [1987] consider locally most powerful rank tests using Lehmann alternatives for the nonresponder problem where only a fraction of subjects respond to treatment. Conover and Salsburg [1988] consider other locally most powerful rank tests for the nonresponder problem. Additional situations where locally most powerful rank tests are considered include Doksum and Bickel [1969] and Bhattacharyya and Johnson [1973].

The optimal score functions (12.23, p. 475) appear in thek-sample problem, Section 12.6 (p. 480), and in the correlation problem, Section 12.7 (p. 487). Analogous results are also available in the one-sample location or matched pairs problem, Section 12.7 (p. 487), and are mentioned there.

Theoretical development and rigorous theorems on locally most powerful rank tests may be found in Hajek and Sidak [1967, Ch. 2], Conover [1973], and Randles and Wolfe [1979, Chs. 4 and 9].

5.2 Pitman Asymptotic Relative Efficiency

Perhaps the most useful way to evaluate and compare rank tests is due to Pitman [1948] and further developed by Noether [1955] and others. The basic idea is that Pitman Asymptotic Relative Efficiency (ARE) is the ratio of sample sizes for two different tests to have the same power at a sequence of alternatives converging to the null hypothesis.

LetS andT be two test statistics forH : θ = θ0 where θ k is a sequence of alternatives converging to θ0 ask → . If we can choose sample sizes\({N}_{{S}_{k}}\) and\({N}_{{T}_{k}}\) and critical values\({c}_{{S}_{k}}\) and\({c}_{{T}_{k}}\) forS andT, respectively, such that\(S > {c}_{{S}_{k}}\) and\(T > {c}_{{T}_{k}}\) have levels that converge to α and their powers under θ k converge to β, α < β < 1, then the Pitman asymptotic relative efficiency ofS toT is given by

$$\mbox{ ARE}(S,T) =\lim \limits_{k\rightarrow \infty }\frac{{N}_{{T}_{k}}} {{N}_{{S}_{k}}}.$$

Note that if ARE(S, T) > 1, thenS is preferred toT because it takes fewer observations (\({N}_{{S}_{k}}\) is less than\({N}_{{T}_{k}}\)) to achieve the same power. Technical conditions in the Appendix and\(P({S}_{k} > {c}_{{S}_{k}}) \rightarrow\beta< 1\) require that the alternatives have a specific form: for some δ > 0

$${\theta }_{k} = {\theta }_{0} + \frac{\delta } {\sqrt{{N}_{{S}_{k }}}} + o\left ( \frac{1} {\sqrt{{N}_{{S}_{k }}}}\right )\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
(12.24)

Such sequences of alternatives are calledPitman alternatives. Another important quantity arising from the technical details is theefficacy of a test statisticS,

$$\mbox{ eff}(S) =\lim \limits_{k\rightarrow \infty } \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{{S}_{k } } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}},$$

where\({\mu }_{{S}_{k}}({\theta }_{0})\) and\({\sigma }_{{S}_{k}}({\theta }_{0})\) are the asymptotic mean ofS and standard deviation ofS. Thus, the efficacy of a test is the rate of change of its asymptotic mean at the null hypothesis relative to its asymptotic standard deviation (the factor\(1/\sqrt{{N}_{{S}_{k }}}\) is introduced in the derivative because of 12.24). A powerful test in the Pitman sense is one that is able to detect changes in the parameter value near the null hypothesis. The ARE ofS toT turns out to be

$$\mbox{ ARE}(S,T) ={ \left \{\frac{\mbox{ eff}(S)} {\mbox{ eff}(T)}\right \}}^{2}.$$

The Pitman ARE is both a limiting ratio of sample sizes required to give the same power and the square of the ratio of the test efficacies. High efficacies lead to high ARE’s.

In the Appendix we give details for finding efficacies in the one-sample problem, but here we use similar standard results on efficacies for the two-sample problem from Randles and Wolfe [1979, Chs. 5 and 9]. The most important comparison is between the two-samplet test and the Wilcoxon Rank Sum test. The efficacy of thet test is

$$\mbox{ eff}(t) = \frac{\sqrt{\lambda (1 - \lambda )}} {\sigma },$$

where σ is the standard deviation of theX distribution functionF(x) and of theY distribution function\(G(y) = F(x - \Delta )\), and\(\lambda=\lim \limits_{\min (m,n)\rightarrow \infty }m/(m + n)\). For the Wilcoxon Rank Sum statisticW we have

$$\mbox{ eff}(W) = \sqrt{12\lambda (1 - \lambda )}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx,$$

wheref is the density ofF(x), and the integral is assumed to exist. Putting these efficacies together, we have that the Pitman ARE ofW tot is

$$\mbox{ ARE}(W,t) = 12{\sigma }^{2}{\left \{{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx\right \}}^{2}.$$
(12.25)

We put ARE(W, t) into Table 12.3 for a number of distributions. Remember that ARE(W, t) > 1 means that the Wilcoxon Rank Sum test is preferred to thet test.The first number is the lower bound 0.864 derived by Hodges and Lehmann [1956] which shows that the Wilcoxon Rank Sum cannot do much worse than thet test for any continuous unimodal distribution. The second number 0.955 is for the normal distribution and shows that the Wilcoxon loses very little efficiency at the normal distribution where thet test is optimal. At the uniform distribution, the tests perform equivalently, and at the remaining examples in Table 12.3, the Wilcoxon is preferred.

Table 12.3 ARE(W, t) for the Two-Sample Shift Model

One might think that these ARE results are just asymptotic and may not relate to small sample results. To supplement the ARE results, in Figure 12.3 we plot power results for\(m = n = 15\) taken from Table 4.1.10 of Randles and Wolfe [1979, p. 118–119]. They simulated the power of thet and Wilcoxon using 1000 replications. Here we see good correspondence between small sample power and the ARE results of Table 12.3. For the normal, uniform, and logistic distributions, there is little power difference as one might expect from ARE values of.955, 1.00, and 1.10, respectively. For the Laplace, the Wilcoxon has a significant power advantage, perhaps not quite as large at the ARE(W, t) = 1. 5 would imply. Thet 1 (Cauchy) and exponential power results strongly favor the Wilcoxon and are consistent with the large ARE values.

Fig. 12.3
figure 3

Power of Wilcoxon Rank Sum (⋯ ) and t (_______) for \(m = n = 15\) from Table 4.1.10 of Randles and Wolfe [1979]

We should mention that the Laplace distribution with density\(f(x) = (1/2)\exp\) ( − | x | ) has been used quite a bit in the rank literature as a model for data, especially for ARE comparisons and simulation studies. But it may not be very useful as a model for real data, and ARE results for it are not as consistent with simulation results in small samples as with other densities. The optimal rank test for the Laplace uses scoresa(i) = 1 for\(i > (N + 1)/2\) and 0 otherwise, and is called the two-sample median test. However, its power performance in small samples, even when simulating from the Laplace distribution, is poor. Freidlin and Gastwirth [2000] show by simulation that the Wilcoxon Rank Sum test outperforms the median test at the Laplace distribution for samples sizesm = n less than or equal to 25. They recommend that the median test “be retired” from general usage, and we agree.

It turns out that in the scale problem mentioned briefly in Section 12.6.6 (p. 486), ARE values are overly optimistic when compared to small sample power results. This may reflect the fact that measuring scale (standard deviation) is an inherently harder problem that is not as well suited to rank statistics. Klotz [1962] pointed out this discrepancy between small sample power and ARE results. Fortunately, ARE results have been used mainly in location comparisons where they yield good intuition about the qualitative behavior of tests.

Another result from Randles and Wolfe [1979, p. 307] is that under suitable regularity results on the score functions, the efficacy of any linear rank test\(S = \sum\limits _{i=m+1}^{N}\phi ({R}_{i}/(N + 1))\) in the two-sample shift model is given by

$$\mbox{ eff}(S) = \sqrt{\lambda (1 - \lambda )} \frac{{\int\nolimits \nolimits }_{0}^{1}\phi (u)\phi (u,f)\,du} {{\left [{\int\nolimits \nolimits }_{0}^{1}\{\phi (u) -{\overline{\phi }\}}^{2}\,du\right ]}^{1/2}},$$
(12.26)

where ϕ(u, f) is given in (12.23, p. 475). Expression (12.26) now justifies the nameoptimal score function since the efficacy in (12.26) is optimized by choosing ϕ(u) = ϕ(u, f). This can be seen by noting that

$${\int\nolimits \nolimits }_{0}^{1}{\phi }^{2}(u,f)\,du ={ \int\nolimits \nolimits }_{-\infty }^{\infty }{\left \{\frac{f^{\prime}(x)} {f(x)}\right \}}^{2}f(x)\,dx = I(f),$$

whereI(f) is the Fisher information for the model\(f(x;\theta ) = f(x - \theta )\). Now, noting that ∫0 1ϕ(u, f) du = 0, (12.26) can be reexpressed as

$$\mbox{ eff}(S) = \sqrt{\lambda (1 - \lambda )I(f)}\mbox{ Corr}(\phi (U),\phi (U,f)),$$
(12.27)

whereU is a uniform random variable and Corr is the correlation. Clearly, the correlation is maximized by choosing ϕ(u) = ϕ(u, f). Moreover, it can also be shown that\(\sqrt{\lambda (1 - \lambda )I(f)}\) is not only the largest possible efficacy among linear rank tests but also among all α-level tests. Thus, optimal linear rank tests are asymptotically equivalent in terms of Pitman ARE to the best possible tests, say likelihood ratio or score or Wald tests for the shift model in a parametric framework. Of course, this optimality in either the rank test or the parametric test requires that the assumed family is correct.

In the next sections we consider i) thek-sample problem that is a generalization of the two-sample problem tok > 2 samples; ii) the correlation or regression problem; and then iii) the matched pairs or one-sample symmetry problem. The Pitman ARE analysis has to be adjusted to handle each situation, but the numbers found in Table 12.3 (p. 477) continue to hold for these situations as well. Thus Wilcoxon procedures, in other words rank methods using scoresa(i) = i, tend to give very good results across a wide range of distributions in each of these situations.

6 Thek-sample Problem, One-way ANOVA

The extension of the two-sample case tok samples or treatments is straightforward. Suppose that we have availablek independent random samples\(\left \{{Y }_{i1},\ldots,{Y }_{i{n}_{i}};\right.\) i = 1, , k, where in each sample theY ij (j = 1, , n i ) are iid with distribution functionF i (x), and\(N = {n}_{1} + \cdots+ {n}_{k}\). The linear model representation is

$${Y }_{ij} = \mu+ {\alpha }_{i} + {e}_{ij}.$$
(12.28)

If the errorse ij all come from the same distribution, then (12.28) is an extension of the shift model for two-sample data.

For example, the following are data on the ratio of Assessed Value to Sale Price for single family dwellings (n 1 = 27), two-family dwellings (n 2 = 22), three-family dwellings (n 3 = 17), and four or more family dwellings (n 4 = 14) in Fitchburg, Massachusetts, in 1979.

The null hypothesis of interest is of identical distribution functions,

$${H}_{0} : {F}_{1}(y) = {F}_{2}(y) = \cdots= {F}_{k}(y),$$
(12.29)

which arises most naturally if we randomly assignedN experimental units tok treatment groups with sample sizesn 1, n 2, , n k . (The above data are not of this type.) There are

$${M}_{N} = \left ({ N \atop {n}_{1}{n}_{2}\cdots {n}_{k}} \right ) = \frac{N!} {{n}_{1}!{n}_{2}!\cdots {n}_{k}!}$$

possible assignments, which of course is the relevant number of permutations even if the data do not come from a randomized experiment. Pitman [1938] proposed the permutation approach for the ANOVAF statistic

$$F = \frac{ \frac{1} {k - 1}\sum\limits _{i=1}^{k}{n}_{ i}{({\overline{Y }}_{i.} -{\overline{Y }}_{..})}^{2}} { \frac{1} {N - k}\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{({Y }_{ij} -{\overline{Y }}_{i.})}^{2}},$$
(12.30)

where\({\overline{Y }}_{i.} = {n}_{i}^{-1} \sum\limits _{j=1}^{{n}_{i}}{Y }_{ij}\), and\({\overline{Y }}_{..} = {N}^{-1} \sum\limits _{i=1}^{k}{n}_{i}{\overline{Y }}_{i}\). The number of permutationsM N gets large very fast. For example, with\(k = 3,N = 15,{n}_{1} = {n}_{2} = {n}_{3} = 5\), we get\({M}_{N} = \left({15}\atop{5\;5\;5}\right) = 756,756\). Thus Monte Carlo or asymptotic approximations are more important than in the two-sample case. For the above housing data, the ANOVAF in (12.30) isF = 1. 24 withp-value =.30 from theF(3, 75) distribution. The exact permutationp-value is obtained by computingF for each of the 1. 9 ×1044 distinct allocations of\(\left \{{Y }_{i1},\ldots,{Y }_{i{n}_{i}};i = 1,\ldots,4\right \}\) to samples of sizen 1 = 27,n 2 = 22,n 3 = 17, andn 4 = 14, and finding the proportion of these greater to or equal toF = 1. 24. A Monte Carlo estimate of the exact permutationp-value is.267 based on 100,000 resamples with standard error =.0014. Because the housing ratios are quite skewed with a number of large observations, it is not surprising thatF is small. Now we turn to rank methods that naturally limit the effect of outliers.

Table 6

6.1 Rank Methods for the k-Sample Location Problem

Kruskal and Wallis [1952] proposed the rank extension of the Wilcoxon Rank Sum statistic to thek-sample situation. The rank approach is to put allN observations together and rank them; letR ij be the rank ofY ij in the combined sample. Further define the sample sums

$${S}_{i} = \sum\limits _{j=1}^{{n}_{i} }a({R}_{ij}),$$

where the scoresa(i) could be of any form for permutational analysis, but for asymptotic results we assume\(a(i) = \phi (i/(N + 1))\) and ϕ is a scores generating function as in Theorem 12.3 (p. 465). The Kruskal-Wallis statistic usesa(i) = i or equivalently\(a(i) = i/(N + 1)\). Note thatS i is just a two-sample linear rank statistic for comparing theith population to all the others combined. The general linear rank statistic form for comparing thek populations is then

$$Q = \sum\limits _{i=1}^{k} \frac{1} {{s}_{a}^{2}{n}_{i}}{({S}_{i} - {n}_{i}\overline{a})}^{2} = \sum\limits _{i=1}^{k}\left (\frac{N - {n}_{i}} {N} \right )\frac{{({S}_{i} -\mathrm{ E}{S}_{i})}^{2}} {\mathrm{Var}({S}_{i})},$$
(12.31)

where\({s}_{a}^{2} = {(N - 1)}^{-1} \sum\limits _{i=1}^{N}\{a(i) -{\overline{a}\}}^{2}\),\(\overline{a} = \sum\limits _{i=1}^{N}a(i)\), and Var(S i ) is given by (12.4, p. 459) with the constantsc i in that expression equal to 1 forn i of them and 0 otherwise. The reason for giving the second form in (12.31) is that it is then clear that\(\mathrm{E}(Q) = k - 1\) under the null hypothesis of equal populations. The Kruskal-Wallis statistic that allows for ties is explicitly given by

$$H = \frac{(N - 1)\left \{\sum\limits _{i=1}^{k}{n}_{ i}{\left ({\overline{R}}_{i.} -\frac{N + 1} {2} \right )}^{2}\right \}} {\left (\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{R}_{ij}^{2}\right ) - N{(N + 1)}^{2}/4},$$

where\({\overline{R}}_{i.} = {n}_{i}^{-1} \sum\limits _{j=1}^{{n}_{i}}{R}_{ij}\). If there are no ties in the data, then

$$\sum\limits _{i=1}^{k} \sum\limits _{j=1}^{{n}_{i} }{R}_{ij}^{2} = N(N + 1)(2N + 1)/6,$$

andH reduces to the more familiar form

$$H = \frac{12} {N(N + 1)}\sum\limits _{i=1}^{k}{n}_{ i}{\left ({\overline{R}}_{i.} -\frac{N + 1} {2} \right )}^{2}.$$

Under the null hypothesis (12.29, p. 480), standard asymptotic theory similar to Theorem 12.3 (p. 465) yields that\(Q\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}\) as min{n 1, , n k } → . The χ k − 1 2 approximation is not very good in small samples, but fortunately theF statistic on the scoresa(R ij ) is a monotone function ofQ,

$${F}_{\mathrm{R}} = \left (\frac{N - k} {k - 1} \right )\left ( \frac{Q} {N - 1 - Q}\right ),$$

and using\(F(k - 1,N - k)\) as a reference distribution or the Box-Andersen adjusted\(F(d(k - 1),d(N - k))\) distribution yields excellent results. For the housing data above,H = 9. 8856 withp-value = 0.020 from the χ3 2 distribution.F R  = 3. 6283 withp-value 0.017 from theF(3, 75) distribution. The Box-Andersend=0.9876, and so the adjustment is very minor, only in the fourth decimal place. A Monte Carlo approximation to the exactp-value is.017 based on 100,000 samples with standard error.0004. So here theF distribution approximation is right on target to 3 decimals, but the χ2 approximation is not bad due to the fairly large samples.

In Figure 12.4 we look at much smaller sample sizes fork = 3 andk = 5. Figure 12.4 shows the difference between the exact permutationp-value and each approximation versus the exactp-value for the Kruskal-Wallis statistic. Note that the left panel is more expanded in the vertical scale than the right panel and actually has less error. Nevertheless, the Box-Andersen approximation is the best in both plots and is generally very good fork > 2. The χ k − 1 2 approximation gets more conservative ask gets larger. This can be explained by the following large-k asymptotic results.

Fig. 12.4
figure 4

(ExactP-Values − ApproximateP-Values) versus ExactP-Values for Kruskal-Wallis Statistic.\(F = F(k - 1,N - k)\),\({F}_{BA} = F(d(k - 1),\) d(N − k)), and\({\chi }^{2} = {\chi }_{k-1}^{2}\)

6.2 Large-k Asymptotics for the ANOVAF Statistic

Brownie and Boos [1994] show under the null hypothesis of equal populations that

$$\sqrt{k}({F}_{\mathrm{R}} - 1)\stackrel{d}{\rightarrow }\mbox{ N}\left (0, \frac{2n} {n - 1}\right ),$$
(12.32)

for equal sample sizes\({n}_{1} = {n}_{2} = \cdots= {n}_{k} = n\) andk →  withn fixed. Note that the usual result withn →  andk fixed is\((k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}\), similar to the result forQ. The “largek” asymptotic result (12.32) implies that

$$\sqrt{k}\left ( \frac{Q} {k - 1} - 1\right )\stackrel{d}{\rightarrow }\mbox{ N}\left (0, \frac{2(n - 1)} {n} \right ),$$
(12.33)

ask →  withn fixed, using

$$Q = \frac{(N - 1){F}_{\mathrm{R}}} {(N - k)/(k - 1) + {F}_{\mathrm{R}}}$$
(12.34)

(see Problem 12.17, p. 527). Note that comparingQ to a χ k − 1 2 is asymptotically (k → ) like comparing\(Q/(k - 1)\) to a N\(\{1,2/(k - 1)\}\) because a χ k − 1 2 random variable obeys the Central Limit Theorem (it is a sum of χ1 2 random variables). However, (12.33) says that\(Q/(k - 1)\) should be compared to a N\(\{1,2(n - 1)/(kn)\}\) distribution. Because\(2(n - 1)/(kn)< 2/(k - 1)\), using the χ k − 1 2 distribution withQ results in conservative true levels. For example, ifk = 5 andn = 5, then the large sample 95th percentile from N\(\{1,2/(k - 1)\}\) is\(1 + {(2/4)}^{1/2}1.645 = 2.16\), and the approximate true level of a nominal α = . 05 test is

$$P(Q \geq{\chi }_{4}^{2}(.95)) \approx P(1 + {(8/25)}^{1/2}Z \geq2.16) = P(Z \geq2.05) =.02.$$

In contrast, use ofF R with an\(F(k - 1,N - k)\) reference distribution is supported by (12.32) underk →  and by the usual asymptotics\((k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}\) whenn →  withk fixed. We leave those details for Problem 12.18 (p. 527). Thus, it is not surprising that theF approximations in Figure 12.4 are much better than the χ k − 1 2 ones.

6.3 Comparison of Approximate P-Values – Data on Cadmium in Rat Diet

Nation et al. [1984] studied the effect of diets containing cadmium (Cd) on the neurobehavior of adult rats. The data consists of the number of platform descents during a passive-avoidance training scheme for 27 rats randomly assigned to three groups:

Table 7

The control group had no Cd in the diet, and Cd1 and Cd5 refer to daily diets containing 1 milligram and 5 milligrams, respectively, of Cd per kilogram of body weight. The usual one-way ANOVAF = 5. 10, and the permutationp-valueF statistic is\(\widehat{p} = 0.016\) based on 100,000 random permutations. TheF(2, 24) distribution givesp-value =.014, and the Box-Andersen correction factor isd = . 954 leading top-value =.016. The Kruskal-Wallis rank statistic isQ = 8. 18 with permutationp-value\(\widehat{p} =.012\) based on 100,000 random permutations. The χ2 2 approximation givesp-value =.017. The associatedF statistic isF R = 5. 51 withp-value =.011. The Box-Andersen correction factor is\(d = 1 - 1.2/24 =.95\) leading top-value =.012. A summary is as follows:

As expected theF approximations give excellentp-values.

Statistic  

Method

P-value

F

Monte Carlo (B=100,000)  

0.016

 

F(2, 24)

0.014

 

Box-Andersen

0.016

KW

Monte Carlo (B=100,000)

0.012

 

χ2 2

0.017

 

F(2, 24)

0.011

 

Box-Andersen

0.012

6.4 Other Types of Alternative Hypotheses

Thek-sampleF statistic and Kruskal-Wallis statistic are used to compare the centers or locations of thek populations. Other statistics could be used for that purpose, perhaps ones more suited to long-tailed or skewed populations. The logrank or Savage scores, for example, are asymptotically optimal for detecting shifts in the scale parameter of exponential populations (or the shift parameter of extreme value distributions).

Other types of alternatives may also be of interest. For example, there may be an implied order in the populations, say increasing doses, and there may be interest in trends in location. There might also be interest in comparing the spread of the populations or even the skewness.

These latter alternatives present a problem to permutation and rank methods because the null hypothesis of interest may not be the one of identical populations. For comparing spread, the usual null hypothesis of interest would be equal spread rather than identical populations. In such a situation, use of the permutation approach would require subtraction of unknown location parameters. We first discuss ordered alternatives in location.

6.5 Ordered Means or Location Parameters

Recall Section 3.6.1a (p. 154) where we discussed likelihood-based methods for ordered alternatives. Here we discuss permutation methods with simple statistics in the context of a Phase I toxicology study where there seems to be trends in both the means and variances with dose:

TheF statistic for comparing means isF = 1. 77, and the usualF(4, 16) distribution and the Box-Andersen approximation givep-value = 0.19. Similarly, a Monte Carlo estimatedp-value based on 10,000 random permutations gives\(\widehat{p} = 0.19\). The Kruskal-Wallis statistic isH = 6. 73 with χ4 2 p-value = 0.15. TheF approximation fromF R = 2. 06 and the Box-Andersen approximation both givep-value = 0.14. A Monte Carlo estimatedp-value based on 10,000 random permutations gives\(\widehat{p} = 0.14\). So the global comparison of location is not significant at usual levels.

Suppose that we considerH 0 : identical populations versusH a : means are decreasing. The permutation approach with\(M_{N} = \left({{20}\atop {44444}}\right)\) permutations may be used with thet statistic from a regression of the observations on dose or equivalently Pearson’s correlation coefficient (see also the next section). Pearson’s correlation coefficient is\(r = -0.53\) with Monte Carlo estimatedp-value\(\widehat{p} = 0.007\) based on 10,000 random permutations. Spearman’s correlation coefficient is − 0. 56 with\(\widehat{p} = 0.005\). Another statistic that could have been used is the likelihood ratio statistic for decreasing means assuming the data are normally distributed (see Section 3.6.1a, p. 154). In addition to Spearman’s correlation coefficient, the standard rank-based statistic is the Jonckheere-Terpstra statistic based on summing pairwise Wilcoxon Rank Sum statistics in increasing order, ∑ i < j W ij , whereW ij is the Wilcoxon Rank Sum for comparing dose groupi with dose groupj (see Lehmann 1975, p. 233). Its value here is − 2. 458 with exact permutationp-value = 0.0069. So we can be pretty confident that there is a downward trend in means or other location measures.

6.6 Scale or Variance Comparisons

Motivated by the apparent increase in variances for the dose-response data above, we now discuss hypotheses about variances or scale parameters. Unfortunately, there is a philosophical dilemma for using permutation procedures here. Usually, the typical set of hypotheses when testing for unequal variances is for a semiparametric model,\(P({Y }_{ij} \leq y) = {F}_{0}((y - {\mu }_{i})/{\sigma }_{i})\),j = 1, , n i ; i = 1, , k, whereF 0 is an unknown distribution function. Note that ifF 0(x) has mean 0 and variance 1, then μ i is theith population mean, and σ i 2 is theith population variance. In any case, under this semiparametric model, theith standard deviation iscσ i for some constantc, and we can always refer to σ i as a scale parameter. The hypotheses for increasing scale are then\({H}_{0} : {\sigma }_{1} = \cdots= {\sigma }_{k}\) versusH a : σ1 ≤ ⋯ ≤ σ k with at least one inequality. The reason for this hypothesis formulation is that we often know that the means are different; therefore it makes little sense to assume identical populations when testing for variance differences. Basically, we usually want to test for variance differences in the presence of location differences.

Unfortunately, the permutation argument requires that the null hypothesis be one of identical populations. It makes intuitive sense to center the data first by subtracting means, but these residuals\({Y }_{ij} -{\overline{Y }}_{i}\) no longer satisfy exchangeability required for using Theorem 12.1 (p. 457). The permutation distribution is correct asymptotically, but the exact level-α property no longer holds. An overview of the scale testing problem is given in Boos and Brownie [2004]. The best method that has emerged for comparing scales is to uset orF statistics on the dataY ij replaced by | Y ij  − M i  | , whereM i is theith sample median.

One way to avoid the centering problem for the dose-response data is to reduce the data to the sample standard deviations (or some other scale estimator) and then calculate an appropriate statistic for the 5!  = 120 permutations possible. For the correlation between dose and standard deviation we getr = 0. 79 andp-value\(= 7/120 =.058.\) If we use the likelihood ratio test for increasing variances for normal distributions, we getp-value = 5/120=.042. There is a loss of information when the number of permutations get reduced so much, from\(M_{N} = \left({{20}\atop {44444}}\right)\) toM N  = 120; perhaps the loss of information is just a discreteness problem caused by having too few permutations. This can be seen more clearly by calculating the exact permutation test on the data reduced to the five means; the correlation is higher than when using all the data, but thep-value = 2/120 =.017 is much larger than the.007 value we obtained previously with the whole data set.

We note that the use of rank statistics for scale comparisons has not been very successful. The subtraction of means or medians ruins the permutation argument as mentioned above. However, rank statistics for scale based on centered data are asymptotically distribution free if the samples are symmetrically distributed. The larger problem is that rank tests for scale tend to have low power in small samples. Although rank tests for location perform well in small samples and are consistent with asymptotic relative efficiency comparisons, the opposite is true for rank tests for scale. The latter statistics are not as powerful in small samples as would be expected from asymptotic relative efficiency calculations.

7 Testing Independence and Regression Relationships

Regression methods are among the most important tools of statistics. Unfortunately, permutation methods can really be applied in only the simplest setting of (X, Y ) pairs; that is, correlation or simple regression (not necessarily linear). Here we discuss that simple situation and mention at the end of the section why permutation methods cannot handle the more interesting case of multiple explanatory variables.

Suppose that we have iid random pairs (X 1, Y 1), , (X n , Y n ) and permute each coordinate independently to getn! different pairings. In reality, we need only permute one of the coordinates to obtain all the different pairings. For example, suppose thatn = 3 with pairs (1, 2. 5), (2, 3. 7), (3, 6. 4). Then the 6 possible permutations are

Pitman [1937b] suggested that a test for independence ofX andY based on the sample correlation

$$r = \frac{\sum\limits _{i=1}^{n}({X}_{ i} -\overline{X})({Y }_{i} -\overline{Y })} {{\left [\sum\limits _{i=1}^{n}{({X}_{ i} -\overline{X})}^{2} \sum\limits _{i=1}^{n}{({Y }_{ i} -\overline{Y })}^{2}\right ]}^{1/2}}$$

use this permutation distribution for critical values. A permutationally equivalent statistic is the least squares slope estimate\(\widehat{\beta } = \sum\limits _{i=1}^{n}({X}_{i} -\overline{X})({Y }_{i} -\overline{Y })/\sum\limits _{i=1}^{n}{({X}_{i} -\overline{X})}^{2}\). Other popular measures that could be used to test independence are Kendall’s rank correlation and Spearman’s rank correlation. Spearman’s estimated correlation coefficientr S is simply to replaceX i by its rank amongX 1, , X n andY i by its rank amongY 1, , Y n , and compute the Pearson correlationr between these pairs of ranks. It is important to keep in mind that the null hypothesis is independence ofX andY and not zero correlation. Independence is needed for then! different pairings to have the same distribution and thus for Theorem 12.1 (p. 457) to apply.

1

2

3

4

5

6

(1,2.5)

(1,3.7)

(1,6.4)

(1,2.5)

(1,3.7)

(1,6.4)

(2,3.7)

(2,2.5)

(2,3.7)

(2,6.4)

(2,6.4)

(2,2.5)

(3,6.4)

(3,6.4)

(3,2.5)

(3,3.7)

(3,2.5)

(3,3.7)

Typical approximations to the permutation distribution ofr (and similarly ofr S) are to compare\({(n - 1)}^{1/2}r\) to a standard normal distribution or\({(n - 2)}^{1/2}r/{(1 - {r}^{2})}^{1/2}\) to at(n − 2) distribution. Pitman [1937b] gave the first two permutation moments ofr 2,\(\mathrm{{E}}_{\mathrm{P}}({r}^{2}) = 1/(n - 1)\), and

$$\mathrm{{E}}_{\mathrm{P}}({r}^{4}) = \frac{3} {(n - 1)(n + 1)} + \frac{(n - 2)(n - 3)} {n(n + 1){(n - 1)}^{3}}\left \{ \frac{{k}_{4}(X)} {{k}_{2}{(X)}^{2}}\right \}\left \{ \frac{{k}_{4}(Y )} {{k}_{2}{(Y )}^{2}}\right \},$$

where the sample cumulantsk 2 andk 4 were given in (12.21, p. 469) and (12.22, p. 469), respectively. Note that these moments are straightforward from the results in Section 12.4.2 (p. 458) since the numerator ofr has the form (12.3, p. 458) of a linear statistic, and the denominator is constant over permutations. If the pairs are iid with a bivariate normal distribution, thenr 2 has a beta(\(1/2,n/2 - 1)\) distribution with\(\mathrm{E}({r}^{2}) = 1/(n - 1)\) and\(\mathrm{E}({r}^{4}) = 3/(n - 1)(n + 1)\). Because the permutation moments and normal theory moments are so close, Pitman [1937b] suggested using the beta approximation, which is equivalent to comparing\((n - 2){r}^{2}/(1 - {r}^{2})\) to anF(1, n − 2) distribution. Box and Watson [1962] generalized these results to the fullp regressor case for the test that all regressors are independent ofY. They derived the adjustedF approximation (see Box and Watson 1962, p. 100), which for thep = 1 case here is to compare\((n - 2){r}^{2}/(1 - {r}^{2})\) to anF(d, d(n − 2)) distribution, where

$$\frac{1} {d} = 1 + \frac{(n + 1){\alpha }_{1}} {n - 1 - 2{\alpha }_{1}},\quad {\alpha }_{1} = \frac{n - 3} {2n(n - 1)}\left \{ \frac{{k}_{4}(X)} {{k}_{2}{(X)}^{2}}\right \}\left \{ \frac{{k}_{4}(Y )} {{k}_{2}{(Y )}^{2}}\right \}.$$

In large samples,\(d \approx1 +\{\mathrm{ Kurt}(X) - 3\}\{\mathrm{Kurt}(Y ) - 3\}/2n\), revealing a double Type I error robustness to nonnormality: if eitherX orY is approximately normally distributed, then the usualF approximation is very good. To numerically illustrate, recall\(r = -.53\) from the dose-response data (p. 110) where the Monte Carlo estimated one-sidedp-value was\(\widehat{p} =.007\). Taking half of theF(1, 18)p-value approximation for\(18{r}^{2}/(1 - {r}^{2}) = 7.03\), we getp-value =.008. Similarly, for Spearman’s\({r}_{\mathrm{S}} = -.56\) we obtained previously\(\widehat{p} =.005\). Using one half of theF(1, 18)p-value for\(18{r}_{\mathrm{S}}^{2}/(1 - {r}_{\mathrm{S}}^{2}) = 8.22\) yieldsp-value =.005.

Dose

    

\(\overline{Y }\)

s n − 1

0

1.44

1.63

1.40

1.59

1.52

0.11

1

1.27

1.50

1.45

1.57

1.45

0.13

2

1.26

1.07

1.38

1.75

1.37

0.29

3

1.04

1.14

1.46

1.06

1.18

0.19

4

1.37

0.79

1.32

1.42

1.23

0.29

Now let us move to the more complicated situation of the linear model,

$${Y }_{i} = {\beta }_{0} + {\beta }_{1}{X}_{1i} + {\beta }_{2}{X}_{2i} + {e}_{i},\quad i + 1,\ldots,n,$$

where we assumee 1, , e n are iid from some distribution and independent of all theX ij . As mentioned above, permuting theY ’s under the assumption\({H}_{0} : {\beta }_{1} = {\beta }_{2} = 0\) yields a suitable permutation distribution for testing independence ofY and (X 1, X 2). Unfortunately, we are usually much more interested in testingH 0 : β2 = 0 with β0 and β1 unrestricted. Without knowledge of β1, however, an exact permutation procedure forH 0 : β2 = 0 is not possible. (Actually, it is possible to take the maximum over permutationp-values for each value of β1 in a confidence interval underH 0 as described in Berger and Boos [1994], but the loss in power is typically not worth the gain in exactness.) Anderson and Robinson [2001] review a number of different proposals that use residuals from first fitting the reduced model, and show that they are asymptotically correct but do not satisfy the assumptions of Theorem 12.1 (p. 457). Fortunately, standard linear model and rank-based linear model testing procedures have good Type I error robustness properties in general. The rank-based linear model methods given in Ch. 5 of Hettmansperger [1984] have good Type II error robustness properties as well. Similarly, the M-estimation regression methods discussed in Ch. 5 also have good robustness properties.

We conclude this section with an example that illustrates how easy it is to use Monte Carlo approximation in an autocorrelation setting.

Example 12.1 (Raleigh snowfall). 

Is the total snowfall in one year independent of the total snowfall in other years? The left panel of Figure 12.5 plots Raleigh, NC, annual snowfall for 1962–1991 versus year. The right panel plots each year’s snowfall versus the previous year’s snowfall.The sample correlation from the right panel isr = . 32. Does that suggest nonzero autocorrelation? The null hypothesis for a permutation approach is that the sequence of yearly snowfalls is iid or at least exchangeable. Below we give R code for samplingB permutations from the set of 30! possible permutations, computing the lag-1 sample correlation for each, and estimating the one-sidedp-value for a positive autocorrelation. UsingB = 10, 000, we get\(\widehat{p} =.027\) with standard error.0016. Thus there is good evidence of a positive autocorrelation. The main point here is to illustrate how easy it is to carry out the permutation test.

Fig. 12.5
figure 5

Annual snowfall in Raleigh, NC, 1962–1991 (left panel) and annual snowfall versus annual snowfall of previous year (right panel)

r.auto<-function(x){

     n<-length(x)

     cor(x[1:(n-1)],x[2:n])

}

perm1<-function(b, x, stat,...){

  # Gives est. permutation $p$-value for vector x.

  # Assumes test rejects for large values of stat.

        call <- match.call()

        n <- length(x)

        t0 <- stat(x)

        res <- numeric(b)

        for(i in 1:b) {

                perm.xx <- sample(x)

        res[i] <- stat(perm.xx)

}

        pvalue <- sum(res >= t0)/b

        se<-sqrt(pvalue*(1-pvalue)/b)

        return(list(call=call,results=data.frame

          (nperm=b, stat0=round(t0,4),pvalue=pvalue,

          se=round(se,5))))

 }

 > set.seed(2458)

 > perm1(10000,raleigh.snow$snow,r.auto)

 npermstat0 pvaluese

 1 10000 0.3245 0.0269 0.00162

8 One-Sample Test for Symmetry about θ0 or Matched Pairs Problem

Fisher [1935] introduced the permutation approach for the matched-pairs problem in a discussion of Darwin’s data on self-fertilized and cross-fertilized plants. There were 15 pairs of plants, and the differences

$$49,-67,8,16,6,23,28,41,14,29,56,24,75,60,-48$$

have mean\(\overline{D} = 20.933\),s = 37. 744, andt = 2. 148 for testingH 0 : μD = 0 versusH a : μD≠0, where μD is the population mean difference. The two-sidedp-value is.0497 from thet table with 14 degrees of freedom. Alternatively, consider Fisher’s permutation argument. There were 215 possible random assignments of types of seeds to the 15 blocks of size 2. Thus, Fisher considered all 215 sums ∑ i = 1 15 D i , whereD i is theith difference, and found only 835+28 = 863 which are greater than or equal to the observed sum = 314. The two-sidedp-value is (2)(863)/32,768 =.0527 (by symmetry there are 863 sums ≤ − 314). Note that\(t = \sqrt{n}\overline{D}/s\) is permutationally equivalent to ∑ i = 1 15 D i becauset is a monotonic function of ∑ i = 1 15 D i that depends on ∑ i = 1 15 D i 2, which is constant over all 215 permutations.

Let us consider the theory behind Fisher’s approach. The population null model is that the differencesD 1, , D n are independent, each with a symmetric distribution about some θ0; often θ0 = 0. The distributions do not need to be the same, merely symmetric about θ0. Thus

$${H}_{0} : {D}_{i} - {\theta }_{0}\stackrel{d}{=}{\theta }_{0} - {D}_{i},\quad i = 1,\ldots,n.$$
(12.35)

The group of transformations to be used with Theorem 12.1 (p. 457) is the set of 2n sign changes applied to the data with θ0 subtracted. For notational simplicity, let\({D}_{i0} = {D}_{i} - {\theta }_{0}\),i = 1, , n. Then, for example, ifn = 4, one such transformation is\((-,+,+,-)\). It would transform

$$({D}_{10},{D}_{20},{D}_{30},{D}_{40})$$
(12.36)

into

$$(-{D}_{10},{D}_{20},{D}_{30},-{D}_{40}).$$
(12.37)

Because of (12.35) and independence, all 2n transformations of the original data have the same distribution. That is, under (12.35) and independence, the joint distribution of (12.36) is the same as (12.37), etc. Thus, the conditions of Theorem 12.1 (p. 457) apply with the group of sign changes, and Fisher’s original method is a valid permutation approach.

8.1 Moments and Normal Approximation

Now let us abstract the above situation slightly in order to compute moments and approximations. Suppose thatd 1, , d n is a sequence of real constants, playing the role of the observedD i  − θ0 above. Letc 1, , c n be iid random variables with\(P({c}_{i} = 1) = P({c}_{i} = -1) = 1/2\); these play the role of making the sign changes. Now consider the linear statistic\(T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}\). Note that thec i are symmetrically distributed around 0 so that all odd moments ofc i are 0 and all even moments equal to 1. ThenT is also symmetrically distributed about 0 with odd moments 0 and\(\mathrm{E}({T}^{2}) =\mathrm{ Var}(T) = \sum\limits _{i=1}^{n}{d}_{i}^{2}\) and\(\mathrm{E}({T}^{4}) = 3{(\sum\limits _{i=1}^{n}{d}_{i}^{2})}^{2} - 2\sum\limits _{i=1}^{n}{d}_{i}^{4}\). Now we give a Central Limit Theorem forT. A more general version and proof are given in Hettmansperger [1984, p. 302–303].

Theorem 12.4.

. Suppose that d 1 ,…,d n and c 1 ,…,c n are defined as above and

$$\frac{1} {n}\sum\limits _{i=1}^{n}{d}_{ i}^{2}\rightarrow {\sigma }^{2}< \infty \qquad \mbox{ as}\;\;n \rightarrow \infty.$$

Then

$$\frac{T} {\sqrt{\mbox{ Var} (T)}} = \frac{\sum\limits _{i=1}^{n}{c}_{ i}{d}_{i}} {{\left (\sum\limits _{i=1}^{n}{d}_{ i}^{2}\right )}^{1/2}}\stackrel{d}{\rightarrow }N(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty.$$

Now we apply this theorem to the permutation distribution of ∑ i = 1 n D i when sampling from a population.

Theorem 12.5.

Suppose that D 1 ,…,D n are iid random variables satisfying (12.35) and with variance σ2 < ∞. Then the permutation distribution function of \(\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})\) under the group of sign changes satisfies

$${P}^{{_\ast}}\left \{\sum\limits _{i=1}^{n}({D}_{ i} - {\theta }_{0})/\sqrt{n}\sigma \right \}\stackrel{wp1}{\rightarrow }\mbox{ N}(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty.$$

We have used the notationP  ∗  to emphasize that the probability is taken with respect to the permutation distribution holdingD 1, , D n fixed. An alternative statement of the result is that the permutation distribution of\(\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})/\sqrt{n}\sigma \) converges in distribution to a standard normal distribution with probability 1. Note also that we could just as well have put\(\{\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0}){}^{2}\}}^{1/2}\) in place of\(\sqrt{n}\sigma \) in the conclusion, giving

$$\frac{\sum\limits _{i=1}^{n}({D}_{ i} - {\theta }_{0})} {{\left \{\sum\limits _{i=1}^{n}{({D}_{ i} - {\theta }_{0})}^{2}\right \}}^{1/2}}\stackrel{{d}^{{_\ast}}}{\rightarrow }\mbox{ N}(0,1)\qquad \mbox{ as}\;\;n \rightarrow \infty \quad wp1.$$
(12.38)

The result follows from Theorem 12.4 because for each infinite sequenceD 1(ω), D 2(ω),  where ω ∈ Ω withP(Ω) = 1,

$$\frac{1} {n}\sum\limits _{i=1}^{n}{({D}_{ i}(\omega ) - {\theta }_{0})}^{2}\rightarrow {\sigma }^{2}\qquad \mbox{ as}\;\;n \rightarrow \infty $$

by the Strong Law of Large Numbers. For each of these sequences, Theorem 12.4 holds, and thus the convergence in distribution holds with probability 1.

8.2 Box-Andersen Approximation

The Box-Andersen adjustedF approximation to the permutation distribution of\(\sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})\) uses thebeta version of\({t}^{2} = n{(\overline{D} - {\theta }_{0})}^{2}/{s}^{2}\),

$$b({t}^{2}) = \frac{{t}^{2}} {n - 1 + {t}^{2}} = \frac{n{(\overline{D} - {\theta }_{0})}^{2}} {\sum\limits _{i=1}^{n}{({D}_{ i} - {\theta }_{0})}^{2}}.$$

Under an iid normal distribution assumption forD 1, , D n ,b(t 2) has abeta(1 ∕ 2, \((n - 1)/2)\) distribution with mean 1 ∕ n and variance\(2(n - 1)/\{{n}^{2}(n + 2)\}\). Using the results in the previous section for\(T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}\), where\({d}_{i} = ({D}_{i} - {\theta }_{0})/n\), the permutation moments ofb(t 2) are\(\mathrm{{E}}_{\mathrm{P}}\{b({t}^{2})\} = 1/n\) and

$$\mathrm{{Var}}_{\mathrm{P}}\{b({t}^{2})\} = \frac{2(n - 1)} {{n}^{2}(n + 2)}\left (1 -\frac{{f}_{2} - 3} {n - 1} \right ),$$
(12.39)

where\({f}_{2} = (n + 2)\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0})}^{4}/\{\sum\limits _{i=1}^{n}{({D}_{i} - {\theta }_{0}){}^{2}\}}^{2}\). Equating the permutation moments to those of a\(beta(d/2,d(n - 1)/2)\) distribution leads to

$$d = 1 + \frac{{f}_{2} - 3} {n\{1 - {f}_{2}/(n + 2)\}}.$$
(12.40)

In the above derivation we have followed the notation in Box and Andersen [1955, p. 9], but theirW is 1 − b(t 2), and we relabeled theirb 2 asf 2. Note thatf 2 is close to the sample kurtosis of theD i  − θ0, and thus\(d \approx1 +\{\mathrm{ Kurt}(D) - 3\}/n\).

For the Darwin data,d = . 94 and theF adjusted two-sidedp-value is.053. Recall from previous analysis that the exact two-sided permutationp-value is.0527. The normal approximation here isZ = 1. 9282 with two-sidedp-value = . 054. Thus, the normal approximation is surprisingly good here, better than theF = t 2 approximation that Fisher gave (.0497), but the Box-Andersen adjustment has made theF approximation slightly better than the normal approximation.

8.3 Signed Rank Methods

Now we turn to signed rank methods. Here again for simplicity we use the notationD i0 forD i  − θ0. LetR i be the rank of | D i0 | among | D 10 | , , | D n0 | . Let the sign function be defined by\(\mbox{ sign}(x) = I(x > 0) - I(x< 0)\) ifx is nonzero and sign(0) = 0. Then the signed rank ofD i0 is sign(D i0)R i although some authors useI(D i0 > 0)R i as the definition of the signed rank. We illustrate with a simple data set from Wilcoxon [1945] on the difference between wheat yields in two treatments in 8 blocks:

D i0

 58 

 32 

  30 

  5 

  − 7 

  6 

  11 

  10 

R i

  8 

  7 

  6 

  1 

  3 

  2 

  5 

  4 

sign(D i0)R i

  8 

  7 

  6 

  1 

  − 3 

  2 

  5 

  4 

I(D i0 > 0)R i

  8 

  7 

  6 

  1 

  0 

  2 

  5 

  4 

Then define\({W}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0){R}_{i}\),\({W}^{-} = \sum\limits _{i=1}^{n}I({D}_{i0}< 0){R}_{i}\) and\(W = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0}){R}_{i}\). As long as there are no ties in the data, then all three of these are equivalent and\(W = {W}^{+} - {W}^{-}\). For the above sample we have\({W}^{+} = 33,\) \({W}^{-} = 3\), andW = 30. It is perhaps more standard to callW  +  the Wilcoxon Signed Rank statistic. Under (12.35, p. 491) and continuity of the data (implying no ties with probability 1), the basic facts are that:

  1. 1.

    sign(D 10), , sign(D n0) andI(D 10 > 0), , I(D n0 > 0) are independent of | D 10 | , , | D n0 | and thus also independent ofR 1, , R n ;

  2. 2.

    \({W}^{+}\stackrel{d}{=}{W}^{-}\stackrel{d}{=}\sum\limits _{i=1}^{n}I({D}_{i0} > 0)i,\) andI(D 10 > 0), , I(D n0 > 0) are independent Bernoulli(1/2) random variables;

  3. 3.

    \(W\stackrel{d}{=}\sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0})i\), and sign(D 10), sign(D n0) are iid with\(P(\mbox{ sign}({D}_{i0}) = 1) = 1/2\);

  4. 4.
    $$\mathrm{E}({W}^{+}) = \frac{1} {2}\sum\limits _{i=1}^{n}i = \frac{n(n + 1)} {4},\quad \mathrm{Var}({W}^{+}) = \frac{1} {4}\sum\limits _{i=1}^{n}{i}^{2} = \frac{n(n + 1)(2n + 1)} {24} ;$$
  5. 5.
    $$\mathrm{E}(W) = 0,\quad \mathrm{Var}(W) = \sum\limits _{i=1}^{n}{i}^{2} = \frac{n(n + 1)(2n + 1)} {6}.$$

For the simple example above withn = 8, we have\(\mathrm{E}({W}^{+}) = (8)(9)/4 = 18\) and\(\mathrm{Var}({W}^{+}) = (8)(9)(17)/24 = 51\) leading to the standardized value\((33 - 18)/\sqrt{51} = 2.1\), which is clearly the same forW  −  andW as well. From a normal table, we get the right-tailedp-value.018, whereas the exact permutationp-value for the signed rank statistics is\(5/256 =.01953.\)

Although the Wilcoxon Signed Rank is by far the most important of the signed rank procedures, the general signed rank procedures are\({T}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0)a({R}_{i})\),\({T}^{-} = \sum\limits _{i=1}^{n}I({D}_{i0}< 0)a({R}_{i})\), and

$$T = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{ i0})a({R}_{i}),$$
(12.41)

where the scoresa(i) could be of any form. The analogues of the above properties forW hold for the general signed rank statistics. In particular\(T\stackrel{d}{=}\sum\limits _{i=1}^{n}\mbox{ sign}({D}_{i0})a(i)\) simplifies the distribution and moment calculations in the case of no ties. In the case of ties, the permutation variance ofT, given the midranksR 1, , R n , is ∑ i = 1 n{a(R i )}2. Thus, for the normal approximation, it is simplest to use the form

$$Z = \sum\limits _{i=1}^{n}\mbox{ sign}({D}_{ i0})a({R}_{i})/{\left [\sum\limits _{i=1}^{n}\{a{({R}_{ i})\}}^{2}\right ]}^{1/2},$$
(12.42)

that automatically adjusts for ties (see Section 12.8.6, p. 497, for a discussion of ties).

The most well-known score functions area(i) = i for the Wilcoxon, the quantile normal scores\(a(i) = {\Phi }^{-1}(1/2 + i/[2(n + 1)])\), and the sign testa(i) = 1. These are asymptotically optimal for shifts in the center of symmetryD 0 of the logistic distribution, the normal distribution, and the Laplace distribution, respectively. For asymptotic analysis we assume\(a(i) = {\phi }^{+}(i/(n + 1))\), where ϕ + (u) is nonnegative and nonincreasing and ∫0 1 + (u)]2 du < . The asymptotically optimal general form for data with densityf(x − θ0) and\(f(x) = f(-x)\) is

$${\phi }^{+}(u) = -\frac{f^{\prime}\left \{{F}^{-1}\left (\frac{1} {2} + \frac{u} {2}\right )\right \}} {f\left \{{F}^{-1}\left (\frac{1} {2} + \frac{u} {2}\right )\right \}}.$$

Asymptotic normality is similar to Theorem 12.5 (p. 492) (see for example, Theorem 10.2.5, p. 333 of Randles and Wolfe, 1979). The Edgeworth expansion forW  +  andT  +  may be found on p. 37 and p. 89, respectively, of Hettmansperger [1984].

8.4 Sign Test

The sign test mentioned in the last section as (12.41) witha(i) = 1 is usually given in the form\({T}^{+} = \sum\limits _{i=1}^{n}I({D}_{i0} > 0)\), the number of positive differences. Under the null hypothesis (12.35, p. 491),T  +  has a binomial(n, 1 ∕ 2) distribution and is extremely easy to use. Because of this simple distribution,T  +  is often given early in a nonparametric course to illustrate exact null distributions.

The sign test does not require symmetry of the distributions to be valid. It can be used as a test ofH 0 : median of\({D}_{i} - {\theta }_{0} = 0\), where it is assumed only thatD 1, , D n are independent, each with the same median. Thus, the test is often used in skewed distributions to test that the median has value θ0. This generality, though, comes with a price because typically the sign test is not as powerful as the signed rank ort test in situations where all three are valid. If there are zeroes inD 1, , D n , the standard approach is remove them before applying the sign test.

8.5 Pitman ARE for the One-Sample Symmetry Problem

In the Appendix, we give some details for finding expressions for the efficacy and Pitman efficiency of tests for the one-sample symmetry problem. Here we just report some Pitman ARE’s in Table 12.4 for the sign test, thet test, and the Wilcoxon signed rank. The comparison of the signed rank and thet are very similar to those given in Table 12.3 (p. 477) for the two-sample problem. The only difference is that skewed distributions are allowed in the shift problem but not here.

Table 12.4 Pitman ARE’s for the One-Sample Symmetry Problem

The general message from Table 12.4 is that the tails of the distribution must be very heavy compared to the normal distribution in order for the sign test to be preferred. This is a little unfair to the sign test because symmetry off is not required for the sign test to be valid, whereas symmetry is required for the Wilcoxon signed rank test. In fact Hettmansperger [1984, p. 10–12] shows that the sign test is uniformly most powerful among size-α tests if no shape assumptions are made about the density off. Moreover, in the matched pairs situation where symmetry is justified by differencing, the uniform distribution is not possible, and that is where the sign test performs so poorly.

Monte Carlo power estimates in Randles and Wolfe [1979, p. 116] show that generally the ARE results in Table 12.4 correspond qualitatively to power comparisons. For example, atn = 10 and normal alternative\(({\theta }_{0} +.4)/\sigma \), the Wilcoxon signed rank has power.330 compared to.263 for the sign test. The ratio\(.263/.330 =.80\) is not too far from ARE = . 64. The estimated power ratio atn = 20 is\(.417/.546 =.76.\) The Laplace distribution AREs in Table 12.4 are not as consistent. For example, atn = 20 for a similar alternative, the ratio is\(.644/.571 = 1.13,\) not all that close to ARE = 2. 00. 

The Wilcoxon signed rank test is seen to have good power relative to the sign test and to thet test. The Hodges and Lehmann [1956] result that ARE(W  + , t) ≥ . 864 also holds here for all symmetric unimodal densities. Coupled with the fact that there is little loss of power relative to thet test at the normal distribution (ARE(\({W}^{+},t) = 0.955\)),W  +  should be the statistic of choice in many situations.

8.6 Treatment of Ties

The general permutation approach is not usually bothered by ties in the data, although rank methods typically require some thought about how to handle the definition of ranks in the case of ties. For the original situation ofn pairs of data and a well-defined statistic like the pairedt statistic, the 2n permutations of the data merely yield redundance if members of a pair are equal. For example, considern = 3 and the following data with all 8 permutations (1 is the original data pairing):

Permutations 1–4 are exactly the same as permutations 5–8 because permuting the 2nd pair has no effect. Thus, a permutationp-value defined from just permutations 1–4 is exactly the same as for using the full set 1–8. After taking differences between members of each pair, the 2n sign changes work in the same way by using sign(0) = 0; that is, there is the same kind of redundancy in that there are really just\({2}^{n-{n}_{0}}\) unique permutations, wheren 0 is the number of zero differences.

1

2

3

4

5

6

7

8

3,5

5,3

3,5

5,3

3,5

5,3

3,5

5,3

2,2

2,2

2,2

2,2

2,2

2,2

2,2

2,2

7,4

7,4

4,7

4,7

7,4

7,4

4,7

4,7

For signed rank statistics, there are two kinds of ties to consider after converting to differences, multiple zeros and multiple non-zero values. For the non-zero multiple values, we just use mid-ranks (average ranks) as before. For the multiple zeros, there are basically two recommended approaches:

Method 1:Remove the differences that are zero and proceed with the reduced sample in the usual fashion. This is the simplest approach and the most powerful for the sign statistic (see Lehmann 1975, p. 144). Pratt and Gibbons [1981, p. 169] discuss anomalies when using this procedure withW  + .

Method 2: First rank all | D 10 | , , | D n0 | . Then remove the ranks associated with the zero values before getting the permutation distribution of the rank statistic,but do not change the ranks associated with the non-zero values. However, as above, since the permutation distribution is the same with and without the redundancy, it really just makes the computing easier to remove the ranks associated with the zero values. The normal approximation in (12.42, p. 495) automatically eliminates the ranks associated with the zero values because sign(0) = 0. For the Box-Andersen approximation, the degrees of freedom are different depending on whether the reduced set is used or not. It appears best to use the reduced set for the Box-Andersen approximation although a few zero values make little difference.

Example 12.2 (Fault rates of telephone lines). 

Welch [1987] gives the difference (times 105) of a transformation of telephone line fault rates for 14 matched areas. We modify the data by dividing by 10 and rounding to 2 digits leading to

Notice that there two ties in the absolute values 20 and 8 for which the midranks are given. The exact right-tailed permutationp-value based on thet statistic is.38, whereas thet tables gives.33 and the Box-Andersen approximation is.40. The large outlier − 99 essentially kills the power of thet statistic. The sign test first removes the 0 value and then the binomial probability of getting 10 or more positives out of 13 is.046. Welch [1987] used the sample median as a statistic and for these data we get exactp-value.062. Note that the mean and sum andt statistic are all permutationally equivalent, but the median is not permutationally equivalent to using a Wald statistic based on the median. So, the properties of using the median as a test statistic are not totally clear.

D i0

 − 99

31

27

23

20

20

19

 − 14

11

9

8

 − 8

6

0

sign(D i0)R i

 − 14

13

12

11

9. 5

9. 5

8

 − 7

6

5

3. 5

 − 3. 5

2

0

For the Wilcoxon Signed Rank, no tables can be used because of the ties and the 0. However, it is straightforward to get the permutation after choosing one of the methods above for dealing with the 0 difference.

Method 1: First remove the 0, then rank. The remaining data are

The exactp-value based on the sign(D i0)R i values above (for example, just insert the signed ranks into the R program below) is.048, the normal approximation is.047, and the Box-Andersen approximation is.049.

D i0

 − 99

31

27

23

20

20

19

 − 14

11

9

8

 − 8

6

sign(D i0)R i

 − 14

13

12

11

9. 5

9. 5

8

 − 7

6

5

3. 5

 − 3. 5

2

Method 2: Rank the data first, then throw away the signed rank associated with the 0. The exactp-value is.044 Recall, for the permutationp-value, it does not matter whether we drop the 0 or not after ranking. Similarly, the normal approximationp-value.042 based on (12.42, p. 495) automatically handles the 0 value. For the Box-Andersen approximation, we get.0437 based on all 14 signed ranks and.0441 after throwing out the 0; so it matters very little whether we include the 0 or not.

For problems withn ≤ 20, the following R code modified from Venables and Ripley [1997, p. 189-190] gives the exact permutationp-value for signed statistics:

perm.sign<-function(d,stat,pr=FALSE,...){

 # Exact perm. $p$-value for one-sample problem.

 # Assumes test rejects for large values of stat.

 # Looks at all 2^n sign change samples.

 # Use only for small n.

 # Need the following obscure function

bi<-function(x,digits=if(x>0)1+

 floor(log(x,base=2)) else 1){

ans<-0:(digits-1)

(x %/% 2^ans) %% 2

}# note %/% and %% are different

# The main program

t0<-stat(d,...)

digits<-length(d)

b <- 2^digits

res <- numeric(b)

for(i in 1:b){

x <- d*2*(bi(i,digits=digits) - 0.5)

res[i] <- stat(x,...)

if(pr)cat(i,x,res[i],fill=T) # prints

}

pvalue <- sum(res >= t0)/b

sum(res==t0)->co

return(data.frame(b=b,stat0=round(t0,4),

eq.t0=co,rt.pvalue=pvalue,pv2=2*pvalue))

}

9 Randomized Complete Block Data—the Two-Way Design

Blocking is one of the most important techniques for reducing variation in experimental designs. The usual Randomized Complete Block design may be viewed as a generalization of the matched pairs to situations with more than two treatments. To use the permutation argument with blocked data, we do not need for the treatments to be assigned randomly, but it is most natural to discuss blocked data in that context. The key assumption required underH 0 is that the data are exchangeable within blocks.

Suppose thatk treatments are to be assigned at random within each block of size k. Forn blocks, there are (k)n possible permutations of the data corresponding to permuting independently among treatments within each block. In the following table there arek = 4 blocks withn = 10 treatments, thus\({M}_{N} = 2{4}^{10} = 6.34 \times1{0}^{13}\) possible permutations. These data are actually treatments 6–15 from an example of aphid infestation of crepe myrtle cultivars given in Table 1 of Brownie and Boos [1994]. The response variable is the number of aphids on the three most heavily infested leaves plus the percent of foliage covered with sooty mold.  

The linear model representation is

$${Y }_{ij} = \mu+ {\beta }_{i} + {\alpha }_{j} + {e}_{ij},$$
(12.43)

where α1, , α k are the treatment effects, and β1, β n are the block effects. Note that we have switched subscripts onY ij compared to the one-way model (12.28, p. 480) so that the blocks can be the rows. Often the block effects are assumed random, but the nonparametric literature typically considers them fixed effects.

 

Treatments

Block

1

2

3

4

5

6

7

8

9

10

1

0

0

93

78

5

1

0

21

1

1

2

0

24

0

3

2

180

0

0

3

9

3

0

2

10

0

0

3

2

3

3

140

4

0

4

2

2

0

0

1

47

1

52

The usual ANOVAF statistic could be used with these data:

$$F = \frac{ \frac{1} {k - 1}\sum\limits _{j=1}^{k}n{({\overline{Y }}_{.j} -{\overline{Y }}_{..})}^{2}} { \frac{1} {(k - 1)(n - 1)}\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{({Y }_{ ij} -{\overline{Y }}_{i.} -{\overline{Y }}_{.j} +{ \overline{Y }}_{..})}^{2}},$$
(12.44)

where\({\overline{Y }}_{i.} = {k}^{-1} \sum\limits _{j=1}^{k}{Y }_{ij}\),\({\overline{Y }}_{.j} = {n}^{-1} \sum\limits _{i=1}^{n}{Y }_{ij}\), and\({\overline{Y }}_{..} = {n}^{-1} \sum\limits _{i=1}^{n}{\overline{Y }}_{i.}\). For the above dataF = 0. 80 withp-value = 0.62 from anF distribution with 9 and 27 degrees of freedom. Since theF distribution approximates the permutation distribution, the value 0.62 should be satisfactory. A Monte Carlo approximation to the exact permutationp-value based on 10,000 samples gave.60 with standard error.005, thus confirming the Type I error robustness of the usualF procedure. However, the nonnormality of the response variable is cause for concern because theF statistic is not Type II error robust in the face of outliers. Transformations are an obvious approach, andF on log(Y ij  + 1) resulted inp-value =.29. Fortunately, with rank procedures we do not have to guess the correct transformation.

9.1 Friedman’s Rank Test

The standard rank procedure was introduced by Friedman [1937]. For the untied case, it has the form

$$T = \frac{12n} {k(k + 1)}\sum\limits _{j=1}^{k}{\left ({\overline{R}}_{.j} -\frac{k + 1} {2} \right )}^{2},$$
(12.45)

whereR ij is the rank ofY ij within theith row, and\({\overline{R}}_{.j} = {n}^{-1} \sum\limits _{i=1}^{n}{R}_{ij}\) is thejth treatment mean rank. Note that\((k + 1)/2\) is\({\overline{R}}_{..}\) since the average of the integers 1 tok is\((k + 1)/2\). The within-row ranksR ij for the above table are

We see immediately that there are numerous ties in the data. The form of the Friedman statistic that accommodates ties is (see, for example, Conover and Iman, 1981, p. 126)

$$T = \frac{(k - 1){n}^{2} \sum\limits _{j=1}^{k}{\left ({\overline{R}}_{.j} -\frac{k + 1} {2} \right )}^{2}} {\left (\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{R}_{ ij}^{2}\right ) -\frac{nk{(k + 1)}^{2}} {4} }.$$
(12.46)

Under the null hypothesis of identical treatments,T converges to a χ k − 1 2 distribution asn →  andk remains fixed. For the above data,T = 13. 7732, and comparing to a χ9 2 distribution givesp-value =.13. However, as in the one-way design, the χ2 approximation becomes increasingly conservative as the number of treatments gets large relative to the number of blocks.F distributionp-values provide much better approximations and can be justified by either asymptotic theory or the Box-Andersen permutation moment approximations.

  

Treatments

 

Block  

1  

2

3

4

5  

6 

7

8

9

10

 

1

2

2

10

9

7

5

2

8

5

  5

 

2

2.5 

9

2.5 

6.5 

5

10 

2.5 

2.5 

6.5

  8

 

3

2

4.5 

9

2

2

7

4.5 

7

7

10

 

4

2

8

6.5

6.5

2

2

4.5 

9

4.5 

10

9.2 F Approximations

Friedman [1937, pp. 694–695] conjectured that the Friedman statistic is asymptotically normal ask →  with meank − 1 and variance 2(n − 1)(k − 1) ∕ n (a proof may be found in Lemma 4 of Brownie and Boos, 1994). Similar to the one-way design, this asymptotic normal result is consistent with applying theF statistic (12.44, p. 500) to the within-row Friedman ranks and then using theF(k − 1, (k − 1)(n − 1)) distribution forp-values. This argument is to be fleshed out in Problem 12.22 (p. 528). Of course, theF distribution should be used in practice; the asymptotic normal result just supports use of theF distribution.

From Box and Andersen [1955, p. 14-15], we may approximate the permutation distribution ofF of (12.44, p. 500) or of the sameF applied to the within-row Friedman ranks by aF(d(k − 1), d(k − 1)(n − 1)) distribution, where

$$d = 1 + \frac{(nk - n + 2){V }_{2} - 2n} {n(k - 1)(n - {V }_{2})},$$
$${V }_{2} = \frac{1} {n - 1}\sum\limits _{i=1}^{n}{({s}_{ i}^{2} -{\overline{s}}^{2})}^{2}/{({\overline{s}}^{2})}^{2},$$

and thes i 2 are the within-row variances, and\({\overline{s}}^{2} = {n}^{-1} \sum\limits _{i=1}^{n}{s}_{i}^{2}\). In the case of the Friedman ranks with no ties in the data,d = 1 − 2 ∕ {n(k − 1)}. For the Crepe Myrtle data this latter expression isd = . 944, the same (to three decimals) as the actuald value from the tied ranks. We summarize the various approximations in the following table:

The Monte Carlo estimates are based on 10,000 random permutations and have standard error bounded by.005. TheF approximations are good, but the Box-Andersen adjustments do not help here. Interestingly,d = 1. 08 for the usualF (row 3), but thep-value is adjusted upwards because theF = . 80 is so small. Typically, ad value greater than 1 lowers thep-value from theF approximation.

  

ApproximateP-Values

  

for the Crepe Myrtle Data

  

Monte

 

Box-And.

 
  

Carlo

F(9, 27)

F(9d, 27d)

χ9 2

 

Friedman

.10

  

.13

 

F R

.10

.10

.11

 
 

F onY

.60

.62

.63

 
 

F on log(Y + 1)

.29

.29

.30

 

9.3 Pitman ARE for Blocked Data

From van Elteren and Noether [1959] we find the surprising result that the Pitman asymptotic relative efficiency of the Friedman test to the ANOVAF depends on the number of treatmentsk,

$$\mbox{ ARE(Friedman},F) = \left \{ \frac{k} {k + 1}\right \}12{\sigma }^{2}{\left \{{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)\,dx\right \}}^{2},$$
(12.47)

where σ2 is the variance of the observations. Expression (12.47) is just\(k/(k + 1)\) times the ARE(W, t) in (12.25, p. 477). Table 12.5 gives a few values of (12.47) for several distributions.

Table 12.5 Pitman ARE of the Friedman Test to theF Test

The value.64 atk = 2 for the normal distribution is the same as the ARE of the sign test to thet in Table 12.4 (p. 496). That is no accident. It turns out that fork = 2, the Friedman test is equivalent to the sign test. (The other values in Table 12.4, p. 496, do not correspond to thek = 2 values in Table 12.5 because Table 12.4 refers to the distribution after taking differences, whereas Table 12.5 is for the distribution of the individual treatment results, not the difference of treatment results. For the normal distribution, the difference of normal random variables is also normally distributed; so for the normal the results are the same in both tables.)

The reason for the low efficiency in Table 12.5 is that ranking within rows (intrablock ranking) takes no advantage of between block (interblock) information. For thek = 2 case, the Wilcoxon signed rank statistic uses interblock information by ranking the absolute differences (note the improved efficiencies in Table 12.4, p. 496, for the signed rank test compared to the sign test). In the next section we discuss some rank approaches that use interblock information.

9.4 Aligned Ranks and the Rank Transform

Many approaches have been used to remedy the low efficiency in Table 12.5 for small values ofk. Perhaps the earliest approach (and still one of the best) is the aligned rank method due to Hodges and Lehmann [1962]. The aligned rank approach is to first subtract the block mean (or any other location measure such as the median) from each observationY ij , then rank all the resultingnk residuals together. These latter ranks on the residuals, denoted\(\widehat{{R}}_{ij}\), are calledaligned ranks. We suggest usingF of (12.44, p. 500) on these aligned ranks.

Actually, Sen [1968] and Lehmann [1975, p. 272] use

$$\widehat{Q} = \frac{{n}^{2}(k - 1)\sum\limits _{j=1}^{k}{\left ({\overline{\widehat{R}}}_{.j} -\frac{nk + 1} {2} \right )}^{2}} {\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{\left (\widehat{{R}}_{ ij} -{\overline{\widehat{R}}}_{i.}\right )}^{2}},$$
(12.48)

a statistic that is asymptotically χ k − 1 2 underH 0. The justification for the form (12.48) comes from noting that the permutation mean of\({\overline{\widehat{R}}}_{.j}\) is\((nk + 1)/2\), and the permutation covariance matrix of\(({\overline{\widehat{R}}}_{.1},\ldots,{\overline{\widehat{R}}}_{.k})\) is

$$\frac{{\sigma }^{2}k} {k - 1}\mbox{ diag}\left ({I}_{k} -\frac{{\mathbf{1}}_{k}{\mathbf{1}}_{k}^{T}} {k} \right ),$$
(12.49)

where\({I}_{k}\) is thek-dimensional identity matrix,1 k is a vector of ones, and

$${\sigma }^{2} = \frac{1} {{n}^{2}k}\sum\limits _{i=1}^{n} \sum\limits _{j=1}^{k}{(\widehat{{R}}_{ ij} -{\overline{\widehat{R}}}_{i.})}^{2}$$
(12.50)

is the permutation variance of\({\overline{\widehat{R}}}_{.j}\).\(\widehat{Q}\) in (12.48) is the appropriate quadratic form in\(({\overline{\widehat{R}}}_{.1},\ldots,{\overline{\widehat{R}}}_{.k})\) upon noting that\((k - 1){I}_{k}/(k{\sigma }^{2})\) is a generalized inverse of the covariance matrix (12.49).

Other authors (Fawcett and Salter, 1984, and O’Gorman, 2001) use a one-way ANOVAF on the aligned ranks, but we prefer the two-wayF of (12.44, p. 500) because the Box-Andersen adjustment is readily available. All three statistics,\(\widehat{Q}\) and the twoF statistics on the aligned ranks, are permutationally equivalent to the numerator of\(\widehat{Q}\); so if exact or Monte Carlo approximations are used, it does not matter which of the three statistics is chosen. Clearly, either of the twoFs gives better approximatep-values than\(\widehat{Q}\) with χ k − 1 2 p-values.

Mehra and Sarangi [1967] give somewhat complicated formulas for the Pitman ARE of the aligned rank approach to the usualF and to Friedman’s statistic, but the bottom line is that the AREs of the aligned rank procedure to the usualF are close to the last column of Table 12.5 (p. 503). Thus, the aligned rank approach is able to recover most of the interblock information.

Another approach to recovering the interblock information is to just rank all the observations together and applyF of (12.44, p. 500) on the resulting ranks. Thisrank transform approach, due to Conover and Iman [1981] works well as long as the block effects are not strong. When the block effects are strong, then this approach is similar to Friedman’s test. Hora and Iman [1988] give Pitman ARE results for this approach.

There is an extensive literature on rank methods in block models. Mahfoud and Randles [2005] and Kepner and Wackerly [1996] are several places that briefly review many of the approaches. The latter also gives extensions to incomplete blocks.

9.5 Replications within Blocks

In the preceding discussion we have been talking about cases where there is just one observation per cell,nk total observations forn blocks andk treatments, and no block by treatment interaction. Consider thek = 2 case andn blocks where there arem i Xs for the first treatment in blocki andn i Y s for the second treatment,i = 1, , n. These type data arise naturally in clinical trials atn centers or sites. The sites might be hospitals or clinics or individual doctors. The usual rank approach is the van Elteren statistic (van Elteren, 1960, or Lehmann 1975, p. 145), a weighted sum of individual Wilcoxon rank sum statisticsW i within each block,

$${W}_{\mathrm{VE}} = \sum\limits _{i=1}^{n} \frac{{W}_{i}} {{m}_{i} + {n}_{i} + 1}.$$

van Elteren [1960] showed that the weights\(1/({m}_{i} + {n}_{i} + 1)\) are asymptotically optimal among all linear combinations of theW i . This optimality makes sense if we write the standardized version ofW VE as

$$\sum\limits _{i=1}^{n} \frac{1} {{\sigma }_{0}^{2}(\widehat{{\theta }}_{i})}\left (\widehat{{\theta }}_{i} -\frac{1} {2}\right )\left /{\left \{\sum\limits _{i=1}^{n} \frac{1} {{\sigma }_{0}^{2}(\widehat{{\theta }}_{i})}\right \}}^{1/2}\right.,$$
(12.51)

where\(\widehat{{\theta }}_{i}\) is the Mann-Whitney estimator of\({\theta }_{i} = P({Y }_{i1} > {X}_{i1}) + (1/2)P({Y }_{i1} = {X}_{i1})\) given in (12.14, p. 463) (here we have dropped theXY subscript for simplicity), and\({\sigma }_{0}^{2}(\widehat{{\theta }}_{i})\) is the variance of\(\widehat{{\theta }}_{i}\) under the null hypothesis of identicalX andY populations. In the completely nonparametric case (in the absence of the shift model), θ i is the underlying parameter of interest for Wilcoxon statistics. For continuous data (no ties),\({\sigma }_{0}^{2}(\widehat{{\theta }}_{i}) = ({m}_{i} + {n}_{i} + 1)/(12{m}_{i}{n}_{i})\). Thus, the numerator of the standardized version ofW VE is a weighted average of\(\widehat{{\theta }}_{i} - 1/2\), where the weights are inversely proportional to null variances.

The analogoust procedure is based on standardizing

$$\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i}}({\overline{Y }}_{i} -{\overline{X}}_{i}).$$
(12.52)

Thus, thet procedure uses a weighted linear combination of the difference of sample means, where the weights are inversely proportional to\(\mathrm{Var}\left ({\overline{Y }}_{i} -{\overline{X}}_{i}\right ) = {\sigma }^{2}(1/{m}_{i} + 1/{n}_{i})\).

The standard permutation approach is to consider all possible

$${M}_{N} ={ \prod\nolimits }_{i=1}^{n}\left ({ {m}_{i} + {n}_{i} \atop {n}_{i}} \right )$$

independent permutations within sites. The normal approximation forW VE should be very good if ∑ i = 1 n m i and ∑ i = 1 n n i are reasonably large and therefore is widely used in practice. In the case that ∑ i = 1 n m i and ∑ i = 1 n n i converge to, Hodges and Lehmann [1962] give the Pitman ARE of (12.51) to (12.52) for normal data as

$$.955\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i} + 1}\left /\sum\limits _{i=1}^{n} \frac{{m}_{i}{n}_{i}} {{m}_{i} + {n}_{i}}\right..$$

Thus, ifm i  + n i is reasonably large, then the ARE is close to the best value.955. For example, if\({m}_{i} + {n}_{i} = 10\) for each site, then the ARE is.955(10/11).

For the case that there are small numbers of replications per block (site), we are led back to the procedures of the previous section, aligned ranks and possibly the rank transform. With replications within blocks, however, we now have the ability to test for block by treatment interactions. Unfortunately, standard permutation procedures are not available for testing the no interaction hypothesis in the face of main effects. A large literature exists evaluating and criticizing the rank transform approach for testing interactions. See, for example, Akritas [19901991] and Thompson [1991]. In general, for more complicated fixed effects models with interaction, to achieve robustness via rank methods, we feel it is better to use the general R-estimation linear model approach mentioned at the end of Section 12.7 (p. 487).

Boos and Brownie [1992] argue that a mixed model approach is usually more appropriate, allowing inferences to be made to a larger population, but the mixed model leads away from van Eltern’s statistic (12.51, p. 505) and permutation inference.

10 Contingency Tables

10.1 2 x 2 Table – Fisher’s Exact Test

The first use of the permutation method was given by Fisher [1934a,Statistical Methods for Research Workers, fifth edition] in an analysis of 2 ×2 tables. Fisher’s example was of 13 identical twins and 17 fraternal twins (of the same sex) who had at least one of the pair convicted of a crime. Of the 13 identical twins only 3 had a twin free of conviction. Of the 17 fraternal twins 15 had a twin free of conviction. Thus the table is as follows,

To fix notation, a general 2 ×2 table is,

 

Both

One

  
 

Convicted

Convicted

Total

 

Identical 

10

3

13

 

Fraternal

2

15

17

 

Total

12

18

30

 
 

Category

Category

  
 

1

2

Total

 

Group 1

N 11

N 12

N 1. 

 

Group 2

N 21

N 22

N 2. 

 

Total

N . 1

N . 2

N

 

A standard analysis of these data assumes thatN 11 is binomial (N 1. , p 1) and independent ofN 21 assumed to be binomial (N 2. , p 2). The usual statistic for testingH 0 : p 1 = p 2 is the pooledZ, the square root of the score statistic found in Section 3.2.9 (p. 144),

$$Z = \frac{\widehat{{p}}_{1} -\widehat{ {p}}_{2}} {{\left \{\frac{\widetilde{p}(1 -\widetilde{ p})} {{N}_{1.}} + \frac{\widetilde{p}(1 -\widetilde{ p})} {{N}_{2.}} \right \}}^{1/2}},$$

where\(\widehat{{p}}_{1} = {N}_{11}/{N}_{1.}\),\(\widehat{{p}}_{2} = {N}_{21}/{N}_{2.}\), and\(\widetilde{p} = {N}_{.1}/N\). To testH a : p 1 > p 2, the standard approach would be to compareZ toz α, the 1 − α quantile of the standard normal.

Instead of this approximate procedure, Fisher noted that conditional on the marginsN . 1 andN . 2 held fixed in addition toN 1.  andN 2. , that a given table has hypergeometric probability of (n 11, n 12, n 21, n 22) given by

$$\frac{\left ({ {N}_{1.} \atop {n}_{11}} \right )} {} \left ({ {N}_{2.} \atop {n}_{21}} \right )\left ({ N \atop {N}_{.1}} \right ) = \frac{{N}_{1.}!{N}_{2.}!{N}_{.1}!{N}_{.2}!} {N!{n}_{11}!{n}_{12}!{n}_{21}!{n}_{22}!}.$$

This hypergeometric probability is easily obtained if one thinks about an urn withN . 1 balls of type 1 andN . 2 of type 2. If we draw outN 1.  balls without replacement, then the above probability is the probability of gettingn 11 of type 1 andn 21 of type 2.

One can also think of the above table arising in the two-sample problem where the data consists of just 1’s and 0’s. Although there are\(\left ({ N \atop {N}_{1.}} \right )\) permutations of interest, many of them yield the same table. The numerator of the above hypergeometric probability just gives the number of permutations which lead a given table.

Now a variety of statistics can be used to order the possible tables from supportingH 0 to strongly rejectingH 0 and to calculate ap-value. Or one can just use intuition for the ordering: most people would agree that for testingH a : p 1 > p 2, the table below is more extreme than the original.

 

Category

Category

  
 

1

2

Total

 

Group 1

N 11 + 1

N 12 − 1

N 1. 

 

Group 2

N 21 − 1

N 22 + 1

N 2. 

 

Total

N . 1

N . 2

N

 

Thus, a one-tailedp-value would be obtained by summing up the hypergeometric probabilities of those tables as extreme or more extreme than the original table (N 11, N 12, N 21, N 22). A number of seemingly different ways of ordering the tables lead to the same definition of “more extreme” and are called Fisher’s Exact Test. The simplest way to order is either the intuitive notion above or to order via the pooledZ statistic.

For the twins data, Fisher noted that the two more extreme tables haveN 11 = 11,N 12 = 2,N 21 = 1,N 22 = 16 andN 11 = 12,N 12 = 1,N 21 = 0,N 22 = 17. Thus thep-value is the probability of the original table plus the probability of these two more extreme tables:

$$\frac{13!17!12!18!} {30!} \left \{ \frac{1} {10!3!2!15!} + \frac{1} {11!2!1!16!} + \frac{1} {12!1!0!17!}\right \} = \frac{619} {1330665} =.000465.$$

The definition of a two-sidedp-value is not so clear, but the usual practice is to add in the probabilities of tables as extreme or more extreme in the other direction (having probabilities less than or equal to the probability of the observed table). In the above example we would need to add the probabilities of tables withN 11 = 0,N 12 = 13,N 21 = 12,b 22 = 5 andN 11 = 1,N 12 = 12,N 21 = 11,N 22 = 6 but notN 11 = 2,N 12 = 11,N 21 = 10,N 22 = 7 since it has higher probability than the original table.

When accompanied by a randomization rule to yield exact α levels, Fisher’s Exact Test is uniformly most powerful unbiased as discussed in Lehmann [1986, Ch. 4]. But many people have noted how conservative it is whenp-values are used with the rule: rejectH 0 whenp-value ≤ α. In this case the discreteness of the permutation distribution does prove costly in terms of power.

Barnard [19451947], Boschloo [1970], and Suissa and Shuster [1985] proposed unconditional tests in the 2 x 2 table that are typically more powerful than the Fisher Exact Test without randomization. See Berger [1996] for details and power comparisons.

We have given Fisher’s Exact Test in the context of two independent binomials andH 0 : p 1 = p 2. It also applies in the context of multinomial data where the data consists of a pair of binary variables (X, Y ) with valuesx 1 andx 2 andy 1 andy 2, respectively:

 

Y

  
  

y 1

y 2

Total

 

X

x 1

N 11

N 12

N 1. 

 
 

x 2

N 21

N 22

N 2. 

 
 

Total

N . 1

N . 2

N

 

The entries (N 11, N 12, N 21, N 22) are multinomial(N; p 11, p 12, p 21, p 22) with associated parameters

 

Y

  
  

y 1

y 2

Total

 

X

x 1

p 11

p 12

p 1. 

 
 

x 2

p 21

p 22

p 2. 

 
 

Total

p . 1

p . 2

1

 

In this paired variable context, the null hypothesis for Fisher’s Exact Test is independence ofX andY,

$${H}_{0} : {p}_{ij} = {p}_{i.}{p}_{.j},\quad i = 1,2;j = 1,2.$$
(12.53)

Of course, ifp 11 = p 1.  p . 1, then all the other equalities such asp 12 = p 1. 2 p . 2 hold as well.

10.2 Paired Binary Data – McNemar’s Test

In the context of paired binary data introduced in the last section, we might expect association betweenX andY, but our main interest could be in their marginal probabilities. In particular, the null hypothesis is often

$${H}_{0} : {p}_{1.} = {p}_{.1}.$$
(12.54)

A typical application is in matched pair studies such as the following well-known case-control data from Miller [1980],

 

Sibling (Control)

  
  

Tons.

No Tons.

Total

 

Hodgkin’s

Tons.

26

15

41

 

Patient

No Tons.

7

37

44

 
 

Total

33

52

85

 

where Hodgkin’s patients were paired with a sibling and it was determined whether they each had a tonsillectomy or not. If the marginal estimates\(\widehat{{p}}_{1.} = {N}_{1.}/N = 41/85\) and\(\widehat{{p}}_{.1} = {N}_{.1}/N = 33/85\) differ significantly, then incidence of tonsillectomies may be associated with contracting Hodgkin’s disease. Noting that\(\widehat{{p}}_{1.} -\widehat{ {p}}_{.1} = {N}_{12}/N - {N}_{21}/N\) has multinomial variance\(\{{p}_{12} + {p}_{21} - {({p}_{12} - {p}_{21})}^{2}\}/N = ({p}_{12} + {p}_{21})/N\) underH 0, the score statistic is

$$Z = \frac{{N}_{12} - {N}_{21}} {{\left ({N}_{12} + {N}_{21}\right )}^{1/2}}.$$

Exact inference follows by noting that under (12.54, p. 509),N 12 | N 12 + N 21 has a binomial\(({N}_{12} + {N}_{21},1/2)\) distribution. Thus,Z = 1. 71 has approximate normal one-sidedp-value = . 044, but\(P(\mbox{ binomial}(22,1/2) \geq15) =.067\). These procedures are generally referred to as McNemar’s test.

What do these tests have to do with permutation and rank statistics? LetX = 1 denote that a Hodgkin’s patient had a tonsillectomy, andX = 0 denote that he/she did not, and similarlyY = 1 andY = 0 for the sibling control. Then the paired data and their differences are

Note that there areN 12 = 15 positive differences out of\({N}_{12} + {N}_{21} = 22\) nonzero differences. Thus, the exact binomial procedure above is just the sign test for the differences, andZ is exactly (12.42, p. 495) fora(i) = 1. In fact, since all the nonzero absolute differences are identically 1, the exact signed rank test (assuming zeroes are deleted) yields the same binomial procedure, andZ is also (12.42, p. 495) witha(i) = i.

 

Hodgkin’s

Sibling

 

Pair

Patient

(Control)

Diff.

1

1

1

0

.

.

.

.

.

.

.

.

26

1

1

0

27

1

0

1

.

.

.

.

.

.

.

.

41

1

0

1

42

0

1

 − 1

.

.

.

.

.

.

.

.

48

0

0

0

49

0

0

0

.

.

.

.

.

.

.

.

85

0

0

0

10.3 I byJ Tables

We now consider the generalI byJ contingency table

 

Y

  
  

y 1

.

.

.

y J

Total

 
 

x 1

N 11

.

.

.

N 1J

N 1. 

 
 

.

.

.

.

.

.

.

 

X

.

.

.

.

.

.

.

 
 

.

.

.

.

.

.

.

 
 

x I

N I1

.

.

.

N IJ

N J. 

 
 

Total

N . 1

.

.

.

N . J

N

 

The distribution of these data could be a full multinomial withIJ cells orI independent rows of multinomial data. In either case, exact permutation analysis is achieved by conditioning on the marginal totals resulting in a multiple hypergeometric for the joint distribution of the entriesN ij having probability\(P({N}_{ij} = {n}_{ij},i = 1,\ldots,I;j = 1,\ldots,J\mid {N}_{1.},\ldots,{N}_{I.},{N}_{.1},\ldots,{N}_{.J})\) given by

$$\frac{\left ({\prod\nolimits }_{i=1}^{I}{N}_{ i.}!\right )\left ({\prod\nolimits }_{j=1}^{J}{N}_{.j}!\right )} {N!{\prod\nolimits }_{i=1}^{I}{ \prod\nolimits }_{j=1}^{J}{n}_{ ij}!}.$$

The question remains as to what statistic should be used. If bothX andY have nominal categories, then the chi-squared goodness-of-fit statistic is natural, but not very interesting. IfX andY have numerical scores or are at least ordered, then some type of association or correlation statistic should be used. For example, one might use Pearson’sr or Spearman’s rank correlation. IfX has nominal categories andY has numerical categories, then ANOVA type comparisons among the row means makes sense. IfX has nominal categories andY has ordered categories, then the Kruskal-Wallis test might be a good choice of statistic. Moreover, all these situations can be generalized to multi-way tables, sayI byJ byK tables, usually viewed as stratified comparisons ofX andY.

All these options for statistics in two-way and multiway tables come under the general purview ofGeneralized Cochran-Mantel-Haenszel statistics. Expositions of these statistics may be found in Landis et al. [1978] and Agresti [2002, Section 7.5.3] and implementation is found inSAS PROC FREQ.

11 Confidence Intervals and R-Estimators

Confidence intervals can be obtained from permutation and rank test statistics in the same way as for other types of statistics: choose values of θ appearing in a null hypothesis such that the statisticT(θ) viewed as a function of θ does not reject the null hypothesis (see 3.19, p. 146). We often refer to this approach as “inverting a test statistic.” For example, in the one-sample problem with dataD 1, , D n assumed to be symmetrically distributed about θ0, a two-sided permutationt test could just as well be based on\(T({\theta }_{0}) = \vert \sum\limits _{i=1}^{n}({D}_{i} - {\theta }_{0})\vert \). The permutation distribution depends on the 2n sign change configurations of\({D}_{i} - {\theta }_{0},\ldots,{D}_{n} - {\theta }_{0}\); we reject ifT0) is larger than the largest α of the 2n values ofT0) computed on those permutations. So the 1 − α confidence interval can be found by trial and error, but it would seem to be a pretty laborious task because the permutation distribution changes with each θ0. A somewhat easier computing method is suggested in Lehmann [1986, p. 263], but in general, the usualt interval is close enough to the permutation interval that it is mostly used in practice.

Inverting the signed rank statisticW  +  leads to an interval\([{W}_{({k}_{1})},{W}_{({k}_{2})}]\), where\({W}_{(1)} \leq{W}_{(2)}\cdots\leq{W}_{(n(n+1)/2)}\) are the ordered values of theWalsh averages

$${W}_{ij} = \frac{{D}_{i} + {D}_{j}} {2},\qquad 1 \leq i \leq j \leq n.$$
(12.55)

The order numberk 2 is such that\(P({W}^{+} \geq{k}_{2}) \leq\alpha /2\), and\({k}_{1} = n(n + 1)/2 + 1 - {k}_{2}\). We have specified a closed interval so that the probability of coverage is at least 1 − α for tied data situations (see Randles and Wolfe, 1979, p. 181-183). For example, atn = 7 with continuous data and α = . 05,\(P({W}^{+} \geq26) = P({W}^{+} \leq2) =.0234\), and thus the interval [W (3), W (26)] has exact confidence level 1\(-.0468 =.9532.\) Oftenk 1 andk 2 are taken from the normal approximation to the permutation distribution ofW  + . For example,\({k}_{1} = q + 1\) and\({k}_{2} = n(n + 1)/2 - q\), whereq is the closest integer to

$$\frac{n(n + 1)} {4} - {z}_{\alpha /2}{\left \{\frac{1} {4}\sum\limits _{i=1}^{n}{R}_{ i}^{2}\right \}}^{1/2}.$$

In then = 7 example above, this latter calculation gives 2.4, and thusq = 2,k 1 = 3, and\({k}_{2} = 28 - 2 = 26\) as before. For the sample − 1. 11, 2.23, 3.35, 4.67, 5.34, 6.17, 7.44, the interval is [W (3), W (26)] = [1. 12, 6. 39].

Inverting the sign test leads to an interval of order statistics

$$({D}_{(k)},{D}_{(n-k+1)}),\quad 1 \leq k \leq n - k + 1.$$

This interval has exact coverage probability\({C}_{n}(k) = 1 - {(1/2)}^{n-1} \sum\limits _{i=0}^{k-1}\left ({ n \atop i} \right )\) for the population median from any continuous, not necessarily symmetric distribution. To obtain at least the same coverage for any discrete distribution, we need to again change to the closed interval\([{D}_{(k)},{D}_{(n-k+1)}]\). An interesting addendum to these intervals due to Guilbaud [1979] is that the average of two such intervals,

$$\left [\frac{{D}_{(k)} + {D}_{(k+t)}} {2}, \frac{{D}_{(n-k-t+1)} + {D}_{(n-k+1)}} {2} \right ],\;\;k + t \leq n - k - t + 1,$$

has guaranteed coverage\(\{{C}_{n}(k) + {C}_{n}(k + t)\}/2\) for any distribution. This latter interval is useful for smalln because it give more options for the confidence level than given byC n (k) alone. A more practical solution is given byHettmansperger and Sheather [1986], who interpolate between adjacent order statistics to get an interval with approximately the specified confidence, say 95%. The intervals are no longer distribution-free, but the confidence is close to the specified value.

Moving to the two-sample problem, the permutation interval based on the two-samplet is hard to compute, similar to the one-sample interval, and the usualt interval is mostly used in practice. Inversion of the Wilcoxon Rank Sum statistic for the shift model\(G(x) = F(x - \Delta )\) leads to a confidence interval forΔ of the form\([{U}_{({k}_{1})},{U}_{({k}_{2})}]\), whereU (1) ≤ U (2)⋯ ≤ U (mn) are the ordered values of the pairwise differences

$${U}_{ij} = {Y }_{j} - {X}_{i},\quad i = 1,\ldots,m;j = 1,\ldots,n.$$
(12.56)

Similar to the one-sample case,k 2 is chosen so that\(P(W \geq{k}_{2} + n(n + 1)/2) = \alpha /2\) and\({k}_{1} = mn + 1 - {k}_{2}\). In practice, one often uses the normal approximation interval with\({k}_{1} = q + 1\) and\({k}_{2} = mn - q\), whereq is the integer closest to

$$\frac{mm} {2} - {z}_{\alpha /2}{\left \{\mathrm{Var}(W)\right \}}^{1/2},$$

where Var(W) is given by (12.10, p. 462) or (12.11, p. 462).

Point estimators obtained from rank test statistics were introduced by Hodges and Lehmann [1963]. TheseR-estimators inherit some of the natural robustness properties of rank methods; see, for example Huber [1981] and Serfling [1980, Ch. 9], Randles and Wolfe [1979, Ch. 7], and Hettmansperger [1984, Ch. 5]. The most well known are: i) the one-sample center of symmetry estimator\(\widehat{\theta } = \mbox{ median}\{{W}_{ij}\}\), where theW ij are in (12.55, p. 512); and ii) the two-sample shift estimator\(\widehat{\Delta } = \mbox{ median}\{{U}_{ij}\}\), where theU ij are in (12.56, p. 513). Asymptotic relative efficiency comparisons for confidence intervals and estimators derived from rank tests are exactly the same as for the associated rank tests.

12 Appendix – Technical Topics for Rank Tests

12.1 Locally Most Powerful Rank Tests

Recall from Section 12.5.1 (p. 474) that forH 0 : Δ = 0 versusH a : Δ > 0, if there exists a rank test that is uniformly most powerful of level α for some ε > 0 in the restricted testing problemH 0 : Δ = 0 versusH a, ε : 0 < Δ < ε, we say that the test is thelocally most powerful rank test for the original testing problem. By using a Taylor expansion of the probability of the rank vector\(R\) as a function ofΔ,\({L}_{r}(\Delta ) \equiv{P}_{\Delta }(R = r)\), we need only obtain an expression for the derivative of\({L}_{r}(\Delta )\) and maximize it.

To see this consider the Taylor expansion

$${L}_{r}(\Delta ) = {L}_{r}(0) + {L}_{r}^{{\prime}}(0)\Delta+ o(\vert \Delta \vert ),$$

and a rank test with\(\alpha= k/N!\) based on maximizing\({L}_{r}^{{\prime}}(0)\). Let\({r}^{(1)}\) be the rank configuration that makes\({L}_{r}^{{\prime}}(0)\) largest among allN! rank configurations,\({r}^{(2)}\) makes\({L}_{r}^{{\prime}}(0)\) second largest among allN! rank configurations, etc. Such a rank test has power

$$\beta (\Delta ) = \sum\limits _{j=1}^{k}{L}_{{ r}^{(j)}}(\Delta ) = \sum\limits _{j=1}^{k}\left [ \frac{1} {N!} + {L}_{{r}^{(j)}}^{{\prime}}(0)\Delta+ o(\vert \Delta \vert )\right ].$$

For each rank configuration\({r}^{(j)}\), we can chooseΔ j small enough so that\({L}_{{r}^{(j)}}(\Delta )\) is also thejth largest among\({L}_{{r}^{(1)}}(\Delta ),\ldots,{L}_{{r}^{(N!)}}(\Delta )\) for all 0 < Δ < Δ j . Now take ε to be smaller than all of theΔ j . This shows that for 0 < Δ < ε, the power of the test that places points in the rejection region as ordered by\({L}_{r}^{{\prime}}(0)\) also puts points in the rejection as ordered by\({P}_{\Delta }(R = r) = {L}_{r}(\Delta )\); in other words, it is the locally most powerful rank test.

Let us now consider the two-sample problem whereX 1, , X m are iid with distribution functionF(x), andY 1, , Y n are iid with distribution functionG(x). Suppose thatF andG have densitiesf(x) andg(x), respectively, whose support is contained in that of a densityh(x). This means thath(x) is positive wheneverf(x) andg(x) are positive; for example, when all three densities have support on ( − , ). From Theorem 12.6, (p. 515), we have

$$P(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=1}^{m}f({V }_{({r}_{i})}){\prod\nolimits }_{i=m+1}^{N}g({V }_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{m}h({V }_{({r}_{i})}){\prod\nolimits }_{i=m+1}^{N}h({V }_{({r}_{i})})}\right ],$$

whereV (1) < ⋯ < V (N) are the order statistics of an iid sample of sizeN fromh(x).

Shift alternatives have the form\(g(x) = f(x - \Delta )\) so that theX distribution has the same shape as theY distribution but shiftedΔ to the right of it. Iff(x) has support on ( − , ), then we may takeh(x) = f(x) and obtain

$${P}_{\Delta }(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=m+1}^{N}f({V }_{({r}_{i})} - \Delta )} {{\prod\nolimits }_{i=m+1}^{N}f({V }_{({r}_{i})})} \right ],$$
(12.57)

where nowV (1) < ⋯ < V (N) are order statistics for a random sample fromf. Now suppose thatf(x) is differentiable and that we can take the derivative inside the expectation in (12.57). Then,

$${L}_{r}^{{\prime}}(0) ={ \left. \frac{\partial } {\partial \Delta }{P}_{\Delta }(R = r)\right \vert }_{\Delta =0} = \frac{1} {N!}\sum\limits _{i=m+1}^{N}\mbox{ E}\left [\frac{-{f}^{{\prime}}({V }_{ ({r}_{i})})} {f({V }_{({r}_{i})})} \right ].$$
(12.58)

The locally most powerful rank test places points in the rejection region according to large values of this latter expression.

If we letV (1) < ⋯ < V (N) be replaced by\({F}^{-1}({U}_{(1)})< \cdots< {F}^{-1}({U}_{(N)})\) where theU (i) are uniform order statistics from an iid sampleU 1, , U N , then the locally most powerful rank test rejects for large values of

$$T = \sum\limits _{i=m+1}^{N}a({R}_{ i}),$$

wherea(i) = Eϕ(U (i), f), and\(\phi (u,f) = -{f}^{{\prime}}({F}^{-1}(u))/f({F}^{-1}(u))\) is given in (12.23, p. 475) and called the optimal score function.

12.2 Distribution of the Rank Vector under Alternatives

A version of the following result first appeared in Hoeffding [1951].

Theorem 12.6.

Suppose that Z 1 ,… Z N are independent continuous random variables with respective densities f 1 ,…,f N . Let \(R = {({R}_{1},\ldots,{R}_{N})}^{T}\) be the corresponding rank vector. If h is the density of a continuous random variable whose support contains the support of each of f 1 ,…,f N , then

$$P(R = r) = \frac{1} {N!}\mbox{ E}\left [\frac{{\prod\nolimits }_{i=1}^{N}{f}_{i}({V }_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{N}h({V }_{({r}_{i})})} \right ],$$

where V (1) < ⋯ < V (N) are the order statistics of an iid sample from h.

Proof.

Let\(C =\{ t : {t}_{i}\;\;\mbox{ has rank}\;\;{r}_{i}\}\). Then by definition

$$P(R = r) = \int\nolimits \nolimits \cdots \int\nolimits \nolimits I(t \in C)\left \{{\prod\nolimits }_{i=1}^{N}{f}_{ i}({t}_{i})\right \}d{t}_{1}d{t}_{2}\cdots d{t}_{N}.$$

Now let\({v}_{({r}_{i})} = {t}_{i}\) so thatv (1) < ⋯ < v (N). On the setC this is just a 1-to-1 change of variable, but its implications are important. For a given vector\(t\) suppose thatt 1 has rankr 1 = 3; that is,t 1 is third from the bottom when the components of\(t\) are ranked. Then\({v}_{({r}_{1})} = {v}_{(3)} = {t}_{1}\). Ift 2 has rankr 2 = 9, then\({v}_{({r}_{2})} = {v}_{(9)} = {t}_{2}\). Now we make the change of variable, and multiply and divide by\(N!{\prod\nolimits }_{i=1}^{N}h({v}_{({r}_{i})})\) to get

$$\begin{array}{rcl} P(R = r)& =& \frac{1} {N!}\int\nolimits \nolimits \cdots \int\nolimits \nolimits \left [\frac{{\prod\nolimits }_{i=1}^{N}{f}_{i}({v}_{({r}_{i})})} {{\prod\nolimits }_{i=1}^{N}h({v}_{({r}_{i})})} \right ]I({v}_{(1)}< \cdots< {v}_{(N)})N! \\ & & \quad \times \left \{{\prod\nolimits }_{i=1}^{N}h({v}_{ (i)})\right \}d{v}_{(1)}d{v}_{(2)}\cdots d{v}_{(N)}\end{array}$$

The result follows by noticing thatI(v (1) < ⋯ < v (N))N!  ∏ i = 1 N h(v (i)) is the density of the order statistic vector fromh.

12.3 Pitman Efficiency

Recall from Section (12.5.2, p. 476) that the Pitman asymptotic relative efficiency of testS to testT is given by

$$\mbox{ ARE}(S,T) =\lim \limits_{k\rightarrow \infty }\frac{{N}_{k}^{{\prime}}} {{N}_{k}},$$

whereN k andN k are the sample sizes required for the two tests to have the same limiting level α and power β under the sequence of alternatives

$${\theta }_{k} = {\theta }_{0} + \frac{\delta } {\sqrt{{N}_{k}}} + o\left ( \frac{1} {\sqrt{{N}_{k}}}\right )\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
(12.59)

These sequences of alternatives are calledPitman alternatives, and the basic approach is due to Pitman [1948] and Noether [1955]. In the following we have drawn heavily from the accounts in Lehmann [1975] and Randles and Wolfe [1979].

We assume in Theorem 12.7 below that both test statistics satisfy 1–7 below. For simplicity we state the conditions for justS and then give a result on asymptotic power before giving the main theorem.

In the following\({\mu }_{{S}_{k}}(\theta )\) and\({\sigma }_{{S}_{k}}(\theta )\) refer to sequences of constants associated withS k under θ. They might be the means and standard deviations, but need not be.

  1. 1.
    $${\theta }_{k} \rightarrow{\theta }_{0}\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  2. 2.
    $${N}_{k} \rightarrow \infty \;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  3. 3.

    Under θ = θ0

    $$\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \stackrel{d}{\rightarrow }\mbox{ N}(0,1)\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  4. 4.

    Under θ = θ k

    $$\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{k})} \stackrel{d}{\rightarrow }\mbox{ N}(0,1)\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  5. 5.

    The derivative\({\mu }_{{S}_{k}}^{{\prime}}(\theta )\) exists in a neighborhood of θ = θ0 with\({\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0}) > 0\) and

    $$\frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})} {{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} \rightarrow1\;\;\mbox{ for all}\;\;{\theta }_{k}^{{_\ast}}\rightarrow{\theta }_{ 0}\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  6. 6.
    $$\frac{{\sigma }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow1\;\;\mbox{ as}\;\;k \rightarrow \infty.$$
  7. 7.

    There exists a positive constantc such that

    $$c =\lim \limits_{k\rightarrow \infty } \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{k } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}}.$$

This constantc is called the efficacy ofS and denoted eff(S). Based on these conditions we first give a result on asymptotic power. The result shows that the higher the efficacy of a test, the more power it has. The result also gives a way to approximate the power of a test based onS. LetZ be a standard normal random variable, and letz α be its upper 1 − α quantile.

Theorem 12.7.

Suppose that the test that rejects for S k > c k has level α k → α as k →∞ under H 0 : θ = θ 0.

  1. a)

    If Conditions 1–7 and (12.59, p. 516) hold, then

    $${\beta }_{k} = P({S}_{k} > {c}_{k}) \rightarrow P(Z > {z}_{\alpha } - c\delta )\;\;\mbox{ as}\;\;k \rightarrow \infty,$$
    (12.60)

    where δ is given in (12.59, p. 516).

  2. b)

    If Conditions 1–7 and (12.60) hold, then (12.24, p. 476) holds.

Proof.

Note first that if Condition 3. holds, then since α k  → α

$$\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow{z}_{\alpha }\;\;\mbox{ as}\;\;k \rightarrow \infty.$$

NowP(S k  > c k ) is given by

$$\begin{array}{rcl} & & P\left (\frac{{S}_{k} - {\mu }_{{S}_{k}}({\theta }_{k})} {{\sigma }_{{S}_{k}}({\theta }_{k})} > \left [\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} -\frac{{\mu }_{{S}_{k}}({\theta }_{k}) - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \right ]\frac{{\sigma }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{k})}\right ) \\ & \rightarrow & P(Z > {z}_{\alpha } - c\delta )\;\;\mbox{ as}\;\;k \rightarrow \infty \end{array}$$

To see this last step, note that by the mean value theorem there exists a θ k  ∗  such that

$$\begin{array}{rcl} \frac{{\mu }_{{S}_{k}}({\theta }_{k}) - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} & =& \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})({\theta }_{k} - {\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \\ & =& \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{k}^{{_\ast}})} {{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} \frac{{\mu }_{{S}_{k}}^{{\prime}}({\theta }_{0})} {\sqrt{{N}_{k } {\sigma }_{{S}_{k } }^{2 }({\theta }_{0 } )}}\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0}) \rightarrow c\delta \end{array}$$

For part b) we just work backwards and note that (12.60) and Conditions 1–7 force the convergence tocδ which means that\(\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0}) \rightarrow\delta \) which is equivalent to (12.59, p. 516).

Now we give the main Pitman ARE theorem.

Theorem 12.8.

Suppose that the tests that reject for S k > c k and T k > c k based on sample sizes N k and N k , respectively, have levels α k and α k that converge to α under H : θ = θ 0 and their powers under θ k both converge to β, α < β < 1. If conditions 1–7 hold and their efficacies are c =eff(S) and c =eff(T), respectively, then the Pitman asymptotic relative efficiency of S to T is given by

$$\mbox{ ARE} ={ \left \{\frac{\mbox{ eff}(S)} {\mbox{ eff}(T)}\right \}}^{2}.$$

Proof.

By Theorem 12.7 (p. 517) b),\(\beta= P(Z > {z}_{\alpha } - c\delta ) = P(Z > {z}_{\alpha } - {c}^{{\prime}}{\delta }^{{\prime}})\). Thuscδ = c δ and

$$\begin{array}{rcl} \mbox{ ARE}(S,T)& =& \lim \limits_{k\rightarrow \infty }\frac{{N}_{k}^{{\prime}}} {{N}_{k}} \\& =& \lim \limits_{k\rightarrow \infty }{\left (\frac{\sqrt{{N}_{k }^{{\prime}}}({\theta }_{k} - {\theta }_{0})} {\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} \right )}^{2} \\ & =&{ \left (\frac{{\delta }^{{\prime}}} {\delta } \right )}^{2} ={ \left ( \frac{c} {{c}^{{\prime}}}\right )}^{2}\end{array}$$

To apply Theorem 12.8 it would appear that we have to verify Conditions 3–6 above for arbitrary subsequences θ k converging to θ0 and then compute the efficacy in 7 for such sequences. However, if Conditions 1–7 and (12.60, p. 517) hold, we know by Theorem 12.7 (p. 517) that (12.24, p. 476) holds. Thus, we really only need to assume Condition 2 and verify Conditions 3–6 for alternatives of the form (12.59, p. 516). Moreover, the efficacy need only be computed for a simple sequenceN converging to since the numerator and denominator in Condition 7 only involve θ0.

12.4 Pitman ARE for the One-Sample Location Problem

Using the notation of Section 12.8 (p. 419) letD 1, , D N be iid fromF(x − θ), whereF(x) has densityf(x) that is symmetric about 0,\(f(x) = f(-x)\). ThusD i has densityf(x − θ) that is symmetric about θ. The testing problem isH 0 : θ = θ0 versusH a : θ = θ k , where θ k is given by (12.59).

12.4.1 a Efficacy for the One-Sample t

The one-samplet statistic is

$$t = \frac{\sqrt{N}(\overline{D} - {\theta }_{0})} {s},$$

wheres is then − 1 version of the sample standard deviation. The simplest choice of standardizing constants are

$${\mu }_{{t}_{k}}({\theta }_{k}) = \frac{\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} {\sigma }$$

and\({\sigma }_{{t}_{k}}({\theta }_{k}) = 1\), where σ is the standard deviation ofD 1 (under both θ = θ0 and θ = θ k ). To verify Conditions 3 and 4 (p. 517), we have

$$\begin{array}{rcl} \frac{{t}_{k} - {\mu }_{{t}_{k}}({\theta }_{0})} {{\sigma }_{{t}_{k}}({\theta }_{0})} & =& \frac{\sqrt{{N}_{k}}(\overline{D} - {\theta }_{0})} {s} -\frac{\sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})} {\sigma } \\ & =& \frac{\sqrt{{N}_{k}}(\overline{D} - {\theta }_{k})} {\sigma } \left ( \frac{s} {\sigma }\right ) + \sqrt{{N}_{k}}({\theta }_{k} - {\theta }_{0})\left (\frac{1} {s} - \frac{1} {\sigma }\right )\end{array}$$

Under both θ = θ0 and θ = θ k ,s has the same distribution and converges in probability to σ ifD has a finite variance. Thus, under θ = θ k the last term in the latter display converges to 0 in probability since (12.59) forces\(\sqrt{{ N}_{k}}({\theta }_{k} - {\theta }_{0})\) to converge to δ. Of course under θ = θ0 this last term is identically 0. The standardized means converge to standard normals under both θ = θ0 and θ = θ k by Theorem 5.33 (p. 263). Two applications of Slutsky’s Theorem then gives Conditions 3 and 4 (p. 517). Since the derivative of\({\mu }_{{t}_{k}}(\theta )\) is\({\mu }_{{t}_{k}}^{{\prime}}(\theta ) = \sqrt{{N}_{k}}/\sigma \) for all θ, Condition 5 (p. 517) is satisfied. Since\({\sigma }_{{t}_{k}}({\theta }_{k}) = 1\), Condition 6 (p. 517) is satisfied. Finally, dividing\({\mu }_{{t}_{k}}^{{\prime}}({\theta }_{0}) = \sqrt{{N}_{k}}/\sigma \) by\(\sqrt{{ N}_{k}}\) yields

$$\mbox{ eff}(t) = \frac{1} {\sigma }.$$

It should be pointed out that this efficacy expression also holds true for the permutation version of thet test because the permutation distribution of thet statistic also converges to a standard normal under θ = θ0.

12.4.2 b Efficacy for the Sign Test

The sign test statistic is the number of observations above θ0,

$$S = \sum\limits _{i=1}^{N}I({D}_{ i} > {\theta }_{0}).$$

S has a binomial(N, 1 ∕ 2) distribution under θ = θ0 and a binomial\((N,1 - F({\theta }_{0} - \theta ))\) distribution under general θ. Let\({\mu }_{{S}_{k}}(\theta ) = N[1 - F({\theta }_{0} - \theta )]\) and\({\sigma }_{{S}_{k}}^{2}(\theta ) = N[1 - F({\theta }_{0} - \theta )]F({\theta }_{0} - \theta )\). Conditions 3. and 4. (p. 517) follow again by Theorem 5.33 (p. 263), and\({\mu }_{{S}_{k}}^{{\prime}}(\theta ) = Nf({\theta }_{0} - \theta )\). SinceF is continuous, Condition 6 (p. 517)is satisfied, and iff is continuous, then Condition 5 (p. 517) is satisfied, and the efficacy is

$$\mbox{ eff}(S) =\lim \limits_{N\rightarrow \infty } \frac{Nf(0)} {\sqrt{{N}^{2 } /4}} = 2f(0).$$

Now we are able to compute the Pitman ARE of the sign test to thet test:

$$\mbox{ ARE}(S,t) = 4{\sigma }^{2}{f}^{2}(0).$$

Table 12.4 (p. 496) gives values of ARE(S, t) for some standard distributions.

12.4.3 c Efficacy for the Wilcoxon Signed Rank Test

Recall that the signed rank statistic is

$${W}^{+} = \sum\limits _{i=1}^{N}I({D}_{ i} > {\theta }_{0}){R}_{i}^{+},$$

whereR i  +  is the rank of | D i  − θ0 | among\(\vert {D}_{1} - {\theta }_{0}\vert,\ldots,\vert {D}_{N} - {\theta }_{0}\vert \). The asymptotic distribution ofW  +  under θ k requires more theory than we have developed so far, but Olshen [1967] showed that the efficacy ofW  +  is

$$\sqrt{12}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx$$

under the condition that ∫ −  f 2(x)dx < . Thus the Pitman asymptotic relative efficiency of the sign test to the Wilcoxon Signed Rank test is

$$\mbox{ ARE}(S,{W}^{+}) = \frac{{f}^{2}(0)} {3{\left ({\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right )}^{2}}.$$

Similarly, the Pitman asymptotic relative efficiency of the Wilcoxon Signed Rank test to thet test is

$$\mbox{ ARE}({W}^{+},t) = 12{\sigma }^{2}{\left ({\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right )}^{2}.$$

Table 12.4 (p. 496) displays these AREs for a number of distributions.

12.4.4 d Power approximations for the One-Sample Location problem

Theorem 12.7 (p. 517) gives the asymptotic power approximation

$$P(Z > {z}_{\alpha } - c\delta ) = 1 - \Phi \left ({z}_{\alpha } - c\,\sqrt{N}(\theta- {\theta }_{0})\right )$$

based on setting\(\delta= \sqrt{N}(\theta- {\theta }_{0})\) in (12.60, p. 517), where θ is the alternative of interest at sample sizeN.

For example, let us first consider thet statistic with\(c = 1/\sigma \) and θ0 = 0. The power approximation is then

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{N}\theta /\sigma \right ).$$

This is the exact power we get for theZ statistic\(\sqrt{N}(\overline{X} - {\theta }_{0})/\sigma \) when we know σ instead of estimating it. At\(\theta /\sigma=.2\) andN = 10, we get power 0.16, which may be compared with the estimated exact power taken from the first four distributions in Randles and Wolfe [1979, p. 116]:.14,.15,.16,.17. These latter estimates were based on 5000 simulations and have standard deviation around.005. At\(\theta /\sigma=.4\) andN = 10, the approximate power is 0.35, and the estimated exact powers for those first four distributions in Randles and Wolfe [1979, p. 116] are.29,.33,.35, and.37, respectively. So here our asymptotic approximation may be viewed as substituting aZ for thet, and the approximation is quite good. Of course, for the normal distribution we could easily have used the noncentralt distribution to get the exact power.

For the sign test, the approximation is

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2f(0)\theta \right ) = 1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2{f}_{0}(0)\theta /\sigma \right ),$$

where we have putf in the form of a location-scale model\(f(x) = {f}_{0}((x - \theta )/\sigma )/\sigma \), wheref 0(x) has standard deviation 1, and thus σ is the standard deviation. For the uniform distribution,\({f}_{0}(x) = I(-\sqrt{3}< x< \sqrt{3})/\sqrt{12}\), so that\(2{f}_{0}(0) = 2/\sqrt{12}\). The approximate power at\(\theta /\sigma=.2,.4,.6,.8\) andN = 10 is then.10,.18,.29,.43, respectively. The corresponding Randles and Wolfe [1979, p. 116] estimates are.10,.19,.30, and.45, respectively. Here of course we could calculate the power exactly using the binomial. The approximate power we have used is similar to the normal approximation to the binomial but not the same because our approximation has replaced the difference of\(p = F(0) = 1/2\) andp = F(θ) by a derivative times θ (Taylor expansion) and also used the null variance. It is perhaps surprising how good the approximation is.

The most interesting case is the signed rank statistic because we do not have any standard way of calculating the power. The approximate power for an alternative θ when θ0 = 0 is

$$\begin{array}{rcl} P(Z > {z}_{\alpha } - c\delta )& =& 1 - \Phi \left ({z}_{\alpha } - \theta \sqrt{12N}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}^{2}(x)dx\right ) \\ & =& 1 - \Phi \left ({z}_{\alpha } - \frac{\theta } {\sigma }\sqrt{12N}{\int\nolimits \nolimits }_{-\infty }^{\infty }{f}_{ 0}^{2}(x)dx\right )\end{array}$$

Here again in the second part we have substituted so that σ is the standard deviation off(x). For example, at the standard normal\({\int\nolimits \nolimits }_{-\infty }^{\infty }{f}_{0}^{2}(x)dx = 1/\sqrt{4\pi }\), and the approximate power is

$$1 - \Phi \left ({z}_{\alpha } -\sqrt{\frac{3N} {\pi }} \frac{\theta } {\sigma }\right ).$$

Plugging in θ ∕ σ =.2,.4,.6, and.8 atN = 10, we obtain.15,.34,.58, and.80, respectively. The estimates of the exact powers from Randles and Wolfe [1979, p. 116] are.14,.32,.53, and.74. Thus the asymptotic approximation is a bit too high, especially at the larger θ ∕ σ values.

Although the approximation is a little high, it could easily be used for planning purposes. For example, suppose that a clinical trial is to be run with power = . 80 at the α = . 05 level against alternatives expected to be around\(\theta /\sigma=.5\). Since the FDA requires two-sided procedures, we usez . 025 = 1. 96 and solve\({\Phi }^{-1}(1 -.8) = 1.96 -\sqrt{3N/\pi }(.5)\) to get

$$N ={ \left [\frac{1.96 - {\Phi }^{-1}(.2)} {.5} \right ]}^{2}\frac{\pi } {3} = 32.9.$$

Notice that if we invert theZ statistic power formula used above for approximating the power of thet statistic, the only difference from the last display is that the factor π ∕ 3 does not appear. Thus for thet the calculations result in 31.4 observations. Of course this ratio\(3/\pi= 31.4/32.9\) is just the ARE efficiency of the signed rank test to thet test at the normal distribution.

13 Problems

  1. 12.1.

    For the permutations in Table 12.1 (p. 453), give the permutation distribution of the Wilcoxon Rank Sum statisticW.

  2. 12.2.

    For the two-sample problem with samplesX 1, , X m andY 1, , Y n , show that the permutation test based on ∑ i = 1 n Y i is equivalent to the permutation tests based on ∑ i = 1 m X i ,\(\sum\limits _{i=1}^{n}{Y }_{i} -\sum\limits _{i=1}^{m}{X}_{i}\), and\(\overline{Y } -\overline{X}\).

  3. 12.3.

    A one-way ANOVA situation withk = 3 groups and two observations within each group (\({n}_{1} = {n}_{2} = {n}_{3} = 2\)) results in the following data. Group 1: 37, 24; Group 2: 12, 15; Group 3: 9, 16. The ANOVAF = 5. 41 results in ap-value of.101 from theF table. If we exchange the 15 in Group 2 for the 9 in Group 3, thenF = 7. 26.

    1. a.

      What are the total number of ways of grouping the data that are relevant to testing that the means are equal?

    2. b.

      Without resorting to the computer, give reasons why the permutationp-value using theF statistic is 2/15.

  4. 12.4.

    In a one-sided testing problem with continuous test statisticT, thep-value is eitherF H (T obs.) or 1 − F H (T obs.) depending on the direction of the hypotheses, whereF H is the distribution function ofT under the null hypothesisH, andT obs. is the observed value of the test statistic. In either case, under the null hypothesis thep-value is a uniform random variable as seen from the probability integral transformation. Now consider the case whereT has a discrete distribution with valuest 1, , t k and probabilities\(P(T = {t}_{i}) = {p}_{i},i = 1,\ldots,k\) under the null hypothesisH 0. If we are rejectingH 0 for small values ofT, then thep-value is\(p = P(T \leq{T}_{\mbox{ obs.}}) = {p}_{1} + \cdots+ P(T = {T}_{\mbox{ obs.}})\), and the mid-p value is\(p - (1/2)P(T = {T}_{\mbox{ obs.}})\). Under the null hypothesisH 0, show that E(mid-p)=1/2 and thus that the expected value of the usualp-value must be greater than 1/2 (and thus greater than the expected value of thep-value in continuous cases).

  5. 12.5.

    Consider a finite population of valuesa 1, , a N and a set of constantsc 1, , c N . We select a random permutation of thea values, call themA 1, , A N , and form the statistic

    $$T = \sum\limits _{i=1}^{N}{c}_{ i}{A}_{i}.$$

    The purpose of this problem is to derive the first two permutation momentsT given in Section 12.4.2 (p. 458).

    1. a.

      First show that

      $$P({A}_{i} = {a}_{s}) = \frac{1} {N}\quad \mbox{ for}\;s = 1,\ldots,N,$$

      and

      $$P({A}_{i} = {a}_{s},{A}_{j} = {a}_{t}) = \frac{1} {N(N - 1)}\quad \mbox{ for}\;s\neq t = 1,\ldots,N.$$

      (Hint: for the first result there are (N − 1)! permutations witha s in theith slot out of a total ofN! equally likely permutations.)

    2. b.

      Using a. show that

      $$\mbox{ E}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{a}_{ i} \equiv \overline{a},\quad \mathrm{Var}({A}_{i}) = \frac{1} {N}\sum\limits _{i=1}^{N}{({a}_{ i}-\overline{a})}^{2},\quad \mbox{ for}\;i = 1,\ldots,N,$$

      and

      $$\mbox{ Cov}({A}_{i},{A}_{j}) = \frac{-1} {N(N - 1)}\sum\limits _{i=1}^{N}{({a}_{ i} -\overline{a})}^{2},\quad \mbox{ for}\;i\neq j = 1,\ldots,N.$$
    3. c.

      Now use b. to show that

      $$\mbox{ E}(T) = N\overline{c}\;\overline{a}\quad \mbox{ and}\quad \mathrm{Var}(T) = \frac{1} {N - 1}\sum\limits _{i=1}^{N}{({c}_{ i} -\overline{c})}^{2} \sum\limits _{j=1}^{N}{({a}_{ j} -\overline{a})}^{2},$$

      where\(\overline{a}\) and\(\overline{c}\) are the averages of thea’s andc’s, respectively.

  6. 12.6.

    As an application of the previous problem, consider the Wilcoxon Rank Sum statisticW = sum of the ranks of theY ’s in a two-sample problem where we assume continuous distributions so that there are no ties. Thec values are 1 for\(i = m + 1,\ldots,N = m + n\) and 0 otherwise. With no ties thea’s are just the integers 1, , N corresponding to the ranks. Show that

    $$\mbox{ E}(W) = \frac{n(m + n + 1)} {2}$$

    and

    $$\mathrm{Var}(W) = \frac{mn(m + n + 1)} {12}.$$
  7. 12.7.

    In Section 12.4.4 (p. 461), the integral

    $$\begin{array}{rcl} P({X}_{1}< {X}_{2}) =\mathrm{ E}\left \{I({X}_{1}< {X}_{2})\right \}& =& \int\nolimits \nolimits \int\nolimits \nolimits I({x}_{1}< {x}_{2})\,dF({x}_{1})\,dF({x}_{2}) \\ & =& \int\nolimits \nolimits F(x)\,dF(x) \\ \end{array}$$

    arises, whereX 1 andX 2 are independent with distribution functionF. IfF is continuous, argue that\(P({X}_{1}< {X}_{2}) = 1/2\) sinceX 1 < X 2 andX 1 > X 2 are equally likely. Also use iterated expectations and the probability integral transformations to get the same result. Finally, letu = F(x) in the final integral to get the result.

  8. 12.8.

    Suppose thatX andY represent some measurement that signals the presence of disease via a threshold to be used in screening for the disease. Assume thatY has distribution functionG(y) and represents a diseased population, andX has distribution functionF(x) and represents a disease-free population. A “positive” for a disease-free subject is declared ifX > c and has probability 1 − F(c), whereF(c) is called thespecificity of the screening test. A “positive” for a diseased subject is declared ifY > c and has probability 1 − G(c), called thesensitivity of the test. The receiver operating characteristic (ROC) curve is a plot of 1 − G(c i ) versus 1 − F(c i ) for a sequence of thresholdsc 1, , c k . Instead of a discrete set of points, we may let\(t = 1 - F(c)\), solve to get\(c = {F}^{-1}(1 - t)\), and plug into 1 − G(c) to get the ROC curve\(R(t) = 1 - G({F}^{-1}(1 - t))\). Show that

    $${\int\nolimits \nolimits }_{0}^{1}R(t)\,dt = \int\nolimits \nolimits \{1 - G(u)\}\,dF(u) = {\theta }_{\mathrm{XY}}$$

    for continuousF andG.

  9. 12.9.

    Use the asymptotic normality result for\(\widehat{{\theta }}_{\mathrm{XY}}\) to derive (12.15, p. 464).

  10. 12.10.

    Use (12.15, p. 464) to prove that the power of the Wilcoxon Rank Sum Test goes to 1 asm andn go to andm ∕ N converges to a number λ between 0 and 1. You may assume that theF andG are continuous.

  11. 12.11.

    Use (12.15, p. 464) to derive (12.16, p. 464).

  12. 12.12.

    Suppose that\(\widehat{{\theta }}_{\mathrm{XY}}\) is.7 andm = n. How large shouldm = n be in order to have approximately 80% power at α = . 05 with the Wilcoxon Rank Sum Test?

  13. 12.13.

    Suppose that two normal populations with the same standard deviation σ differ in means by\(\Delta /\sigma=.7\). How large shouldm = n be in order to have approximately 80% power at α = . 05 with the Wilcoxon Rank Sum Test?

  14. 12.14.

    The number of permutations needed to carry out a permutation test can be computationally overwhelming. Thus the typical use of a permutation test involves estimating the true permutationp-value by randomly selectingB = 1, 000,B = 10, 000, or even more of the possible permutations. If we use sampling with replacement, then\(B\widehat{p}\) has a binomial distribution with the truep-valuep being the probability in the binomial. Consider the following situation where an approach of questionable ethics is under consideration. A company has just run a clinical trial comparing a placebo to a new drug that they want to market, but unfortunately the estimatedp-value based onB = 1000 shows ap-value of around\(\widehat{p} =.10\). Everybody is upset because they “know” the drug is good. One clever doctor suggests that they run the simulation ofB = 1000 over and over again until they get a\(\widehat{p}\) less than.05. Are they likely to find a run for which\(\widehat{p}\) is less than.05 if the truep-value isp = . 10? Use the following calculation based onk separate (independent) runs resulting in\(\widehat{{p}}_{1},\ldots,\widehat{{p}}_{k}\):

    $$\begin{array}{rcl} P{(\min }_{1\leq i\leq k}\widehat{{p}}_{i} \leq.05)& =& 1 - P{(\min }_{1\leq i\leq k}\widehat{{p}}_{i} >.05) \\ & =& 1 - {[1 - P(\widehat{{p}}_{1} \leq.05)]}^{k} \\ & =& 1 - {[1 - P(\mbox{ Bin(1000,.1)} \leq50)]}^{k}\end{array}$$

    Plug in some values ofk to find out how largek would need to be to get a\(\widehat{p}\) under.05 with reasonably high probability.

  15. 12.15.

    The above problem is for given data, and we were trying to estimate the true permutationp-value conditional on the data set and therefore conditional on the set of test statistics computed for every possible permutation. In the present problem we want to think in terms of the overall unconditional probability distribution of\(B\widehat{p}\) where we have two stages: first the data is generated and then we randomly selectT 1  ∗ , , T B  ∗  from the set of permutations. The calculation of importance for justifying Monte Carlo tests is the unconditional probability\(P(\widehat{p} \leq\alpha ) = P(B\widehat{p} \leq B\alpha )\) that takes both stages into account.

    1. a.

      First we consider a simpler problem. Suppose that we get some data that seems to be normally distributed and decide to compute at statistic, call itT 0. Then we discover that we have lost ourt tables, but fortunately we have a computer. Thus we can generate normal data and computeT 1  ∗ , , T B  ∗  for each ofB independent data sets. In this caseT 0, T 1  ∗ , , T B  ∗  are iid from a continuous distribution so that there are no ties among them with probability one. Let\(\widehat{p} = \sum\limits _{i=1}^{B}I({T}_{i}^{{_\ast}}\geq{T}_{0})/B\) and prove that\(B\widehat{p}\) has a discrete uniform distribution on the integers (0, 1, , B + 1). (Hint: just use the argument that each ordering has equal probability\(1/((B + 1)!)\). For example,\(B\widehat{p} = 0\) occurs whenT 0 is the largest value. How many orderings haveT 0 as the largest value?)

    2. b.

      The above result also holds ifT 0, T 1  ∗ , , T B  ∗  have no ties and are merely exchangeable. However, if we are samplingT 1  ∗ , , T B  ∗  with replacement from a finite set of permutations, then ties occur with probability greater than one. Think of a way to randomly break ties so that we can get the same discrete uniform distribution.

    3. c.

      Assuming that\(B\widehat{p}\) has a discrete uniform distribution on the integers (0, 1, , B), show that\(P(\widehat{p} \leq\alpha ) = \alpha \) as long as (B + 1)α is an integer.

  16. 12.16.

    From (12.20, p. 469),d = . 933 for the Wilcoxon Rank Sum statistic form = 10 andn = 6 and assuming no ties. This corresponds to\(Z\) being the integers 1 to 16. For no ties andW = 67, the exactp-value for a one-sided test is.0467. Show that the normal approximationp-value is.0413 and the Box-Andersenp-value is.0426. Also find the Box-Andersenp-values using the approximations\(d = 1 + (1.8 - 3)/(m + n)\) andd = 1.

  17. 12.17.

    Show that the result “\(Q/(k - 1)\) of (12.31, p. 482) is AN\(\{1,2(n - 1)/(kn)\}.\) ask →  withn fixed” follows from (12.32, p. 483) and writing

    $$\begin{array}{lr} \sqrt{ k}\left ( \frac{Q} {k - 1} - \frac{n{F}_{\mathrm{R}}} {n - 1 + {F}_{\mathrm{R}}}\right ) = \frac{\sqrt{k}\{(N - 1)/(k - 1) - n\}{F}_{\mathrm{R}}} {(n - 1)\left ( \frac{k} {k - 1}\right ) + {F}_{\mathrm{R}}} \\ + \sqrt{k}(n{F}_{\mathrm{R}})\left ( \frac{1} {(n - 1)\left ( \frac{k} {k - 1}\right ) + {F}_{\mathrm{R}}} - \frac{1} {n - 1 + {F}_{\mathrm{R}}}\right )\end{array}$$

    Then show that each of the above two pieces converges to 0 in probability and use the delta theorem on\(n{F}_{\mathrm{R}}/(n - 1 + {F}_{\mathrm{R}})\). (Keep in mind thatn is a fixed constant.)

  18. 12.18.

    Justify the statement: “use ofF R with an\(F(k - 1,N - k)\) reference distribution is supported by (12.32, p. 483) underk →  and by the usual asymptotics\((k - 1){F}_{\mathrm{R}}\stackrel{d}{\rightarrow }{\chi }_{k-1}^{2}\) whenn →  withk fixed.” Hint: for thek →  asymptotics, write an\(F(k - 1,N - k)\) random variable as an average ofk − 1 χ1 2 random variables divided by an independent average ofk(n − 1) χ1 2 random variables. Then subtract 1, multiply by\(\sqrt{k}\) and use the Central Limit Theorem and Slutsky’s Theorem.

  19. 12.19.

    From Section 12.8.1 (p. 492), show that for\(T = \sum\limits _{i=1}^{n}{c}_{i}{d}_{i}\),\(\mathrm{E}({T}^{4}) = 3{(\sum\limits _{i=1}^{n}{d}_{i}^{2})}^{2} - 2\sum\limits _{i=1}^{n}{d}_{i}^{4}\). (Hint: first show that

    $${\left (\sum\nolimits {c}_{i}{d}_{i}\right )}^{4} = \sum\nolimits {c}_{i}^{4}{d}_{ i}^{4} + 6\sum\limits _{i<j}{c}_{i}^{2}{d}_{ i}^{2}{c}_{ j}^{2}{d}_{ j}^{2}$$

    plus sums of odd moments.)

  20. 12.20.

    Verify (12.39, p. 493) and (12.40, p. 493) for the Box-Andersen approximation in the matched pairs problem.

  21. 12.21.

    Using results in Section 12.4.2 (p. 458), show that\(\mathrm{E}\{{\overline{R}}_{.j}\} = (k + 1)/2\),\(\mathrm{Var}\{{\overline{R}}_{.j}\} = ({k}^{2} - 1)/(12n)\), and\(\mathrm{Cov}\{{\overline{R}}_{.j},{\overline{R}}_{.m}\} = -({k}^{2} - 1)/\{12n(k - 1)\}\), whereR i1, …R ik are Friedman ranks in theith block randomly assigned to the integers 1 tok and independent of the ranks in the other blocks. Putting these results together, the covariance matrix of\(\overline{R} = {({\overline{R}}_{.1},\ldots,{\overline{R}}_{.k})}^{T}\) is\(\{k(k + 1)/(12n)\}{C}_{k}\), where\({C}_{k} = \mbox{ diag}\left ({I}_{k} -\frac{{\mathbf{1}}_{k}{\mathbf{1}}_{k}^{T}} {k} \right )\). Using the fact thatC k is idempotent, find a generalized inverse of the covariance matrix of\(\overline{R}\), call itG, and show that (12.45, p. 501) is given by\({\overline{R}}^{T}G\overline{R}\).

  22. 12.22.

    Similar to Problem 12.18, explain why asymptotic normality of the Friedman statistic (12.45, p. 501) supports use of theF in (12.44, p. 500) on the within row Friedman ranks with an\(F(k - 1,(k - 1)(n - 1))\) reference distribution.

  23. 12.23.

    From Section 12.9.4 (p. 503) verify the permutation moments in (12.49, p. 504) and (12.50, p. 504). Use results from Section 12.4.2 (p. 458) under the assumption that permutations are independently carried out within rows.

  24. 12.24.

    From Section 12.10.1 (p. 506) consider the two independent binomial testing problem wherem = 12 (N 11 + N 12) for Group 1 andn = 4 (N 21 + N 22) for Group 2, and we want to testH 0 : p 1 = p 2 versusH a : p 1 < p 2, wherep 1 andp 2 are the respective probabilities of falling in Category 1. Suppose thatT = 4 (N 11 + N 21) is observed. Write down the conditional probability distribution ofN 11 | T = 4 (just the hypergeometric probabilities forn 11 = 0, 1, 2, 3, 4). Also, letting each of 0, 1, 2, 3, 4 be considered observed values forN 11, list:

    1. a.

      the Fisher Exactp-values

    2. b.

      the Fisher Exact mid-p values.

  25. 12.25.

    For a multinomial vector (N 11, N 12, N 21, N 22),\({N}_{11} + {N}_{12} + {N}_{21} + {N}_{22} = N\), with associated probabilities (p 11, p 12, p 21, p 22), show that the variance ofN 12 − N 21 is\(N\{{p}_{12} + {p}_{21} - {({p}_{12} - {p}_{21})}^{2}\}\).

  26. 12.26.

    Show that (12.58, p. 515) follows from (12.57, p. 515) if the derivative can be taken inside the expectation.

  27. 12.27.

    Show why α k  → α and Condition 3. (p. 517) imply that

    $$\frac{{c}_{k} - {\mu }_{{S}_{k}}({\theta }_{0})} {{\sigma }_{{S}_{k}}({\theta }_{0})} \rightarrow{z}_{\alpha }\;\;\mbox{ as}\;\;k \rightarrow \infty.$$

    (Hint: it helps to use Pólya’s result on uniform convergence, Theorem 5.6, p. 222.)

  28. 12.28.

    Verify that Theorem 5.33 (p. 263) applies to\(\overline{X}\) when\({X}_{1}^{{_\ast}},\ldots,{X}_{{N}_{k}}^{{_\ast}}\) are iid fromF(x) having mean 0 and finite variance σ2, and\({X}_{i} = {X}_{i}^{{_\ast}} + \delta /\sqrt{{N}_{k}},i = 1,\ldots,{N}_{k}\).

  29. 12.29.

    Verify that Theorem 5.33 (p. 263) applies to\(S = \sum\limits _{i=1}^{N}I({X}_{i} > 0\) when\({X}_{1}^{{_\ast}},\ldots,{X}_{{N}_{k}}^{{_\ast}}\) are iid fromF(x) having median 0 and\({X}_{i} = {X}_{i}^{{_\ast}} + \delta /\sqrt{{N}_{k}},i = 1,\ldots,{N}_{k}\).

  30. 12.30.

    The data areY 1, , Y n iid with median θ. ForH 0 : θ = 0 versusH a : θ > 0, use the normal approximation to the binomial distribution to find a power approximation for the sign test and compare to the expression\(1 - \Phi \left ({z}_{\alpha } -\sqrt{N}2f(0){\theta }_{a}\right )\) derived from Theorem 12.7 (p. 517), where θ a is an alternative. Where are the differences?

  31. 12.31.

    For the Wilcoxon Signed Rank statistic, calculate an approximation to the power of a. 05 level test for a sample of sizeN = 20 from the Laplace distribution with a shift of.6 in standard deviation units. Compare with the simulation estimate.63 from Randles and Wolfe [1979, p.116].

  32. 12.32.

    Consider the two-sample problem whereX 1, , X m andY 1, , Y n are iid fromF(x) underH 0, but theY ’s are shifted to the right by\({\Delta }_{k} = \delta /\sqrt{{N}_{k}}\) under a sequence of the Pitman alternatives. Verify Conditions 3.-6 (p. 517), making any assumptions necessary and show that the efficacy of the two-samplet test is given by eff\((t) = \sqrt{\lambda (1 - \lambda )}/\sigma \), where σ is the standard deviation ofF.

  33. 12.33.

    Consider a variable having a Likert scale with possible answers 1,2,3,4,5. Suppose that we are thinking of a situation where the treatment group has answers that tend to be spread toward 1 or 5 and away from the middle. Can we design a rank test to handle this? Here is one formulation. For the two-sample problem suppose that the base density is a beta density of the following form:

    $$\frac{\Gamma (2(1 - \theta ))} {\Gamma (1 - \theta )\Gamma (1 - \theta )}{x}^{-\theta }{(1 - x)}^{-\theta },\;\;0< x< 1,\;\;\theta< 1.$$

    A sketch of this density shows that it spreads towards the ends as θ gets large. Using the LMPRT theory, find the optimal score function forH 0 : θ = θ0 versusH a : θ > θ0, where 0 ≤ θ0 < 1. At θ0 = 0, the score function simplifies to\(\phi (u) = -2 -\log [u(1 - u)]\). Sketch this score function and comment on whether a linear rank statistic of the form\(S = \sum\limits _{i=1}^{m}\phi ({R}_{i}/(N + 1))\) makes sense here.

  34. 12.34.

    For the two-sample problem with\(G(x) = (1 - \Delta )F(x) + \Delta {F}^{2}(x)\) andH 0 : Δ = 0 versusH a : Δ > 0, show that the Wilcoxon Rank Sum test is the locally most powerful rank test. (You may takeh(x) = f(x) in the expression for\(P(R = r)\).)

  35. 12.35.

    In some two-sample situations (treatment and control), only a small proportion of the treatment group responds to the treatment. Johnson et al. [1987] were motivated by data on sister chromatid exchanges in the chromosomes of smokers where only a small number of units are affected by a treatment, that is, where the treatment group seemed to have a small but higher proportion of large values than the control group. For this two-sample problem, they proposed a mixture alternative,

    $$G(x) = (1 - \Delta )F(x) + \Delta K(x),$$

    whereK(x) is stochastically larger thanF(x), i.e.,K(x) ≤ F(x) for allx, andΔ refers to the proportion of responders. ForH 0 : Δ = 0 versusH a : Δ > 0, verify that the locally most powerful rank test has optimal score function\(k({F}^{-1}(u))/f({F}^{-1}(u)) - 1\). LetF(x) andK(x) be normal distribution functions with means μ1 and μ2, respectively, μ2 > μ1, and variance σ2. Show that the optimal score function is

    $$\phi (u) =\exp (-{\delta }^{2}/2)\exp (\delta {\Phi }^{-1}(u)) - 1,$$
    (12.61)

    where\(\delta= ({\mu }_{2} - {\mu }_{1})/\sigma \).

  36. 12.36.

    Related to the previous problem, Johnson et al. [1987] give the following example data:

    X: 99 10 10 14 14 14 15 16 20

    Y: 6 10 13 15 18 21 22 23 30 37

    By sampling from the permutation distribution of the linear rank statistic\(\sum\limits _{i=m+1}^{m+n}\phi ({R}_{i}/(m + n + 1))\) with score function in (12.61), estimate the one-sided permutationp-values with δ = 1 and δ = 2. For comparison, also give one-sidedp-values for the Wilcoxon rank sum (exact) and pooledt-tests (fromt table).

  37. 12.37.

    Similar in motivation to problem 12.35 (p. 529), Conover and Salsburg [1988] proposed the mixture alternative

    $$G(x) = (1 - \Delta )F(x) + \Delta {\left \{F(x)\right \}}^{a}.$$

    Note thatF(x)a is the distribution function of the maximum ofa random variables with distribution functionF(x). ForH 0 : Δ = 0 versusH a : Δ > 0, verify that the locally most powerful rank test has optimal score functionu a − 1.

  38. 12.38.

    For the data in Problem 12.36 (p. 530), by sampling from the permutation distribution of the linear rank statistic\(\sum\limits _{i=m+1}^{m+n}\phi ({R}_{i}/(m + n + 1))\) with score function\(\phi (u) = {u}^{a-1}\), estimate the one-sided permutationp-value witha = 5. For comparison, also give one-sidedp-values for the Wilcoxon rank sum (exact) and pooledt-tests (fromt table).

  39. 12.39.

    Conover and Salsburg [1988] gave the following example data set on changes from baseline of serum glutamic oxaloacetic transaminase (SGOT):

    X:-50-17-10-3 4 7 8122637

    Y: -116-56 2024292935353741

    Plot the data and decide what type of test should be used to detect larger values in some or all of theY ’s. Then, give the one-sidedp-value for that test and for one other possible test.

  40. 12.40.

    Useperm.sign to get the exact one-sidedp-value 0.044 for the data give in Example 12.2 (p. 498). Then by trial and error get an exact confidence interval for the center of the distribution with coverage at least 90%. Also give the exact confidence interval for the median based on the order statistics with coverage at least 90%.