1 Mixtures of Experiments

The present chapter has a somewhat different character from the preceding ones. It is concerned with problems regarding the proper choice and interpretation of tests and confidence procedures, problems which—despite a large literature—have not found a definitive solution. The discussion will thus be more tentative than in earlier chapters, and will focus on conceptual aspects more than on technical ones.

Consider the situation in which either the experiment \({\mathcal {E}}\) of observing a random quantity X with density \(p_{\theta }\) (with respect to \(\mu \)) or the experiment \({\mathcal {F}}\) of observing an X with density \(q_{\theta }\) (with respect to \(\nu \)) is performed with probability p and \(q=1-p\), respectively. On the basis of X, and knowledge of which of the two experiments was performed, it is desired to test \(H_0:\theta =\theta _0\) against \(H_1:\theta =\theta _1\). For the sake of convenience it will be assumed that the two experiments have the same sample space and the same \(\sigma \)-field of measurable sets. The sample space of the overall experiment consists of the union of the sets

$$ {\mathcal {X}}_0=\{(I,x):I=0,\ x\in {\mathcal {X}}\}\quad {\mathcal {X}}_1=\{(I,x):I=1,\ x\in {\mathcal {X}}\}, $$

where I is 0 or 1 as \({\mathcal {E}}\) or \({\mathcal {F}}\) is performed.

A level-\(\alpha \) test of \(H_0\) is defined by its critical function

$$ \phi _i(x)=\phi (i,x) $$

and must satisfy

$$\begin{aligned} \begin{aligned}&pE_0\bigl [\phi _0(X)\mid {\mathcal {E}}\bigr ]+qE_0\bigl [\phi _1(X)\mid {\mathcal {F}}\bigr ] =\\&p\int \phi _0p_{\theta _0}\,d\mu +q\int \phi _1q_{\theta _0}\,d\nu \le \alpha . \end{aligned} \end{aligned}$$
(10.1)

Suppose that p is unknown, so that \(H_0\) is composite. Then a level-\(\alpha \) test of \(H_0\) satisfies (10.1) for all \(0<p<1\), and must therefore satisfy

$$\begin{aligned} \alpha _0=\int \phi _0p_{\theta _0}\,d\mu \le \alpha \quad \alpha _1=\int \phi _1q_{\theta _0}\,d\nu \le \alpha . \end{aligned}$$
(10.2)

As a result, a UMP test against \(H_1\) exists and is given by

$$\begin{aligned} {\phi _0(x)=\left\{ \begin{array}{l} 1 \quad \text {when } p_{\theta _1}(x)> c_0p_{\theta _0}(x), \\ \gamma _0 \;\, \text {when } p_{\theta _1}(x) = c_0p_{\theta _0}(x),\\ 0 \quad \text {when } p_{\theta _1}(x)< c_0p_{\theta _0}(x),\end{array}\right. \text {and}\quad \phi _1(x)=\left\{ \begin{array}{l} 1\quad \text {when } q_{\theta _1}(x) > c_1q_{\theta _0}(x),\\ \gamma _1\;\, \text {when } q_{\theta _1}(x) = c_1q_{\theta _0}(x),\\ 0\quad \text {when } q_{\theta _1}(x) < c_1q_{\theta _0}(x),\end{array}\right. } \end{aligned}$$
(10.3)

where the \(c_i\) and \(\gamma _i\) are determined by

$$\begin{aligned} E_{\theta _0}\bigl [\phi _0(X)\mid {\mathcal {E}}\bigr ]= E_{\theta _0}\bigl [\phi _1(X)\mid {\mathcal {F}}\bigr ]=\alpha . \end{aligned}$$
(10.4)

The power of this test against \(H_1\) is

$$\begin{aligned} \beta (p)=p\beta _0+q\beta _1 \end{aligned}$$
(10.5)

with

$$\begin{aligned} \beta _0=E_{\theta _1}\bigl [\phi _0(X)\mid {\mathcal {E}}\bigr ],\qquad \beta _1=E_{\theta _1}\bigl [\phi _1(X)\mid {\mathcal {F}}\bigr ]. \end{aligned}$$
(10.6)

The situation is analogous to that of Section 4.4 and, as was discussed there, it may be more appropriate to consider the conditional power \(\beta _i\) when \(I=i\), since this is the power pertaining to the experiment that has been performed. As in the earlier case, the conditional power \(\beta _I\) can also be interpreted as an estimate of the unknown \(\beta (p)\), which is unbiased, since

$$ E(\beta _I)=p\beta _0+q\beta _1=\beta (p). $$

So far, the probability p of performing experiment \({\mathcal {E}}\) has been assumed to be unknown. Suppose instead that the value of p is known, say \(p=\frac{1}{2}\). The hypothesis H can be tested at level \(\alpha \) by means of (10.3) as before, but the power of the test is now known to be \(\frac{1}{2}(\beta _0+\beta _1)\). Suppose that \(\beta _0=0.3\), \(\beta _1=0.9\), so that at the start of the experiment the power is \(\frac{1}{2}(0.3+0.9)=0.6\). Now a fair coin is tossed to decide whether to perform \({\mathcal {E}}\) (in case of heads) or \({\mathcal {F}}\) (in case of tails). If the coin shows heads, should the power be reassessed and scaled down to 0.3?

Let us postpone the answer and first consider another change resulting from the knowledge of p. A level-\(\alpha \) test of H now no longer needs to satisfy (10.2) but only the weaker condition

$$\begin{aligned} \frac{1}{2} \left[ \int \phi _0p_{\theta _0}\,d\mu +\int \phi _1q_{\theta _0}\,d\nu \right] \le \alpha . \end{aligned}$$
(10.7)

The most powerful test against K is then again given by (10.3), but now with \(c_0=c_1=c\) and \(\gamma _0=\gamma _1=\gamma \) determined by (Problem 10.3)

$$\begin{aligned} {\textstyle \frac{1}{2}}(\alpha _0+\alpha _1)=\alpha , \end{aligned}$$
(10.8)

where

$$\begin{aligned} \alpha _0=E_{\theta _0}\bigl [\phi _0(X)\mid {\mathcal {E}}\bigr ],\qquad \alpha _1 =E_{\theta _0}\bigl [\phi _1(X)\mid {\mathcal {F}}\bigr ]. \end{aligned}$$
(10.9)

As an illustration of the change, suppose that experiment \({\mathcal {F}}\) is reasonably informative, say that the power \(\beta _1\) given by (10.6) is 0.8, but that \({\mathcal {E}}\) has little ability to distinguish between \(p_{\theta _0}\) and \(p_{\theta _1}\). Then it will typically not pay to put much of the rejection probability into \(\alpha _0\); if \(\beta _0\) [given by (10.6)] is sufficiently small, the best choice of \(\alpha _0\) and \(\alpha _1\) satisfying (10.8) is approximately \(\alpha _0\approx 0\), \(\alpha _1\approx 2\alpha \). The situation will be reversed if \({\mathcal {F}}\) is so informative that \({\mathcal {F}}\) can attain power close to 1 with an \(\alpha _1\) much smaller than \(\alpha /2\).

When p is known, there are therefore two issues. Should the procedure be chosen which is best on the average over both experiments, or should the best conditional procedure be preferred; and, for a given test or confidence procedure, should probabilities such as level, power, and confidence coefficient be calculated conditionally, given the experiment that has been selected, or unconditionally? The underlying question is of course the same: Is a conditional or unconditional point of view more appropriate?

The answer cannot be found within the model but depends on the context. If the overall experiment will be performed many times, for example in an industrial or agricultural setting, the average performance may be the principal feature of interest, and an unconditional approach suitable. However, if repetitions refer to different clients, or are potential rather than actual, interest will focus on the particular event at hand, and conditioning seems more appropriate. Unfortunately, as will be seen in later sections, it is then often not clear how the conditioning events should be chosen.

The difference between the conditional and the unconditional approaches tends to be most striking, and a choice between them therefore most pressing, when the two experiments \({\mathcal {E}}\) and \({\mathcal {F}}\) differ sharply in the amount of information they contain, if for example the difference \(|\beta _1-\beta _0|\) in (10.6) is large. To illustrate an extreme situation in which this is not the case, suppose that \({\mathcal {E}}\) and \({\mathcal {F}}\) consist in observing X with distribution \(N(\theta ,1)\) and \(N(-\theta ,1)\) respectively, that one of them is selected with known probabilities p and q, respectively, and that it is desired to test \(H:\theta =0\) against \(K:\theta >0\). Here \({\mathcal {E}}\) and \({\mathcal {F}}\) contain exactly the same amount of information about \(\theta \). The unconditional most powerful level-\(\alpha \) test of H against \(\theta _1>0\) is seen to reject (Problem 10.5) when \(X>c\) if \({\mathcal {E}}\) is performed, and when \(X<-c\) if \({\mathcal {F}}\) is performed, where \(P_0(X>c)=\alpha \). The test is UMP against \(\theta >0\), and happens to coincide with the UMP conditional test.

The issues raised here extend in an obvious way to mixtures of more than two experiments. As an illustration of a mixture over a continuum, consider a regression situation. Suppose that \(X_1,\ldots ,X_n\) are independent, and that the conditional density of \(X_i\) given \(t_i\) is

$$ \frac{1}{\sigma }f\left( {x_i-\alpha -\beta t_i\over \sigma }\right) . $$

The \(t_i\) themselves are obtained with error. They may for example be independently normally distributed with mean \(c_i\) and known variance \(\tau ^2\), where the \(c_i\) are the intended values of the \(t_i\). Then it will again often be the case that the most appropriate inference concerning \(\alpha \), \(\beta \), and \(\sigma \) is conditional on the observed values of the t’s (which represent the experiment actually being performed). Whether this is the case will, as before, depend on the context.

The argument for conditioning also applies when the probabilities of performing the various experiments are unknown, say depend on a parameter \(\vartheta \), provided \(\vartheta \) is unrelated to \(\theta \), so that which experiment is chosen provides no information concerning \(\theta \). A more precise statement of this generalization is given at the end of the next section.

2 Ancillary Statistics

Mixture models can be described in the following general terms. Let \( \{{\mathcal {E}}_z, z\in \mathcal{Z} \}\) denote a collection of experiments of which one is selected according to a known probability distribution over \(\mathcal Z\). For any given z, the experiment \({\mathcal {E}}_z\) consists in observing a random quantity X, which has a distribution \(P_{\theta }(\cdot \mid z)\). Although this structure seems rather special, it is common to many statistical models.

Consider a general statistical model in which the observations X are distributed according to \(P_\theta \), \(\theta \in \Omega \), and suppose there exists an ancillary statistic, that is, a statistic Z whose distribution F does not depend on \(\theta \). Then one can think of X as being obtained by a two-stage experiment: observe first a random quantity Z with distribution F; given \(Z=z\), observe a quantity X with distribution \(P_\theta (\cdot \mid z)\). The resulting X is distributed according to the original distribution \(P_\theta \). Under these circumstances, the argument of the preceding section suggests that it will frequently be appropriate to take the conditional point of view.Footnote 1 (Unless Z is discrete, these definitions involve technical difficulties concerning sets of measure zero and the existence of conditional distributions, which we shall disregard.)

An important class of models in which ancillary statistics exist is obtained by invariance considerations. Suppose the model \({\mathcal {P}}=\{P_\theta , \theta \in \Omega \}\) remains invariant under the transformations

$$ X\rightarrow gX,\quad \theta \rightarrow \bar{g}{}\theta ;\qquad g\in G,\quad \bar{g}{}\in \bar{G}{}, $$

and that \(\bar{G}{}\) is transitive over \(\Omega \).Footnote 2

Theorem 10.2.1

If \({\mathcal {P}}\) remains invariant under G and if \(\bar{G}{}\) is transitive over \(\Omega \), then a maximal invariant T (and hence any invariant) is ancillary.

Proof. It follows from Theorem 6.3.2 that the distribution of a maximal invariant under G is invariant under \(\bar{G}{}\). Since \(\bar{G}{}\) is transitive, only constants are invariant under \(\bar{G}{}\). The probability \(P_\theta (T\in B)\) is therefore constant, independent of \(\theta \), for all B, as was to be proved. \(\blacksquare \)

As an example, suppose that \(X=(X_1,\ldots ,X_n)\) is distributed according to a location family with joint density \(f(x_1-\theta ,\ldots ,x_n-\theta )\). The most powerful test of \(H:\theta =\theta _0\) against \(K:\theta =\theta _1>\theta _0\) rejects when

$$\begin{aligned} {f(x_1-\theta _1,\ldots ,x_n-\theta _1)\over f(x_1-\theta _0,\ldots ,x_n-\theta _0)}\ge c. \end{aligned}$$
(10.10)

Here the set of differences \(Y_i=X_i-X_n\) (\(i=1,\ldots ,n-1\)) is ancillary. This is obvious by inspection and follows from Theorem 10.2.1 in conjunction with Example 6.2.1(i). It may therefore be more appropriate to consider the testing problem conditionally given \(Y_1=y_1,\ldots ,Y_{n-1}=y_{n-1}\). To determine the most powerful conditional test, transform to \(Y_1,\ldots ,Y_n\), where \(Y_n=X_n\). The conditional density of \(Y_n\) given \(y_1,\ldots ,y_{n-1}\) is

$$\begin{aligned} p_\theta (y_n\mid y_1,\ldots ,y_{n-1})={f(y_1+y_n-\theta ,\ldots ,y_{n-1}+y_n-\theta ,y_n-\theta )\over \int f(y_1+u,\ldots ,y_{n-1}+u,u)\,du} \end{aligned}$$
(10.11)

and the most powerful conditional test rejects when

$$\begin{aligned} {p_{\theta _1}(y_n\mid y_1,\ldots ,y_{n-1})\over p_{\theta _0}(y_n\mid y_1,\ldots ,y_{n-1})}>c(y_1,\ldots ,y_{n-1}). \end{aligned}$$
(10.12)

In terms of the original variables this becomes

$$\begin{aligned} {f(x_1-\theta _1,\ldots ,x_n-\theta _1)\over f(x_1-\theta _0,\ldots ,x_n-\theta _0)}>c(x_1-x_n,\ldots ,x_{n-1}-x_n). \end{aligned}$$
(10.13)

The constant \(c(x_1-x_n,\ldots ,x_{n-1}-x_n)\) is determined by the fact that the conditional probability of (10.13), given the differences of the x’s, is equal to \(\alpha \) when \(\theta =\theta _0\).

For describing the conditional test (10.12) and calculating the critical value \(c(y_1,\ldots ,y_{n-1})\), it is useful to note that the statistic \(Y_n=X_n\) could be replaced by any other \(Y_n\) satisfying the equivariance conditionFootnote 3

$$\begin{aligned} Y_n(x_1+a,\ldots ,x_n+a)=Y_n(x_1,\ldots ,x_n)+a \quad \text {for all} \; a. \end{aligned}$$
(10.14)

This condition is satisfied for example by the mean of the X’s, the median, or any of the order statistics. As will be shown in Lemma 10.2.1, any two statistics \(Y_n\) and \(Y_n'\) satisfying (10.14) differ only by a function of the differences \(Y_i=X_i-X_n\) (\(i=1,\ldots ,n-1\)). Thus conditionally, given the values \(y_1,\ldots ,y_{n-1}\), \(Y_n\) and \(Y_n'\) differ only by a constant, and their conditional distributions (and the critical values \(c(y_1,\ldots ,y_{n-1})\)) differ by the same constant. One can therefore choose \(Y_n\), subject to (10.14), to make the conditional calculations as convenient as possible.

Lemma 10.2.1

If \(Y_n\) and \(Y_n'\) both satisfy (10.14), then their difference \(\Delta =Y_n'-Y_n\) depends on \((x_1,\ldots ,x_n)\) only through the differences \((x_1-x_n,\ldots ,x_{n-1}-x_n)\).

Proof. Since \(Y_n\) and \(Y_n'\) satisfy (10.14),

$$ \Delta (x_1+a,\ldots ,x_n+a)=\Delta (x_1,\ldots ,x_n)\quad \text {for all}\; a. $$

Putting \(a=-x_n\), one finds

$$ \Delta (x_1,\ldots ,x_n)=\Delta (x_1-x_n,\ldots ,x_{n-1}-x_n,0), $$

which is a function of the differences. \(\blacksquare \)

The existence of ancillary statistics is not confined to models that remain invariant under a transitive group \(\bar{G}{}\). The mixture and regression examples of Section 10.1 provide illustrations of ancillaries without the benefit of invariance. Further examples are given in Problems 10.810.13.

If conditioning on an ancillary statistic is considered appropriate because it makes the inference more relevant to the situation at hand, it is desirable to carry the process as far as possible and hence to condition on a maximal ancillary. An ancillary Z is said to be maximal if there does not exist an ancillary U such that \(Z=f(U)\) without Z and U being equivalent. [For a more detailed treatment, which takes account of the possibility of modifying statistics on sets of measure zero without changing their probabilistic properties, see Basu (1959).]

Conditioning, like sufficiency and invariance, leads to a reduction of the data. In the conditional model, the ancillary is no longer part of the random data but has become a constant. As a result, conditioning often leads to a great simplification of the inference. Choosing a maximal ancillary for conditioning thus has the additional advantage of providing the greatest reduction of the data.

Unfortunately, maximal ancillaries are not always unique, and one must then decide which maximal ancillary to choose for conditioning. [This problem is discussed by Cox (1971) and Becker and Gordon (1983).] If attention is restricted to ancillary statistics that are invariant under a given group G, the maximal ancillary of course coincides with the maximal invariant.

Another issue concerns the order in which to apply reduction by sufficiency and ancillarity.

Example 10.2.1

Let \((X_i,Y_i)\), \(i=1,\ldots ,n\), be independently distributed according to a bivariate normal distribution with \(E(X_i)=E(Y_i)=0\), \({\text {Var}(X_i)=\text {Var}(Y_i)=1}\), and unknown correlation coefficient \(\rho \). Then \(X_1,\ldots ,X_n\) are independently distributed as N(0, 1) and are therefore ancillary. The conditional density of the Y’s given \(X_1=x_i,\ldots ,X_n=x_n\) is

$$ C\exp \left( -{1\over 2(1-\rho ^2)}\sum (y_i-\rho x_i)^2\right) , $$

with the sufficient statistics \((\sum Y^2_i,\sum x_iY_i)\).

Alternatively, one could begin by noticing that \((Y_1,\ldots ,Y_n)\) is ancillary. The conditional distribution of the X’s given \(Y_1=y_1,\ldots ,Y_n=y_n\) then admits the sufficient statistics \((\sum X^2_i,\sum X_iy_i)\). A unique maximal ancillary V does not exist in this case, since both the X’s and Y’s would have to be functions of V. Thus V would have to be equivalent to the full sample \((X_1,Y_1),\ldots ,(X_n,Y_n)\), which is not ancillary.

Suppose instead that the data are first reduced to the sufficient statistics \(T=(\sum X^2_i+\sum Y^2_i,\sum X_iY_i)\). Based on T, no nonconstant ancillaries appear to exist.Footnote 4 This example and others like it suggest that it is desirable to reduce the data as far as possible through sufficiency, before attempting further reduction by means of ancillary statistics. \(\blacksquare \)

Note that contrary to this suggestion, in the location example at the beginning of the section, the problem was not first reduced to the sufficient statistics \(X_{(1)}<\cdots <X_{(n)}\). The omission can be justified in hindsight by the fact that the optimal conditional tests are the same whether or not the observations are first reduced to the order statistics.

In the structure described at the beginning of the section, the variable Z that labels the experiment was assumed to have a known distribution. The argument for conditioning on the observed value of Z does not depend on this assumption. It applies also when the distribution of Z depends on an unknown parameter \(\vartheta \), which is independent of \(\theta \) and hence by itself contains no information about \(\theta \), that is, when the distribution of Z depends only on \(\vartheta \), the conditional distribution of X given \(Z=z\) depends only on \(\theta \), and the parameter space \(\Omega \) for \((\theta ,\vartheta )\) is a Cartesian product \(\Omega =\Omega _1 \times \Omega _2\), with

$$\begin{aligned} (\theta ,\vartheta )\in \Omega \quad \Leftrightarrow \quad \theta \in \Omega _1 \quad \vartheta \in \Omega _2~. \end{aligned}$$
(10.15)

(the parameters \(\theta \) and \(\vartheta \) are then said to be variation independent, or unrelated.)

Statistics Z satisfying this more general definition are called partial ancillary or S-ancillary. (The term ancillary without modification will be reserved here for a statistic that has a known distribution.) Note that if \(X=(T,Z)\) and Z is a partial ancillary, then T is a partial sufficient statistic in the sense of Problem 3.65. For a more detailed discussion of this and related concepts of partial ancillarity, see for example Basu (1978) and Barndorff-Nielsen (1978).

Example 10.2.2

Let X and Y be independent with Poisson distributions \(P(\lambda )\) and \(P(\mu )\), and let the parameter of interest be \(\theta =\mu /\lambda \). It was seen in Section 10.4 that the conditional distribution of Y given \(Z=X+Y=z\) is binomial b(pz) with \(p=\mu /(\lambda +\mu )=\theta /(\theta +1)\) and therefore depends only on \(\theta \), while the distribution of Z is Poisson with mean \(\vartheta =\lambda +\mu \). Since the parameter space \(0<\lambda \), \(\mu <\infty \) is equivalent to the Cartesian product of \(0<\theta <\infty \), \(0<\vartheta <\infty \), it follows that Z is S-ancillary for \(\theta \).

The UMP unbiased level-\(\alpha \) test of \(H:\mu \le \lambda \) against \(\mu >\lambda \) is UMP also among all tests whose conditional level given z is \(\alpha \) for all z. (The class of conditional tests coincides exactly with the class of all tests that are similar on the boundary \(\mu =\lambda \).) \(\blacksquare \)

When Z is S-ancillary for \(\theta \) in the presence of a nuisance parameter \(\vartheta \), the unconditional power \(\beta (\theta ,\vartheta )\) of a test \(\varphi \) of \(H:\theta =\theta _0\) may depend on \(\vartheta \) as well as on \(\theta \). The conditional power\(\beta (\vartheta \mid z)=E_\theta [\varphi (X)\mid z]\) can then be viewed as an unbiased estimator of the (unknown) \(\beta (\theta ,\vartheta )\), as was discussed at the end of Section 4.4. On the other hand, if no nuisance parameters \(\vartheta \) are present and Z is ancillary for \(\theta \), the unconditional power \(\beta (\theta )=E_\theta \varphi (X)\) and the conditional power \(\beta (\theta \mid z)\) provide two alternative evaluations of the power of \(\varphi \) against \(\theta \), which refer to different sampling frameworks, and of which the latter of course becomes available only after the data have been obtained.

Surprisingly, the S-ancillarity of \(X+Y\) in Example 10.2.2 does not extend to the corresponding binomial problem.

Example 10.2.3

Let X and Y have independent binomial distributions \(b(p_1,m)\) and \(b(p_2,n)\) respectively. Then it was seen in Section 4.5 that the conditional distribution of Y given \(Z=X+Y=z\) depends only on the cross-product ratio \(\Delta =p_2q_1/p_1q_2\) (\(q_i=1-p_i\)). However, Z is not S-ancillary for \(\Delta \). To see this, note that S-ancillarity of Z implies the existence of a parameter \(\vartheta \) unrelated to \(\Delta \) and such that the distribution of Z depends only on \(\vartheta \). As \(\Delta \) changes, the family of distributions \(\{P_\vartheta ,\vartheta \in \Omega _2 \}\) of Z would remain unchanged. This is not the case, since Z is binomial when \(\Delta =1\) and not otherwise (Problem 10.15). Thus Z is not S-ancillary.

In this example, all unbiased tests of \(H:\Delta =\Delta _0\) have a conditional level given z that is independent of z, but conditioning on z cannot be justified by S-ancillarity. \(\blacksquare \)

Closely related to this example is the situation of the multinomial \(2\times 2\) table discussed from the point of view of unbiasedness in Section 4.6.

Example 10.2.4

In the notation of Section 4.6, let the four cell entries of a \(2\times 2\) table be X, \(X'\), Y, \(Y'\) with row totals \(X+X'=M\), \(Y+Y'=N\), and column totals \(X+Y=T\), \(X'+Y'=T'\), and with total sample size \(M+N=T+T'=s\). Here it is easy to check that (MN) is S-ancillary for \(\theta =(\theta _1,\theta _2)=(p_{AB}/p_B,p_{A{\tilde{B}}{}}/p_{{\tilde{B}}{}})\) with \(\vartheta =p_B\). Since the cross-product ratio \(\Delta \) can be expressed as a function of \((\theta _1,\theta _2)\), it may be appropriate to condition a test of \(H:\Delta =\Delta _0\) on (MN). Exactly analogously one finds that \((T,T')\) is S-ancillary for \(\theta '=(\theta _1',\theta _2')=(p_{AB}/p_A,p_{{\tilde{A}}{}B}/p_{{\tilde{A}}{}})\), and since \(\Delta \) is also a function of \((\theta _1',\theta _2')\), it may be equally appropriate to condition a test of H on \((T,T')\). One might hope that the set of all four marginals \((M,N,T,T')=Z\) would be S-ancillary for \(\Delta \). However, it is seen from the preceding example that this is not the case.

Here, all unbiased tests have a constant conditional level given z. However, S-ancillarity permits conditioning on only one set of margins (without giving any guidance as to which of the two to choose), not on both. \(\blacksquare \)

Despite such difficulties, the principle of carrying out tests and confidence estimation conditionally on ancillaries or S-ancillaries frequently provides an attractive alternative to the corresponding unconditional procedures, primarily because it is more appropriate for the situation at hand. However, insistence on such conditioning leads to another difficulty, which is illustrated by the following example.

Example 10.2.5

Consider N populations \(\prod _i\), and suppose that an observation \(X_i\) from \(\prod _i\) has a normal distribution \(N(\xi _i,1)\). The hypothesis to be tested is \(H:\xi _1=\cdots =\xi _N\). Unfortunately, N is so large that it is not practicable to take an observation from each of the populations; the total sample size is restricted to be \(n<N\). A sample \(\prod _{J_1},\ldots ,\prod _{J_n}\) of n of the N populations is therefore selected at random, with probability \(1/{N\atopwithdelims ()n}\) for each set of n, and an observation \(X_{j_i}\) is obtained from each of the populations \(\prod _{j_i}\), in the sample.

Here the variables \(J_1,\ldots ,J_n\) are ancillary, and the requirement of conditioning on ancillaries would restrict any inference to the n populations from which observations are taken. Systematic adherence to this requirement would therefore make it impossible to test the original hypothesis H.Footnote 5 Of course, rejection of the partial hypothesis \(H_{j_1,\ldots ,j_n}:\xi _{j_1}=\cdots =\xi _{j_n}\) would imply rejection of the original H. However, acceptance of \(H_{j_1,\ldots ,j_n}\) would permit no inference concerning H.

The requirement to condition in this case runs counter to the belief that a sample may permit inferences concerning the whole set of populations, which underlies much of statistical practice.

With an unconditional approach such an inference is provided by the test with rejection region

$$ \sum \left[ X_{j_i}-\left( \frac{1}{n}\sum ^n_{k=1} X_{j_k}\right) \right] ^2\ge c, $$

where c is the upper \(\alpha \)-percentage point of \(\chi ^2\) with \(n-1\) degrees of freedom. Not only does this test actually have unconditional level \(\alpha \), but its conditional level given \(J_1=j_1,\ldots ,J_n=j_n\) also equals \(\alpha \) for all \((j_1,\ldots ,j_n)\). There is in fact no difference in the present case between the conditional and the unconditional tests: they will accept or reject for the same sample points. However, as has been pointed out, there is a crucial difference between the conditional and unconditional interpretations of the results.

If \(\beta _{j_1,\ldots ,j_n}(\xi _{j_1},\ldots ,\xi _{j_n})\) denotes the conditional power of this test given \(J_1=j_1,\ldots ,J_n=j_n\), its unconditional power is

$$ {\sum \beta _{j_1,\ldots ,j_n}(\xi _{j_1},\ldots ,\xi _{j_n})\over {N\atopwithdelims ()n}} $$

summed over all \(N\atopwithdelims ()n\) n-tuples \(j_1<\ldots <j_n\). As in the case with any test, the conditional power given an ancillary (in the present case \(J_1,\ldots ,J_n\)) can be viewed as an unbiased estimate of the unconditional power. \(\blacksquare \)

3 Optimal Conditional Tests

Although conditional tests are often sensible and are beginning to be employed in practice [see for example Lawless (1972, 1973, 1978) and Kappenman (1975)], not much theory has been developed for the resulting conditional models. Since the conditional model tends to be simpler than the original unconditional one, the conditional point of view will frequently bring about a simplification of the theory. This possibility will be illustrated in the present section on some simple examples.

Example 10.3.1

Specializing the example discussed at the beginning of Section 10.1, suppose that a random variable is distributed according to \(N(\theta ,\sigma ^2_1)\) or \(N(\theta ,\sigma ^2_0)\) as \(I=1\) or 0, and that \(P(I=1)=P(I=0)=\frac{1}{2}\). Then the most powerful test of \(H:\theta =\theta _0\) against \(\theta =\theta _1(>\theta _0)\) based on (IX) rejects when

$$ {x-\frac{1}{2}(\theta _0+\theta _1)\over 2\sigma ^2_i}\ge k. $$

A UMP test against the alternatives \(\theta >\theta _0\) therefore does not exist. On the other hand, if H is tested conditionally given \(I=i\), a UMP conditional test exists and rejects when \(X>c_i\) where \(P(X>c_i\mid I=i)=\alpha \) for \(i=0\), 1. \(\blacksquare \)

The nonexistence of UMP unconditional tests found in this example is typical for mixtures with known probabilities of two or more families with monotone likelihood ratio, despite the existence of UMP conditional tests in these cases.

Example 10.3.2

Let \(X_1,\ldots ,X_n\) be a sample from a normal distribution \(N(\xi ,a^2\xi ^2)\), \(\xi >0\), with known coefficient of variation \(a>0\), and consider the problem of testing \(H:\xi =\xi _0\) against \(K:\xi >\xi _0\). Here \(T=(T_1,T_2)\) with \(T_1=\bar{X}{}\), \({T_2=\sqrt{(1/n)\sum X^2_i}}\) is sufficient, and \(Z=T_1/T_2\) is ancillary. If we let \({V=\sqrt{n}T_2/a}\), the conditional density of V given \(Z=z\) is equal to (Problem 10.18)

$$\begin{aligned} p_\xi (v\mid z)={k\over \xi ^n}v^{n-1}\exp \left\{ -\frac{1}{2}\left[ \frac{v}{\xi } - \frac{z\sqrt{n}}{a}\right] ^2\right\} . \end{aligned}$$
(10.16)

The density has monotone likelihood ratio, so that the rejection region \(V>C(z)\) constitutes a UMP conditional test.

Unconditionally, \(Y=\bar{X}{}\) and \(S^2=\sum (X_i-\bar{X}{})^2\) are independent with joint density

$$\begin{aligned} cs^{(n-3)/2}\exp \left( -\frac{n}{2a^2\xi ^2}(y-\xi )^2-\frac{1}{2a^2\xi ^2} s^2\right) , \end{aligned}$$
(10.17)

and a UMP test does not exist. [For further discussion of this example, see Hinkley (1977).] \(\blacksquare \)

An important class of examples is obtained from situations in which the model remains invariant under a group of transformations that is transitive over the parameter space, that is, when the given class of distributions constitutes a group family. The maximal invariant V then provides a natural ancillary on which to condition, and an optimal conditional test may exist even when such a test does not exist unconditionally. Perhaps the simplest class of examples of this kind are provided by location families under the conditions of the following lemma.

Lemma 10.3.1

Let \(X_1,\ldots ,X_n\) be independently distributed according to \(f(x_i-\theta )\), with f strongly unimodal. Then the family of conditional densities of \(Y_n=X_n\) given \(Y_i=X_i-X_n\) \((i=1,\ldots ,n-1)\) has monotone likelihood ratio.

Proof. The conditional density (10.11) is proportional to

$$\begin{aligned} f(y_n+y_1-\theta )\cdots f(y_n+y_{n-1}-\theta )f(y_n-\theta ). \end{aligned}$$
(10.18)

By taking logarithms and using the fact that each factor is strongly unimodal, it is seen that the product is also strongly unimodal, and the result follows from Example 8.2.1\(\blacksquare \)

Lemma 10.3.1 shows that for strongly unimodal f there exists a UMP conditional test of \(H:\theta \le \theta _0\) against \(K:\theta >\theta _0\) which rejects when

$$\begin{aligned} X_n>c(X_1-X_n,\ldots ,X_{n-1}-X_n). \end{aligned}$$
(10.19)

Conditioning has reduced the model to a location family with sample size one. The double exponential and logistic distributions are both strongly unimodal (Section 9.2), and thus provide examples of UMP conditional tests. In neither case does there exist a UMP unconditional test unless \(n=1\).

As a last class of examples, we shall consider a situation with a nuisance parameter. Let \(X_1,\ldots ,X_m\) and \(Y_1,\ldots ,Y_n\) be independent samples from location families with densities \(f(x_1-\xi ,\ldots ,x_m-\xi )\) and \(g(y_1-\eta ,\ldots ,y_n-\eta )\) respectively, and consider the problem of testing \(H:\eta \le \xi \) against \(K:\eta >\xi \). Here the differences \(U_i = X_i-X_m\) and \(V_j=Y_j-Y_n\) are ancillary. The conditional density of \(X=X_m\) and \(Y=Y_n\) given the u’s and v’s is seen from (10.18) to be of the form

$$\begin{aligned} f^*_u(x-\xi )g_v^*(y-\eta ), \end{aligned}$$
(10.20)

where the subscripts u and v indicate that \(f^*\) and \(g^*\) depend on the u’s and v’s respectively. The problem of testing H in the conditional model remains invariant under the transformations: \(x'=x+c\), \(y'=y+c\), for which \(Y-X\) is maximal invariant. A UMP invariant conditional test will then exist provided the distribution of \(Z=Y-X\), which depends only on \(\Delta =\eta -\xi \), has monotone likelihood ratio. The following lemma shows that a sufficient condition for this to be the case is that \(f^*_u\) and \(g_v^*\) have monotone likelihood ratio in x and y respectively.

Lemma 10.3.2

Let X, Y be independently distributed with densities \(f^*(x-\xi )\), \({g^*(y-\eta )}\) respectively. If \(f^*\) and \(g^*\) have monotone likelihood with respect to \(\xi \) and \(\eta \), then the family of densities of \(Z=Y-X\) has monotone likelihood ratio with respect to \(\Delta =\eta -\xi \).

Proof. The density of Z is

$$\begin{aligned} h_\Delta (z)=\int g^*(y-\Delta )f^*(y-z)\,dy. \end{aligned}$$
(10.21)

To see that \(h_\Delta (z)\) has monotone likelihood ratio, one must show that for any \(\Delta <\Delta '\), \(h_{\Delta '}(z)/h_{\Delta }(z)\) is an increasing function of z. For this purpose, write

$$ {h_{\Delta '}(z)\over h_\Delta (z)} =\int {g^*(y-\Delta ')\over g^*(y-\Delta )} \cdot {g^*(y-\Delta )f^*(y-z)\over \int g^*(u-\Delta )f(u-z)\,du} \,dy. $$

The second factor is a probability density for Y,

$$\begin{aligned} p_z(y)=C_zg^*(y-\Delta )f^*(y-z), \end{aligned}$$
(10.22)

which has monotone likelihood ratio in the parameter z by the assumption made about \(f^*\). The ratio

$$\begin{aligned} {h_{\Delta '}(z)\over h_\Delta (z)} = \int {g^*(y-\Delta ')\over g^*(y-\Delta )} p_z(y)\,dy \end{aligned}$$
(10.23)

is the expectation of \(g^*(Y-\Delta ')/g^*(Y-\Delta )\) under the distribution \(p_z(y)\). By the assumption about \(g^*\), \(g^*(y-\Delta ')/g^*(y-\Delta )\) is an increasing function of y, and it follows from Lemma 3.4.2 that its expectation is an increasing function of z\(\blacksquare \)

It follows from (10.18) that \(f^*_u(x-\xi )\) and \(g^*_v(y-\eta )\) have monotone likelihood ratio provided this condition holds for \(f(x-\xi )\) and \(g(y-\eta )\), i.e., provided f and g are strongly unimodal. Under this assumption, the conditional distribution \(h_\Delta (z)\) then has monotone likelihood ratio by Lemma 10.3.2, and a UMP conditional test exists and rejects for large values of Z. (This result also follows from Problem 8.12.)

The difference between conditional tests of the kind considered in this section and the corresponding (e.g., locally most powerful) unconditional tests typically disappears as the sample size(s) tend(s) to infinity. Some results in this direction are given by Liang (1984); see also Barndorff-Nielsen (1983).

The following multivariate example provides one more illustration of a UMP conditional test when unconditionally no UMP test exists. The results will only be sketched. The details of this and related problems can be found in the original literature reviewed by Marden and Perlman (1980) and Marden (1983).

Example 10.3.3

Suppose you observe \(m+1\) independent normal vectors of dimension \(p=p_1+p_2\),

$$ Y=(Y_1Y_2)\quad Z_1,\ldots ,Z_m, $$

with common covariance matrix \(\Sigma \) and expectations

$$ E(Y_1)=\eta _1,\qquad E(Y_2)=E(Z_1)=\cdots =E(Z_m)=0. $$

(The normal multivariate two-sample problem with covariates can be reduced to this canonical form.) The hypothesis being tested is \(H:\eta _1=0\). Without the restriction \(E(Y_2)=0\), the model would remain invariant under the group G of transformations: \(Y^*=YB\), \(Z^*=ZB\), where B is any nonsingular \(p\times p\) matrix. However, the stated problem remains invariant only under the subgroup \(G'\) in which B is of the form (Problem 10.22(i))

$$\begin{aligned} B= & {} \left( \begin{array}{cc} B_{11} &{} 0\\ B_{21} &{} B_{22}\end{array}\right) \!\! \begin{array}{c} {\scriptstyle p_1}\\ {\scriptstyle p_2}\end{array}.\\ [-6bp]&\quad \begin{array}{cc} {\scriptstyle p_1}&{\scriptstyle p_2}\end{array} \end{aligned}$$

If

$$ Z'Z=S=\left( \begin{array}{cc} S_{11} &{} S_{12}\\ S_{21} &{} S_{22}\end{array}\right) \quad \Sigma =\left( \begin{array}{cc} \Sigma _{11} &{} \Sigma _{12}\\ \Sigma _{21} &{} \Sigma _{22}\end{array}\right) , $$

the maximal invariants under \(G'\) are the two statistics \(D=Y_2 S^{-1}_{22}Y_2'\) and

$$ N={(Y_1-S_{12}S^{-1}_{22}Y_2)(S_{11}-S_{12}S^{-1}_{22}S_{21})^{-1}(Y_1-S_{12}S^{-1}_{22}Y_2)'\over 1+D}, $$

and the joint distribution of (ND) depends only on the maximal invariant under \(G'\),

$$ \Delta =\eta _1(\Sigma _{11}-\Sigma _{12}\Sigma _{22}^{-1}\Sigma _{21})^{-1}\eta _1'. $$

The statistic D is ancillary (Problem 10.22(ii)), and the conditional distribution of N given \(D=d\) is that of the ratio of two independent \(\chi ^2\)-variables: the numerator noncentral \(\chi ^2\) with p degrees of freedom and noncentrality parameter \(\Delta /(1+d)\), and the denominator central \(\chi ^2\) with \({m+1-p}\) degrees of freedom. It follows from Section 7.1 that the conditional density has monotone likelihood ratio. A conditionally UMP invariant test therefore exists, and rejects H when \({(m+1-p)N/p>C}\), where C is the critical value of the F-distribution with p and \({m+1-p}\) degrees of freedom. On the other hand, a UMP invariant (unconditional) test does not exist; comparisons of the optimal conditional test with various competitors are provided by Marden and Perlman (1980). \(\blacksquare \)

4 Relevant Subsets

The conditioning variables considered so far have been ancillary statistics, i.e., random variables whose distribution is fixed, independent of the parameters governing the distribution of X, or at least of the parameter of interest. We shall now examine briefly some implications of conditioning without this constraint. Throughout most of the section we shall be concerned with the simple case in which the conditioning variable is the indicator of some subset C of the sample space, so that there are only two conditioning events \(I=1\) (i.e., \(X\in C\)) and \(I=0\) (i.e., \(X\in C^c\), the complement of C). The mixture problem at the beginning of Section 10.1, with \({\mathcal {X}}_1=C\) and \({\mathcal {X}}_0=C^c\), is of this type.

Suppose X is distributed with density \(p_\theta \), and R is a level-\(\alpha \) rejection region for testing the simple hypothesis \(H:\theta =\theta _0\) against some class of alternatives. For any subset C of the sample space, consider the conditional rejection probabilities

$$\begin{aligned} \alpha _C=P_{\theta _0}(X\in R\mid C)\quad \alpha _{C^c}=P_{\theta _0}(X\in R\mid C^c), \end{aligned}$$
(10.24)

and suppose that \(\alpha _C>\alpha \) and \(\alpha _{C^c}<\alpha \). Then we are in the difficulty described in Section 10.1. Before X was observed, the probability of falsely rejecting H was stated to be \(\alpha \). Now that X is known to have fallen into C (or \(C^c\)), should the original statement be adjusted and the higher value \(\alpha _C\) (or lower value \(\alpha _{C^c}\)) be quoted? An extreme case of this possibility occurs when C is a subset of R or \(R^c\), since then \(P(X\in R\mid X\in C)=1\) or 0.

It is clearly always possible to choose C so that the conditional level \(\alpha _C\) exceeds the stated \(\alpha \). It is not so clear whether the corresponding possibility always exists for the levels of a family of confidence sets for \(\theta \), since the inequality must now hold for all \(\theta \).

Definition 10.4.1

A subset C of the sample space is said to be a negatively biased relevant subset for a family of confidence sets S(X) with unconditional confidence level \(\gamma =1-\alpha \) if for some \(\epsilon >0\)

$$\begin{aligned} \gamma _C(\theta )=P_\theta [\theta \in S(X)\mid X\in C]\le \gamma -\epsilon \quad \text {for all}\; \theta , \end{aligned}$$
(10.25)

and a positively biased relevant subset if

$$\begin{aligned} P_0[\theta \in S(X)\mid X\in C]\ge \gamma +\epsilon \quad \text {for all} \; \theta . \end{aligned}$$
(10.26)

The set C is semirelevant, negatively or positively biased, if respectively

$$\begin{aligned} P_\theta [\theta \in S(X)\mid X\in C]\le \gamma \quad \text {for all} \;\theta \end{aligned}$$
(10.27)

or

$$\begin{aligned} P_\theta [\theta \in S(X)\mid X\in C]\ge \gamma \quad \text {for all}\; \theta , \end{aligned}$$
(10.28)

with strict inequality holding for at least some \(\theta \).

Obvious examples of relevant subsets are provided by the subsets \({\mathcal {X}}_0\) and \({\mathcal {X}}_1\) of the two-experiment example of Section 10.1.

Relevant subsets do not always exist. The following four examples illustrate the various possibilities.

Example 10.4.1

Let X be distributed as \(N(\theta ,1)\), and consider the standard confidence intervals for \(\theta \):

$$ S(X)=\{\theta :X-c<\theta <X+c\}, $$

where \(\Phi (c)-\Phi (-c)=\gamma \). In this case, there exists not even a semirelevant subset.

To see this, suppose first that a positively biased semirelevant subset C exists, so that

$$ A(\theta )=P_\theta [X-c<\theta <X+c\hbox { and }X\in C]-\gamma P_\theta [X\in C]\ge 0 $$

for all \(\theta \), with strict inequality for some \(\theta _0\). Consider a prior normal density \(\lambda (\theta )\) for \(\theta \) with mean 0 and variance \(\tau ^2\), and let

$$ \beta (x)=P[x-c<\Theta <x+c\mid x], $$

where \(\Theta \) has density \(\lambda (\theta )\). The posterior distribution of \(\Theta \) given x is then normal with mean \(\tau ^2x/(1+\tau ^2)\) and variance \(\tau ^2/(1+\tau ^2)\) (Problem 10.24(i)), and it follows that

$$\begin{aligned} \beta (x)= & {} \Phi \left[ {x\over \tau \sqrt{1+\tau ^2}}+{c\sqrt{1+\tau ^2}\over \tau }\right] -\Phi \left[ {x\over \tau \sqrt{1+\tau ^2}}-{c\sqrt{1+\tau ^2}\over \tau }\right] \\\le & {} \Phi \left[ {c\sqrt{1+\tau ^2}\over \tau }\right] -\Phi \left[ {-c\sqrt{1+\tau ^2}\over \tau }\right] \le \gamma +{c\over \sqrt{2\pi }\tau ^2}. \end{aligned}$$

Next let \(h(\theta )=\sqrt{2\pi }\tau \lambda (\theta )=e^{-\theta ^2/2\tau ^2}\) and

$$\begin{aligned} D=\int h(\theta )A(\theta )\,d\theta \le \sqrt{2\pi }\tau \int \lambda (\theta )\{P_\theta [X-c<\theta <X+c\hbox { and }X\in C]\qquad \\ -E_\theta [\beta (X)I_C(X)]\}\,d\theta +{c\over \tau }. \end{aligned}$$

The integral on the right side is the difference of two integrals each of which equals \(P[X-c<\Theta <X+c\) and \(X\in C]\), and is therefore 0, so that \(D\le c/\tau \).

Consider now a sequence of normal priors \(\lambda _m(\theta )\) with variances \(\tau ^2_m\rightarrow \infty \), and the corresponding sequences \(h_m(\theta )\) and \(D_m\). Then \(0\le D_m\le c/\tau _m\) and hence \(D_m\rightarrow 0\). On the other hand, \(D_m\) is of the form \(D_m=\int ^\infty _{-\infty }A(\theta )h_m(\theta )\,d\theta \), where \(A(\theta )\) is continuous, nonnegative, and\({}>0\) for some \(\theta _0\). There exists \(\delta >0\) such that \(A(\theta )\le \frac{1}{2}A(\theta _0)\) for \(|\theta -\theta _0|<\delta \) and hence

$$ D_m\ge \int ^{\theta _0+\delta }_{\theta _0-\delta }\frac{1}{2}A(\theta _0)h_m(\theta )\,d\theta \rightarrow \delta A(\theta _0)>0\quad \quad m\rightarrow \infty . $$

This provides the desired contradiction. \(\blacksquare \)

That also no negatively semirelevant subsets exist is a consequence of the following result.

Theorem 10.4.1

Let S(x) be a family of confidence sets for \(\theta \) such that \(P_\theta [\theta \in S(X)]=\gamma \) for all \(\theta \), and suppose that \(0<P_\theta (C)<1\) for all \(\theta \).

(i) If C is semirelevant, then its complement \(C^c\) is semirelevant with opposite bias.

(ii) If there exists a constant a such that

$$ 1>P_\theta (C)>a>0\quad \text {for all}\; \theta $$

and C is relevant, then \(C^c\) is relevant with opposite bias.

Proof. The result is an immediate consequence of the identity

The next example illustrates the situation in which a semirelevant subset exists but no relevant one.

Example 10.4.2

Let X be \(N(\theta ,1)\), and consider the uniformly most accurate lower confidence bounds \(\underline{\theta }=X-c\) for \(\theta \), where \(\Phi (c)=\gamma \). Here S(X) is the interval \([X-c,\infty )\) and it seems plausible that the conditional probability of \(\theta \in S(X)\) will be lowered for a set C of the form \(X\ge k\). In fact

$$\begin{aligned} P_\theta (X-c\le \theta \mid X\ge k)=\left\{ \begin{array}{ll} {\Phi (c)-\Phi (k-\theta )\over 1-\Phi (k-\theta )} &{} \quad \theta >k-c,\\ 0 &{} \quad \theta <k-c.\end{array}\right. \end{aligned}$$
(10.29)

The probability (10.29) is always\({}<\gamma \), and tends to \(\gamma \) as \(\theta \rightarrow \infty \). The set \(X\ge k\) is therefore semirelevant negatively biased for the confidence sets S(X).

We shall now show that no relevant subset C with \(P_\theta (C)>0\) exists in this case. It is enough to prove the result for negatively biased sets; the proof for positive bias is exactly analogous. Let A be the set of x-values \(-\infty<x<c+\theta \), and suppose that C is negatively biased and relevant, so that

$$ P_\theta [X\in A\mid C]\le \gamma -\epsilon \quad \text {for all}\;\theta . $$

If

$$ a(\theta )=P_\theta (X\in C),\qquad b(\theta )=P_\theta (X\in A\cap C), $$

then

$$\begin{aligned} b(\theta )\le (y-\epsilon )\,a(\theta )\quad \text {for all}\;\theta . \end{aligned}$$
(10.30)

The result is proved by comparing the integrated coverage probabilities

$$ A(R)=\int ^R_{-R}a(\theta )\,d\theta ,\qquad B(R)=\int ^R_{-R}b(\theta )\,d\theta $$

with the Lebesgue measure of the intersection \(C\cap (-R,R)\),

$$ \mu (R)=\int ^R_{-R} I_C(x)\,dx, $$

where \(I_C(x)\) is the indicator of C, and showing that

$$\begin{aligned} {A(R)\over \mu (R)}\rightarrow 1,\quad {B(R)\over \mu (R)}\rightarrow \gamma \text {as}R\rightarrow \infty . \end{aligned}$$
(10.31)

This contradicts the fact that by (10.30),

$$ B(R)\le (\gamma -\epsilon )A(R)\quad \text {for all}\; R, $$

and so proves the desired result.

To prove (10.31), suppose first that \(\mu (\infty )<\infty \). Then if \(\phi \) is the standard normal density

$$ A(\infty ) = \int ^\infty _{-\infty }\,d\theta \int _C\phi (x-\theta )\,dx = \int _C\,dx = \mu (\infty ), $$

and analogously \(B(\infty )=\gamma \mu (\infty )\), which establishes (10.31).

When \(\mu (\infty )=\infty \), (10.31) will be proved by showing that

$$\begin{aligned} A(R)=\mu (R)+K_1(R),\qquad B(R)=\gamma \mu (R)+K_2(R), \end{aligned}$$
(10.32)

where \(K_1(R)\) and \(K_2(R)\) are bounded. To see (10.32), note that

$$\begin{aligned} \mu (R)=\int ^R_{-R}I_C(x)\,dx= & {} \int ^R_{-R}I_C(x)\left[ \int ^\infty _{-\infty }\phi (x-\theta )\,d\theta \right] \,dx\\= & {} \int ^\infty _{-\infty }\left[ \int ^R_{-R}I_C(x)\phi (x-\theta )\,dx\right] \,d\theta , \end{aligned}$$

while

$$\begin{aligned} A(R)=\int ^R_{-R}\left[ \int ^\infty _{-\infty }I_C(x)\phi (x-\theta )\,dx\right] \,d\theta . \end{aligned}$$
(10.33)

A comparison of each of these double integrals with that over the region \(-R<x<R\), \(-R<\theta <R\), shows that the difference \(A(R)-\mu (R)\) is made up of four integrals, each of which can be seen to be bounded by using the fact that \(\int |t|\phi (t)\,dt<\infty \) (Problem 10.24(ii)). This completes the proof. \(\blacksquare \)

Example 10.4.3

Let \(X_1,\ldots ,X_n\) be independently normally distributed as \(N(\xi ,\sigma ^2)\), and consider the uniformly most accurate equivariant (and unbiased) confidence intervals for \(\xi \) given by (5.36).

It was shown by Buehler and Feddersen (1963) and Brown (1967) that in this case there exist positively biased relevant subsets of the form

$$\begin{aligned} C:{|\bar{X}{}|\over S}\le k. \end{aligned}$$
(10.34)

In particular, for confidence level \(\gamma =.5\) and \(n=2\), Brown shows that with \(C:|\bar{X}{}|/|X_2-X_1|\le \frac{1}{2}(1+\sqrt{2})\), the conditional level is\({}>\frac{2}{3}\) for all values of \(\xi \) and \(\sigma \). Goutis and Casella (1992) provide detailed values for general n.

It follows from Theorem 10.4.1 that \(C^c\) is negatively biased semirelevant, and Buehler (1959) shows that any set \(C^*:S\le k\) has the same property. These results are intuitively plausible, since the length of the confidence intervals is proportional to S, and one would expect short intervals to cover the true value less often than long ones.

Theorem 10.4.1 does not show that \(C^c\) is negatively biased relevant, since the probability of the set (10.34) tends to zero as \(\xi /\sigma \rightarrow \infty \). It was in fact proved by Robinson (1976) that no negatively biased relevant subset exists in this case.

The calculations for \(C^c\) throw some light on the common practice of stating confidence intervals for \(\xi \) only when a preliminary test of \(H:\xi =0\) rejects the hypothesis. For a discussion of this practice see Olshen (1973), and Meeks and D’Agostino (1983). \(\blacksquare \)

The only type of example still missing is that of a negatively biased relevant subset. It was pointed out by Fisher (1956a, 1956b,1959,1973) that the Welch–Aspin solution of the Behrens–Fisher problem (discussed in Sections 6.6 and 13.2) provides an illustration of this possibility. The following are much simpler examples of both negatively and positively biased relevant subsets.

Example 10.4.4

An extreme form of both positively and negatively biased subsets was encountered in Section 7.7, where lower and upper confidence bounds \(\underline{\Delta }<\Delta \) and \(\Delta <\bar{\Delta }\) were obtained in (7.42) and (7.43) for the ratio \(\Delta =\sigma ^2_A/\sigma ^2\) in a model II one-way classification. Since

$$ P(\underline{\Delta }\le \Delta \mid \underline{\Delta }<0)=1\quad P(\Delta \le \bar{\Delta }\mid \bar{\Delta }<0)=0, $$

the sets \(C_1:\underline{\Delta }<0\) and \(C_2:\bar{\Delta }<0\) are relevant subsets with positive and negative bias respectively. \(\blacksquare \)

The existence of conditioning sets C for which the conditional coverage probability of level-\(\gamma \) confidence sets is 0 or 1, such as in Example 10.4.4 or Problems 10.2710.28 are an embarrassment to confidence theory, but fortunately they are rare. The significance of more general relevant subsets is less clear,Footnote 6 particularly when a number of such subsets are available. Especially awkward in this connection is the possibility [discussed by Buehler (1959)] of the existence of two relevant subsets C and \(C'\) with nonempty intersection and opposite bias.

If a conditional confidence level is to be cited for some relevant subset C, it seems appropriate to take account also of the possibility that X may fall into \(C^c\) and to state in advance the three confidence coefficients \(\gamma \), \(\gamma _C\), and \(\gamma _{C^c}\). The (unknown) probabilities \(P_\theta (C)\) and \(P_\theta (C^c)\) should also be considered. These points have been stressed by Kiefer, who has also suggested the extension to a partition of the sample space into more than two sets. For an account of these ideas, see Kiefer (1977a, 1977b), Brownie and Kiefer (1977), and Brown (1978).

Kiefer’s theory does not consider the choice of conditioning set or statistic. The same question arose in Section 10.2 with respect to conditioning on ancillaries. The problem is similar to that of the choice of model. The answer depends on the context and purpose of the analysis, and must be determined from case to case.

5 Problems

Section 10.1

Problem 10.1

Let the experiments of \({\mathcal {E}}\) and \({\mathcal {F}}\) consist in observing \(X:N(\xi ,\sigma ^2_0)\) and \(X:N(\xi ,\sigma ^2_1)\) respectively \((\sigma _0<\sigma _1)\), and let one of the two experiments be performed, with \(P({\mathcal {E}})=P({\mathcal {F}})=\frac{1}{2}\). For testing \(H:\xi =0\) against \(\xi =\xi _1\), determine values \(\sigma _0\), \(\sigma _1\), \(\xi _1\), and \(\alpha \) such that

$$ \mathrm{(i)}\quad \alpha _0<\alpha _1;\qquad \mathrm{(ii)}\quad \alpha _0>\alpha _1, $$

where the \(\alpha _i\) are defined by (10.9).

Problem 10.2

Under the assumptions of Problem 10.1, determine the most accurate invariant (under the transformation \(X'=-X\)) confidence sets S(X) with

$$ P(\xi \in S(X)\mid {\mathcal {E}})+P(\xi \in S(X)\mid {\mathcal {F}})=2\gamma . $$

Find examples in which the conditional confidence coefficients \(\gamma _0\) given \({\mathcal {E}}\) and \(\gamma _1\) given \({\mathcal {F}}\) satisfy

$$ \mathrm{(i)}\quad \gamma _0<\gamma _1;\qquad \mathrm{(ii)}\quad \gamma _0> \gamma _1. $$

Problem 10.3

The test given by (10.3), (10.8), and (10.9) is most powerful under the stated assumptions.

Problem 10.4

Let \(X_1,\ldots ,X_n\) be independently distributed, each with probability p or q as \(N(\xi ,\sigma ^2_0)\) or \(N(\xi ,\sigma ^2_1)\).

  1. (i)

    If p is unknown, determine the UMP unbiased test of \(H:\xi =0\) against \(K:\xi >0\).

  2. (ii)

    Determine the most powerful test of H against the alternative \(\xi _1\) when it is known that \(p=\frac{1}{2}\), and show that a UMP unbiased test does not exist in this case.

  3. (iii)

    Let \(\alpha _k\) (\(k=0,\ldots ,n\)) be the conditional level of the unconditional most powerful test of part (ii) given that k of the X’s came from \(N(\xi ,\sigma ^2_0)\) and \(n-k\) from \(N(\xi ,\sigma ^2_1)\). Investigate the possible values \(\alpha _0,\alpha _1,\ldots ,\alpha _n\).

Problem 10.5

With known probabilities p and q perform either \({\mathcal {E}}\) or \({\mathcal {F}}\), with X distributed as \(N(\theta ,1)\) under \({\mathcal {E}}\) or \(N(-\theta ,1)\) under \({\mathcal {F}}\). For testing \(H:\theta =0\) against \(\theta >0\) there exist a UMP unconditional and a UMP conditional level-\(\alpha \) test. These coincide and do not depend on the value of p.

Problem 10.6

In the preceding problem, suppose that the densities of X under \({\mathcal {E}}\) and \({\mathcal {F}}\) are \(\theta e^{-\theta x}\) and \((1/\theta )e^{-x/\theta }\) respectively. Compare the UMP conditional and unconditional tests of \(H:\theta =1\) against \(K:\theta >1\).

Section 10.2

Problem 10.7

Let X, Y be independently normally distributed as \(N(\theta ,1)\), and let \(V=Y-X\) and

$$ W=\left\{ \begin{array}{lll} Y-X &{} \hbox {if} &{} X+Y>0,\\ X-Y &{} \hbox {if} &{} X+Y\le 0.\end{array}\right. $$
  1. (i)

    Both V and W are ancillary, but neither is a function of the other.

  2. (ii)

    (VW) is not ancillary. [Basu (1959).]

Problem 10.8

An experiment with n observations \(X_1,\ldots ,X_n\) is planned, with each \(X_i\) distributed as \(N(\theta ,1)\). However, some of the observations do not materialize (for example, some of the subjects die, move away, or turn out to be unsuitable). Let \({I_j=1}\) or 0 as \(X_j\) is observed or not, and suppose the \(I_j\) are independent of the X’s and of each other and that \(P(I_j=1)=p\) for all j.

  1. (i)

    If p is known, the effective sample size \(M=\sum I_j\) is ancillary.

  2. (ii)

    If p is unknown, there exists a UMP unbiased level-\(\alpha \) test of \(H:\theta \le 0\) versus \(K:\theta >0\). Its conditional level (given \(M=m\)) is \(\alpha _m=\alpha \) for all \(m=0,\ldots ,n\).

Problem 10.9

Consider n tosses with a biased die, for which the probabilities of \(1,\ldots ,6\) points are given by

$$ \begin{array}{cccccc} 1 &{} 2 &{} 3 &{} 4 &{} 5 &{} 6\\ \hline {} ~~{1-\theta \over 12}~~ &{} ~~{2-\theta \over 12}~~ &{} ~~{3-\theta \over 12}~~ &{} ~~{1+\theta \over 12}~~ &{} ~~{2+\theta \over 12}~~ &{} ~~{3+\theta \over 12}~~ \end{array} $$

and let \(X_i\) be the number of tosses showing i points.

  1. (i)

    Show that the triple \(Z_1=X_1+X_5\), \(Z_2=X_2+X_4\), \(Z_3=X_3+X_6\) is a maximal ancillary; determine its distribution and the distribution of \(X_1,\ldots ,X_6\) given \(Z_1=z_1\), \(Z_2=z_2\), \(Z_3=z_3\).

  2. (ii)

    Exhibit five other maximal ancillaries. [Basu (1964).]

Problem 10.10

In the preceding problem, suppose the probabilities are given by

$$ \begin{array}{cccccc} 1 &{} 2 &{} 3 &{} 4 &{} 5 &{} 6\\ \hline {} {1-\theta \over 6}&{} {1-2\theta \over 6}&{} {1-3\theta \over 6}&{} {1+\theta \over 6}&{} {1+2\theta \over 6}&{} {1+3\theta \over 6}\end{array} $$

Exhibit two different maximal ancillaries.

Problem 10.11

Let X be uniformly distributed on \((\theta ,\theta +1)\), \(0<\theta <\infty \), let [X] denote the largest integer\({}\le X\), and let \(V=X-[X]\).

  1. (i)

    The statistic V(X) is uniformly distributed on (0, 1) and is therefore ancillary.

  2. (ii)

    The marginal distribution of [X] is given by

    $$ [X]=\left\{ \begin{array}{ll} [\theta ] &{} \hbox {with probability } 1-V(\theta ),\\ {[\theta ]+1} &{} \hbox {with probability }V(\theta ).\end{array}\right. $$
  3. (iii)

    Conditionally, given that \(V=v\), [X] assigns probability 1 to the value \([\theta ]\) if \(V(\theta )\le v\) and to the value\([\theta ]+1\) if \(V(\theta )>v\). [Basu (1964).]

Problem 10.12

Let X, Y have joint density

$$ p(x,y)=2f(x)f(y)F(\theta xy), $$

where f is a known probability density symmetric about 0, and F its cumulative distribution function. Then

  1. (i)

    p(xy) is a probability density.

  2. (ii)

    X and Y each have marginal density f and are therefore ancillary, but (XY) is not.

  3. (iii)

    \(X\cdot Y\) is a sufficient statistic for \(\theta \). [Dawid (1977).]

Problem 10.13

A sample of size n is drawn with replacement from a population consisting of N distinct unknown values \(\{a_1,\ldots ,a_N\}\). The number of distinct values in the sample is ancillary.

Problem 10.14

Assuming the distribution (4.22) of Section 4.9, show that Z is S-ancillary for \(p=p_+/(p_++p_-)\).

Problem 10.15

In the situation of Example 10.2.3, \(X+Y\) is binomial if and only if \({\Delta =1}\).

Problem 10.16

In the situation of Example 10.2.2, the statistic Z remains S-ancillary when the parameter space is \(\Omega =\{(\lambda ,\mu ):\mu \le \lambda \}\).

Problem 10.17

Suppose \(X=(U,Z)\), the density of X factors into

$$ p_{\theta ,\vartheta }(x)=c(\theta ,\vartheta )g_\theta (u;z)h_\vartheta (z)k(u,z), $$

and the parameters \(\theta \), \(\vartheta \) are unrelated. To see that these assumptions are not enough to insure that Z is S-ancillary for \(\theta \), consider the joint density

$$ C(\theta ,\vartheta )e^{-\frac{1}{2}(u-\theta )^2-\frac{1}{2}(z-\vartheta )^2} I(u,z), $$

where I(uz) is the indicator of the set \(\{(u,z):u\le z\}\). [Basu (1978).]

Section 10.3

Problem 10.18

Verify the density (10.16) of Example 10.3.2.

Problem 10.19

Let the real-valued function f be defined on an open interval.

  1. (i)

    If f is logconvex, it is convex.

  2. (ii)

    If f is strongly unimodal, it is unimodal.

Problem 10.20

Let \(X_1,\ldots ,X_m\) and \(Y_1,\ldots ,Y_n\) be positive, independent random variables distributed with densities \(f(x/\sigma )\) and \(g(y/\tau )\), respectively. If f and g have monotone likelihood ratios in \((x,\sigma )\) and \((y,\tau )\), respectively, there exists a UMP conditional test of \(H:\tau /\sigma \le \Delta _0\) against \(\tau /\sigma >\Delta _0\) given the ancillary statistics \(U_i=X_i/X_m\) and \(V_j=Y_j/Y_n\) (\(i=1,\ldots ,m-1\); \(j=1,\ldots ,n-1\)).

Problem 10.21

Let \(V_1,\ldots ,V_n\) be independently distributed as N(0, 1), and given \({V_1=v_1},\ldots , \)

\({V_n=v_n}\), let \(X_i\) (\(i=1,\ldots ,n\)) be independently distributed as \(N(\theta v_i,1)\).

  1. (i)

    There does not exist a UMP test of \(H:\theta =0\) against \(K:\theta >0\).

  2. (ii)

    There does exist a UMP conditional test of H againstK given the ancillary \((V_1,\ldots ,V_n)\). [Buehler (1982).]

Problem 10.22

In Example 10.3.3,

  1. (i)

    the problem remains invariant under \(G'\) but not under G;

  2. (ii)

    the statistic D is ancillary.

Section 10.4

Problem 10.23

In Example 10.4.1, check directly that the set \(C=\{x:x\le -k\hbox { or }x\ge k\}\) is not a negatively biased semirelevant subset for the confidence intervals \((X-c,X+c)\).

Problem 10.24

  1. (i)

    Verify the posterior distribution of \(\Theta \) given x claimed in Example 10.4.1.

  2. (ii)

    Complete the proof of (10.32).

Problem 10.25

Let X be a random variable with cumulative distribution function F. If \(E|X|<\infty \), then \(\int ^0_{-\infty }F(x)\,dx\) and \(\int ^\infty _0[1-F(x)]\,dx\) are both finite. [Apply integration by parts to the two integrals.]

Problem 10.26

Let X have probability density \(f(x-\theta )\), and suppose that \(E|X|<\infty \). For the confidence intervals \(X-c<\theta \) there exist semirelevant but no relevant subsets. [Buehler (1959).]

Problem 10.27

Let \(X_1,\ldots ,X_n\) be independently distributed according to the uniform distribution \(U(\theta ,\theta +1)\).

  1. (i)

    Uniformly most accurate lower confidence bounds \(\underline{\theta }\) for \(\theta \) at confidence level \({1-\alpha }\) exist and are given by

    $$ \underline{\theta }=\max (X_{(1)}-k,X_{(n)}-1), $$

    where \(X_{(1)}=\min (X_1,\ldots ,X_n)\), \(X_{(n)}=\max (X_1,\ldots ,X_n)\), and \({(1-k)^n=\alpha }\).

  2. (ii)

    The set \(C:x_{(n)}-x_{(1)}\ge 1-k\) is a relevant subset with \(P_\theta (\underline{\theta }\le \theta \mid C)=1\) for all \(\theta \).

  3. (iii)

    Determine the uniformly most accurate conditional lower confidence bounds \(\underline{\theta }(v)\) given the ancillary statistic \(V=X_{(n)}-X_{(1)}=v\), and compare them with \(\underline{\theta }\). [The conditional distribution of \(Y=X_{(1)}\) given \(V=v\) is \(U(\theta ,\theta +1-v)\).]

[Pratt (1961a), Barnard (1976).]

Problem 10.28

  1. (i)

    Under the assumptions of the preceding problem, the uniformly most accurate unbiased (or invariant) confidence intervals for \(\theta \) at confidence level \(1-\alpha \) are

    $$ \underline{\theta }=\max (X_{(1)}+d,X_{(n)})-1<\theta <\min (X_{(1)},X_{(n)}-d)=\bar{\theta }, $$

    where d is the solution of the equation

    $$ \begin{array}{rl} 2d^n=\alpha &{}\quad \alpha <1/2^{n-1},\\ 2d^n-(2d-1)^n=\alpha &{}\quad \alpha >1/2^{n-1}.\end{array} $$
  2. (ii)

    The sets \(C_1:X_{(n)}-X_{(1)}>d\) and \(C_2:X_{(n)}-X_{(1)}<2d-1\) are relevant subsets with coverage probability

    $$ P_\theta [\underline{\theta }<\theta<\bar{\theta }\mid C_1]=1 \quad P_\theta [\underline{\theta }<\theta <\bar{\theta }\mid C_2]=0. $$
  3. (iii)

    Determine the uniformly most accurate unbiased (or invariant) conditional confidence intervals \(\underline{\theta }(v)<\theta <\bar{\theta }(v)\) given \(V=v\) at confidence level \(1-\alpha \), and compare \(\underline{\theta }(v)\), \(\bar{\theta }(v)\), and \(\bar{\theta }(v)-\underline{\theta }(v)\) with the corresponding unconditional quantities.

[Welch (1939), Pratt (1961a), Kiefer (1977a).]

Problem 10.29

Suppose \(X_1\) and \(X_2\) are i.i.d. with

$$P \{ X_i = \theta - 1 \} = P \{ X_i = \theta +1 \} = {1 \over 2}~.$$

Let C be the confidence set consisting of the single point \((X_1 + X_2 )/2\) if \(X_1 \ne X_2\) and \(X_1 -1\) if \(X_1 = X_2\). Show that, for all \(\theta \),

$$P_{\theta } \{ \theta \in C \} = 0.75~,$$

but

$$P_{\theta } \{ \theta \in C | X_1 = X_2 \} = 0.5$$

and

$$P_{\theta } \{ \theta \in C | X_1 \ne X_2 \} = 1~.$$

[Berger and Wolpert (1988)]

Problem 10.30

Instead of conditioning the confidence sets \(\theta \in S(X)\) on a set C, consider a randomized procedure which assigns to each point x a probability \(\psi (x)\) and makes the confidence statement \(\theta \in S(x)\) with probability \(\psi (x)\) when x is observed.Footnote 7

  1. (i)

    The randomized procedure can be represented by a nonrandomized conditioning set for the observations (XU), where U is uniformly distributed on (0, 1) and independent of X, by letting \(C=\{(x,u):u<\psi (x)\}\).

  2. (ii)

    Extend the definition of relevant and semirelevant subsets to randomized conditioning (without the use of U).

  3. (iii)

    Let \(\theta \in S(X)\) be equivalent to the statement \(X\in A(\theta )\). Show that \(\psi \) is positively biased semirelevant if and only if the random variables \(\psi (X)\) and \(I_{A(\theta )}(X)\) are positively correlated, where \(I_A\) denotes the indicator of the set A.

Problem 10.31

The nonexistence of (i) semirelevant subsets in Example 10.4.1 and (ii) relevant subsets in Example 10.4.2 extends to randomized conditioning procedures.

6 Notes

Conditioning on ancillary statistics was introduced by Fisher (1934a, 1935b, 1936).Footnote 8 The idea was emphasized in Fisher (1956b,1959,1973) and by Cox (1958), who motivated it in terms of mixtures of experiments providing different amounts of information. The consequences of adopting a general principle of conditioning in mixture situations were explored by Birnbaum (1962) and Durbin (1970). Following Fisher’s suggestion (1934), Pitman (1938b) developed a theory of conditional tests and confidence intervals for location and scale parameters. For recent paradox concerning conditioning on an ancillary statistic, see Brown (1990) and Wang (1999).

The possibility of relevant subsets was pointed out by Fisher (1956a, 1956b,1959,1973) (who called them recognizable). Its implications (in terms of betting procedures) were developed by Buehler (1959), who in particular introduced the distinction between relevant and semirelevant, positively and negatively biased subsets, and proved the nonexistence of relevant subsets in location models. The role of relevant subsets in statistical inference, and their relationship to Bayes and admissibility properties, was discussed by Pierce (1973), Robinson (1976, 1979a, 1979b), Bondar (1977), and Casella (1988), among others.

Fisher (1956a, 1956b,1959,1973) introduced the idea of relevant subsets in the context of the Behrens–Fisher problem. As a criticism of the Welch–Aspin solution, he established the existence of negatively biased relevant subsets for that procedure. It was later shown by Robinson (1976) that no such subsets exist for Fisher’s preferred solution, the so-called Behrens–Fisher intervals. This fact may be related to the conjecture [supported by substantial numerical evidence in Robinson (1976) but so far unproved] that the unconditional coverage probability of the Behrens–Fisher intervals always exceeds the nominal level. For a review of these issues, see Wallace (1980) and Robinson (1982).

Maatta and Casella (1987) examine the conditional properties of some confidence intervals for the variance in the one-sample normal problem. The conditional properties of some confidence sets for the multivariate normal mean, including confidence sets centered at James-Stein or shrinkage estimators, see Casella (1987) and George and Casella (1994). The conditional properties of the standard confidence sets in a normal linear model are studied in Hwang and Brown (1991).

In testing a simple hypothesis against a simple alternative, Berger et al. (1994) present a conditional frequentist methodology that agrees with a Bayesian approach.