1 Introduction

The conditionality principle C has played a puzzling role in attempts to develop a frequentist theory of statistical inference. On the one hand it seems intuitively obvious, and even a necessary component of such a theory. But it also produces a significant ambiguity due to nonequivalent applications for which there seems to be no easy solution in terms of determining which is correct or even if any are correct. Attempts to ignore this problem, typically by considering certain applications as equivalent, produces the somewhat strange phenomenon that C, a frequentist principle, can lead to the likelihood principle L which precludes any frequentist inferences, see Evans et al. (1986) and Evans (2013) for discussion of this.

The fact that C is not an equivalence relation, which any valid characterization of statistical evidence must be, calls into question the justification for C. This can be considered as a logical inconsistency in the definition of C. Moreover, as will be shown, the ancillary status of a statistic can change to being informative when the distribution of another ancillary statistic changes. This raises the issue of whether the distribution of such a statistic is truly irrelevant for inference, which can be considered as a statistical inconsistency in the definition of C.

The purpose of this paper is to propose a resolution to these problems. It is argued that a correct characterization of the ancillary concept requires the restriction of the set of possible ancillaries for use to a subset and this is based upon very natural statistical criteria. Once the restriction is made, there is a unique maximal member of this subset and this becomes the ancillary to use as it makes the maximal reduction in the set of possible data values to compare the observed data to in the conditional model. We show that natural statistical criteria lead to the set being the minimal ancillaries, whose maximum is the laminal ancillary as labelled by the taxonomy of Basu (1959).

One could argue that this isn’t much of an advance, particularly because the laminal ancillary is often trivial, but we would counterargue that it is significant because it shows that the other ancillaries, besides the laminal, are ineligible to be used in the conditioning step. This establishes the validity of some form of C for inference and this has broad implications. In particular, the idea that C together with the sufficiency principle S can lead to L, as discussed, for example, in Birnbaum (1962), Evans et al. (1986), Evans (2013) and many others, is completely avoided and this applies similarly to the argument that C alone can produce L. Additionally, it leads to a new and uncontroversial principle that combines S and a modified C that still permits frequentist considerations for inferences.

In Section 2 the conditionality principle is discussed. In Section 3 we introduce the statistical criterion that assesses whether an ancillary statistic is unstable (can become informative) if one merely changes the distribution of another ancillary statistic. We show how this connects to the minimal and laminal ancillaries. In Section 4 a principle is introduced which satisfies both S and the new conditionality principle. This principle forms an equivalence relation in the class of all inference bases and so is indeed a valid partial characterization of statistical evidence, which was Birnbaum’s intention. The proofs of all propositions are placed in the Appendix.

The conditionality principle has attracted many authors some of whom have attempted resolutions. The papers Basu (1959, 1964), Cox (1958, 1971), Kalbfleisch (1975), Buehler (1982), Stigler (2001) and Ghosh et al. (2010) all represent interesting contributions and there are many more which can be found in the references of these papers. To the best of our knowledge nobody has presented a forceful argument for the laminal ancillary as being the natural resolution and that is the outcome of the discussion in Section 3.

2 Principles and Ancillaries

All of the principles S,C and L, are applied to inference bases. An inference base I = (M,x) is comprised of a statistical model

$$ M=(\mathcal{X},\mathcal{B},\{P_{\theta,X}:\theta\in{\Theta}\}), $$

where \(\mathcal {X}\) is a sample space containing all possible values for the observed data x of random object \(X,{\mathscr{B}}\) is a σ-field on \(\mathcal {X}\) and {P𝜃,X : 𝜃 ∈Θ} is a collection of probability measures defined on \({\mathscr{B}}\) indexed by model parameter 𝜃 ∈Θ. For inference, the assumption is made that there is a true value of 𝜃, say 𝜃true, such that, before it is observed, \(x\sim P_{\theta _{true},X}.\) The goal, once x is observed, is to make inference about which of the possible values of 𝜃 ∈Θ corresponds to 𝜃true and these inferences are based somehow on the ingredients I = (M,x). More generally, our interest is in some marginal parameter ψ = Ψ(𝜃) that has a real-world interpretation and it is desired to know the value ψtrue = Ψ(𝜃true) and this requires dealing with so-called nuisance parameters. This more general problem is ignored here except to say that the concept of conditioning on an ancillary for the model is still relevant for that context. Birnbaum (1962) considered the set of all inference bases and for inference bases I1 and I2 with essentially the same model parameter (or bijective relabellings thereof), indicated that these inference bases contain the same statistical evidence about the true value of the model parameter by writing Ev(I1) = Ev(I2).

It is worth noting that there is an implicit assumption in the developments here, namely, it is assumed that all relevant aspects of the statistical investigation are captured by saying that the true distribution of the response is a member of a set of distributions, on a given sample space, and indexed by 𝜃 ∈Θ. This restriction is commonly made in discussions of inference but it is still an assumption, namely, that there are no other aspects of the problem that need to be included. This assumption is stated as the Distribution Principle in Dawid (1977). So all results derived here require this assumption as does our interpretation of the results of Birnbaum (1962).

An ancillary statistic for the model M is a map \(A:(\mathcal {X} ,\mathcal {B)\rightarrow }(\mathcal {A},\mathcal {C)}\) such that the marginal probability measure induced by A satisfies P𝜃,A = PA for every 𝜃 ∈Θ. In other words A is ancillary when its marginal distribution is independent of the model parameter and it is then claimed that the observed value of A(x) contains no information about 𝜃true. More than this, simple examples, like the two measuring instruments example in Cox (1958), suggest that for frequentist inferences the initial model M in I = (M,x) be replaced by M|A(x) = {P𝜃,X(⋅|A(x)) : 𝜃 ∈Θ}, where P𝜃,X(⋅|A(x)) is the conditional probability measure for X given the value A(x). A statement of the principle C is then,

$$ C\text{: If }A:(\mathcal{X},\mathcal{B)\!\rightarrow}(\mathcal{A},\mathcal{C)} \text{ is ancillary for }M,\text{ then }Ev(M,x) = Ev(M_{\vert A(x)},x). $$

An ancillary A is a maximal ancillary if, whenever \(A^{\prime }\) is another ancillary and there exists a function h such that \(A=h(A^{\prime }),\) then h is effectively a 1-1 function. So, the set of possible data values {z : A(z) = A(x)} that is conditioned on via C, when the value A(x) of a maximal ancillary is observed, cannot be made smaller without losing ancillarity.

It is natural to make the greatest possible reduction in the set of possible sample values we use for inference and so a possible full statement of C would be to condition on a maximal ancillary. When there is a unique maximal ancillary this is uncontroversial. As Example 1 shows, however, there can be several maximal ancillaries. In such a case there is an ambiguity concerning which maximal ancillary to use when applying C as, for two maximal ancillaries A1 and A2, inference bases \((M_{\vert A_{1}(x)},x)\) and \((M_{\vert A_{2}(x)},x)\) can lead to quite different inferences, see Example 2. It is shown in Evans (2013) that the lack of a unique maximal ancillary implies that C is not an equivalence relation on the set of all inference bases and therefore, as currently stated, it is not a correct characterization of statistical evidence. Also, it is shown there that, if \(\bar {C}\) is the smallest equivalence relation containing C, then \(\bar {C}=L.\) Similarly the smallest equivalence relation containing SC, which is also not an equivalence relation, satisfies \(\overline {S\cup C}=L\) and this is what the proof of Birnbaum’s theorem proves. So the lack of a unique maximal ancillary leaves open the question of whether or not C, or some modification, is indeed a valid statistical principle that should be employed in statistical work.

Basu (1959) defined a minimal ancillary as any ancillary which is a function of every maximal ancillary and showed that there is a unique ancillary in the class of minimal ancillaries, called the laminal ancillary, which is maximal in this class. The following example illustrates these concepts.

Example 1.

Suppose M consists of two distributions, as provided in the Table 1 together with the likelihood ratio (LR). Actually it is a range of examples as 𝜖 is any value satisfying 0 < 𝜖 < 1/64. For each such case the minimal sufficient statistic (mss) is the identity which is not the case if 𝜖 = 0. This implies that all the ancillaries are functions of the mss and this will prove important for our later discussion.

Table 1 Distributions in Example 1 together with likelihood ratios

Since any 1-1 function of an ancillary is ancillary, it is equivalent to present all the preimage partitions induced by such statistics when considering the ancillary structure of this model and some of these are provided in Table 2. It is clear from this table that the maximal ancillaries are given by A1 and A2, as these give the finest ancillary partitions, and so the laminal ancillary must be L as it is the finest partition containing both maximal ancillaries. The minimal ancillaries are given by {T,B1,B2,B3,L}, where T is the trivial ancillary, as these are all coarsenings of both A1 and A2 and are presented in Table 2.

Table 2 The minimal ancillaries in Example 1

There are ancillaries that are coarsenings of single maximal ancillaries such as

$$ \begin{array}{@{}rcl@{}} C_{1} & :\{1,3\},\{2,4\},\{5,6,7\}\\ C_{2} & :\{1,3,5,6\},\{2,4\},\{7\} \end{array} $$

which are coarsenings of A2 but not of A1 and there are many others.

If the sample space were shrunk to {1,2,3,4}, with the 1/2 probability for {5,6,7} redistributed equally among the 4 sample points, then the laminal ancillary becomes the trivial ancillary and this is not uncommon, as noted in Basu (1959) where conditions for this to occur are discussed.

The following example demonstrates the ambiguity that a nonunique maximal ancillary can produce and is adapted from Evans (2015).

Example 2.

Consider the model given by Table 3 and suppose x = 1 is observed. The MLE of 𝜃 is \(\hat {\theta }(1)=\theta _{1}.\)

Table 3 Distributions in Example 2

There are two maximal ancillaries as given by their partitions, namely A1 = {{1,2},{3,4}} and A2 = {{1,3},{2,4}}. The sampling distributions of the MLE obtained by conditioning on the maximal ancillaries are as displayed in Table 4.

Table 4 Conditional distributions of the MLE in Example 2

As can be seen, these sampling distributions are quite different and it is not clear which to use as part of quantifying the uncertainty in the estimate.

3 Stable and Strong Ancillaries

Despite the rich structure of the ancillary statistics, standard evidence theory assumes (through the standard conditionality principle C) that conditioning on different ancillary statistics is equally valid. We challenge this assumption through two main perspectives, which give rise to a resolution.

Reproducing the Structure with a Single Maximal Ancillary

As noted in Evans (2013), the fact that more than one maximal ancillary can exist results in C not forming an equivalence relation on the set of all inference bases. If we want to claim that a given principle does properly characterize when two inference bases contain the same amount of statistical evidence concerning an unknown 𝜃, then it seems clear that the principle must induce an equivalence relation. Therefore, C needs to be modified if it is desirable for conditioning on ancillaries to play a role in inference.

Basu (1959) introduced the concept that two ancillary subsets \(A,B\in {\mathscr{B}}\) for model M conform when AB is also ancillary. The set of all ancillary subsets that conform to every other ancillary subset is denoted by Γ0 and it is proved that Γ0 is a σ-field and moreover this is the laminal ancillary σ-field in the sense that it is the largest σ-field contained in all the σ-fields induced by the individual maximal ancillaries. This is effectively saying that (allowing for 1-1 equivalences) the laminal ancillary statistic is a function of every maximal ancillary. A further implication of this is that the laminal ancillary σ-field is the largest minimal ancillary σ-field and so the laminal ancillary statistic is the maximal minimal ancillary statistic. Also, if there is a unique maximal ancillary then this is also the laminal ancillary. This points to a special role for the laminal ancillary especially since the laminal ancillary always exists and a conditionality principle that prescribed conditioning on the laminal forms an equivalence relation on the set of inference bases, see Section 4.

Although logical, this role has not been explored. Perhaps this is because the laminal doesn’t often produce a meaningful reduction. But also Basu’s development, while logical, doesn’t provide a good statistical reason to adopt the laminal as the logical ancillary to condition on. It is argued here, however, that there is a key element that can be added to the story and with this addition the laminal is not only a logical resolution, but is a statistical necessity.

Addressing the Transition of Ancillaries to Informative Statistics

The key idea in this development is the supposed irrelevance of the distribution of an ancillary that is to be conditioned on. For after all, as far as inference goes, this distribution plays absolutely no role whatsoever. The statistical intuition behind this is that the distribution of the ancillary is free of the parameter and so an observation from it contains no information about 𝜃true. As such, it must be the case that, no matter what distribution is assumed for an ancillary this cannot change the basic information structure of the problem. Note that this is a more severe requirement for what it means for a statistic to be ancillary. Two definitions that capture this idea are now provided and their equivalence proved. It is then proved that the set of ancillaries which satisfy this criterion has a maximal member and it is the laminal ancillary. To avoid a measure-theoretic presentation via σ-fields, as in Basu (1959), it will be assumed here that all ancillaries are discretely distributed on \( \mathbb {N} \) and that there are at most a countable number of ancillaries, as this is sufficient for conveying the key ideas.

For ancillary U for model M, the following notation is adopted

$$ M={\sum}_{i}P_{U}(\{i\})M_{\vert U=i}. $$

This expresses the idea that the model M is a mixture of the component models obtained by conditioning on U = i where the mixture probabilities are given by the marginal distribution of U. The following definitions capture the idea that the distribution of U should be irrelevant for the inference problem.

Definition 1.

An ancillary U for model M is called a stable ancillary for M if, whenever V is ancillary for M, then U is ancillary for the mixture \({\sum }_{i}p_{i}M_{_{\vert V=i}}\) for every probability distribution (p1,p2,…) on the set of possible values for V. An ancillary U for model M is called a strong ancillary for M if any ancillary V for M is also ancillary for the mixture \({\sum }_{i}p_{i}M_{_{\vert U=i}}\) for every probability distribution (p1,p2,…) on the set of possible values for U.

So U is a stable ancillary when changing the distribution of any other ancillary has no effect on the ancillarity of U and U is a strong ancillary if changing the distribution of U has no effect on the ancillarity of any other ancillary. For any ancillary U that is not stable, then conditioning on the value of some other ancillary renders the value U(x) informative which contradicts the underlying motivation that the value of an ancillary statistic contains no evidence concerning 𝜃true. Similarly, if U is not strong, then conditioning on the value U(x) renders the value of some other ancillary informative. Accordingly, it is difficult to accept the claim that the value of an ancillary that is not stable/strong is noninformative with respect to 𝜃true.

In actuality, a stable ancillary is strong and a strong ancillary is stable as the following result shows.

Proposition 1.

U is a strong ancillary for M iff it is a stable ancillary for M.

Given that stable and strong ancillaries are just different expressions of the same concept, these will be referred to hereafter as stable ancillaries.

In part (i) of the following result it is now shown that a stable ancillary is a minimal ancillary and a minimal ancillary is a stable ancillary. Since Basu (1959) proved that the laminal ancillary is the maximal minimal ancillary this establishes that the laminal ancillary is the maximal stable ancillary and, for the sake of completeness, this is proved in part (ii).

Proposition 2.

(i) A stable/strong ancillary is a minimal ancillary and conversely. (ii) There exists a maximal minimal ancillary (the laminal ancillary).

Since the word minimal doesn’t really convey the positive aspects of such ancillaries these will be referenced as stable ancillaries hereafter.

It is worth noting that the structure given by the minimal and laminal ancillaries is really the largest ancillary structure within the model that replicates the situation where there is a single maximal ancillary and, as such, there is no ambiguity about which ancillary to condition on. This coherence points to the laminal ancillary as playing a special role and this is reinforced by the notion of stability of an ancillary.

The following example demonstrate numerically the extent to which, having an incorrect distribution of an unstable ancillary (i) can transform another unstable ancillary to informative; yet (ii) preserves the ancillary state of a stable ancillary.

Example 3.

Consider again Example 1 with 𝜖 = 0.01, but now consider what happens to the ancillary state of the unstable ancillary C2 and the stable ancillary L, when the distribution of the unstable ancillary A1 is changed from \(P_{A_{1}},\) as given by (1/4,1/4,3/14,4/14), to a true distribution that is unknown to the researcher, \(P_{A_{1}}^{unknown},\) as given by (7/100,13/100,27/100,53/100), see Fig. 1. It is then observed that L stays ancillary, as theory assures, namely, for both 𝜃 = 𝜃1 and 𝜃 = 𝜃2, the distribution of L is (1/2,3/14,4/14) under the first scenario and (20/100,27/100,53/100) under the second. However, the likelihood ratios of C2, \(P_{\theta _{1}}(C_{2}=\) a given value\()/P_{\theta _{2}}(C_{2}=\) a given value), are largely away from 1; C2 has lost its ancillary state and is now informative.

Figure 1
figure 1

The result of changing the distribution of a nonstable ancillary in Example 3

One may consider reasonable that such sensitivity of the ancillary state for a statistic suggests that its ancillarity is not a structural feature of the design, but is rather an erroneous coincidence. This possibility, while not testable within the model, suggests that one should focus any conditioning only on stable ancillaries.

To see additionally why C needs to be modified we examine the motivation for conditioning as part of the inference process. This arises from considering mixture experiments. Suppose there are a set of models say \(\{M_{a} :a\in \mathcal {A}\},\) with \(M_{a}=(\mathcal {X},\{P_{\theta ,a}:\theta \in {\Theta }\}),\) where the data x will arise from one of these models. The model that produces the data is obtained via a randomization procedure where a value a is produced with probabilities given by PA({a}) = P(A = a), on \(\mathcal {A}.\) This mixing produces the overall model \(M={\sum }_{a\in \mathcal {A}}P_{A}(\{a\})M_{a}\) and A is ancillary for M. If the value of A = a0 is observed, then C says that the inference base \((M_{a_{0}},x)\) is the one that is relevant for inference about 𝜃. This seems uncontroversial and therein lies the appeal of C.

The controversy surrounding C arises when, rather than being presented with a physical randomization device as part of a two-stage procedure, as just described, we are presented with the inference base (M,x) with A being ancillary for M. Since M can be at least be formally considered as a mixture model via A, it then seems reasonable to replace (M,x) by \((M_{a_{0}},x),\) where A(x) = a0, for inference about 𝜃.

But now consider two studies conducted by statisticians 1 and 2 concerning the true value of the quantity 𝜃 but suppose different randomization schemes are used in each. So, in the i-th study the collection of models is given by \(\{M_{ia}:a\in \mathcal {A}_{i}\}\) and the relevant ancillary is Ai. Suppose that the results of the mixing produces the same overall model M and furthermore the same data x is obtained. This may seem unrealistic, but recall that in the end this is the situation that confronts us when considering a model with multiple ancillaries and we wish to justify conditioning on one of them.

It would seem then that both studies would conclude that the evidence about the true value of 𝜃 in the inference base (M,x) is the same but the expression of this will be different, and result in different conditional inference bases, unless effectively the same maximal ancillary is being used for the mixing. In Example 1, suppose the two randomization schemes are specified by the maximal ancillaries A1 and A2 as this will be a case where the conditional inference bases will be different. Recall, however, that the specific distributions for the Ai are supposedly irrelevant for inference about 𝜃 and indeed these play no role in the actual inferences. But now suppose, for whatever reason, statistician 1 decides to modify their randomization scheme by changing the distribution of A1 say from \(P_{A_{1}}\) to \(P_{A_{1}}^{\prime }.\) This does not change the submodels M1a and so this change in the ancillary distribution seems innocuous to statistician 1 as their inferences will not change due to the irrelevance of the distribution of the ancillary. The overall model M, however, has changed to \(M^{\prime }\) and this may produce a conflict with statistician 2 because it may be that A2 is no longer ancillary in \(M^{\prime }\) and is now informative. Statistician 2 can now rightly claim that the distribution of A1 is definitely relevant to the inference process and so there is a contradiction between the two statisticians.

This demonstrates that there is a clear contradiction that resides within the reasoning that justifies C, at least as long as it is silent about which ancillaries are appropriate for the conditioning step. The content of this paper has demonstrated how to resolve this contradiction by making sure that any ancillaries that are used do not produce the phenomenon just described. The relevant ancillaries to use are the stable ancillaries and indeed their marginal distributions are irrelevant for inference. The irrelevance of the marginal distribution of a stable ancillary is similar to the irrelevance of the conditional distribution of the data given a mss and both can be discarded for inference. This recovers conditioning on an ancillary as a valid part of the inference process. Of course, we want to make the maximal reduction via conditioning, to eliminate as much of the variation as possible that has nothing to do with 𝜃, and this leads to conditioning on the laminal.

4 Stable Conditionality and Evidence

In discussing statistical evidence Birnbaum (1962) introduced the Ev function defined on the set of all inference bases. When two inference bases I1,I2 were considered to be equivalent with respect to their content of statistical evidence, this was denoted by Ev(I1) = Ev(I2). Birnbaum did not, however, specify the value of Ev(I). While this is understandable, this approach is modified here as evidence functions are fully defined (up to 1-1 equivalence due to relabellings) for the principles discussed. The basic reason for this is that a principle of inference should not only state an equivalence, but also prevent the usage of aspects of an inference base that are identified as irrelevant for the inference process. As pointed out in Durbin (1970), ensuring that this didn’t happen was one way of preventing Birnbaum’s proof of his well-known theorem. We still do not give a full definition of Ev but it is argued that this takes us some steps closer and that such restrictions are a necessity.

In what follows, we examine the consequences that arise for statistical evidence as described in Birnbaum, if one focuses on the set of stable ancillaries that are functions of a mss for a model M, namely,

$$ \mathcal{A}_{M}=\{A:A\text{ is a stable ancillary and a function of a mss for model }M\}. $$
(1)

It was pointed out in Durbin (1970) that restricting to ancillaries that are functions of a mss voided the proof of Birnbaum’s theorem. Evans et al. (1986) argued that this was a natural restriction because otherwise the information being conditioned on via the ancillary was precisely the information being discarded as irrelevant via sufficiency in Birnbaum’s proof. As such, there existed a contradiction between the principles S and C in that context. The restriction to ancillaries that are functions of a mss also seems implicit in Fisher’s development of the ancillarity concept, as documented in Stigler (2001).

Based on the developments in Section 3, the restriction is made to those ancillaries that are stable because these are in a sense the ancillaries that truly introduce no information into the analysis concerning the true distribution. It is to be noted that there still is a place in a statistical analysis for ancillaries that are not functions of a mss as, for example, in regression analysis with normal error where the standardized residuals are ancillaries that are not functions of the mss but play a key role in model checking. Our concern here, however, is with the inference step and the restriction to (1) seems essential in that context.

For simplicity, we suppose that the parameter space Θ = {𝜃1,𝜃2,...,𝜃m} and the sample space \(\mathcal {X} =\{x_{1},,...,x_{n}\}\) are both finite as this doesn’t change the essential meaning of the principles. Also we take \({\mathscr{B}}=2^{\mathcal {X}},\) the power set of \(\mathcal {X},\) and suppress this in the notation hereafter. It is assumed that Θ is the same in any two inference bases that we consider related via Ev although it is possible to allow one parameter space to be a 1-1 relabelling of the other but this is ignored here. Also, it will always be assumed that, for each \(x_{i}\in \mathcal {X}\) then there is at least one 𝜃 ∈Θ such that P𝜃({xi}) > 0 so the sample space \(\mathcal {X}\) cannot be made smaller.

A sufficient statistic T is any function defined on \(\mathcal {X}\) such that, if T(x) = T(y), then x and y are in the same equivalence class associated with the sufficiency equivalence relation on \(\mathcal {X}\) given by xSy whenever there is a constant c such that P𝜃,X({x}) = cP𝜃,X({y}) for every 𝜃 ∈Θ. A mss is a sufficient statistic T such that when xSy, then T(x) = T(y) and so it is any function on \(\mathcal {X}\) that indexes the equivalence classes. The value of the mss represents the maximal reduction in the observed data that results in no information loss concerning 𝜃true. A canonical representative of the mss is, as discussed in Evans (2015), Lemma 3.3.2, given by T(x) = [x] where \([x]\subset \mathcal {X}\) is the equivalence class induced by ≡S on \(\mathcal {X}.\) Any function on \(\mathcal {X}\) that is constant on each set [x] and different on [x] and [y] when [x]≠[y], can also serve as a mss. For example, when there is 𝜃i ∈Θ such that \(P_{\theta _{i},X}(\{x\})>0\) for all \(x\in \mathcal {X},\) then the mss can be taken to be

$$ T(x)=(P_{\theta_{1},X}(\{x\})/P_{\theta_{i},X}(\{x\}),\ldots,P_{\theta_{n} ,X}(\{x\})/P_{\theta_{i},X}(\{x\})). $$

Let \(T:\mathcal {X}\overset {onto}{\mathcal {\rightarrow }}\mathcal {T}\) denote the mss, however it is chosen, with model \(M_{T}=(\mathcal {T},\{P_{\theta ,T}:\theta \in {\Theta }\}).\)

The following statement of the sufficiency principle is equivalent to the statement in Birnbaum (1962) but it is easier to use this version to prove that S is indeed an equivalence relation on the set of all inference bases, see Evans (2015), Lemma 3.3.3. Here we allow for any version of the mss as h(T) where h is a 1-1 function (a relabelling) defined on \(\mathcal {T}.\) This allows for relating two inference bases (M1,x1) and (M2,x2) that may have very different models but their minimal sufficient statistics are essentially equivalent under such a relabelling and so the principle is defined as a relation on the set of all inference bases.

Sufficiency Principle

(S) The inference bases (M1,x1) and (M2,x2), with minimal sufficient statistics T1 and T2 respectively, are equivalent under S whenever there is a 1-1 onto, function \(h:\mathcal {T}_{2}\mathcal {\rightarrow T}_{1}\) such that T1 = hT2 and

$$ (M_{1,T_{1}},T_{1}(x_{1}))=(M_{2,h(T_{2})},h(T_{2}(x_{2}))). $$

So when (M1,x1) and (M2,x2) are related via S, the sampling distributions of T1 and T2 are essentially the same as are the observed values of these statistics. For example, as a particular application, if model \((\mathcal {X},\{P_{X\mid \theta }:\theta \in {\Theta }\})\) has mss T, then observations \(x,y\in \mathcal {X}\) satisfying T(x) = T(y), together with the model, contain the same evidence about 𝜃true, i.e.,

$$ Ev(\mathcal{X},\{P_{\theta,X}:\theta\in{\Theta}\},x)=Ev(\mathcal{X} ,\{P_{\theta,X}:\theta\in{\Theta}\},y) $$

where the function h is just the identity in this case.

While no image space is defined for Ev it is necessary to do this for a specific principle so that it is clear that the goal of the principle is to also exclude ingredients that are really extraneous to the intent of the principle. It is immediate from S that

$$ Ev(\mathcal{X},\{P_{\theta,X}:\theta\in{\Theta}\},x)=Ev(\mathcal{T} ,\{P_{\theta,T}:\theta\in{\Theta}\},T(x)) $$

and this is undoubtedly the most important application of the principle, namely, all inferences about the true value of 𝜃 are based on the model for a mss and its observed value. This leads to the definition of the minimal sufficiency evidence function EvMS given by

$$ Ev_{MS}(\mathcal{X},\{P_{\theta,X}:\theta\in{\Theta}\},x)=(\mathcal{T} ,\{P_{\theta,T}:\theta\in{\Theta}\},T(x))=(M_{T},T(x)), $$

for say the canonical mss T, although any other equivalent version of the mss could be used. In other words, we are restricting what we consider an appropriate presentation of the evidence based on S. The ultimate evidence function, whatever it may be, will be composed with EvMS.

For ancillary statistic A for model \(M=(\mathcal {X},\{P_{\theta ,X}:\theta \in {\Theta }\})\) we write \(M_{\vert A(x)}=(\mathcal {X},\{P_{\theta ,X\mid A(x)} :\theta \in {\Theta }\})\) for the family of derived conditional distributions on \(\mathcal {X}\) obtained by conditioning on the event specified by A(x). The discussion in Section 3 about ancillarity then leads to the following modified conditionality principle where again we state a general version of the principle that can be applied to relate (or not) any inference bases.

Stable Conditionality Principle

(SC) The inference bases (M1,x1) and (M2,x2), with minimal sufficient statistics Ti and laminal ancillaries \(L_{i}\in \mathcal {A}_{M_{i}}\) respectively, are equivalent under SC, whenever there is a a 1-1 onto, function \(h:\mathcal {T}_{2}\mathcal {\rightarrow T}_{1}\) such that T1 = hT2 and

$$ (M_{1,T_{1}\mid L_{1}(T_{1}(x_{1}))},T_{1}(x_{1}))=(M_{2,h(T_{2})\mid L_{2}(T_{2}(x_{2}) )},h(T_{2}(x_{2}))). $$
(2)

For example, if model \((\mathcal {X},\{P_{X\mid \theta }:\theta \in {\Theta }\})\) has mss T and laminal ancillary \(L\in \mathcal {A}_{M},\) then observations \(x,y\in \mathcal {X}\) satisfying T(x) = T(y), together with the conditional model, contain the same evidence about 𝜃true, i.e.,

$$ Ev(\mathcal{X},\{P_{\theta,X\vert L(T(x))}:\theta\in{\Theta}\},x)=Ev(\mathcal{X} ,\{P_{\theta,X\vert L(T(y))}:\theta\in{\Theta}\},y) $$

where the function h is just the identity in this case.

It follows from SC that

$$ Ev(\mathcal{X},\{P_{\theta,X}:\theta\in{\Theta}\},x)=Ev(\mathcal{T} ,\{P_{\theta,T\vert L(T(x))}:\theta\in{\Theta}\},T(x)) $$

and this is undoubtedly the most important application of the principle. This leads to the definition of the stable conditionality evidence function EvMS given by

$$ Ev_{SC}(\mathcal{X},\{P_{\theta,X}:\theta\in{\Theta}\},x)=(\mathcal{T} ,\{P_{\theta,T\vert L(T(x))}:\theta\in{\Theta}\},T(x)) $$
(3)

for say the canonical mss T although any other equivalent version of the mss could be used.

It is necessary to prove that SC is an equivalence relation on the set of all inference bases as part of establishing that EvSC is a valid characterization of statistical evidence.

Proposition 3.

SC is an equivalence relation on the set of inference bases.

It is obvious that, as relations on the set of all inference bases, SCC. The fact that SC is an equivalence relation establishes that this containment is proper because it has been established that C is not an equivalence relation, see Evans (2013) or Evans (2015), Lemma 3.3.4. It has also been shown in these references that the smallest equivalence relation containing C is L. So an interesting consequence of Proposition 3 is that L cannot be obtained from SC in this way.

Similarly, the same references establish that the relation given by SC is not an equivalence relation and the proof of Birnbaum’s Theorem establishes that the smallest equivalence relation containing SC is L. In this case the following establishes that SSC so Birnabum’s Theorem does not follow from S and SC.

Proposition 4.

As relations on the set of all inference bases SSC.

Note that SC only requires that the conditional models \(M_{i,T_{i}\mid L_{i}(T_{i}(x_{i}))}\) be effectively the same for given xi and this does not imply that the unconditional models \(M_{i,T_{i}}\) are effectively the same so we cannot conclude that SCS. We do have, however, that the conditional inference bases are equivalent under S.

Proposition 5.

If (M1,x1) and (M2,x2) are equivalent under SC, then the conditional inference bases \((M_{1,T_{1}\mid L_{1}(T_{1}(x_{1}))\!},T_{1}(x_{1})\!)\) and \((M_{2,h(T_{2})\mid L_{2}(T_{2}(x_{2}))},\) h(T2(x2))) are equivalent under S.

The following result demonstrates that the evidence function EvSC is the ultimate presentation of the evidence based upon S and SC. The symbol ∘ refers to the composition of relations.

Proposition 6.

For data x and model \(M=(\mathcal {X} ,\{P_{\theta ,X}:\theta \in {\Theta }\}),\) since MSSC, the evidence function defined by (3) satisfies EvSC = EvMSEvSC = EvSCEvMS.

So, the evidence function that results from the two principles, can be unambiguously defined as the inference base containing both the observed value of the mss and the collection of conditional distributions given the laminal ancillary function of the mss as indexed by the model parameter.

The consequence of this development is that the application of the two principles can be thought of unambiguously as a function on the set of all inference bases. It is not clear that there shouldn’t be further reductions in \((\mathcal {T},\{P_{\theta ,T\vert L(T(x))}:\theta \in {\Theta }\},T(x))\) to remove ingredients that are still extraneous to the expression of the evidence concerning 𝜃true, but at this point it is not obvious what form those would take.

Also, statistical evidence is ultimately expressed as part of answering statistical questions. For example, what is the appropriate estimate of ψtrue = Ψ(𝜃true) and how accurate is it or is there evidence for or against a hypothesis H0 : Ψ(𝜃true) = ψ0 and how strong is this evidence? Simply stating an inference base does not answer such questions but at least it does tell us what to focus on when devising the answer.

An application of these principles can be given to perhaps an archetypal example that has supplied much of the intuition underlying the necessity of conditioning on an ancillary statistic.

Example 4.

Two distinct sampling regimes as determined by an ancillary.

Consider the two measuring instruments example discussed in Cox (1958). A sample x = (x1,…,xn) is obtained from either the model \(\{N(\mu ,{\sigma _{1}^{2}}):\mu \in \mathbb {R} ^{1}\}\) or the model \(\{N(\mu ,{\sigma _{2}^{2}}):\mu \in \mathbb {R}^{1}\}\) where the variances are known and reflect the inherent accuracy of two possible measuring instruments. The instrument used is determined by a coin toss, before the data x is observed, where i = 1 occurs with known probability p1 and i = 2 occurs with probability p2 = 1 − p1. The full observed data is (i,x) and clearly A(i,x) = i is ancillary. Given that it is known which measuring instrument is used, it seems necessary to condition on this as the accuracy of the inferences will be quite different when the variances are quite different. For example, if \(\sigma _{1}^{2}<<{\sigma _{2}^{2}}\), then the variance of \(\bar {x},\) based on the mixture model \(p_{1}N_{n}(\mu 1_{n},{\sigma _{1}^{2}}I_{n})+p_{2}N_{n}(\mu 1_{n},\sigma _{2}^{2}I_{n}),\) is \((p_{1}{\sigma _{1}^{2}}+p_{2}{\sigma _{2}^{2}})/n\) and if i = 1 is observed this will be relatively much greater than \({\sigma _{1}^{2}}/n\) at least when p1 is not too small. The principle C suggests that conditioning on A(i,x) is the correct analysis and this example plays a key role in justifying C more generally.

The model for the data (i,x) has likelihood

$$ L(\mu \vert i,x) = c\exp\{-n(\bar{x}-\mu)^{2} /2{\sigma_{i}^{2}}\}. $$

Therefore, \((i,\bar {x})\) is sufficient. Since \(\bar {x}\) maximizes \(\log L(\mu \vert i,x),\) with second derivative at \(\mu =\bar {x}\) given by \(-n/{\sigma _{i}^{2}}\) then, assuming \({\sigma _{1}^{2}}{\neq \sigma _{2}^{2}},\) the value of \((i,\bar {x})\) can be recovered from the likelihood and so is minimal sufficient. Note that the situation where \({\sigma _{1}^{2}}={\sigma _{2}^{2}}\) is not relevant here as then the two measuring instruments have the same characteristics and conditioning plays no role. Clearly, the unique maximal ancillary based on the mss \((i,\bar {x})\) is i and so this is also the laminal ancillary. Conditioning on this statistic gives \(\bar {x}\sim N(\mu ,{\sigma _{i}^{2}}/n)\) for the mss and so the intuitively correct basis for inference about μ is obtained.

As another example consider a situation where x1,x2… are i.i.d. Bernoulli (𝜃) with 𝜃 ∈ (0,1) unknown. Suppose that there are two possible sampling regimes and the actual one used is determined by a coin toss as before, where model 1 corresponds to n fixed and model 2 corresponds to negative binomial sampling with k (the number of 1’s observed before stopping) fixed. Let N(i) denote the observed sample size. Note that, before observing the data, N(1) = n is known but \(N(1)\bar {x}\) is not known while \(N(2)\bar {x}=k\) is known but N(2) is unknown. The likelihood is given by

$$ L(\theta \vert i,x)=c\theta^{N(i)\bar{x}}(1-\theta)^{N(i)(1-\bar{x})}. $$

Since the likelihood is determined by \((N(i),N(i)\bar {x})\) it is sufficient. Also, \(\log L(\theta \vert i,x)\) is maximized at \(\bar {x}\) with second derivative at the maximum given by \(-N(i)/\bar {x}(1-\bar {x})\) so N(i) can also be recovered from the likelihood. This shows that \((N(i),N(i)\bar {x})\) is minimal sufficient. In this case, however, \((i,\bar {x})\) is not generally minimal sufficient and that is because, if \((N(i),N(i)\bar {x})=(n,k),\) then i cannot be recovered. This situation corresponds to the well-known example that shows that the likelihood principle leads to ignoring the sampling rule for inference. This can only occur, however, when kn. If it is required that k > n, then i is always recoverable from \((N(i),N(i)\bar {x})\) and so \((i,\bar {x})\) is minimal sufficient with i the unique maximal ancillary and is thus the laminal ancillary as well.

The problem arises here because the mss discards the information as to which sampling regime has been used, whenever \((N(i),N(i)\bar {x})=(n,k).\) This is similar to the situation encountered in the proof in Birnbaum (1962) that the likelihood principle follows from sufficiency and conditionality. In that context, the sufficient statistic used in the proof discards precisely the information that the conditionality principle invokes for conditioning. Durbin (1970) proposed always reducing first to the mss before invoking conditionality and this does void Birnbaum’s proof.

This example demonstrates that Durbin’s restriction does not avoid conflicts between S and C for frequentist inferences, as for confidence intervals for 𝜃 it is necessary to take into account the actual sampling regime used. One possibility for a resolution of this issue is to consider the situation where the two sampling regimes have different 𝜃 parameters, say 𝜃 and 𝜃 ∗ (1 − δ) where δ > 0 is known and small. In that case the mss never takes the same value for the two sampling regimes and i is always recoverable from the mss. One could then argue that inference should be continuous in δ and so conditioning on i is always appropriate. This would require a significant modification of a conditionality principle, however, and this is not pursued further here.

For Bayesian inferences about 𝜃 the issues around ancillarity pose no difficulties as these do not depend on the sampling rule. If we consider model checking as part of good practice for any approach to statistical analysis, then this can be based on the conditional distribution given the mss which does involve the sampling plan used and so this information is not simply discarded. For example, when \((N(i),N(i)\bar {x})\neq (n,k),\) then the conditional distribution of (i,x) is uniform on the set of possible sequences arising from binomial sampling when i = 1 and is uniform on the set of possible sequences arising from negative binomial sampling when i = 2. If \((N(i),N(i)\bar {x})=(n,k),\) then the conditional distribution of (i,x) assigns the probabilities

$$ p_{1}/\left\{ p_{1}\binom{n}{k}+(1 - p_{1})\binom{n-1}{k-1}\right\} \text{ and }(1-p_{1})/\left\{ p_{1}\binom{n}{k} + (1-p_{1})\binom{n-1}{k-1}\right\} $$

to each of the possible sequences arising from binomial sampling and negative binomial sampling, respectively. If the observed sequence x = (x1,…,xn) is such that xn≠ 1, then this check categorically eliminates negative binomial sampling as there are no such sequences. When xn = 1 then this check assigns probability p1/(1 + p1(nk)/k) to binomial sampling which is small when p1 is small or n is large relative to k. For a Bayesian analysis the goal in model checking is not to distinguish between the two sampling regimes, but rather to assess whether or not the observed data is reasonable given the stated model. A runs test based on these probabilities would then be in order.

A feature of the problems discussed in Example 4, is that sometimes there is only one maximal ancillary function of the mss, which is perhaps what may be misleading one to accept the standard conditionality as a principle. In the more general problems, such as Example 3, however, the experiment may be rich enough so that the joint likelihood has more maximal ancillary functions of the mss, e.g., as with A1 and A2 of Example 3. In these more general problems, the reasoning behind the stable conditionality principles leads to the (stable) evidence of Proposition 6 being, not just the observed value of the mss, but also the conditional distribution of the mss statistic given the laminal ancillary function of the mss.

A further aspect of the two measuring instruments example discussed in Example 4 is that, as Cox (1958) points out, there is a conflict between conditioning and decision-theoretic criteria for determining correct statistical procedures. In particular, the optimal unconditional test for the hypothesis H0 : μ = 0 is not the same as the optimal conditional test. The role of conditioning in decision-theoretic approaches to statistics would appear to be an unresolved issue at this time.

5 Conclusions

Various ambiguities have raised doubts about the possibility of a successful theory for frequentist inference. For example, Birnbaum’s theorem concerning S and C seemingly implying L or for that matter C alone implying L are but two examples. While the validity of these conclusions has been challenged, consideration of these results still raises concerns as to what the correct applications of the principles are. For S this is undoubtedly discarding all aspects of the inference base that are extraneous to expressing the evidence about 𝜃true and this leads to the principle as expressed by Durbin (1970) together with the evidence function EvMS which we add to the development. For C our thesis is that the fundamental idea underlying the principle is better expressed by SC and the evidence function EvSC as this removes the ambiguity about which ancillary to condition on and avoids any contradictions in the justification for the irrelevance of the distribution of the ancillary. While the laminal ancillary may often be trivial, namely, a function constant on the sample space, it seems clear that we have to accept the verdict that conditioning on any ancillary other than the laminal is not appropriate. The results developed here have shown that the principles S and SC are mutually compatible and satisfy the basic requirement of any statistical principle by inducing equivalence relations on the set of all inference bases. As such the logical and statistical inconsistences in the definition of C have been avoided.

It is true that the stable conditionality principle proposed here, is - in part - mathematically supported by the taxonomy results in Basu (1959). The present paper shows, however, that conditioning on stable ancillaries removes the logical inconsistencies of the standard conditionality principle and provides a coherent framework for the assessment of statistical evidence.

Certainly this is not the end of the story concerning the concept of statistical evidence and how it should be measured and expressed, but our hope is that clarifying the roles of two key principles contributes to a more solid foundation for statistics.