1 Introduction

Significance testing is widely used across the natural and social sciences. Given its popularity in scientific practice, it might come as a surprise that significance testing has attracted severe criticism in both the statistical and philosophical literature. For instance, the relationship between significance testing and Bayesian inference as illustrated by Lindley’s paradox has led to an ongoing discussion (e.g., Sprenger 2013; Spanos 2013; Robert 2014). Further, the relationship between significance tests and effect size has been subject to criticism (McCloskey and Ziliak 1996; Ziliak and McCloskey 2008). In addition, significance testing has been criticised on the grounds that p-values depend on unobserved data (Wagenmakers 2007) and that their interpretation is problematic (Trafimow 2003). This paper is concerned with an objection made by Sober (2008): the claim that significance testing violates the Principle of Total Evidence (PTE). If significance testing violates an independent and widely accepted methodological principle, then this would constitute a forceful criticism as it does not rely on the prior commitment to a particular statistical methodology.

I will offer a limited defence of significance testing against Sober’s objection. My argument proceeds in two steps. First, I will show that the application of PTE requires the prior specification of a criterion for evidential assessment. Second, I will demonstrate that when a plausible criterion for evidential assessment is presupposed, using p-values for inductive inference does not violate PTE for a large and important class of significance tests. In particular, I will argue that p-values violate PTE for two-sided tests but satisfy PTE for one-sided tests with sufficient test statistic from likelihoodist, Bayesian and error-statistical perspectives. Along the way, I will also shed some light on the reading of PTE. Given the importance of significance testing in scientific practice, it should be emphasised that I do not aim to defend the use of p-values tout court. Every particular objection against significance testing merits a careful investigation. Here, the focus is on the relationship between significance testing and PTE.

Before turning to Sober’s argument, some terminology has to be introduced. Suppose one is interested in the mean adult size of a certain fish species. In order to infer the mean size in this species, one takes measurements of a particular fish population in a pond. The size measurements constitute a random sample X = (X 1, X 2,...,X n ) of size n. The random variables X i are assumed to be independent and normally distributed with unknown mean μ and known standard deviation σ = 1. Now, suppose one would like to test the hypothesis H 0 - referred to as the ‘null hypothesis’ by statisticians - asserting that the mean μ is equal to, say, 4cm (i.e., H 0 : μ = 4). In order to measure the discrepancy between the parameter value of the mean postulated by the null hypothesis and the sample mean, a test statistic has to be specified. A canonical choice is to use the test statistic \(\tau (X)=\sqrt {n}(\bar {X} - \mu _{0})/\sigma \), where \(\bar {X}\) is the sample mean and μ 0 equals 4. As a result the test statistic τ(X) follows the standard normal distribution under the null hypothesis. After observing a sample realisation x, a significance tester then calculates the ‘p-value’, formally defined as P(τ(X) ≥ τ(x);H 0 is true ) for a one-sided test. That is, the p-value is the probability of observing a sample realisation that would have given rise to a value of the test statistic equal or larger than the one actually observed. While a one-sided test examines only deviations in one direction from the null hypothesis, a two-sided test takes deviations in both directions into account. In the two-sided case the p-value is therefore given by P(|τ(X)| ≥ |τ(x)|;H 0 is true ).Footnote 1

Having calculated the p-value, the question of what to do next arises. At this stage there are two different approaches to significance testing within the camp of frequentist statistics. One school of thought, tracing back to Fisher (1925), considers the p-value as a measure of the strength of evidence for or against the null hypothesis: the smaller the p-value, the less plausible the null hypothesis. Statisticians in this tradition reluctantly specify particular thresholds according to which the data are evidence for (or against) the null hypothesis. Based on some early writings by Fisher, Spanos (1999, 690) offers the following rules of thumb, while maintaining that they can be criticised as ad hoc and unwarranted:Footnote 2

  • p-value >0.1 indicates strong support for H 0

  • 0.05< p-value <0.1 indicates some support for H 0

  • 0.01< p-value <0.05 indicates lack of support for H 0

  • p-value <0.01 indicates strong lack of support for H 0

An alternative approach to significance testing is more closely related to the decision-theoretic framework associated with Neyman and Pearson (1933). Here, a significance test is specified such that the probability of rejecting a true null hypothesis, denoted by α, is fixed at some small number, usually 0.05 or 0.01, which is called the ‘significance level’ of the test. If the p-value is smaller than α, then the null hypothesis is rejected. Otherwise the null hypothesis is not rejected.Footnote 3

Sober (2008) objects that using p-values for inductive inference violates PTE. When calculating p-values one considers a disjunction of events, in which the actual event is one of the disjuncts and, hence, uses a logically weaker description of the observed data. In Sober’s own words:

Fisher’s test of significance [...] has the additional defect that it violates the principle of total evidence. In a significance test, the hypothesis you are testing is called the “null” hypothesis, and your question is whether the observations are sufficiently improbable according to the null hypothesis. However, you don’t consider the observations in all their detail but rather the fact that they fall in a certain region. You use a logically weaker rather than a logically stronger description of the data. (Sober 2008, 53)

While both the evidentialist (or ‘Fisherian’) and the decision-theoretic approach to significance testing invoke the concept of a p-value, Sober’s objection applies in different ways. In the case of the Fisherian approach, Sober’s objection applies directly, as the notion of evidential support characterised by Spanos’s scheme is based on the p-value. In contrast, Sober’s objection applies to the decision-theoretic approach in an indirect way; it requires a principle connecting accept/reject decisions with the notion of evidence. One such principle is given by Sober:Footnote 4

If learning that e is true justifies you in rejecting (i.e., disbelieving) the proposition P, and you were not justified in rejecting P before you gained this information, then e must be evidence against P. If learning that e is true justifies you in accepting (i.e., believing) the proposition P, and you were not justified in rejecting P before you gained this information, then e must be evidence for P. (Sober 2008, 5)

The details of such a principle are not of concern here. What matters is that rejection needs to be understood as a form of ‘evidential rejection’ for Sober’s objection to apply to the decision-theoretic approach to significance testing.Footnote 5

2 Interpreting PTE

PTE is regularly invoked in philosophical discussions of scientific method. For instance, is has been argued that consensus methods in phylogenetic inference are in conflict with PTE (Barrett et al. 1991). Further, meta-analysis in medicine has been criticised on the grounds that it violates PTE (Stegenga 2011). In order to assess whether significance tests violate PTE, it has to be asked what this principle asserts in the first place. I will approach this question in an iterative manner by refining the interpretation of PTE in a number of steps. Sober (2008, 41) describes PTE as a ‘pragmatic’ principle, asserting that you should take account of everything you know. The roots of this principle can be traced back to Carnap’s inductive logic. Inductive logic aims to assign an objective probability, called ‘degree of confirmation’, to a hypothesis based on the relationship between hypothesis and evidence. In this context Carnap introduces what he calls the ‘requirement of total evidence’:

In the application of inductive logic to a given knowledge situation, the total evidence available must be taken as basis for determining the degree of confirmation. (Carnap 1962, 211)

Synthesizing Sober’s and Carnap’s remarks, a first interpretation of PTE, denoted as PTE1, could then read like this: Take into account all available information when making inferences about a hypothesis of interest.

In order to assess the merits of PTE1, let us return to the fish example introduced earlier. Following PTE1, one should take into account all available information when making inferences regarding the mean adult fish size. One problem with PTE1 is that in any real life situation it is unclear what the term ‘all available information’ amounts to. There is no such a thing as the logically strongest data set.Footnote 6 We can always add further attributes to the description of the data set. For instance, we can enrich the description of the data set containing the measurements of the fish population by noting whether, say, the fish were difficult to catch, whether it was raining, and whether Chelsea FC played that day.Footnote 7

Given the problems with the notion of a logically strongest data set aiming to capture ‘all available information’ in an inference situation, an obvious remedy is to formulate PTE in terms of a contrastive principle. The second reading of PTE, denoted as PTE2, therefore reads as follows: Suppose data d 1 are strictly logically stronger than data d 2, then one should use data d 1 when making inferences about the hypothesis of interest.

While PTE2 is more satisfactory than PTE1, it still has consequences that will strike many readers as counterintuitive. In particular, PTE2 seems to give the false answer to the question of whether we are always doing something wrong if we use a logically weaker data set. It seems uncontroversial that PTE only requires using relevant information. So, using a strictly logically weaker data set is unproblematic if the additional information in the logically stronger data set is irrelevant. Sober writes:

Although the principle of total evidence says that you must use all the relevant evidence you have, it does not require the spilling of needless ink. It does not require you to record irrelevant information. (Sober 2008, 44, my italics)

In a similar vein, Carnap (1962, 211) distinguishes between relevant and irrelevant evidence and demands either that an agent knows “nothing beyond [evidence] e or that the totality of his additional knowledge i be irrelevant for [hypothesis] H with respect to e”.Footnote 8 Both Sober’s and Carnap’s refinements of PTE point to a third reading, asserting that one should take into account all relevant information when making inferences regarding a hypothesis. Again, it is preferable to phrase PTE in terms of a comparative claim (denoted as PTE3): Suppose data d 1 are strictly logically stronger than d 2, then one should use data d 1 if the additional information contained in d 1 is relevant for the inference at hand.

PTE3 naturally raises the question of how to establish whether the strictly logically stronger data are relevant for the inference at hand. Again, the existing literature offers some insights. Suppose data d 1 are strictly logically stronger than data d 2. Carnap’s criterion for establishing that d 1 is relevant for hypothesis H given d 2 requires checking whether changing between d 1 and d 2 changes the degree of confirmation of H. Obviously, Carnap’s relevance criterion is formulated in terms of his inductive logic. Abstracting from the details of Carnap’s account, leads to the following, more general relevance criterion (denoted as RC): data d 1 are relevant for hypothesis H given data d 2 (with d 1d 2 and \(d_{2} \nRightarrow d_{1}\)) if and only if using d 1 rather than d 2 changes the evidential assessment.Footnote 9

How can RC be put into practice? I will argue that applying RC presupposes what I will call a ‘criterion for evidential assessment’ (or ‘theory of evidence’ for short). Here, a criterion for evidential assessment refers to any account that specifies conditions under which some data d provide evidential support for a hypothesis H. As understood here, a criterion for evidential assessment is generic in character and supposed to capture a variety of philosophical and statistical accounts of evidence. According to the Bayesian theory of evidence, for instance, data d provide evidential support for hypothesis H if and only if the posterior probability of H exceeds the prior probability of H:Footnote 10

Data d are evidence for hypothesis H if and only if P(H|d) > P(H).

Similarly, the law of likelihood (LL) (Hacking 1965) qualifies as a criterion for evidential assessment even though it warrants only contrastive evidential claims. That is, LL establishes conditions under which some data d provide evidential support for one hypothesis H 1 over another hypothesis H 2 :

Data d favour hypothesis H 1 over hypothesis H 2 if and only if P(d|H 1) > P(d|H 2).Footnote 11

A third prominent theory of evidence is provided by Mayo (1996). Mayo suggests that data d are evidence for hypothesis H just in case that H passes what she calls a ‘severe test with d. Hypothesis H passes a severe test with d if and only if a) d ‘fits’ or ‘agrees with’ H (with some suitable notion of ‘fit’) and b) there is a low probability that the test would have produced a result that fits H at least as well as d does, if H were false.

Having introduced the notion of a theory of evidence, I am in a position to state my preferred reading of PTE, denoted as PTE4. The principle reads as follows:

Suppose data d 1 are strictly logically stronger than data d 2, then an inference about hypothesis H should be based on d 1 if changing between d 1 and d 2 changes the evidential assessment.

Alternatively, PTE4 can be formulated in terms of the notion of relevance captured by RC: Suppose data d 1 are strictly logically stronger than data d 2 with d 1, then an inference about hypothesis H should be based on d 1 if data d 1 are relevant for H given d 2. As discussed, a theory of evidence has to be presupposed in order to apply PTE4.

The function of PTE4 can be best illustrated by means of an example. Suppose we evaluate evidential claims within a likelihoodist framework. We observe ten coin tosses. It is assumed that the tosses are independent and each toss follows a Bernoulli distribution with parameter p denoting the probability of ‘heads’. The hypotheses under consideration are H 1 : p = 0.5 and H 2 : p = 0.6. We are given the following three description of the observational data:

  • d 1 = (H, H, T, H, T, T, T, H, H, H)

  • d 2 = 6×H,4×T

  • d 3 = (H, H, T, H, T, T, T,(HT),(HT),(HT))

That is, data d 1 contain the outcomes of the ten coin tosses in its temporal order, data d 2 only note the frequency of the events ‘heads’ or ‘tails’ and data d 3 note the outcomes of the first seven tosses but only tell us that the last three tosses have occurred but not the outcomes of these last three tosses. As a result d 1 strictly logically entails both d 2 and d 3. Since both hypotheses assign probabilities to all three data sets, we do not need to invoke any further assumptions in order to specify the probability measure required for applying LL. Suppose we start with data d 3. According to LL, the data favour hypothesis H 1 over hypothesis H 2 since P(d 3|H 1) > P(d 3|H 2). Does PTE4 prescribe using the strictly logically stronger data d 1 when making inferences regarding the two hypotheses of interest? The answer is yes, since data d 1 favour hypothesis H 2 over hypothesis H 1 and, hence, change the (qualitative) evidential assessment. Now, suppose we start with data d 2. Data d 2 favour hypothesis H 2 over hypothesis H 1. Hence, the evidential assessment remains unchanged if we move from data d 2 to data d 1. Both data sets favour the same hypothesis. As a result, PTE4 does not force us to operate on the logically stronger data set in this case.Footnote 12

At this stage one might think about further aspects that should be taken into account when formulating PTE. For instance, I have presumed that the data d 1 and d 2 are freely available and that they can be analysed without any difference in computational cost. These assumptions might not be warranted in a more general discussion of PTE. However, for the purpose of examining Sober’s argument against significance testing I set these issues aside.

3 Sober’s objection revisited

Having made the case for PTE4 as an adequate interpretation of PTE, I will now turn to the question of whether significance testing violates PTE as Sober suggests. In order to assess what data set should be used for inductive inference in any particular application, PTE4 requires the prior specification of a theory of evidence. Without such a specification PTE4 cannot be applied and, hence, neither be satisfied nor violated. The statistical framework that determines what counts as evidence is therefore primary to PTE. Sober, however, does not explicitly endorse a theory of evidence in his argument. In order to proceed, I will first adopt LL as the theory of evidence, given the central role of LL in Sober’s writings (e.g., Sober 2009). PTE4, however, does not force us to make this choice as the principle is neutral regarding the question of what theory of evidence to adopt in the first place.

As PTE4 is concerned with prescribing the choice of data for inductive inference, the question is how this principle can be used to evaluate a statistical technique such as significance testing. A first answer might suggest comparing the data set used by the significance tester with the data set used by the likelihoodist. This suggestion, however, is problematic as both approaches start with the same data set, that is, a realisation of a random sample. So, there is no difference between the significance tester and the likelihoodist in this respect. In order to get Sober’s argument off the ground, we have to compare a different pair of data sets. Since Sober’s objection is concerned with the use of p-values for inductive inference, we will compare the realisation of the random sample used by the likelihoodist with a ‘data’ set containing only information about the p-value. In that case it is an open question whether changing between these two data sets affects the evidential assessment by means of LL. I will show that there is no universal conflict between PTE and the use of p-values for inductive inference. While violations do occur, there exists a large and important class of significance tests for which no conflict arises.

As an illustration, let us return to the test of the mean of a normal distribution with known variance (i.e., the ‘fish example’) introduced earlier. In that case the data are given by the realisation x of the random sample X = (X 1, X 2,...,X n ), denoted as d 1, and the p-value resulting from this sample realisation, denoted as d 2. My argument proceeds in two steps. In a first step, I will show that the data can be weakened in accordance with PTE4 by moving from data d 1 to data \(\tilde {d_{1}}\) consisting of a realisation of the sample mean \(\bar {X}\). In a second step, I will examine whether the data can be further weakened from data \(\tilde {d_{1}}\) to data d 2. As it will turn out, the second step requires distinguishing between one-sided and two-sided tests.

The first step of modifying the problem by considering the logically weaker data \(\tilde {d_{1}}\) rather than data d 1 is warranted since the sample mean \(T(X)=\bar {X}\) is a sufficient statistic for the mean of the normal distribution. Formally, any real-valued function T = r(X 1, X 2,...,X n ) of the observations in the random sample is called a statistic. A statistic T is a sufficient statistic for parameter 𝜃 if for each t, the conditional distribution of X 1, X 2,...,X n given T = t and 𝜃 does not depend on 𝜃. Speaking informally, a sufficient statistic summarizes all the information in a random sample that is relevant for estimating the parameter of interest. In particular, summarizing the data by means of a sufficient statistic T(X) rather than the random sample X leaves the likelihood ratio within a class of hypotheses - here, hypotheses regarding the mean of the normal distribution - constant (Hacking 1965, 110). Hence, PTE4 does not demand using the strictly logically stronger data d 1 rather than data \(\tilde {d_{1}}\) when the theory of evidence is provided by LL.Footnote 13

Next, we have to evaluate whether using data d 2 rather than data \(\tilde {d_{1}}\) violates PTE4. I will show that there exists a one-to-one function between the p-value and the value of the sufficient statistic \(T(X) = \bar {X}\) in the case of the one-sided test but not in the case of the two-sided test. As the one-to-one function of a sufficient statistic is itself sufficient, the one-sided p-value is therefore a sufficient statistic for the mean of the normal distribution.

Let us consider the one-sided test first. Needless to say, there exists a mapping from the value of the sample mean \(\bar {X}\) to the p-value P(τ(X) ≥ τ(x);H 0 is true ) since the test statistic is defined as \(\tau (X)=\sqrt {n}(\bar {X} - \mu _{0})\). What about the opposite direction? Suppose we are given the p-value P(τ(X) ≥ τ(x);H 0 is true ) resulting from the realisation of the sample mean \(\bar {X}\). As the test statistic τ(X) follows a standard normal distribution under hypothesis H 0 we can use a standard normal table to infer τ(x) from the p-value. Based on the definition of the test statistic as \(\tau (X)=\sqrt {n}(\bar {X} - \mu _{0})\), we can then infer the realisation of the sample mean \(\bar {X}\) by simple algebraic transformations. So, there exists a function from the p-value to the value of the sufficient statistic \(\bar {X}\).

Summing up, I have established a one-to-one function between the value of the sufficient statistic \(\bar {X}\) and the p-value. This implies that the one-sided p-value constitutes a sufficient statistic for the mean of the normal distribution. While Sober (2008, 45) stresses the importance of sufficiency in the context of PTE, he does not mention that for a large class of significance tests the p-value constitutes a sufficient statistic. By applying the same reasoning that warranted the use of data \(\tilde {d_{1}}\) rather than data d 1, I conclude that using data d 2 instead of data \(\tilde {d_{1}}\) does not not violate PTE4.

It is worth pointing out that the argument developed here sits well with the result that one-sided p-values can be interpreted as likelihood ratios (DeGroot 1973). DeGroot shows that for a given null hypothesis H 0, a set of alternative hypotheses H 1 can be constructed such that the p-value of a one-sided test is numerically identical with the likelihood ratio of the null hypothesis and the family of alternative hypotheses.Footnote 14 At the same time my argument differs from DeGroot’s result. I have made no specific assumptions about the alternative hypothesis (or the family of alternative hypotheses) considered in a likelihood evaluation that would warrant drawing conclusions regarding the numerical equivalence between p-values and likelihood ratios. My argument holds for any alternative hypothesis about the mean of the normal distribution. This does not mean, however, that using p-values for inductive inference will yield the same conclusions as inferences by means of LL. In particular, I do not claim that p-values serve as a proxy for likelihood based inferences. Rather, I argue that there is no loss of relevant information when using the information contained in p-values as opposed to the original data set from a likelihoodist perspective.

Returning to the discussion of Sober’s objection, matters are different in the case of the two-sided test. Here, the p-value is given by P(|τ(X)| ≥ |τ(x)|;H 0 is true ). As a result the p-value does not stand in a one-to-one correspondence with the value of the sample mean \(\bar {X}\). Speaking graphically, learning about the p-value does not tell us in which of the two tails of the normal distribution the realisation of the sample mean is to be found. Hence, there is no mapping from the two-sided p-value to the value of the sufficient statistic \(T(X)=\bar {X}\). Now, it can then be shown that changing between data d 2 and data \(\tilde {d_{1}}\) can lead to conflicting evidential assessments given LL (see ??). As a result the use of p-values violates PTE in the two-sided case.

Given that the choice between the one-sided and the two-sided test has implications for the question of whether the use of p-values violates PTE4, it is natural to ask which of these two is to be employed by statisticians. The two-sided test is typically used to assess whether there is “some effect” in the data if the null hypothesis denotes, say, the absence of a difference between two treatments. However, Casella and Berger (1987, 106) critically remark that given their experience few experimenters are actually interested in the question of whether there is “some difference”. Rather, there is a direction of interest in many experiments, such as establishing that “the new treatment is better”, which renders the use of a two-sided test inappropriate. While the statistical issue of one-sided versus two-sided testing cannot be resolved in the current paper, it is clear that a one-sided p-value contains information about the direction of the effect, which is lost in the two-sided p-value.Footnote 15 So, if the direction of the effect matters to the investigator, there is a prima facie reason for employing a one-sided test. One-sided tests therefore constitute an important class of significance tests.

4 Other theories of evidence

So far, the discussion in this section presupposed LL as the theory of evidence needed to apply PTE4. In order to complete the discussion of Sober’s argument, I will also consider the Bayesian and the error-statistical accounts of evidence. As it turns out, the conclusion will be the same: for the class of one-sided significance tests with sufficient test statistic there is no conflict with PTE while the use of two-sided tests violates PTE.

In order to relate the previous discussion to the analysis of the Bayesian account, the following observation is helpful:Footnote 16 Suppose T = T(X) is a sufficient statistic for parameter 𝜃 with parameter space Θ equal to an interval of real numbers. Then, for every possible prior prior probability density for 𝜃 the posterior probability density of 𝜃 given X = x depends on x only through T(x). No matter what prior one uses, one only has to consider the sufficient statistic for Bayesian inference, because the posterior distribution given T = T(x) is the same as the posterior given the data X = x. As the p-value of a one-sided test invoking a sufficient statistic can itself be considered as a sufficient statistic, conditioning on a data set containing information about the p-value is the same as conditioning on the data X = x. Hence, there is no conflict between the use of p-values and PTE4 for this class of significance tests from a Bayesian perspective.

Again, it is important to stress that this argument differs from DeGroot’s (1973) and Casella and Berger’s (1987) results that under certain assumptions p-values can be interpreted as posterior probabilities. Analogous to the observation that p-values are numerically identical to likelihood ratios, DeGroot identifies improper priors for which a one-sided p-value and posterior probability match. Similarly, Casella and Berger demonstrate that for many classes of priors there is a close numerical relationship between the posterior probability of the null hypothesis and a one-sided p-value. In contrast, showing that from a Bayesian perspective the use of a one-sided p-value is not in conflict with PTE does not allow any inferences with regard to the numerical equality of p-values and posterior probabilities.

Turning to Mayo’s error-statistical account, an important difference to Bayesian and likelihood theories of evidence has to be noted right from the start. As the error statistician does not see a general problem in invoking tail probabilities for inductive inference, the relevant question is what kind of tail probability is suitable for evidential assessment. At the heart of the error-statistical theory is the quantitative measure of severity. In order to illustrate this tail probability, consider the following test scenario. Suppose a random variable is normally distributed with known variance and unknown mean μ 0. Further, suppose one wants to assess the severity with which the hypothesis H 0 : μμ 0 passes a test with the realisation of random sample X = x against the alternative H 1 : μ > μ 0. Again, the test statistic \(\tau (X)=\sqrt {n}(\bar {X} - \mu _{0})/\sigma \) is employed to measure deviations from H 0 in the direction of the alternative hypothesis H 1. The severity with which H 0 passes the test with data x is then defined as the probability that the test statistic would have taken a larger value if the alternative hypothesis H 1 had been true:

$$SEV(\mu \leq \mu_{0})(x, H_{1})= P(\tau(X)>\tau(x); \mu > \mu_{0}).$$

Since the alternative hypothesis H 1 consists of a continuum of point hypotheses it is unclear, however, how to evaluate this probability from a frequentist perspective. Mayo and Spanos (2006) observe that S E V(μμ 0)(x, H 1) is bounded from below by the probability P(τ(X) > τ(x);μ = μ 0), which is the one-sided p-value of the point null hypothesis μ = μ 0. As a result there is a close mathematical relationship between severity and one-sided p-values.

In order to assess whether the use of p-values violates PTE from an error-statistical perspective, one has to ask whether changing from data d 1 = x to data d 2 containing information only about the p-value changes the evidential assessment. Again, the difference between one-sided and two-sided p-values is crucial. As the one-sided p-value stands in a one-to-one correspondence with the value of the test statistic τ(X) (and, hence, test statistic \(T(X)=\bar {X}\)), using data d 2 rather than d 1 is sufficient for establishing the severity of the test. Once the value of \(T(X)=\bar {X}\) is known, one can calculate the severity of this test. Using a one-sided p-value does therefore not violate PTE from an error-statistical perspective. In contrast, the two-sided p-value does not allow to establish the severity of a test, as information about the direction of the effect is lost and the value of test statistic \(T(X)=\bar {X}\) cannot be established based on knowledge of the two-sided p-value.

By highlighting a difference between one-sided and two-sided tests, the error-statistical position mirrors the likelihoodist and Bayesian views on the relationship between PTE and significance testing. All three accounts agree that the use of one-sided p-values with sufficient test statistic is in accordance with PTE while the use of two-sided p-values violates this principle (see Table 1).

Table 1 Summary of results

This result should not be too surprising since all three accounts of evidence subscribe to the Sufficiency Principle (SP). In order to state SP, the notion of the evidential meaning of an experimental outcome has to be introduced. The ‘evidential meaning’ of outcome x of experiment E, denoted as E v(E, x), is supposed to capture the “essential properties” of the statistical evidence provided by the observed outcome x of experiment E (Birnbaum, 1962, 270). The two experiments E (with outcome x) and E (with outcome y) being ‘evidentially equivalent’ is denoted by E v(E, x) = E v(E , y). SP then reads as follows (Birnbaum 1962):Footnote 17

If E is a specified experiment, with outcome x; if T = T(X) is any sufficient statistic; and if E is the experiment derived from E, in which any outcome x of E is represented only by the corresponding value T(x) of the sufficient statistic; then for each x, E v(E, x) = E v(E , T(x)).

In essence, SP states that the evidential meaning of an observation depends only on the observed value of a sufficient statistic. Since the p-value of a one-sided test with a sufficient test statistic is itself sufficient, all three accounts of evidence agree that this quantity captures the evidential meaning of the observed data. SP is therefore to be seen as a statistical explication of PTE by specifying the conditions under which an evidential assessment should be unaffected when moving to a strictly logically weaker description of the data.

A final word on the question of whether to use a one-sided or a two-sided test. The present discussion suggests a further argument for the use of one-sided p-values. As using one-sided tests with a sufficient test statistic is in accordance with PTE from a variety of perspective of what counts as evidence - including likelihoodist, Bayesian and error-statistical positions - this supports the view of choosing a one-sided over a two-sided test.

5 Conclusion

The paper proposed PTE4 as an adequate interpretation of PTE. According to PTE4, strictly logically stronger data should be used if they affect the evidential assessment. Adopting this interpretation of PTE has consequences for assessing the claim that significance testing violates PTE. First, there is no theory-independent assessment of whether significance testing violates PTE. Second, when prominent theories of evidence are presupposed there is no conflict between the use of p-values and PTE for a large and important class of significance tests. Whatever the flaws of p-values and significance tests, violating PTE is not one of them under the premise that a one-sided test with a sufficient test statistic is employed.