1 Introduction

Scientists seek to publish observations that constitute sufficient evidence to accept a theory or a working hypothesis as a contribution to scientific knowledge. That contrasts with the position of Fisher that a null hypothesis can be rejected but never accepted, as Patriota (2017) discussed. The disagreement may be more apparent than real. For example, there is nothing self-contradictory about accepting a scientific theory as a working hypothesis because it is consistent with a 99% confidence interval. Even though each of those parameter values, when considered as a null hypothesis, has a p value greater than 0.01, none is accepted for that reason alone.

For measuring the strength of evidence involving scientific or statistical hypotheses, the likelihood paradigm may have advantages over the frequentist and Bayesian paradigms (Edwards 1992; Royall 1997; Blume 2011; Bickel 2012; Rohde 2014). In this paradigm, the likelihood ratio serves as a measure of the strength of statistical evidence for one hypothesis over another through the lens of a family of distributions (Royall 1997, 2000a). That differs from the more familiar uses of the likelihood function as a tool for the construction of point estimators, p values, confidence intervals, and posterior probabilities. It has been used to analyze data both in basic domains such as genetics (Strug and Hodge 2006a, b; Strug et al. 2007; Hodge et al. 2011; Strug et al. 2010; Strug 2018) and in more applied domains such as health care (Blume 2002; Hoch and Blume 2008). Rohde (2014) provides an accessible exposition of the likelihood paradigm.

The paradigm has roots in the likelihood intervals of R. A. Fisher. In a certain sense, a scalar parameter value \(\theta\) is “consistent with the observations” at some level \(\Lambda\) if and only if \(\theta ^{-}\left( \Lambda \right) \le \theta \le \theta ^{+}\left( \Lambda \right)\), where \(\left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right]\) is the interval of parameter values with likelihood within a factor of \(\Lambda\) of the maximum likelihood, provided that \(\Lambda >1\) (Royall 1997, p. 26). For example, Fisher (1973, pp. 75–76) considered \(\Lambda =2,5,15\), flagging parameter values outside the \(\left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right]\) intervals as “implausible” and those outside even the \(\left[ \theta ^{-}\left( 15\right) ,\theta ^{+}\left( 15\right) \right]\) interval as “obviously open to grave suspicion” (cf. Barnard 1967; Hoch and Blume 2008). In that context, Fisher (1973, p. 71; cf. 74–75) remarked that the p value is “not very defensible save as an approximation” (see Bickel and Patriota 2019). Royall (1997) instead used \(\Lambda =2^{3}\) for strong evidence and \(\Lambda =2^{5}\) for very strong evidence; Bickel and Rahal (2019) suggest additional gradations. For vector parameters, the level-\(\Lambda\) likelihood set is the set of parameter values with likelihood within a factor of \(\Lambda\) of the maximum likelihood.

Just as nested confidence sets may be inverted to define a p value for each parameter value, likelihood sets may be inverted to obtain the likelihood ratio of each parameter value relative to the maximum likelihood. Edwards (1992) and Royall (1997) interpreted the likelihood ratio as the strength of evidence, carefully limiting the scope to comparisons between simple (point) hypotheses, in which case the Bayes factor is the likelihood ratio. According to the (special) law of likelihood attributed to Hacking (1965), the likelihood ratio between two simple hypotheses quantifies the strength of evidence of one hypothesis over the other, apart from prior distributions, loss functions, and the sample size (Edwards 1992; Royall 1997). This contrasts with the more generally applicable practice of measuring statistical evidence for general hypotheses with the Bayes factor (cf. Jeffreys 1948).

The primary motivation for the limitation to simple hypotheses was to avoid specifying the prior distributions needed to define a Bayes factor for composite hypotheses, for the Bayes factor is the ratio of the likelihood means with respect to a prior distribution conditional on each hypothesis compared. To achieve applicability to composite hypotheses without a prior distribution of the interest parameter, the prior mean likelihood given each composite hypothesis is replaced with the maximum likelihood over the parameter values of each composite hypothesis. The resulting ratio of maximum likelihoods is interpreted as the weight of the statistical evidence that supports one composite hypothesis over another under the general law of likelihood (Bickel 2012, §2.2.3), applicable to pseudolikelihood as well as to likelihood. For instance, \(\Lambda\) is the weight of the evidence substantiating the hypothesis that \(\theta ^{-}\left( \Lambda \right) \le \theta \le \theta ^{+}\left( \Lambda \right)\) over the hypothesis that \(\theta \notin \left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right]\). Denoting the function for the weight of evidence by W, that can be concisely expressed as

$$\begin{aligned} W\left( \left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right] ;\left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right] ^{c}\right) =\Lambda , \end{aligned}$$
(1)

where the complement \(\left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right] ^{c}\) is \(\left]-\infty ,\theta ^{-}\left( \Lambda \right) \right[\cup \left]\theta ^{+}\left( \Lambda \right) ,\infty \right[\). With an eye toward clinical trials, Zhang and Zhang (2013a) recommended a special case of the general law for regular models and sufficiently large samples. Motivated by different concerns, Dubois et al. (1997), Walley and Moral (1999), Giang and Shenoy (2005), and Coletti et al. (2009) had previously considered a general form of \(L\left( \left[ \theta ^{-}\left( \Lambda \right) ,\theta ^{+}\left( \Lambda \right) \right] \right)\), the normalized maximum likelihood of the hypothesis that \(\theta ^{-}\left( \Lambda \right) \le \theta \le \theta ^{+}\left( \Lambda \right)\).

Example 1

Bickel(2012, Example 4, altered). Let \(\theta\) represent a cosmological theory, with \(\theta =0\) for the big bang theory and \(\theta =1\) for the steady state theory. Let \(f_{0}\left( x\right) =2^{-3}\) and \(f_{1}\left( x\right) =2^{-7}\) be the probabilities of the sample x of astronomical data under \(\theta =0\) and \(\theta =1\), respectively. More generally, with \(\theta\) as any real number, each corresponding to an astronomical theory, suppose the probability of observing x given a theory would be

$$\begin{aligned} f_{\theta }\left( x\right) ={\left\{ \begin{array}{ll} 2^{-3} &{{{{\text {if }}}\,}}\theta =0\\ 2^{-7} & {{{{\text {if }}}\,}}0<\theta \le 1\\ 0 & {\text {otherwise}} \end{array}\right. }. \end{aligned}$$

The maximum likelihood estimate is \({\widehat{\theta }}=0\) since \(f_{0}\left( x\right) >f_{\theta }\left( x\right)\) for all \(\theta \ne 0\). For \(\Lambda =2^{3}\) and \(\Lambda =2^{5}\), the likelihood intervals are

$$\begin{aligned} \left[ \theta ^{-}\left( 2^{3}\right) ,\theta ^{+}\left( 2^{3}\right) \right]&=\left\{ \theta :f_{{\widehat{\theta }}}\left( x\right) /f_{\theta }\left( x\right) \le 2^{3}\right\} =\left[ 0,0\right] ;\\ \left[ \theta ^{-}\left( 2^{5}\right) ,\theta ^{+}\left( 2^{5}\right) \right]&=\left\{ \theta :f_{{\widehat{\theta }}}\left( x\right) /f_{\theta }\left( x\right) \le 2^{5}\right\} =\left[ 0,1\right] . \end{aligned}$$

The weight of the evidence substantiating the big bang theory as opposed to the set of all other theories is

$$\begin{aligned} W\left( \left\{ 0\right\} ;\left\{ \theta :\theta \ne 0\right\} \right) =\frac{f_{0}\left( x\right) }{\sup _{\theta \ne 0}f_{\theta }\left( x\right) }=\frac{2^{-3}}{2^{-7}}=2^{4}. \end{aligned}$$

Likewise, the weight of the evidence substantiating the big bang theory as opposed to the set of all theories, including the big bang, is

$$\begin{aligned} W\left( \left\{ 0\right\} ;\Theta \right) =\frac{f_{0}\left( x\right) }{\sup _{\theta }f_{\theta }\left( x\right) } =\frac{2^{-3}}{2^{-3}}=1, \end{aligned}$$

in which \(\Theta\) is the real line.

This example can be extended to problems of null hypothesis significance testing by letting \(\theta =0\) correspond to the null hypothesis. \({{\,\mathrm{\,\blacktriangle }\,}}\)

Although the general law of likelihood overcomes multiplicity paradoxes without resorting to a prior distribution and has been applied to genomics data (Bickel 2012; Bickel and Rahal 2019) and genetics data (Strug 2018), it remains controversial. Blume (2013), while advocating the special law of likelihood, does not recognize a need for assigning a strength of evidence to a composite hypothesis, maintaining that the level-\(\Lambda\) likelihood set simply indicates which distributions are better supported than the others by the data (cf. Zhang and Zhang 2013b).

Further, it is often thought that the likelihood ratio cannot be directly compared to a fixed threshold \(\Lambda\) but that it requires calibration (Severini 2000; Kalbfleisch 2000; Morgenthaler and Staudte 2012; Spanos 2013). For example, Vieland and Seok (2016) made several adjustments to the case of \(L\left( \bullet \right)\) defined in Zhang and Zhang (2013a). Frequentist calibrations include those that Bickel (2018) bases on the fixed-confidence likelihood intervals of Sprott(2000, §5.3), and Patriota (2013) proposed a quantity based on the likelihood ratio test. Frequentist calibration would indeed be needed to achieve specified repeated-sampling coverage rates since a level-\(\Lambda\) likelihood set can cover the true value of the parameter with much less than, say, 95% confidence even if \(\Lambda\) is relatively high. Likewise, from a Bayesian perspective, a level-\(\Lambda\) likelihood set can have a very low posterior probability.

Largely due to those concerns, the most commonly used extension of the special law of likelihood to composite hypotheses is the Bayes factor rather than the general law’s \(W\left( \bullet ;\bullet \right)\). Being defined as the posterior odds divided by the prior odds, the Bayes factor captures the intuitive appeal of the special law. Indeed, Edwards (1992) commended the special law for its compatibility with data analyses in the presence of priors, and Royall (2000b) interpreted the likelihood ratio as the Bayes factor for the case of comparing two simple hypotheses. To overcome the objection against the Bayes factor as a measure of evidence for composite hypotheses, Bickel (2013a) presented general classes of prior-free approximations to Bayes factors.

The Bayes factor is a well known measure of how relevant the data are when considered as evidence for or against a composite hypothesis. The degree of that relevance is known as the relevancy of the evidence to whether some hypothesis is true (Koehler 2002). The many other proposed measures of the relevancy of the evidence include the relative belief ratio, which is the posterior probability of a hypothesis divided by its prior probability (Evans 2015), and the relevance measure of Carnap (1962, §67); see Koscholke (2017).

The data or other evidence can be relevant to the truth of a hypothesis without warranting the conclusion that the hypothesis is true or the conclusion that it is false. That is why the relevancy of the evidence, also called the probative value of the evidence, is distinguished from the sufficiency of the evidence to justify drawing a conclusion about the hypothesis (Kaye and Koehler 2003). (The concept of sufficient evidence should not be confused with the idea of sufficient statistics. The evidence is sufficient to reach a conclusion if there is enough information in the data to come to the conclusion. The sufficiency of a data set as evidence is its “enoughness” for drawing a conclusion about a hypothesis.)

While the Bayes factor succeeds in quantifying the relevancy of the data to the truth of the hypothesis, it fails to quantify the sufficiency of the data to warrant a conclusion about the hypothesis (Lavine and Schervish 1999). Conversely, the posterior probability and the posterior odds of a hypothesis quantify the sufficiency of the data to justify a conclusion but not the relevancy of the data. Fiducial probability defined as an observed confidence level is an alternative measure of the sufficiency of the evidence (Bickel 2011).

Nonetheless, the Bayes factor qualifies as a measure of the sufficiency of the data as well as its relevancy when the prior probability of the hypothesis is fixed at 50%, for the Bayes factor is then equal to the posterior odds. The commonly used thresholds for Bayes factors to achieve certain scales of evidence were originally intended for that case (Jeffreys 1948).

Another measure of both the sufficiency and relevancy of data is the weight of evidence under the general law of likelihood, defined in Section 2. That section also defines a generalization of \(L\left( \bullet \right)\) as a particular conditional possibility measure that is dual to a necessity measure, as those terms are used in possibility theory (§2.2). Section 3 derives the general law from idealizations of sufficiency and relevancy as opposed to the idealization of inference to the best explanation found in Bickel (2012). Section 4 summarizes the paper’s developments.

Appendix A of Bickel (2019) contrasts this paper’s approach to possibility theory with the interpretation of possibility as an upper probability. Walley and Moral (1999) used the latter interpretation to argue against an application of possibility theory.

2 Weight of evidence

2.1 Preliminary notation and definitions

Let x denote an observed scalar, vector, or matrix in some set \({\mathcal {X}}\) of possible observations. This x, a realization of a random element X, may be a statistic that depends on other observations.

Consider a set \(\Theta\) and a family of density functions \(\left\{ f_{\theta _{0}}:\theta _{0}\in \Theta \right\}\) such that the parameter is identifiable in the sense that \(f_{\theta _{0}}\ne f_{\theta }\) (except in a set of measure zero) for all \(\theta _{0},\theta \in \Theta :\theta _{0}\ne \theta\). If the interest parameter value were equal to \(\theta\), then \(f_{\theta }\left( x\right)\) would be the probability density or probability mass of the observation that \(X=x\). The likelihood function \(\ell\) is a function on \(\Theta\) such that \(\ell \left( \theta \right)\) is proportional to \(f_{\theta }\left( x\right)\) for all \(\theta \in \Theta\). Thus, the maximum likelihood estimate is

$$\begin{aligned} {\widehat{\theta }}=\arg \sup _{\theta \in \Theta }\ell \left( \theta \right) =\arg \sup _{\theta \in \Theta }f_{\theta }\left( x\right) , \end{aligned}$$

where the supremum rather than the maximum will be used throughout in case \(\Theta\) or a subset used in its place is not a closed set; see Example 4 of Sect. 2.4.

The function \(\ell\) may be any pseudo-likelihood function such that \(\ell \left( \theta \right)\) is approximately proportional to a probability density for every \(\theta \in \Theta\). Thus, \(\ell\) may be a marginal, conditional, estimated, or integrated likelihood, eliminating a nuisance parameter. If the profile likelihood does not approximate a density for a particular model, it may nevertheless be corrected to approximate a conditional or marginal likelihood in certain cases (Severini 2000, pp. 310–312, 323). The prefix “pseudo” is somewhat misleading: even the “true” likelihood function might be considered a pseudo-likelihood function since a statistical model cannot completely capture the data-generation process (Lindsey 1996, §6.5).

An anonymous reviewer suggested letting \(\ell\) be an extended likelihood function (Bjornstad 1990) or a hierarchical likelihood function (Lee and Nelder 1996; Lee et al. 2006) for applications to predicting random quantities of interest. The relationship between that approach to random parameters and the distinction that Bickel (2012) made between complex hypotheses and intrinsically simple hypotheses has not been investigated.

Each hypothesis about \(\theta\) may be expressed as “\(\theta \in {\mathcal {H}}\)” for an \({\mathcal {H}}\subset \Theta\). Thus, all possible hypotheses about \(\theta\) correspond to members of \({\mathfrak {H}}\), a set of subsets of \(\Theta\). For example, if \(\Theta\) is the real line, \({\mathfrak {H}}\) is the set of Borel subsets of \(\Theta\), and \(\overline{\left\{ 0\right\} }\) is the complement \(\Theta \backslash \left\{ 0\right\}\) of \(\left\{ 0\right\}\), then the hypothesis that \(\theta \ne 0\) is the hypothesis that \(\theta \in \overline{\left\{ 0\right\} }\), corresponding to the subset \(\overline{\left\{ 0\right\} }\), which is a member of \({\mathfrak {H}}\).

A restricted parameter space (Mandelkern 2002; Zhang and Woodroofe 2003; Marchand and Strawderman 2004; Wang 2006; Wang 2007; Marchand and Strawderman 2013; Marchand and Strawderman 2006; Fraser 2011; Bickel 2020a; Bickel and Patriota 2019; Bickel 2020b) is denoted by \({\mathcal {R}}\), a measurable subset of \(\Theta\). In order to overcome pathology, \({\mathcal {R}}\) is assumed to have at least one parameter value. That assumption is reasonable since a restricted parameter space in a real application would only be empty if the statistical model were inadequate for the purpose at hand.

2.2 Likeliness and unlikeliness

For any \({\mathcal {H}}\in {\mathfrak {H}}\) and \({\mathcal {R}}\in {\mathfrak {H}}\backslash \left\{ \emptyset \right\}\), call

$$\begin{aligned} L\left( {\mathcal {H}}\right) =\frac{\sup _{\theta \in {\mathcal {H}}} \ell \left( \theta \right) }{\sup _{\theta \in \Theta }\ell \left( \theta \right) } =\frac{\sup _{\theta \in {\mathcal {H}}} f_{\theta }\left( x\right) }{\sup _{\theta \in \Theta }f_{\theta }\left( x\right) } \end{aligned}$$
(2)

the marginal likeliness of the hypothesis that \(\theta \in {\mathcal {H}}\) and, if \(L\left( {\mathcal {R}}\right) >0\),

$$\begin{aligned} L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\frac{L\left( {\mathcal {H}}\cap {\mathcal {R}}\right) }{L\left( {\mathcal {R}}\right) } \end{aligned}$$
(3)

the conditional likeliness of the hypothesis that \(\theta \in {\mathcal {H}}\) given \(\theta \in {\mathcal {R}}\) (Bickel and Rahal 2019). Here, the supremum is the least upper bound in \([0,\infty [\), \(\sup \emptyset \equiv 0\). It follows that \(L\left( {\mathcal {H}}\right) \in \left[ 0,1\right]\) and \(L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) \in \left[ 0,1\right]\).

The likeliness of a hypothesis is insufficient as a measure of its strength of evidence since the likeliness of the hypothesis’s alternative must also be considered. For that reason, it is convenient to define the marginal unlikeliness of the hypothesis that \(\theta \in {\mathcal {H}}\) as \(U\left( {\mathcal {H}}\right) =L\left( \overline{{\mathcal {H}}}\right)\) and the conditional unlikeliness of the hypothesis that \(\theta \in {\mathcal {H}}\) given \(\theta \in {\mathcal {R}}\) as \(U\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =L\left( \overline{{\mathcal {H}}}\vert {\mathcal {R}}\right)\), where \(\overline{{\mathcal {H}}}\) is the complement of \({\mathcal {H}}\). The likeliness and unlikeliness of a hypothesis are combined into a single measure of evidence in Section 2.4. According to possibility theory, \(L\left( \bullet \right)\) is a possibility measure, and \(1-U\left( \bullet \right)\) is a necessity measure (Bickel and Rahal 2019).

Example 2

Example 1, continued. The marginal likeliness of the big bang theory is

$$\begin{aligned} L\left( \left\{ 0\right\} \right) =\frac{f_{0}\left( x\right) }{\sup _{\theta \in \Theta }f_{\theta }\left( x\right) }=\frac{2^{-3}}{2^{-3}}=1. \end{aligned}$$

Likewise, the conditional likeliness of the big bang theory, given that the truth is between the big bang theory and the steady state theory, is

$$\begin{aligned} L\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left\{ 0\right\} \right) }{L\left( \left[ 0,1\right] \right) }=\frac{f_{0}\left( x\right) }{\sup _{\theta \in \left[ 0,1\right] }f_{\theta }\left( x\right) }=\frac{2^{-3}}{2^{-3}}=1. \end{aligned}$$
(4)

In the same way, the marginal likeliness that the true theory is any other theory is

$$\begin{aligned} L\left( \left\{ \theta :\theta \ne 0\right\} \right) =\frac{\sup _{\theta \ne 0}f_{\theta }\left( x\right) }{\sup _{\theta \in \Theta }f_{\theta }\left( x\right) }=\frac{2^{-7}}{2^{-3}}=2^{-4}, \end{aligned}$$

and the conditional likeliness that the true theory is any theory other than the big bang theory, given that the truth is between the big bang theory and the steady state theory, is

$$\begin{aligned} L\left( \left\{ \theta :\theta \ne 0\right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left\{ \theta \in \left[ 0,1\right] :\theta \ne 0\right\} \right) }{L\left( \left[ 0,1\right] \right) }=\frac{\sup _{0<\theta \le 1}f_{\theta }\left( x\right) }{\sup _{\theta \in \left[ 0,1\right] }f_{\theta }\left( x\right) }=\frac{2^{-7}}{2^{-3}}=2^{-4}. \end{aligned}$$
(5)

A more extreme case is the conditional likeliness for the truth of a theory between the big bang theory and the steady state theory, given the truth of a theory between the big bang theory and the steady state theory:

$$\begin{aligned} L\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) =\frac{L\left( \left[ 0,1\right] \cap \left[ 0,1\right] \right) }{L\left( \left[ 0,1\right] \right) }=\frac{L\left( \left[ 0,1\right] \right) }{L\left( \left[ 0,1\right] \right) }=1. \end{aligned}$$
(6)

Finally, the conditional likeliness for the falsity of every theory between the big bang theory and the steady state theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} L\left( \left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \cap \left[ 0,1\right] \right) }{L\left( \left[ 0,1\right] \right) }=\frac{0}{2^{-3}}=0 \end{aligned}$$
(7)

since \(\left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \cap \left[ 0,1\right] =\emptyset\). \({{\,\mathrm{\,\blacktriangle }\,}}\)

2.3 Marginal and conditional weight of evidence

Suppose \({\mathcal {H}}_{1},{\mathcal {H}}_{2}\in {\mathfrak {H}}\). According to the general law of likelihood (§1), the weight of evidence in the observation that \(X=x\) substantiating the hypothesis that \(\theta \in {\mathcal {H}}_{1}\), as opposed to the hypothesis that \(\theta \in {\mathcal {H}}_{2}\), is the extended real number

$$\begin{aligned} W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\right) ={\left\{ \begin{array}{ll} \frac{\sup _{\theta \in {\mathcal {H}}_{1}}\ell \left( \theta \right) }{\sup _{\theta \in {\mathcal {H}}_{2}}\ell \left( \theta \right) }=\frac{\sup _{\theta \in {\mathcal {H}}_{1}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {H}}_{2}}f_{\theta }\left( x\right) } &{} {{\,\mathrm{{\text {if }}}\,}}\sup _{\theta \in {\mathcal {H}}_{1}}\ell \left( \theta \right) \ge 0,\sup _{\theta \in {\mathcal {H}}_{2}}\ell \left( \theta \right)>0\\ \infty &{} {{\,\mathrm{{\text {if }}}\,}}\sup _{\theta \in {\mathcal {H}}_{1}}\ell \left( \theta \right) >0,\sup _{\theta \in {\mathcal {H}}_{2}}\ell \left( \theta \right) =0\\ 1 &{} {{\,\mathrm{{\text {if }}}\,}}\sup _{\theta \in {\mathcal {H}}_{1}}\ell \left( \theta \right) =0,\sup _{\theta \in {\mathcal {H}}_{2}}\ell \left( \theta \right) =0 \end{array}\right. } \end{aligned}$$
(8)

That will be called the marginal weight of evidence to distinguish it from the conditional weight of evidence, defined below. For a simple special case, recall Eq. (1) of the introduction. If \(f_{\bullet }\left( x\right)\) is a profile likelihood function and \(W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\right) \ne 1\), then \(W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\right)\) reduces to the quantity considered by Zhang and Zhang (2013a), as discussed in Bickel (2013b).

In the rest of this paper, likelihood ratios with a denominator of 0 are to be understood in analogy with Eq. (8) to prevent explicitly listing all the cases. Example 3 shows how a 0 might appear in the denominator.

The conditional weight of evidence in the observation that \(X=x\) substantiating the hypothesis that \(\theta \in {\mathcal {H}}_{1}\) as opposed to the hypothesis that \(\theta \in {\mathcal {H}}_{2}\) given \(\theta \in {\mathcal {R}}\) is

$$\begin{aligned} W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\vert {\mathcal {R}}\right) =W\left( {\mathcal {H}}_{1}\cap {\mathcal {R}};{\mathcal {H}}_{2}\cap {\mathcal {R}}\right) \end{aligned}$$
(9)

for all \({\mathcal {H}}_{1},{\mathcal {H}}_{2},{\mathcal {R}}\in {\mathfrak {H}}\) such that \(L\left( {\mathcal {R}}\right) >0\). This is connected to the likeliness of Section 2.2 as follows.

Theorem 1

For any \({\mathcal {H}}_{1},{\mathcal {H}}_{2},{\mathcal {R}}\in {\mathfrak {H}}\) such that \(L\left( {\mathcal {R}}\right) >0\),

$$\begin{aligned} W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\vert {\mathcal {R}}\right) =\frac{L\left( {\mathcal {H}}_{1}\vert {\mathcal {R}}\right) }{L\left( {\mathcal {H}}_{2}\vert {\mathcal {R}}\right) }. \end{aligned}$$
(10)

For any set \({\mathfrak {H}}_{0}\subset {\mathfrak {H}}\) such that \(\bigcup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}{\mathcal {H}}_{0}={\mathcal {H}}\),

$$\begin{aligned} L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\frac{\sup _{\theta \in {\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {R}}}f_{\theta }\left( x\right) }=\sup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}L\left( {\mathcal {H}}_{0}\vert {\mathcal {R}}\right) . \end{aligned}$$
(11)

As Bickel and Rahal (2019) claimed, for any partition \({\mathfrak {P}}\subset {\mathfrak {H}}\) of \(\Theta\) such that \(L\left( {\mathcal {R}}\right) >0\) for all \({\mathcal {R}}\in {\mathfrak {P}}\),

$$\begin{aligned} L\left( {\mathcal {H}}\right) =\sup _{{\mathcal {R}}\in {\mathfrak {P}}}L\left( {\mathcal {R}}\right) L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) . \end{aligned}$$
(12)

Proof

By equs. (2, 8 and 9),

$$\begin{aligned} W\left( {\mathcal {H}}_{1};{\mathcal {H}}_{2}\vert {\mathcal {R}}\right) =\frac{\sup _{\theta \in {\mathcal {H}}_{1}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {H}}_{2}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }=\frac{\sup _{\theta \in {\mathcal {H}}_{1}\cap {\mathcal {R}}}f_{\theta }\left( x\right) /\sup _{\theta \in {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {H}}_{2}\cap {\mathcal {R}}}f_{\theta }\left( x\right) /\sup _{\theta \in {\mathcal {R}}}f_{\theta }\left( x\right) }, \end{aligned}$$

which is the right-hand side of Eq. (10) according to Eq. (3). Equation (2 and 3) imply that

$$\begin{aligned} L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =L\left( \bigcup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}{\mathcal {H}}_{0}\vert {\mathcal {R}}\right) =\frac{\sup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}\sup _{\theta \in {\mathcal {H}}_{0}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {R}}}f_{\theta }\left( x\right) }=\sup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}\frac{\sup _{\theta \in {\mathcal {H}}_{0}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {R}}}f_{\theta }\left( x\right) }, \end{aligned}$$

yielding \(L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\sup _{{\mathcal {H}}_{0}\in {\mathfrak {H}}_{0}}L\left( {\mathcal {H}}_{0}\vert {\mathcal {R}}\right)\). The other portion of formula (11) is established by substituting \(\left\{ \left\{ \theta \right\} :\theta \in {\mathcal {H}}\right\}\) for \({\mathfrak {H}}_{0}\). Since \({\mathfrak {P}}\subset {\mathfrak {H}}\) is a partition,

$$\begin{aligned} L\left( {\mathcal {H}}\right) =L\left( {\mathcal {H}}\cap \Theta \right) =L\left( {\mathcal {H}}\cap \bigcup _{{\mathcal {R}}\in {\mathfrak {P}}}{\mathcal {R}}\right) =L\left( \bigcup _{{\mathcal {R}}\in {\mathfrak {P}}}\left( {{\mathcal {H}}\cap {\mathcal {R}}}\right) \right) =L\left( \bigcup _{{\mathcal {R}}_{0}\in {\mathfrak {P}}\left( {\mathcal {H}}\right) }{\mathcal {R}}_{0}\right) , \end{aligned}$$

where \({\mathfrak {P}}\left( {\mathcal {H}}\right) =\left\{ {\mathcal {R}}\in {\mathfrak {P}}:{\mathcal {R}}\subseteq {\mathcal {H}}\right\}\). Thus, using Eq. (11),

$$\begin{aligned} L\left( {\mathcal {H}}\right) =\sup _{{\mathcal {R}}_{0}\in {\mathfrak {P}}\left( {\mathcal {H}}\right) }L\left( {\mathcal {R}}_{0}\right) =\sup _{{\mathcal {R}}\in {\mathfrak {P}}}L\left( {\mathcal {H}}\cap {\mathcal {R}}\right) =\sup _{{\mathcal {R}}\in {\mathfrak {P}}}L\left( {\mathcal {R}}\right) L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) , \end{aligned}$$

with the last equality following from Eq. (3). \(\square\)

Example 3

Example 2, continued. The conditional weight of evidence substantiating the big bang theory, as opposed to the hypothesis that the truth is any other theory, given that the truth is between the big bang theory and the steady state theory, is

$$\begin{aligned} W\left( \left\{ 0\right\} ;\left\{ \theta :\theta \ne 0\right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right) }{L\left( \left\{ \theta :\theta \ne 0\right\} \vert \left[ 0,1\right] \right) }=2^{4} \end{aligned}$$
(13)

according to equs. (45) and (10). Similarly, the conditional weight of evidence substantiating the truth of a theory between the big bang theory and the steady state theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} W\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) =W\left( \left[ 0,1\right] ;\left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) }{L\left( \left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \vert \left[ 0,1\right] \right) }=\frac{1}{0}=\infty \end{aligned}$$

by equs. (6, 7). \({{\,\mathrm{\,\blacktriangle }\,}}\)

Equation (11) is the foundation of the multiple hypothesis method of Bickel and Rahal (2019).

2.4 Absolute weight of evidence

The strength of evidence favoring the hypothesis that \(\theta \in {\mathcal {H}}\) can also be quantified without explicit reference to a second hypothesis by taking that second hypothesis to be the alternative to the first (i.e., the hypothesis that \(\theta \notin {\mathcal {H}}\)). The conditional weight of evidence in the observation that \(X=x\) substantiating the hypothesis that \(\theta \in {\mathcal {H}}\) given \(\theta \in {\mathcal {R}}\) is \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =W\left( {\mathcal {H}};\overline{{\mathcal {H}}}\vert {\mathcal {R}}\right)\). Likewise, the marginal weight of evidence in the observation that \(X=x\) substantiating the hypothesis that \(\theta \in {\mathcal {H}}\) is \(W\left( {\mathcal {H}}\right) =W\left( {\mathcal {H}}\vert \Theta \right)\).

The word “absolute” could be added to those terms to prevent confusion with the terms defined in Sect. 2.3. Doing so, however, could make them too cumbersome to use in practice. The conditional and marginal weights of evidence are instead designated as absolute by the absence of the relative hypothesis. For example, whereas

$$\begin{aligned} {\text {marginal weight of evidence substantiating the big bang theory}} \end{aligned}$$

is absolute,

$$\begin{aligned} {\text {marginal weight of evidence substantiating the big bang theory as opposed to the steady state theory}} \end{aligned}$$

is relative. The words “absolute” and “relative” would then be redundant but could be added as needed for additional clarity.

Corollary 1

Under the assumptions of Theorem 1, for any \({\mathcal {H}},{\mathcal {R}}\in {\mathfrak {H}}\),

$$\begin{aligned} W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)= & {} \frac{L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) }{U\left( {\mathcal {H}}\vert {\mathcal {R}}\right) }=\frac{L\left( {\mathcal {H}}\cap {\mathcal {R}}\right) }{L\left( {\mathcal {R}}\backslash {\mathcal {H}}\right) }=\frac{\sup _{\theta \in {\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right) } \end{aligned}$$
(14)
$$\begin{aligned} W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)= & {} \frac{\sup _{{\mathcal {R}}\in {\mathfrak {P}}}L\left( {\mathcal {R}}\right) L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) }{\sup _{{\mathcal {R}}\in {\mathfrak {P}}}L\left( {\mathcal {R}}\right) L\left( \overline{{\mathcal {H}}}\vert {\mathcal {R}}\right) }. \end{aligned}$$
(15)

Proof

The claims follow directly from \(U\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =L\left( \overline{{\mathcal {H}}}\vert {\mathcal {R}}\right)\) and from equs. (2, 10, 12). \(\square\)

Example 4

Examples 1, 2, and 3, continued. The conditional weight of evidence substantiating the big bang theory, given that the truth is between the big bang theory and the steady state theory, is

$$\begin{aligned} W\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right) =W\left( \left\{ 0\right\} ;\left\{ \theta :\theta \ne 0\right\} \vert \left[ 0,1\right] \right) =2^{4} \end{aligned}$$

by Eq. (13). In the same way, the conditional weight of evidence substantiating the truth of a theory between the big bang theory and the steady state theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} W\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) =W\left( \left[ 0,1\right] ;\left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \vert \left[ 0,1\right] \right) =\frac{L\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) }{L\left( \left\{ \theta :\theta \notin \left[ 0,1\right] \right\} \vert \left[ 0,1\right] \right) }=\frac{1}{0}=\infty \end{aligned}$$

by Eqs. (6, 7). \({{\,\mathrm{\,\blacktriangle }\,}}\)

Equation (14) (Bickel and Rahal 2019) indicates that \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) is a coherent measure of evidence in the sense to be defined in Sect. 3. As will be seen, that property supports calling \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) the weight of evidence.

2.5 Likeliness and unlikeliness from the weight of evidence

While the weight of evidence is the ratio of the likeliness to the unlikeliness (14), it is convenient in some applications to derive the likeliness and unlikeliness from the weight of evidence.

Lemma 1

Given \({\mathcal {H}},{\mathcal {R}}\in {\mathfrak {H}}\) such that \(L\left( {\mathcal {R}}\right) >0\), it follows that \(L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =1\) and \(U\left( {\mathcal {H}}\vert {\mathcal {R}}\right) = \frac{1}{W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) }\) if \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) \ge 1\) but that \(L\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) and \(U\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =1\) if \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) <1\).

Bickel and Rahal (2019) proves the result and relates it to the theory of ranking functions treated in Spohn (2012, §5.2).

3 Derivation from coherence and Bayes compatibility

3.1 Theory of coherence and Bayes compatibility

Let P stand for a probability measure on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\), where \({\mathfrak {X}}\) is a \(\sigma\)-algebra of subsets of \({\mathcal {X}}\), and \({\mathfrak {H}}\otimes {\mathfrak {X}}\) is the smallest \(\sigma\)-field that contains \({\mathfrak {H}}\times {\mathfrak {X}}\). Consider a random parameter \(\vartheta\) of prior distribution \(P_{0}=P\left( \bullet \times {\mathcal {X}}\right)\) on \(\left( \Theta ,{\mathfrak {H}}\right)\) such that the posterior probability that \(\vartheta \in {\mathcal {H}}\) is

$$\begin{aligned} P\left( \vartheta \in {\mathcal {H}}\vert x\right) =\frac{P_{0}\left( \vartheta \in {\mathcal {H}}\right) \int _{{\mathcal {H}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}}\right) }{\int f_{\theta }\left( x\right) dP_{0}\left( \theta \right) }. \end{aligned}$$

This is considered a function of P such that, if Q were the joint distribution on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\), the posterior distribution would be \(Q\left( \vartheta \in {\mathcal {H}}\vert x\right)\) with Q in place of P and \(Q_{0}=Q\left( \bullet \times {\mathcal {X}}\right)\) in place of \(P_{0}\). The increase in the odds ratio due to the observation that \(X=x\) in favor of the hypothesis that \(\theta \in {\mathcal {H}}\) given \(\theta \in {\mathcal {R}}\) is the ratio of the conditional posterior odds to the conditional prior odds:

$$\begin{aligned} \Delta \left( {\mathcal {H}};P\vert {\mathcal {R}}\right) =\frac{P\left( \vartheta \in {\mathcal {H}}\vert x,{\mathcal {R}}\right) /P\left( \vartheta \notin {\mathcal {H}}\vert x,{\mathcal {R}}\right) }{P_{0}\left( \vartheta \in {\mathcal {H}}\vert {\mathcal {R}}\right) /P_{0}\left( \vartheta \notin {\mathcal {H}}\vert {\mathcal {R}}\right) }=\frac{\int _{{\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}},{\mathcal {R}}\right) }{\int _{{\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert \overline{{\mathcal {H}}},{\mathcal {R}}\right) }. \end{aligned}$$
(16)

The conditional Bayes factor \(B\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) as a function of \({\mathcal {H}}\) and \({\mathcal {R}}\), is defined such that \(B\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\Delta \left( {\mathcal {H}};P\vert {\mathcal {R}}\right)\) for some fixed probability measure P on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\).

The requirement of Edwards (1992) that a measure of support for one hypothesis over another be compatible with Bayes’s theorem is generalized to composite hypotheses by the following definition, differing from the generalization that often forbids accepting a hypothesis of sufficiently high weight of evidence (Bickel 2013a). Any function \(u:{\mathfrak {H}}^{2}\rightarrow \left[ 0,\infty \right]\) measures the odds ratio increase due to the observation that \(X=x\) if, for every \({\mathcal {H}}\in {\mathfrak {H}}\), there is a probability measure \(P_{{\mathcal {H}}}\) on \(\left( \Theta ,{\mathfrak {H}}\right)\) such that

$$\begin{aligned} u\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\Delta \left( {\mathcal {H}};P_{{\mathcal {H}}} \vert {\mathcal {R}}\right) \end{aligned}$$
(17)

for all \({\mathcal {R}}\in {\mathfrak {H}}\) satisfying \(L\left( {\mathcal {R}}\right) >0\). Functions that measure the odds ratio increase quantify the relevancy of the body of evidence to whether or not a hypothesis is true.

On the other hand, a property of a measure of the sufficiency of the body of evidence for concluding that a hypothesis is true is the avoidance of asserting that contradictory statements are individually supported by the evidence (Schervish 1996; Lavine and Schervish 1999; Zhang and Zhang 2013a). More generally, the functions \(v:{\mathfrak {H}}\rightarrow \left[ 0,\infty \right]\) and \(v:{\mathfrak {H}}^{2}\rightarrow \left[ 0,\infty \right]\) are logically coherent if

$$\begin{aligned} v\left( {\mathcal {H}}_{0}\right)\le & {} v\left( {\mathcal {H}}_{1}\right) \iff \left( \theta \in {\mathcal {H}}_{0}\implies \theta \in {\mathcal {H}}_{1}\right) \end{aligned}$$
(18)
$$\begin{aligned} v\left( {\mathcal {H}}_{0}\vert {\mathcal {R}}\right)\le & {} v\left( {\mathcal {H}}_{1}\vert {\mathcal {R}}\right) \iff \left( \theta \in {\mathcal {H}}_{0}\cap {\mathcal {R}}\implies \theta \in {\mathcal {H}}_{1}\cap {\mathcal {R}}\right) \end{aligned}$$
(19)

for all \({\mathcal {H}}_{0},{\mathcal {H}}_{1}\in {\mathfrak {H}}\) and all \({\mathcal {R}}\in {\mathfrak {H}}\). The main Bayesian logically coherent measure is the posterior probability. A frequentist logically coherent measure is the compatibility or c value, a generalization of the p value (Bickel and Patriota 2019).

In short, whereas the odds ratio increase quantifies the relevancy of the evidence, logical coherence is a minimal requirement of a measure of the sufficiency of evidence. Putting them together leads to the following definition and theorem.

Definition 1

A function \(w:{\mathfrak {H}}^{2}\rightarrow \left[ 0,\infty \right]\) measures the relevancy and sufficiency of the evidence if it both measures the odds ratio increase and is logically coherent.

Theorem 2

If \({\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}= \arg \sup _{\theta \in {\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right)\) and \({\widehat{\theta }}_{{\mathcal {R}}\backslash {\mathcal {H}}}=\arg \sup _{\theta \in {\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right)\) are unique, then a function \(w:{\mathfrak {H}}^{2}\rightarrow \left[ 0,\infty \right]\) measures the relevancy and sufficiency of the evidence if and only if it is the weight of evidence function \(W\left( \bullet \vert \bullet \right)\).

Proof

\(( \Leftarrow )\). The following statements apply for all \({\mathcal {H}},{\mathcal {H}}_{0},{\mathcal {H}}_{1}\in {\mathfrak {H}}\) and all \({\mathcal {R}}\in {\mathfrak {H}}\). Let \(\delta \left( \bullet ;{\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right)\) and \(\delta \left( \bullet ;{\widehat{\theta }}_{{\mathcal {R}}\backslash {\mathcal {H}}}\right)\) denote the Dirac probability measures on \(\left( \Theta ,{\mathfrak {H}}\right)\) with mass at \({\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\) and \({\widehat{\theta }}_{{\mathcal {R}}\backslash {\mathcal {H}}}\), respectively. By Eq. (14),

$$\begin{aligned} W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\frac{\int f_{\theta }\left( x\right) d\delta \left( \theta ;{\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right) }{\int f_{\theta }\left( x\right) d\delta \left( \theta ;{\widehat{\theta }}_{{\mathcal {R}} \backslash {\mathcal {H}}}\right) }. \end{aligned}$$
(20)

There is a probability measure P on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\) such that \(P_{0}\left( \bullet \vert {\mathcal {H}},{\mathcal {R}}\right) =\delta \left( \bullet ;{\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right)\) and \(P_{0}\left( \bullet \vert \overline{{\mathcal {H}}},{\mathcal {R}}\right) =\delta \left( \bullet ;{\widehat{\theta }}_{{\mathcal {R}}\backslash {\mathcal {H}}}\right)\), in which case Eqs. (16, 20) imply that \(W\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\Delta \left( {\mathcal {H}};P\vert {\mathcal {R}}\right)\). Thus, \(W\left( \bullet \vert \bullet \right)\) measures the odds ratio increase. The fact that \(W\left( {\mathcal {H}}_{0},{\mathcal {R}}\right) \le W\left( {\mathcal {H}}_{1},{\mathcal {R}}\right)\) if and only if \({\mathcal {H}}_{0}\subseteq {\mathcal {H}}_{1}\) demonstrates Eq. (19). Therefore, \(W\left( \bullet \vert \bullet \right)\) is logically coherent. Thus, both criteria of Definition 1 are satisfied.

\(\left( \implies \right)\). Let \(w:{\mathfrak {H}}^{2}\rightarrow \left[ 0,\infty \right]\) denote a function that measures the relevancy and sufficiency of the evidence. By Definition 1, w both measures the odds ratio increase and is logically coherent. Assume that there are \({\mathcal {H}}\in {\mathfrak {H}}\) and \({\mathcal {R}}\in {\mathfrak {H}}\) and, contrary to the \(w=W\) claim and Eq. (14), such that

$$\begin{aligned} w\left( {\mathcal {H}}\vert {\mathcal {R}}\right) \ne \frac{\sup _{\theta \in {\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) }{\sup _{\theta \in {\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right) } \end{aligned}$$
(21)

in order to prove the claim by contradiction. Since \(w=v\), equs. (16, 17 and 19) yield

$$\begin{aligned} \int _{{\mathcal {H}}_{0}\cap {\mathcal {R}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}}_{0},{\mathcal {R}}\right) \le \int _{{\mathcal {H}}_{1}\cap {\mathcal {R}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}}_{1},{\mathcal {R}}\right) \iff {\mathcal {H}}_{0}\subseteq {\mathcal {H}}_{1}. \end{aligned}$$

Since \(\left\{ {\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right\} \subseteq {\mathcal {H}}\cap {\mathcal {R}}\),

$$\begin{aligned} \int _{{\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}},{\mathcal {R}}\right)\ge & {} \int _{\left\{ {\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right\} }f_{\theta }\left( x\right) dP_{0}\left( \theta \vert \left\{ {\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right\} ,{\mathcal {R}}\right) \\= & {} \int f_{\theta }\left( x\right) d\delta \left( \theta ;{\widehat{\theta }}_{{\mathcal {H}}\cap {\mathcal {R}}}\right) =\sup _{\theta \in {\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) , \end{aligned}$$

but that requires that \(\int _{{\mathcal {H}}\cap {\mathcal {R}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}},{\mathcal {R}}\right) =\sup _{\theta \in {\mathcal {H}} \cap {\mathcal {R}}}f_{\theta }\left( x\right)\) (cf. Coletti et al. 2009). Analogous reasoning leads to \(\int _{{\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right) dP_{0}\left( \theta \vert {\mathcal {H}},{\mathcal {R}}\right) =\sup _{\theta \in {\mathcal {R}}\backslash {\mathcal {H}}}f_{\theta }\left( x\right)\). Thus, Eqs. (16, 17 and 19) establish Eq. (14), contradicting Eq. (21), thereby proving the \(w=W\) claim. \(\square\)

Theorem 2 says the weight of evidence uniquely measures both the relevancy and the sufficiency of the evidence. That raises questions about the senses in which the posterior probability and the Bayes factor fall short as measures of the relevancy and sufficiency of the evidence. Lavine and Schervish (1999) demonstrated that the posterior probability but not the Bayes factor is coherent as a measure of evidence. This is restated in the following corollaries in addition to whether each measures the odds ratio increase.

Corollary 2

Given any probability measure P on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\) satisfying the above conditions, the conditional Bayes factor function B measures the odds ratio increase but is not necessarily logically coherent.

Proof

By the definition of the conditional Bayes factor \(B\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =\Delta \left( {\mathcal {H}};P\vert {\mathcal {R}}\right)\). Thus, \(u=B\) yields Eq. (17), establishing the first claim. The second claim is established by noting that, according to Theorem 2, B is only logically coherent in the special case that \(B\left( {\mathcal {H}}\vert {\mathcal {R}}\right) =W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) for all \({\mathcal {H}}\in {\mathfrak {H}}\) and all \({\mathcal {R}}\in {\mathfrak {H}}\). \(\square\)

The next corollary uses the posterior odds function, the function \({{\,\mathrm{odds}\,}}\left( \bullet \vert x,\bullet \right)\) defined on \(\mathcal {{\mathfrak {H}}}^{2}\) such that \({{\,\mathrm{odds}\,}}\left( {\mathcal {H}}\vert x,{\mathcal {R}}\right) =P\left( \vartheta \in {\mathcal {H}}\vert x,{\mathcal {R}}\right) /P\left( \vartheta \notin {\mathcal {H}}\vert x,{\mathcal {R}}\right)\) for all \({\mathcal {H}},{\mathcal {R}}\in {\mathfrak {H}}\).

Corollary 3

Given any probability measure P on \(\left( \Theta \times {\mathcal {X}},{\mathfrak {H}}\otimes {\mathfrak {X}}\right)\) satisfying the above conditions, the posterior odds function is logically coherent but does not necessarily measure the odds ratio increase.

Proof

Consider an \({\mathcal {R}}\in {\mathfrak {H}}\) that satisfies \(L\left( {\mathcal {R}}\right) >0\) and \({\mathcal {H}}_{0},{\mathcal {H}}_{1}\in {\mathfrak {H}}\) such that \({\mathcal {H}}_{0}\subseteq {\mathcal {H}}_{1}\). By the additivity of probability measures, \(P\left( \vartheta \in {\mathcal {H}}_{0}\vert x,{\mathcal {R}}\right) \le P\left( \vartheta \in {\mathcal {H}}_{1}\vert x,{\mathcal {R}}\right)\), from which \({{\,\mathrm{odds}\,}}\left( {\mathcal {H}}_{0}\vert x,{\mathcal {R}}\right) \le {{\,\mathrm{odds}\,}}\left( {\mathcal {H}}_{1}\vert x,{\mathcal {R}}\right)\) follows. Likewise, any \({\mathcal {H}}_{0},{\mathcal {H}}_{1}\in {\mathfrak {H}}\) such that \({{\,\mathrm{odds}\,}}\left( {\mathcal {H}}_{0}\vert x,{\mathcal {R}}\right) \le {{\,\mathrm{odds}\,}}\left( {\mathcal {H}}_{1}\vert x,{\mathcal {R}}\right)\) are related by \({\mathcal {H}}_{0}\subseteq {\mathcal {H}}_{1}\). Thus, \(v={{\,\mathrm{odds}\,}}\left( \bullet \vert x,\bullet \right)\) yields Eq. (19), establishing the first claim. The second claim is established by noting that, according to Theorem 2, \({{\,\mathrm{odds}\,}}\left( \bullet \vert x,\bullet \right)\) only measures the odds ratio increase in the special case that \({{\,\mathrm{odds}\,}}\left( {\mathcal {H}}\vert x,{\mathcal {R}}\right) =W\left( {\mathcal {H}}\vert {\mathcal {R}}\right)\) for all \({\mathcal {H}}\in {\mathfrak {H}}\) and all \({\mathcal {R}}\in {\mathfrak {H}}\) satisfying \(L\left( {\mathcal {R}}\right) >0\). \(\square\)

Lavine and Schervish (1999) likewise argued that the posterior probability is coherent as a measure of evidence.

3.2 Examples of coherence and Bayes compatibility

The (counter)examples of this section build on Examples 12, 3, and 4.

Example 5

Let \(P_{0}\) denote the Lebesgue measure that is the uniform prior distribution on the real line. The conditional Bayes factor for the big bang theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} B\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right) =\Delta \left( \left\{ 0\right\} ;P\vert \left[ 0,1\right] \right) =\frac{f_{0}\left( x\right) }{\int _{0}^{1}f_{\theta }\left( x\right) d\theta }=\frac{2^{-3}}{2^{-7}}=2^{4}, \end{aligned}$$

as per Eq. (16). However, the conditional Bayes factor for the truth of a theory between the big bang theory and the steady state theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} B\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) =\Delta \left( \left[ 0,1\right] ;P\vert \left[ 0,1\right] \right) =\frac{\int _{0}^{1}f_{\theta }\left( x\right) d\theta }{\int _{0}^{1}f_{\theta }\left( x\right) d\theta }=\frac{2^{-7}}{2^{-7}}=1. \end{aligned}$$

Even though \(\theta \in {\left[ 0,1\right] }\) is a consequence of \(\theta \in {\left\{ 0\right\} }\), the former hypothesis has a smaller conditional Bayes factor, in violation of logical coherence. This counterexample illustrates Corollary 2. \({{\,\mathrm{\,\blacktriangle }\,}}\)

Example 6

The conditional posterior odds for the big bang theory, given the truth of a theory between the big bang theory and the steady state theory, is

$$\begin{aligned} {{\,\mathrm{odds}\,}}\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right) =\Delta \left( \left\{ 0\right\} ;P\vert \left[ 0,1\right] \right) =\frac{P\left( \vartheta =0\vert x,\left[ 0,1\right] \right) }{P\left( \vartheta \ne 0\vert x,\left[ 0,1\right] \right) }=0. \end{aligned}$$

But for any P, the odds ratio increase of Eq. (16) satisfies \(\Delta \left( \left\{ 0\right\} ;P\vert \left[ 0,1\right] \right) \ge 1\) since \(f_{0}\left( x\right) \ge f_{\theta }\left( x\right)\) for all \(\theta \in \left[ 0,1\right]\), contradicting \(\Delta \left( \left\{ 0\right\} ;P\vert \left[ 0,1\right] \right) =0\). It follows that the posterior odds function does not necessarily measure the odds ratio increase. That counterexample illustrates Corollary 3. \({{\,\mathrm{\,\blacktriangle }\,}}\)

Example 7

From Example 4, we see that \(W\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right) >W\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right)\), which illustrates how conditional weight of evidence satisfies logical coherence, in contrast with Example 5. In addition, since \(W\left( \left[ 0,1\right] \vert \left[ 0,1\right] \right)\) and \(W\left( \left\{ 0\right\} \vert \left[ 0,1\right] \right)\) are likelihood ratios, they satisfy the condition of measuring the odds ratio increase, in contrast with Example 6. Those properties hold not only in this example but also for all weights of evidence (Theorem 2). \({{\,\mathrm{\,\blacktriangle }\,}}\)

4 Discussion

Recall the distinction that Sect. 1 made between the sufficiency of the evidence and the relevancy of the evidence. It is often assumed that measures of the sufficiency of the evidence and measures of the relevancy of the evidence are mutually exclusive. While it is in fact the case that measures of evidence commonly used in practice fall into either one category or the other, the weight of evidence defined in Sect. 2 falls into both categories.

That claim is made precise and strengthened as follows. While the Bayes factor measures the odds ratio increase and the posterior probability is logically coherent, the weight of evidence is the only quantity with both properties in the sense of Sect. 3.