Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

17.1 Introduction

As we have seen in previous chapters use of the likelihood is important in frequentist methods and in Bayesian methods. In this chapter we explore the use of the likelihood function in another context, that of providing a self-contained method of statistical inference. Richard Royall in his book, Statistical Evidence: A Likelihood Paradigm, carefully developed the foundation for this method building on the work of Ian Hacking and Anthony Edwards. Royall lists three questions of interest to statisticians and scientists after having observed some data

  1. 1.

    What do I do?

  2. 2.

    What do I believe?

  3. 3.

    What evidence do I now have?

In the context of the usual parametric statistical model where we have an observed value \(\boldsymbol{x}_{obs}\) of random \(\boldsymbol{X}\) having sample space \(\mathcal{X}\), parameter space \(\Theta \), and probability density function \(f(\boldsymbol{x}_{obs};\theta )\) at the observed value, \(\boldsymbol{x}_{obs}\) of \(\boldsymbol{X}\) the first question is a decision theoretic problem re the actions to be taken on the basis of the model and the observed data and the second concerns what do I believe about θ given the observed data and presumably some prior knowledge about θ. The third question concerns characterizing what evidence the data has provided us about θ and requires no actions or beliefs. It is simply a question of “what do the data say” (about θ).

We have already stated the Law of Likelihood:

Axiom 17.1.1.

(Law of Likelihood). For two parameter values, θ 1 and θ 0 , in the model \(\mathcal{X},f(\boldsymbol{x};\theta ),\Theta )\) , the magnitude of the likelihood ratio

$$\displaystyle{\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) = \frac{f(\boldsymbol{x}_{obs};\theta _{1})} {f(\boldsymbol{x}_{obs};\theta _{0})}}$$

measures the statistical evidence for θ 1 vs θ 0 . If the ratio is greater than 1 we have statistical evidence for θ 1 vs θ 0 while if less than 1 we have statistical evidence for θ 0 vs θ 1.

I have used the term statistical evidence so as to not conflict with the use of the word evidence in other contexts, e.g., in P-values. We say the statistical evidence for θ 1 vs θ 0 is of strength k > 1 if \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k\).

17.2 Misleading Statistical Evidence

Since we are dealing with probability models it is possible to observe a value, \(\boldsymbol{x}_{obs}\), for which \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k\), and yet θ 0 is true. This is called misleading evidence. The following is called the universal bound and shows that the probability of misleading evidence can be kept small by choice of k.

Theorem 17.2.1.

The probability of misleading evidence is bounded by 1∕k, i.e.,

$$\displaystyle{\mathbb{P}_{\theta _{0}}\left \{ \frac{f(\boldsymbol{X};\theta _{1})} {f(\boldsymbol{X};\theta _{0})} \geq k\right \} \leq \frac{1} {k}}$$

Proof.

Let M be the set

$$\displaystyle{M = \left \{\boldsymbol{x}\::\: \frac{f(\boldsymbol{X};\theta _{1})} {f(\boldsymbol{X};\theta _{0})} \geq k\right \}}$$

Then

$$\displaystyle\begin{array}{rcl} \int _{M}f(\boldsymbol{x};\theta _{0})d\mu (\boldsymbol{x})& \leq & \int _{M}\frac{1} {k}f(\boldsymbol{x};\theta _{1})d\mu (\boldsymbol{x}) {}\\ & \leq & \frac{1} {k}\int _{\mathcal{X}}f(\boldsymbol{x};\theta _{1})d\mu (\boldsymbol{x}) {}\\ & =& \frac{1} {k} {}\\ \end{array}$$

In fact a much stronger result is true. Consider a sequence of observations

$$\displaystyle{\mathbf{X}_{n} = (X_{1},X_{2},\ldots,X_{n})}$$

such that if A is true then X n  ∼ f n and when B is true X n  ∼ g n . The likelihood ratio

$$\displaystyle{\frac{g_{n}(\mathbf{x}_{n})} {f_{n}(\mathbf{x}_{n})} = z_{n}}$$

is the LR in favor of B after n observations. Then we have the following theorem.

Theorem 17.2.2.

If A is true then

$$\displaystyle{P_{A}(Z_{n} \geq k\;\;\mbox{ for some $n = 1,2,\ldots $}) \leq \frac{1} {k}}$$

Robbins [41]

In many circumstances the universal bound is far too conservative. Consider the situation where we have X 1, X 2, , X n where the X i are iid as N(μ, σ 2) where, for simplicity, σ 2 is assumed known. The joint density is given by

$$\displaystyle{f(\mathbf{y};\mu ) =\prod _{ i=1}^{n}(2\pi \sigma ^{2})^{-\frac{1} {2} }\exp \left \{-\frac{(y_{i}-\mu )^{2}} {2\sigma ^{2}} \right \}}$$

After some algebraic simplification the likelihood ratio for comparing μ 1 vs μ 0 is given by

$$\displaystyle{\exp \left \{\left (\bar{x} -\frac{\mu _{0} +\mu _{1}} {2} \right )\frac{n(\mu _{1} -\mu _{0})} {\sigma ^{2}} \right \}}$$

It follows that the likelihood ratio exceeds k if and only if

$$\displaystyle{\frac{n(\mu _{1} -\mu _{0})} {\sigma ^{2}} \left (\bar{x} -\frac{\mu _{1} +\mu _{0}} {2} \right ) \geq \ln (k)}$$

Thus, without loss of generality, if μ 1μ 0 > 0, the likelihood ratio exceeds k if and only if

$$\displaystyle{\bar{x} \geq \frac{\mu _{1} +\mu _{0}} {2} + \frac{\sigma ^{2}} {n(\mu _{1} -\mu _{0})}\:\ln (k)}$$

Thus the probability of misleading statistical evidence when H 0 = μ 0 is assumed true is given by

$$\displaystyle\begin{array}{rcl} \mbox{ PMLEV}_{1}& =& P_{\mu =\mu _{0}}\left \{ \frac{f(\boldsymbol{X};\mu _{1})} {f(\boldsymbol{X};\mu _{0}} \geq k\right \} {}\\ & =& P_{\mu =\mu _{0}}\left \{\overline{X} \geq \frac{\mu _{1} +\mu _{0}} {2} + \frac{\sigma ^{2}} {n\mu _{1} -\mu _{0}}\:\ln (k)\right \} {}\\ & =& P_{\mu =\mu _{0}}\left \{ \frac{\sqrt{n}(\overline{X} -\mu _{0})} {\sigma } \geq \frac{\sqrt{n}(\mu _{1} -\mu _{0})} {2\sigma } + \frac{\sigma \ln (k)} {\sqrt{n}(\mu _{1} -\mu _{0})}\right \} {}\\ & =& P\left (Z \geq \frac{\sqrt{n}c} {2} + \frac{\ln (k_{2})} {\sqrt{n}c}\right ) {}\\ & =& \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right ) {}\\ \end{array}$$

where \(\Phi (z)\) is the standard normal distribution function evaluated at z and

$$\displaystyle{c = \frac{\vert \mu _{1} -\mu _{0}\vert } {\sigma } }$$

If μ 1μ 0 < 0 similar calculations show that the probability of misleading evidence when μ 0 is assumed true is given by the same expression. It follows that the probability of misleading evidence when H 0 = μ 0 is true is

$$\displaystyle{\mbox{ PMLEV} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k_{2})} {c\sqrt{n}}\right )}$$

where

$$\displaystyle{c = \frac{\vert \mu _{2} -\mu _{1}\vert } {\sigma } }$$

and \(\Phi \) is the standard normal distribution function. The function

$$\displaystyle{B(c,k,n) = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right )}$$

has been called the bump function by Royall.

Also note that c is often called the effect size in the social science literature and represents the difference between μ 0 and μ 1 in standard deviation units. The following are rules of thumb for judging the magnitude of the effect size:

  • c ≤ 0. 1 trivial

  • 0. 1 < c ≤ 0. 6 small

  • 0. 6 < c ≤ 1. 2 moderate

  • c ≥ 1. 2 large

Note that the derivative with respect to c of the bump function is

$$\displaystyle{\phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right )\left (-\frac{\sqrt{n}} {2} + \frac{\ln (k)} {c^{2}\sqrt{n}}\right )}$$

which vanishes when

$$\displaystyle{\frac{\sqrt{n}} {2} = \frac{\ln (k)} {c^{2}\sqrt{n}}}$$

i.e., when

$$\displaystyle{c = \sqrt{\frac{2\ln (k)} {n}} = c^{{\ast}}}$$

The second derivative with respect to c is

$$\displaystyle{\begin{array}{c} \phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right )\left (-\frac{\sqrt{n}} {2} + \frac{\ln (k)} {c^{2}\sqrt{n}}\right )^{2}\phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right ) \\ + \left (-\frac{2\ln (k)} {c^{3}\sqrt{n}}\right )\end{array} }$$

which is negative when c = c so that the bump function has a maximum at c = c given by

$$\displaystyle{B(c^{{\ast}},k,n) = \Phi \left (-\frac{c^{{\ast}}\sqrt{n}} {2} - \frac{\ln (k)} {c^{{\ast}}\sqrt{n}}\right ) = \Phi (-\sqrt{2\ln (k)})}$$

It is well known that

$$\displaystyle{ \frac{t} {1 + t^{2}}\phi (t) \leq \Phi (-t) \leq \frac{1} {t}\phi (t)}$$

so that

$$\displaystyle\begin{array}{rcl} \Phi (-\sqrt{2\ln (k)})& \leq & \frac{1} {\sqrt{2\ln (k)}}\phi (\sqrt{2\ln (k)}) {}\\ & =& \frac{1} {2\sqrt{\pi \ln (k)}}\exp \left \{-(\sqrt{2\ln (k)})^{2}/2\right \} {}\\ & =& \frac{1} {2\sqrt{\pi \ln (k)}}\exp \left \{-\ln (k)\right \} {}\\ & =& \frac{1} {k2\sqrt{\pi \ln (k)}} {}\\ \end{array}$$

which is considerably less than the universal bound of 1∕k.

17.2.1 Weak Statistical Evidence

Again, since we are dealing with probability models, it is possible to observe a value, \(\boldsymbol{x}_{obs}\), for which

$$\displaystyle{\frac{1} {k} <\boldsymbol{ L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) < k}$$

This is called weak statistical evidence. We have weak evidence in the example of normally distributed observations if and only if

$$\displaystyle{-\ln (k) \leq \left (\bar{x} -\frac{\mu _{1} +\mu _{0}} {2} \right )\frac{n(\mu _{1} -\mu _{0})} {\sigma ^{2}} \leq \ln (k)}$$

If μ 1μ 0 > 0 the condition for weak statistical evidence is that \(\overline{x}\) must lie between

$$\displaystyle{\frac{\mu _{1} +\mu _{0}} {2} - \frac{\sigma ^{2}} {n(\mu _{1} -\mu _{0})}\:\ln (k)}$$

and

$$\displaystyle{\frac{\mu _{1} +\mu _{0}} {2} + \frac{\sigma ^{2}} {n(\mu _{1} -\mu _{0})}\:\ln (k)}$$

If we define

$$\displaystyle{g_{n}(\mu _{0},\mu _{1},k) = \frac{\mu _{1} +\mu _{0}} {2} + \frac{\sigma ^{2}} {n(\mu _{1} -\mu _{0})}\:\ln (k)}$$

then the condition for weak evidence becomes

$$\displaystyle{g_{n}(\mu _{0},\mu _{1},1/k) \leq \bar{ x} \leq g_{n}(\mu _{0},\mu _{1},k)}$$

and the probability of weak evidence is given by

$$\displaystyle{\mbox{ Pr}\left \{g_{n}(\mu _{0},\mu _{1},1/k) \leq \bar{ X} \leq g_{n}(\mu _{0},\mu _{1},k)\right \}}$$

which is easily evaluated under H 0 and H 1 since \(\bar{X}\) has an \(\mbox{ N}\left (\mu, \frac{\sigma ^{2}} {n}\right )\) distribution.

Now we note that

$$\displaystyle{\frac{\sqrt{n}} {\sigma } [g_{n}(\mu _{0},\mu _{1},k) -\mu _{0}] = \frac{c\sqrt{n}} {2} + \frac{\ln (k)} {c\sqrt{n}}}$$

It follows that the probability of weak evidence, \(P_{\mu =\mu _{0}}(\mbox{ WEV})\), is given by

$$\displaystyle{\Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k)} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right )}$$

Similarly

$$\displaystyle{\frac{\sqrt{n}} {\sigma } [g_{n}(\mu _{0},\mu _{1},k) -\mu _{2}] = \frac{-c\sqrt{n}} {2} + \frac{\ln (k)} {c\sqrt{n}}}$$

It follows that the probability of weak evidence \(P_{\mu =\mu _{1}}(\mbox{ WEV})\) is given by

$$\displaystyle{\Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k)} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k)} {c\sqrt{n}}\right )}$$

If we define

$$\displaystyle{W(x,y) = \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{1})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k_{2})} {c\sqrt{n}}\right )}$$

then the two probabilities are given by

$$\displaystyle{\begin{array}{c} W_{1} = P_{1}(\mbox{ WEV}) = W(k_{2},k_{1}) \\ \mbox{ and} \\ W_{2} = P_{2}(\mbox{ WEV}) = W(k_{1},k_{2}) \end{array} }$$

There is nothing that requires the same level of statistical evidence be the same for μ 1 vs μ 0 as for μ 0 vs μ 1. That is we say we have statistical evidence for μ 1 vs μ 0 of level k 1 if \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k_{1}\) and statistical evidence for μ 0 vs μ 1 of level k 0 if \(\boldsymbol{L}(\theta _{0},\theta _{1};\boldsymbol{x}_{obs}) > k_{0}\).

We then have the following summary of results for the normal distribution example.

  • When H 0 is true the probability of misleading evidence for H 1 at level k 1 defined by (\(\mbox{ L}_{1}/\mbox{ L}_{0} \geq k_{1}\)) is

    $$\displaystyle{M_{0} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k_{2})} {c\sqrt{n}}\right )}$$
  • When H 0 is true the probability of weak evidence is

    $$\displaystyle\begin{array}{rcl} W_{0}& =& P_{\mu =\mu _{0}}\left ( \frac{1} {k_{0}} \leq \frac{\mbox{ L}_{1}} {\mbox{ L}_{0}} \leq k_{1}\right ) {}\\ & =& \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{1})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k_{0})} {c\sqrt{n}}\right ) {}\\ \end{array}$$
  • When H 1 is true the probability of misleading evidence for H 0 at level k 0 defined by (\(\mbox{ L}_{0}/\mbox{ L}_{1} \geq k_{0}\) is

    $$\displaystyle{M_{1} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k_{1})} {c\sqrt{n}}\right )}$$
  • When H 1 is true the probability of weak evidence is

    $$\displaystyle\begin{array}{rcl} W_{1}& =& P_{\mu =\mu _{1}}\left ( \frac{1} {k_{0}} \leq \frac{\mbox{ L}_{2}} {\mbox{ L}_{1}} \leq k_{1}\right ) {}\\ & =& \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{0})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k_{1})} {c\sqrt{n}}\right ) {}\\ \end{array}$$

17.2.2 Sample Size

There is no doubt that one of the questions most asked of a statistician is

$$\displaystyle{\mbox{ How many observations do I need?}}$$

Actually the usual question is how many subjects do I need to get a statistically significant result that is publishable? This question is easily answered, so let us consider refining the question.

Suppose that we will observe X 1, X 2, , X n assumed independent and identically distributed as normal with mean μ and variance σ 2 assumed known (usually based on past work with similar instruments, and so on). Of interest is a (null) hypothesis H 0:  μ = μ 0 and an alternative H 1:  μ = μ 1 where without loss of generality we assume that μ 1 > μ 0. It is assumed that μ 1 represents a value of μ which is of scientific importance,i.e., if μ 1 is true then a result of scientific or practical importance has been discovered.

The Neyman–Pearson theory has been used for decades to determine sample size is the default method. It is required in submitting grants to NIH, NSF, FDA, etc., as well as in reporting the results of published studies and dissertations. The Neyman–Pearson approach to sample size selection is as follows:

  1. 1.

    Choose a value α for the significance level (usually α = 0. 05).

  2. 2.

    Choose a value 1 −β for the power (usually β = 0. 20 so that the power is 0.8).

  3. 3.

    Select the sample size n so that

    $$\displaystyle{\begin{array}{c} P(\mbox{ Type I error}) = P(\mbox{ reject $H_{0}$}\vert H_{0}\;\mbox{ true}) =\alpha \\ 1 - P(\mbox{ Type II error}) = P(\mbox{ reject $H_{0}$}\vert H_{1}\;\mbox{ true}) = 1-\beta \end{array} }$$

In the case of a normal distribution with known variance we have that

$$\displaystyle\begin{array}{rcl} P(\mbox{ Type I error})& =& P(\overline{X} \geq C\vert \mu =\mu _{0}) {}\\ & =& P\left (\frac{\sqrt{n}(\overline{X} -\mu _{0}} {\sigma } \geq \frac{\sqrt{n}(C -\mu _{1})} {\sigma } \Bigg\vert \mu =\mu _{0}\right ) {}\\ & =& 1 - \Phi \left (\frac{C -\mu _{1}} {\sigma } \right ) {}\\ \end{array}$$

and it follows that

$$\displaystyle{C =\mu _{0} + z_{1-\alpha /2} \frac{\sigma } {\sqrt{n}} =\mu _{0} + 1.645 \frac{\sigma } {\sqrt{n}}\;\;\mbox{ if $\alpha = 0.05$}}$$

and

$$\displaystyle\begin{array}{rcl} \mbox{ Power}& =& P(\overline{X} \geq C\vert \mu =\mu _{1}) {}\\ & =& P\left (\overline{X} \geq \mu _{0} + z_{1-\alpha /2} \frac{\sigma } {\sqrt{n}}\Bigg\vert \mu =\mu _{1}\right ) {}\\ & =& P\left (\frac{\sqrt{n}(\overline{X} -\mu _{1})} {\sigma } \geq z_{1-\alpha /2} - (\mu _{1} -\mu _{0})\frac{\sqrt{n}} {\sigma } \Bigg\vert \mu =\mu _{1}\right ) {}\\ & =& 1 - \Phi \left (z_{1-\alpha /2} - (\mu _{1} -\mu _{0})\frac{\sqrt{n}} {\sigma } \right ) {}\\ & =& 1 - \Phi (z_{1-\alpha /2} - c\sqrt{n}) {}\\ \end{array}$$

In order to have power 1 −β we must have

$$\displaystyle{\Phi \left (z_{1-\alpha /2} - (\mu _{1} -\mu _{0})\frac{\sqrt{n}} {\sigma } \right ) =\beta }$$

i.e.,

$$\displaystyle{z_{1-\alpha /2} - (\mu _{1} -\mu _{0})\frac{\sqrt{n}} {\sigma } = z_{\beta }}$$

or

$$\displaystyle{z_{1-\alpha /2} - z_{\beta } = (\mu _{1} -\mu _{0})\frac{\sqrt{n}} {\sigma } }$$

and it follows that

$$\displaystyle{n = \frac{(z_{1-\alpha /2} + z_{1-\beta })^{2}} {c^{2}} }$$

where

$$\displaystyle{c = \frac{\mu _{1} -\mu _{0}} {\sigma } }$$

This is the prototype of sample size formulas.

The Neyman-Pearson approach is inadequate when we want to quantify statistical evidence for H 1 vs H 0 we now consider the selection of sample size necessary to quantify statistical evidence. Recall that there are four probabilities involved:

  1. 1.

    The probability of misleading statistical evidence for H 1 when H 0 is true

  2. 2.

    The probability of misleading statistical evidence for H 0 when H 1 is true

  3. 3.

    The probability of weak statistical evidence when H 0 is true

  4. 4.

    The probability of weak evidence when H 1 is true

The analogue to the Type I error probability is the probability of finding misleading evidence for H 1 when H 0 is true. For the normal distribution we have the correspondence for α = 0. 05 and M 0 given by

$$\displaystyle{\alpha = 0.05\;\;\;M_{0} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (8)} {c\sqrt{n}}\right )}$$

and if we take c = 0. 5, a moderate effect size, we have the correspondence

$$\displaystyle{\alpha = 0.05\;\;\;M_{0} = \Phi \left (-\frac{\sqrt{n}} {4} -\frac{4\ln (8)} {\sqrt{n}} \right )}$$

For the analogue to the Type II error we must be more careful. The probability of failing to find evidence supporting H 1 when H 0 is true is composed of two parts:

  1. 1.

    The probability of misleading evidence in favor of H 0 when H 1 is true

  2. 2.

    The probability of weak evidence when H 1 is true

For the normal distribution we have the correspondence

$$\displaystyle\begin{array}{rcl} & \beta = P(\mbox{ Type II error}) = \Phi \left (z_{1-\alpha /2} - (\mu _{2} -\mu _{1})\frac{\sqrt{n}} {\sigma } \right ) & {}\\ & M_{1} + W_{1} = \Phi \left (-\frac{c\sqrt{n}} {2} -\frac{\ln (k_{1})} {c\sqrt{n}}\right ) + \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{1})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} -\frac{\ln (k_{2})} {c\sqrt{n}}\right )& {}\\ & = 1 - \Phi \left (\frac{c\sqrt{n}} {2} -\frac{\ln (k_{2})} {c\sqrt{n}}\right ) & {}\\ & = \Phi \left (\frac{-c\sqrt{n}} {2} + \frac{\ln (k_{2})} {c\sqrt{n}}\right ) & {}\\ \end{array}$$

and if β = 0. 2, c = 0. 5 and k 2 = 8 we have

$$\displaystyle{\beta = 0.2\;;\;M_{2} + W_{2} = \Phi \left (-\frac{\sqrt{n}} {4} + \frac{2\ln (8)} {\sqrt{n}} \right )}$$

For the Neyman Pearson sample size formula for α = 0. 05, β = 0. 20 and c = 0. 5 we get a sample size of

$$\displaystyle{n = \frac{(1645 + 0.84)^{2}} {0.5^{2}} = 25}$$

For this sample size we find that M 1 + W 1 is equal to

$$\displaystyle{M_{1} + W_{1} = \Phi \left (-\frac{\sqrt{25}} {4} + \frac{2\ln (8)} {\sqrt{25}}\right ) = \Phi (-0.418) = 0.34}$$

Thus the conventional sample size formula does not lead to a small probability of finding weak evidence.

Exercises in Royall’s book show that the this is true in general, i.e., conventional sample size formulas do not guarantee finding strong evidence.

17.3 Birnbaum’s Confidence Concept

Recall Birnbaum’s confidence concept which he advocated after becoming skeptical of the likelihood principle.

A concept of statistical evidence is not plausible unless it finds “strong evidence” for H 2 as against H 1 with small probability (α) when H 1 is true and with much larger probability (1 −β) when H 2 is true.

What the results in the sample size section show is that it is possible in certain cases to satisfy the confidence concept with sufficient observations.

17.4 Combining Evidence

Suppose that we have two independent estimators, t 1 and t 2 of a parameter θ where t 1 is normal with expected value θ and variance v 1 and t 2 is normal with expected value θ and variance v 2. Assume that v 1 and v 2 are known.

The joint density of t 1 and t 2 is

$$\displaystyle{f(t_{1},t_{2};\theta ) = \frac{1} {2\pi \sqrt{v_{1 } v_{2}}}\exp \left \{-\frac{(t_{1}-\theta )^{2}} {2v_{1}} -\frac{(t_{2}-\theta )^{2}} {2v_{2}} \right \}}$$

which has logarithm

$$\displaystyle{-\mathrm{ln}[2\pi \sqrt{v_{1 } v_{2}}] -\frac{(t_{1}-\theta )^{2}} {2v_{1}} -\frac{(t_{2}-\theta )^{2}} {2v_{2}} }$$

The derivative with respect to θ is thus

$$\displaystyle{\frac{(t_{1}-\theta )} {v_{1}} -\frac{t_{2}-\theta } {v_{2}} }$$

and hence the maximum likelihood estimate of θ is

$$\displaystyle{\widehat{\theta }= \frac{ \frac{t_{1}} {v_{1}} + \frac{t_{2}} {v_{2}} } { \frac{1} {v_{1}} + \frac{1} {v_{2}} } }$$

At this value of θ the joint density is

$$\displaystyle{f(t_{1},t_{2};\widehat{\theta }) = \frac{1} {2\pi \sqrt{v_{1 } v_{2}}}\exp \left \{-\frac{(t_{1}-\widehat{\theta })^{2}} {2v_{1}} -\frac{(t_{2}-\widehat{\theta })^{2}} {2v_{2}} \right \}}$$

and hence the likelihood for θ is

$$\displaystyle{\mathcal{L}(\theta;t_{1},t_{2}) = \frac{f(t_{1},t_{2};\theta )} {f(t_{1},t_{2};\widehat{\theta })} = \frac{ \frac{1} {2\pi \sqrt{v_{1 } v_{2}}} \exp \left \{-\frac{(t_{1}-\theta )^{2}} {2v_{1}} -\frac{(t_{2}-\theta )^{2}} {2v_{2}} \right \}} { \frac{1} {2\pi \sqrt{v_{1 } v_{2}}} \exp \left \{-\frac{(t_{1}-\widehat{\theta })^{2}} {2v_{1}} -\frac{(t_{2}-\widehat{\theta })^{2}} {2v_{2}} \right \}} }$$

or

$$\displaystyle\begin{array}{rcl} \mathcal{L}(\theta;t_{1},t_{2})& =& \exp \left \{ \frac{t_{1}\theta } {v_{1}} + \frac{t_{2}\theta } {v_{2}} - \frac{\theta ^{2}} {2v_{1}} - \frac{\theta ^{2}} {2v_{2}} - \frac{t_{1}\widehat{\theta }} {v_{1}} - \frac{t_{2}\widehat{\theta }} {v_{2}} + \frac{\widehat{\theta }^{2}} {2v_{1}} + \frac{\widehat{\theta }^{2}} {2v_{2}}\right \} {}\\ & =& \exp \left \{\theta \left ( \frac{t_{1}} {v_{1}} + \frac{t_{2}} {v_{2}}\right ) -\frac{\theta ^{2}} {2}\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right ) -\widehat{\theta }\left ( \frac{t_{1}} {v_{1}} + \frac{t_{2}} {v_{2}}\right )\right. {}\\ & & +\left.\frac{\widehat{\theta }^{2}} {2}\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right )\right \} {}\\ & =& \exp \left \{\theta \widehat{\theta }\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right ) -\frac{\theta ^{2}} {2}\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right ) -\frac{\widehat{\theta }^{2}} {2}\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right )\right \} {}\\ & =& \exp \left \{-\frac{1} {2}\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right )\left (\theta ^{2} - 2\theta \widehat{\theta } +\widehat{\theta } ^{2}\right )\right \} {}\\ & =& \exp \left \{-\left ( \frac{1} {v_{1}} + \frac{1} {v_{2}}\right )\frac{(\theta -\widehat{\theta })^{2}} {2} \right \} {}\\ & =& \exp \left \{\frac{(\theta -\widehat{\theta })^{2}} {2v} \right \} {}\\ \end{array}$$

where

$$\displaystyle{v = \frac{1} { \frac{1} {v_{1}} + \frac{1} {v_{2}} } }$$

which is a normal likelihood centered at \(\widehat{\theta }\) and curvature v.

This is a likelihood version of the standard result that to combine unbiased uncorrelated estimators weight inversely as their variance and divide the result by the sum of the weights.

In general note that if we have \(\boldsymbol{x}_{1}\) observations from f(x; θ) and independent observations \(\boldsymbol{x}_{2}\) from f(x; θ) then the evidence for θ 1 vs θ 0 based on \((\boldsymbol{x}_{1},\boldsymbol{x}_{2})\) is

$$\displaystyle{\frac{f(\boldsymbol{x}_{1},\boldsymbol{x}_{2};\theta _{1})} {f(\boldsymbol{x}_{1},\boldsymbol{x}_{2};\theta _{0})} = \left [\frac{f(\boldsymbol{x}_{1};\theta _{1})} {f(\boldsymbol{x}_{1};\theta _{0})}\right ]\left [\frac{f(\boldsymbol{x}_{2};\theta _{1})} {f(\boldsymbol{x}_{2};\theta _{0})}\right ]}$$

i.e., evidence is multiplicative.

17.5 Exercises

  1. 1.

    Suppose that X 1, X 2, , X n are iid, each Poisson with parameter λ. Let k = 8, n = 1, 10, 25. Draw graphs of the probability of misleading evidence for λ 1 = 2 vs λ 0 = 1.

  2. 2.

    Repeat Exercise 1 for the binomial with n = 10, 25, 100, 1000 and p = 0. 6 vs p = 0. 5.