Abstract
As we have seen in previous chapters use of the likelihood is important in frequentist methods and in Bayesian methods. In this chapter we explore the use of the likelihood function in another context, that of providing a self-contained method of statistical inference. Richard Royall in his book, Statistical Evidence: A Likelihood Paradigm, carefully developed the foundation for this method building on the work of Ian Hacking and Anthony Edwards. Royall lists three questions of interest to statisticians and scientists after having observed some data
-
1.
What do I do?
-
2.
What do I believe?
-
3.
What evidence do I now have?
Access provided by Autonomous University of Puebla. Download chapter PDF
Keywords
- Likelihood Paradigm
- Self-contained Method
- Weak Statistical Evidence
- Neyman-Pearson Approach
- Universal Boundary
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
17.1 Introduction
As we have seen in previous chapters use of the likelihood is important in frequentist methods and in Bayesian methods. In this chapter we explore the use of the likelihood function in another context, that of providing a self-contained method of statistical inference. Richard Royall in his book, Statistical Evidence: A Likelihood Paradigm, carefully developed the foundation for this method building on the work of Ian Hacking and Anthony Edwards. Royall lists three questions of interest to statisticians and scientists after having observed some data
-
1.
What do I do?
-
2.
What do I believe?
-
3.
What evidence do I now have?
In the context of the usual parametric statistical model where we have an observed value \(\boldsymbol{x}_{obs}\) of random \(\boldsymbol{X}\) having sample space \(\mathcal{X}\), parameter space \(\Theta \), and probability density function \(f(\boldsymbol{x}_{obs};\theta )\) at the observed value, \(\boldsymbol{x}_{obs}\) of \(\boldsymbol{X}\) the first question is a decision theoretic problem re the actions to be taken on the basis of the model and the observed data and the second concerns what do I believe about θ given the observed data and presumably some prior knowledge about θ. The third question concerns characterizing what evidence the data has provided us about θ and requires no actions or beliefs. It is simply a question of “what do the data say” (about θ).
We have already stated the Law of Likelihood:
Axiom 17.1.1.
(Law of Likelihood). For two parameter values, θ 1 and θ 0 , in the model \(\mathcal{X},f(\boldsymbol{x};\theta ),\Theta )\) , the magnitude of the likelihood ratio
measures the statistical evidence for θ 1 vs θ 0 . If the ratio is greater than 1 we have statistical evidence for θ 1 vs θ 0 while if less than 1 we have statistical evidence for θ 0 vs θ 1.
I have used the term statistical evidence so as to not conflict with the use of the word evidence in other contexts, e.g., in P-values. We say the statistical evidence for θ 1 vs θ 0 is of strength k > 1 if \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k\).
17.2 Misleading Statistical Evidence
Since we are dealing with probability models it is possible to observe a value, \(\boldsymbol{x}_{obs}\), for which \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k\), and yet θ 0 is true. This is called misleading evidence. The following is called the universal bound and shows that the probability of misleading evidence can be kept small by choice of k.
Theorem 17.2.1.
The probability of misleading evidence is bounded by 1∕k, i.e.,
Proof.
Let M be the set
Then
In fact a much stronger result is true. Consider a sequence of observations
such that if A is true then X n ∼ f n and when B is true X n ∼ g n . The likelihood ratio
is the LR in favor of B after n observations. Then we have the following theorem.
Theorem 17.2.2.
If A is true then
Robbins [41]
In many circumstances the universal bound is far too conservative. Consider the situation where we have X 1, X 2, …, X n where the X i are iid as N(μ, σ 2) where, for simplicity, σ 2 is assumed known. The joint density is given by
After some algebraic simplification the likelihood ratio for comparing μ 1 vs μ 0 is given by
It follows that the likelihood ratio exceeds k if and only if
Thus, without loss of generality, if μ 1 −μ 0 > 0, the likelihood ratio exceeds k if and only if
Thus the probability of misleading statistical evidence when H 0 = μ 0 is assumed true is given by
where \(\Phi (z)\) is the standard normal distribution function evaluated at z and
If μ 1 −μ 0 < 0 similar calculations show that the probability of misleading evidence when μ 0 is assumed true is given by the same expression. It follows that the probability of misleading evidence when H 0 = μ 0 is true is
where
and \(\Phi \) is the standard normal distribution function. The function
has been called the bump function by Royall.
Also note that c is often called the effect size in the social science literature and represents the difference between μ 0 and μ 1 in standard deviation units. The following are rules of thumb for judging the magnitude of the effect size:
-
c ≤ 0. 1 trivial
-
0. 1 < c ≤ 0. 6 small
-
0. 6 < c ≤ 1. 2 moderate
-
c ≥ 1. 2 large
Note that the derivative with respect to c of the bump function is
which vanishes when
i.e., when
The second derivative with respect to c is
which is negative when c = c ∗ so that the bump function has a maximum at c = c ∗ given by
It is well known that
so that
which is considerably less than the universal bound of 1∕k.
17.2.1 Weak Statistical Evidence
Again, since we are dealing with probability models, it is possible to observe a value, \(\boldsymbol{x}_{obs}\), for which
This is called weak statistical evidence. We have weak evidence in the example of normally distributed observations if and only if
If μ 1 −μ 0 > 0 the condition for weak statistical evidence is that \(\overline{x}\) must lie between
and
If we define
then the condition for weak evidence becomes
and the probability of weak evidence is given by
which is easily evaluated under H 0 and H 1 since \(\bar{X}\) has an \(\mbox{ N}\left (\mu, \frac{\sigma ^{2}} {n}\right )\) distribution.
Now we note that
It follows that the probability of weak evidence, \(P_{\mu =\mu _{0}}(\mbox{ WEV})\), is given by
Similarly
It follows that the probability of weak evidence \(P_{\mu =\mu _{1}}(\mbox{ WEV})\) is given by
If we define
then the two probabilities are given by
There is nothing that requires the same level of statistical evidence be the same for μ 1 vs μ 0 as for μ 0 vs μ 1. That is we say we have statistical evidence for μ 1 vs μ 0 of level k 1 if \(\boldsymbol{L}(\theta _{1},\theta _{0};\boldsymbol{x}_{obs}) > k_{1}\) and statistical evidence for μ 0 vs μ 1 of level k 0 if \(\boldsymbol{L}(\theta _{0},\theta _{1};\boldsymbol{x}_{obs}) > k_{0}\).
We then have the following summary of results for the normal distribution example.
-
When H 0 is true the probability of misleading evidence for H 1 at level k 1 defined by (\(\mbox{ L}_{1}/\mbox{ L}_{0} \geq k_{1}\)) is
$$\displaystyle{M_{0} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k_{2})} {c\sqrt{n}}\right )}$$ -
When H 0 is true the probability of weak evidence is
$$\displaystyle\begin{array}{rcl} W_{0}& =& P_{\mu =\mu _{0}}\left ( \frac{1} {k_{0}} \leq \frac{\mbox{ L}_{1}} {\mbox{ L}_{0}} \leq k_{1}\right ) {}\\ & =& \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{1})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k_{0})} {c\sqrt{n}}\right ) {}\\ \end{array}$$ -
When H 1 is true the probability of misleading evidence for H 0 at level k 0 defined by (\(\mbox{ L}_{0}/\mbox{ L}_{1} \geq k_{0}\) is
$$\displaystyle{M_{1} = \Phi \left (-\frac{c\sqrt{n}} {2} - \frac{\ln (k_{1})} {c\sqrt{n}}\right )}$$ -
When H 1 is true the probability of weak evidence is
$$\displaystyle\begin{array}{rcl} W_{1}& =& P_{\mu =\mu _{1}}\left ( \frac{1} {k_{0}} \leq \frac{\mbox{ L}_{2}} {\mbox{ L}_{1}} \leq k_{1}\right ) {}\\ & =& \Phi \left (\frac{c\sqrt{n}} {2} + \frac{\ln (k_{0})} {c\sqrt{n}}\right ) - \Phi \left (\frac{c\sqrt{n}} {2} - \frac{\ln (k_{1})} {c\sqrt{n}}\right ) {}\\ \end{array}$$
17.2.2 Sample Size
There is no doubt that one of the questions most asked of a statistician is
Actually the usual question is how many subjects do I need to get a statistically significant result that is publishable? This question is easily answered, so let us consider refining the question.
Suppose that we will observe X 1, X 2, …, X n assumed independent and identically distributed as normal with mean μ and variance σ 2 assumed known (usually based on past work with similar instruments, and so on). Of interest is a (null) hypothesis H 0: μ = μ 0 and an alternative H 1: μ = μ 1 where without loss of generality we assume that μ 1 > μ 0. It is assumed that μ 1 represents a value of μ which is of scientific importance,i.e., if μ 1 is true then a result of scientific or practical importance has been discovered.
The Neyman–Pearson theory has been used for decades to determine sample size is the default method. It is required in submitting grants to NIH, NSF, FDA, etc., as well as in reporting the results of published studies and dissertations. The Neyman–Pearson approach to sample size selection is as follows:
-
1.
Choose a value α for the significance level (usually α = 0. 05).
-
2.
Choose a value 1 −β for the power (usually β = 0. 20 so that the power is 0.8).
-
3.
Select the sample size n so that
$$\displaystyle{\begin{array}{c} P(\mbox{ Type I error}) = P(\mbox{ reject $H_{0}$}\vert H_{0}\;\mbox{ true}) =\alpha \\ 1 - P(\mbox{ Type II error}) = P(\mbox{ reject $H_{0}$}\vert H_{1}\;\mbox{ true}) = 1-\beta \end{array} }$$
In the case of a normal distribution with known variance we have that
and it follows that
and
In order to have power 1 −β we must have
i.e.,
or
and it follows that
where
This is the prototype of sample size formulas.
The Neyman-Pearson approach is inadequate when we want to quantify statistical evidence for H 1 vs H 0 we now consider the selection of sample size necessary to quantify statistical evidence. Recall that there are four probabilities involved:
-
1.
The probability of misleading statistical evidence for H 1 when H 0 is true
-
2.
The probability of misleading statistical evidence for H 0 when H 1 is true
-
3.
The probability of weak statistical evidence when H 0 is true
-
4.
The probability of weak evidence when H 1 is true
The analogue to the Type I error probability is the probability of finding misleading evidence for H 1 when H 0 is true. For the normal distribution we have the correspondence for α = 0. 05 and M 0 given by
and if we take c = 0. 5, a moderate effect size, we have the correspondence
For the analogue to the Type II error we must be more careful. The probability of failing to find evidence supporting H 1 when H 0 is true is composed of two parts:
-
1.
The probability of misleading evidence in favor of H 0 when H 1 is true
-
2.
The probability of weak evidence when H 1 is true
For the normal distribution we have the correspondence
and if β = 0. 2, c = 0. 5 and k 2 = 8 we have
For the Neyman Pearson sample size formula for α = 0. 05, β = 0. 20 and c = 0. 5 we get a sample size of
For this sample size we find that M 1 + W 1 is equal to
Thus the conventional sample size formula does not lead to a small probability of finding weak evidence.
Exercises in Royall’s book show that the this is true in general, i.e., conventional sample size formulas do not guarantee finding strong evidence.
17.3 Birnbaum’s Confidence Concept
Recall Birnbaum’s confidence concept which he advocated after becoming skeptical of the likelihood principle.
A concept of statistical evidence is not plausible unless it finds “strong evidence” for H 2 as against H 1 with small probability (α) when H 1 is true and with much larger probability (1 −β) when H 2 is true.
What the results in the sample size section show is that it is possible in certain cases to satisfy the confidence concept with sufficient observations.
17.4 Combining Evidence
Suppose that we have two independent estimators, t 1 and t 2 of a parameter θ where t 1 is normal with expected value θ and variance v 1 and t 2 is normal with expected value θ and variance v 2. Assume that v 1 and v 2 are known.
The joint density of t 1 and t 2 is
which has logarithm
The derivative with respect to θ is thus
and hence the maximum likelihood estimate of θ is
At this value of θ the joint density is
and hence the likelihood for θ is
or
where
which is a normal likelihood centered at \(\widehat{\theta }\) and curvature v.
This is a likelihood version of the standard result that to combine unbiased uncorrelated estimators weight inversely as their variance and divide the result by the sum of the weights.
In general note that if we have \(\boldsymbol{x}_{1}\) observations from f(x; θ) and independent observations \(\boldsymbol{x}_{2}\) from f(x; θ) then the evidence for θ 1 vs θ 0 based on \((\boldsymbol{x}_{1},\boldsymbol{x}_{2})\) is
i.e., evidence is multiplicative.
17.5 Exercises
-
1.
Suppose that X 1, X 2, …, X n are iid, each Poisson with parameter λ. Let k = 8, n = 1, 10, 25. Draw graphs of the probability of misleading evidence for λ 1 = 2 vs λ 0 = 1.
-
2.
Repeat Exercise 1 for the binomial with n = 10, 25, 100, 1000 and p = 0. 6 vs p = 0. 5.
References
Robbins, H.: Statistical methods related to the law of the iterated logarithm. Ann. Math. Stat. 41, 1397–1409 (1970)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Rohde, C.A. (2014). Pure Likelihood Methods. In: Introductory Statistical Inference with the Likelihood Function. Springer, Cham. https://doi.org/10.1007/978-3-319-10461-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-10461-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10460-7
Online ISBN: 978-3-319-10461-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)