Abstract
As we have seen once we have an estimator and its sampling distribution we can easily obtain confidence intervals and tests regarding the parameter. We now develop the theory of estimation focusing on the method of maximum likelihood, which for parametric models is the most widely used method. This will also supply us with a collection of statistical methods for important problems.
Access provided by Autonomous University of Puebla. Download chapter PDF
Keywords
- Maximum Likelihood Estimate
- Score Function
- Maximum Likelihood Estimator
- Fisher Information
- Multivariate Normal Distribution
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
7.1 Basic Properties
As we have seen once we have an estimator and its sampling distribution we can easily obtain confidence intervals and tests regarding the parameter. We now develop the theory of estimation focusing on the method of maximum likelihood, which for parametric models is the most widely used method. This will also supply us with a collection of statistical methods for important problems.
For comparing two values of a parameter, θ 2 vs θ 2, a natural role is played by the likelihood ratio
According to the Law of Likelihood the likelihood ratio represents the statistical evidence in the data for comparing θ 2 to θ 1.
The score function is defined by
The score function plays a major role in the theory of maximum likelihood estimation.
Example.
Consider n iid normal random variables with parameters θ, σ 2 where σ 2 is known. Then
and
It follows that
As a random variable we have that the score function has expected value 0 and variance n∕σ 2 when evaluated at the true θ.
Because of the Law of Likelihood a natural estimate of θ is that value of θ which maximizes the likelihood or the log of the likelihood.
Assuming that ln[f(x; θ)] is differentiable with respect to θ the maximum likelihood estimate is then the solution to
which is called the likelihood or score equation. If there are r parameters we differentiate with respect to each and equate to 0, obtaining r equations. Note that one needs to check the second derivative to ensure a maximum.
Example 1 (Binomial).
. If X is binomial with parameter θ then
First note that if x = 0 then \(f(0;\theta ) = (1-\theta )^{n}\) and in this case \(\widehat{\theta }= 0\). If x = n then f(n; θ) = θ n and in this case \(\widehat{\theta }= 1\).
For \(x = 1,2,\ldots,n - 1\) we have that
and
It follows that
Note that it is unbiased with variance \(\theta (1-\theta )/n\) so that it is also consistent.
Example 2..
Let Y 1, Y 2, …, Y n be iid each normal with mean μ and variance σ 2. Then we have
It follows that the log likelihood is given by
Thus we have that
and
and it follows that
7.2 Consistency of Maximum Likelihood
-
1.
Consider the case where there are only two possible values of the parameter θ 2 and θ 1.
-
2.
Also suppose that we have n observations which are realized values of independent and identically distributed random variables having density f(x; θ 2) or f(x; θ 1).
The maximum likelihood estimate is defined by
-
1.
Assume with no loss of generality that θ 2 is the true value of the parameter.
-
2.
The maximum likelihood estimator is consistent if
$$\displaystyle{\mathbb{P}_{\theta _{2}}(\widehat{\theta }=\theta _{2})\;\;\longrightarrow \;\;1}$$
We note that \(\widehat{\theta }=\theta _{2}\) if and only if
Equivalently
Now note that the random variables
are independent and identically distributed.
Moreover
By the law of large numbers we have that
and hence
i.e., \(\widehat{\theta }\) is consistent:
-
1.
The same proof holds provided the parameter space \(\Theta \) is finite.
-
2.
The more general case where \(\Theta \) is an interval requires more delicate arguments and is of technical, not statistical interest.
7.3 General Results on the Score Function
We know that
for any density function f(x; θ). Recall that for a function g we write
Assuming that we can differentiate under the integral or summation sign, we have that
Now note that
It follows that
Thus the expected value of the score function is 0.
If we differentiate again we have that
Noting that
we see that
The right-hand side may be written as
It follows that
and hence
The quantity
is called the (expected) Fisher information and
is called the (observed) Fisher information.
7.4 General Maximum Likelihood
-
1.
Let X be a random variable with density f(x; θ).
-
2.
Assume that the parameter space \(\Theta \) is an interval and that f(x; θ) is sufficiently smooth so that derivatives with respect to θ are defined and that differentiation under a summation or integral is allowed.
-
3.
Finally assume that the range of X does not depend on θ.
Under weak regularity conditions it follows from the previous section that
Thus the random variable
i.e., the score function has expected value and variance given by
where
is the expected Fisher information for a sample size of one.
Example.
If X is normal with mean θ and variance σ 2 with σ 2 known then
and hence
and
so Fisher’s information is
Example.
If X is Bernoulli θ then
and hence
It follows that
and
so Fisher’s information is
If we have a random sample X 1, X 2, …, X n from f(x; θ) and if
then
is the sample mean of n iid random variables with expected value 0 and variance i(θ). It follows that
by the central limit theorem.
Define the maximum likelihood estimate of θ as that value of θ which maximizes f(x; θ) or equivalently ln[f(x; θ)].
Thus we solve
or when \(f(\mathbf{x};\theta ) =\prod _{ i=1}^{n}f(x_{i};\theta )\) we solve
Since we can write, using Taylor’s theorem,
where
and θ ∗ is between θ and \(\widehat{\theta }\).
Since \(u(\widehat{\theta }) = 0\) we have
It follows that
Application of the results of the preceding section shows that
where i(θ) is Fisher’s information for a sample of size 1.
7.5 Cramer-Rao Inequality
If t(x) is any unbiased estimator of θ i.e.
then
Assuming that we can differentiate under the integral or summation sign, we have that
and hence
It follows that
or
where I(θ) is the expected Fisher information. Thus the smallest variance for an unbiased estimator is the inverse of Fisher’s information. This result is called the Cramer–Rao inequality.
Since 1∕I(θ) is the large sample variance of the maximum likelihood estimator we have the result that the method of maximum likelihood produces estimators which are asymptotically efficient, i.e., have smallest variance.
7.6 Summary Properties of Maximum Likelihood
-
1.
Maximum likelihood have the equivariance property: i.e., the maximum likelihood estimate of g(θ), \(\widehat{g(\theta )}\), is \(g(\widehat{\theta })\).
-
2.
Under weak regularity conditions maximum likelihood estimators are consistent, i.e.,
$$\displaystyle{\widehat{\theta }\:\stackrel{p}{\longrightarrow }\;\theta }$$ -
3.
Maximum likelihood estimators are asymptotically normal:
$$\displaystyle{\sqrt{n}(\widehat{\theta }-\theta _{0})\;\stackrel{d}{\longrightarrow }\;\mbox{ N}(0,v(\theta _{0}))}$$where v(θ 0) is the inverse of Fisher’s information.
-
4.
Maximum likelihood estimators are asymptotically efficient, i.e., in large samples
$$\displaystyle{\mathbb{V}(\widehat{\theta }) \leq \mathbb{V}(\widetilde{\theta })}$$where \(\widetilde{\theta }\) is any other consistent estimator which is asymptotically normal.
The regularity conditions under which the results on maximum likelihood estimators are true consist of conditions of the form:
-
(i)
The range of the distributions cannot depend on the parameter.
-
(ii)
The first three derivatives of the log likelihood function with respect to θ exist are continuous and have finite expected values as functions of X.
7.7 Multiparameter Case
All of the results for maximum likelihood generalize to the case where there are p parameters θ 1, θ 2, …, θ p . Let
If the pdf is given by
the maximum likelihood or score equation is
Fisher’s information matrix
has i − j element given by
Note that it is a p × p matrix.
Under regularity conditions, similar to those for the single parameter case we have
-
1.
The maximum likelihood estimate of \(g(\boldsymbol{\theta })\), \(\widehat{g(\boldsymbol{\theta })}\), is \(g(\widehat{\boldsymbol{\theta }})\).
-
2.
Maximum likelihood estimators are consistent, i.e.,
$$\displaystyle{\widehat{\boldsymbol{\theta }}\:\stackrel{p}{\longrightarrow }\;\boldsymbol{\theta }}$$ -
3.
Maximum likelihood estimators are asymptotically normal:
$$\displaystyle{(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta }_{0})\; \approx \;\mbox{ N}(0,\mathbf{V}_{n}(\boldsymbol{\theta }_{0}))}$$where \(\mathbf{V}_{n}(\boldsymbol{\theta }_{0})\) is the inverse of Fisher’s information matrix. We can replace \(\boldsymbol{\theta }_{0}\) by \(\widehat{\boldsymbol{\theta }}\) to use this result to determine confidence intervals.
7.8 Maximum Likelihood in the Multivariate Normal
Let y 1, y 2, …, y n be independent each having a multivariate normal distribution with parameters \(\boldsymbol{\mu }\) and \(\boldsymbol{\Sigma }\), i.e.,
The joint density is thus
We will show that the maximum likelihood estimates of \(\boldsymbol{\mu }\) and \(\boldsymbol{\Sigma }\) are
and
i.e., the j − k element of S is
essentially the sample covariance between the jth and kth variable.
The first step is to note that
can be written as
or
where the trace of a square matrix, tr(A), is the sum of the diagonal elements, i.e.,
Thus the joint density \(f_{\mathbf{Y}}(\mathbf{y};\boldsymbol{\mu },\boldsymbol{\Sigma })\) = can be written as
It follows immediately that the maximum likelihood estimate of \(\boldsymbol{\mu }\) is \(\overline{\mathbf{y}}\) and the joint density at \(\widehat{\boldsymbol{\mu }}\) and \(\widehat{\boldsymbol{\Sigma }} = \mathbf{S}\) is thus
The ratio
is thus equal to
which is greater than or equal to
This ratio is greater than or equal to 1 if and only its logarithm is greater than or equal to 0. The logarithm is
If λ 1, λ 2, …, λ p are the characteristic roots of \(\boldsymbol{\Sigma }^{-1}\mathbf{S}\) then it can be shown that
-
1.
λ i ≥ 0 for each i
-
2.
\(\det (\boldsymbol{\Sigma }^{-1}\mathbf{S}) =\prod _{ i=1}^{p}\lambda _{i}\)
-
3.
\(\mbox{ tr}(\boldsymbol{\Sigma }^{-1}\mathbf{S}) =\sum _{ i=1}^{p}\lambda _{i}\)
It follows that the log of the ratio is greater than or equal to
or
which is greater than or equal to zero since
Thus the maximum likelihood estimators for the multivariate normal are
We usually use
as the estimator so that the estimated components of \(\boldsymbol{\Sigma }\) are exactly the sample covariances and variances.
7.9 Multinomial
Suppose that X 1, X 2, …, X k have a multinomial distribution, i.e.,
where
and
Note that
The maximum likelihood estimates of the θ i are found by taking the partial derivatives of the log likelihood with respect to θ i for \(i = 1,2,\ldots,k - 1\) where the log likelihood is
Since \(\theta _{k} = 1 -\theta _{1} -\theta _{2} -\cdots -\theta _{k-1}\) we have
for \(i = 1,2,\ldots,k - 1\). It follows that the maximum likelihood estimates satisfy
Summing from i = 1 to k − 1 yields
and hence
so that
The second derivatives of the log likelihood are given by
which has expected value
and
which has expected value
Thus Fisher’s information matrix, \(\mathcal{I}(\boldsymbol{\theta })\), is given by
Fisher’s information can be written in matrix form as
where \(\mathbf{D}(\boldsymbol{\theta })\) is a \(k - 1 \times k - 1\) matrix with diagonal elements θ 1, θ 2, …, θ k−1 and 1 is a k − 1 column vector with each element equal to 1.
The general theory of maximum likelihood then implies that
where \(i(\boldsymbol{\theta })\) is Fisher’s information matrix with n = 1.
It is easy to check that
or
which we recognize as the variance covariance matrix of X 1, X 2, …, X k−1
Standard maximum likelihood theory implies that
Now note that
is equal to
and hence to
This last expression simplifies to
which in turn simplifies to
and to
This finally reduces to
Noting that \(\mathbb{E}(X_{i}) = n\theta _{i} = E_{i}\) this last formula may be written as
which is called Pearson’s chi-square statistic. For large n, it has a chi-square distribution with k − 1 degrees of freedom.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Rohde, C.A. (2014). Maximum Likelihood: Basic Results. In: Introductory Statistical Inference with the Likelihood Function. Springer, Cham. https://doi.org/10.1007/978-3-319-10461-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-10461-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10460-7
Online ISBN: 978-3-319-10461-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)