Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Basic Properties

As we have seen once we have an estimator and its sampling distribution we can easily obtain confidence intervals and tests regarding the parameter. We now develop the theory of estimation focusing on the method of maximum likelihood, which for parametric models is the most widely used method. This will also supply us with a collection of statistical methods for important problems.

For comparing two values of a parameter, θ 2 vs θ 2, a natural role is played by the likelihood ratio

$$\displaystyle{\mathcal{L}\mathcal{R}(\theta _{2},\theta _{1};x) = \frac{f(x;\theta _{2})} {f(x;\theta _{1})}}$$

According to the Law of Likelihood the likelihood ratio represents the statistical evidence in the data for comparing θ 2 to θ 1.

The score function is defined by

$$\displaystyle{s(\theta;x) = \frac{\partial \ln [f(x;\theta )]} {\partial \theta } }$$

The score function plays a major role in the theory of maximum likelihood estimation.

Example.

Consider n iid normal random variables with parameters θ, σ 2 where σ 2 is known. Then

$$\displaystyle{f(x;\theta ) = (2\pi \sigma ^{2})^{-n/2}\exp \left \{-\frac{1} {2\sigma ^{2}}\sum _{i=1}^{n}(x_{ i}-\theta )^{2}\right \}}$$

and

$$\displaystyle{f^{'}(x;\theta ) = f(x;\theta )\frac{1} {\sigma ^{2}} \sum _{i=1}^{n}(x_{ i}-\theta )}$$

It follows that

$$\displaystyle{\frac{f^{'}(x;\theta )} {f(x;\theta )} = \frac{n(\overline{x}-\theta )} {\sigma ^{2}} }$$

As a random variable we have that the score function has expected value 0 and variance nσ 2 when evaluated at the true θ.

Because of the Law of Likelihood a natural estimate of θ is that value of θ which maximizes the likelihood or the log of the likelihood.

Assuming that ln[f(x; θ)] is differentiable with respect to θ the maximum likelihood estimate is then the solution to

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } = 0}$$

which is called the likelihood or score equation. If there are r parameters we differentiate with respect to each and equate to 0, obtaining r equations. Note that one needs to check the second derivative to ensure a maximum.

Example 1 (Binomial).

​​​. If X is binomial with parameter θ then

$$\displaystyle{f(x;\theta ) ={ n\choose x}\theta ^{x}(1-\theta )^{n-x}\;\;\mbox{ $x = 0,1,\ldots,n$}}$$

First note that if x = 0 then \(f(0;\theta ) = (1-\theta )^{n}\) and in this case \(\widehat{\theta }= 0\). If x = n then f(n; θ) = θ n and in this case \(\widehat{\theta }= 1\).

For \(x = 1,2,\ldots,n - 1\) we have that

$$\displaystyle{\ln [f(x;\theta )] =\ln [{n\choose x}] = x\ln (\theta ) + (n - x)\ln (1-\theta )}$$

and

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } = \frac{x} {\theta } -\frac{n - x} {1-\theta } = \frac{x - n\theta } {\theta (1-\theta )} }$$

It follows that

$$\displaystyle{\widehat{\theta }= \frac{x} {n}\;\;\mbox{ for $x = 0,1,\ldots,n$}}$$

Note that it is unbiased with variance \(\theta (1-\theta )/n\) so that it is also consistent.

Example 2..

Let Y 1, Y 2, , Y n be iid each normal with mean μ and variance σ 2. Then we have

$$\displaystyle\begin{array}{rcl} f(y;\theta )& =& \prod _{i=1}^{n}(2\pi \sigma ^{2})^{-1/2}\exp \left \{-\frac{(y_{i} -\mu _{i})^{2}} {2\sigma ^{2}} \right \} {}\\ & =& (2\pi \sigma ^{2})^{-n/2}\exp \left \{-\frac{\sum _{i=1}^{n}(y_{ i}-\mu )^{2}} {2\sigma ^{2}} \right \} {}\\ \end{array}$$

It follows that the log likelihood is given by

$$\displaystyle{\ln [f(y;\theta ) = -\frac{n} {2} \ln (2\pi ) -\frac{n} {2} \ln (\sigma ^{2}) - \frac{1} {2\sigma ^{2}}\sum _{i=1}^{n}(y_{ i}-\mu )^{2}}$$

Thus we have that

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \mu } = \frac{1} {\sigma ^{2}} n(\overline{y}-\mu )}$$

and

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \sigma ^{2}} = -\frac{n} {2\sigma ^{2}} + \frac{\sum _{i=1}^{n}(y_{i}-\mu )^{2}} {2\sigma ^{4}} }$$

and it follows that

$$\displaystyle{\widehat{\mu }= \overline{y}\;\;\mbox{ and}\;\;\widehat{\sigma }^{2} = \frac{1} {n}\sum _{i=1}^{n}(y_{ i} -\overline{y})^{2}}$$

7.2 Consistency of Maximum Likelihood

  1. 1.

    Consider the case where there are only two possible values of the parameter θ 2 and θ 1.

  2. 2.

    Also suppose that we have n observations which are realized values of independent and identically distributed random variables having density f(x; θ 2) or f(x; θ 1).

The maximum likelihood estimate is defined by

$$\displaystyle{\widehat{\theta }= \left \{\begin{array}{rl} \theta _{2} & \mbox{ if $f(x_{1},x_{2},\ldots,x_{n};\theta _{2}) \geq f(x_{1},x_{2},\ldots,x_{n};\theta _{1})$} \\ \theta _{1} & \mbox{ otherwise} \end{array} \right.}$$
  1. 1.

    Assume with no loss of generality that θ 2 is the true value of the parameter.

  2. 2.

    The maximum likelihood estimator is consistent if

    $$\displaystyle{\mathbb{P}_{\theta _{2}}(\widehat{\theta }=\theta _{2})\;\;\longrightarrow \;\;1}$$

We note that \(\widehat{\theta }=\theta _{2}\) if and only if

$$\displaystyle{\frac{f(x_{1},x_{2},\ldots,x_{n};\theta _{2})} {f(x_{1},x_{2},\ldots,x_{n};\theta _{1})} =\prod _{ i=1}^{n}\frac{f(x_{i};\theta _{2})} {f(x_{i};\theta _{1})} > 1}$$

Equivalently

$$\displaystyle{\sum _{i=1}^{n}\ln \left [\frac{f(x_{i};\theta _{2})} {f(x_{i};\theta _{1})}\right ] > 0}$$

Now note that the random variables

$$\displaystyle{Y _{i} =\ln \left [\frac{f(X_{i};\theta _{2})} {f(X_{i};\theta _{1})}\right ]\;\;i = 1,2,\ldots,n}$$

are independent and identically distributed.

Moreover

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{\theta _{2}}(Y _{i})& =& \int \ln \left [\frac{f(x;\theta _{2})} {f(x;\theta _{1})}\right ]f(x;\theta _{2})\lambda (dx) {}\\ & =& -\int \ln \left [\frac{f(x;\theta _{1})} {f(x;\theta _{2})}\right ]f(x;\theta _{2})\lambda (dx) {}\\ & >& -\int \left [\frac{f(x;\theta _{1})} {f(x;\theta _{2})} - 1\right ]f(x;\theta _{2})\lambda (dx) {}\\ & =& 0 {}\\ \end{array}$$

By the law of large numbers we have that

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}Y _{ i}\;\;\stackrel{p}{\longrightarrow }\;\;\mathbb{E}_{\theta _{2}}(Y ) > 0}$$

and hence

$$\displaystyle{\mathbb{P}_{\theta _{2}}(\widehat{\theta }=\theta _{2})\;\;\longrightarrow \;\;1}$$

i.e., \(\widehat{\theta }\) is consistent:

  1. 1.

    The same proof holds provided the parameter space \(\Theta \) is finite.

  2. 2.

    The more general case where \(\Theta \) is an interval requires more delicate arguments and is of technical, not statistical interest.

7.3 General Results on the Score Function

We know that

$$\displaystyle{\int f(x;\theta )d\lambda (x) = 1}$$

for any density function f(x; θ). Recall that for a function g we write

$$\displaystyle{\int g(x;\theta )d\lambda (x) = \left \{\begin{array}{rl} \int g(x;\theta )dx&\mbox{ $g$ continuous}\\ \sum g(x;\theta ) &\mbox{ $g$ discrete} \end{array} \right.}$$

Assuming that we can differentiate under the integral or summation sign, we have that

$$\displaystyle{\int \frac{\partial f(x;\theta )} {\partial \theta } d\lambda (x) = 0}$$

Now note that

$$\displaystyle{\frac{\partial f(x;\theta )} {\partial \theta } = \frac{\partial \ln [f(x;\theta )]} {\partial \theta } f(x;\theta )}$$

It follows that

$$\displaystyle{\mathbb{E}_{\theta }\left \{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } \right \} = 0}$$

Thus the expected value of the score function is 0.

If we differentiate again we have that

$$\displaystyle{\int \frac{\partial ^{2}f(x;\theta )} {\partial \theta ^{2}} \lambda (x) = 0}$$

Noting that

$$\displaystyle\begin{array}{rcl} \frac{\partial ^{2}f(x;\theta )} {\partial \theta ^{2}} & =& \frac{\partial } {\partial }\left [\frac{\partial f(x;\theta )} {\partial \theta } \right ] {}\\ & =& \frac{\partial } {\partial }\left [\frac{\partial \ln [f(x;\theta )]} {\partial \theta } f(x;\theta )\right ] {}\\ \end{array}$$

we see that

$$\displaystyle\begin{array}{rcl} \frac{\partial ^{2}f(x;\theta )} {\partial \theta ^{2}} & =& \left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]f(x;\theta ) {}\\ & & +\left [\frac{\partial \ln [f(x;\theta )]} {\partial \theta } \right ]\frac{\partial f(x;\theta )} {\partial \theta } {}\\ \end{array}$$

The right-hand side may be written as

$$\displaystyle{\left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]f(x;\theta ) + \left [\frac{\partial \ln [f(x;\theta )]} {\partial \theta } \right ]^{2}f(x;\theta )}$$

It follows that

$$\displaystyle{\mathbb{E}_{\theta }\left \{\left [\frac{\partial \ln [f(x;\theta )]} {\partial \theta } \right ]^{2}\right \} = -\mathbb{E}_{\theta }\left \{\left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]\right \}}$$

and hence

$$\displaystyle{\mathbb{V}_{\theta }\left \{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } \right \} = -\mathbb{E}_{\theta }\left \{\left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]\right \}}$$

The quantity

$$\displaystyle{-\mathbb{E}_{\theta }\left \{\left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]\right \}}$$

is called the (expected) Fisher information and

$$\displaystyle{-\left [\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} \right ]}$$

is called the (observed) Fisher information.

7.4 General Maximum Likelihood

  1. 1.

    Let X be a random variable with density f(x; θ).

  2. 2.

    Assume that the parameter space \(\Theta \) is an interval and that f(x; θ) is sufficiently smooth so that derivatives with respect to θ are defined and that differentiation under a summation or integral is allowed.

  3. 3.

    Finally assume that the range of X does not depend on θ.

Under weak regularity conditions it follows from the previous section that

$$\displaystyle\begin{array}{rcl} \mathbb{E}_{\theta }\left \{\left [\frac{\partial \ln [f(X;\theta )]} {\partial \theta } \right ]\right \}& =& 0 {}\\ \mathbb{E}_{\theta }\left \{\left [\frac{\partial \ln [f(X;\theta )]} {\partial \theta } \right ]^{2}\right \}& =& -\mathbb{E}_{\theta }\left \{\left [\frac{\partial ^{2}\ln [f(X;\theta )]} {\partial \theta ^{2}} \right ]\right \} {}\\ \end{array}$$

Thus the random variable

$$\displaystyle{U(\theta ) = \left [\frac{\partial \ln [f(X;\theta )]} {\partial \theta } \right ]}$$

i.e., the score function has expected value and variance given by

$$\displaystyle{\mathbb{E}_{\theta }[U(\theta )] = 0\;\;, \mathbb{V}_{\theta }[U(\theta )] = i(\theta )}$$

where

$$\displaystyle{i(\theta ) = -\mathbb{E}_{\theta }\left \{\left [\frac{\partial ^{2}\ln [f(X;\theta )]} {\partial \theta ^{2}} \right ]\right \}}$$

is the expected Fisher information for a sample size of one.

Example.

If X is normal with mean θ and variance σ 2 with σ 2 known then

$$\displaystyle{\ln [f(x;\theta )] = -\frac{1} {2}\ln [2\pi \sigma ^{2}] - \frac{1} {2\sigma ^{2}}(x-\theta )^{2}}$$

and hence

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } = \frac{x-\theta } {\sigma ^{2}} }$$

and

$$\displaystyle{\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} = -\frac{1} {\sigma ^{2}} }$$

so Fisher’s information is

$$\displaystyle{i(\theta ) = \frac{1} {\sigma ^{2}} }$$

Example.

If X is Bernoulli θ then

$$\displaystyle{f(x;\theta ) =\theta ^{x}(1-\theta )^{1-x}}$$

and hence

$$\displaystyle{\ln [f(x;\theta )] = x\ln (\theta ) + (1 - x)\ln (1-\theta )}$$

It follows that

$$\displaystyle{\frac{\partial \ln [f(x;\theta )]} {\partial \theta } = \frac{x} {\theta } -\frac{1 - x} {1-\theta } }$$

and

$$\displaystyle{\frac{\partial ^{2}\ln [f(x;\theta )]} {\partial \theta ^{2}} = -\frac{x} {\theta ^{2}} - \frac{1 - x} {(1-\theta )^{2}}}$$

so Fisher’s information is

$$\displaystyle{i(\theta ) = \frac{1} {\theta } + \frac{1} {1-\theta } = \frac{1} {\theta (1-\theta )}}$$

If we have a random sample X 1, X 2, , X n from f(x; θ) and if

$$\displaystyle{u_{i}(\theta ) = \frac{\partial \ln [f(x_{i};\theta )]} {\partial \theta } }$$

then

$$\displaystyle{\overline{U}(\theta ) = \frac{1} {n}\sum _{i=1}^{n}U_{ i}(\theta )}$$

is the sample mean of n iid random variables with expected value 0 and variance i(θ). It follows that

$$\displaystyle{\sqrt{n}\:\overline{U}\;\;\stackrel{d}{\longrightarrow }\;\;\mbox{ N}[0,i(\theta )]}$$

by the central limit theorem.

Define the maximum likelihood estimate of θ as that value of θ which maximizes f(x; θ) or equivalently ln[f(x; θ)].

Thus we solve

$$\displaystyle{\frac{\partial \ln [f(\mathbf{x};\theta )]} {\partial \theta } = 0}$$

or when \(f(\mathbf{x};\theta ) =\prod _{ i=1}^{n}f(x_{i};\theta )\) we solve

$$\displaystyle{u(\theta ) =\sum _{ i=1}^{n}u_{ i}(\theta ) = 0}$$

Since we can write, using Taylor’s theorem,

$$\displaystyle{u(\widehat{\theta }) = u(\theta ) + \frac{du(\theta )} {d\theta } (\widehat{\theta }-\theta ) + v(\theta ^{{\ast}})\frac{(\widehat{\theta }-\theta )^{2}} {2} }$$

where

$$\displaystyle{v(\theta ^{{\ast}}) = \frac{d^{2}u(\theta )} {d\theta ^{2}} \Bigg\vert _{\theta =\theta ^{{\ast}}}}$$

and θ is between θ and \(\widehat{\theta }\).

Since \(u(\widehat{\theta }) = 0\) we have

$$\displaystyle{(\widehat{\theta }-\theta )\left [\frac{du(\theta )} {d\theta } + v(\theta ^{{\ast}})\frac{(\widehat{\theta }-\theta )} {2} \right ] = -u(\theta )}$$

It follows that

$$\displaystyle{\sqrt{n}(\widehat{\theta }-\theta ) = \frac{ \frac{1} {\sqrt{n}}u(\theta )} {\left [-\frac{1} {n} \frac{du(\theta )} {d\theta } - \frac{1} {n}v(\theta ^{{\ast}})\frac{(\widehat{\theta }-\theta )} {2} \right ]}}$$

Application of the results of the preceding section shows that

$$\displaystyle{\sqrt{n}(\widehat{\theta }-\theta )\;\;\stackrel{d}{\longrightarrow }\;\;\mbox{ N}(0,[i(\theta )]^{-1})}$$

where i(θ) is Fisher’s information for a sample of size 1.

7.5 Cramer-Rao Inequality

If t(x) is any unbiased estimator of θ i.e.

$$\displaystyle{\mathbb{E}[t(X)] =\theta }$$

then

$$\displaystyle{\int t(x)f(x;\theta )d\lambda (x) =\theta }$$

Assuming that we can differentiate under the integral or summation sign, we have that

$$\displaystyle{\int t(x)\frac{\partial \ln [f(x;\theta )]} {\partial \theta } f(x;\theta )d\lambda (x) = 1}$$

and hence

$$\displaystyle{\mathbb{C}\left \{t(X),\left [\frac{\partial \ln [f(X;\theta )]} {\partial \theta } \right ]\right \} = 1}$$

It follows that

$$\displaystyle{\mathbb{V}[t(X)]\mathbb{V}\left \{\frac{\partial \ln [f(X;\theta )]} {\partial \theta } \right \} \geq 1}$$

or

$$\displaystyle{\mathbb{V}[t(X)] \geq \frac{1} {I(\theta )}}$$

where I(θ) is the expected Fisher information. Thus the smallest variance for an unbiased estimator is the inverse of Fisher’s information. This result is called the Cramer–Rao inequality.

Since 1∕I(θ) is the large sample variance of the maximum likelihood estimator we have the result that the method of maximum likelihood produces estimators which are asymptotically efficient, i.e., have smallest variance.

7.6 Summary Properties of Maximum Likelihood

  1. 1.

    Maximum likelihood have the equivariance property: i.e., the maximum likelihood estimate of g(θ), \(\widehat{g(\theta )}\), is \(g(\widehat{\theta })\).

  2. 2.

    Under weak regularity conditions maximum likelihood estimators are consistent, i.e.,

    $$\displaystyle{\widehat{\theta }\:\stackrel{p}{\longrightarrow }\;\theta }$$
  3. 3.

    Maximum likelihood estimators are asymptotically normal:

    $$\displaystyle{\sqrt{n}(\widehat{\theta }-\theta _{0})\;\stackrel{d}{\longrightarrow }\;\mbox{ N}(0,v(\theta _{0}))}$$

    where v(θ 0) is the inverse of Fisher’s information.

  4. 4.

    Maximum likelihood estimators are asymptotically efficient, i.e., in large samples

    $$\displaystyle{\mathbb{V}(\widehat{\theta }) \leq \mathbb{V}(\widetilde{\theta })}$$

    where \(\widetilde{\theta }\) is any other consistent estimator which is asymptotically normal.

The regularity conditions under which the results on maximum likelihood estimators are true consist of conditions of the form:

  1. (i)

    The range of the distributions cannot depend on the parameter.

  2. (ii)

    The first three derivatives of the log likelihood function with respect to θ exist are continuous and have finite expected values as functions of X.

7.7 Multiparameter Case

All of the results for maximum likelihood generalize to the case where there are p parameters θ 1, θ 2, , θ p . Let

$$\displaystyle{\boldsymbol{\theta }= \left [\begin{array}{c} \theta _{1}\\ \theta _{ 2}\\ \vdots\\ \theta _{p} \end{array} \right ]}$$

If the pdf is given by

$$\displaystyle{f(\mathbf{x};\boldsymbol{\theta })}$$

the maximum likelihood or score equation is

$$\displaystyle{\frac{\partial \ln [f(\mathbf{x};\boldsymbol{\theta })]} {\partial \boldsymbol{\theta }} = \left [\begin{array}{c} \frac{\partial \ln [f(\mathbf{x};\boldsymbol{\theta })]} {\partial \theta _{1}} \\ \frac{\partial \ln [f(\mathbf{x};\boldsymbol{\theta })]} {\partial \theta _{2}}\\ \vdots \\ \frac{\partial \ln [f(\mathbf{x};\boldsymbol{\theta })]} {\partial \theta _{p}} \end{array} \right ] = \mathbf{0}}$$

Fisher’s information matrix

$$\displaystyle{\mathcal{I}(\boldsymbol{\theta })}$$

has ij element given by

$$\displaystyle{-\:\frac{\partial ^{2}\ln [f(\mathbf{x};\boldsymbol{\theta })]} {\partial \theta _{i}\partial \theta _{j}} }$$

Note that it is a p × p matrix.

Under regularity conditions, similar to those for the single parameter case we have

  1. 1.

    The maximum likelihood estimate of \(g(\boldsymbol{\theta })\), \(\widehat{g(\boldsymbol{\theta })}\), is \(g(\widehat{\boldsymbol{\theta }})\).

  2. 2.

    Maximum likelihood estimators are consistent, i.e.,

    $$\displaystyle{\widehat{\boldsymbol{\theta }}\:\stackrel{p}{\longrightarrow }\;\boldsymbol{\theta }}$$
  3. 3.

    Maximum likelihood estimators are asymptotically normal:

    $$\displaystyle{(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta }_{0})\; \approx \;\mbox{ N}(0,\mathbf{V}_{n}(\boldsymbol{\theta }_{0}))}$$

    where \(\mathbf{V}_{n}(\boldsymbol{\theta }_{0})\) is the inverse of Fisher’s information matrix. We can replace \(\boldsymbol{\theta }_{0}\) by \(\widehat{\boldsymbol{\theta }}\) to use this result to determine confidence intervals.

7.8 Maximum Likelihood in the Multivariate Normal

Let y 1, y 2, , y n be independent each having a multivariate normal distribution with parameters \(\boldsymbol{\mu }\) and \(\boldsymbol{\Sigma }\), i.e.,

$$\displaystyle{f_{\mathbf{Y}_{i}}(\mathbf{y}_{i};\boldsymbol{\mu },\boldsymbol{\Sigma }) = (2\pi )^{-\frac{p} {2} }[\det (\boldsymbol{\Sigma })]^{-\frac{1} {2} }\exp \left \{-\frac{1} {2}(\mathbf{y}_{i}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\mathbf{y}_{ i}-\boldsymbol{\mu })\right \}}$$

The joint density is thus

$$\displaystyle{f_{\mathbf{Y}}(\mathbf{y};\boldsymbol{\mu },\boldsymbol{\Sigma }) = (2\pi )^{-\frac{np} {2} }[\det (\boldsymbol{\Sigma })]^{-\frac{n} {2} }\exp \left \{-\frac{1} {2}\sum _{i=1}^{n}(\mathbf{y}_{ i}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\mathbf{y}_{ i}-\boldsymbol{\mu })\right \}}$$

We will show that the maximum likelihood estimates of \(\boldsymbol{\mu }\) and \(\boldsymbol{\Sigma }\) are

$$\displaystyle{\widehat{\boldsymbol{\mu }}= \overline{\mathbf{y}} = \frac{1} {n}\sum _{i=1}^{n}\mathbf{y}_{ i}}$$

and

$$\displaystyle{\boldsymbol{\Sigma } = \mathbf{S} = \frac{1} {n}\sum _{i=1}^{n}(\mathbf{y}_{ i} -\overline{\mathbf{y}})(\mathbf{y}_{i} -\overline{\mathbf{y}})^{\top }}$$

i.e., the jk element of S is

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}(y_{ ij} -\overline{y}_{j})(y_{ik} -\overline{y}_{k})}$$

essentially the sample covariance between the jth and kth variable.

The first step is to note that

$$\displaystyle{\sum _{i=1}^{n}(\mathbf{y}_{ i}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\mathbf{y}_{ i}-\boldsymbol{\mu })}$$

can be written as

$$\displaystyle{\sum _{i=1}^{n}(\mathbf{y}_{ i} -\overline{\mathbf{y}})^{\top }\boldsymbol{\Sigma }^{-1}(\mathbf{y}_{ i} -\overline{\mathbf{y}}) + n(\overline{\mathbf{y}}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\overline{\mathbf{y}}-\boldsymbol{\mu })}$$

or

$$\displaystyle{n\mbox{ tr}\left [\boldsymbol{\Sigma }^{-1}\mathbf{S}\right ] + n(\overline{\mathbf{y}}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\overline{\mathbf{y}}-\boldsymbol{\mu })}$$

where the trace of a square matrix, tr(A), is the sum of the diagonal elements, i.e.,

$$\displaystyle{\mbox{ tr}(A) =\sum _{ i=1}^{p}a_{ ii}}$$

Thus the joint density \(f_{\mathbf{Y}}(\mathbf{y};\boldsymbol{\mu },\boldsymbol{\Sigma })\) = can be written as

$$\displaystyle{(2\pi )^{-\frac{np} {2} }[\det (\boldsymbol{\Sigma })]^{-\frac{n} {2} }\exp \left \{-\frac{n} {2} \mbox{ tr}\left [\boldsymbol{\Sigma }^{-1}\mathbf{S}\right ] -\frac{n} {2} (\overline{\mathbf{y}}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\overline{\mathbf{y}}-\boldsymbol{\mu })\right \}}$$

It follows immediately that the maximum likelihood estimate of \(\boldsymbol{\mu }\) is \(\overline{\mathbf{y}}\) and the joint density at \(\widehat{\boldsymbol{\mu }}\) and \(\widehat{\boldsymbol{\Sigma }} = \mathbf{S}\) is thus

$$\displaystyle{f_{\mathbf{Y}}(\mathbf{y};\widehat{\boldsymbol{\mu }},\widehat{\boldsymbol{\Sigma }}) = (2\pi )^{-\frac{np} {2} }[\det (\mathbf{S})]^{-\frac{n} {2} }\exp \left \{-\frac{np} {2} \right \}}$$

The ratio

$$\displaystyle{\frac{f_{\mathbf{Y}}(\mathbf{y};\widehat{\boldsymbol{\mu }},\widehat{\boldsymbol{\Sigma }})} {f_{\mathbf{Y}}(\mathbf{y};\boldsymbol{\mu },\boldsymbol{\Sigma })}}$$

is thus equal to

$$\displaystyle{ \frac{[\det (\mathbf{S})]^{-\frac{n} {2} }\exp \left \{-\frac{np} {2}\right \}} {[\det (\boldsymbol{\Sigma })]^{-\frac{n} {2} }\exp \left \{-\frac{n}{2}\mbox{ tr}\left [\boldsymbol{\Sigma }^{-1}\mathbf{S}\right ] -\frac{n} {2}(\overline{\mathbf{y}}-\boldsymbol{\mu })^{\top }\boldsymbol{\Sigma }^{-1}(\overline{\mathbf{y}}-\boldsymbol{\mu })\right \}}}$$

which is greater than or equal to

$$\displaystyle{\det (\boldsymbol{\Sigma }^{-1}\mathbf{S})^{-\frac{n} {2} }\exp \left \{-\frac{np} {2} + \frac{n} {2} \mbox{ tr}\left [\boldsymbol{\Sigma }^{-1}\mathbf{S}\right ]\right \}}$$

This ratio is greater than or equal to 1 if and only its logarithm is greater than or equal to 0. The logarithm is

$$\displaystyle{\frac{n} {2} \left \{-\ln \left [\det (\boldsymbol{\Sigma }^{-1}\mathbf{S})\right ] - p + \mbox{ tr}\left [\boldsymbol{\Sigma }^{-1}\mathbf{S}\right ]\right \}}$$

If λ 1, λ 2, , λ p are the characteristic roots of \(\boldsymbol{\Sigma }^{-1}\mathbf{S}\) then it can be shown that

  1. 1.

    λ i  ≥ 0 for each i

  2. 2.

    \(\det (\boldsymbol{\Sigma }^{-1}\mathbf{S}) =\prod _{ i=1}^{p}\lambda _{i}\)

  3. 3.

    \(\mbox{ tr}(\boldsymbol{\Sigma }^{-1}\mathbf{S}) =\sum _{ i=1}^{p}\lambda _{i}\)

It follows that the log of the ratio is greater than or equal to

$$\displaystyle{\frac{n} {2} \left \{-\sum _{i=1}^{p}\ln (\lambda _{ i}) -\frac{p} {2} +\sum _{ i=1}^{p}\lambda _{ i}\right \}}$$

or

$$\displaystyle{\frac{n} {2} \left \{\sum _{i=1}^{p}\left [\lambda _{ i} - 1 -\ln (\lambda _{i})\right ]\right \}}$$

which is greater than or equal to zero since

$$\displaystyle{a - 1 -\ln (a) \geq 0\;\;\mbox{ for any positive real number}}$$

Thus the maximum likelihood estimators for the multivariate normal are

$$\displaystyle{\widehat{\boldsymbol{\mu }}= \overline{\mathbf{y}}\;\;\;\mbox{ and}\;\;\;\widehat{\boldsymbol{\Sigma }} = \mathbf{S}}$$

We usually use

$$\displaystyle{ \frac{n} {n - 1}\mathbf{S}}$$

as the estimator so that the estimated components of \(\boldsymbol{\Sigma }\) are exactly the sample covariances and variances.

7.9 Multinomial

Suppose that X 1, X 2, , X k have a multinomial distribution, i.e.,

$$\displaystyle{f(x_{1},x_{2},\ldots,x_{k};\theta _{1},\theta _{2},\ldots,\theta _{k}) = n!\prod _{i=1}^{k}\frac{\theta _{i}^{x_{i}}} {x_{i}!}}$$

where

$$\displaystyle{0 \leq x_{i} \leq n\;\;\mbox{ each $i = 1,2,\ldots,k$ and}\;\;\sum _{i=1}^{k}x_{ i} = n}$$

and

$$\displaystyle{0 \leq \theta _{i} \leq 1\;\;\mbox{ each $i = 1,2,\ldots,k$ and}\;\;\sum _{i=1}^{k}\theta _{ i} = 1}$$

Note that

$$\displaystyle{\theta _{k} = 1 -\sum _{i=1}^{k-1}\theta _{ i}\;\;\mbox{ and}\;\;x_{k} = n -\sum _{i=1}^{k-1}x_{ i}}$$

The maximum likelihood estimates of the θ i are found by taking the partial derivatives of the log likelihood with respect to θ i for \(i = 1,2,\ldots,k - 1\) where the log likelihood is

$$\displaystyle{\ln [f(\mathbf{x},\boldsymbol{\theta }] =\ln (n!) -\sum _{i=1}^{k}\ln (x_{ i}!) +\sum _{ i=1}^{k}x_{ i}\ln (\theta _{i})}$$

Since \(\theta _{k} = 1 -\theta _{1} -\theta _{2} -\cdots -\theta _{k-1}\) we have

$$\displaystyle{\frac{\partial \ln [f(\mathbf{x},\boldsymbol{\theta })]} {\partial \theta _{i}} = \frac{x_{i}} {\theta _{i}} -\frac{x_{k}} {\theta _{k}} }$$

for \(i = 1,2,\ldots,k - 1\). It follows that the maximum likelihood estimates satisfy

$$\displaystyle{x_{i}\widehat{\theta }_{k} =\widehat{\theta } _{i}x_{k}\;\;\mbox{ for $i = 1,2,\ldots,k - 1$}}$$

Summing from i = 1 to k − 1 yields

$$\displaystyle{(n - x_{k})\widehat{\theta }_{k} = (1 -\widehat{\theta }_{k})x_{k}}$$

and hence

$$\displaystyle{n\widehat{\theta }_{k} = x_{k}}$$

so that

$$\displaystyle{\frac{x_{i}x_{k}} {n} =\widehat{\theta } _{i}x_{k}\;\;\mbox{ or}\;\;\widehat{\theta }_{i} = \frac{x_{i}} {n} }$$

The second derivatives of the log likelihood are given by

$$\displaystyle{\frac{\partial ^{2}\ln [f(\mathbf{x},\boldsymbol{\theta })]} {\partial \theta _{i}^{2}} = -\frac{x_{i}} {\theta _{i}^{2}} -\frac{x_{k}} {\theta _{k}} }$$

which has expected value

$$\displaystyle{-\frac{n\theta _{i}} {\theta _{i}^{2}} -\frac{n\theta _{k}} {\theta _{k}^{2}} = -\frac{n} {\theta _{i}} -\frac{n} {\theta _{k}} }$$

and

$$\displaystyle{\frac{\partial ^{2}\ln [f(\mathbf{x},\boldsymbol{\theta })]} {\partial \theta _{i}\partial \theta _{j}} = -\frac{x_{k}} {\theta _{k}^{2}} }$$

which has expected value

$$\displaystyle{-\frac{n\theta _{k}} {\theta _{k}^{2}} = -\frac{n} {\theta _{k}} }$$

Thus Fisher’s information matrix, \(\mathcal{I}(\boldsymbol{\theta })\), is given by

$$\displaystyle{\mathcal{I}(\boldsymbol{\theta }) = n\:\left [\begin{array}{cccc} \frac{1} {\theta _{1}} + \frac{1} {\theta _{k}} & \frac{1} {\theta _{k}} & \cdots & \frac{1} {\theta _{k}} \\ \frac{1} {\theta _{k}} & \frac{1} {\theta _{2}} + \frac{1} {\theta _{k}} & \cdots & \frac{1} {\theta _{k}}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{1} {\theta _{k}} & \frac{1} {\theta _{k}} & \cdots & \frac{1} {\theta _{k-1}} + \frac{1} {\theta _{k}} \end{array} \right ]}$$

Fisher’s information can be written in matrix form as

$$\displaystyle{n\:\left [\mathbf{D}(\boldsymbol{\theta })^{-1} + \frac{1} {\theta _{k}}\mathbf{1}\mathbf{1}^{\top }\right ]}$$

where \(\mathbf{D}(\boldsymbol{\theta })\) is a \(k - 1 \times k - 1\) matrix with diagonal elements θ 1, θ 2, , θ k−1 and 1 is a k − 1 column vector with each element equal to 1.

The general theory of maximum likelihood then implies that

$$\displaystyle{\sqrt{n}(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })\;\;\stackrel{d}{\longrightarrow }\;\;\mbox{ N}\left (\mathbf{0},[i(\boldsymbol{\theta })]^{-1}\right )}$$

where \(i(\boldsymbol{\theta })\) is Fisher’s information matrix with n = 1.

It is easy to check that

$$\displaystyle{[i(\boldsymbol{\theta })]^{-1} = \mathbf{D}(\boldsymbol{\theta }) -\boldsymbol{\theta }\boldsymbol{\theta }^{\top }}$$

or

$$\displaystyle{[i(\boldsymbol{\theta })]^{-1} = \left [\begin{array}{cccc} \theta _{1}(1 -\theta _{1})& -\theta _{1}\theta _{2} & \cdots & -\theta _{1}\theta _{k-1} \\ -\theta _{2}\theta _{1} & \theta _{2}(1 -\theta _{2})&\cdots & -\theta _{2}\theta _{k-1}\\ \vdots & \vdots & \ddots & \vdots \\ \theta _{k-1}\theta _{1} & -\theta _{k-1}\theta _{2} & \cdots &\theta _{k-1}(1 -\theta _{k-1}) \end{array} \right ]}$$

which we recognize as the variance covariance matrix of X 1, X 2, , X k−1

Standard maximum likelihood theory implies that

$$\displaystyle{n(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })^{\top }[i(\boldsymbol{\theta }]\:(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })\;\;\stackrel{d}{\longrightarrow }\;\;\chi ^{2}(k - 1)}$$

Now note that

$$\displaystyle{n(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })^{\top }[i(\boldsymbol{\theta }](\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })}$$

is equal to

$$\displaystyle{n(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })^{\top }\left [\mathbf{D}(\boldsymbol{\theta })^{-1} + \frac{1} {p_{k}}\mathbf{1}\mathbf{1}^{\top }\right ](\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })}$$

and hence to

$$\displaystyle{n(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })^{\top }\mathbf{D}(\boldsymbol{\theta })^{-1}(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta }) + \frac{n} {\theta _{k}} (\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })^{\top }\mathbf{1}\mathbf{1}^{\top }(\widehat{\boldsymbol{\theta }}-\boldsymbol{\theta })}$$

This last expression simplifies to

$$\displaystyle{n\sum _{i=1}^{k-1}\frac{(\widehat{\theta }_{i} -\theta _{i})^{2}} {\theta _{i}} + \frac{n} {\theta _{k}} \left [\sum _{i=1}^{k-1}(\widehat{\theta }_{ i} -\theta _{i})\right ]^{2}}$$

which in turn simplifies to

$$\displaystyle{\sum _{i=1}^{k-1}\frac{(x_{i} - n\theta _{i})^{2}} {n\theta _{i}} + \frac{n} {\theta _{k}} (\theta _{k} -\widehat{\theta }_{k})^{2}}$$

and to

$$\displaystyle{\sum _{i=1}^{k-1}\frac{(x_{i} - n\theta _{i})^{2}} {n\theta _{i}} + \frac{(x_{k} - n\theta _{k})^{2}} {n\theta _{k}} }$$

This finally reduces to

$$\displaystyle{\sum _{i=1}^{k}\frac{(x_{i} - n\theta _{i})^{2}} {n\theta _{i}} }$$

Noting that \(\mathbb{E}(X_{i}) = n\theta _{i} = E_{i}\) this last formula may be written as

$$\displaystyle{\sum _{i=1}^{k}\frac{(\mbox{ X}_{i} -\mbox{ E}_{i})^{2}} {\mbox{ E}_{i}} }$$

which is called Pearson’s chi-square statistic. For large n, it has a chi-square distribution with k − 1 degrees of freedom.