Point estimation concerns making inferences about a quantity that is unknown but about which some information is available, e.g., a fixed quantity θ for which we have n imperfect measurements x1,…,xn) The theory of estimation deals with how best to use the information (combine the values x1,…,xn) to obtain a single number, estimate, for θ, say \( \widehat{\theta} \). Interval estimation does not reduce the available information to a single number and is a special case of hypothesis testing. This entry deals only with point estimation.

Justification for any particular way of combining the available information can be given only in terms of a model connecting the x’s to θ. For example, in the case of imperfect measurements x1,…, xn, we could regard the errors, xiθ, i = 1,…, n as independent outcomes of a random process so that the joint distribution of the x’s depends on θ:

$$ p\left({x}_1,\dots, {x}_n|\theta \right)=\prod_1^nf\left({x}_i-\theta \right). $$

In general, a statistical model represents the data, observations x1,…,xn, x, where the x’s may be vectors of quantities, as having arisen as a drawing from a joint distribution depending on some unknown parameters θ = (θ1,…, θk)′. For example, consider x1,…,xt, where xt, is identically and independently distributed according to a univariate normal distribution with mean μ and variance σ2 (Cramer 1946). The “location parameter,” μ, and the “scale parameter, ” σ2, are unknown but, because they determine the distribution from which the data are supposed to arise, the latter may be used to form a point estimate of the vector θ = (μ2)′, e.g., \( \widehat{\theta}={\left(\overline{x}{\varSigma}_1^T{x}_i/T,{s}^2={\varSigma}_1^T{\left({x}_i-\overline{x}\right)}^2/T\right)}^{\prime } \), the properties of which may be discussed in terms of various criteria and the properties of the family of probability distributions p(x| θ) from which the data are assumed to come. An estimator is a function of the observations; an estimateis the value of such a function for a particular set of observations. The theory of point estimation concerns the justification for estimators in terms of the properties of the estimates which they yield relative to specified criteria.

General treatments of the theory of point estimation may be found in Lehmann (1983), Cox and Hinkley (1974), Rao (1973) and Zellner (1971), inter alia.

Econometric estimation problems usually concern inferences about the parameters of conditional rather than unconditional distributions. For example, if the observations (y1, x1,…, (yn, xn), are assumed to represent a drawing from a multivariate normal distribution with mean vector μ and variance-covariance matrix Σ , then the conditional distribution of y given x, p(y | x, θ), is univaraite normal with mean \( {\theta}_1={\mu}_1+{\sigma}_{12}{\sigma}_{22}^{-1}\left(x-{\mu}_2\right) \) and variance \( {\theta}_2={\sigma}_{11}-{\sigma}_{12}{\sigma_{22}}_{-1}{\sigma}_{21} \), where

$$ \mu =\left({\mu}_1,{\mu}_2\right)\kern0.48em \mathrm{and}\kern0.48em \sum =\left[\begin{array}{cc}\hfill {\sigma}_{11}\hfill & \hfill {\sigma}_{12}\hfill \\ {}\hfill {\sigma}_{21}\hfill & \hfill {\sigma}_{22}\hfill \end{array}\right]. $$

Note that θ1 is a linear function of x which depends upon the parameters of the originally assumed joint distribution; this function is called the regression of y on x. regression analysis deals with the general problem of estimating such functions which characterize conditional distributions, usually those derived from normal distributions.

A standard method, and the one most common in econometrics, for obtaining estimators is the method of maximum likelihood. Consideration of this method provides a good introduction to alternative principles of estimation. Let the data x = (x1,…, xn)′ be fixed and regard p(x | θ) as a function of θ it is then called the likelihood. The value of \( \widehat{\theta} \) = \( \widehat{\theta} \) (x1,…,xn) which maximizes p(x | θ), if it is exists and is unique, is called the maximum-likelihood estimator, or estimate (MLE). (For a general survey, see Norden 1972–1973, or Lehmann 1983.) The MLE of a continuous function g(θ) is g(\( \widehat{\uptheta} \)) where\( \widehat{\uptheta} \) is the MLE of θ. Other desirable properties of the MLE are asymptotic as n → ∞. Under regularity conditions: (1) The MLE is weakly consistent, i.e., \( \lim\;n\to \infty \Pr \left(|\ {\widehat{\theta}}_n-\theta |<\in \right)=1 \) for all > 0. (2) The MLE is asymptotically normal, i.e. the distribution of \( \widehat{\theta} \) appropriately normalized, \( \surd n\left(\ {\widehat{\theta}}_n-\theta \right) \), tends to the normal distribution, with mean O and variance-covariance matrix [I(θ)]−1 where

$$ I\left(\theta \right)=-E\left[{\partial}^2\log\ p\left(x|\theta \right)/\partial \theta \partial {\theta}^{\prime}\right]. $$

I(θ) is called the information matrix and shows the information a single observation contains about the parameter θ. (3) The MLE is asymptotically efficient in the sense that if θ* is any other estimator such that \( \surd n\left(\ {\widehat{\theta}}_n-\theta \right) \) tends in distribution to the normal with mean zero and variancecovariance matrix ∑(θ), the matrix [∑(θ) – I−1(θ)] is positive semi-definite. For example, in the case of one parameter this means that no other asymptotically normal estimator has, as n → ∞, a smaller variance than the MLE. The conditions for asymptotic normality do ensure, with probability tending to one, a solution to the likelihood equation ∂ log p(x|θ)/ ∂θ = 0, which is consistent and asymptotically normal and efficient. The problem is that there may be more than one solution, but only one can be the MLE. When the number of parameters to be estimated (elements of the vector θ) tends to infinity with n, the MLE’s for some may exist but may not be consistent (Neyman and Scott 1948).

Solutions to the likelihood equation are not the only estimators which may be consistent, asymptotically normal and efficient, but comparison with the MLE, assuming correct specification of p(x|θ), is facilitated by the fact that all have a normal distribution as n → ∞. For fixed n, the distributions of different estimators are difficult to determine and may, indeed, be quite different. Moreover, when the distributions underlying the data are misspecified, the MLE’s generally no longer have these optimal properties (White 1982; Gourieroux et al. 1984), although other, weaker, optimality properties remain. Apart from specification problems, however, the likelihood function provides an important and useful summary of the data, and point estimates and hypothesis testing procedures based on it are often justified in this way (Fisher 1925; Barnard et al. 1962; Edwards 1972).

The ‘accuracy’ of an estimator \( \widehat{\theta} \) of a scalar parameter θ may be measured (defined) in a variety of ways: by its expected squared or absolute error, relative error, or by Pr{|\( \widehat{\theta} \) θ | ≤ α} for some α. Any choice is arbitrary; for convenience expected squared error is the usual choice. Some justification for a particular choice may be provided in terms of a loss function L(θ,\( \widehat{\theta} \) ) or the expected loss EL(θ,\( \widehat{\theta} \) ) or risk function of statistical decision theory. Choice of estimators may be justified in terms of the extent to which the choice minimizes risk or some aspect thereof. Both the sampling theory and Bayesian approaches to estimation can be interpreted in these terms.

A very weak property that any estimator should have is that no other estimator exists which dominates it in the sense that the latter leads to estimates having uniformly lower expected loss irrespective of θ. Estimators satisfying this criterion are called admissible.

In the sampling theoretic approach, emphasis is placed on finding estimators which have desirable properties in terms of relative frequencies in hypothetically repeated samples. For example, we might require that the distribution of an estimator be centred on the true parameter value, i.e., E(\( \widehat{\theta} \) − θ) = 0. Such estimators are called unbiased. Among all unbiased estimators we presumably would prefer one yielding estimates with a distribution concentrated about the mean. Such minimum variance unbiased estimators (MVU) play a key role in the theory of estimation. Specifically, the famous Rao-Blackwell Theorem states that if an unbiased estimator \( \widehat{\theta} \) is a function of a complete sufficient statistic for θ then it is MVU. A statistic, say T, is said to be sufficient for θ if the conditional distribution of the observations given T is independent of θ. Completeness is also a property of the distribution functions for the observations; (a family P of distributions (of T) indexed by a parameter θ is said to be complete if there is no ‘unbiased estimator of zero’ other than Φ (x) ≡ 0.) Note that choosing an estimator so as to minimize the expected squared error of the estimate it yields is equivalent to minimizing the unweighted sum of the variance and the squared bias. From a decision theoretic point of view, it may be better to accept an estimator with a small bias if such an estimator has a smaller risk.

In the sampling theoretic approach, emphasis is given to the distribution of estimates yielded by a specified estimator. The likelihood approach, on the other hand, emphasizes the distribution of the observations, given a parametrically specified distribution, under alternative values of these parameters. Concern is primarily with the maximum value of the likelihood function with respect to the parameters and its curvature near the point at which the global maximum occurs, but some approaches stress the relevance of the likelihood function in other neighbourhoods (Barnard et al. 1962; Edwards 1972). The Bayesian approach carries concern with the entire likelihood function further: estimation and inference are based on the posterior density of the unknown parameters of the distribution generating the observations. This posterior density is proportional to the likelihood function multiplied by a prior density of the parameters, i.e., a weighted average of likelihoods for different parameter values where the weights are determined by prior (subjective) beliefs. (See BAYESIAN INFERENCE.)

In the Bayesian approach, both observations and parameters are taken to be stochastic. Let p(x1, θ) be the joint probability density function for an observation vector, x, and a parameter vector θ; then p(x1, θ) = p(x|θ)p(θ) = p(θ|x)p(x), where p(ξ|η) denotes the conditional density of ξ given η and p(ξ) denotes the marginal density of ξ. Thus p(θ|x) is proportional to p(θ) p(x|θ) by the factor

$$ p(x)=\int p\left(\theta \right)p\left(x|\theta \right)\mathrm{d}\theta . $$

p(θ|x) is the posterior distribution of θ after having observed the data; p(θ) is the prior distribution of θ and p(x|θ) is the likelihood. Alternatively, consider the weighted average risk (as defined above):

$$ \int EL\left(\theta, \widehat{\theta}\right)w\left(\theta \right)\mathrm{d}\theta, $$

with weights w(θ) such that

$$ \int w\left(\theta \right)\mathrm{d}\theta =1. $$

When L(θ,\( \widehat{\uptheta} \)) = ( \( \widehat{\uptheta} \) θ)2, the estimator which minimizes such a weighted average risk is

$$ {\displaystyle \begin{array}{ll}\widehat{\theta}(x)=& \int \theta w\left(\theta \right)p\Big(\theta w\left(\theta \right)p\left(x|\theta \right)\ \mathrm{d}\theta \int w\left(\theta \right)p\left(x|\theta \right)\hfill \\ {}& \times \mathrm{d}\theta .\hfill \end{array}} $$

If the weights w(θ) are taken to be the values of the marginal density p(θ), the mean of the posterior Bayes distribution minimizes the expected squared error of the estimates when both the variation of data and the uncertainty with respect to θ are taken into account: \( \widehat{\theta} \)posterior distribution of θ.

As is the expected value of θ based on the n → ∞, it may be shown that the influence of the prior distribution diminishes until in the limit it disappears; then, under general circumstances, the minimization of mean square error in the Bayesian framework yields the MLE. The principal difficulty in the Bayesian approach is the choice of a reasonable prior for θ, p(θ). (For a comprehensive discussion, see Zellner 1971.)

Instead of minimizing the expected loss, one may minimize the maximum loss. Estimators which do are called minimax; the theory is developed in Wald (1950).

There are three general approaches to choice of a prior in Bayesian analysis. First, the prior may be obtained empirically (Maritz 1970). For example, suppose that the problem is to estimate the percentage of defective items in a particular batch. Assuming such batches were produced in the past suggests a prior based on the proportion of defective items observed in previous batches. This kind of ‘updating’ forms the basis for the celebrated Kalman filter. Second, the prior may be viewed as representing a ‘rational degree of belief’ (Jeffreys 1961). What represents a ‘rational degree’ is not specified, but the idea leads directly to the use of priors that represent knowing little or nothing, so-called non-informative priors. However, total ignorance has proved difficult to capture in many cases. A third approach is that the prior represents a subjective degree of belief (Savage 1954; Raiffa and Schlaifer 1961). But, of whom? and how arrived at? Minimax-estimation theory offers one possible approach for it leads to the minimum mean-square-error Bayes estimator, i.e., the mean of the posterior distribution of the parameters, when the prior is least favourable in the sense of making expected loss the largest for whatever class of priors is chosen.

Related to this problem is the more general question of robust estimation. In order to make sense of any data, it is necessary to assume something. For example, the justification for using the sample mean to estimate the mean of the distribution generating the data is often the assumption that that distribution is normal or nearly so. In that case, the sample mean is not only asymptotically efficient but uniformly MVU, minimax, admissible, etc. But suppose that the distribution is Cauchy (having roughly the same shape as the normal but with very thick tails); then, the sample mean has the same distribution as any individual observation, its accuracy does not improve with n and it is not even a consistent estimator. At least, within the class of distributions which include the Cauchy, the properties of the sample mean, and similarly ordinary least squares, are quite sensitive to the true nature of the underlying distribution of the data. We say that such estimators are not robust. Complete discussions are contained in Huber (1981) and Hampel et al. (1985).

To conclude, three estimation problems of special concern in economics are discussed: (1) classical linear regression; (2) non-linear regression, and (3) estimation of simultaneous structural equations.

The classical theory of linear regression deals with the following problem: Let X be an n k matrix of nonstochastic observations (n for each variable (x1,…, xk), β be a k 1 vector of parameters (one of which becomes an intercept if x1 ≡ 1, say), and y be an n 1 vector of stochastic variables such that y = ∈ ∼ N (0, Σ). The ordinary least-squares estimates (OLS), \( \widehat{\beta} \) = (X′ X)−1X′y are MLE and MVU when Σ = σ2 When this is not true, although the OLS estimates are unbiased and consistent, they are not asymptotically efficient or minimum variance. The generalized least squares estimates (GLS), \( \widehat{\upbeta} \) (X′ Σ-1X)−1X′ Σ-1 are efficient, but of course Σ and therefore Σ−1, is generally unknown. Often, however, a consistent estimate of Σ is available, leading to feasible, or estimated, GLS estimates.

Many problems in economics lead to non-linear relationships. Linear regression may be a good (local) approximation to such relationships if the data do not vary too widely. Moreover, many non-linear relationships may be transformed into linear ones (e.g., the Cobb-Douglas production function). Often, however, the data are sufficiently variable to make a linear relationship a poor approximation and no linearizing transformation exists. The general non-linear regression model is y = f(X, β, ) or more frequently y = f(X, β) + ∈. Leastsquares or maximum-likelihood estimates may still be obtained, but the first-order conditions for a minimum or a maximum will generally be non-linear, frequently ruling out analytic expressions for the estimates. Consider the problem of minimizing the sum of squared residuals, (yf(X, β))′ (y – f(x,β))′ with respect to β (non-linear least squares); numerical methods for solving this problem are of the general form: \( {\widehat{\beta}}_{i+1}={\widehat{\beta}}_i-{s}_i{P}_i{\nabla}_i, \) where \( {\widehat{\beta}}_i \) = the value of the estimator parameter vector at iteration i, si = the step size at iteration i, Pi = the direction matrix at iteration i, and ∇i = the gradient of the objective function at iteration i. The matrix Pi, determines the direction in which the parameter vector is changed at each iteration; it is generally taken to be the Hessian matrix evaluated at the current value of the parameter vector or some approximation to it. Let g(β) be the objective function; then

$$ {P}_i={\left[{\partial}^2g\left(\beta \right)/\partial \beta \partial {\beta}^{\prime }|\beta ={\widehat{\beta}}_i\right]}^{-1} $$

is the Hessian. A justification for this choice is obtained from the second-order (quadratic) approximation to the objective function in the neighborhood of the current estimate. For a detailed treatment of this problem as well as constrained non-linear estimation see Quandt (1983). The statistical properties of non-linear estimators are discussed by Amemiya (1983).

Economic theory teaches us that the values of many economic variables are often determined simultaneously by the joint operation of several economic relationships, for example, supply and demand determine price and quantity. This leads to a representation in terms of a system of simultaneous structural equations (simultaneous equations model, or SEM). The problem of how to estimate the parameters of an SEM has occupied a central place in econometrics since Haavelmo (1944). A linear SEM is given by, ByiΓxt = ut,t = 1,…, T where B is G × G, Γ is G × K, yt is G × 1, xt is K × 1, and ut is G × 1. ut is assumed to be zero mean with variance-covariance matrix Σ often normally distributed, independently and identically for each t. Thus the ut are serially independent. It is also assumed that plim \( {\varSigma}_1^T{x}_{it}{u}_{jt}/T=0 \) all i = 1,…, K and j =1,…,G and plim X′X|T is a positive definite matrix, where X = (x1,…, xT). If B is non-singular this system of structural equations, as they are called, may be solved for the so called ‘endogenous’ variables, yt, in terms of the ‘exogenous’ variables xt: y = Πxt, + vtwhere Π = −B1Γ, vt = B1ut, so that Evt = 0 ;and Evtv′t = B1Σ (B1)′ = Ω. It is, in general, not possible to determine B, Γ and Σ from knowledge of the reduced form (RF) parameters Π and Ω there are, in principle, many structural systems compatible with the same RF. Given sufficient restrictions on the structural system, however, knowledge of the RF parameters can be used, together with the assumed restrictions, to determine the structural parameters. The SEM is then said to be identified.

For linear structural equations with normally distributed disturbances, the conditions for* identification may be derived from the condition that for any system \( {B}^{\ast }{y}_t+{\varGamma}^{\ast }{x}_t={u}_t^{\ast } \)for which ut and ut, are identically distributed, where B* = FB, Γ * = F Γand \( {u}_t^{\ast }=F{u}_t \), then FfI is implied by the restrictions, where f is any positive scalar (Hsiao 1983).

Methods of estimating the parameters of SEMs may be put into two categories: (1) limited-information methods which estimate parameters of a subset of the equations, usually a subset consisting of a single equation, taking into account only the identifying restrictions on the parameters of equations in that subset, and (2) full-information methods which estimate all of the identifiable parameters in the system simultaneously and therefore take into account all identifying restrictions. Full- or limited-information methods may be based on either least-squares or maximum-likelihood principles. ML-based methods yield estimates which are invariant according to the normalization rule (choice of f).

For systems or single equations in SEMs for which there are restrictions just sufficient to identify the parameters of interest, estimates may be based on indirect least squares, that is, derived directly from the reduced form parameters estimated by applying OLS to each equation of the RF; such estimates are ML. If the restrictions are just sufficient to identify the parameters of each equation, the resulting estimates are full-information maximum-likelihood (FIML) estimates. When an equation is over-identified, in the sense that there are more than enough restrictions to identify it, two-stage least squares (2SLS) or limited-information maximum likelihood (LIML) may be applied equation by equation to each equation which is identified. Provided the model is correctly specified, such estimates are consistent and asymptotically unbiased but not asymptotically efficient, because some restrictions are neglected in the estimation of some parameters. An analog of 2SLS, three-stage least squares (3SLS), yields estimates which are asymptotically equivalent to FIML and therefore efficient.

Amemiya (1983) extends all of these methods to non-linear systems. Sargan (1980) discusses identification in non-linear systems.

See Also