Keywords

JEL Classifications

Panel or longitudinal data are becoming increasingly popular in applied work as they offer a number of advantages over pure cross-sectional or pure time-series data. They allow researchers to model unobserved heterogeneity at the level of the observational unit, where the latter may be an individual, a household, a firm or a country. This article describes several estimation methods that are available for nonlinear panel data models, that is, models which are nonlinear in the parameters of interest and which include models that arise frequently in applied work, such as discrete choice models and limited dependent variable models, among others.

Introduction

Panel or longitudinal data are becoming increasingly popular in applied work as they offer a number of advantages over pure cross-sectional or pure time-series data. A particularly useful feature is that they allow researchers to model unobserved heterogeneity at the level of the observational unit, where the latter may be an individual, a household, a firm or a country. Standard practice in the econometric literature is to model this heterogeneity as an individual-specific effect which enters additively in the model, typically assumed to be linear, that captures the statistical relationship between the dependent and the independent variables. The presence of these individual effects may cause problems in estimation. In particular in short panels, that is, in panels where the time-series dimension is of smaller order than the cross-sectional dimension, their estimation in conjunction with the other parameters of interest usually yields inconsistent estimators for both. (Notable exceptions are the static linear and the Poisson count panel data models, where estimation of the individual effects along with the finite dimensional coefficient vector yields consistent estimators of the latter.) This is the well-known incidental parameters problem (Neyman and Scott 1948). In linear regression models, this problem may be dealt with by taking transformations of the model, such as first differences or differences from time averages (‘within transformation’), which remove the individual effect from the equation under consideration. However they do not apply to nonlinear econometric models, that is, models which are nonlinear in the parameters of interest and which include models that arise frequently in applied work, such as discrete choice models, limited dependent variable models, and duration models, among others.

This article describes several estimation methods that are available for nonlinear panel data models. An approach that is available for estimating certain linear and nonlinear parametric models with individual effects is the conditional maximum likelihood approach. This is described in section “The Conditional Maximum Likelihood (CML) Approach”. Section “The Fixed Effects Approach” describes estimation techniques that have been recently developed for several semiparametric nonlinear panel data models. A common feature in the methods discussed in that section is that we do not make any assumptions about the nature of these individual effects, that is, whether they are fixed constants or random variables. Thus, we do not make any assumptions about whether they are related to the conditioning variables and, if so, in what manner. This approach is typically referred to as the fixed effects approach. Section “The Random Effects Approach” describes the so-called random effects approach in estimating nonlinear panel data models. In contrast to the fixed effects approach, the random effects approach does make assumptions about the individual effects.

The discussion distinguishes between two types of models, static and dynamic. In static models, the conditioning set includes past, present and future values of the variables. In this case the conditioning variables are said to be strictly exogenous. In dynamic models, the conditioning set may also include lags of the dependent variable and other endogenous variables, that is, variables that are only weakly exogenous or predetermined.

Our discussion is limited in several aspects. First, we focus only on the case when the time series dimension of the panel (T) is short so that it makes sense to consider the asymptotic properties of the estimators when the cross-sectional dimension (N) is large while T remains fixed. Second, we do not consider estimation of random coefficient models, that is, models where all the parameters are varying at the individual level. Finally, we do not discuss the Bayesian approach to estimating panel data models.

The Conditional Maximum Likelihood (CML) Approach

Suppose that a random variable yit has density f (·,θ,αi) where θ is the parameter of interest which is common across all units i, whereas αi is a nuisance parameter which is allowed to differ across i. A sufficient statistic Si for ai is a function of the data such that the conditional distribution of the data given Si does not depend on αi. However, the conditional distribution may depend on θ. In this case, one can estimate θ by maximizing the conditional likelihood function, which conditions on the sufficient statistic(s). Such sufficient statistics are readily available for the exponential family that includes the normal, Poisson, gamma, logistic, and binomial distributions. The CML approach, when it exists, yields consistent and asymptotically normal estimators for parametric panel data models with individual effects (Andersen 1970). We will next demonstrate how the CML approach works in the case of a static and a dynamic logit model with individual effects.

The Static Panel Data Logit Model

Consider the binary choice logit model with individual effects

$$ {y}_{it}=1\left\{{x}_{it}{\beta}_0+{\alpha}_i+{\varepsilon}_{it}\ge 0\right\}\;i=1,\dots, N;t=1,\dots, T $$

where 1{A} = 1 if A occurs and is 0 otherwise. Let xi(xi1,…, xiT ). Here the error term εit is distributed i.i.d. over t with a logistic distribution conditional on (xi,αi). Note that this assumption implies that εit is in fact independent of αi and xit for all t. We can easily calculate that

$$ \Pr \left({y}_{it}=1|{x}_i,{\alpha}_i\right)=\frac{\exp \left({x}_{it}{\beta}_0+{\alpha}_i\right)}{1+\exp \left({x}_{it}{\beta}_0+{\alpha}_i\right)}. $$

In this model it turns out that ∑tyit is a sufficient statistic for αi. Indeed, let T = 2. Note that

$$ \Pr \left({y}_{it}=1|{y}_{i1}+{y}_{i2}=0,{x}_i,{\alpha}_i\right)=0\ \Pr \left({y}_{it}=1|{y}_{i1}+{y}_{i2}=2,{x}_i,{\alpha}_i\right)=1 $$

that is, individuals who do not switch states (i.e. who are 0 or 1 in both periods) do not offer any information about β0. But it can be easily shown that

$$ \Pr \left({y}_{i1}=1|{y}_{i1}+{y}_{i2}=1,{x}_i,{\alpha}_i\right)=\frac{1}{1+\exp \left(\left({x}_{i2}-{x}_{i1}\right){\beta}_0\right)} $$

and

$$ \Pr \left({y}_{i1}=0|{y}_{i1}+{y}_{i2}=1,{x}_i,{\alpha}_i\right)=\frac{\exp \left(\left({x}_{i2}-{x}_{i1}\right){\beta}_0\right)}{1+\exp \left(\left({x}_{i2}-{x}_{i1}\right){\beta}_0\right)}. $$

In other words, conditional on the individual switching states (from 0 to 1 or from 1 to 0), the probability that yit is 1 or 0 depends on β0 (that is, contains information about β0) but is independent of αi.

The conditional log-likelihood is

$$ \begin{array}{ll}\mathscr{L}_C(\beta) & = \sum\limits_{i=1}^{N}1\{y_{i1} + y_{i2} =1 \} \\ & \times {\rm In} \ \left(\frac{\exp \left((x_{i2} - x_{i1})\beta\right)^{(1-y_{i1})}}{1 + \exp \left((x_{i2} - x_{i1})\beta)\right)}\right) \end{array}$$

and may be maximized over β to produce a consistent and root-N asymptotically normal estimator of β0. Note that the approach uses a subset of the data, since only individuals who switch states enter the likelihood. For the expression of the conditional log-likelihood in the general T case, see Chamberlain (1984).

The Dynamic Panel Data Logit Model

Chamberlain (1985) noticed that the conditional maximum likelihood approach also applies to the ‘AR(1)’ logit model with individual effects:

$$ {y}_{it}=1\left\{{\upgamma}_0{y}_{it-1}+{\alpha}_i+{\varepsilon}_{it}\ge 0\right\}\;i=1,\dots, N;t=1,\dots, T $$

where the error term εit is distributed i.i.d. with a logistic distribution conditional on αi and the initial observation of the sample yi0. Note that we are not making any assumption about the distribution of the initial yi0. As we will see, the approach requires at least four observations for each individual (including the initial observation). In fact, let that be the case and consider the events:

$$ A=\left\{{y}_{i0}={d}_0,{y}_{i1}=0,{y}_{i2}=1,{y}_{i3}={d}_3\right\}\;B=\left\{{y}_{i0}={d}_0,{y}_{it}=1,{y}_{i2}=0,{y}_{i3}={d}_3\right\} $$

where d0 and d3 are either 0 or 1. It is rather easy to derive the following probabilities which condition on the individual switching states in the two middle periods

$$ {\displaystyle \begin{array}{ll}\hfill & \Pr \left(A|A\cup B,{\alpha}_i\right)\\ {}& =\frac{1}{1+\exp \left({\upgamma}_0\left({d}_0-{d}_3\right)\right)}\Pr \left(B|A\cup B,{\alpha}_i\right)\hfill \\ {}& =\frac{\exp \left({\upgamma}_0\left({d}_0-{d}_3\right)\right)}{1+\exp \left({\upgamma}_0\left({d}_0-{d}_3\right)\right)}.\hfill \end{array}} $$

Note that these depend on γ0 but are independent of αi. The conditional log-likelihood of the model for four periods is:

$$ \begin{array}{ll}\mathscr{L}_C(\beta) & = \sum\limits_{i} \ 1 \{y_{i1} + y_{i2} =1 \} \\ & {\rm In} \ \left(\frac{\exp \left(\gamma(y_{i0} - y_{i3})\right)^{(y_{i1})}}{1 + \exp (\gamma(y_{i0} - y_{i3}))}\right) \end{array}$$

and maximizing it with respect to γ produces a consistent and root-N asymptotically normal estimator. The approach generalizes to logit models with more than one lags of yit (see Magnac 2000).

It is important to note that the CML approach described above does not work in the logit model

$$ {y}_{it}=1\left\{{\upgamma}_0{y}_{it-1}+{x}_{it}{\beta}_0+{\alpha}_i+{\varepsilon}_{it}\ge 0\right\}\;i=1,\dots, N;t=1,\dots, T $$

that is, when the conditioning set also includes exogenous variables. Honoré and Kyriazidou (2000a) show that β0 and γ0 in the model above are in fact identified both for the case when the errors εit are logistic and when they are only assumed to have the same distribution over time conditional on (xi, yi0) (see below). In the logistic case identification is based on the fact that the following probabilities

$$ {\displaystyle \begin{array}{ll}\hfill & \Pr \left(A|A\cup B,{x}_{i2}={x}_{i3},{x}_i,{\alpha}_i\right)\\ {}& =\frac{1}{1+\exp \left(\left({x}_{i1}-{x}_{i2}\right){\beta}_0+{\upgamma}_0\left({d}_0-{d}_3\right)\right)}\Pr \left(B|A\cup B,{x}_{i2}={x}_{i3},{x}_i,{\alpha}_i\right)\hfill \\ {}& =\frac{\exp \Big(\left({x}_{i1}-{x}_{i2}\right){\beta}_0+{\upgamma}_0\left({d}_0-{d}_3\right)}{1+\exp \left(\left({x}_{i1}-{x}_{i2}\right){\beta}_0+{\upgamma}_0\left({d}_0-{d}_3\right)\right)}\hfill \end{array}} $$

are independent of αi. Note that the probabilities above condition not only on the individual switching states in the middle two periods so that yi1 + yi2 = 1 but also on the event that xi2 = xi3. Honoré and Kyriazidou (2000a) propose estimating β0 and γ0 by maximizing

$$ \sum \limits_i1\left\{{x}_{i2}-{x}_{i3}=0\right\}1\left\{{y}_{i1}+{y}_{i2}=1\right\}\times \ln \left(\frac{\exp {\left(\left({x}_{i1}-{x}_{i2}\right)\beta +\upgamma \left({y}_{i0}-{y}_{i3}\right)\right)}^{y_{i1}}}{1+\exp \left(\left({x}_{i1}-{x}_{i2}\right)\beta +\upgamma \left({y}_{i0}-{y}_{i3}\right)\right)}\right) $$

when Pr(xi2 = xi3) > 0. When xi2xi3 is continuously distributed with support around 0, β0 and γ0 can be obtained by maximizing

$$ \sum \limits_iK\left(\frac{x_{i2}-{x}_{i3}}{h_N}\right)1\left\{{y}_{i1}+{y}_{i2}=1\right\}\times \ln \left(\frac{\exp {\left(\left({x}_{i1}-{x}_{i2}\right)\beta +\upgamma \left({y}_{i0}-{y}_{i3}\right)\right)}^{y_{i1}}}{1+\exp \left(\left({x}_{i1}-{x}_{i2}\right)\beta +\upgamma \left({y}_{i0}-{y}_{i3}\right)\right)}\right) $$

where K () is a kernel density function and hN is a bandwidth sequence, chosen so as to satisfy certain assumptions that guarantee consistency and asymptotic normality of the proposed estimators.

The Fixed Effects Approach

The conditional maximum likelihood approach is not always available. For example, there are no sufficient statistics for the binary choice model with individual effects when the errors are normally distributed. Furthermore, like all ML approaches, the approach suffers from the fact that the distribution of the unobserved idiosyncratic errors needs to be parametrically specified. There do exist, however, methods for some semiparametric nonlinear panel data models with individual effects where the distribution of the underlying idiosyncratic errors is left unspecified. These include the binary choice model, the censored and truncated regression models, and the sample selection model.

The Semiparametric Panel Data Binary Choice Model

Manski (1987) considers the model

$$ {y}_{it}=1\left\{{x}_{it}{\beta}_0+{\alpha}_i-{\varepsilon}_{it}\ge 0\right\}\;i=1,\dots, N;t=1,\dots, T $$

where εit is identically distributed over time conditional on (xi,αi), with distribution function F that is a continuous and strictly increasing function on\( \mathcal{R} \). Note that, in contrast to the models considered above, F here is not assumed to have a specific functional form, hence the characterization of the model as semiparametric.

He observes that for T = 2 the time invariance of F implies that

$$ {\displaystyle \begin{array}{l}\Pr \left({y}_{i1}=1|{x}_i\right)\hfill \\ {}\lesseqgtr \Pr \left({y}_{i2}=1|{x}_i\right)\;\mathrm{if}\;\mathrm{and}\;\mathrm{only}\kern0.17em \mathrm{if}\;{x}_{i1}{\beta}_0\lesseqgtr {x}_{i2}{\beta}_0\hfill \end{array}} $$

or equivalently that

$$ \mathit{\operatorname{sgn}}\left(\Pr \left({y}_{i2}=1|{x}_i,{\alpha}_i\right)-\Pr \left({y}_{i1}=1|{x}_i,{\alpha}_i\right)\right)=\mathit{\operatorname{sgn}}\left(\left({x}_{i2}-{x}_{i1}\right){\beta}_0\right). $$

In fact it can be shown that, under appropriate regularity conditions on the joint distribution of Δxi ≡(xi2−xi1), β0 uniquely (up to scale) maximizes the so-called population ‘score function’

$$ E\left[\Delta {y}_i\cdot \mathit{\operatorname{sgn}}\left(\Delta {x}_i{\beta}_0\right)\right] $$

where sgn(x) equals 1 if x > 0, equals − 1 if x < 0 and is equal to 0 if x = 0. This suggests estimating β0 by the so-called conditional maximum score estimator which maximizes the sample analog of the population score function

$$ \widehat{\beta}=\arg \max \limits_{\beta } \sum \limits_i\Delta {y}_i\cdot \mathit{\operatorname{sgn}}\left(\Delta {x}_i\beta \right). $$

Note that only observations for which yi1yi2 are used here, similarly to conditional logit. The estimator is consistent under some additional assumptions but is not asymptotically normal and its rate of convergence is not root-N.

Honoré and Kyriazidou (2000a) show that it is possible to extend the conditional maximum score approach to the dynamic binary choice model:

$$ {\displaystyle \begin{array}{ll}\Pr \left({y}_{i0}=1|{x}_i,{\alpha}_i\right)={p}_0\left({x}_i,{\alpha}_i\right)\\ {}& \Pr \left({y}_{it}=1|{x}_i,{\alpha}_i,{y}_{i0},\dots, {y}_{it-1}\right)\hfill \\ {}& =F\left({x}_{it}{\beta}_0+{\upgamma}_0{y}_{it-1}+{\alpha}_i\right)\;t\hfill \\ {}& =1,\dots, T\hfill \end{array}} $$

where yi0 is assumed to be observed and F is strictly increasing.

We will next demonstrate their identification scheme. Assume T = 3 and define the events A and B as above. Then

$$ {\displaystyle \begin{array}{l}\Pr \left(A|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)\hfill \\ {}={p}_0{\left({x}_i,{\alpha}_i\right)}^{d_0}{\left(1-{p}_0\left({x}_i,{\alpha}_i\right)\right)}^{1-{d}_0}\hfill \\ {}\times \left(1-F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\right)\hfill \\ {}\times F\left({x}_{i2}{\beta}_0+{\alpha}_i\right)\hfill \\ {}\times {\left(1-F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)\right)}^{\left(1-{d}_3\right)}\hfill \\ {}\times F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)\Big){}^{d_3}\Pr \left(B|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)\hfill \\ {}={p}_0{\left({x}_i,{\alpha}_i\right)}^{d_0}{\left(1-{p}_0\left({x}_i,{\alpha}_i\right)\right)}^{1-{d}_0}\hfill \\ {}\times F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\hfill \\ {}\times \left(1-F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)\right)\hfill \\ {}\times {\left(1-F\left({x}_{i2}{\beta}_0+{\alpha}_i\right)\right)}^{\left(1-{d}_3\right)}\hfill \\ {}\times F\left({x}_{i2}{\beta}_0+{\alpha}_i\right)\Big){}^{d_3}.\hfill \end{array}} $$

If d3 = 0, then,

$$ {\displaystyle \begin{array}{l}\frac{\Pr \left(A|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)}{\Pr \left(B|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)}\hfill \\ {}=\frac{\left(1-F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\right)}{\left(1-F\left({x}_{i2}{\beta}_0+{\alpha}_i\right)\right)}\hfill \\ {}\times \frac{F\left({x}_{i2}{\beta}_0+{\alpha}_i\right)}{F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)}\hfill \\ {}=\frac{\left(1-F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\right)}{\left(1-F\left({x}_{i2}{\beta}_0+{\upgamma}_0{d}_3+{\alpha}_i\right)\right)}\hfill \\ {}\times \frac{F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)}{F\left({x}_{i1}{\beta}_0+{\alpha}_i\right)}\hfill \end{array}} $$

while if d3 = 1, then,

$$ {\displaystyle \begin{array}{l}\frac{\Pr \left(A|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)}{\Pr \left(B|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)}\hfill \\ {}=\frac{\left(1-F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\right)}{\left(1-F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)\right)}\hfill \\ {}\times \frac{F\left({x}_{i2}{\beta}_0+{\upgamma}_0+{\alpha}_i\right)}{F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)}\hfill \\ {}=\frac{\left(1-F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right)\right)}{\left(1-F\left({x}_{i2}{\beta}_0+{\upgamma}_0{d}_3+{\alpha}_i\right)\right)}\hfill \\ {}\times \frac{F\left({x}_{i2}{\beta}_0+{\upgamma}_0{d}_3+{\alpha}_i\right)}{F\left({x}_{i1}{\beta}_0+{\upgamma}_0{d}_0+{\alpha}_i\right).}\hfill \end{array}} $$

Monotonicity of F implies that

$$ \mathit{\operatorname{sgn}}\Big(\Pr \left(A|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)-\Pr \left(B|{x}_i,{\alpha}_i,{x}_{i2}={x}_{i3}\right)=\mathit{\operatorname{sgn}}\left(\left({x}_{i2}-{x}_{i1}\right){\beta}_0+{\upgamma}_0\left({d}_3-{d}_0\right)\right). $$

This last equation suggests that β0 and γ0 can be estimated by conditional maximum score using only the observations satisfying yi1 + yi2 = 1 and xi2 = xi3, that is, by maximizing

$$ {\displaystyle \begin{array}{l}\sum \limits_i1\left\{{x}_{i2}-{x}_{i3}=0\right\}\;\left({y}_{i2}-{y}_{i1}\right)\hfill \\ {}\mathit{\operatorname{sgn}}\;\left(\left({x}_{i2}-{x}_{i1}\right)\beta +\upgamma \left({y}_{i3}-{y}_{i0}\right)\right).\hfill \end{array}} $$

Similar to the logit case, when xi2xi3 is continuously distributed with support around 0, estimation of β0 and γ0 can be obtained by maximizing

$$ \sum \limits_iK\left(\frac{x_{i2}-{x}_{i3}}{h_N}\right)\;\left({y}_{i2}-{y}_{i1}\right)\mathit{\operatorname{sgn}}\;\Big(\left({x}_{i2}-{x}_{i1}\right)\beta +\upgamma \left({y}_{i3}-{y}_{i0}\right). $$

The Semiparametric Panel Data Censored Regression Model

The standard censored panel data (or Type 1 Tobit) model with individual effects is given by

$$ {y}_{it}=\max \left\{{x}_{it}{\beta}_0+{\alpha}_i+{\varepsilon}_{it},0\right\}\;i=1,\dots, N;t=1,\dots, T. $$

Estimation of this model was first considered by Honoré (1992) and later by Honoré and Kyriazidou (2000b), who extend the results of the former paper. We will present here Honoré (1992), who assumes that (εit, εis) are pairwise exchangeable conditional on (xi, ai). This implies that εit and εis are identically distributed conditional on (xi, ai) although it does not require (conditional) independence over time. (Fristedt and Gray 1997, give the following definition of exchangeability: Let I e a countable set. A sequence \(( X_i : i \in \mathscr{I}) \) finite or infinite, of random variables on a probability space (Ω; F; P) is exchangeable if, for every permutation ρ of I, the distribution of \(X_{p(i)} : i \in {\mathscr{I}} \)and \(X_{i} : i \in {\mathscr{I}} \)) are identical. Note that a finite or infinite i.i.d. sequence is exchangeable and that exchangeability allows for certain types of serial correlation. Furthermore, exchangeability implies strict stationarity although the converse is not true.)

Consider the ‘pseudo-error’:

$$ {e}_{is t}\left(\beta \right)=\max \left\{{y}_{is},\left({x}_{is}-{x}_{it}\right)\beta \right\}-{x}_{is}\beta . $$

With this definition, at the true β0

$$ {\displaystyle \begin{array}{ll}{e}_{is t}\left({\beta}_0\right)& =\max \left\{{y}_{is},\left({x}_{is}-{x}_{it}\right){\beta}_0\right\}-{x}_{is}{\beta}_0\hfill \\ {}& =\max \left\{\max \left\{{x}_{is}{\beta}_0+{\alpha}_i+{\varepsilon}_{is},0\right\},\left({x}_{is}-{x}_{it}\right){\beta}_0\right\}\hfill \\ {}& -{x}_{is}{\beta}_0=\max \left\{\max \left\{{\alpha}_i+{\varepsilon}_{is},-{x}_{is}{\beta}_0\right\},-{x}_{it}{\beta}_0\right\}\hfill \\ {}& =\max \left\{{\alpha}_i+{\varepsilon}_{is},-{x}_{is}{\beta}_0,-{x}_{is}{\beta}_0\right\}\hfill \end{array}} $$

The conditional exchangeability assumption implies that (eist(β0), eits(β0)) is distributed like (eits(β0), eist(β0)) conditional on (xit, xis, ai) and hence the difference eits0) − eist0) is distributed symmetrically around 0 conditional on (xit, xis, ai). Since this is true for any αi this symmetry holds conditional only on (xit, xis). Therefore for any odd function ξ (that is, a function ξ that satisfies ξ(−d) = −ξ(d)) we have

$$ E\left[\xi \left({e}_{is t}\left({\beta}_0\right)-{e}_{is t}\left({\beta}_0\right)\right)|{x}_{it},{x}_{is}\right]=0 $$

which also implies the following moment restriction:

$$ E\left[\xi \left({e}_{is t}\left({\beta}_0\right)-{e}_{is t}\left({\beta}_0\right)\right){\left({x}_{is}-{x}_{it}\right)}^{\prime }|{x}_{it},{x}_{is}\right]=0. $$

The left-hand side of the moment condition above may be thought of as the first order condition for the following population minimization problem

$$ \min \limits_{\beta }E\left[q\left({y}_{is},{y}_{it},\left({x}_{is}-{x}_{it}\right)\beta \right)|{x}_{it},{x}_{is}\right] $$

Where

$$ {\displaystyle \begin{array}{ll}\hfill & q\left({y}_i,{y}_j,\delta \right)\\ {}& =\left\{\begin{array}{ccc}\hfill \Xi \left({y}_i\right)-\left({y}_j+\delta \right)\xi \left({y}_i\right)\hfill & \hfill \mathrm{if}\hfill & \hfill \delta \le -{y}_j\hfill \\ {}\hfill \Xi \left({y}_i-{y}_j-\delta \right)\hfill & \hfill \mathrm{if}\hfill & \hfill -{y}_j<\delta <{y}_i\hfill \\ {}\hfill \Xi \left(-{y}_j\right)-\left(\delta -{y}_j\right)\xi \left(-{y}_i\right)\hfill & \hfill \mathrm{if}\hfill & \hfill {y}_i\le \delta \hfill \end{array}\right.\hfill \end{array}} $$

and Ξ(d):R →R)+ is an even function (that is, Ξ(−d) = Ξ(d)) which is convex, strictly increasing for d > 0 and has Ξ (0) = 0, and Ξ′(d) = ξ(d) where ξ(0) = 0.Note that for Ξ to be convex, ξ has to be monotone. Obvious choices for Ξ are Ξ (d) = d2 (which corresponds to ξ(d) = 2d) and Ξ (d) = |d| (which corresponds to ξ(d) = sgn(d)).

The fact that the true β0 solves the population minimization problem above suggests the following estimator for β0:

$$ \widehat{\beta}=\arg \min \limits_{\beta } \sum \limits_i\sum \limits_{s<t}q\left({y}_{is},{y}_{it}\left({x}_{is}-{x}_{it}\right)\beta \right). $$

Honoré (1992) shows that the estimators corresponding to Ξ (d) = d2 and Ξ (d) = |d| are root-N consistent and asymptotically normal.

Honoré (1993) considers a dynamic version of the model where the lag of the observed (censored) dependent variable appears in the model instead of the latent one. Hu (2002) considers the case where one lag of the latent (unobserved) dependent variable is included along with the set of exogenous variables xit.

The Semiparametric Panel Data Sample Selection Model

The standard panel data sample selection (or Type 2 Tobit) model is defined as:

$$ {y}_{it}^{\ast }={x}_{it}^{\ast }{\beta}_0+{\alpha}_i^{\ast }+{\varepsilon}_{it}^{\ast }{y}_{it}={d}_{it}\cdot {y}_{it}^{\ast }{d}_{it}=1\left\{{z}_{it}{\upgamma}_0+{\eta}_i-{u}_{it}\ge 0\right\} $$

where i = 1,2,…,N; t = 1,…T. Kyriazidou (1997) considers estimation without any parametric assumptions on the form of the joint distribution of \( \left({\varepsilon}_{it}^{\ast },{u}_{it}\right) \) or on the individual effects (αii).

Consider the case where T = 2 and only those individuals for whom di1 = di2 = 1. Let \( {\xi}_i=\left({z}_{i1},{z}_{i2},{x}_{i1}^{\ast },{x}_{i2}^{\ast },{\alpha}_i,{\eta}_i\right) \) denote all the information about individual i. Note that

$$ E\left({y}_{i1}-{y}_{i2}|{d}_{i1}={d}_{i2}=1,{\xi}_i\right)=\left({x}_{i1}^{\ast }-{x}_{i2}^{\ast}\right){\beta}_0+E\left({\varepsilon}_{i1}^{\ast }-{\varepsilon}_{i2}^{\ast }|{d}_{i1}={d}_{i2}=1,{\xi}_i\right) $$

and hence OLS estimation of the first differenced model will not yield consistent estimation of β0 since in general the so-called ‘sample selection bias term’

$$ {\lambda}_{it}\equiv E\left({\varepsilon}_{it}^{\ast }|{d}_{i1}={d}_{i2}=1,{\xi}_i\right)=E\left({\varepsilon}_{it}^{\ast }|{u}_{i1}\le {z}_{i1}{\upgamma}_0+{\eta}_i,{u}_{i2}\le {\mathrm{z}}_{i2}{\upgamma}_0+{\eta}_i,{\xi}_i\right) $$

is not zero. Nor do we have in general that λi1 = λi2, so that first differencing removes the sample selection bias along with the individual effects. Kyriazidou (1997) makes a conditional exchangeability assumption that \( \left({\varepsilon}_{i1}^{\ast },{\varepsilon}_{i2}^{\ast },{u}_{i1},{u}_{i2}\right)\;\mathrm{and}\;\left({\varepsilon}_{i2}^{\ast },{\varepsilon}_{i1}^{\ast },{u}_{i2},{u}_{i1}\right) \) are identically distributed conditional on ξi. Under this assumption, it is easy to see that if zi1γ0 = zi2γ0 then

$$ {\displaystyle \begin{array}{ll}\hfill & {\lambda}_{i1}\\ {}& =E\left({\varepsilon}_{i1}^{\ast }|{u}_{i1}\le {z}_{i1}{\upgamma}_0+{\eta}_i,{u}_{i2}\le {z}_{i2}{\upgamma}_0+{\eta}_i,{\xi}_i\right)\hfill \\ {}& =E\left({\varepsilon}_{i2}^{\ast }|{u}_{i1}\le {z}_{i1}{\upgamma}_0+{\eta}_i,{u}_{i2}\le {z}_{i2}{\upgamma}_0+{\eta}_i,{\xi}_i\right)\hfill \\ {}& ={\lambda}_{i2}\hfill \end{array}} $$

so that first differencing will eliminate both αi and λit simultaneously. So β0 can be estimated by first difference OLS for the subsample of individuals that are observed in both periods (that is, that have di1 = di2 = 1) and also have the selection index, zitγ0, constant (that is, zi1γ0 = zi2γ0). Of course, this estimation scheme cannot be directly implemented since γ0 is unknown. And it is quite possible that no observation has zi1γ0 = zi2γ0 if ziγ0 is continuously distributed. If, however, λit is a sufficiently smooth function and \( \widehat{\upgamma} \) is a consistent estimator of γ0, zi1γ0zi2γ0 implies λi1≈ λi2, and the preceding augment holds approximately. Kyriazidou proposes a two-step estimation procedure, in the spirit of Powell (2001), and Ahn and Powell (1993) who consider estimation of cross-section versions of the sample selection model. In the first step, γ0 is consistently estimated based on the selection equation. In the second step, the estimate \( \widehat{\upgamma} \) is used to estimate β0 based on those pairs of observations for which zi1\( \widehat{\upgamma} \) and zi2\( \widehat{\upgamma} \) are ‘close’. To this end define

$$ {\widehat{\psi}}_i=\frac{1}{h_N}K\left(\frac{\Delta {z}_i\widehat{\upgamma}}{h_N}\right) $$

where K () is a kernel density function and hN is a bandwidth sequence. The proposed estimator takes the form:

$$ \widehat{\beta}={\left[\sum \limits_{i=1}^N{\widehat{\psi}}_i\Delta {x}_i^{\prime}\Delta {x}_i{d}_{i1}{d}_{i2}\right]}^{-1}\sum \limits_{i=1}^N{\widehat{\psi}}_i\Delta {x}_i^{\prime}\Delta {y}_i{d}_{i1}{d}_{i2}. $$

Under some assumptions and by appropriately choosing hN, the estimator can be shown to be asymptotically normal although the rate of convergence is slower that the parametric \( \sqrt{N} \) rate. Apart from the conditional exchangeability assumption, another important assumption that underlies the approach is that there is at least one variable in zit not contained in xit, which is an exclusion restriction common in semiparametric sample selection models.

A dynamic version of the panel data sample selection model, with the own lagged dependent variable appearing in each equation, is considered by Kyriazidou (2001).

The Random Effects Approach

Fixed effects methods and conditional maximum likelihood methods (when they exist) estimate the coefficients of time-varying regressors consistently without making any assumptions on how the individual effects are related to the observed covariates or to the time-varying errors or to the initial observations of the sample. However, these methods do not deliver estimates of coefficients of time-invariant regressors and of the individual effects, and hence cannot be used for prediction, or for computation of marginal effects and elasticities which are often the quantities of interest. Furthermore, none of these approaches allows for non-stationary errors and hence for time-series heteroskedasticity.

These problems do not arise in the random effects approach. The approach essentially consists of treating (αI + εit) as a two-component error term and making assumptions about its relationship with the observed covariates and, in the case of dynamic models, with the initial conditions as well. A downside of the approach is that misspecification of any part of the model typically yields inconsistent estimates.

Static Case

In the static panel data linear regression model, the traditional random effects approach (sometimes also called the uncorrelated random effects approach) assumes that the individual effects αi along with the time-varying errors εit are uncorrelated with the observed covariates xit. Then the coefficients of both time-varying and time-invariant regressors may be estimated consistently (albeit not efficiently) by pooled OLS. In static nonlinear models, the traditional random effects approach apart from parameterizing the conditional distribution of εit given xit, also assumes that αi is independent of xit and εit for all t, and has a distribution, say H, that depends on a finite set of unknown parameters, say δ0. For example, in the binary choice model,

$$ {y}_{it}=1\left\{{x}_{it}{\beta}_0+{\alpha}_i+{\varepsilon}_{it}\ge 0\right\}\;i=1,\dots, N;t=1,\dots, T $$
(1)

assuming that εit are i.i.d. over time and independent of xi and αi with known distribution F (say, standard normal or logistic), we may estimate the unknown parameters (β0,δ0) via ML. The log-likelihood is

$$ {\displaystyle \begin{array}{ll}\hfill & \ln L\left(\beta, \delta \right)\\ {}& =\sum \limits_i\ln \int \prod \limits_{T=1}^TF{\left({x}_{it}\beta +\alpha \right)}^{y_{it}}{\left(1-F\left({x}_{it}\beta +\alpha \right)\right)}^{1-{y}_{it}} dH\left(\alpha, \delta \right)\hfill \end{array}} $$

and involves a one-dimensional integral which may be calculated numerically, for example, by quadrature procedures (see Butler and Moffitt 1982).

However, things become quite complicated if we want to allow for arbitrary serial correlation in the εit’s. Consider the binary choice model

$$ {y}_{it}=1\left\{{x}_{it}{\beta}_0-{u}_{it}\ge 0\right\} $$

where uit = αI + εit is the composite error term. For T = 3 there are 23 possible sequences of 0’s and 1’s. The likelihood for an individual for whom the sequence of observed yit’s is (0,1,0) takes the form

$$ \underset{x_{i1}\beta }{\int }{\int}^{x_{i2}\beta}\underset{x_{i3}\beta }{\int }f\left({u}_1,{u}_2,{u}_3\right){du}_1{du}_2{du}_3 $$

where f is the trivariate density of (u1,u2,u3) conditional on xi. The log-likelihood is

$$ {\displaystyle \begin{array}{ll}\ln L\left(\beta, \delta \right)=& \underset{i:\left(0,0,0\right)}{\Sigma}\ln \underset{x_{i1}\beta }{\int \limits}\underset{x_{i2}\beta }{\int \limits}\underset{x_{i3}\beta }{\int \limits }f\left({u}_1,{u}_2,{u}_3\right){du}_1{du}_2{du}_3\hfill \\ {}& +\underset{i:\left(0,0,1\right)}{\Sigma}\ln \underset{x_{i1}\beta }{\int \limits}\underset{x_{i2}\beta }{\int \limits }{\int}^{x_{i3}\beta }f\left({u}_1,{u}_2,{u}_3\right)\hfill \\ {}& \times {du}_1{du}_2{du}_3+\dots \hfill \end{array}} $$

which requires the computation of multiple trivariate integrals. Multivariate integration is basically infeasible for large T. This is where simulation methods come in very handy.

The assumption that αi is independent of xi is often found unsatisfactory. A possible solution is to assume a specific functional form for the relationship of αi with xi. This approach (recently also called the correlated random effects approach) was first proposed by Chamberlain (1984). Suppose that

$$ {\alpha}_i=\sum \limits_{t=1}^T{x}_{it}{\upgamma}_{0,t}+{v}_i $$

where vi is independent of xi, similarly to the time varying error component εit, and that the composite new error term vi + εit follows a specific distribution, say normal. In the case of the binary choice model, for example, assuming that εit + vi|xi; αi is \( N\left(0,{\sigma}_{0,t}^2\right) \) implies that

$$ \Pr \left({y}_{it}=1|{x}_i\right)=\Phi \left(\frac{x_{it}{\beta}_0+{\sum}_{t=1}^T{x}_{it}{\upgamma}_{0,t}}{\sigma_{0,t}}\right)=\Phi \left({x}_{it}{\theta}_{0,t}\right). $$

For computational simplicity, Chamberlain proposes to estimate the unknown parameters θ0,t via period-by-period probit. The ‘structural parameters’ \( {\beta}_0,{\left\{{\sigma}_{0,t}^2\right\}}_{t=1}^T,\mathrm{and}\;{\left\{{\upgamma}_{0,t}\right\}}_{t=1}^T \) can then be recovered by minimum distance estimation. Note that the approach allows for time series heteroskedasticity and requires only one normalization e.g. that\( {\sigma}_{0,t}^2=1 \).

Newey (1994) generalizes Chamberlain’s approach by postulating that

$$ {\alpha}_i=\rho \left({x}_{i1},\dots, {x}_{it}\right)+{v}_i $$

where ρ () is an unknown function of xi. Assuming again that vi and εit are independent of xi and that the composite new error term vi + εit follows a specific distribution, say Ft, we obtain

$$ {\pi}_t=\Pr \left({y}_{it}=1|{x}_i\right)={F}_t\left(\rho \left({x}_i\right)+{x}_{it}{\beta}_0\right) $$

which for a strictly monotonic Ft implies that

$$ {F}_t^{-1}\left({\pi}_t\right)=\rho \left({x}_i\right)+{x}_{it}{\beta}_0. $$

For example in the normal case

$$ {\Phi}^{-1}\left({\pi}_t\right)=\frac{\rho \left({x}_i\right)+{x}_{it}{\beta}_0}{\rho_{0,t}}. $$

Thus for two periods t and s we obtain

$$ {\Phi}^{-1}\left({\pi}_t\right)=\frac{\sigma_{0,s}}{\sigma_{0,t}}{\Phi}^{-1}\left({\pi}_s\right)+\frac{\sigma_{0,s}}{\sigma_{0,t}}\left({x}_{it}-{x}_{is}\right){\beta}_0. $$

Normalizing σ0,t = 1 and estimating πt and πs nonparametrically, we can recover σ0,s and β0 from the regression of \( {\Phi}^{-1}\left({\widehat{\pi}}_t\right)\;\mathrm{on}\;{\Phi}^{-1}\left({\widehat{\pi}}_s\right)\;\mathrm{and}\;\left({x}_{it}-{x}_{is}\right). \)

A criticism of all these correlated random effects approaches is that, although in the linear model writing \( {\alpha}_i={\sum}_{t=1}^T{x}_{it}{\upgamma}_{0,t}+{u}_i \) where E(uixit) = 0 for all t does not impose xit−xis any restrictions on the joint distribution of αi and xi (apart from the requirement that it has second moments) since this is just the best linear projection of αi on xi, in the nonlinear model assuming αi = ρ(xi1,…,xit) + ui, even without specifying the functional form of ρ, imposes implausible restrictions in the sense that, if this relationship holds for the T observations, a similar one will not in general hold for T + 1.

Dynamic Case

In the case where there are genuine dynamics in the model in the form of lags of the dependent variable or other endogenous regressors, random effects methods become even more complicated and require additional assumptions about the relationship of the individual effects with the initial observations. We next describe a general approach for estimating dynamic random effects models suggested by Wooldridge (2000). For simplicity we will drop the subscripts i.

We are interested in the conditional distribution of yt given a vector of strictly exogenous variables zT ≡ (z1, . . ., zT), own lags and lags of other endogenous variables xt − 1 ≡ (yt − 1, wt − 1, yt − 2, wt − 2, . . ., y0, w0), and an unobserved scalar or vector random effect α. Here zt is strictly exogenous in the sense that

$$ F\left({w}_t|{z}^T,{x}^{t-1},\alpha \right)=F\left({w}_t|{z}_t,{x}^{t-1},\alpha \right). $$

The conditional density of xt ≡ (yt, wt) is

$$ {f}_t\left({x}_t|{z}^T,{x}^{t-1},\alpha \right)={f}_t\left({x}_t|{z}_t,{x}^{t-1},\alpha \right)={f}_t\left({y}_t|{w}_t,{z}_t,{x}^{t-1},\alpha \right)\cdot {f}_t\left({w}_t|{z}_t,{x}^{t-1},\alpha \right) $$

and the joint density for all T periods is

$$ f\left({x}_1,{x}_2,\dots, {x}_T|{z}^T,{x}_0,\alpha \right)=\prod \limits_{t=1}^T{f}_1\left({x}_t|{z}_t,{x}^{t-1},\alpha \right). $$

But a is unobserved. We need to integrate it out. One solution is to parameterize the distribution of α conditional on zT and x0, say h(α|zT,x0). Then

$$ f\left({x}_1,{x}_2,\dots, {x}_T|{z}^T,{x}_0\right)=\int \prod \limits_{t=1}^T{f}_1\left({x}_t|{z}_t,{x}^{t-1},\alpha \right)h\left(\alpha |{z}^T,{x}_0\right) d\alpha . $$

Notice that in the traditional random effects approach (in the line of Anderson and Hsiao 1981) we would have to make assumptions about the conditional distribution of x0 conditional on a and zT.

See Also