Keywords

JEL Classifications

The control function approach is an econometric method used to correct for biases that arise as a consequence of selection and/or endogeneity. It is the leading approach for dealing with selection bias in the correlated random coefficients model (see Heckman and Robb 1985, 1986; Heckman and Vytlacil 1998; Wooldridge 1997, 2003; Heckman and Navarro 2004), but it can be applied in more general semiparametric settings (see Newey et al. 1999; Altonji and Matzkin 2005; Chesher 2003; Imbens and Newey 2006; Florens et al. 2007).

The basic idea behind the control function methodology is to model the dependence between the variables not observed by the analyst on the observables in a way that allows us to construct a function K such that, conditional on the function, the endogeneity problem (relative to the object of interest) disappears.

In this article I deal exclusively with the problem of identification. That is, I assume access to data on an arbitrarily large population. As a consequence, I do not discuss estimation, standard errors or inference. In the examples, I analyse how to recover parameters in a way that, I hope, shows directly how to perform estimation via sample analogues.

The Set-Up

The general set-up I consider is the following two-equation structural model; an outcome equation:

$$ Y=g\left(x,D,\varepsilon \right), $$
(1)

and an equation describing the mechanism assigning values of D to individuals:

$$ D=h\left(X,Z,v\right), $$
(2)

where X and Z are vectors of observed random variables, D is a (possibly vector valued) observed random variable, and ε and ν are general disturbance vectors not independent of each other but satisfying some form of independence of X and Z.

The problem of endogeneity arises because D is correlated with ε via the dependence between ε and ν. Because Eq. (2) represents an assignment mechanism in many economic models, it is generically called the ‘selection’ or ‘choice’ equation. This set-up has been applied to problems like earnings and schooling (Willis and Rosen 1979; Cunha et al. 2005), wages and sectoral choice (Heckman and Sedlacek 1985) and production functions and productivity (Olley and Pakes 1996), among others.

The goal of the analysis is to recover some functional of g(X, D, ε) of interest

$$ a\left(X,D\right) $$
(3)

that cannot be recovered in a straightforward way because of the endogeneity/ selection problem. As an example, when D is binary interest sometimes centres on the effect of going from D = 0 to D = 1 for an individual chosen at random from the population, the so-called average treatment effect:

$$ a\left(X,D\right)=E\left(g,\left(X,1,\varepsilon \right),\hbox{--}, g,\left(X,0,\varepsilon \right)\right). $$

The key behind the control function approach is to notice that (conditional on X, Z) the only source of dependence is given by the relation between ε and ν. If ν was known, we could condition on it and analyse Eq. (1) without having to worry about endogeneity. The main idea behind the control function approach is to recover some function of ν via its relationship with the model observables so that we can now condition on it and solve the endogeneity problem.

Definition

The control function approach proposes a function K (the control function) that allows us to recover a (X, D) such that K satisfies

A-1. K is a function of X, Z, D.

A-2. ε satisfies some form of independence of D conditional on ρ (X, K), with ρ a knowable function.

A-3. K is identified.

Assumption A-2 is the key assumption of the approach. It states that, once we condition on K, the dependence between ε and D (that is, the endogeneity) is no longer a problem. To help fix ideas, consider the following example of a simple linear in parameters additively separable version of the model of Eqs. (1 and 2).

Example 1

Linear regression with constant effects. Write the outcome Eq. (1) as

$$ Y= X\beta + D\alpha +\varepsilon $$

and assume that our object of interest (3) is α. Assume that we can write Eq. (2) as

$$ D= X\rho + Z\pi +\nu $$
(4)

with ν, ε ⊥⊥ X, Z where ⊥⊥ denotes statistical independence. Such a model arises, for example, if Y is logearnings and D is years of schooling as in Heckman et al. (2003). If ability is unobservable since high ability is associated with higher earnings but also with higher schooling, then ε and ν would be correlated.

If we let K = ν be the residual of the regression in (4), then we can recover a from the following regression

$$ Y= X\beta + D\alpha + K\psi +\eta, $$

where it follows that E(η|X, K) = 0. It is easy to show that in this case the control function estimator and the two-stage least squares estimator are equivalent. (To my knowledge, although in a different context – a SUR model – Telser 1964, was the first to use the residuals from other equations as regressors in the equation of interest.)

The previous case is a simple example of a control function where K = DE(D|X, Z). In this case, because of the constant effects assumption (that is, α is not random), standard instrumental variables methods and the control function approach coincide. In general, this is not the case.

In the next section I describe in detail the control function methodology for the binary choice case (Roy 1951). This case is interesting both because it is the workhorse of the policy evaluation literature and because, by virtue of its nonlinearity, it highlights the implications of a nonlinear structure in a relatively simple context. I then briefly describe extensions to more general cases. For simplicity, I focus on the additively separable in unobservables case, but recent research provides generalizations to non-additive functions (see Blundell and Powell 2003; Imbens and Newey 2006, among others).

The Case of a Binary Endogenous Variable

In this section I describe how the control function approach solves the selection/ endogeneity problem when the endogenous variable is binary. This problem has a long tradition in economics going back (at least) to Roy (1951). In Roy’s original version of the model (see Roy model) an individual is deciding whether to become a fisherman (D = 0) or a hunter (D = 1).

Associated with each occupation is a payoff YD = gD(X) + εD. Since we can only observe individuals in one sector at a time, the observed outcome for an individual is given by Y1 if he becomes a hunter (D = 1) and by Y0 if he becomes a fisherman (D = 0). That is, the observed outcome (Y) can be written as:

$$ Y={DY}_1+\left(1\hbox{--} D\right){Y}_0={g}_0(X)+D\left({g}_1(X)\hbox{--} {g}_0(X)\right)+{\varepsilon}_0+D\left({\varepsilon}_1\hbox{--} {\varepsilon}_0\right). $$
(5)

The model is closed by assuming that individuals choose the occupation with the highest payoff. That is,

$$ D=1\left({Y}_1-{Y}_0>0\right)=1\left({g}_1(X)-{g}_0(X)+{\varepsilon}_1-{\varepsilon}_0>0\right), $$
(6)

where 1(a) is an indicator function that takes value 1 if a is true and 0 if it is false. Endogeneity arises because the error term in choice Eq. (6) contains the same random variables as the outcome Eq. (5). A generalized version of the model replaces the simple income maximization rule in (6) with a more general decision rule

$$ D=1\left(h\left(X,Z\right)-\nu >0\right). $$
(7)

The model described by Eqs. (5 and 7) is general enough to be used in many different cases. Many qsts of interest in economics fit this framework if, instead of thinking of two sectors, fishing and hunting, we think of two generic potential states, the treated state (D = 1) and the untreated state (D = 0) with their associated potential outcomes. The decision rule in (7) is general enough to capture not only income maximization but also utility maximization and even a deciding actor different from the agent directly affected by the outcomes (parents deciding for their children, for example). The simple income maximization rule in (6) shows why, in general if ε1ε0, then ε1ε0 is likely to be correlated with D.

The correlated random coefficients model is a special case of the model described by (5) and (7) when ε1ε0 is not independent of D and gj (X) = αj + for j = 0,1. (For simplicity I assume β1 = β0 = β. The case where β1β0 follows directly.) To see why simply rewrite (5) as

$$ Y={\alpha}_0+ X\beta +D\left({\alpha}_1-{\alpha}_0+{\varepsilon}_1-{\varepsilon}_0\right)+{\varepsilon}_0 $$
(8)

so that now the coefficient on D is (a) random and (b) correlated with D. In this case we have that the gains from treatment (α1α0 + ε1ε0) are heterogeneous (that is, they are not constant even after controlling for X) and they are correlated with D. I come back to this special linear in parameters case in Example 2.

Though other parameters of interest can be defined, I consider the case in which we are interested in the two particular functionals that receive the most attention in the evaluation literature – the average treatment effect and the average effect of treatment on the treated. I impose that ε1, ε0, ν are absolutely continuous with finite means, and that ε1, ε0, ν ⊥⊥ X, Z. (One could weaken the assumption to be ε1, ε0 ⊥⊥ X|Z and ν ⊥⊥ X, Z.)

Under these assumptions the average treatment effect is given by

$$ ATE(x)=E\left({Y}_1-{Y}_0|X=x\right)={g}_1(x)-{g}_0(x)=x\left({\beta}_1-{\beta}_0\right) $$

where the last equality follows if Eq. (8) applies. ATE(X) is of interest to answer qsts like the average effect of a policy that is mandatory, for example. When receipt of treatment is not mandatory or randomly assigned, the average effect of treatment among those individuals who are selected into treatment is commonly the functional of interest (see Heckman 1997; Heckman and Smith 1998). This effect is measured by the average effect of treatment on the treated:

$$ TT(x)=E\left({Y}_1-{Y}_0|X=x,D=1\right)={g}_1(x)-{g}_0(x)+E\left({\varepsilon}_1-{\varepsilon}_0|X=x,D=1\right)={\alpha}_1-{\alpha}_0+E\left({\varepsilon}_1-{\varepsilon}_0|X=x,D=1\right), $$

where the last equality follows for the linear in parameters case of Eq. (8).

Now, suppose we ignored the endogeneity problem and attempted to recover either of these objects from the data on outcomes at hand. In particular, if we used the (observed) conditional means of the outcome

$$ E\left(Y|X=x,D=1\right)\hbox{--} E\left(Y|X=x;D=0\right)={g}_1(x)\hbox{--} {g}_0(x)+E\left({\varepsilon}_1|X=x,D=1\right)-E\left({\varepsilon}_0|X=x,D=0\right) $$

we would not recover either ATE(X) or TT(x). Notice too that, since the endogenous variable D is binary, we cannot directly recover ν and use it as a control as we did in the linear case of Example 1 above. Instead, we can recover a function of ν that satisfies the definition of a control function.

Let Fν() denote the cumulative distribution function of ν. To form the control function in this case, first take Eq. (7) and write the choice probability

$$ P\left(x,z\right)=\Pr \left(D=1|X=x,Z=z\right)=\Pr \left(\nu <h\left(x,z\right)\right)={F}_v\left(h\left(x,z\right)\right), $$

which under our assumptions implies

$$ h\left(x,z\right)={F}_{\nu}^{-1}\left(P\left(x,z\right)\right). $$

Following the analysis in Matzkin (1992), we can recover both h(x, z) and Fν() nonparametrically up to normalization.

Next, take the conditional (on X, Z) expectation of the outcome for the treated group

$$ E\left(Y|X=x,Z=z,D=1\right)={g}_1(x)+E\left({\varepsilon}_1|X=x,Z=z,D=1\right). $$

We can write the last term as

$$ E\left({\varepsilon}_1|X=x,Z=z,D=1\right)\operatorname{}=E\left({\varepsilon}_1|\nu <h\left(x,z\right)\right)=E\left({\varepsilon}_1|\nu <{F}_{\nu}^{-1}\left(P\left(x,z\right)\right)\right). $$

That is, we can write it as a function of the known h(x, z) or, equivalently, as a function of the probability of selection P(x, z),

$$ E\left(Y|X=x,Z=z,D=1\right)={g}_1(x)+{K}_1\left(P\left(x,z\right)\right), $$

where K1(P(X, Z)) satisfies our definition of a control function. So, provided that we can vary K1(P(X, Z)) independently of g1(X), we can recover g1(X) up to a constant. We can identify the constant in a limit set such that P → 1 since limP1K1(P) = 0. Provided that we have enough support in the probability of treatment – that is, provided that some people choose treatment with probability arbitrarily close to (1) – we can recover the constant. (See Example 2.) Using the same argument we can form

$$ E\left(Y|X=x,Z=z,D=0\right)={g}_0(x)+{K}_0\left(P\left(x,z\right)\right) $$

and identify g0(X) (up to a constant) and the control function K0(P(X, Z)). As before, we can recover the constant in g0(X) by noting that limP0K0(P) = 0.

Intuitively, we need to be able to vary the K1(P(X, Z)) function relative to the g1(X) function so that we can identify them from the observed variation in Y1. One possibility is to impose that g1 and K1 are measurably separated functions. (That is, provided that, if g1(X) = K1(P(X, Z)) almost surely then g1(X) is a constant almost surely; see Florens et al. 1990.) The simplest way to satisfy this restriction is by exclusion. That is, if K1(P(X, Z)) is a nontrivial function of Z conditional on X and Z shows enough variation, we can vary the K1 function by varying Z while keeping g1(X) constant. Another related possibility is to assume that g1 and K1 live in different function spaces. For example, g1 a linear function and K1 the nonlinear mills ratio term that results from assuming that (ε0, ε1, ν) are jointly normal as in the original Heckman (1979) selection correction model.

Once we have recovered g0(X), g1(X), K0 (P(X, Z)), K1 (P(X, Z)) we can now form our parameters of interest. Given g0(X) and g1(X), ATE(X) = g1 (X) − g0(X) immediately follows. To recover TT (X), first notice that, by the law of iterated expectations

$$ {\displaystyle \begin{array}{l}E\left({\varepsilon}_0|X=X,Z=z\right)=E\left({\varepsilon}_0|X=x,Z=z,D=1\right)P\left(x,z\right)\\ {}\qquad\qquad\quad\, +E\left({\varepsilon}_0|X=x,Z=z,D=0\right)\left(1-P\left(x,z\right)\right)\\ {}\qquad\qquad\quad\, =0,\end{array}} $$

where P(X, Z) is known from our analysis above and E(ε0| X = x, Z = z, D = 0) = K0(P(X, z)). Rewriting the expression above we get \( E\left({\varepsilon}_0|X=x,Z=z,D=1\right)=\frac{K_0\left(P\left(x,z\right)\right)\left(1-P\left(x,z\right)\right)}{P\left(x,z\right)} \). With this expectation in hand we can recover \( TT\left(X,Z\right)={g}_1(X)-{g}_0(X)+{K}_1\left(P\left(X,Z\right)\right)+\frac{K_0\Big(P\left(X,Z\right)\left(1-P\left(X,Z\right)\right)}{P\left(X,Z\right)} \). By integrating against the appropriate distribution, we can recover TT(X) = ∫TT(X, z)dFZ|X, D = 1(z).

The following example shows how the control function methodology can be applied to recover average effects of treatment in a linear in parameters model with correlated random coefficients. This model arises when there are unobservable gains that vary over individuals and these gains are correlated with the choice of treatment (that is, when there is essential heterogeneity. See Heckman et al. 2006; Basu et al. 2006). The Roy model of Eqs. (5 and 6) in which the unobservable individual gains (ε1ε0) are correlated with the choice of sector is an example of this case.

Example 2

Correlated random coefficients with binary treatment. Assume we can write the outcome equations in linear in parameters form,

$$ {Y}_j={\alpha}_j+X{\beta}_j+{\varepsilon}_jj=0,1. $$

Let D be an indicator of whether an individual receives treatment (D = 1) or not (D = 0). We also write a linear in parameters decision rule:

$$ D=1\left( X\delta + Z\gamma -\nu >0\right). $$

From the analysis in Manski (1988) we can recover δ, γ and Fν (up to scale). With P(x, z) = Pr(D = 1|X = x, Z = z) in hand, we then form

$$ {Y}_j={\alpha}_j+X{\beta}_j+{K}_j\left(P\left(X,Z\right)\right)+{\eta}_j $$

where E(ηj| X = x, Kj(P(X, Z)) = kj) = 0. To emphasize the problem of identification of the constant αj we can rewrite the outcome as

$$ {Y}_j={\tau}_j+X{\beta}_j+{\tilde{K}}_j\left(P\left(X,Z\right)\right)+{\eta}_j $$

where \( {K}_j=\left(P\left(X,Z\right)\right)={\kappa}_j+{\tilde{K}}_j\left(P\left(X,Z\right)\right) \) and τj = αj + κj.

The elements of the outcome equations can be recovered by various methods. One could, for example, use Robinson (1988) and use residualized nonparametric regressions to recover βj, τj and Kj(P(X, Z)). Alternatively, one could approximate K(P(X, Z)) with a polynomial on P(X, Z). In this case we would have

$$ {Y}_j={\tau}_j+X{\beta}_j+{\pi}_1P\left(X,Z\right)+{\pi}_2P{\left(X,Z\right)}^2+\cdots +{\pi}_nP{\left(X,Z\right)}^n+{\eta}_j $$

where \( {\tilde{K}}_j\left(P\left(X,Z\right)\right)={\sum}_{i=1}^n{\pi}_{j1}P{\left(X,\,\, Z\right)}^i. \)When j = 0 then limP → 0K0(P) = 0 and it follows that \( {\tilde{K}}_0(P)={K}_0(P) \) and τ0 = α0. For the treated case (j = 1) we have that limP→1K1(P(X, Z)) = 0. Since \( {\tilde{K}}_1(1)={\sum}_{i=1}^n{\pi}_{1i} \) it follows that \( {\kappa}_1=-{\sum}_{i=1}^n{\pi}_{1i} \) and \( {\alpha}_1={\tau}_1-{\sum}_{i=1}^n{\pi}_{1i} \).

Extensions for a Continuous Endogenous Variable

In this section I briefly review the use of the control function approach for the case in which the endogenous variable D is continuous and we assume that X, Z ⊥⊥ε, ν. Following Blundell and Powell (2003) I assume that the object of interest is the average structural function

$$ a\left(X,D\right)=\int g\left(X,D,\varepsilon \right){dF}_{\varepsilon}\left(\varepsilon \right), $$

which, in the additively separable case g(X, D, ε) = μ(X, D) + ε is simply the regression function μ(X, D).

If we assume that the choice equation

$$ D=h\left(X,Z,\nu \right) $$

is strictly monotonic in ν (which would follow automatically if it were additively separable in ν), we can recover h() and Fν from the analysis of Matzkin (2003) up to normalization. A convenient normalization is to assume that ν ∼ Uniform (0,1) in which case we can directly recover ν from the quantiles of Fν, but other normalizations are possible. From the independence assumption it follows that E(ε|X, D, Z) = E(ε|ν), so we can write the outcome equation as

$$ {\displaystyle \begin{array}{c}Y=\mu \left(X,D\right)+E\left(\varepsilon |v\right)\\ {}=\mu \left(X,D\right)+K\left(\nu \right)\end{array}} $$

which allows us to recover μ(X, D) directly (up to normalization). In the additively separable case we analyse, we can relax the full independence assumption and instead assume directly that the weaker mean independence assumption E(ε|X, D, Z) = E(ε|ν) holds.

See Also