An Introduction to Nonparametric Regression for Labor Economists

Henderson, Daniel J.; Souto, Anne-Charlotte

doi:10.1007/s12122-018-9279-6

An Introduction to Nonparametric Regression for Labor Economists

Published: 07 November 2018

Volume 39, pages 355–382, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Labor Research Aims and scope Submit manuscript

An Introduction to Nonparametric Regression for Labor Economists

Download PDF

Daniel J. Henderson¹ &
Anne-Charlotte Souto¹

633 Accesses
9 Citations
Explore all metrics

Abstract

In this article we overview nonparametric (spline and kernel) regression methods and illustrate how they may be used in labor economics applications. We focus our attention on issues commonly found in the labor literature such as how to account for endogeneity via instrumental variables in a nonparametric setting. We showcase these methods via data from the Current Population Survey.

Tobit Model

Semi-nonparametric spline modifications to the Cornwell–Schmidt–Sickles estimator: an analysis of US banking productivity

Article 30 December 2014

Local Regression Models

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

This survey aims to (re-)introduce applied labor economists to nonparametric regression techniques. Specifically, we discuss both spline and kernel regression, in an approachable manner. We present an intuitive discussion of estimation and model selection for said methods. We also address the use of nonparametric methods in the presence of endogeneity, a common issue in the labor literature, but seldom accounted for in applied nonparametric work.

Accounting for endogeneity is well understood in the parametric literature once a suitable instrument is obtained. Standard methods have been around for some time, but these methods do not always transfer in a straightforward manner in the nonparametric setting. This has caused many to shy away from their use, even with the knowledge that this can lead to additional insight (Henderson and Parmeter 2015).

To showcase these methods, we will look at the relationship between experience, education and earnings. We will begin by ignoring the endogeneity of education and then will discuss how to control for this via a nonparametric control function approach. While nonparametric estimation may seem like a choice, it should be stated that the parametric alternative requires strict functional form assumptions, which if false, likely lead to biased and inconsistent estimators. In practice, the functional relationship between education and earnings as well as between education and its instruments is typically unknown. By using nonparametric regression, we relax these functional form restrictions and are more likely to uncover the causal relationship.

To empirically illustrate these methods, we use individual-level data obtained from the March Current Population Survey (CPS) to highlight each concept discussed. To eliminate additional complications, we primarily focus on a relatively homogeneous sub-group, specifically, working age (20 to 59 years old) males with four-year college degrees.

In what follows, we first slowly introduce the fundamentals of spline and kernel estimators and then discuss how to decide upon various options of each estimator. This should build the foundation for understanding the more advanced topic of handling endogenous regressors. By illustrating these techniques in the context of labor-specific examples, we hope that this helps lead to widespread use of these methods in labor applications.

Nonparametric Regression

In a parametric regression model, we assume which functional form best describes the relationship between the response and explanatory variables. If this form is correct, and the remaining Gauss-Markov assumptions hold, we will have unbiased and efficient estimators. However, if these assumptions do not hold, these estimators are likely biased and inconsistent. Nonlinear parametric models exist, but are often complicated to estimate and still require a priori knowledge of the underlying functional form.

Nonparametric regression offers an alternative. The methods discussed here estimate the unknown conditional mean by using a “local” approach. Specifically, the estimators use data near the point of interest to estimate the function at that point and then use these local estimates to construct the global function. This can be a major advantage over parametric estimators which use all data points to build their estimates (global estimators). In other words, nonparametric estimators can focus on local peculiarities inherent in a data set. Those observations which are more similar to the point of interest carry more weight in the estimation procedure.

This section will introduce two commonly used nonparametric techniques, and will provide the notation and concepts that will be used for the remainder of this review. Specifically, we discuss spline and kernel regression estimation. To help bridge gaps, we make connections to well-known techniques such as ordinary and weighted least-squares.

Spline Regression

Spline regression can be thought of as an extension of ordinary least-squares (OLS). Consider the basic univariate linear model:

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} + \epsilon_{i}, \quad i = 1,2,\ldots,n, $$

(1)

where for a sample of n observations, y is our response variable, x is our explanatory variable, 𝜖 is our usual error term and we have two parameters: a constant and a slope (α and β, respectively). The right-hand side of Eq. 1 can be thought of as a linear combination of 1 and x, we call them the “bases” of the model. One popular way to transform (1) into a nonlinear function is to add higher-order polynomials. A quadratic model would add one extra basis function x² to the model, which corresponds to adding the term $\beta _{2}{x}_{i}^{2}$ to Eq. 1. In matrix form, the number of bases would correspond to the number of columns in the matrix X:

$$ y = X\beta + \epsilon, $$

(2)

where

$$X = \left[\begin{array}{ll} 1 & x_{1} \\ 1 & x_{2} \\ {\vdots} & {\vdots} \\ 1 & x_{n} \end{array}\right] $$

for the linear case (2 bases), and

$$X = \left[\begin{array}{lll} 1 & x_{1} & {x^{2}_{1}} \\ 1 & x_{2} & {x^{2}_{2}} \\ {\vdots} & {\vdots} & {\vdots} \\ 1 & x_{n} & {x^{2}_{n}} \end{array}\right] $$

for the quadratic case (3 bases).

These two cases are illustrated in Fig. 1 where x is years of experience and y is the log wage (adjusted for inflation). To highlight a relatively homogeneous group, we restrict our sample to college-educated (16 years of schooling) males working in personal care and service (or related occupations) between 2006 and 2016.^{Footnote 1} For each panel, the solid line represents white males and the dashed line non-whites. Our linear model (i.e., OLS) shows a strong wage gap between whites and non-whites which seems to remain constant (in percentage terms) as the sale workers gain experience (i.e., similar slopes). Adding experience squared to the model (quadratic model) allows us to better capture the well-known nonlinear relationship between log wage and experience. As workers gain experience, we expect their log wage to increase, but at a decreasing rate. The quadratic model (bottom-left panel) shows a large increase in log wages early in a career with a slight downfall towards the end. Also, this model tends to suggest that the wage gap between white and non-white males working in personal care and service varies with experience. Non-white workers appear to have a more constant and slower increase in their predicted log wages.

Linear Spline Bases

In our example, we could argue that although wages should increase with experience (increase in competence/knowledge), there may be a point where more experience will not increase wages or perhaps even decrease it (slower cognitive ability/decreases in efficiency). Suppose we created a model with the equivalent of two linear regression: one for the first 20 years of experience, and another for the latter years. This would be equivalent of adding the following basis function to our linear model:

$$(x-20)_{+}, $$

where the + sign indicates that the function is set to zero for all values of x where (x − 20) is negative. This model is sometimes called the broken stick model because of its shape, but more generally is referred to as a linear spline base model with 3 knots. The 3 knots are at 0 (minimum value), 20, and 37 (maximum value) years of experience. Note that the maximum and minimum values of x will always be consider to be knots. For example, the linear model in Eq. 9 has two knots. Here we arbitrarily fixed the middle knot at 20 years of experience. We will discuss which knots to select and how many to select in “Model Selection”.

The broken stick model with a break at x = 20 is written as

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} +\beta_{2} {(x_{i}-20)}_{+} + \epsilon_{i} $$

(3)

and is illustrated in upper-right panel of Fig. 1. We see a similar result to the quadratic model, that is, for white workers, we see a strong increase in wages in the first part of their career followed by a smaller decrease towards the end of their career. That being said, we arbitrarily fixed the middle knot at 20 years of experience. Without strong reasons to do so, it is premature to say anything about when the increase in the log wage stops and when the decrease begins. Noting the aforementioned issue, we also observe the wage gap widen at first with experience, but then converge at higher levels of experience.

Figure 2 illustrates how adding knots at different values can change the results. We present a model with 5 knots at x = 0,10,20,30,37, and a model with 20 knots (every 2 years) at x = 0,2,4,…,34,36,37. In matrix form, (2) , the X matrix with 5 knots is given as

$$X = \left[\begin{array}{llllll} 1 & x_{1} & {(x_{1}-10)}_{+} & {(x_{1}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{1}-37)}_{+} \\ 1 & x_{2} & {(x_{2}-10)}_{+} & {(x_{2}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{2}-37)}_{+} \\ {\vdots} & {\vdots} & {\vdots} & & {\vdots} & {\vdots} \\ 1 & x_{n} & {(x_{n}-10)}_{+} & {(x_{n}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{n}-37)}_{+} \end{array}\right] $$

and with 20 knots,

$$X = \left[\begin{array}{llllll} 1 & x_{1} & {(x_{1}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{1}-37)}_{+} \\ 1 & x_{2} & {(x_{2}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{2}-37)}_{+} \\ {\vdots} & {\vdots} & {\vdots} & & {\vdots} & {\vdots} \\ 1 & x_{n} & {(x_{n}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{n}-37)}_{+} \end{array}\right]. $$

Adding knots at 10 and 30 years of experience allows the model to account for the commonly seen mid-career flattening period. However, the function is still not very smooth and it is hard to tell from this model when log wages start to flatten out. Adding more knots allows for more flexibility, but this can potentially lead to overfitting. For example, in the linear base model with 20 knots (upper-right panel of Fig. 2), the fitted line appears to be modeling noise.

Quadratic Spline Bases

The linear spline base model is a combination of linear bases. The quadratic spline base model is a combination of quadratic bases. In other words, we simply add the corresponding squared function for each of the linear base functions. Consider our previous broken stick model with a middle knot at x = 20, we can transform it into a quadratic spline base model with a knot at x = 20 by replacing (x − 20)₊ with the following bases:

$$x^{2} , {(x-20)}_{+}^{2}. $$

This quadratic spline base model is represented by the following equation

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} +\beta_{2} {x}_{i}^{2} +\beta_{3} {(x_{i}-20)}_{+}^{2} + \epsilon_{i}, $$

(4)

and is illustrated in the bottom-right panel of Fig. 1. We can see that the quadratic spline base model suggests a slightly different relationship between experience and log wage. The predicted log wage increases more dramatically for the first 5 years of work experience, but flattens out thereafter. The racial gap seems to be small at first, but widens greatly over the first 5 years. Non-white workers appear to slowly catch up over the course of their careers.

One of the main advantages of the quadratic over the linear spline base model is that it does not have any sharp corners (i.e., undefined gradients). It follows that for any number of knots, the resulting function will have continuous first derivatives. This is both a useful and aesthetically pleasing property. Adding more knots (lower-right panel of Fig. 2) to the model adds more variability. It appears that for this example, 5 knots would be sufficient.

An important concept in economics (typically of secondary importance in statistics textbooks) is recovery of the gradients. In the linear case, the gradient is simply the estimated coefficient between two particular knots. In the quadratic (or higher-order) case, we use the same method to get the gradient as in a simple quadratic OLS model. The difference is that we calculate it between each knot. That is, to estimate a particular gradient for any type of spline model, we can simply take the partial derivative with respect to the regressor x. In its general form, our estimated gradient $\widehat {\beta }(x)$ for a particular regressor x is

$$ \widehat{\beta}(x)= \frac{\partial\widehat{y}(x)}{\partial x}. $$

(5)

For our linear spline base example with 3 knots, this is

$$ \widehat{\beta}(x)= \beta_{1} + \left\{\begin{array}{ll} \beta_{2}, & \text{if } x \in [20, 37) \\ 0, & \text{otherwise} \end{array}\right. $$

(6)

and for our quadratic spline base example with 3 knots

$$ \widehat{\beta}(x)= \beta_{1} + 2 \beta_{2} x_{i} + \left\{\begin{array}{ll} 2\beta_{3} (x_{i}-20) , & \text{if } x \in [20, 37) \\ 0, & \text{otherwise} \end{array}\right.. $$

(7)

B-Splines

We introduced linear and quadratic spline models with the truncated power basis function. Using the same truncated power functions, those models can be generalized to

(8)

where p is the degree of the power basis (truncated power basis of degree p). This generalizes our model by allowing for (1) other spline models (using p degrees), and (2) other bases for a given spline model (using knots). This function has p − 1 continuous derivatives and thus higher values of p should lead to “smoother” spline functions. Similar to before, the general form of the gradient is defined as

(9)

While this general form seems reasonable, splines computed from the truncated power bases in Eq. 8 may be numerically unstable. The values in the X-matrix may become very large (for large p), and the columns of the X-matrix may be highly correlated. This problem will only become worse with a higher number of knots. Therefore, Eq. 8 is rarely used in practice, but is instead typically transformed into equivalent bases with more stable numerical properties. One of the most popular is the B-spline basis.

This can be relatively difficult to present and code, but luckily there exist regression packages to easily transform the X-matrix into the more numerically stable version. Formally, we can compute the equivalence as

$$X_{b} = XL_{p}, $$

where X is a matrix of the bases (explanatory variables) used in Eq. 8 and L_p is a squared invertible matrix. The most commonly used transformation in the linear case is

$$B(x)_{j}= \left\{\begin{array}{ll} \frac{x-\kappa_{j}}{\kappa_{j + 1}-\kappa_{j}}, & \text{if } x \in [\kappa_{j}, \kappa_{j + 1}) \\ \frac{\kappa_{j + 2}-x}{\kappa_{j + 2}-\kappa_{j + 1}}, & \text{if } x \in [\kappa_{j + 1}, \kappa_{j + 2})\\ 0, & \text{otherwise} \end{array}\right. $$

for .

To better illustrate this, consider our broken stick example from Fig. 1: the linear-spline with one middle knot at 20 years of experience. The corresponding bases for this model are 1, x and, (x − 20)₊ and are shown in the upper-left panel of Fig. 3. The B-spline transformation of the second knot (20 years of experience) for this example is

$$B(x)_{j = 2}= \left\{\begin{array}{ll} \frac{x-0}{20-0}, & \text{if} x \in [0, 20)\\ \frac{37-x}{37-20}, & \text{if} x \in [20, 37)\\ 0, & \text{otherwise} \end{array}\right. $$

The corresponding bases of this transformation are shown in the upper-right panel of Fig. 3. B(x)_j= 2 corresponds to the inverse V-shaped function which equals 1 when experience equals 20. The other two functions can be computed similarly using j = − 1, and 3. Adding a higher degree to our model will change the shape of our basis functions. The two bottom panels of Fig. 3 show the equivalent truncated spline basis and B-spline basis for the cubic case (p = 3).

While other basis functions exist (for example, radial basis functions), practitioners may prefer B-splines as they are both numerically more stable and relatively easy to compute. Both R and Stata packages are available. We used the now defunct bs (⋅) function in the splines package^{Footnote 2} in R. The bspline module is available in Stata for B-splines.

Kernel Regression

Instead of assuming that the relationship between y and x come from a polynomial family, we can state that the conditional mean is an unspecified smooth function m(⋅) and our model will be given as

$$ y_{i}=m(x_{i})+\epsilon_{i}, \quad i = 1,2,\ldots,n, $$

(10)

where the remaining variables are described as before. In much the same way spline regression can be thought of as an extension of OLS, kernel regression can be seen as an extension of WLS. That is, we are still minimizing a weighted residual sum of squares, but now we will weight observations by how close they are to the point of interest (i.e., a “local” sample). With spline regression, our local sample is defined as all the points included between two knots, where each point within that sample is weighted equally. Kernel regression goes a step further by estimating each point using a weighted local sample that is centered around the point of interest. The local sample is weighted using a kernel function, which possess several useful properties.

A kernel function defines a weight for each observation within a (typically) symmetric predetermined bandwidth. Unlike an OLS regression which makes no distinction of where the data are located when estimating the conditional expectation, kernel regression will estimate the point of interest using data within a bandwidth.

Before introducing the kernel estimators, let us first derive a kernel function. Consider x our point of interest; we can write an indicator function such that data fall within a range h (our bandwidth) around x:

$$n_{x} = \sum\limits_{i = 1}^{n} 1\left\{x-\frac{h}{2}\leq x_{i} \leq x+\frac{h}{2}\right\}.$$

The corresponding probability of falling in this box (centered on x) is thus n_x/n. This indicator function can be rewritten as

$$ n_{x} = \sum\limits_{i = 1}^{n} \left( \frac{1}{2}\right) 1 \left\{\left| \frac{x_{i}-x}{h} \right| \leq 1 \right\}. $$

(11)

This function is better known as the uniform kernel and is more commonly written as

$$k(\psi)= \left\{\begin{array}{ll} 1/2,& \text{if } |\psi|\leq 1\\ 0, & \text{otherwise} \end{array}\right. $$

where we have written k(ψ) for convenience, where ψ is defined as (x_i − x)/h and represents how “local” the observation x_i is relative to x. Though very simple and intuitive, the uniform kernel is not smooth. It is discontinuous at − 1 and 1 (when the weight switches from 1/2 to zero) and has a derivative of 0 everywhere except at theses two points (where it is undefined).

This kernel is rarely used, but it does possess some basic properties that we typically require of kernel functions. More formally, if we let the moments of the kernel be defined as

$$ \kappa_{j} (k)= {\int}_{-\infty}^{\infty} \psi^{j} k(\psi)d\psi, $$

(12)

these properties are

1.
κ₀(k) = 1 (k(ψ) integrates to one),
2.
κ₁(k) = 0 (k(ψ) is symmetric), and
3.
κ₂(k) < ∞ (k(ψ) has a finite variance).

These are known as second-order kernels. In addition to the uniform kernel, several commonly known kernel functions can be found in Table 1 (with their second-moments) and Fig. 4. Each of them are derived from the general polynomial family:

$$ k_{s}(\psi)= \frac{(2s + 1)!!}{2^{s + 1}s!} (1-\psi^{2})^{s} \textbf{1}\{|\psi|\leq 1\}, $$

(13)

where !! is the double factorial. The most commonly used kernel function in econometrics is the Gaussian kernel as it has derivatives of all orders. The most commonly used kernel function in statistics is the Epanechnikov kernel function as it has many desirable properties with respect to mean squared error. We will discuss how to choose the kernel function and smoothing parameter (h) in “Model Selection”.

Table 1 Commonly used second-order kernel functions

Full size table

Local-Constant Least-Squares

The classic kernel regression estimator is the local-constant least-squares (LCLS) estimator (also known as the Nadaraya-Watson kernel regression estimator, see Nadaraya (1964) and Watson (1964)). While it has fallen out of fashion recently, it is useful as a teaching tool and still useful in many situations (e.g., binary left-hand-side variables).

To begin, recall how we construct the OLS estimator. Our objective function is

$$\underset{\alpha,\beta}{\min}\sum\limits_{i = 1}^{n}(y_{i}-\alpha-x_{i}\beta)^{2}, $$

which leads to the slope and intercept estimators, $\widehat {\beta }$ and $\widehat {\alpha }$.

Suppose instead of a linear function of x, we simply regress y on a constant (a). Our objective function becomes

$$\underset{a}{\min} \sum\limits_{i = 1}^{n}[y_{i}-a]^{2} , $$

which leads to the estimator $\widehat {a}=(1/n){\sum }^{n}_{i = 1}y_{i}=\bar {y}$. A weighted least-squares version of this objective function can be written as

$$\underset{a}\min\sum\limits_{i = 1}^{n}[y_{i}-a]^{2} W(x_{i}) , $$

where W(x_i) is the weighting function, unique to the point x_i. If we replace the weighting function with a kernel function, minimizing this objective function yields the LCLS estimator

$$ \widehat{a}=\widehat{m}(x) = \frac{{\sum}^{n}_{i = 1}y_{i} k\left( \frac{x_{i}-x}{h} \right)}{{\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right)}. $$

(14)

This estimator represents a local average. Essentially, we regress y locally, on a constant, weighting observations via their distance to x.

While Eq. 14 gives us the fit, economists are typically interested in the marginal effects (i.e., gradients). To estimate a particular gradient, we simply take the partial derivative of $\widehat {m}(x)$ with respect to the regressor of interest, x. Our estimated gradient $\widehat {\beta }(x)$ is thus

$$ \widehat{\beta}(x)= \frac{ \left( {\sum}^{n}_{i = 1} y_{i} \frac{\partial k\left( \frac{x_{i}-x}{h} \right)} {\partial x} \right) \left( {\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right) \right) - \left( {\sum}^{n}_{i = 1} y_{i} k\left( \frac{x_{i}-x}{h} \right) \right) \left( {\sum}^{n}_{i = 1} \frac{\partial k\left( \frac{x_{i}-x}{h} \right)}{\partial x} \right) } { \left( {\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right) \right)^{2} } , $$

(15)

where, for example, $\frac {\partial k\left (\frac {x_{i}-x}{h} \right )}{\partial x}=\left (\frac {x_{i}-x}{h^{2}}\right )k\left (\frac {x_{i}-x}{h}\right )$ for the Gaussian kernel. Higher-order derivatives can be derived in a similar manner.

Local-Linear Least-Squares

While the LCLS estimator is intuitive, it suffers from biases near the boundary of the support of the data. As an alternative, most applied researchers use the local-linear least-squares (LLLS) estimator. The LLLS estimator locally fits a line as opposed to a constant.

The local-linear estimator is obtained by taking a first-order Taylor approximation of Eq. 10 via

$$y_{i} \approx m(x) + (x_{i}-x)\beta(x)+ \epsilon_{i},$$

where β(x) is the gradient. Similar to the LCLS case, by labeling m(x) and β(x) as the parameters a and b, we get the following minimization problem

$$\underset{a,b}\min \sum\limits_{i = 1}^{n}[y_{i}-a - (x_{i}-x)b]^{2} k\left( \frac{x_{i}-x}{h} \right),$$

which, in matrix notation (with q regressors) is

$$\underset{\delta}\min (y-X\delta)'K(x) (y-X\delta) ,$$

where δ = (a,b)^′, X is a n × (q + 1) matrix with its i th row equal to (1,(x_i − x)) and K(x) is a n × n diagonal matrix with its i th element equal to ${\prod }_{j = 1}^{q} k\left (\frac {x_{qi}-x_{q}}{h_{q}}\right )$. This leads to the LLLS estimators of the conditional expectation ($\widehat {m}(x)$) and gradient ($\widehat {\beta }(x)$) as

$$\widehat{\delta}(x) = \begin{pmatrix} \widehat{m}(x) \\ \widehat{\beta}(x) \end{pmatrix} = (X^{\prime}K(x)X)^{-1}X^{\prime}K(x)y,$$

Notice that we can obtain the OLS estimator by replacing K(x) by an identity matrix (giving all observations equal weight, i.e., each bandwidth tending towards infinity), the weighted least-squares (WLS) estimator by replacing it with some other weighting function, and the generalized least-squares (GLS) estimator by replacing it with the inverse of the variance-covariance matrix of the errors (Ω).

Figure 5 gives both the LCLS and LLLS estimates for white (solid line) and non-white (dashed line) college-educated males working in personal care and service. The gradients for each level of experience are also shown. Compared to the LCLS model, the LLLS model captures a stronger increase in log wage during the first 5 years of work experience with gradients ranging from 0.10 to 0.17. If taken literally, after only a year of working in personal care and service, white college-educated males wages increases by almost 17% on average while non-white college-educated males’ wages increases by about 7%. The LCLS model, while showing a similar overall shape, shows a much slower increase in those first few years of work experience with less than 4% increases in wages for non-whites and 5% to 8% increases for whites. Both models suggest that while white workers have much higher percent increases in their wages in the first few years, those year-to-year percent increases in their wages fall below non-white workers after 10 years of experience.

Local-Polynomial Least-Squares

The derivation of the LLLS estimator can be generalized to include higher-order expansions. The resulted family of estimators are called local-polynomial least-squares (LPLS) estimators. For the general case, if we are interested in the p th-order Taylor expansion, and we assume that the (p + 1)th derivative of the conditional mean at the point x exists, we can write our equation as

$$y_{i} \approx m(x) + (x_{i}-x)\frac{\partial m(x)}{\partial x} + (x_{i}-x)^{2}\frac{\partial^{2} m(x)}{\partial x^{2}} \frac{1}{2!}+ {\ldots} + (x_{i}-x)^{p}\frac{\partial^{p} m(x)}{\partial x^{p}} \frac{1}{p!} + \epsilon_{i}.$$

Replacing the parameters by (a₀,…,a_p), our kernel weighted least-squares problem can be written as

$$\underset{a_{0},\ldots,a_{p}}{\min}\sum\limits_{i = 1}^{n}\left[y_{i}-a_{0} - (x_{i}-x)a_{1} - (x_{i}-x)^{2}a_{2} -\ldots- (x_{i}-x)^{p}a_{p}\right]^{2} k\left( \frac{x_{i}-x}{h} \right) .$$

In matrix notation, our objective function becomes

$$\underset{\delta}{\min} (y-X\delta)'K(x) (y-X\delta), $$

where the only difference from the LLLS case (p = 1) is that the i th row of X is defined as [1,(x_i − x),(x_i − x)²,…,(x_i − x)^p] and δ = (a₀,a₁,…,a_p)^′. Minimizing the objective function leads to the local-polynomial least-square estimator

$$\widehat{\delta}(x)= \left( \widehat{m}(x), \frac{\partial \widehat{m}(x)}{\partial x}, \frac{\partial^{2} \widehat{m}(x)}{\partial x^{2}}, \ldots, \frac{\partial^{p} \widehat{m}(x)}{\partial x^{p}} \right)^{\prime}= (X^{\prime}K(x)X)^{-1}X^{\prime}K(x)y.$$

The first question then becomes, how many expansions should we take? More expansions lead to less bias, but increased variability. This becomes a bigger problem when the number of covariates (q) is large and the sample size (n) is small. One promising data driven method to determine the number of expansions is considered in Hall and Racine (2015).

As is the case for splines, there exist options to employ these methods in popular software packages. In R we recommend the np package (Hayfield and Racine 2008) and in Stata we recommend the npregress command.

Model Selection

For both spline and kernel regression, many seemingly arbitrary choices can greatly influence fit. The typical trade-off is between bias and variance. We want to make selections such that we avoid overfitting or underfitting. In this section, we first discuss penalty selection, knot selection, and degree selection in spline models; and then, kernel and bandwidth selection in kernel models.

Spline Penalty and Knot Selection

In “Spline Regression”, we saw that the fit is influenced by both our choice of degree of the piecewise polynomials, and by the number and locations of knots we include. However, in spline models, there is a third, more direct way, to influence fit: add an explicit penalty. In short, we want to select the degree of the piecewise polynomials, the knot locations, and the smoothing parameter λ (penalty) which best capture the underlying shape of our data. Though we will briefly discuss the selection of all three, it is easy to show that the choices of degree and knots are much less crucial than the choice of λ, the smoothing parameter (we will see a similar result for kernel regression). That is, when using a high enough number of knots and degrees, the “smoothness” of our fit can be controlled by λ. Hence, we will focus most of our discussion on the choice of λ when the degree and number of knots are fixed. Although there exist several ways to select our parameters in a data-driven manner, we will concentrate on one of the most commonly used approaches: cross-validation (CV).

Penalty Selection Using Cross Validation

There are several ways to impose a penalty, but here we focus on a method that avoids extreme values (and hence too much variability). In a univariate setting using a linear spline, this penalty is

where is the coefficient on the knot.^{Footnote 3} In matrix form, our constrained objective function can thus be written as

$$\underset{\beta}{\min} \mid \mid y- X\beta \mid \mid^{2} \textmd{ s.t. } \beta^{\prime} D\beta \leq C, $$

and leads to the Lagrangian^{Footnote 4}

$$ \mathcal{L}(\beta,\lambda) = \underset{\beta,\lambda}{\min} \mid \mid y - X\beta \mid \mid^{2} + \lambda^{2} \beta^{\prime} D\beta, $$

(16)

where D is a matrix with diagonal . Note that consistency will require that λ tends towards zero as the sample size (n) tends towards infinity.

The second term of Eq. 16 is called a roughness penalty because it penalizes through the value of the smoothing parameter (λ) the curvature of our estimated function. This type of regression is referred to as a penalized spline (p-spline) regression and yields the following solution and fitted values:

$$\widehat{\beta}_{\lambda} = (X^{\prime}X + \lambda^{2}D)^{-1}X^{\prime}y $$

$$\widehat{y} = X(X^{\prime}X + \lambda^{2}D)^{-1}X^{\prime}y. $$

To generalize these results to the p th degree spline model (8), we replace λ² by λ^2p and transform the D-matrix: .^{Footnote 5} A penalized B-spline (PB-spline) would simply include the transformation done to X (i.e., the square invertible matrix L_p) into the penalty term as well:

$$\widehat{y} = X_{B}(X^{\prime}_{B}X_{B} + \lambda^{2p}L^{\prime}_{p}DL^{\prime}_{p})^{-1}X^{\prime}_{B}y.$$

As λ^2p →∞ (infinite smoothing), the curvature penalty becomes predominant and the estimate converges to OLS. As λ^2p → 0, the curvature penalty becomes insignificant. In this case, the function will become rougher (we will see a similar result with the bandwidth parameter for a LLLS regression). Figure 6 illustrates this effect using linear p-spline estimates for college-educated males working in personal care and service. The knots have been fixed at every five years of experience (0,5,10,...). As the penalty (λ) increases, it is clear that the fit becomes smoother and converges to an OLS estimate.

Figure 6 shows an intuitive fit of the data for a value of λ around 10. However, using a more systematic method to select λ would lead to less subjective and more comparable results. If we let $\widehat {m}(x_{i};\lambda )$ be our nonparametric regression estimate at the point x with smoothing parameter λ, we can write a residual sum of squares objective function as

$$ RSS (\lambda) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}(x_{i};\lambda) \right]^{2}. $$

(17)

The problem with this approach, is that $\widehat {m}(x_{i};\lambda )$ uses y_i as well as the other observations to predict y_i. This objective function is minimized when λ = 0. This problem can be avoided by using a leave-one-out estimator. Least-Squares Cross-Validation (LSCV) is the technique whereby we minimize Eq. 17, where the fit is replaced by a leave-one-out estimator

$$ CV(\lambda) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}_{-i}(x_{i};\lambda) \right]^{2}, $$

(18)

where $\widehat {m}_{-i}\left (\cdot \right )$ is our leave-one-out estimator, and is defined as our original nonparametric regression estimator $\widehat {m}\left (\cdot \right )$ applied to the data, but with the point (x_i,y_i) omitted. We will thus choose a smoothing parameter $\widehat {\lambda }_{CV}$ that will minimize CV (λ) over λ ≥ 0.

Using the same number of knots, the top panel of Fig. 7 shows the corresponding CV and RSS curves at different values of λ. We can see that the RSS curve is strictly increasing as theory predicts and would choose a lambda of zero. The CV curve, on the other hand, decreases at first and reaches a minimum when λ = 7. The resulting fit (bottom panel of Fig. 7) is smoother than what the RSS criterion would provide.^{Footnote 6}

Knots and Degree Selection

Using an “optimal” lambda and CV criterion, we can compare p-spline models that use different numbers (and location) of knots and different bases (degrees). From experimenting with the number of knots and degrees, the literature finds that (1) adding more knots only improves the fit for a small number of knots (2) when using many knots, the minimum CV for linear and quadratic fits become indistinguishable. In general, we suggest using quadratic or cubic basis functions.

Though there exist more formal criterion to select the number and location of knots, Ruppert et al. (2003) provide simple solutions which often work well. Their default choice of is

where is the number of knots. For knot locations they suggest

for .

Eilers and Marx (1996, 2010) argue that equally spaced knots are always preferred. Eilers and Marx (2010) present an example where equally spaced knots outperform quantile spaced knots. The best type of knot spacings is still under debate and both methods are still commonly used.^{Footnote 7}

While knots’ location and degree selection usually have little effect on the fit when using a “sufficiently” large amount of knots, they may become important when dealing with more complex problems. For example, when trying to smooth regression functions with strong varying local variability or with sparse data. In these cases, using a more complex algorithm to make your selection may be more appropriate.

Kernel and Bandwidth Selection

Choosing a kernel function is similar to choosing the degree of the piecewise polynomials in spline models, and choosing the size of the bandwidth (h) is similar to choosing the number and location of knots. There exist equivalents to having a direct penalty (λ) incorporated in a kernel model, but those are rarely used in applied kernel estimation. We will therefore focus our discussion on kernel and bandwidth selection.

Similar to adding more knots or decreasing the penalty λ in a spline model, decreasing the bandwidth will lead to less bias, but more variance. Figure 8 illustrates this effect using LLLS and a Gaussian kernel for college-educated males working in personal care and service. As the size of the bandwidth (h) increases, the fit becomes smoother and converges to OLS.

Choice of bandwidth and kernel can be chosen via the asymptotic mean square error (AMSE) criterion (or more specifically, via asymptotic mean integrated square error). In practice, the fit will be more sensitive to a change in bandwidth than a change in the kernel function. Reducing the bandwidth (h), leads to a decrease in the bias at the expense of increasing the variance. In practice, as the sample size (n) tends to infinity, we need to reduce the bandwidth (h) slowly enough so that the amount of “local” information (nh) also tends to infinity. In short, consistency requires that

$$\text{as } n\rightarrow \infty \text{, we need } h\rightarrow 0 \text{ and } nh\rightarrow \infty .$$

The bandwidth is therefore not just some parameter to set, but requires careful consideration. While many may be uncomfortable with an estimator that depends so heavily on the choice of a parameter, remember that this is no worse than pre-selecting a parametric functional form to fit your data.

Cross-Validation Bandwidth Selection

In practice, there exist several methods to obtain the “optimal bandwidth” which differ in the way they calculate asymptotic mean square-error (or asymptotic mean intergrated square-error). Three typical approaches to bandwidth selection are: (1) reference rules-of-thumb, (2) plug-in methods and (3) cross-validation methods. Each has its distinct strengths and weaknesses in practice, but in this survey we will focus on the data driven method: cross-validation.^{Footnote 8}Henderson and Parmeter (2015) provide more details on each of these methods.

LSCV is perhaps the most popular tool for cross-validation in the literature. This criterion is the same as the one described in “Penalty Selection Using Cross Validation” to select the penalty parameter in spline regression. That is, we use a leave-one-out estimator

$$ CV(h) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}_{-i}(x_{i}) \right]^{2}, $$

(19)

whereby we minimize the objective function with respect to h instead of λ and the (LCLS) leave-one-out estimator is defined as

$$\widehat{m}_{-i}(x_{i}) = \frac{\underset{j\neq i}{\sum\limits_{j = 1}^{n}} y_{j} K_{h}(x_{j}, x_{i})}{\underset{j\neq i}{\sum\limits_{j = 1}^{n}} K_{h}(x_{j}, x_{i})}. $$

In the top panel of Fig. 9, we show an analogous figure to that presented in “Penalty Selection Using Cross Validation”. It shows the corresponding CV and RSS curves for different bandwidths. When failing to use the leave-one-out estimator, the RSS curve is strictly increasing (i.e., the optimal bandwidth is zero). Using the leave-one-out estimator, the objective function is minimized at h = 1.62. The resulting fit, (bottom panel of Fig. 9) shows more variation than the linear p-spline (Fig. 7). This is not surprising as the linear p-spline forces a linear fit between each knot. The two graphs would have looked more similar if we had used a cubic p-spline, allowing for curvature between knots.

Kernel Function Selection

Kernel selection is typically considered to be of secondary performance as it is believed that it makes minor differences in practice. The optimal kernel function, in the AMISE sense, is the Epanechnikov kernel function. However, as stated previously, it may not be useful in some situations as it does not posses more than one derivative. Gaussian kernels are often used in economics as they possess derivatives of any order, but there are losses in efficiency. In the univariate density case, the loss in efficiency is around 5%. However, Table 3.2 of Henderson and Parmeter (2015) shows that this loss in efficiency increases with the dimension of the data (at least in the density estimation case). In practice, it may make sense to see if the results of a study are sensitive to the choice of kernel.

Splines versus Kernels

In these single-dimension cases, our spline and kernel estimates are more or less identical. Spline regressions have the advantage that they are much faster to compute. While it is uncommon to have an economic problem with a single covariate, if that were the case, we likely would suggest splines.

In a multiple variable setting, the difference between the two methods are more pronounced. The computation time for kernels increases exponentially with the number of dimensions. The additional computational time required for splines is minor. On the other hand, kernels handle interactions and discrete regressors (see Ma et al. (2015) for using discrete kernels with splines) well (both common features in economic data). It is also relatively easier to extract gradients with kernel methods.

In reality there are camps: those who use kernels and those who use splines. However, the better estimator probably depends upon the problem at hand. Both should be considered in practice.

Instrumental Variables

Nonparametric methods are not immune to the problem of endogeneity. A first thought about how to handle this issue would be to use some type of nonparametric two-stage least-squares procedure. However, this is not feasible as there exists an ill-posed inverse problem (to be discussed below). It turns out that this problem can be avoided by using a control function approach much like that in the parametric literature (e.g., see Cameron and Trivedi (2010)).

To motivate this problem, consider a common omitted-variable problem in labor economics: ability in the basic compensation model. A (potentially) correctly specified wage equation could be described as:

$$ \log(wage) = \beta_{0} + \beta_{1}educ + \beta_{2}z_{1} + \beta_{3}abil + \epsilon, $$

(20)

where educ is years of education, abil is ability, and z₁ is a vector of other relevant characteristics (e.g., experience, gender, race, marital status). However, in applied work, ability (abil) cannot be directly measured/observed.

If we ignore ability (abil), it will become part of the error term

$$ \log(wage) = \beta_{0} + \beta_{1}educ + \beta_{2}z_{1} + u, $$

(21)

where u = 𝜖 + β₃abil and abil is correlated with both u and with educ. Our resulting estimated return to education (β₁) will be biased and inconsistent. We can resolve this problem if we can find an instrumental variable (IV) which is uncorrelated with u (and so uncorrelated with ability), but correlated with educ. Several IVs have been considered in the literature for this particular model,^{Footnote 9} each with their own strengths and weaknesses, but for the purpose of this illustration, we will use spouse’s wage. That is, we assume that spouse’s wage is correlated with education, but not with ability.

In the parametric setting, the control function (CF) approach to IVs is a two-step procedure. In the first step, we regress the endogenous variable on the exogenous vector z:

$$educ = \gamma_{0} + \gamma_{1} z + v ,$$

where z = (z₁,spwage) and spwage is the spouse’s wage, and obtain the reduced form residuals $\widehat {v}$. In the second step, we add $\widehat {v}$ to Eq. 21 and regress

$$\log(wage) = \beta_{0} + \beta_{1} educ + \beta_{2} z_{1} + \beta_{3} \widehat{v} + u.$$

By directly controlling for v, educ is no longer viewed as endogenous.

The Ill-Posed Inverse Problem and Control Function Approach

Let us first go back and consider the general nonparametric regression setting

$$ y = m(x) + u, $$

(22)

where E[u|x]≠ 0, but there exists a variable z such that E[u|z] = 0. For the moment, assume that x and z are scalars.

Using the condition

$$E[u|z] = [y-m(x)|z] = 0,$$

yields the conditional expectation

$$ E[y|z] = [m(x)|z] = \int m(x)f(x|z)dx. $$

(23)

Although we can estimate both the conditional mean of y given z (E[y|z]) as well as the conditional density of x given z (f(x|z)), we cannot recover m(x) by inverting the relationship. That is, even though the integral in Eq. 23 is continuous in m(x), inverting it to isolate and estimate m(x) does not represent a continuous mapping (discontinuous). This is the so-called ill-posed inverse problem and it is a major issue when using instrumental variables in nonparametric econometrics.

Luckily, we can avoid this problem by placing further restrictions on the model (analogous to additional moment restrictions in a parametric model). Here we consider a control function approach. Similar to the parametric case above, we consider the triangular framework

$$ x = g(z) + v, $$

(24)

$$ y = m(x) + u $$

(25)

with the conditions E[v|z] = 0 and E[u|z,v] = E[u|v]. The first condition implies that z is a valid instrument for x and the second allows us to estimate m(x) and avoid the ill-posed inverse problem. It does so by restricting u to depend on x only through v. More formally,

$$\begin{array}{@{}rcl@{}} E[y|x,v] & =& m(x) + E[u|x-v, v] \\ & = &m(x) + E[u|z,v] \\ & = &m(x) + E[u|v] \\ & = &m(x) + r(v), \end{array} $$

and hence

$$ m(x) = E[y|x,v] - r(v), $$

(26)

where both E[y|x,v] and r(v) can be estimated nonparametrically. In short, we control for v through the nonparametric estimation of the function r(v).

Spline Regression with Instruments

Now that we have the basic framework, we can discuss nonparametric estimation in practice. Consider our previous compensation model, but without functional form assumptions:

$$ \log(wage) = m(educ, z_{1}) + u $$

(27)

where ability (abil) is unobserved. It is known that abil will be correlated with both the error u and the regressor educ (i.e., E[u|educ]≠ 0). Similar to the parametric setting, if we have an instrument z such that

$$E[u|z] = [\log(wage)-m(educ, z_{1})|z] = 0,$$

we can avoid the bias due to endogeneity. We again define z = (z₁,spwage), where our excluded instrument, spwage, is the spouse’s wage. This yields

$$E[\log(wage)|z] = [m(educ, z_{1})|z]. $$

Our problem can now be written via the triangular system attributable to Newey et al. (1999):

$$\begin{array}{@{}rcl@{}} educ &=& g(z) + v \end{array} $$

(28)

$$\begin{array}{@{}rcl@{}} \log(wage) &=& m(educ, z_{1}) + u, \end{array} $$

(29)

where E[v|z] = 0 and E[u|z,v] = E[u|v]. Similar to the parametric case, we first estimate the residuals from the reduced-form equation (i.e., $\widehat {v}$). We then include the reduced-form residuals nonparametrically as an additional explanatory variable:

$$ \log(wage) = w(educ, z_{1}, \widehat{v}) + u, $$

(30)

where $w\left (educ,z_{1},\widehat {v}\right )\equiv m\left (educ,z_{1}\right )+r\left (\widehat {v}\right )$ and $\widehat {m}\left (educ,z_{1}\right )$ can be recovered by only extracting those terms that depend upon educ and z₁. Note that we need to use splines that do not allow for interactions between educ or z₁ and $\widehat {v}$ (interactions between educ and z₁ are allowed).

In what follows, we will use cubic B-splines (the default for most R packages) with equi-quantile knots (see “Model Selection”) in both stages. Figure 10 shows the fitted results for the first stage. In the left panel, we see that as individuals’ experience increases, the level of education slowly decreases, with a significant drop after 35 years of work experience. In the right panel, we observe a quadratic relationship between education level and spouse’s wage. That is, the higher the spouse’s wage, the higher the individual’s education, but for individuals whose spouse have a high level of income, the relationship becomes negative.

The fitted plots from the second stage are given in Fig. 11. Controlling for education,^{Footnote 10} men’s log wage seems to increase in the first few years on the job, stabilizes mid-career, and then decreases towards the end of their career in personal care and service. Education seems to affect log wage positively only after 10–12 years of schooling (high-school level).

Kernel Regression with Instruments

Estimation via kernels is relatively straightforward given what we have learned above. We again use a control function approach, but with local-polynomial estimators. Kernel estimation of this model was introduced by Su and Ullah (2008) and is outlined in detail in Henderson and Parmeter (2015). In short, the first stage requires running a local-p₁th order polynomial regression of the endogenous regressor on z, obtaining the residuals and then running a local-p₂th order polynomial regression of y on the endogenous regressor, the included exogenous regressors and the residuals from the first stage.

More formally, our first stage regression model for our example is

$$ educ = g\left( z\right) + v, $$

(31)

and the residuals from this stage are used in the second stage regression

$$ \log(wage) = w(educ, z_{1}, \widehat{v}) + u, $$

(32)

where $w\left (educ,z_{1},\widehat {v}\right )\equiv m\left (educ,z_{1}\right )+r\left (\widehat {v}\right )$.

In spline regression, we simply took the estimated components not related to $\widehat {v}$ from $\widehat {w}\left (\cdot \right )$ in order to obtain the conditional expectation $\widehat {m}\left (educ,z_{1}\right )$. However, disentangling the residuals is a bit more difficult in kernel regression. While it is feasible to estimate additively separable models, we follow Su and Ullah (2008) and remove them via counterfactual estimates in conjunction with the zero mean assumption on the errors. Under the assumption that E(u) = 0, we recover the conditional mean estimate via

$$ \widehat{m}\left( educ,z_{1}\right)=\frac{1}{n}\sum\limits_{i = 1}^{n}\widehat{w}\left( educ,z_{1},\widehat{v}_{i}\right), $$

(33)

where $\widehat {w}\left (educ,z_{1},\widehat {v}_{i}\right )$ is the counterfactual estimator of the unknown function using bandwidths from the local-p₂ order polynomial regression in the second step (derivatives can be obtained similarly, but summing over the counterfactual derivatives of $\widehat {w}\left (\cdot \right )$).^{Footnote 11}

Bandwidth selection and order of the polynomials (p₁ and p₂) are a little more complicated. Here we will give a brief discussion, but suggest the serious user consult Chapter 10 (which includes a discussion of weak instruments) of Henderson and Parmeter (2015).

Bandwidth selection is important in both stages. In the first-stage, v is not observed and we want to make sure that the estimation of it does not impact the second-stage. If we observe the conditions in Su and Ullah (2008), it allows for the following cross-validation criterion

$$ CV\left( h_{2}\right)=\underset{h_{2}}\min \frac{1}{n}\sum\limits_{i = 1}^{n}\left[y_{i}-\widehat{m}_{-i}\left( educ_{i},z_{1i}\right)\right]^{2}, $$

(34)

and the first-stage bandwidths can be constructed as

$$ \widehat{h}_{1}=\widehat{h}_{2} n^{-\gamma}, $$

(35)

where the acceptable values for γ depend upon the order of the polynomials in each stage.^{Footnote 12}

Henderson and Parmeter (2015) give the admissible combinations of polynomial orders for the Su and Ullah (2008) estimator with a single endogenous variable and a single excluded instrument. In practice, they suggest using a local-cubic estimator in the first stage (local-linear in the first stage is never viable) and a local-linear estimator in the second stage for a just-identified model with a single endogenous regressor. For other cases, the conditions of Assumption A5 in Su and Ullah (2008) need to be checked.

Using the methods outlined above, Fig. 12 shows the impact of controlling for endogeneity. The upper-left panel gives density plots for the gradient estimates across the sample for returns to education both with and without using instrumental variables. Most college-educated men working in personal care and service have a wage increase of about 5 to 15% for each additional year they spend in school. However, the distribution is skewed to the left suggesting that a few men have seen their investment in education yield no returns or even negative returns (for a similar result in a nonparametric setting see Henderson et al. (2011)).

Comparing those results with the gradients without instruments, we clearly see that failing to control for endogeneity would overestimate the returns to education (as expected). That is, the distribution of gradients without using IV is more concentrated around 10 to 15% returns and fewer low returns.

To try to swing back to the examples from before, the upper-right panel of Fig. 12 gives densities of gradient estimates controlling for endogeneity for whites versus non-whites. The figure seems to suggest that non-whites have higher rates of return to education. This is commonly found in the literature, but is often attributed to lower average years of education. To try to compare apples to apples, in the bottom two panels we plot the densities of returns to education for fixed levels of education (high school and college, respectively). Here we see while the general shape is similar, whites tend to get more mass on the higher returns and less mass on lower returns, especially for college graduates.^{Footnote 13}

Data Availability

Upon publication, all data and R code used in this paper will be available on the publisher’s website.

Notes

Fixing the sample to college educated males allows us to plot these figures in two dimensions.
See https://stat.ethz.ch/R-manual/R-devel/library/splines/html/bs.html and the seemingly equivalent bSpline (⋅) function in the splines2 package.
The matrix of the coefficients being
Note that the last term − λ²C disappears as it does not influence the solution
We raise λ to the power of 2p due to the way we add bases. Intuitively, raising λ to the power of 2p can be explained by the following example: if we transform X into αX for any α > 0, we want to have the equivalent transformation done on the smoothing parameter λ → αλ to get the same fit.
Note that to compute our CV statistics, we transformed Eq. 18 to avoid the high computational cost of calculating n versions of $\widehat {m}_{-i}(x_{i};\lambda )$ (i.e., the order-n² algorithm) using fast order-n (Hutchinson and De Hoog 1985).
Montoya et al. (2014) use a simulation to test the performance of different knot selection methods with equidistant knots in a p-spline model. Specifically, they compare the methods presented in Ruppert et al. (2003) with the myopic algorithm knot selection method, and the full search algorithm knot selection method. Their results show that the default choice method performs just as well or better than the other methods when using different commonly used smoothing parameter selection methods.
While there is no theoretical justification for doing so, it is common to use rule-of-thumb methods designed for density estimation as a form of exploratory analysis. In fact, we used a rule-of-thumb to compute the bandwidth in our previous examples (“Nonparametric Regression”). In its general form, the bandwidth (designed for Gaussian densities with a Gaussian kernel) is $h_{rot} = 1.06{\sigma _{x}^{2}}n^{-1/5}$. For the remainder of the article, we will use bandwidths selected via cross-validation.
Those IVs include, but are not limited to: minimum school-leaving age, quarter of birth, school costs, proximity to schools, loan policies, school reforms, spouse’s and parents’ education/income.
Recall that in our previous examples, the level of education is fixed at 16 years – college degree.
Multiple endogenous regressors can be handled by running separate first stage regressions and putting the residuals from each of those regressions into the second stage regression and finally summing over i to obtain the conditional mean estimates.
The acceptable range for γ is between $\left (2\left (p_{2} + 1\right )+q_{1}+ 1\right )^{-1}\max \left [\frac {p_{2}+ 1}{p_{1}+ 1},\frac {p_{2}+ 3}{2\left (p_{1}+ 1\right )}\right ]$ and $\left (2\left (p_{2} + 1\right )+q_{1}+ 1\right )^{-1}\frac {p_{2}+q_{1}}{q_{1}+q_{2}}$, where q₁ and q₂ represent the number of elements in the first and second stage regressions, respectively.
We could combine spline and kernel methods to obtain an IV estimator as in Ozabaci et al. (2014). Using this combination allows for a lower computational burden and oracally efficient estimates.

References

Cameron AC, Trivedi PK (2010) Microeconometrics using Stata, vol 2. Stata Press, College Station
Google Scholar
Eilers PHC, Marx BD (1996) Flexible smoothing with b-splines and penalties. Stat Sci 11(2):89–102
Article Google Scholar
Eilers PHC, Marx BD (2010) Splines knots, and penalties. Wiley Interdiscip Rev Comput Stat 2(6):637–653
Article Google Scholar
Hall P, Racine JS (2015) Infinite-order cross-validated local polynomial regression. J Econom 185:510–525
Article Google Scholar
Hayfield T, Racine JS (2008) Nonparametric econometrics: the np package. J Stat Softw 27:1–32
Article Google Scholar
Henderson DJ, Parmeter CF (2015) Applied nonparametric econometrics. Cambridge University Press, Cambridge
Book Google Scholar
Henderson DJ, Polachek SW, Wang L (2011) Heterogeneity in schooling rates of return. Econ Educ Rev 30:1202–1214
Article Google Scholar
Hutchinson MF, De Hoog FR (1985) Smoothing noisy data with spline functions. Numer Math 47(1):99–106
Article Google Scholar
Ma S, Racine JS, Yang L (2015) Spline regression in the presence of categorical predictors. J Appl Econom 30:705–717
Article Google Scholar
Montoya EL, Ulloa N, Miller V (2014) A simulation study comparing knot selection methods with equally spaced knots in a penalized regression spline. Int J Stat Probab 3(3):96
Article Google Scholar
Nadaraya EA (1964) On estimating regression. Theory Probab Its Appl 9 (1):141–142
Article Google Scholar
Newey WK, Powell JL, Vella F (1999) Nonparametric estimation of triangular simultaneous equations models. Econometrica 67(3):565–603
Article Google Scholar
Ozabaci D, Henderson DJ, Su L (2014) Additive nonparametric regression in the presence of endogenous regressors. J Bus Econ Stat 32(4):555–575
Article Google Scholar
Ruppert D, Wand MP, Carroll RJ (2003) Semiparametric regression. Cambridge University Press, Cambridge
Book Google Scholar
Su L, Ullah A (2008) Local polynomial estimation of nonparametric simultaneous equations models. J Econom 144:193–218
Article Google Scholar
Watson GS (1964) Smooth regression analysis. Sankhyā: The Indian Journal of Statistics Series A 26(4):359–372
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics, Finance, and Legal Studies, University of Alabama, Box 870224, Tuscaloosa, AL, 35487-0224, USA
Daniel J. Henderson & Anne-Charlotte Souto

Authors

Daniel J. Henderson
View author publications
You can also search for this author in PubMed Google Scholar
Anne-Charlotte Souto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel J. Henderson.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Henderson, D.J., Souto, AC. An Introduction to Nonparametric Regression for Labor Economists. J Labor Res 39, 355–382 (2018). https://doi.org/10.1007/s12122-018-9279-6

Download citation

Published: 07 November 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s12122-018-9279-6

Keywords

JEL Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Introduction to Nonparametric Regression for Labor Economists

Abstract

Similar content being viewed by others

Tobit Model

Semi-nonparametric spline modifications to the Cornwell–Schmidt–Sickles estimator: an analysis of US banking productivity

Local Regression Models

Introduction

Nonparametric Regression

Spline Regression

Linear Spline Bases

Quadratic Spline Bases

B-Splines

Kernel Regression

Local-Constant Least-Squares

Local-Linear Least-Squares

Local-Polynomial Least-Squares

Model Selection

Spline Penalty and Knot Selection

Penalty Selection Using Cross Validation

Knots and Degree Selection

Kernel and Bandwidth Selection

Cross-Validation Bandwidth Selection

Kernel Function Selection

Splines versus Kernels

Instrumental Variables

The Ill-Posed Inverse Problem and Control Function Approach

Spline Regression with Instruments

Kernel Regression with Instruments

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation