Introduction

This survey aims to (re-)introduce applied labor economists to nonparametric regression techniques. Specifically, we discuss both spline and kernel regression, in an approachable manner. We present an intuitive discussion of estimation and model selection for said methods. We also address the use of nonparametric methods in the presence of endogeneity, a common issue in the labor literature, but seldom accounted for in applied nonparametric work.

Accounting for endogeneity is well understood in the parametric literature once a suitable instrument is obtained. Standard methods have been around for some time, but these methods do not always transfer in a straightforward manner in the nonparametric setting. This has caused many to shy away from their use, even with the knowledge that this can lead to additional insight (Henderson and Parmeter 2015).

To showcase these methods, we will look at the relationship between experience, education and earnings. We will begin by ignoring the endogeneity of education and then will discuss how to control for this via a nonparametric control function approach. While nonparametric estimation may seem like a choice, it should be stated that the parametric alternative requires strict functional form assumptions, which if false, likely lead to biased and inconsistent estimators. In practice, the functional relationship between education and earnings as well as between education and its instruments is typically unknown. By using nonparametric regression, we relax these functional form restrictions and are more likely to uncover the causal relationship.

To empirically illustrate these methods, we use individual-level data obtained from the March Current Population Survey (CPS) to highlight each concept discussed. To eliminate additional complications, we primarily focus on a relatively homogeneous sub-group, specifically, working age (20 to 59 years old) males with four-year college degrees.

In what follows, we first slowly introduce the fundamentals of spline and kernel estimators and then discuss how to decide upon various options of each estimator. This should build the foundation for understanding the more advanced topic of handling endogenous regressors. By illustrating these techniques in the context of labor-specific examples, we hope that this helps lead to widespread use of these methods in labor applications.

Nonparametric Regression

In a parametric regression model, we assume which functional form best describes the relationship between the response and explanatory variables. If this form is correct, and the remaining Gauss-Markov assumptions hold, we will have unbiased and efficient estimators. However, if these assumptions do not hold, these estimators are likely biased and inconsistent. Nonlinear parametric models exist, but are often complicated to estimate and still require a priori knowledge of the underlying functional form.

Nonparametric regression offers an alternative. The methods discussed here estimate the unknown conditional mean by using a “local” approach. Specifically, the estimators use data near the point of interest to estimate the function at that point and then use these local estimates to construct the global function. This can be a major advantage over parametric estimators which use all data points to build their estimates (global estimators). In other words, nonparametric estimators can focus on local peculiarities inherent in a data set. Those observations which are more similar to the point of interest carry more weight in the estimation procedure.

This section will introduce two commonly used nonparametric techniques, and will provide the notation and concepts that will be used for the remainder of this review. Specifically, we discuss spline and kernel regression estimation. To help bridge gaps, we make connections to well-known techniques such as ordinary and weighted least-squares.

Spline Regression

Spline regression can be thought of as an extension of ordinary least-squares (OLS). Consider the basic univariate linear model:

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} + \epsilon_{i}, \quad i = 1,2,\ldots,n, $$
(1)

where for a sample of n observations, y is our response variable, x is our explanatory variable, 𝜖 is our usual error term and we have two parameters: a constant and a slope (α and β, respectively). The right-hand side of Eq. 1 can be thought of as a linear combination of 1 and x, we call them the “bases” of the model. One popular way to transform (1) into a nonlinear function is to add higher-order polynomials. A quadratic model would add one extra basis function x2 to the model, which corresponds to adding the term \(\beta _{2}{x}_{i}^{2}\) to Eq. 1. In matrix form, the number of bases would correspond to the number of columns in the matrix X:

$$ y = X\beta + \epsilon, $$
(2)

where

$$X = \left[\begin{array}{ll} 1 & x_{1} \\ 1 & x_{2} \\ {\vdots} & {\vdots} \\ 1 & x_{n} \end{array}\right] $$

for the linear case (2 bases), and

$$X = \left[\begin{array}{lll} 1 & x_{1} & {x^{2}_{1}} \\ 1 & x_{2} & {x^{2}_{2}} \\ {\vdots} & {\vdots} & {\vdots} \\ 1 & x_{n} & {x^{2}_{n}} \end{array}\right] $$

for the quadratic case (3 bases).

These two cases are illustrated in Fig. 1 where x is years of experience and y is the log wage (adjusted for inflation). To highlight a relatively homogeneous group, we restrict our sample to college-educated (16 years of schooling) males working in personal care and service (or related occupations) between 2006 and 2016.Footnote 1 For each panel, the solid line represents white males and the dashed line non-whites. Our linear model (i.e., OLS) shows a strong wage gap between whites and non-whites which seems to remain constant (in percentage terms) as the sale workers gain experience (i.e., similar slopes). Adding experience squared to the model (quadratic model) allows us to better capture the well-known nonlinear relationship between log wage and experience. As workers gain experience, we expect their log wage to increase, but at a decreasing rate. The quadratic model (bottom-left panel) shows a large increase in log wages early in a career with a slight downfall towards the end. Also, this model tends to suggest that the wage gap between white and non-white males working in personal care and service varies with experience. Non-white workers appear to have a more constant and slower increase in their predicted log wages.

Fig. 1
figure 1

Log-wages versus experience for white versus non-white college-educated males working in personal care and service

Linear Spline Bases

In our example, we could argue that although wages should increase with experience (increase in competence/knowledge), there may be a point where more experience will not increase wages or perhaps even decrease it (slower cognitive ability/decreases in efficiency). Suppose we created a model with the equivalent of two linear regression: one for the first 20 years of experience, and another for the latter years. This would be equivalent of adding the following basis function to our linear model:

$$(x-20)_{+}, $$

where the + sign indicates that the function is set to zero for all values of x where (x − 20) is negative. This model is sometimes called the broken stick model because of its shape, but more generally is referred to as a linear spline base model with 3 knots. The 3 knots are at 0 (minimum value), 20, and 37 (maximum value) years of experience. Note that the maximum and minimum values of x will always be consider to be knots. For example, the linear model in Eq. 9 has two knots. Here we arbitrarily fixed the middle knot at 20 years of experience. We will discuss which knots to select and how many to select in “Model Selection”.

The broken stick model with a break at x = 20 is written as

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} +\beta_{2} {(x_{i}-20)}_{+} + \epsilon_{i} $$
(3)

and is illustrated in upper-right panel of Fig. 1. We see a similar result to the quadratic model, that is, for white workers, we see a strong increase in wages in the first part of their career followed by a smaller decrease towards the end of their career. That being said, we arbitrarily fixed the middle knot at 20 years of experience. Without strong reasons to do so, it is premature to say anything about when the increase in the log wage stops and when the decrease begins. Noting the aforementioned issue, we also observe the wage gap widen at first with experience, but then converge at higher levels of experience.

Figure 2 illustrates how adding knots at different values can change the results. We present a model with 5 knots at x = 0,10,20,30,37, and a model with 20 knots (every 2 years) at x = 0,2,4,…,34,36,37. In matrix form, (2) , the X matrix with 5 knots is given as

$$X = \left[\begin{array}{llllll} 1 & x_{1} & {(x_{1}-10)}_{+} & {(x_{1}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{1}-37)}_{+} \\ 1 & x_{2} & {(x_{2}-10)}_{+} & {(x_{2}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{2}-37)}_{+} \\ {\vdots} & {\vdots} & {\vdots} & & {\vdots} & {\vdots} \\ 1 & x_{n} & {(x_{n}-10)}_{+} & {(x_{n}-20)}_{+} & {(x_{1}-30)}_{+} & {(x_{n}-37)}_{+} \end{array}\right] $$

and with 20 knots,

$$X = \left[\begin{array}{llllll} 1 & x_{1} & {(x_{1}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{1}-37)}_{+} \\ 1 & x_{2} & {(x_{2}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{2}-37)}_{+} \\ {\vdots} & {\vdots} & {\vdots} & & {\vdots} & {\vdots} \\ 1 & x_{n} & {(x_{n}-2)}_{+} & {\ldots} & {(x_{1}-36)}_{+} & {(x_{n}-37)}_{+} \end{array}\right]. $$
Fig. 2
figure 2

Log-wage versus experience for white versus non-white college-educated males working in personal care and service

Adding knots at 10 and 30 years of experience allows the model to account for the commonly seen mid-career flattening period. However, the function is still not very smooth and it is hard to tell from this model when log wages start to flatten out. Adding more knots allows for more flexibility, but this can potentially lead to overfitting. For example, in the linear base model with 20 knots (upper-right panel of Fig. 2), the fitted line appears to be modeling noise.

Quadratic Spline Bases

The linear spline base model is a combination of linear bases. The quadratic spline base model is a combination of quadratic bases. In other words, we simply add the corresponding squared function for each of the linear base functions. Consider our previous broken stick model with a middle knot at x = 20, we can transform it into a quadratic spline base model with a knot at x = 20 by replacing (x − 20)+ with the following bases:

$$x^{2} , {(x-20)}_{+}^{2}. $$

This quadratic spline base model is represented by the following equation

$$ y_{i} = \beta_{0} +\beta_{1} x_{i} +\beta_{2} {x}_{i}^{2} +\beta_{3} {(x_{i}-20)}_{+}^{2} + \epsilon_{i}, $$
(4)

and is illustrated in the bottom-right panel of Fig. 1. We can see that the quadratic spline base model suggests a slightly different relationship between experience and log wage. The predicted log wage increases more dramatically for the first 5 years of work experience, but flattens out thereafter. The racial gap seems to be small at first, but widens greatly over the first 5 years. Non-white workers appear to slowly catch up over the course of their careers.

One of the main advantages of the quadratic over the linear spline base model is that it does not have any sharp corners (i.e., undefined gradients). It follows that for any number of knots, the resulting function will have continuous first derivatives. This is both a useful and aesthetically pleasing property. Adding more knots (lower-right panel of Fig. 2) to the model adds more variability. It appears that for this example, 5 knots would be sufficient.

An important concept in economics (typically of secondary importance in statistics textbooks) is recovery of the gradients. In the linear case, the gradient is simply the estimated coefficient between two particular knots. In the quadratic (or higher-order) case, we use the same method to get the gradient as in a simple quadratic OLS model. The difference is that we calculate it between each knot. That is, to estimate a particular gradient for any type of spline model, we can simply take the partial derivative with respect to the regressor x. In its general form, our estimated gradient \(\widehat {\beta }(x)\) for a particular regressor x is

$$ \widehat{\beta}(x)= \frac{\partial\widehat{y}(x)}{\partial x}. $$
(5)

For our linear spline base example with 3 knots, this is

$$ \widehat{\beta}(x)= \beta_{1} + \left\{\begin{array}{ll} \beta_{2}, & \text{if } x \in [20, 37) \\ 0, & \text{otherwise} \end{array}\right. $$
(6)

and for our quadratic spline base example with 3 knots

$$ \widehat{\beta}(x)= \beta_{1} + 2 \beta_{2} x_{i} + \left\{\begin{array}{ll} 2\beta_{3} (x_{i}-20) , & \text{if } x \in [20, 37) \\ 0, & \text{otherwise} \end{array}\right.. $$
(7)

B-Splines

We introduced linear and quadratic spline models with the truncated power basis function. Using the same truncated power functions, those models can be generalized to

(8)

where p is the degree of the power basis (truncated power basis of degree p). This generalizes our model by allowing for (1) other spline models (using p degrees), and (2) other bases for a given spline model (using knots). This function has p − 1 continuous derivatives and thus higher values of p should lead to “smoother” spline functions. Similar to before, the general form of the gradient is defined as

(9)

While this general form seems reasonable, splines computed from the truncated power bases in Eq. 8 may be numerically unstable. The values in the X-matrix may become very large (for large p), and the columns of the X-matrix may be highly correlated. This problem will only become worse with a higher number of knots. Therefore, Eq. 8 is rarely used in practice, but is instead typically transformed into equivalent bases with more stable numerical properties. One of the most popular is the B-spline basis.

This can be relatively difficult to present and code, but luckily there exist regression packages to easily transform the X-matrix into the more numerically stable version. Formally, we can compute the equivalence as

$$X_{b} = XL_{p}, $$

where X is a matrix of the bases (explanatory variables) used in Eq. 8 and Lp is a squared invertible matrix. The most commonly used transformation in the linear case is

$$B(x)_{j}= \left\{\begin{array}{ll} \frac{x-\kappa_{j}}{\kappa_{j + 1}-\kappa_{j}}, & \text{if } x \in [\kappa_{j}, \kappa_{j + 1}) \\ \frac{\kappa_{j + 2}-x}{\kappa_{j + 2}-\kappa_{j + 1}}, & \text{if } x \in [\kappa_{j + 1}, \kappa_{j + 2})\\ 0, & \text{otherwise} \end{array}\right. $$

for .

To better illustrate this, consider our broken stick example from Fig. 1: the linear-spline with one middle knot at 20 years of experience. The corresponding bases for this model are 1, x and, (x − 20)+ and are shown in the upper-left panel of Fig. 3. The B-spline transformation of the second knot (20 years of experience) for this example is

$$B(x)_{j = 2}= \left\{\begin{array}{ll} \frac{x-0}{20-0}, & \text{if} x \in [0, 20)\\ \frac{37-x}{37-20}, & \text{if} x \in [20, 37)\\ 0, & \text{otherwise} \end{array}\right. $$

The corresponding bases of this transformation are shown in the upper-right panel of Fig. 3. B(x)j= 2 corresponds to the inverse V-shaped function which equals 1 when experience equals 20. The other two functions can be computed similarly using j = − 1, and 3. Adding a higher degree to our model will change the shape of our basis functions. The two bottom panels of Fig. 3 show the equivalent truncated spline basis and B-spline basis for the cubic case (p = 3).

Fig. 3
figure 3

Truncated and B-spline corresponding bases with knots at 0, 20, and 37 years of experience

While other basis functions exist (for example, radial basis functions), practitioners may prefer B-splines as they are both numerically more stable and relatively easy to compute. Both R and Stata packages are available. We used the now defunct bs (⋅) function in the splines packageFootnote 2 in R. The bspline module is available in Stata for B-splines.

Kernel Regression

Instead of assuming that the relationship between y and x come from a polynomial family, we can state that the conditional mean is an unspecified smooth function m(⋅) and our model will be given as

$$ y_{i}=m(x_{i})+\epsilon_{i}, \quad i = 1,2,\ldots,n, $$
(10)

where the remaining variables are described as before. In much the same way spline regression can be thought of as an extension of OLS, kernel regression can be seen as an extension of WLS. That is, we are still minimizing a weighted residual sum of squares, but now we will weight observations by how close they are to the point of interest (i.e., a “local” sample). With spline regression, our local sample is defined as all the points included between two knots, where each point within that sample is weighted equally. Kernel regression goes a step further by estimating each point using a weighted local sample that is centered around the point of interest. The local sample is weighted using a kernel function, which possess several useful properties.

A kernel function defines a weight for each observation within a (typically) symmetric predetermined bandwidth. Unlike an OLS regression which makes no distinction of where the data are located when estimating the conditional expectation, kernel regression will estimate the point of interest using data within a bandwidth.

Before introducing the kernel estimators, let us first derive a kernel function. Consider x our point of interest; we can write an indicator function such that data fall within a range h (our bandwidth) around x:

$$n_{x} = \sum\limits_{i = 1}^{n} 1\left\{x-\frac{h}{2}\leq x_{i} \leq x+\frac{h}{2}\right\}.$$

The corresponding probability of falling in this box (centered on x) is thus nx/n. This indicator function can be rewritten as

$$ n_{x} = \sum\limits_{i = 1}^{n} \left( \frac{1}{2}\right) 1 \left\{\left| \frac{x_{i}-x}{h} \right| \leq 1 \right\}. $$
(11)

This function is better known as the uniform kernel and is more commonly written as

$$k(\psi)= \left\{\begin{array}{ll} 1/2,& \text{if } |\psi|\leq 1\\ 0, & \text{otherwise} \end{array}\right. $$

where we have written k(ψ) for convenience, where ψ is defined as (xix)/h and represents how “local” the observation xi is relative to x. Though very simple and intuitive, the uniform kernel is not smooth. It is discontinuous at − 1 and 1 (when the weight switches from 1/2 to zero) and has a derivative of 0 everywhere except at theses two points (where it is undefined).

This kernel is rarely used, but it does possess some basic properties that we typically require of kernel functions. More formally, if we let the moments of the kernel be defined as

$$ \kappa_{j} (k)= {\int}_{-\infty}^{\infty} \psi^{j} k(\psi)d\psi, $$
(12)

these properties are

  1. 1.

    κ0(k) = 1 (k(ψ) integrates to one),

  2. 2.

    κ1(k) = 0 (k(ψ) is symmetric), and

  3. 3.

    κ2(k) < (k(ψ) has a finite variance).

These are known as second-order kernels. In addition to the uniform kernel, several commonly known kernel functions can be found in Table 1 (with their second-moments) and Fig. 4. Each of them are derived from the general polynomial family:

$$ k_{s}(\psi)= \frac{(2s + 1)!!}{2^{s + 1}s!} (1-\psi^{2})^{s} \textbf{1}\{|\psi|\leq 1\}, $$
(13)

where !! is the double factorial. The most commonly used kernel function in econometrics is the Gaussian kernel as it has derivatives of all orders. The most commonly used kernel function in statistics is the Epanechnikov kernel function as it has many desirable properties with respect to mean squared error. We will discuss how to choose the kernel function and smoothing parameter (h) in “Model Selection”.

Table 1 Commonly used second-order kernel functions
Fig. 4
figure 4

Commonly used second-order kernel functions

Local-Constant Least-Squares

The classic kernel regression estimator is the local-constant least-squares (LCLS) estimator (also known as the Nadaraya-Watson kernel regression estimator, see Nadaraya (1964) and Watson (1964)). While it has fallen out of fashion recently, it is useful as a teaching tool and still useful in many situations (e.g., binary left-hand-side variables).

To begin, recall how we construct the OLS estimator. Our objective function is

$$\underset{\alpha,\beta}{\min}\sum\limits_{i = 1}^{n}(y_{i}-\alpha-x_{i}\beta)^{2}, $$

which leads to the slope and intercept estimators, \(\widehat {\beta }\) and \(\widehat {\alpha }\).

Suppose instead of a linear function of x, we simply regress y on a constant (a). Our objective function becomes

$$\underset{a}{\min} \sum\limits_{i = 1}^{n}[y_{i}-a]^{2} , $$

which leads to the estimator \(\widehat {a}=(1/n){\sum }^{n}_{i = 1}y_{i}=\bar {y}\). A weighted least-squares version of this objective function can be written as

$$\underset{a}\min\sum\limits_{i = 1}^{n}[y_{i}-a]^{2} W(x_{i}) , $$

where W(xi) is the weighting function, unique to the point xi. If we replace the weighting function with a kernel function, minimizing this objective function yields the LCLS estimator

$$ \widehat{a}=\widehat{m}(x) = \frac{{\sum}^{n}_{i = 1}y_{i} k\left( \frac{x_{i}-x}{h} \right)}{{\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right)}. $$
(14)

This estimator represents a local average. Essentially, we regress y locally, on a constant, weighting observations via their distance to x.

While Eq. 14 gives us the fit, economists are typically interested in the marginal effects (i.e., gradients). To estimate a particular gradient, we simply take the partial derivative of \(\widehat {m}(x)\) with respect to the regressor of interest, x. Our estimated gradient \(\widehat {\beta }(x)\) is thus

$$ \widehat{\beta}(x)= \frac{ \left( {\sum}^{n}_{i = 1} y_{i} \frac{\partial k\left( \frac{x_{i}-x}{h} \right)} {\partial x} \right) \left( {\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right) \right) - \left( {\sum}^{n}_{i = 1} y_{i} k\left( \frac{x_{i}-x}{h} \right) \right) \left( {\sum}^{n}_{i = 1} \frac{\partial k\left( \frac{x_{i}-x}{h} \right)}{\partial x} \right) } { \left( {\sum}^{n}_{i = 1} k\left( \frac{x_{i}-x}{h} \right) \right)^{2} } , $$
(15)

where, for example, \(\frac {\partial k\left (\frac {x_{i}-x}{h} \right )}{\partial x}=\left (\frac {x_{i}-x}{h^{2}}\right )k\left (\frac {x_{i}-x}{h}\right )\) for the Gaussian kernel. Higher-order derivatives can be derived in a similar manner.

Local-Linear Least-Squares

While the LCLS estimator is intuitive, it suffers from biases near the boundary of the support of the data. As an alternative, most applied researchers use the local-linear least-squares (LLLS) estimator. The LLLS estimator locally fits a line as opposed to a constant.

The local-linear estimator is obtained by taking a first-order Taylor approximation of Eq. 10 via

$$y_{i} \approx m(x) + (x_{i}-x)\beta(x)+ \epsilon_{i},$$

where β(x) is the gradient. Similar to the LCLS case, by labeling m(x) and β(x) as the parameters a and b, we get the following minimization problem

$$\underset{a,b}\min \sum\limits_{i = 1}^{n}[y_{i}-a - (x_{i}-x)b]^{2} k\left( \frac{x_{i}-x}{h} \right),$$

which, in matrix notation (with q regressors) is

$$\underset{\delta}\min (y-X\delta)'K(x) (y-X\delta) ,$$

where δ = (a,b), X is a n × (q + 1) matrix with its i th row equal to (1,(xix)) and K(x) is a n × n diagonal matrix with its i th element equal to \({\prod }_{j = 1}^{q} k\left (\frac {x_{qi}-x_{q}}{h_{q}}\right )\). This leads to the LLLS estimators of the conditional expectation (\(\widehat {m}(x)\)) and gradient (\(\widehat {\beta }(x)\)) as

$$\widehat{\delta}(x) = \begin{pmatrix} \widehat{m}(x) \\ \widehat{\beta}(x) \end{pmatrix} = (X^{\prime}K(x)X)^{-1}X^{\prime}K(x)y,$$

Notice that we can obtain the OLS estimator by replacing K(x) by an identity matrix (giving all observations equal weight, i.e., each bandwidth tending towards infinity), the weighted least-squares (WLS) estimator by replacing it with some other weighting function, and the generalized least-squares (GLS) estimator by replacing it with the inverse of the variance-covariance matrix of the errors (Ω).

Figure 5 gives both the LCLS and LLLS estimates for white (solid line) and non-white (dashed line) college-educated males working in personal care and service. The gradients for each level of experience are also shown. Compared to the LCLS model, the LLLS model captures a stronger increase in log wage during the first 5 years of work experience with gradients ranging from 0.10 to 0.17. If taken literally, after only a year of working in personal care and service, white college-educated males wages increases by almost 17% on average while non-white college-educated males’ wages increases by about 7%. The LCLS model, while showing a similar overall shape, shows a much slower increase in those first few years of work experience with less than 4% increases in wages for non-whites and 5% to 8% increases for whites. Both models suggest that while white workers have much higher percent increases in their wages in the first few years, those year-to-year percent increases in their wages fall below non-white workers after 10 years of experience.

Fig. 5
figure 5

Log-wage versus experience for white versus non-white college-educated males working in personal care and service

Local-Polynomial Least-Squares

The derivation of the LLLS estimator can be generalized to include higher-order expansions. The resulted family of estimators are called local-polynomial least-squares (LPLS) estimators. For the general case, if we are interested in the p th-order Taylor expansion, and we assume that the (p + 1)th derivative of the conditional mean at the point x exists, we can write our equation as

$$y_{i} \approx m(x) + (x_{i}-x)\frac{\partial m(x)}{\partial x} + (x_{i}-x)^{2}\frac{\partial^{2} m(x)}{\partial x^{2}} \frac{1}{2!}+ {\ldots} + (x_{i}-x)^{p}\frac{\partial^{p} m(x)}{\partial x^{p}} \frac{1}{p!} + \epsilon_{i}.$$

Replacing the parameters by (a0,…,ap), our kernel weighted least-squares problem can be written as

$$\underset{a_{0},\ldots,a_{p}}{\min}\sum\limits_{i = 1}^{n}\left[y_{i}-a_{0} - (x_{i}-x)a_{1} - (x_{i}-x)^{2}a_{2} -\ldots- (x_{i}-x)^{p}a_{p}\right]^{2} k\left( \frac{x_{i}-x}{h} \right) .$$

In matrix notation, our objective function becomes

$$\underset{\delta}{\min} (y-X\delta)'K(x) (y-X\delta), $$

where the only difference from the LLLS case (p = 1) is that the i th row of X is defined as [1,(xix),(xix)2,…,(xix)p] and δ = (a0,a1,…,ap). Minimizing the objective function leads to the local-polynomial least-square estimator

$$\widehat{\delta}(x)= \left( \widehat{m}(x), \frac{\partial \widehat{m}(x)}{\partial x}, \frac{\partial^{2} \widehat{m}(x)}{\partial x^{2}}, \ldots, \frac{\partial^{p} \widehat{m}(x)}{\partial x^{p}} \right)^{\prime}= (X^{\prime}K(x)X)^{-1}X^{\prime}K(x)y.$$

The first question then becomes, how many expansions should we take? More expansions lead to less bias, but increased variability. This becomes a bigger problem when the number of covariates (q) is large and the sample size (n) is small. One promising data driven method to determine the number of expansions is considered in Hall and Racine (2015).

As is the case for splines, there exist options to employ these methods in popular software packages. In R we recommend the np package (Hayfield and Racine 2008) and in Stata we recommend the npregress command.

Model Selection

For both spline and kernel regression, many seemingly arbitrary choices can greatly influence fit. The typical trade-off is between bias and variance. We want to make selections such that we avoid overfitting or underfitting. In this section, we first discuss penalty selection, knot selection, and degree selection in spline models; and then, kernel and bandwidth selection in kernel models.

Spline Penalty and Knot Selection

In “Spline Regression”, we saw that the fit is influenced by both our choice of degree of the piecewise polynomials, and by the number and locations of knots we include. However, in spline models, there is a third, more direct way, to influence fit: add an explicit penalty. In short, we want to select the degree of the piecewise polynomials, the knot locations, and the smoothing parameter λ (penalty) which best capture the underlying shape of our data. Though we will briefly discuss the selection of all three, it is easy to show that the choices of degree and knots are much less crucial than the choice of λ, the smoothing parameter (we will see a similar result for kernel regression). That is, when using a high enough number of knots and degrees, the “smoothness” of our fit can be controlled by λ. Hence, we will focus most of our discussion on the choice of λ when the degree and number of knots are fixed. Although there exist several ways to select our parameters in a data-driven manner, we will concentrate on one of the most commonly used approaches: cross-validation (CV).

Penalty Selection Using Cross Validation

There are several ways to impose a penalty, but here we focus on a method that avoids extreme values (and hence too much variability). In a univariate setting using a linear spline, this penalty is

figure e

where is the coefficient on the knot.Footnote 3 In matrix form, our constrained objective function can thus be written as

$$\underset{\beta}{\min} \mid \mid y- X\beta \mid \mid^{2} \textmd{ s.t. } \beta^{\prime} D\beta \leq C, $$

and leads to the LagrangianFootnote 4

$$ \mathcal{L}(\beta,\lambda) = \underset{\beta,\lambda}{\min} \mid \mid y - X\beta \mid \mid^{2} + \lambda^{2} \beta^{\prime} D\beta, $$
(16)

where D is a matrix with diagonal . Note that consistency will require that λ tends towards zero as the sample size (n) tends towards infinity.

The second term of Eq. 16 is called a roughness penalty because it penalizes through the value of the smoothing parameter (λ) the curvature of our estimated function. This type of regression is referred to as a penalized spline (p-spline) regression and yields the following solution and fitted values:

$$\widehat{\beta}_{\lambda} = (X^{\prime}X + \lambda^{2}D)^{-1}X^{\prime}y $$
$$\widehat{y} = X(X^{\prime}X + \lambda^{2}D)^{-1}X^{\prime}y. $$

To generalize these results to the p th degree spline model (8), we replace λ2 by λ2p and transform the D-matrix: .Footnote 5 A penalized B-spline (PB-spline) would simply include the transformation done to X (i.e., the square invertible matrix Lp) into the penalty term as well:

$$\widehat{y} = X_{B}(X^{\prime}_{B}X_{B} + \lambda^{2p}L^{\prime}_{p}DL^{\prime}_{p})^{-1}X^{\prime}_{B}y.$$

As λ2p (infinite smoothing), the curvature penalty becomes predominant and the estimate converges to OLS. As λ2p → 0, the curvature penalty becomes insignificant. In this case, the function will become rougher (we will see a similar result with the bandwidth parameter for a LLLS regression). Figure 6 illustrates this effect using linear p-spline estimates for college-educated males working in personal care and service. The knots have been fixed at every five years of experience (0,5,10,...). As the penalty (λ) increases, it is clear that the fit becomes smoother and converges to an OLS estimate.

Fig. 6
figure 6

Log-wage versus experience for college-educated males working in personal care and service with different penalty (λ) Factors

Figure 6 shows an intuitive fit of the data for a value of λ around 10. However, using a more systematic method to select λ would lead to less subjective and more comparable results. If we let \(\widehat {m}(x_{i};\lambda )\) be our nonparametric regression estimate at the point x with smoothing parameter λ, we can write a residual sum of squares objective function as

$$ RSS (\lambda) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}(x_{i};\lambda) \right]^{2}. $$
(17)

The problem with this approach, is that \(\widehat {m}(x_{i};\lambda )\) uses yi as well as the other observations to predict yi. This objective function is minimized when λ = 0. This problem can be avoided by using a leave-one-out estimator. Least-Squares Cross-Validation (LSCV) is the technique whereby we minimize Eq. 17, where the fit is replaced by a leave-one-out estimator

$$ CV(\lambda) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}_{-i}(x_{i};\lambda) \right]^{2}, $$
(18)

where \(\widehat {m}_{-i}\left (\cdot \right )\) is our leave-one-out estimator, and is defined as our original nonparametric regression estimator \(\widehat {m}\left (\cdot \right )\) applied to the data, but with the point (xi,yi) omitted. We will thus choose a smoothing parameter \(\widehat {\lambda }_{CV}\) that will minimize CV (λ) over λ ≥ 0.

Using the same number of knots, the top panel of Fig. 7 shows the corresponding CV and RSS curves at different values of λ. We can see that the RSS curve is strictly increasing as theory predicts and would choose a lambda of zero. The CV curve, on the other hand, decreases at first and reaches a minimum when λ = 7. The resulting fit (bottom panel of Fig. 7) is smoother than what the RSS criterion would provide.Footnote 6

Fig. 7
figure 7

Objective functions for choosing penalty factors for linear p-splines for college-educated males working in personal care and service

Knots and Degree Selection

Using an “optimal” lambda and CV criterion, we can compare p-spline models that use different numbers (and location) of knots and different bases (degrees). From experimenting with the number of knots and degrees, the literature finds that (1) adding more knots only improves the fit for a small number of knots (2) when using many knots, the minimum CV for linear and quadratic fits become indistinguishable. In general, we suggest using quadratic or cubic basis functions.

Though there exist more formal criterion to select the number and location of knots, Ruppert et al. (2003) provide simple solutions which often work well. Their default choice of is

where is the number of knots. For knot locations they suggest

for .

Eilers and Marx (1996, 2010) argue that equally spaced knots are always preferred. Eilers and Marx (2010) present an example where equally spaced knots outperform quantile spaced knots. The best type of knot spacings is still under debate and both methods are still commonly used.Footnote 7

While knots’ location and degree selection usually have little effect on the fit when using a “sufficiently” large amount of knots, they may become important when dealing with more complex problems. For example, when trying to smooth regression functions with strong varying local variability or with sparse data. In these cases, using a more complex algorithm to make your selection may be more appropriate.

Kernel and Bandwidth Selection

Choosing a kernel function is similar to choosing the degree of the piecewise polynomials in spline models, and choosing the size of the bandwidth (h) is similar to choosing the number and location of knots. There exist equivalents to having a direct penalty (λ) incorporated in a kernel model, but those are rarely used in applied kernel estimation. We will therefore focus our discussion on kernel and bandwidth selection.

Similar to adding more knots or decreasing the penalty λ in a spline model, decreasing the bandwidth will lead to less bias, but more variance. Figure 8 illustrates this effect using LLLS and a Gaussian kernel for college-educated males working in personal care and service. As the size of the bandwidth (h) increases, the fit becomes smoother and converges to OLS.

Fig. 8
figure 8

Log-wage versus Experience for college-educated males working in personal care and service when varying the bandwidth (h) parameter

Choice of bandwidth and kernel can be chosen via the asymptotic mean square error (AMSE) criterion (or more specifically, via asymptotic mean integrated square error). In practice, the fit will be more sensitive to a change in bandwidth than a change in the kernel function. Reducing the bandwidth (h), leads to a decrease in the bias at the expense of increasing the variance. In practice, as the sample size (n) tends to infinity, we need to reduce the bandwidth (h) slowly enough so that the amount of “local” information (nh) also tends to infinity. In short, consistency requires that

$$\text{as } n\rightarrow \infty \text{, we need } h\rightarrow 0 \text{ and } nh\rightarrow \infty .$$

The bandwidth is therefore not just some parameter to set, but requires careful consideration. While many may be uncomfortable with an estimator that depends so heavily on the choice of a parameter, remember that this is no worse than pre-selecting a parametric functional form to fit your data.

Cross-Validation Bandwidth Selection

In practice, there exist several methods to obtain the “optimal bandwidth” which differ in the way they calculate asymptotic mean square-error (or asymptotic mean intergrated square-error). Three typical approaches to bandwidth selection are: (1) reference rules-of-thumb, (2) plug-in methods and (3) cross-validation methods. Each has its distinct strengths and weaknesses in practice, but in this survey we will focus on the data driven method: cross-validation.Footnote 8Henderson and Parmeter (2015) provide more details on each of these methods.

LSCV is perhaps the most popular tool for cross-validation in the literature. This criterion is the same as the one described in “Penalty Selection Using Cross Validation” to select the penalty parameter in spline regression. That is, we use a leave-one-out estimator

$$ CV(h) = \sum\limits_{i = 1}^{n} \left[y_{i} - \widehat{m}_{-i}(x_{i}) \right]^{2}, $$
(19)

whereby we minimize the objective function with respect to h instead of λ and the (LCLS) leave-one-out estimator is defined as

$$\widehat{m}_{-i}(x_{i}) = \frac{\underset{j\neq i}{\sum\limits_{j = 1}^{n}} y_{j} K_{h}(x_{j}, x_{i})}{\underset{j\neq i}{\sum\limits_{j = 1}^{n}} K_{h}(x_{j}, x_{i})}. $$

In the top panel of Fig. 9, we show an analogous figure to that presented in “Penalty Selection Using Cross Validation”. It shows the corresponding CV and RSS curves for different bandwidths. When failing to use the leave-one-out estimator, the RSS curve is strictly increasing (i.e., the optimal bandwidth is zero). Using the leave-one-out estimator, the objective function is minimized at h = 1.62. The resulting fit, (bottom panel of Fig. 9) shows more variation than the linear p-spline (Fig. 7). This is not surprising as the linear p-spline forces a linear fit between each knot. The two graphs would have looked more similar if we had used a cubic p-spline, allowing for curvature between knots.

Fig. 9
figure 9

Objective functions for choosing bandwidths for kernel estimators for college-educated males working in personal care and service

Kernel Function Selection

Kernel selection is typically considered to be of secondary performance as it is believed that it makes minor differences in practice. The optimal kernel function, in the AMISE sense, is the Epanechnikov kernel function. However, as stated previously, it may not be useful in some situations as it does not posses more than one derivative. Gaussian kernels are often used in economics as they possess derivatives of any order, but there are losses in efficiency. In the univariate density case, the loss in efficiency is around 5%. However, Table 3.2 of Henderson and Parmeter (2015) shows that this loss in efficiency increases with the dimension of the data (at least in the density estimation case). In practice, it may make sense to see if the results of a study are sensitive to the choice of kernel.

Splines versus Kernels

In these single-dimension cases, our spline and kernel estimates are more or less identical. Spline regressions have the advantage that they are much faster to compute. While it is uncommon to have an economic problem with a single covariate, if that were the case, we likely would suggest splines.

In a multiple variable setting, the difference between the two methods are more pronounced. The computation time for kernels increases exponentially with the number of dimensions. The additional computational time required for splines is minor. On the other hand, kernels handle interactions and discrete regressors (see Ma et al. (2015) for using discrete kernels with splines) well (both common features in economic data). It is also relatively easier to extract gradients with kernel methods.

In reality there are camps: those who use kernels and those who use splines. However, the better estimator probably depends upon the problem at hand. Both should be considered in practice.

Instrumental Variables

Nonparametric methods are not immune to the problem of endogeneity. A first thought about how to handle this issue would be to use some type of nonparametric two-stage least-squares procedure. However, this is not feasible as there exists an ill-posed inverse problem (to be discussed below). It turns out that this problem can be avoided by using a control function approach much like that in the parametric literature (e.g., see Cameron and Trivedi (2010)).

To motivate this problem, consider a common omitted-variable problem in labor economics: ability in the basic compensation model. A (potentially) correctly specified wage equation could be described as:

$$ \log(wage) = \beta_{0} + \beta_{1}educ + \beta_{2}z_{1} + \beta_{3}abil + \epsilon, $$
(20)

where educ is years of education, abil is ability, and z1 is a vector of other relevant characteristics (e.g., experience, gender, race, marital status). However, in applied work, ability (abil) cannot be directly measured/observed.

If we ignore ability (abil), it will become part of the error term

$$ \log(wage) = \beta_{0} + \beta_{1}educ + \beta_{2}z_{1} + u, $$
(21)

where u = 𝜖 + β3abil and abil is correlated with both u and with educ. Our resulting estimated return to education (β1) will be biased and inconsistent. We can resolve this problem if we can find an instrumental variable (IV) which is uncorrelated with u (and so uncorrelated with ability), but correlated with educ. Several IVs have been considered in the literature for this particular model,Footnote 9 each with their own strengths and weaknesses, but for the purpose of this illustration, we will use spouse’s wage. That is, we assume that spouse’s wage is correlated with education, but not with ability.

In the parametric setting, the control function (CF) approach to IVs is a two-step procedure. In the first step, we regress the endogenous variable on the exogenous vector z:

$$educ = \gamma_{0} + \gamma_{1} z + v ,$$

where z = (z1,spwage) and spwage is the spouse’s wage, and obtain the reduced form residuals \(\widehat {v}\). In the second step, we add \(\widehat {v}\) to Eq. 21 and regress

$$\log(wage) = \beta_{0} + \beta_{1} educ + \beta_{2} z_{1} + \beta_{3} \widehat{v} + u.$$

By directly controlling for v, educ is no longer viewed as endogenous.

The Ill-Posed Inverse Problem and Control Function Approach

Let us first go back and consider the general nonparametric regression setting

$$ y = m(x) + u, $$
(22)

where E[u|x]≠ 0, but there exists a variable z such that E[u|z] = 0. For the moment, assume that x and z are scalars.

Using the condition

$$E[u|z] = [y-m(x)|z] = 0,$$

yields the conditional expectation

$$ E[y|z] = [m(x)|z] = \int m(x)f(x|z)dx. $$
(23)

Although we can estimate both the conditional mean of y given z (E[y|z]) as well as the conditional density of x given z (f(x|z)), we cannot recover m(x) by inverting the relationship. That is, even though the integral in Eq. 23 is continuous in m(x), inverting it to isolate and estimate m(x) does not represent a continuous mapping (discontinuous). This is the so-called ill-posed inverse problem and it is a major issue when using instrumental variables in nonparametric econometrics.

Luckily, we can avoid this problem by placing further restrictions on the model (analogous to additional moment restrictions in a parametric model). Here we consider a control function approach. Similar to the parametric case above, we consider the triangular framework

$$ x = g(z) + v, $$
(24)
$$ y = m(x) + u $$
(25)

with the conditions E[v|z] = 0 and E[u|z,v] = E[u|v]. The first condition implies that z is a valid instrument for x and the second allows us to estimate m(x) and avoid the ill-posed inverse problem. It does so by restricting u to depend on x only through v. More formally,

$$\begin{array}{@{}rcl@{}} E[y|x,v] & =& m(x) + E[u|x-v, v] \\ & = &m(x) + E[u|z,v] \\ & = &m(x) + E[u|v] \\ & = &m(x) + r(v), \end{array} $$

and hence

$$ m(x) = E[y|x,v] - r(v), $$
(26)

where both E[y|x,v] and r(v) can be estimated nonparametrically. In short, we control for v through the nonparametric estimation of the function r(v).

Spline Regression with Instruments

Now that we have the basic framework, we can discuss nonparametric estimation in practice. Consider our previous compensation model, but without functional form assumptions:

$$ \log(wage) = m(educ, z_{1}) + u $$
(27)

where ability (abil) is unobserved. It is known that abil will be correlated with both the error u and the regressor educ (i.e., E[u|educ]≠ 0). Similar to the parametric setting, if we have an instrument z such that

$$E[u|z] = [\log(wage)-m(educ, z_{1})|z] = 0,$$

we can avoid the bias due to endogeneity. We again define z = (z1,spwage), where our excluded instrument, spwage, is the spouse’s wage. This yields

$$E[\log(wage)|z] = [m(educ, z_{1})|z]. $$

Our problem can now be written via the triangular system attributable to Newey et al. (1999):

$$\begin{array}{@{}rcl@{}} educ &=& g(z) + v \end{array} $$
(28)
$$\begin{array}{@{}rcl@{}} \log(wage) &=& m(educ, z_{1}) + u, \end{array} $$
(29)

where E[v|z] = 0 and E[u|z,v] = E[u|v]. Similar to the parametric case, we first estimate the residuals from the reduced-form equation (i.e., \(\widehat {v}\)). We then include the reduced-form residuals nonparametrically as an additional explanatory variable:

$$ \log(wage) = w(educ, z_{1}, \widehat{v}) + u, $$
(30)

where \(w\left (educ,z_{1},\widehat {v}\right )\equiv m\left (educ,z_{1}\right )+r\left (\widehat {v}\right )\) and \(\widehat {m}\left (educ,z_{1}\right )\) can be recovered by only extracting those terms that depend upon educ and z1. Note that we need to use splines that do not allow for interactions between educ or z1 and \(\widehat {v}\) (interactions between educ and z1 are allowed).

In what follows, we will use cubic B-splines (the default for most R packages) with equi-quantile knots (see “Model Selection”) in both stages. Figure 10 shows the fitted results for the first stage. In the left panel, we see that as individuals’ experience increases, the level of education slowly decreases, with a significant drop after 35 years of work experience. In the right panel, we observe a quadratic relationship between education level and spouse’s wage. That is, the higher the spouse’s wage, the higher the individual’s education, but for individuals whose spouse have a high level of income, the relationship becomes negative.

Fig. 10
figure 10

First-stage estimates for education for college-educated males working in personal care and service versus experience and spousal income

The fitted plots from the second stage are given in Fig. 11. Controlling for education,Footnote 10 men’s log wage seems to increase in the first few years on the job, stabilizes mid-career, and then decreases towards the end of their career in personal care and service. Education seems to affect log wage positively only after 10–12 years of schooling (high-school level).

Fig. 11
figure 11

Second-stage estimates for log-wage for college-educated males working in personal care and service versus experience and education level

Kernel Regression with Instruments

Estimation via kernels is relatively straightforward given what we have learned above. We again use a control function approach, but with local-polynomial estimators. Kernel estimation of this model was introduced by Su and Ullah (2008) and is outlined in detail in Henderson and Parmeter (2015). In short, the first stage requires running a local-p1th order polynomial regression of the endogenous regressor on z, obtaining the residuals and then running a local-p2th order polynomial regression of y on the endogenous regressor, the included exogenous regressors and the residuals from the first stage.

More formally, our first stage regression model for our example is

$$ educ = g\left( z\right) + v, $$
(31)

and the residuals from this stage are used in the second stage regression

$$ \log(wage) = w(educ, z_{1}, \widehat{v}) + u, $$
(32)

where \(w\left (educ,z_{1},\widehat {v}\right )\equiv m\left (educ,z_{1}\right )+r\left (\widehat {v}\right )\).

In spline regression, we simply took the estimated components not related to \(\widehat {v}\) from \(\widehat {w}\left (\cdot \right )\) in order to obtain the conditional expectation \(\widehat {m}\left (educ,z_{1}\right )\). However, disentangling the residuals is a bit more difficult in kernel regression. While it is feasible to estimate additively separable models, we follow Su and Ullah (2008) and remove them via counterfactual estimates in conjunction with the zero mean assumption on the errors. Under the assumption that E(u) = 0, we recover the conditional mean estimate via

$$ \widehat{m}\left( educ,z_{1}\right)=\frac{1}{n}\sum\limits_{i = 1}^{n}\widehat{w}\left( educ,z_{1},\widehat{v}_{i}\right), $$
(33)

where \(\widehat {w}\left (educ,z_{1},\widehat {v}_{i}\right )\) is the counterfactual estimator of the unknown function using bandwidths from the local-p2 order polynomial regression in the second step (derivatives can be obtained similarly, but summing over the counterfactual derivatives of \(\widehat {w}\left (\cdot \right )\)).Footnote 11

Bandwidth selection and order of the polynomials (p1 and p2) are a little more complicated. Here we will give a brief discussion, but suggest the serious user consult Chapter 10 (which includes a discussion of weak instruments) of Henderson and Parmeter (2015).

Bandwidth selection is important in both stages. In the first-stage, v is not observed and we want to make sure that the estimation of it does not impact the second-stage. If we observe the conditions in Su and Ullah (2008), it allows for the following cross-validation criterion

$$ CV\left( h_{2}\right)=\underset{h_{2}}\min \frac{1}{n}\sum\limits_{i = 1}^{n}\left[y_{i}-\widehat{m}_{-i}\left( educ_{i},z_{1i}\right)\right]^{2}, $$
(34)

and the first-stage bandwidths can be constructed as

$$ \widehat{h}_{1}=\widehat{h}_{2} n^{-\gamma}, $$
(35)

where the acceptable values for γ depend upon the order of the polynomials in each stage.Footnote 12

Henderson and Parmeter (2015) give the admissible combinations of polynomial orders for the Su and Ullah (2008) estimator with a single endogenous variable and a single excluded instrument. In practice, they suggest using a local-cubic estimator in the first stage (local-linear in the first stage is never viable) and a local-linear estimator in the second stage for a just-identified model with a single endogenous regressor. For other cases, the conditions of Assumption A5 in Su and Ullah (2008) need to be checked.

Using the methods outlined above, Fig. 12 shows the impact of controlling for endogeneity. The upper-left panel gives density plots for the gradient estimates across the sample for returns to education both with and without using instrumental variables. Most college-educated men working in personal care and service have a wage increase of about 5 to 15% for each additional year they spend in school. However, the distribution is skewed to the left suggesting that a few men have seen their investment in education yield no returns or even negative returns (for a similar result in a nonparametric setting see Henderson et al. (2011)).

Fig. 12
figure 12

Second-step gradients of log-wage with respect to education for college-educated males working in personal care and service

Comparing those results with the gradients without instruments, we clearly see that failing to control for endogeneity would overestimate the returns to education (as expected). That is, the distribution of gradients without using IV is more concentrated around 10 to 15% returns and fewer low returns.

To try to swing back to the examples from before, the upper-right panel of Fig. 12 gives densities of gradient estimates controlling for endogeneity for whites versus non-whites. The figure seems to suggest that non-whites have higher rates of return to education. This is commonly found in the literature, but is often attributed to lower average years of education. To try to compare apples to apples, in the bottom two panels we plot the densities of returns to education for fixed levels of education (high school and college, respectively). Here we see while the general shape is similar, whites tend to get more mass on the higher returns and less mass on lower returns, especially for college graduates.Footnote 13