1 Introduction

Randomization has been the golden rule of inference in statistics. In an experiment with a treatment group (\(D=1\)) and a control group (\(D=0\)), randomization of D assures that the two groups are different only in the treatment status and balanced in all covariates, observed or not. For instance, persons with different levels of ability or different types of genes are assigned to different groups, but the distribution of ability levels or gene types becomes almost the same across the two groups. Hence \( E(Y|D=1)-E(Y|D=0)\) for a response variable Y reveals the mean effect of D on Y.

But randomization cannot be done if the treatment is possibly harmful as in smoking or radiation. Also, randomization is unthinkable in most observational studies, which has been accepted as a fact for a long time. Recently, it has been realized that regression discontinuity (RD), started long ago by Thistlethwaite and Campbell (1960), offers ‘local randomization’ using an institutional or legal break around a cutoff: \(D=1\) if an underlying continuous variable crosses a cutoff, and \(D=0\) otherwise. Many policy/program/treatment variables take this form; e.g., a test score crossing a cutoff to graduate from a school, a vote proportion crossing 0.5 to win an election, age crossing a cutoff to retire, etc.

RD has been steadily rising as a main vehicle of inference for observational data in social sciences as long as D takes the aforementioned form. The goal of this paper is to convey the essentials of RD in a nontechnical and concise fashion. Earlier reviews on RD can be seen in Van der Klaauw (2008), Imbens and Lemieux (2008) and Lee and Lemieux (2010), and this paper updates the reviews with emphasis on recent developments since the existing reviews. Although purported to be a review, this paper provides some extensions as well.

Despite the popularity of RD in social sciences, the RD literature in statistics has been nearly nonexistent as pointed out by Cook (2008), with exceptions being Berk and Rauma (1983), Berk and de Leeuw (1999) and Battistin and Rettore (2002). Robbins and Zhang (1991) examined the RD form of D, calling it a “ biased allocation” of treatment, although they did not implement RD, whereas Hansen (2008) just mentioned RD in passing. Nevertheless, RD studies have been appearing slowly in statistics journals in recent years: Cattaneo et al. (2015) for randomized inference when large sample inference is inappropriate, Calonico et al. (2015) for an optimal way to draw RD data plots, and Angrist and Rokkanen (2015) for RD identification away from the cutoff. In addition to these theoretical contributions, several mostly applied works are available as well: Mealli and Rampichini (2012), Crawford et al. (2014), Dickens et al. (2014), Keele et al. (2015), and MacDonald et al. (2016). Undoubtedly, many more will come in future both on the theoretical and applied fronts in statistics, which makes the review in this paper timely for the statistics community.

Before proceeding further, some words on notation are needed. In RD, although there are exceptions, both identification and estimation are done only locally around a known cutoff c of a ‘running/forcing/assignment variable’ S. As S can be re-centered as \( S-c \) always, we set \(c=0\) unless otherwise necessary, and denote a local neighborhood of 0 as \((-h,h)\) for a small positive bandwidth h; everything will be local in RD unless otherwise mentioned. Various functions of S will appear and their (dis-) continuity matters only at \(S=0\), and we will thus often omit the qualifier ‘at \(S=0\)’. Although the name running/forcing/assignment variable is well established in the RD literature, since it will appear very often in the remainder of this paper, we will call it just ‘score’ (S for score). Throughout the paper, \(1[A]=1\) if A holds and 0 otherwise. A realized value of S will be denoted as s, and \(E(\cdot |S=s)\) will be often written just as \(E(\cdot |s)\).

The rest of this paper is organized as follows. Section 2 introduces RD main ideas and features, the details of which will be seen in the remaining sections. Section 3 discusses RD identification. Section 4 examines RD estimators, and Sect. 5 reviews specification tests. Section 6 collects RD topics; many recent theoretical developments can be found here. Section 7 provides an empirical illustration applying some of the introduced methods. Finally, Sect. 8 concludes.

2 RD main ideas and features

In RD, the main variables are (DYS), and

$$\begin{aligned} Z\equiv 1[0\le S]. \end{aligned}$$

The following examples will help understanding what these variables are.

Example 1

Effect of entering a college on income, with the entrance determined solely by a normalized test score equal to or greater than 0; S is the test score, \(D=Z\), and Y is income.

Example 2

Effect of schooling on health, where an educational law dictates that only persons with birth date\(\ {\ge {c}}\) are subject to extra years of education in principle, but not everybody obeys the law; S is birth date minus c, D (\(\ne Z\)) is schooling years, and Y is health level.

To understand RD local randomization in Example 1, consider using \( E(Y|D=1)-E(Y|D=0)\) to find the effect of D on Y. The problem is that the treatment group has higher test score (thus higher ability) individuals than the control group, so that \(E(Y|D=1)\ne E(Y|D=0)\) may happen even if D has no true effect; D is not randomized and the two groups are systematically different (in ability). But the local treatment group with S at or just above 0 (i.e., \(S\in [0,h)\)) and the local control group with S just below 0 (\(S\in (-h,0)\)) should have almost the same ability levels. For instance, with SAT score, \(c=1200\) (S is SAT score minus c) and \(h=1\), imagine the local treatment group with \(S=1\) and control group with \(S=-1\). Since two point SAT difference is almost no difference (i.e., getting two points more/less is almost a pure luck), the two local groups would be different only in D and balanced in all covariates, which is a local randomization.

Example 1 has \(Z=1[0\le S]\), but depending on each RD case, Z may take the opposite form \(1[S\le 0]\); e.g., S is family income, and ‘\(S\le 0\)’ is the eligibility condition for an income aid program D. Also, as Example 2 shows, D may not be binary.

Although \(D=Z\) in Example 1, college admission in reality would depend, not just on S, but also on other variables, say \(\varepsilon \), so that \(D\ne Z\). RD with D determined only by S (thus \(E(D|S)=D\)) is ‘ sharp RD (SRD)’ where \(D=Z\) is a prime example; \(D\ne Z\) is still possible in SRD as in \(D=ZS\). RD with D determined by \( (S,\varepsilon )\) (thus \(E(D|S)\ne D\)) is ‘fuzzy RD (FRD)’; we may write \(D=D(S,\varepsilon )\). In FRD, D is a “ fuzzy version” of Z. Although there are FRD’s with a non-binary D as in Example 2, there is little loss of generality in considering only binary D because there are essentially only two levels of S around 0: the treatment group with \(S=h\) and the control group with \( S=-h\).

In RD, D matters for Y through the ‘degree of sharpness’

$$ \begin{aligned} E(D|0^{+})-E(D|0^{-}) \quad \text {where} \quad E(D|0^{+})\equiv \lim _{s\downarrow 0}E(D|s) \, \& \, E(D|0^{-})\equiv \lim _{s\uparrow 0}E(D|s). \end{aligned}$$

For FRD with a binary D, the degree of sharpness is less than one, whereas it is one in its SRD version. So far, we explained what RD looks like, and how fuzzy RD differs from sharp RD. In the rest of this section, we present the main features of RD.

RD local randomization requires a break of E(D|S) at \(S=0\), differently from the usual randomization (e.g., flipping a coin) without S. In Example 1, test score S may affect income Y directly as well as indirectly through D because S has an ability component. Suppose

$$\begin{aligned} E(Y|S)=\beta _{d}E(D|S)+m(S) \end{aligned}$$
(2.1)

where \(\beta _{d}\) is the college entry effect and m(S) is the direct impact of S on Y. But m(S) can be ignored because m(S) should be continuous at \(S=0\) (to change little as S crosses 0) whereas E(D|S) is not: there is no reason for the direct impact of ability on Y to be discontinuous at \(S=0\). It is this contrast between the break (i.e., discontinuity) of E(D|S) and no break (i.e., continuity) of m(S) at \(S=0\) that identifies \(\beta _{d}\) in (2.1).

For covariates W, a parameter \(\beta _{w}\), an error U, and i indexing individuals, suppose

$$\begin{aligned} Y_{i}=\beta _{d}D_{i}+W_{i}^{\prime }\beta _{w}+U_{i}, \quad i=1,\ldots ,N. \end{aligned}$$
(2.2)

(2.1) holds with \(m(S)=E(W^{\prime }|S)\beta _{w}+E(U|S)\) if \(E(W^{\prime }|S)\beta _{w}+E(U|S)\) is continuous at \(S=0\). Since E(U|S) in m(S) is allowed to be a non-trivial continuous function of S , RD is robust to the “ smooth” endogeneity of D through S, and all smooth effects of covariates (observed or not) on Y can be ignored in RD as they can be buried in m(S). Hence, it is enough in RD to consider only (SZDY) with no concern for the functional form issue on how covariates such as W enter the Y equation.

In FRD with \(D=D(S,\varepsilon )\), if we try to estimate \(\beta _{d}\) by the least squares estimator (LSE) of Y on (DW) using (2.2), then D may be endogenous due to \(COR(\varepsilon ,U)\ne 0\); this endogeneity of D through \(\varepsilon \) is different from that through S. But there is an “ automatic instrument” Z for D in FRD, with which instrumental variable estimator (IVE) can be applied. If (2.1) instead of (2.2) is used for \(\beta _{d}\) estimation, however, then the endogeneity issue of D due to \( COR(\varepsilon ,U)\ne 0\) is mute because E(D|S), not D itself, appears in (2.1).

3 Identification

This section addresses RD identification issues. But, before discussing identification, we have to make sure that D represents a treatment of interest. Whereas there is no problem in general to call D a treatment of interest in FRD, multiple treatments can occur together when \( D=Z\) in SRD. For instance, if S is age and \(c=65\), then one may become eligible for several public assistance programs by turning 65. In this case, unless there is an extra variable such as \(\varepsilon \) to characterize D as in FRD, we are bound to find the effect of the combined treatment defined as the interaction of those multiple programs at \(S=65\). We will thus proceed from the premise that D represents a single treatment of interest.

3.1 Ratio identification

In an “ abstract” formation, RD refers to D and Y related through (2.1). For SRD, (2.1) becomes \( E(Y|S)=\beta _{d}D+m(S)\) which is a semi-linear model because m(S) is an unknown (i.e., nonparametric) function. Take \(\lim _{s\downarrow 0}\) and \( \lim _{s\uparrow 0}\) on (2.1):

$$\begin{aligned} E(Y|0^{+})=\beta _{d}E(D|0^{+})+m(0^{+})\,\,\text {and}\,E(Y|0^{-})=\beta _{d}E(D|0^{-})+m(0^{-}). \end{aligned}$$

Subtract the latter from the former, and then solve the difference for \( \beta _{d}\):

$$\begin{aligned} \beta _{d}=\frac{E(Y|0^{+})-E(Y|0^{-})}{E(D|0^{+})-E(D|0^{-})}; \end{aligned}$$
(3.1)

the unknown \(m(\cdot )\) drops out due to \(m(0^{+})=m(0^{-})\). The break of E(D|S) at \(S=0\) ensures a non-zero denominator.

The RD ratio identification of \(\beta _{d}\) in (3.1) avoids the unknown m(S) by invoking the continuity of m(S) at \(S=0\). This identification result thus applies only to \(S=0\), and the identification of single-point effect is thought as a serious limitation of RD. But, if a model such as (2.2) holds, then \(\beta _{d}\) found at a single point is enough. We will examine this single-point identification issue further below. A big question for (3.1) is what happens if the continuity assumption of m(S) is violated, which is to be discussed in the next subsection.

A better understanding of the ratio (3.1) would come from the following structural forms (SF): for an error term \(U^{\prime }\), a parameter \(\alpha _{z}\) and an unknown function \(\mu (S)\) continuous at \(S=0\) , suppose

$$\begin{aligned} Y=\beta _{d}D+m(S)+U^{\prime }\,\,\text {and}\,\,D=\alpha _{z}Z+\mu (S)+\varepsilon \,\,\text {with}\,\,E(\varepsilon |S)\,\,\text { continuous at }0. \end{aligned}$$
(3.2)

Substituting the D SF into the Y SF gives the Y reduced form (RF):

$$\begin{aligned} Y=\beta _{d}\{\alpha _{z}Z+\mu (S)+\varepsilon \}+m(S)+U^{\prime }=\beta _{d}\alpha _{z}Z+\{\beta _{d}\mu (S)+m(S)\}+(\beta _{d}\varepsilon +U^{\prime }). \end{aligned}$$
(3.3)

The D SF in (3.2) gives \(E(D|0^{+})-E(D|0^{-})=\alpha _{z}\), and the Y RF (3.3) gives \(E(Y|0^{+})-E(Y|0^{-})=\beta _{d}\alpha _{z}\). Therefore (3.1) states nothing but \(\beta _{d}=\beta _{d}\alpha _{z}/\alpha _{z}\), with \(\alpha _{z}\) playing the role of a scaling factor in case \( E(D|0^{+})-E(D|0^{-})\ne 1\). This shows that the ratio identification can be viewed as an indirect identification; in contrast, if \(\beta _{d} \) is identified in the Y SF as in (3.2), it would be a direct identification.

3.2 Identified effects when continuity fails

One might wonder what happens if m(S) has a break at \(S=0\), contrary to the continuity assumption. In this case, \(\beta _{d}\) has to be redefined as \(\mathring{\beta }_{d}\) to restore the continuity of m(S):

$$\begin{aligned} \mathring{\beta }_{d}\equiv \beta _{d}+\frac{m(0^{+})-m(0^{-})}{ E(D|0^{+})-E(D|0^{-})}\{=\beta _{d}+m(0^{+})-m(0^{-})\text {when} D=Z\}; \end{aligned}$$
(3.4)

\(\mathring{\beta }_{d}\) absorbs the m(S) break magnitude. Call \(\beta _{d}\) the ‘net effect’, and \(\mathring{\beta }_{d}\) the ‘gross effect’. The proof for (3.4) in the appendix and much of this subsection are new in the RD literature.

Identifying the gross effect \(\mathring{\beta }_{d}\) would be fine if \(\mathring{\beta }_{d}\) per se is of interest; otherwise, it is a failure of identification. To see this point, consider SRD with \(D=Z\), and imagine a covariate A lurking in m(S) with E(A|S) discontinuous at \(S=0\). In this case, \(\mathring{\beta }_{d}=\beta _{d}+E(A|0^{+})-E(A|0^{-})\). If A is a post-treatment variable, then \(\mathring{\beta }_{d}\) may be viewed as the total effect consisting of the ‘direct effect’ \(\beta _{d}\) and the ‘indirect effect’ \(E(A|0^{+})-E(A|0^{-})\) of D through A; e.g., A may be an interaction with Z, say, \(A=ZA^{\prime }\) for a variable \(A^{\prime } \) so that \(E(A|0^{+})-E(A|0^{-})=E(A^{\prime }|0^{+})\). If A is a pre-treatment variable, however, then the indirect effect interpretation is inappropriate and the RD identification fails. In this case, we should separate \(\beta _{d}\) from \(\mathring{\beta }_{d}\) by explicitly using A as a regressor as in \(Y=\beta _{d}Z+\beta _{a}^{\prime }A+error\).

For a covariate W not causing a break in m(S) because E(W|S) is continuous at 0, there is a trade-off in accounting for W or burying W in m(S). Accounting for W has the advantage of reducing the model error term variance, but if the W part is misspecified, then the model error term variance may not go down because the misspecification error is added, although such a misspecification does not make the RD estimator inconsistent.

Differently from an observed covariate, an unobserved covariate (i.e., an error term) is allowed to influence Y only smoothly through E(U|S) in m(S). This may sound restrictive, but RD’s ability to allow E(U|S) to be a non-trivial function of S is a big advantage over other study designs that require E(U|S) to be a constant. Not just this, RD allows D to be endogenous through \(\varepsilon \) in FRD; in (3.2), \( D=D(S,\varepsilon )\) is allowed to be endogenous through \(COR(\varepsilon ,U^{\prime })\ne 0\) as well as through \(COR(S,U^{\prime })\ne 0\) because Z can instrument for D.

4 Estimators

Rewrite (2.1) as

$$\begin{aligned} Y=\beta _{d}D+m(S)+e,\,\,e\equiv Y-E(Y|S)-\beta _{d}\big \{D-E(D|S)\big \}\ \big [\Longrightarrow E(e|S)=0\big ] \end{aligned}$$
(4.1)

where m(S) is to be replaced by a (piecewise) polynomial function continuous at \(S=0\). Since \(E(e|S)=0\), LSE can be applied for SRD to (4.1). For FRD with \(D=D(S,\varepsilon )\), if D is endogenous through \( \varepsilon \), then IVE can be applied with Z instrumenting for D. Hence, there are LSE and IVE for RD estimation, and both LSE and IVE are equivalent to sample versions of (3.1), as will be seen shortly.

4.1 LSE for exogenous treatment

Since \(E(e|S)=0\) in (4.1), S and SZ are exogenous regressors in (4.1). Suppose D is exogenous, which holds always for SRD, and holds for FRD under \(COR(\varepsilon ,e|S)=0\). Let \(m(S)=\beta _{0}+\beta _{1}S+\beta _{1z}SZ\) so that

$$\begin{aligned} Y=\beta _{d}D\ +\beta _{0}+\beta _{1}S+\beta _{1z}SZ\ +e; \end{aligned}$$

having \(\beta _{1}S+\beta _{1z}SZ\) in m(S) is equivalent to having \(\beta _{-}S(1-Z)+\beta _{+}SZ\) to allow different slopes around \(S=0\). With the LSE of Y on (D, 1, SSZ), we can estimate \((\beta _{d},\beta _{0},\beta _{1},\beta _{1z})\).

In practice, it is better to use a piecewise cubic (or quartic) m(S) continuous at \(S=0\) because, with m(S) better approximated, the local identification at \(S=\) 0 can be expanded for a higher ‘external validity’ of RD. For instance, with \(m^{\prime }(S)\) being the derivative of m(S),

$$\begin{aligned} Y= & {} \beta _{d}D\ +\ \beta _{0}+\beta _{1}S+\beta _{2}S^{2}+\beta _{3}S^{3}+\beta _{1z}SZ+\beta _{2z}S^{2}Z+\beta _{3z}S^{3}Z\ +e \nonumber \\&\Longrightarrow \ m^{\prime }(S)=\beta _{1}+2\beta _{2}S+3\beta _{3}S^{2}+\beta _{1z}Z+2\beta _{2z}SZ+3\beta _{3z}S^{2}Z \nonumber \\&\Longrightarrow \ m^{\prime }(0^{-})=\beta _{1}\ \ne \ m^{\prime }(0^{+})=\beta _{1}+\beta _{1z}\,\,\text {whereas}\,\,m(0^{-})=\beta _{0}=m(0^{+}).\nonumber \\ \end{aligned}$$
(4.2)

In practice, LSE is done typically only for SRD, not for FRD. But, in FRD with \(D=D(S,\varepsilon )\), so long as \(COR(\varepsilon ,e|S)=0\), LSE can be applied; there is no reason to apply IVE in this case, because IVE is inefficient compared with LSE. The efficiency of the LSE can be much greater, because \(\varepsilon \) provides an extra variation to D beyond the “ default” variation due to S. Hence it is recommended to do a (‘Hausman’) test of equality of the LSE and IVE; if not rejected, use the LSE. The test was also considered in Bertanha and Imbens (2014).

4.2 IVE for endogenous treatment

Suppose D is endogenous. To be an instrument for D in (4.1), Z should meet three conditions: (i) excluded from the Y equation, (ii) \( COR(Z,e)=0\), and (iii) included in the D equation. Condition (i) cannot be tested, but it is plausible because Z (i.e., the cutoff) should have no direct bearing on Y. Condition (ii) holds automatically due to \(E(e|S)=0\). Condition (iii) can be verified by the LSE of D on (1, Z) (or on (1, WZ) if W is also used as a regressor): a significant slope of Z indicates (iii). Note that, differently from \(COR(Z,e)=0\) holding by construction in RD, for IVE in general, zero correlation between the instrument and the model error term is not verifiable (and thus only can be argued for).

Before discussing IVE, we start with a more intuitive ‘nonparametric ratio estimator’ for \(\beta _{d}\) in (3.1) that replaces the one-sided limits with local sample averages. Define \(N^{+}\equiv \sum _{i}1[S_{i}\in (0,h)]\) and \(N^{-}\equiv \sum _{i}1[S_{i}\in (-h,0)]\), and

$$ \begin{aligned} \hat{E}(Y|0^{+})\equiv \frac{1}{N^{+}}\sum _{i}Y_{i}1\big [S_{i}\in (0,h)\big ] \; \& \; \hat{E}(Y|0^{-})\equiv \frac{1}{N^{-}}\sum _{i}Y_{i}1\big [S_{i} \in (-h,0)\big ]. \end{aligned}$$

Define \(\hat{E}(D|0^{+})\) and \(\hat{E}(D|0^{-})\) analogously with Y replaced by D. The ratio estimator is

$$\begin{aligned} \hat{\beta }_{d}\equiv \frac{\hat{E}(Y|0^{+})-\hat{E}(Y|0^{-})}{\hat{E} (D|0^{+})-\hat{E}(D|0^{-})}. \end{aligned}$$

Call \(\hat{\beta }_{d}\)local-constant regression (LCR) estimator’ for a reason to become clear shortly.

The ratio estimator equals the slope IVE applied to the “ artificial” linear model

$$\begin{aligned} Y=\beta _{d}D+\beta _{0}+error \end{aligned}$$
(4.3)

with Z instrumenting for D. The equality follows from the ‘Wald estimator form’ of IVE (e.g., Lee 2005, p.137), as was noted in (Hahn et al. 2001, p. 206). In this sense, the IVE is a sample version of (3.1). If we apply LSE to (4.3), then the slope LSE equals \(\hat{E}(Y|0^{+})-\hat{E} (Y|0^{-})\); in this sense, the LSE is also a sample version of (3.1) when the denominator of (3.1) is one.

Unfortunately, LCR’s finite-sample bias is large. To see this, suppose \(\beta _{d}=0\) and Y is generated by

$$\begin{aligned} Y=\beta _{0}+\beta _{1}S+U\,\,\text {with}\,\,\beta _{1}>0. \end{aligned}$$

Since Y is linearly increasing at \(S=0\), the ‘left average’ of Y’s over \( S\in (-h,0)\) is smaller than the ‘right average’ of Y’s over \(S\in (0,h)\), resulting in \(\hat{\beta }_{d}>0=\beta _{d}\). To overcome this problem, ‘ local linear regression (LLR)’ was proposed by Hahn et al. (2001): fit a linear line over each local region \((-h,0)\) and (0, h) to obtain the two lines’ heights at \(S=0\); the difference of the two heights is the estimated effect. To see the LLR idea better, consider

$$ \begin{aligned}&Y=\beta _{d}D+\beta _{0}+\beta _{1}S+U\,\,\text {with}\,\,\beta _{1}>0 \\&\quad \Longrightarrow Y=\beta _{0}+\beta _{1}S+U\,\,\text {for }\,\,S\in (-h,0)\ \ \ \& \,Y=\beta _{d}+\beta _{0}+\beta _{1}S+U\quad \text {for}\,\,S\in (0,h). \end{aligned}$$

The height at \(S=0\) from the left model is \(\beta _{0}\) and that from the right model is \(\beta _{d}+\beta _{0}\); the difference is thus \(\beta _{d}\).

Although this may look complicated, there exists a simple IVE that gives the same numerical result, as was noted in Imbens and Lemieux (2008), p. 627: apply IVE to

$$\begin{aligned} Y=\beta _{d}D+\beta _{0}+\beta _{-}S(1-Z)+\beta _{+}SZ+error \end{aligned}$$
(4.4)

where D is instrumented by Z. The usual IVE standard errors can be used for inference.

The artificial linear model (4.4) for LLR is a refined version of (4.3) for LCR: m(S) is “ linear-splinely” approximated in the former while all but ignored with \(m(S)\simeq \beta _{0}\) in the latter. If cubic terms are used in (4.4), then we get a ‘local cubic regression’ that is the same as the IVE for (4.2) with Z as an instrument for D. But local polynomial regressions other than LLR tend to be too variable. Hence, there are essentially two practical RD estimators: LSE to (4.2) (or its lower/higher order version), and IVE to (4.4).

4.3 Relationship between LLR and IVE and remarks

For a bandwidth h and a kernel K (e.g., \(K(\cdot )\) is the N(0, 1) density), minimize

$$ \begin{aligned} \sum _{i=1}^{N}(D_{i}-\tau _{0}-\tau _{1}S_{i})^{2}K\Big (\frac{S_{i}}{h}\Big )1[0<S_{i}]\; \& \; \sum _{i=1}^{N}(Y_{i}-\rho _{0}-\rho _{1}S_{i})^{2}K\Big (\frac{S_{i}}{h}\Big )1[0<S_{i}] \end{aligned}$$

for \((\tau _{0},\tau _{1})\) and \((\rho _{0},\rho _{1})\). Let the minimizers for the intercepts \(\tau _{0}\) and \(\rho _{0}\) be \(\hat{\tau }_{0}^{+}\) and \( \hat{\rho }_{0}^{+}\). Define \(\hat{\tau }_{0}^{-}\) and \(\hat{\rho }_{0}^{-}\) analogously with \(1[0<S_{i}]\) replaced by \(1[S_{i}<0]\). Then LLR is

$$\begin{aligned} \hat{\beta }_{d,LLR}\equiv \frac{\hat{\rho }_{0}^{+}-\hat{\rho }_{0}^{-}}{\hat{ \tau }_{0}^{+}-\hat{\tau }_{0}^{-}}. \end{aligned}$$

If the uniform kernel \(K(t)=1[|t|<1]/2\) is used, then \(\hat{\beta }_{d,LLR}\) is the same as the IVE for (4.4). Otherwise (i.e., if another kernel is used), \(\hat{\beta }_{d,LLR}\) differs from the IVE for (4.4). The asymptotic variance of \(\hat{\beta }_{d,LLR}\) with a general kernel K is involved, as can be seen in Hahn et al. (2001) and Imbens and Lemieux (2008). To avoid this problem, Otsu et al. (2015) proposed empirical-likelihood-based RD estimators, where confidence intervals for \(\beta _{d}\) are drawn using pivotal asymptotic \(\chi ^{2}\) distributions of likelihood ratios.

While practitioners use almost exclusively the above LSE and IVE/LLR estimators for RD, there are other estimators as well in the literature (Porter 2003; Calonico et al. 2014; Yu 2015). Also, Calonico et al. (2014) and Feir et al. (2015) addressed better RD estimator inference using bias correction and weak instrument approach, respectively.

If observed covariates W are to be controlled, \(\beta _{w}^{\prime }W\) or \(\gamma _{-}^{\prime }W(1-Z)+\gamma _{+}^{\prime }WZ\) can be added to (4.2) and (4.4), respectively, where \(\beta _{w}\) and \(\gamma \)’s are parameters. Alternatively, residuals such as \(Y-\hat{\beta }_{W}^{\prime }W\) may be used instead of Y, which essentially nullifies the presence of W. Using the residuals keeps the data dimension low so that we can focus on ( residualZDS). Graphical tools to be explained below can be then easily applied, as W no more appears explicitly.

In nonparametrics, series approximation as in (4.2) is dubbed a ‘global approach’ where the smoothing degree depends on the order of the approximating function. In contrast, a kernel estimator with a smoothing parameter h is dubbed ‘local’. Such a distinction, however, becomes blurry in RD, because RD estimation is done using only local observations around the cutoff. The equivalence of the IVE for (4.4) using a series approximation (of the first order) to the LLR when the uniform kernel is used corroborates this point.

4.4 Bandwidth choice and summary

So far, we have not discussed how to choose the bandwidth h. For this, note that a local-constant nonparametric kernel estimator for \( E(Y|S=S_{i})\) without using \(Y_{i}\) is

$$\begin{aligned} \hat{E}_{-i}(Y|S_{i},h)\equiv \frac{\sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )Y_{j}}{ \sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )} \end{aligned}$$

where \(\sum _{j\ne i}\) is the sum over \(j=1,\ldots ,N\) except the ith observation. There is no single best way to choose h, but a good rule of thumb in practice is \(h=SD(S)N^{-1/5}\), and a more systematic way is to use ‘cross-validation (CV)’ as follows.

The usual CV chooses h by minimizing

$$\begin{aligned} \sum _{i}\big \{Y_{i}-\hat{E}_{-i}(Y|S_{i},h)\big \}^{2}. \end{aligned}$$

\(Y_{i}\) can be used to estimate \(E(Y|S_{i})\), but CV is based on the idea of predicting \(Y_{i}\) with an estimator as this minimand shows, and it would be “ silly” to use \(Y_{i}\) in predicting \( Y_{i}\); this is why \(\hat{E}_{-i}(Y|S_{i},h)\) is used in CV. When there might be a break in E(Y|S) at \(S=0\), \(\hat{E}_{-i}(Y|S_{i},h)\) can be replaced by

$$\begin{aligned} \frac{\sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )1[S_{j}<S_{i}<0]Y_{j}}{\sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )1[S_{j}<S_{i}<0]}\,\,\text {or}\,\,\frac{\sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )1[0<S_{i}<S_{j}]Y_{j}}{\sum _{j\ne i}K\big ((S_{j}-S_{i})/h\big )1[0<S_{i}<S_{j}]}\nonumber \\ \end{aligned}$$
(4.5)

depending on whether \(S_{i}<0\) or \(0<S_{i}\). The idea is simple: if \(S_{i}<0\) , then only the left observations with \(S_{j}<S_{i}\) are used; if \(0<S_{i}\), only the right observations with \(S_{i}<S_{j}\).

In (4.5), a local-constant version is used, and the LLR version for the first term of (4.5) is the intercept estimator in minimizing with respect to \((\rho _{0},\rho _{1})\)

$$\begin{aligned} \sum _{j\ne i}\big \{Y_{j}-\rho _{0}-\rho _{1}(S_{j}-S_{i})\big \}^{2}\cdot K\big (\frac{ S_{j}-S_{i}}{h}\big )1[S_{j}<S_{i}<0]. \end{aligned}$$

This CV scheme was used by Ludwig and Miller (2007). Imbens and Kalyanaraman (2012) considered a variation of the CV scheme (p. 944), and more importantly, they suggested a theoretically optimal choice of h as reviewed in the appendix. Calonico et al. (2014) proposed a bias-correction approach for RD inference, where the extra variability induced by the bias-correcting term is taken into account; they also discussed choosing optimal bandwidths.

In summary, for SRD, do LSE to (4.2) or its lower/higher order version; use the LSE standard errors for inference. For FRD, do LSE to (4.2) or its lower/higher order version if \(\varepsilon \) in D is unlikely to make D endogenous, and otherwise do LLR with the uniform kernel that equals the IVE to (4.4); for inference, use the LSE/IVE standard errors. For bandwidth, use \(h=SD(S)N^{-1/5}\) , or the h chosen by the above CV method; do all estimation locally using only the observations with \(S\in (-h,h)\) , although we write “ \(\sum _{i=1}^{N}\)for simplicity.

5 Specification tests

There are various specification tests for RD, but the most important ones are those for breaks of E(Y|S) and E(D|S), and those for the continuity of m(S).

5.1 Conditional mean breaks

Most RD studies present a graph plotting E(Y|S) vs. S to demonstrate the break of E(Y|S); no break means \(\beta _{d}=0\), or Z having no explanatory power for D (\(\alpha _{z}=0\) in (3.2)). For FRD, E(D|S) versus S is also shown because \(E(D|0^{+})-E(D|0^{-})\ne 0\) is necessary. Informal graphical presentations can be formalized into the following LSE-based tests.

Consider an artificial linear model analogous to (4.4):

$$\begin{aligned} D=\zeta _{z}Z+\zeta _{0}+\zeta _{-}S(1-Z)+\zeta _{+}SZ+error \end{aligned}$$
(5.1)

where \(\zeta \)’s are parameters. Applying LSE to this, a non-zero slope of Z indicates a break of E(D|S) at \(S=0\) , for which the LSE standard errors can be used. The logic is that \( \zeta _{z}\) equals the intercept break \(E(D|0^{+})-E(D|0^{-})\) due to Z, whereas the base intercept is picked up by 1 and the slopes are accounted for by the last two regressors.

As for a break in E(Y|S) at \(S=0\), consider an artificial linear model analogous to (5.1):

$$\begin{aligned} Y=\xi _{z}Z+\xi _{0}+\xi _{-}S(1-Z)+\xi _{+}SZ+error \end{aligned}$$
(5.2)

where \(\xi \)’s are parameters. A non-zero slope LSE of Z indicates a break of E(Y|S) at \(S=0\). The LSE-based tests with (5.1) and (5.2) seem unknown in the literature despite their simplicity.

Although LSE to (5.1) and (5.2) can be used to test for a break at the known point 0, it is possible that a break may occur somewhere else. Seeing a break where it is not supposed to be suggests a misspecification, and checking out breaks over a range for S around 0 can be done with the difference of one-sided kernel regression estimators. For E(Y|s), plot

$$\begin{aligned} \tilde{L}_{Y}(s)\equiv \frac{\sum _{i}K\big ((S_{i}-s)/h\big )1[s<S_{i}]Y_{i}}{ \sum _{i}K\big ((S_{i}-s)/h\big )1[s<S_{i}]}-\frac{\sum _{i}K\big ((S_{i}-s)/h\big )1[S_{i}<s]Y_{i} }{\sum _{i}K\big ((S_{i}-s)/h\big )1[S_{i}<s]} \end{aligned}$$
(5.3)

versus s. \(\tilde{L}_{Y}(s)\) has been used to find structural breaks in statistics [e.g., Qiu (2005)], but hardly so in the RD literature; for structural breaks in general, see, e.g., Breitung and Kruse (2013), Ciuperca (2014), and the references therein.

Porter and Yu (2015) estimated the unknown cutoff by maximizing \( \tilde{L}_{Y}(s)^{2}\) with respect to s for SRD where the local-constant estimators are replaced by local polynomial estimators; for FRD, \(\tilde{L} _{Y}(s)^{2}\) plus the analogous expression with Y replaced by D is used. They found that the cutoff estimator is super-consistent, converging at the rate N instead of the usual \(\sqrt{N}\). Hence the estimated cutoff is as good as the true cutoff, and the estimation does not affect the asymptotic distribution of the treatment effect estimator.

Since \(\tilde{L}_{Y}(s)\) is for different values of s, the appropriate bandwidth may differ from the h chosen for \(s=0\) only. The most practical way to choose h for \(\tilde{L}_{Y}(s)\) is “ eye-balling” : choose h so that the \(\tilde{L}_{Y}(s)\) graph is not too smooth nor too jagged. As for inference, the asymptotic distribution may be derived for \(\tilde{L}_{Y}(s)\), but confidence bands based on nonparametric bootstrap resampling from the original sample with replacement would be adequate in practice. To detect breaks in E(D|s), replace Y in \(\tilde{L}_{Y}(s)\) with D.

Turning to the continuity of m(S), since m(S) comes from the conditional means of the ignored covariates W and U—we will use only U to denote errors in the rest of this paper—the continuity of E(W|S) and E(U|S) should be checked out. We can test for breaks in E(W|S) by the LSE to (5.1) with D replaced by W; Urquiola and Verhoogen (2009) showed an example where E(W|S) is not continuous at \(S=0\). As for the continuity of E(U|S), it is discussed next.

5.2 Score-density continuity at cutoff

Since U is not observed, the continuity of E(U|s) cannot be seen. Instead, necessary conditions can be tested. Observe

$$\begin{aligned} E(U|s)=\int uf_{U|S}(u|s)\mathrm{d}u=\int u\frac{f_{S|U}(s|u)}{f_{S}(s)}f_{U}(u)\mathrm{d}u \end{aligned}$$

where \(f_{U|S}\) denotes U|S density and \(f_{S}\) denotes S density; \( f_{S|U}\) and \(f_{U}\) are defined analogously. This shows that the continuity of \(f_{S|U}(s|u)/f_{S}(s)\) at \(s=0\) is necessary for the continuity of E(U|s). Since the continuity of \( f_{S|U}(s|U)\) at \(s=0\) cannot be tested, only the continuity of \(f_{S}(s)\) is to be checked out.

To enhance understanding, recall Example 1 with a discrete U taking on \(-1\) and 1. Suppose that U stands for ‘socializing well’ and persons with \(U=-1\) try extra hard to get \(D=1\) to make up for their lower income due to \(U=-1\); let \(\pi \equiv P(U=-1)\). Also suppose

$$\begin{aligned} f_{S|U}(s|-1)=1[0\le s<1],f_{S|U}(s|1)=\phi (s)\,\,\text {and }\,\, \phi \text { is a density continuous at }0 : \end{aligned}$$

those with \(U=-1\) attain the scores in [0, 1] whereas those with \(U=1\) have scores well spread around 0 with \(\phi \). Then

$$\begin{aligned}&f_{S}(s)=f_{S|U}(s|-1)\pi +f_{S|U}(s|1)(1-\pi )=\pi 1[0\le s<1]+(1-\pi )\phi (s) \\&\Longrightarrow \ f_{S}(0^{-})=(1-\pi )\phi (0), f_{S}(0^{+})=\pi +(1-\pi )\phi (0) \Longrightarrow f_{S}(0^{+})-f_{S}(0^{-})=\pi . \end{aligned}$$

The break of \(f_{S}\) occurs because those with \(U=-1\) manipulated their S to “ perfection” . Now suppose that \( f_{S|U}(s|-1)\) is continuous at 0 but tilted heavily to the right of 0: those with \(U=-1\) could not perfectly manipulate S although they could to a large extent. Then, \(f_{S}(0^{+})-f_{S}(0^{-})=0\). In principle, RD does not require \(f_{S}\) to be continuous at 0, but it is this concern for manipulating S that prompts testing for the continuity.

The above example, that is a generalized version of an example in Kim and Lee (2016), shows that we can identify \(\pi \) with \( f_{S}(0^{+})-f_{S}(0^{-})\). Gerard et al. (2015) called the individuals with \(U=-1\) “ manipulators’, and assuming that all manipulators have \(S\ge 0\), they indeed identified \(\pi \) using \(f_{S}(0^{+})\) and \( f_{S}(0^{-})\). Declaring ‘the treatment effect on the non-manipulators’ as the main parameter of interest, they went on to bound the effect, and proposed how to estimate the bound and conduct inference.

The easiest way to see the break of \(f_{S}\) at 0 is constructing a histogram for \(f_{S}\) such that 0 becomes a boundary point. A smoothed version of the histogram is

$$\begin{aligned} \tilde{f}_{S}(s)\equiv \frac{1}{Nh}\sum _{i}2K\big (\frac{S_{i}-s}{h}\big )\big \{1[S_{i}<s<0]+1[0<s<S_{i}]\big \}; \end{aligned}$$

the part \(\{\cdot \}\) is to use only the left observations (\(S_{i}<s\)) when \( s<0\), and only the right observations (\(s<S_{i}\)) when \(0<s\). With the uniform kernel \(K((S_{i}-s)/h)=1[|S_{i}-s|<h]/2\), \(\tilde{f}_{S}(s)\) becomes

$$ \begin{aligned} \hat{f}_{S}(s)\equiv \frac{1}{Nh}\sum _{i}1[s-h<S_{i}<s] \quad \text {if}\,s<0 \; \& \; \frac{1}{Nh}\sum _{i}1[s<S_{i}<s+h] \quad \text {if}\,0<s. \end{aligned}$$

whereas \(\tilde{f}_{S}(s)\) and \(\hat{f}_{S}(s)\) are constructed with a break at 0 in mind, we may want to explore unknown break locations. This can be done by plotting

$$\begin{aligned} \tilde{J}(s)\equiv \frac{1}{Nh}\sum _{i}2K\Big (\frac{S_{i}-s}{h} \Big )\big \{1[s<S_{i}]-1[S_{i}<s]\big \} \end{aligned}$$
(5.4)

versus s, which is a “ density analog” of \(\tilde{L}_{Y}(s)\); \(\tilde{J}(s)\) is little known in the literature as \( \tilde{L}_{Y}(s)\) is. Choosing h and doing inference can be done analogously to what was done for \(\tilde{L}_{Y}(s)\).

Whereas the test with \(\tilde{f}_{S}(s)\) is a “ local-constant” variety, McCrary (2008) suggested a two-stage LLR-type test for the break of \(f_{S}\) at 0, as reviewed in the appendix. In the first stage, estimate \(f_{S}\) around \(s=0\) over n intervals using \(\hat{f}_{S}\); let \(s_{1},\ldots ,s_{n}\) be the centers of the histogram intervals. In the second-stage, taking \(\{s_{j},\hat{f} _{S}(s_{j})\}\), \(j=1,\ldots ,n\) as data (analogous to \((X_{j},Y_{j})\), \( j=1,\ldots ,n \)), do LLR on the negative and positive sides separately to see if the two estimated lines meet at \(s=0\) or not. Despite the popularity, however, the test requires two bandwidths (one for \(\hat{f}_{S}\) and the other for LLR), which is a disadvantage. Also the asymptotic distribution was derived only for the ‘triangular kernel’ \(K(t)=(1-|t|)1[|t|<1]\), although it would be certainly possible to use other kernels for the test.

Otsu et al. (2013) proposed empirical-likelihood-based methods to construct confidence intervals (CI) for \(f_{S}(0^{+})-f_{S}(0^{-})\). The methods allow popular kernels including the triangular kernel, and by constructing CI’s using the likelihood ratio that follows a \(\chi ^{2}\) distribution asymptotically, the methods obviate the need to estimate the asymptotic variance.

6 RD topics

So far, we examined RD essentials. In practice, there also arise other issues going beyond the “ basic” RD. This section reviews those RD topics.

6.1 Multiple scores

In a typical RD, a single score crossing a cutoff affects D non-trivially. There are, however, many RD cases where multiple scores are involved to determine a single treatment. For instance, multiple test scores crossing cutoffs may be required for school graduation, and age and pension-contribution-years should cross cutoffs to receive pension.

‘Multiple-score RD for a single treatment’ that is examined in this subsection differs from ‘single-score RD with multiple cutoffs’ as in Van der Klaauw (2002) and Angrist and Lavy (1999), which is easily handled by looking at each cutoff one at a time. The treatments at the multiple cutoffs may be ordered, and as such, they fall in the domain of ‘multi-valued (or multiple) treatments’ (see Imbens (2000), Lechner (2001), Uysal (2015), and the references therein), and are either to be taken as such, or weight-averaged to come up with a single representative effect (Bertanha 2015). Multiple-score RD for a single treatment differs also from ‘multiple-score RD for multiple treatments’ (Leuven et al. 2007; Papay et al. 2011) where each score dictates one treatment.

When there are two scores \(S_{1}\) and \(S_{2}\)—the dominating case in mulitple-score RD—two cases arise: ‘OR case’ where any score can cross a cutoff to get treated (Jacob and Lefgren 2004; Matsudaira 2008), and ‘AND case’ where both scores should cross cutoffs to get treated (Schmieder et al. 2012; Caliendo et al. 2013). But an OR case can be converted to the AND case simply by “ flipping” the treatment, i.e., by relabeling the treatment and control groups. Hence, dealing with AND case is enough in two-score RD. For more than two scores, mixed cases can arise; e.g., given three, only two may cross the cutoffs to get treated.

There appeared three ways to reduce multiple-score RD to single-score RD. First, multiple scores are combined to form a single score/index as in Van der Klaauw (2002). Second, when \(D=1[0\le S_{1},0\le S_{2}]\), D becomes \(1[0\le S_{1}]\) on the subpopulation with \(0\le S_{2}\), in which case the familiar toolkit for single-score RD can be used; this has been the dominant approach (Jacob and Lefgren 2004; Lalive 2008; Schmieder et al. 2012; Caliendo et al. 2013, etc.). This dimension-reduction strategy also works analogously for more than two scores. Third, defining \(S_{m}\equiv \min (S_{1}/\sigma _{1},S_{2}/\sigma _{2})\ \)for some scale-normalizing constants \(\sigma _{1}\) and \(\sigma _{2}\) (Battistin et al. 2009; Clark and Martorell 2014), D becomes \(1[0\le S_{m}]\) or a fuzzy version of \( 1[0\le S_{m}]\), and we can set up the following to apply LSE or IVE:

$$\begin{aligned} E(Y|S_{m})=\beta _{m}D+\beta _{0}+\beta _{-}S_{m}(1-Z)+\beta _{+}S_{m}Z. \end{aligned}$$

This can be easily generalized to J scores with \(S_{m}=\min (S_{j}/\sigma _{j}\), \(j=1,\ldots ,J)\).

Instead of these, Imbens and Zajonc (2009) dealt with both multiple-score SRD and FRD head-on: they discussed identification and estimation with multidimensional LLR. The main complication with multiple scores is the appearance of a boundary instead of a cutoff. For instance, with B denoting the treatment and control boundary, the treatment effect at \(s\in B\) for FRD is

$$\begin{aligned} \beta _{d}(s)\equiv \frac{\lim _{\nu \rightarrow 0}E\{Y|S\in N_{\nu }^{+}(s)\}-\lim _{\nu \rightarrow 0}E\{Y|S\in N_{\nu }^{-}(s)\}}{\lim _{\nu \rightarrow 0}E\{D|S\in N_{\nu }^{+}(s)\}-\lim _{\nu \rightarrow 0}E\{D|S\in N_{\nu }^{-}(s)\}} \end{aligned}$$

where \(N_{\nu }^{+}(s)\) denotes the ‘\(\nu \)-treated-neighborhood’ of s and \(N_{\nu }^{-}(s)\) denotes the \(\nu \)-control-neighborhood of s. Imbens and Zajonc proposed also an integrated version of \(\beta _{d}(s)\):

$$\begin{aligned} \beta _{d}\equiv \int \limits _{s\in B}\beta _{d}(s)f_{S}(s|S\in B)\partial s=\frac{ \int _{s\in B}\beta _{d}(s)f_{S}(s)\partial s}{\int _{s\in B}f_{S}(s)\partial s }. \end{aligned}$$

Tests for the effect heterogeneity along B, the asymptotic distribution of the LLR, and the optimal bandwidth choice are also shown in Imbens and Zajonc (2009).

Wong et al. (2013) examined two-score OR-case SRD, and Keele and Titiunik (2015) two-score AND-case SRD where the two scores are latitude and longitude. These studies were done independently of Imbens and Zajonc (2009), but essentially, they are special cases of Imbens and Zajonc (2009). Choi and Lee (2015) also examined two-score SRD, but their study differs from the other studies because they allowed ‘partial effects’ as follows that were ruled out by Imbens and Zajonc (2009), Wong et al. (2013) and Keele and Titiunik (2015).

Suppose that a student has to pass both math exam (\(Z_{1}=1\)) and English exams (\(Z_{2}=1\)) to graduate high school (\(D=Z_{1}Z_{2}=1\)), then it is possible that \(Z_{1}\) and \(Z_{2}\) may affect Y (say, lifetime income) separately from D. In a case like this, one may postulate

$$\begin{aligned} Y=\beta _{0}+\beta _{1}Z_{1}+\beta _{2}Z_{2}+\beta _{d}D+m(S_{1},S_{2})+U \end{aligned}$$

and estimate the partial effects \(\beta _{1}\) and \(\beta _{2}\) along with \( \beta _{d}\). A simple estimator is obtained by approximating \(m(S_{1},S_{2})\) polynomially (or with splines). Choi and Lee (2015) also proposed a nonparametric estimators that takes a form of local ‘difference in differences’, which is natural because D is the interaction of \(Z_{1}\) and \(Z_{2}\).

6.2 Incompletely observed or discrete score

Sometimes we observe only an error-ridden score S, instead of the genuine score G determining D; e.g., \(S=G+V\) for an error V, where S is a reported income in a survey and G is the true income. This is a measurement error (or errors in variable) problem in RD score. In terms of (3.1), we have the desired ratio (DR) and the available ratio (AR):

$$ \begin{aligned} DR\equiv \frac{E(Y|G=0^{+})-E(Y|G=0^{-})}{E(D|G=0^{+}) -E(D|G=0^{-})} \quad \& \quad AR\equiv \frac{E(Y|S=0^{+})-E(Y|S=0^{-})}{ E(D|S=0^{+})-E(D|S=0^{-})}. \end{aligned}$$

Bear in mind that only observing G is problematic while D and Y are fully observed, and that D is determined by G, not by S ; if S itself determines D, then there is no identification problem—a point misunderstood in the literature for a while (see Cappelleri et al. (1991) and the references therein). Yu (2012) examined identification and local polynomial estimation for various measurement-error SRD and FRD cases depending on D determined by G or S, and the measurement error V shrinking toward zero or not.

First, for a continuously distributed error V, suppose (‘\(\amalg \)’ stands for independence)

$$\begin{aligned} S=G+V\,\,\text {where}\,\,V\amalg G. \end{aligned}$$
(6.1)

This is a classical “ full errors-in-variable ” , as S is a smooth-error-ridden version of G and ‘\( V\amalg G\)’ holds. Lee (2016) showed that E(D|S) has no break at \(S=0\) under (6.1) as V smooths G out, and thus (3.1) fails. This is a “ hopeless” case, unless one is willing to impose strong assumptions as in Hullegie and Klein (2010) where D is private health insurance, Y is health care utilization, G is the true income and S is a self-reported income. In contrast to (6.1) with continuous G and V, Pei (2011) assumed that G and V are discrete with bounded supports to nonparametrically identify the distribution of G and the treatment effect.

Yanagi (2014) considered SRD after rewriting (6.1) as \(S=G+\sigma V^{\prime }\) with \(SD(V^{\prime })=1\). Yanagi (2014) (as well as Yu 2012) considered

$$\begin{aligned} \tau _{DS}\equiv E(Y|D=1,S=0^{+})-E(Y|D=0,S=0^{-}) \end{aligned}$$

that is more informative than \(E(Y|S=0^{+})-E(Y|S=0^{-})\) because G is used (although not localized) through D. Yanagi (2014) found, with \(\tau _{G}\equiv E(Y|G=0^{+})-E(Y|G=0^{-})\),

$$\begin{aligned} \tau _{DS}=\tau _{G}+\sigma ^{2}\times (\text {entity depending on }S,D,Y) \end{aligned}$$

which characterizes the identification error \(\tau _{DS}-\tau _{G}\). It is possible to fully identify \(\tau _{G}\) if auxiliary data are available for \( \sigma \). For example, a survey on income gives S whereas \(\sigma \) is announced by the government based on a census.

Second, for an unobserved binary variable R, suppose

$$\begin{aligned} S=RG+(1-R)(G+V): \end{aligned}$$
(6.2)

G is observed when \(R=1\), and an error-ridden score \(G+V\) is observed otherwise; “ part errors-in-variable ” (6.2) occurred in Battistin et al. (2009) and Schanzenbach (2009). Differently from (6.1), G is observed when \(R=1\): there are “ truth-tellers (\(G=S\))” .

Lee (2016) showed

$$\begin{aligned} AR=\frac{E(Y|G=S=0^{+})-E(Y|G=S=0^{-})}{E(D|G=S=0^{+})-E(D|G=S=0^{-})}: \end{aligned}$$

although DR is not identified, the effect on the “ truthful margin” \(G=S=0\) is identified. The condition ‘\( G=S \)’ is reminiscent of ‘compliers’ in Imbens and Angrist (1994). If \( (D,Y)\amalg (R,V)|G\) holds additionally as assumed by Battistin et al. (2009), then \((D,Y)\amalg S|G\) holds and S thus drops out of the last display to render \(AR=DR\).

Third, suppose \(S=0,1,2,\ldots \) is a grouped transformation of G:

$$\begin{aligned} S=\sum _{j}1[G\ge \gamma _{j}]\ \Longleftrightarrow \ S=j\,\,\text { iff }\,\,G\in [\gamma _{j},\gamma _{j+1})\,\, \text {for some known }\gamma _{j} \,\,\text {'s}. \end{aligned}$$
(6.3)

This occurs to yearly rounded-down age (\(S=\sum _{j=1}1[G\ge j]\)), or income recorded in groups due to confidentiality. Dong (2015) listed many applied papers for (6.3) while proposing a LSE-based estimator under the uniform distribution for the rounding error.

Related to (6.3), Lee and Card (2008) addressed the pure discrete score case \(S=G\) with no measurement error. For this, unless one “ settles” with nearby support points of the cutoff, extension/interpolation toward the cutoff from those points is inevitable, which requires a parametric assumption on \(m(\cdot )\) in (2.1). Lee and Card (2008) suggested to test for the parametric model specifications against the saturated nonparametric model, and use a cluster variance estimator for inference because the observations at each support point would share the same specification error.

Fourth, a general score model for both continuous and discrete components is

$$\begin{aligned} S=RG\ +\ (1-R)\sum _{j}1[G\ge \gamma _{j}]. \end{aligned}$$
(6.4)

Heaping’ (i.e., probability masses despite that the score is supposed to be continuous) in Almond et al. (2010), Almond (2011) and Barreca et al. (2011, 2016) is an example. Heaping can occur for many reasons so that the part next to \(1-R\) can take a form other than \(\sum _{j}1[G\ge \gamma _{j}]\) ; e.g., a matter of practice (retiring at 60, working 40 h per week, ...), limited precision in measurement, top coding, etc.

One identification strategy facing (6.4) is conditioning on \(R=1\); this works under no ‘selection-problem’ \(R\amalg (G,D,Y)\). The opposite strategy is conditioning on \(R=0\) to turn to (6.3), and then the estimates obtained under \(R=1\) and \(R=0\) may be weight-averaged to come up with a single effect. Barreca et al. (2016) recommended plotting disaggregated data not to miss heaping features and estimating a model for a covariate W such as

$$\begin{aligned} W=\alpha _{0}+\alpha _{1}1[S=\gamma ]+\alpha _{2}(S-\gamma )+error\,\,\text {where }\,\,\gamma \,\,\text { is a heaping point} \end{aligned}$$

to see \(\alpha _{1}=0\); if \(\alpha _{1}\ne 0\), then W differs systematically at the heaping point \(\gamma \). For instance, with S being birth weight and W income, poor district hospitals may use scales of poor precision to result in heaping, in which case \(\alpha _{1}<0\). A systematic difference in W may suggest the same for unobserved variables to result in the aforementioned selection problem.

6.3 Regression kink (RK)

Recalling (2.1), in ‘regression kink (RK)’ design, \(\nabla E(D|S)\) is assumed to be discontinuous at 0 whereas \(\nabla m(S)\) is continuous. For instance, Kim and Lee (2016) made use of the fact that marginal income tax rate has breaks/jumps at income (S) cutoffs in income tax schedule, which implies the average tax rate having slope breaks at those cutoffs because it is based on the integral of the marginal tax rate. Using this, Kim and Lee (2016) estimated labor supply elasticity with respect to after-tax income.

Simonsen et al. (2015) estimated the price elasticity of prescription drug demand using RK, where the price that each individual faces differs depending on the accumulated prescription drug purchase amount S: if \(S+ \bar{P}\) crosses a threshold c where \(\bar{P}\) is the shelf price for the current purchase, then the out-of-pocket price D changes from \(\bar{P}\) to a subsidized price (e.g., to \(0.5\bar{P}\)). This results in D decreasing linearly as a function of S over the range \(c-\bar{P}\le S<c\), and slope breaks occur, going in and out of this range.

Define the right and left derivative at 0:

$$\begin{aligned} \nabla E(Y|0^{+})\equiv & {} \lim _{\nu \rightarrow 0^{+}}\frac{E(Y|S=\nu )-E(Y|S=0)}{\nu }, \\ \nabla E(Y|0^{-})\equiv & {} \lim _{\nu \rightarrow 0^{+}}\frac{ E(Y|S=0)-E(Y|S=-\nu )}{\nu }. \end{aligned}$$

The difference of the two one-sided derivatives of (2.1) at 0 is, as \( \nabla m(0^{+})=\nabla m(0^{-})\),

$$\begin{aligned}&\nabla E(Y|0^{+})-\nabla E(Y|0^{-})=\beta _{d}\cdot \big \{\nabla E(D|0^{+})-\nabla E(D|0^{-})\big \} \nonumber \\&\quad \Longrightarrow \ \beta _{d}=\frac{\nabla E(Y|0^{+})-\nabla E(Y|0^{-})}{ \nabla E(D|0^{+})-\nabla E(D|0^{-})}. \end{aligned}$$
(6.5)

As in RD, sharp RK (SRK) refers to D determined only by S, and fuzzy RK (FRK) refers to \(D=D(S,\varepsilon )\). The derivation leading to (6.5) for SRK was shown in Nielsen et al. (2010), p. 214, where the denominator in (6.5) becomes a known constant. For instance, \(D=\alpha _{s}ZS \) for a parameter \(\alpha _{s}\ne 0\) makes the denominator of (6.5) \( \alpha _{s}-0=\alpha _{s}\), although E(D|S) is continuous at 0.

For estimation of FRK, Card et al. (2012) showed that \(\beta _{d}\) can be estimated by the slope \(\hat{\eta }_{1}^{\Delta }\) of D in the IVE applied to (\(\eta \)’s are parameters)

$$\begin{aligned} Y=\eta _{0}+\eta _{1}S+\eta _{1}^{\Delta }D+error \end{aligned}$$
(6.6)

with D instrumented by ZS (not by Z); the published version (Card et al. 2015) of Card et al. (2012) discussed only local polynomial estimators. For sharp RK, apply LSE to (6.6).

In RK, \(\beta _{d}\) represents the effect of D on Y, not the derivative of the effect; it is just that we identify \(\beta _{d}\) using the derivatives in RK. For instance, \(Y=\beta _{d}D+m(S)+U\) may hold where E(D|S) and m(S) are subject to the above RK conditions. In contrast, letting the cutoff be c and writing the RD effect at c as \(\beta _{d}(c)\) , Dong and Lewbel (2015) looked at the derivative \(\beta _{d}^{\prime }(c)\) for a number of reasons. First, we may be interested in the effect constancy to test \(H_{0}:\beta _{d}^{\prime }(c)=0\). Second, the sign of \(\beta _{d}^{\prime }(c)\) will tell whether the effect will become smaller or larger for those with S a little smaller or larger than c. Third, \(\beta _{d}^{\prime }(c)\) shows the effect of changing the cutoff c. Fourth, knowing derivatives expands the RD external validity by extrapolating the estimated effect away from the cutoff.

An estimator for \(\beta _{d}(c)\) in SRK can be an estimator for \( \beta _{d}^{\prime }(c)\) in SRD, because both are sample versions of \(\nabla E(Y|c^{+})-\nabla E(Y|c^{-})\). An estimator for \(\beta _{d}(c)\) in FRK, however, cannot be an estimator for \(\beta _{d}^{\prime }(c)\) in FRD in general, because the former is a sample version of (6.5) whereas the latter is a sample version of the derivative of (3.1) with cutoff c. A better understanding on this can be gained by the following.

The equation (6.5) and the ensuing estimation in (6.6) are based on the premise that there is no break in E(D|S) at c. What if there is a break in E(D|S) at c and yet we use (6.5)? This question is addressed for binary D by Dong (2014) who showed

$$\begin{aligned} \frac{\nabla E(Y|c^{+})-\nabla E(Y|c^{-})}{\nabla E(D|c^{+})-\nabla E(D|c^{-})}=\beta _{d}(c)+\beta _{d}^{\prime }(c)\frac{E(D|c^{+})-E(D|c^{-}) }{\nabla E(D|c^{+})-\nabla E(D|c^{-})}. \end{aligned}$$

This reveals how the RK estimand (6.5) with cutoff c is related to \(\beta _{d}(c)\) and \(\beta _{d}^{\prime }(c)\). If \(\beta _{d}^{\prime }(c)=0\), then both (6.5) and RD estimate the same parameter \(\beta _{d}(c)\). This opens up the possibility to use both RD and RK in estimating \(\beta _{d}(c)\). Since RD estimators are more efficient than the RK estimators, asymptotically, there will be no gain in using both and combining them, although there would be some gain in finite samples. Dong (2014) suggested IVE with all of S, Z and SZ as instruments for D under \(\beta _{d}^{\prime }(c)=0\). The resulting IVE works if there is a break in either E(D|S) or \(\nabla E(D|S)\) .

6.4 Quantile RD and external validity

So far we examined mean-based RD, and one may wonder if there exists a quantile-based RD. Indeed, Frandsen et al. (2012) proposed a quantile RD. Although we have not used ‘potential treatments/responses’ up to now because RD can be explained without them, we now introduce them to explicate Frandsen et al. (2012). Consider a fuzzy D, and imagine potential treatments \((D^{0},D^{1})\) corresponding to \(Z=(0,1)\) and potential responses \((Y^{0},Y^{1})\) corresponding to \(D=0,1\); the relationship between \((D^{0},D^{1})\) and Z is analogous to that between \((Y^{0},Y^{1})\) and D . Given these, Frandsen et al. (2012) proposed ‘complier quantile effects’ as follows.

Observe

$$\begin{aligned} E(Y^{1}|\text {complier})=\frac{E(YD|Z=1)-E(YD|Z=0)}{E(D|Z=1)-E(D|Z=0)} \end{aligned}$$
(6.7)

which was proven by Abadie (2002), where ‘compliers’ are those with \( (D^{0}=0,D^{1}=1)\). Replacing Y in (6.7) with \(1[Y\le y]\) and then localizing at \(S=0\) gives \(F_{Y^{1}|Complier,S=0}(y)\) where \( F_{Y^{1}|Complier,S=0}(\cdot )\) denotes the distribution function of \(Y^{1}\) given \((complier,S=0)\), from which quantiles can be found. Replacing D with \(1-D\) in (6.7) and proceeding analogously gives \( F_{Y^{0}|Complier,S=0}(y)\), from which quantiles can be found. Then the corresponding quantile differences give complier quantile effects.

For SRD with \(D=Z\), under the continuity of \(E(Y^{0}|S)\) at 0, (3.1) becomes ‘the effect on the just treated’:

$$\begin{aligned} \beta _{d}=E(Y|0^{+})-E(Y|0^{-})=E(Y^{1}|0^{+})-E(Y^{0}|0^{-})=E(Y^{1}-Y^{0}|0^{+}). \end{aligned}$$
(6.8)

Denote a quantile of Y|S as q(Y|S). In quantile RD, what can be identified is not \(q(Y^{1}-Y^{0}|0^{+})\) as in (6.8), but only \( q(Y^{1}|0^{+})-q(Y^{0}|0^{+})\) at best. This restriction should be borne in mind; see Lee (2000), however, for a limited scope to find \( q(Y^{1}-Y^{0}|0^{+})\).

In contrast to (6.8), due to Imbens and Angrist (1994), (3.1) for FRD becomes the effect on ‘the just treated compliers’:

$$\begin{aligned} \beta _{d}=E(Y^{1}-Y^{0}|0^{+},\text {complier}) \end{aligned}$$
(6.9)

Hence, the RD identification only on the cutoff/margin 0 in (6.8) becomes further restricted to the “ marginal compliers” .

In an effort to increase the external validity of RD, Bertanha and Imbens (2014) explored if the identified RD effect is independent of the individual type such as complier, ‘always taker’ (\(D^{0}=D^{1}=1\)) and ‘never taker’ (\(D^{0}=D^{1}=0\)). They recommended plotting \(E(Y|D=0,S)\) around \(S=0\) to see if the untreated compliers (\(D=0,S=0^{-}\)) differ from the never takers (\(D=0,S=0^{+})\); also, an analogous plot of \(E(Y|D=1,S)\) will reveal the difference between the always takers (\(D=1,S=0^{-}\)) and the treated compliers (\(D=1,S=0^{+}\)) around \(S=0\). If no difference, then the qualifier ‘complier’ may be dropped from (6.9) for the external validity across the individual types.

Angrist and Rokkanen (2015) found another way to enhance the external validity of RD. To see the idea, consider SRD with \(D=Z\). They assumed the existence of covariates X such that (we use \(\amalg \) for simplicity, although Angrist and Rokkanen (2015) used mean independence)

$$\begin{aligned} Y^{d}\amalg S|X\ \Longleftrightarrow \ Y^{d}\amalg \eta |X,\quad { with } \quad S\text { (hence }D\text {) determined by }(X,\eta )\text { for an error } \eta \text {:} \end{aligned}$$

the potential responses are independent of the unobserved part of S given X, which implies \(Y^{d}\amalg D|X\). Then

$$\begin{aligned} E(Y|S,X,D=d)=E(Y|X,D=d),\quad d=0,1. \end{aligned}$$

This implication can be tested, e.g., by parametrizing \(E(Y|S,X,D=d)=\beta _{ds}S+\beta _{dx}^{\prime }X\) and checking if \(\beta _{ds}=0\). More important is that

$$\begin{aligned}&E\big \{E(Y|X,D=1)-E(Y|X,D=0)\ |S=s\big \}\quad \text { (under }0<E(D|X)<1\text {)} \nonumber \\&\quad =E\big \{E(Y^{1}|X)-E(Y^{0}|X)\ |S=s\big \}\quad \text {(due to }Y^{d}\amalg D|X \text {)} \nonumber \\&\quad =E\big \{E(Y^{1}|S,X)-E(Y^{0}|S,X)\ |S=s\big \}=E(Y^{1}-Y^{0}|S=s). \end{aligned}$$
(6.10)

Hence \(E(Y^{1}-Y^{0}|S=s)\), not just the effect at the cutoff, is identified by (6.10). Angrist and Rokkanen (2015) also considered FRD after strengthening \(Y^{d}\amalg S|X\) to \((Y^{0},Y^{1},D^{0},D^{1})\amalg S|X\); for FRD, in essence, ‘complier with \(S=s\)’ replaces \(S=s\) in the conditioning set.

7 Empirical illustration

In this section, we apply some of the RD estimators and tests introduced so far to a data set used in Angrist and Lavy (1999) to find effects of class size on students’ achievement; S is the number of the enrolled students. This example may not be ideal as S is discrete (integer-valued), but it also reveals problems in RD, which is not necessarily a bad thing.

Fig. 1
figure 1

Class size versus enrolment

The ‘Maimonides rule’ in Israel public schools limits class size to 40; if 41 students, then there should be two classes; if 81, then there should be three classes, and so on. Hence 40 or 41, 80 or 81, 120 or 121,... are break points. But the actual class sizes did not exactly follow the rule. Figure 1 shows the actual and predicted class sizes by the Maimonides rule around the cutoff 40/41. The actual class size tends to be smaller than the predicted size (thus FRD), as some schools could afford to add classes before reaching the limit.

Table 1 Data description

The data were collected in June 1991 on enrolment S, class size D , proportion of disadvantaged students W and average test score Y for mathematics and reading. Angrist and Lavy (1999) used third to fifth grade samples and several cutoffs, but we will focus only on \(c=40/41\) for fifth grade. The unit of observations is a class, and \(N=1127\); the descriptive statistics are in Table 1. The single covariate ‘proportion disadvantaged’ W reflects family income level. See Angrist and Lavy (1999) for details on the data.

Table 2 Class size effect

The effects of class size on test scores are in Table 2. The left half is for LSE to (4.2) and IVE to (4.4), whereas the right half adds \( \beta _{w}W\) to (4.2) and \(\gamma _{w-}W(1-Z)+\gamma _{w+}WZ\) to (4.4) to explicitly account for W; this is done because E(W|S) has a break at 42 as will be seen shortly. We evaluate the effects over \(S\in [39,42]\) because many breaks occur over these points, which can be either due to the discrete nature of S or \(E(\cdot |S)\) having genuine breaks. The bandwidth was chosen by CV using (4.5).

In Table 2, LSE is significantly positive for all S with the effect ranging over 0.50–0.89 before W is used; once W is used, however, the effect drops to 0.15–0.49. These positive effects that are likely due to the D endogeneity through \(\varepsilon \) are, however, counter-intuitive. IVE is immune to this endogeneity problem, and it gives negative effects ranging over \(-\)0.30 to \(-\)2.5 before W is used; about a half of them are significant. But once W is used, IVE drops to \(-\)0.21 to \(-\)0.69, and the effects on reading are significant for all S. These changes due to W have two reasons: omitted W may make D endogenous (this affects only LSE), and W has a break near c (this affects both LSE and IVE). Notice that the SE of IVE is 2–4 times greater than the SE of LSE, and that the SE’s of LSE and IVE drop by controlling W.

Table 3 LSE-based break tests at known points: estimate (SE)

To test breaks of \(E(\cdot |S)\), we apply LSE to (5.2) and (5.1) for Y and D to estimate the slope of Z with the same bandwidth as used for Table 2; we also test breaks for E(W|S). Due to the problem of the S discreteness and the fuzziness in D as Fig. 1 shows, the LSE-based tests indicate many breaks, for which the test results over \(S\in [37,45]\) are presented in Table 3. There, E(D|S) have breaks in many points with the largest at 40; E(Math|S) also have many breaks with the largest at 42 and 43; E(Read|S) have a couple of breaks with the largest break at 40 and 42; finally, E(W|S) have breaks at 42, 43 and 44, which may make both LSE and IVE inconsistent if W is unaccounted for. It is not clear yet how this multiple break problem should be handled; at least, breaks at nearby points around c should be checked out.

Figure 2 presents \(\tilde{L}_{\text {Math}}(s)\), \(\tilde{L}_{\text {Read }}(s)\), \(\tilde{L}_{D}(s)\) and \(\tilde{L}_{W}(s)\) in (5.3), where the N(0, 1) density kernel is used and \(h=SD(S)N^{-1/5}\) is used as a rule of thumb; in each graph, the two dashed lines form (point-wise) \(95\,\%\) confidence intervals obtained by nonparametric bootstrap. Although the nonparametric bootstrap may look ad-hoc, judging by s at which the confidence intervals exclude 0, Fig. 2 agrees well with Table 3 in terms of the break locations.

Fig. 2
figure 2

Conditional mean break versus enrolment

Fig. 3
figure 3

Density break versus enrolment

For the continuity of \(f_{S}\), we obtained \(\tilde{J}(s)\) in (5.4). Since the observation unit is a class, not a school, the schools with enrollment larger than 40 appear multiple times in the data, which could generate an artificial break in \(f_{S}\). Hence we picked only one observation randomly from each school for \(\tilde{J}(s)\). Figure 3 shows that \(f_{S}\) has multiple breaks around 40, which may be viewed as breaks of E(U|S) to violate the RD premise.

8 Conclusions

We reviewed regression discontinuity (RD) to show its essentials. We did the review following the logical order: identification, estimation, and specification tests; also some RD topics were examined and an empirical illustration was provided. Since RD provides local randomization which is not easily available in other study designs, the applicability of RD is high for observational data so long as the treatment of interest meets (part of) the RD requirement: an underlying continuous variable crosses an institutional/legal cutoff to get treated. This bodes well for RD, as there are many such examples in laws, policies and programs in the world.