Introduction

The “control variable” approach has been used in various nonlinear models to address the endogeneity problems.Footnote 1 The purpose of this paper is to examine (i) whether the control variable approach is also subject to the bias problem due to the many instruments problem as pointed out by Bekker (1994) for the linear models, and if so, (ii) whether the cross-fitting advocated by the modern machine learning type estimatorsFootnote 2 eliminates such bias. The many instrument problem is essentially a problem of bias due to many nuisance parameters, which can be understood by using the incidental parameters problem in the panel data. Using a pseudo-panel analysis,Footnote 3 we demonstrate that (i) the control variable approach is indeed subject to the many instrument problem; and (ii) the cross-fitting does not remove the bias. The negative result arises primarily because the control variable approach essentially used the fitted value of the endogenous variable which creates a finite sample bias problem. The bias and its correction in the control variable approach can in principle be understood from the perspective of the quite large general literatureFootnote 4 on bias correction, but because our focus is on the situation where there is a large number of instruments, we use a pseudo-panel analysisFootnote 5 and answer these questions.

The bias of the control variable approach is relatively straightforward to understand. In linear simultaneous equations models, it is well-known that the 2SLS is equivalent to a control variable estimator. See, e.g., Hausman (1978). Therefore, it is quite natural to expect that the control variable approach is subject to the bias problem even in nonlinear models. The problem of cross fitting is not as immediately obvious, which we try to answer using the pseudo-panel analysis in the current paper. In the current section, we just explain what the cross fitting means in control variable estimation, and why it may be intuitively appealing.

Bias in the 2SLS estimator in finite samples has been long recognized. Nagar (1959) proposed the first estimator to remove this finite sample bias.Footnote 6 As has been recognized more recently, the bias can be especially important in the many instrument problem which occurs often with increased size of data sets as Bekker (1994) and Hahn and Hausman (2002, 2003) demonstrate and Hansen et al. (2008) explore empirically. The bias problem in 2SLS is quite important with a number of subsequent papers proposing methods to remove the finite sample bias. The Nagar estimator removes bias by analytically adjusting the estimating equation, and it only holds for linear models. For a linear model

$$\begin{aligned} y&=X\theta +\varepsilon \\ X&=Z\pi +\eta \end{aligned}$$

with the instrument matrix Z, the usual 2SLS estimator solves

$$\begin{aligned} 0=\widehat{X}^{\prime }\left( y-X\widehat{b}\right) , \end{aligned}$$
(1)

while Nagar (1959) estimator solves

$$\begin{aligned} 0=\widetilde{X}^{\prime }\left( y-X\widetilde{b}\right), \end{aligned}$$
(2)

where \(\widehat{X}=PX\), \(\widetilde{X}=\left( P-\frac{k}{n-k}Q\right) X\), and \(P=Z\left( Z^{\prime }Z\right) ^{-1}Z^{\prime }\) and \(Q=I-P\) denote the usual projection matrices. Here, n and k denote the number of observations and the number of instruments. Note that Nagar’s bias correction \(\frac{k}{n-k}Q\) is roughly proportional to k/n, which can be understood to be the ratio between the “number of nuisance parameters” and the sample size, where the nuisance parameters here are the first stage OLS coefficients.

This approach can be motived by observing that the moment underlying 2SLS is biased

$$\begin{aligned} E\left[ \left( PX\right) ^{\prime }\left( y-X\theta \right) \right] =E\left[ \left( Z\pi +P\eta \right) ^{\prime }\varepsilon \right] =k\sigma _{\varepsilon \eta }, \end{aligned}$$

where \(\sigma _{\varepsilon \eta }\) denotes the covariance between the ith elements of \(\varepsilon \) and \(\eta \), while the moment underlying Nagar’s estimator is unbiased

$$\begin{aligned} E\left[ \left( \left( P-\frac{k}{n-k}Q\right) X\right) ^{\prime }\left( y-X\theta \right) \right] =E\left[ \left( Z\pi +\left( P-\frac{k}{n-k}Q\right) \eta \right) ^{\prime }\varepsilon \right] =0. \end{aligned}$$

The lack of bias in Nagar (1959) moment can be understood from the perspective that the noise of estimating the instrument \(\widehat{X}\) used in the moment (1) for 2SLS is correlated with the error \(\varepsilon \) in the second stage, which is eliminated by using the instrument \(\widetilde{X}\). As such, we can understand the cross-fit estimator using sample splitting as sharing a similar spirit as Nagar (1959) estimator. Specifically, we can see that the moment underlying the cross fit estimator

$$\begin{aligned} 0=\check{X}^{\prime }\left( y-X\check{b}\right) \end{aligned}$$

has an unbiased moment, i.e., \(E\left[ \check{X}^{\prime }\left( y-X\theta \right) \right] =0\). Here,

$$\begin{aligned} \check{X}=\left[ \begin{array} [c]{c} \check{x}_{\left( 1\right) ,1}\\ \vdots \\ \check{x}_{\left( 1\right) ,m}\\ \check{x}_{\left( 2\right) ,1}\\ \vdots \\ \check{x}_{\left( 2\right) ,m} \end{array} \right] =\left[ \begin{array} [c]{c} z_{1}^{\prime }\widehat{\pi }_{\left( 2\right) }\\ \vdots \\ z_{m}^{\prime }\widehat{\pi }_{\left( 2\right) }\\ z_{m+1}^{\prime }\widehat{\pi }_{\left( 1\right) }\\ \vdots \\ z_{2m}^{\prime }\widehat{\pi }_{\left( 1\right) } \end{array} \right] , \end{aligned}$$

where \(n=2m\), we split the sample into two equal sized subsamples, and \(\widehat{\pi }_{\left( 1\right) }\) and \(\widehat{\pi }_{\left( 2\right) }\) are first stage estimators based on the first and second subsamples.

We ask whether such an interpretation would lead to a reasonable inference for nonlinear models. Our conclusion is that it does not. We show that it is impossible in general to remove the bias of the moment equation by manipulating the first stage estimator alone. We do this analysis by considering nonlinear models of endogeneity with many instruments, and showing that the moment equation with the cross fit estimator does not eliminate a bias due to nonlinearity, and as a consequence, does not have the desired unbiasedness property.

Pseudo-Panel Model

Our model of interest is nonlinear models with endogeneity such as the probit model with endogenous regressors, where

$$\begin{aligned} y_{i}=& {} 1\left( x_{i}\bar{\delta }+\varepsilon _{i}\ge 0\right) ,\\ x_{i}=& {} z_{i}^{\prime }\pi +\eta _{i}, \end{aligned}$$

and \(\left( \varepsilon _{i},\eta _{i}\right) \) have a bivariate normal distribution. The model has a built-in nonlinearity, and therefore, the endogeneity is probably best handled by the control variable approach. In the particular case of probit models, Rivers and Vuong (1988) solved the problem by writing

$$\begin{aligned} y_{i}=& {} 1\left( x_{i}\delta +\rho \left( x_{i}-z_{i}^{\prime }\pi \right) +\zeta _{i}\right) ,\\ x_{i}=& {} z_{i}^{\prime }\pi +\eta _{i}, \end{aligned}$$

which generates a consistent estimator as long as \(\left( \varepsilon _{i},\eta _{i}\right) \) have a bivariate normal distribution. (We assume that \(\zeta _{i}\) has a standard normal distribution, i.e., the parameters \(\delta \) and \(\rho \) reflect such normalization.)

In order to examine the consequence of many instruments, we adopt the strategy of interpreting the nuisance parameters (due to many instruments) as incidental parameters similar to the fixed effects in panel data. Therefore, we consider a special case that has a panel representation:

$$\begin{aligned} y_{it}=& {} 1\left( x_{it}\delta +\rho \left( x_{it}-\alpha _{i}\right) +\zeta _{it}\right) \nonumber \\ x_{it}=& {} \alpha _{i}+\eta _{it}. \end{aligned}$$
(3)

This is a model where the first stage is characterized by n dummy instruments, and \(\alpha _{i}\) denotes the first stage coefficient for the ith dummy instrument, i.e., \(\pi =\left( \alpha _{1},\alpha _{2},\ldots ,\alpha _{n}\right) \).

The usual two step estimator can be understood to be the method of moments estimator

$$\begin{aligned} 0=& {} E\left[ x_{it}-\alpha _{i}\right] \\ 0=& {} E\left[ \begin{array} [c]{c} m\left( z_{it},\theta ,\alpha _{i}\right) x_{it}\\ m\left( z_{it},\theta ,\alpha _{i}\right) \left( x_{it}-\alpha _{i}\right) \end{array} \right] \end{aligned}$$

where

$$\begin{aligned} m\left( z_{it},\theta ,\alpha _{i}\right) =\frac{y_{it}-\varPhi \left( x_{it}\delta +\rho \left( x_{it}-\alpha _{i}\right) \right) }{\varPhi \left( x_{it}\delta +\rho \left( x_{it}-\alpha _{i}\right) \right) \left[ 1-\varPhi \left( x_{it}\delta +\rho \left( x_{it}-\alpha _{i}\right) \right) \right] } \end{aligned}$$

and \(\varPhi \) denotes the cumulative distribution function of a standard normal distribution.

Bias of Panel Two Step Estimator

In this section, we review the panel literature, and discuss how the bias of panel data estimator can be interpreted. The framework in this section provides a basis of understanding the problem of cross-fit estimator presented in Sect. 5.

The model in the previous section is a special case of the nonlinear panel data estimator

$$\begin{aligned} 0=& {} \sum _{t=1}^{M}\underline{v}\left( z_{it},\widehat{\theta },\widehat{\gamma }_{i}\right) \\ 0=& {} \sum _{i=1}^{n}\sum _{t=1}^{M}\underline{u}\left( z_{it},\widehat{\theta },\widehat{\gamma }_{i}\right). \end{aligned}$$

For reasons that will become clearer later, we use the symbol M to denote the time series dimension of the panel data. Hahn and Newey (2004) and Arellano and Hahn (2007, 2016) are among the few who analyzed the finite sample bias from the large n, large T asymptotic approximation point of view. For our purpose, it is useful to make an explicit assumption that the fixed effects \(\gamma \) are multi-dimensional, and that v is of the same dimension as \(\gamma \). We let J denote \(\dim \left( \theta \right) \).

We provide a brief summary of the finite sample bias from the literature. It is convenient to analyze the general panel estimator in terms of the efficient score

$$\begin{aligned} 0=& {} \sum _{t=1}^{M}\underline{v}\left( z_{it},\widehat{\theta },\widehat{\gamma }_{i}\right) \\ 0=& {} \sum _{i=1}^{n}\sum _{t=1}^{M}\underline{U}\left( z_{it},\widehat{\theta },\widehat{\gamma }_{i}\right) \end{aligned}$$

where

$$\begin{aligned} \underline{U}\left( z_{it},\theta ,\gamma _{i}\right)=& {} \underline{u} \left( z_{it},\theta ,\gamma _{i}\right) -\underline{\varDelta }_{i} \underline{v}\left( z_{it},\theta ,\gamma _{i}\right) ,\\ \underline{\varDelta }_{i}=& {} E\left[ u_{it}^{\gamma _{i}}\right] E\left[ v_{it}^{\gamma _{i}}\right] ^{-1}. \end{aligned}$$

Here, the \(E\left[ \underline{v}_{it}^{\gamma _{i}}\right] =E\left[ \partial \underline{v}\left( z_{it},\theta _{0},\gamma _{i0}\right) / \partial \gamma _{i}^{\prime }\right] \) and \(E\left[ \underline{u}_{it} ^{\gamma _{i}}\right] =E\left[ \partial \underline{u}\left( z_{it},\theta _{0},\gamma _{i0}\right) / \partial \gamma _{i}^{\prime }\right] \) are evaluated at the ‘truth’. The asymptotic distribution of \(\sqrt{nM}\left( \widehat{\theta }-\theta \right) \) is asymptotically normal with variance equal to

$$\begin{aligned} \left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \underline{\mathcal {I}}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} E\left[ \underline{U}_{it}^{2}\right] \right) \left( \left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \underline{\mathcal {I}}_{i}\right) ^{-1}\right) ^{\prime } \end{aligned}$$
(4)

and mean equal to

$$\begin{aligned} \left( \lim _{n\rightarrow \infty }\frac{\sqrt{n}}{\sqrt{M}}\right) \underline{B} \end{aligned}$$
(5)

where

$$\begin{aligned} \underline{B}=\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\underline{\mathcal {I}}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\underline{b}_{i}\right) \end{aligned}$$
(6)

for

$$\begin{aligned} \underline{\mathcal {I}}_{i}\equiv & {} -E\left[ \frac{\partial U_{it}\left( \theta _{0},\gamma _{i0}\right) }{\partial \theta ^{\prime }}\right] , \end{aligned}$$
(7)
$$\begin{aligned} \underline{b}_{i}=& {} \left( \underline{b}_{i,1},\ldots ,\underline{b} _{i,J}\right) ^{\prime },\nonumber \\ \underline{b}_{i,j}=& {} -\text{trace}\left( \left( E\left[ \underline{v}_{it}^{\gamma _{i}}\right] \right) ^{-1}E\left[ \underline{v} _{it}\underline{U}_{it,j}^{\gamma _{i}}\right] \right) \nonumber \\&+\frac{1}{2}\text{trace}\left( E\left[ \underline{U} _{it,j}^{\gamma _{i}\gamma _{i}}\right] \left( E\left[ \underline{v} _{it}^{\gamma _{i}}\right] \right) ^{-1}E\left[ \underline{v}_{it} \underline{v}_{it}^{\prime }\right] \left( \left( E\left[ \underline{v} _{it}^{\gamma _{i}}\right] \right) ^{-1}\right) ^{\prime }\right) , \end{aligned}$$
(8)

and \(\underline{b}_{i,j}\) and \(\underline{U}_{it,j}\) denote the j-th components of \(\underline{b}_{i}\) and \(\underline{U}_{it}\). In other words, the 1/M bias is given by the formula \(\underline{B}/M\). Here, the 1/M bias denotes the approximate bias of \(\widehat{\theta }\) based on the asymptotic bias (5) of \(\sqrt{nM}\left( \widehat{\theta }-\theta \right) \). Because the number of fixed effects is equal to n, and the sample size is equal to nM, we can see that the ratio between the “number of nuisance parameters” and the sample size is 1/M, and that it is of the same order of magnitude of the bias of 2SLS as discussed by Nagar (1959).

Applying this result to the two-step estimation case where \(M=T\) and the fixed effects are scalars,

$$\begin{aligned} 0=& {} \sum _{t=1}^{T}v\left( z_{it},\widehat{\alpha }_{i}\right) \\ 0=& {} \sum _{i=1}^{n}\sum _{t=1}^{T}u\left( z_{it},\widehat{\theta },\widehat{\alpha }_{i}\right) \end{aligned}$$

we have the asymptotic variance of \(\sqrt{nT}\left( \widehat{\theta } -\theta \right) \) equal to

$$\begin{aligned} \left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} E\left[ U_{it}^{2}\right] \right) \left( \left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \mathcal {I}_{i}\right) ^{-1}\right) ^{\prime } \end{aligned}$$

and the approximate bias equal to

$$\begin{aligned} \frac{1}{T}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} \sum _{i=1}^{n}b_{i}\right), \end{aligned}$$
(9)

where

$$\begin{aligned} U\left( z_{it},\theta ,\alpha _{i}\right)=& {} u\left( z_{it},\theta ,\alpha _{i}\right) -\varDelta _{i}v\left( z_{it},\alpha _{i}\right), \end{aligned}$$
(10)
$$\begin{aligned} \varDelta _{i}\equiv & {} \frac{E\left[ u_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }, \end{aligned}$$
(11)
$$\begin{aligned} \mathcal {I}_{i}\equiv & {} -E\left[ \frac{\partial U_{it}}{\partial \theta ^{\prime }}\right], \end{aligned}$$
(12)
$$\begin{aligned} b_{i}=& {} -\frac{E\left[ v_{it}U_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+\frac{1}{2}\frac{E\left[ U_{it}^{\alpha _{i}\alpha _{i}}\right] E\left[ v_{it}^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}}. \end{aligned}$$
(13)

Further Analysis of the Bias Formula

In this section, we analyze the formula (13) in two important models, and show that the bias formula simplifies for linear models, but not in the probit models. In Sect. 5, we will use this difference to illustrate why the bias in the probit models cannot be removed by cross fitting.

Because \(E\left[ v_{it}U_{it}^{\alpha _{i}}\right] =E\left[ v_{it} u_{it}^{\alpha _{i}}\right] -\varDelta _{i}E\left[ v_{it}v_{it}^{\alpha _{i} }\right] \), and \(E\left[ U_{it}^{\alpha _{i}\alpha _{i}}\right] =E\left[ u_{it}^{\alpha _{i}\alpha _{i}}\right] -\varDelta _{i}E\left[ v_{it}^{\alpha _{i}\alpha _{i}}\right] \), we can see that the bias formula simplifies a little bit if \(\varDelta _{i}=0\) or \(v_{it}^{\alpha _{i}}\) is constant. Under this condition, we can see

$$\begin{aligned} b_{i}=-\frac{E\left[ v_{it}u_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+\frac{1}{2}\frac{E\left[ u_{it}^{\alpha _{i}\alpha _{i}}\right] E\left[ v_{it}^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}}. \end{aligned}$$

The condition \(\varDelta _{i}=0\) is satisfied if \(E\left[ u_{it}^{\alpha _{i} }\right] =0\), i.e., under Neyman orthogonality. The condition that \(v_{it} ^{\alpha _{i}}\) is constant is satisfied if \(v_{it}\) is an affine function of \(\alpha _{i}\).

In order to understand these conditions, consider the panel model with n dummy IV’s

$$\begin{aligned} y_{it}=& {} x_{it}\theta +\varepsilon _{it},\nonumber \\ x_{it}=& {} \alpha _{i}+\eta _{it}. \end{aligned}$$
(14)

If our 2SLS estimator solves

$$\begin{aligned} 0=& {} \sum _{t=1}^{T}\left( x_{it}-\widehat{\alpha }_{i}\right) ,\nonumber \\ 0=& {} \sum _{i=1}^{n}\sum _{t=1}^{T}\widehat{\alpha }_{i}\left( y_{it} -x_{it}\widehat{\theta }\right) , \end{aligned}$$
(15)

we see that \(v_{it}^{\alpha _{i}}=-1\). We also see that \(E\left[ u_{it}^{\alpha _{i}}\right] =E\left[ y_{it}-x_{it}\theta \right] =E\left[ \varepsilon _{it}\right] =0\) so the condition \(\varDelta _{i}=0\) is also satisfied. The 2SLS for the pseudo panel model is special because \(u_{it}^{\alpha _{i}\alpha _{i}}=0\). This implies that the bias formula is very simple and satisfies \(b_{i}= -E\left[ v_{it}u_{it}^{\alpha _{i} }\right] / E\left[ v_{it}^{\alpha _{i}}\right] \). This plays an important role in understanding the properties of split sample cross fitting for 2SLS.

We should recognize that these conditions are not satisfied for the probit model with endogenous regressors. In fact, the special nature of 2SLS, i.e., \(u_{it}^{\alpha _{i}\alpha _{i}}=0\), can be argued to be an implication of the IV type interpretation of 2SLS. If the 2SLS is interpreted to be a regression using the fitted value from the first stage as a regressor in the second stage, we see that the 2SLS solves

$$\begin{aligned} 0=& {} \sum _{t=1}^{T}\left( x_{it}-\widehat{\alpha }_{i}\right) ,\nonumber \\ 0=& {} \sum _{i=1}^{n}\sum _{t=1}^{T}\widehat{\alpha }_{i}\left( y_{it} -\widehat{\alpha }_{i}\widehat{\theta }\right) . \end{aligned}$$
(16)

Here, we an easily see that \(u_{it}^{\alpha _{i}\alpha _{i}}\ne 0\) in general. In general, control variable approach requires that it be used as a regressor, so we should expect \(u_{it}^{\alpha _{i}\alpha _{i}}\ne 0\) in general.

Split Sample Cross Fit Estimator

In this section, we will use the framework of Sect. 3, and analyze the bias of the cross fit estimator after sample splitting. We assume that the sample is split into two, and we use the estimate \(\alpha _{i}\) from one subsample to be used as part of u in the second half of the sample cross fit. In other words, in order to understand the issue, we will assume now that the data consists of

$$\begin{aligned} z_{i}=\left( z_{i1},\ldots ,z_{iT}\right) =\left( q_{i1},\ldots ,q_{iM},r_{iM},\ldots ,r_{iM}\right) , \end{aligned}$$

i.e., we will assume that \(T=2M\), and write q and r for the first and second half of the observations. We will write the split sample cross fit estimator as

$$\begin{aligned} 0=& {} \sum _{t=1}^{M}v\left( q_{it},\widehat{\alpha }_{1,i}\right) \\ 0= & {} \sum _{t=1}^{M}v\left( r_{it},\widehat{\alpha }_{2,i}\right) \\ 0= & {} \sum _{i=1}^{n}\sum _{t=1}^{M}\left( u\left( q_{it},\widehat{\theta },\widehat{\alpha }_{2,i}\right) +u\left( r_{it},\widehat{\theta },\widehat{\alpha }_{1,i}\right) \right) \end{aligned}$$

with the recognition that \(\widehat{\alpha }_{1,i}\) and \(\widehat{\alpha } _{2,i}\) are estimators of \(\alpha _{1,i}=\alpha _{2,i}=\alpha _{i}\). In order to see the resemblance to the panel model, we will write it

$$\begin{aligned} 0= & {} \sum _{t=1}^{M}\underline{v}_{\left( S\right) }\left( q_{it} ,r_{it},\widehat{\theta },\widehat{\gamma }_{i}\right) =\sum _{t=1}^{M}\left[ \begin{array} [c]{c} v\left( q_{it},\widehat{\theta },\widehat{\alpha }_{1,i}\right) \\ v\left( r_{it},\widehat{\theta },\widehat{\alpha }_{2,i}\right) \end{array} \right], \\ 0= & {} \sum _{t=1}^{M}\underline{u}_{\left( S\right) }\left( q_{it} ,r_{it},\widehat{\theta },\widehat{\gamma }_{i}\right) =\sum _{t=1}^{M}\left( u\left( q_{it},\widehat{\theta },\widehat{\alpha }_{2,i}\right) +u\left( r_{it},\widehat{\theta },\widehat{\alpha }_{1,i}\right) \right). \end{aligned}$$

In other words, the split sample cross fit estimator can be analyzed by adopting a perspective that the fixed effects are multidimensional. The result for the multi-dimensional fixed effects is already available from Arellano and Hahn (2016), which we will utilize here.

It can be shownFootnote 7 that the efficient score is

$$\begin{aligned} \underline{U}_{\left( S\right) }\left( q_{it},r_{it},\theta ,\alpha _{1,i},\alpha _{2,i}\right) =\left( u\left( q_{it},\theta ,\alpha _{2,i}\right) -\varDelta _{i}v\left( q_{it},\theta ,\alpha _{1,i}\right) \right) +\left( u\left( r_{it},\theta ,\alpha _{1,i}\right) -\varDelta _{i}v\left( r_{it},\theta ,\alpha _{2,i}\right) \right), \end{aligned}$$

where the \(\varDelta _{i}\) is identical to the one in (11). Note that at \(\alpha _{1,i}=\alpha _{2,i}=\alpha _{1,i}\), we see that the counterparts of \(\underline{U}\) and \(\underline{\mathcal {I}}_{i}\) are

$$\begin{aligned} \underline{U}_{\left( S\right) }\left( q_{it},r_{it},\theta ,\alpha _{1,i},\alpha _{2,i}\right)= & {} U\left( q_{it},\theta ,\alpha _{i}\right) +U\left( r_{it},\theta ,\alpha _{i}\right) ,\\ \underline{\mathcal {I}}_{\left( S\right) ,i}\equiv & {} -E\left[ \frac{\partial \left( U\left( q_{it},\theta ,\alpha _{i}\right) +U\left( r_{it},\theta ,\alpha _{i}\right) \right) }{\partial \theta }\right] =2\mathcal {I}_{i}, \end{aligned}$$

where the U and \(\mathcal {I}_{i}\) on the RHS are identical to the ones in (10) and (12). We therefore see that the asymptotic distribution of \(\sqrt{nM}\left( \widehat{\theta }-\theta \right) \) is normal with variance equal to

$$\begin{aligned} \frac{1}{2}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \text{Var}\left( U_{it}\right) \right) \left( \left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\mathcal {I}_{i}\right) ^{-1}\right) ^{\prime }. \end{aligned}$$

It follows that the asymptotic distribution of \(\sqrt{nT}\left( \widehat{\theta }-\theta \right) =\sqrt{n\left( 2M\right) }\left( \widehat{\theta }-\theta \right) =\sqrt{2}\sqrt{nM}\left( \widehat{\theta }-\theta \right) \) is normal with variance equal to

$$\begin{aligned} \left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\mathcal {I} _{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \text{Var}\left( U_{it}\right) \right) \left( \left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\mathcal {I}_{i}\right) ^{-1}\right) ^{\prime } \end{aligned}$$

In other words, the asymptotic variance of \(\sqrt{nT}\left( \widehat{\theta }-\theta \right) \) does not change.

As for the bias, we see that the counter part of \(\underline{b}_{i}\) is given by

$$\begin{aligned} \underline{b}_{\left( S\right) ,i}=2\varDelta _{i}\frac{E\left[ v_{it} v_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+E\left[ U_{it}^{\alpha \alpha }\right] \frac{E\left[ v_{it}^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}} \end{aligned}$$

so

$$\begin{aligned} \underline{b}_{\left( S\right) ,i}=2\left( b_{i}+\frac{E\left[ v_{it}u_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }\right) \end{aligned}$$

and the implied bias is

$$\begin{aligned} \frac{1}{M}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\underline{\mathcal {I}}_{\left( S\right) ,i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\underline{b}_{\left( S\right) ,i}\right) \nonumber \\= \frac{2}{T}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\underline{\mathcal {I}}_{\left( S\right) ,i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\underline{b}_{\left( S\right) ,i}\right) \quad = \frac{2}{T}\left( 2\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\left( 2\varDelta _{i}\frac{E\left[ v_{it}v_{it}^{\alpha _{i} }\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+ \frac{E\left[ U_{it} ^{\alpha \alpha }\right] E\left[ v_{it}^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}}\right) \right) \nonumber \\ = \frac{1}{T}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1} ^{n}\mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^{n}\left( 2\varDelta _{i}\frac{E\left[ v_{it}v_{it}^{\alpha _{i} }\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+ \frac{E\left[ U_{it} ^{\alpha \alpha }\right] E\left[ v_{it}^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}}\right) \right) . \end{aligned}$$
(17)

We now compare the bias of the split sample cross fit estimator with the full sample estimator. We first rewrite the bias (9) with the full sample plug in estimator as

$$\begin{aligned} \frac{1}{T}\left( \lim _{n\rightarrow \infty }\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \mathcal {I}_{i}\right) ^{-1}\left( \lim _{n\rightarrow \infty }\frac{1}{n} \sum _{i=1}^{n}\left( -\frac{E\left[ v_{it}u_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+\varDelta _{i}\frac{E\left[ v_{it} v_{it}^{\alpha _{i}}\right] }{E\left[ v_{it}^{\alpha _{i}}\right] }+\frac{1}{2}\frac{E\left[ U_{it}^{\alpha _{i}\alpha _{i}}\right] E\left[ v_{it} ^{2}\right] }{\left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2} }\right) \right) \end{aligned}$$
(18)

using \(U_{it}^{\alpha _{i}}=u_{it}^{\alpha _{i}}-\varDelta _{i}v_{it}^{\alpha _{i}}\). Comparing (17) with (18), we can see that the split sample cross fit affects the bias in three ways:

  1. 1.

    It eliminates the bias \( -E\left[ v_{it}u_{it}^{\alpha _{i} }\right] / E\left[ v_{it}^{\alpha _{i}}\right] \) due to the correlation between \(v_{it}\) and \(u_{it}^{\alpha _{i}}\).

  2. 2.

    It magnifies the bias \(\varDelta _{i}E\left[ v_{it}v_{it} ^{\alpha _{i}}\right] / E\left[ v_{it}^{\alpha _{i}}\right] \) due to the correlation between \(v_{it}\) and \(v_{it}^{\alpha _{i}}\) by a factor of two.

  3. 3.

    It magnifies the bias \( \frac{1}{2}E\left[ U_{it}^{\alpha _{i}\alpha _{i}}\right] E\left[ v_{it}^{2}\right] / \left( E\left[ v_{it}^{\alpha _{i}}\right] \right) ^{2}\) due to the variance of \(v_{it}\) by a factor of two.

This is all intuitive. The finite sample bias is due to the noise of estimating \(\alpha _{i}\), which may be correlated with the second stage moment u. The split sample cross fit estimator severs this correlation, which explains the first effect. On the other hand, the split sample estimator effectively uses half the sample size for estimation of each \(\alpha _{i}\), which leads to the second and third effects.

We saw that in the pseudo panel 2SLS (15) with the IV interpretation, \(\varDelta _{i}=0\) and \(u_{it}^{\alpha _{i}\alpha _{i}}\). This implies that the bias of the full sample estimator takes a simple form \( -E\left[ v_{it}u_{it}^{\alpha _{i}}\right] / E\left[ v_{it}^{\alpha _{i}}\right] \), and it is completely eliminated by the split sample cross fit.

Note that \(E\left[ v_{it}v_{it}^{\alpha _{i}}\right] =0\) if \(v_{it} ^{\alpha _{i}}\) is constant as in (3). Even then, we should expect that (i) the bias is not removed by the cross fitting in general, although (ii) it is removed in the special case where \(E\left[ u_{it}^{\alpha _{i}\alpha _{i}}\right] =0\).

Getting back to our panel rendition of the probit model with endogenous regressor, we see that

$$\begin{aligned} U_{it}=\left[ \begin{array} [c]{c} x_{it}m\left( z_{it},\theta ,\alpha _{i}\right) \\ \left( x_{it}-\alpha _{i}\right) m\left( z_{it},\theta ,\alpha _{i}\right) \end{array} \right] +\varDelta _{i}\left( x_{it}-\alpha _{i}\right) \end{aligned}$$

where

$$\begin{aligned} \varDelta _{i}= & {} E\left[ \begin{array} [c]{c} \partial \left( x_{it}m\left( z_{it},\theta ,\alpha _{i}\right) \right) / \partial \alpha _{i}\\ \partial \left( \left( x_{it}-\alpha _{i}\right) m\left( z_{it},\theta ,\alpha _{i}\right) \right) / \partial \alpha _{i} \end{array} \right] \\= & {} E\left[ \begin{array} [c]{c} x_{it}\left( \partial m\left( z_{it},\theta ,\alpha _{i}\right) / \partial \alpha _{i}\right) \\ \left( x_{it}-\alpha _{i}\right) \left( \partial m\left( z_{it},\theta ,\alpha _{i}\right) / \partial \alpha _{i}\right) -m\left( z_{it},\theta ,\alpha _{i}\right) \end{array} \right] \\= & {} E\left[ \begin{array} [c]{c} x_{it}\phi \left( x_{it}\theta +\rho \eta _{it}\right) \\ \left( x_{it}-\alpha _{i}\right) \phi \left( x_{it}\theta +\rho \eta _{it}\right) \end{array} \right] \rho , \end{aligned}$$

where we use

$$\begin{aligned} E\left[ \left. \frac{\partial m\left( z_{it},\theta ,\alpha _{i}\right) }{\partial \alpha _{i}}\right| x_{it},\eta _{it}\right] =\phi \left( x_{it}\theta +\rho \eta _{it}\right) \rho. \end{aligned}$$

It can be seen that \(U_{it}^{\alpha _{i}\alpha _{i}}\ne 0\), so we cannot expect the cross fitting estimator to remove the bias.

Note that the probit model is just one of the examples where the control variable is used as part of a nonlinear regression. We should therefore expect that (i) the control variable based estimator to have the many IV problem, and (ii) the problem is not solved by cross fitting. (In fact, the pseudo panel 2SLS (15) with the regression interpretation would be such that the bias due to \(u_{it}^{\alpha _{i}\alpha _{i}}\) will not be eliminated by the split sample cross fit.)

Modified Objective Function for the Second Step

In this section, we review the panel literature, discuss a method of bias removal in the context of control variable estimation, which suggests how the bias can be corrected in principle. One can conjecture with high confidence that the bias can be removed from traditional methods of bias correction such as jackknifeFootnote 8, but it may be useful to find a simple alternative to these computationally intensive procedures. Panel literature discussed various methods of bias correction in the recent past, so one can imagine that it would work even in non-panel settings with some modifications. It is not clear how to frame an asymptotic sequence of models such that the biases in non-panel models with a large number of nuisance parameters can be easily understood and corrected, which we leave for future research.

Consider the moment (16) of the linear model, with a twist that the fitted value from the first stage is used as a regressor in the second stage. In particular, assume that \(x_{it}=\alpha _{i}+\eta _{it}\), which implies that

$$\begin{aligned} v\left( z_{it},\alpha _{i}\right) =x_{it}-\alpha _{i} \end{aligned}$$

and that the bias formula (13) takes the form

$$\begin{aligned} b_{i}=E\left[ \eta _{it}u_{it}^{\alpha _{i}}\right] +\frac{1}{2}E\left[ u_{it}^{\alpha _{i}\alpha _{i}}\right] E\left[ \eta _{it}^{2}\right] . \end{aligned}$$

If we further assume that \(\eta _{it}\) is i.i.d. over i and t, the formula further simplifies to

$$\begin{aligned} b_{i}=E\left[ \eta _{it}u_{it}^{\alpha _{i}}\right] +\frac{\sigma _{\eta }^{2} }{2}E\left[ u_{it}^{\alpha _{i}\alpha _{i}}\right] . \end{aligned}$$

In Sect. 5, we saw that the term \(E\left[ \eta _{it} u_{it}^{\alpha _{i}}\right] \) can be eliminated by sample split cross fit, but the second term actually gets magnified.

We consider changing the moment equation altogether, adopting Arellano and Hahn (2007) proposal to correct the bias of the moment equation. For this purpose, we assume that the moment u is obtained in the maximization of some objective function

$$\begin{aligned} \sum _{i}\sum _{t}\psi \left( z_{it},\theta ,\widehat{\alpha }_{i}\right) \end{aligned}$$

with respect to \(\theta \), i.e., assume that

$$\begin{aligned} u\left( z_{it},\theta ,\alpha _{i}\right) =\frac{\partial \psi \left( z_{it},\theta ,\alpha _{i}\right) }{\partial \theta }. \end{aligned}$$

We then have

$$\begin{aligned} \frac{\partial u\left( z_{it},\theta ,\alpha _{i}\right) }{\partial \alpha _{i}}= & {} \frac{\partial }{\partial \theta }\left( \frac{\partial \psi \left( z_{it},\theta ,\alpha _{i}\right) }{\partial \alpha _{i}}\right) \\ \frac{\partial ^{2}u\left( z_{it},\theta ,\alpha _{i}\right) }{\partial \alpha _{i}^{2}}= & {} \frac{\partial }{\partial \theta }\left( \frac{\partial ^{2}\psi \left( z_{it},\theta ,\alpha _{i}\right) }{\partial \alpha _{i}^{2} }\right). \end{aligned}$$

This suggests that we can adopt the proposal in Arellano and Hahn (2007), and consider maximizing

$$\begin{aligned} \sum _{i}\left( \sum _{t}\psi \left( z_{it},\theta ,\widehat{\alpha }_{i}\right) -\frac{1}{T}\sum _{t}\left( v\left( z_{it},\widehat{\alpha }_{i}\right) \frac{\partial \psi \left( z_{it},\theta ,\widehat{\alpha }_{i}\right) }{\partial \alpha _{i}}+\frac{\widehat{\sigma }_{\eta }^{2}}{2}\frac{\partial ^{2}\psi \left( z_{it},\theta ,\widehat{\alpha }_{i}\right) }{\partial \alpha _{i}^{2}}\right) \right) , \end{aligned}$$

where \(\widehat{\sigma }_{\eta }^{2}=\frac{1}{nT}\sum _{i=1}\sum _{t=1}^{T}\left( x_{it}-\widehat{\alpha }_{i}\right) ^{2}\) with the corresponding moment equationFootnote 9

$$\begin{aligned} 0=\sum _{i}\left( \sum _{t}u\left( z_{it},\widehat{\theta },\widehat{\alpha }_{i}\right) -\frac{1}{T}\sum _{t}\left( v\left( z_{it},\widehat{\alpha } _{i}\right) u^{\alpha _{i}}\left( z_{it},\widehat{\theta },\widehat{\alpha }_{i}\right) +\frac{\widehat{\sigma }_{\eta }^{2}}{2}u^{\alpha _{i}\alpha _{i} }\left( z_{it},\widehat{\theta },\widehat{\alpha }_{i}\right) \right) \right) . \end{aligned}$$
(19)

Summary

Using a pseudo-panel model, we have demonstrated that the control variable approach is subject to the many instrument problem, since it uses the predicted value of the endogenous variable. It is essentially the same bias problem analyzed by Cattaneo et al. (2019), who advocated the use of the jackknife to remove the higher order bias. It would be interesting to develop a method of analytic bias correction in the non-panel setting, which we leave for future research.