1 Introduction

We consider the homoscedastic regression model in which the response variable Y is linked to a covariate X by the formula

$$\begin{aligned} Y= \beta X + \varepsilon . \end{aligned}$$
(1)

For reasons of clarity we focus on the case where X is one dimensional and \(\beta \) an unknown real number. We will assume throughout that \(\varepsilon \) and X are independent and that X has a finite positive variance. Our goal is to make inferences about the slope \(\beta \), treating the density f of the error \(\varepsilon \) and the distribution of the covariate X as nuisance parameters. We shall do so by using an empirical likelihood approach based on independent copies \((X_1,Y_1),\dots ,(X_n,Y_n)\) of the base observation (XY).

Model (1) is the usual linear regression model with a nonzero intercept, even though it is written without an explicit intercept parameter. Since we do not assume that the error variable is centered, the mean \(E[\varepsilon ]\) plays the role of the intercept parameter. Working with this model and notation simplifies the explanation of the method and the presentation of the proofs. The generalization to the multivariate case is straightforward; see Remark 1 in Sect. 2.

The linear regression model is one of the most useful statistical models, and many simple estimators for the slope are available, such as the ordinary least squares estimator (OLSE) which takes on the form

$$\begin{aligned} \dfrac{\sum _{j=1}^n \left( X_j -\bar{X}\right) Y_j}{\sum _{j=1}^n \left( X_j-\bar{X}\right) ^2} \end{aligned}$$
(2)

rather than \(\sum _{j=1}^n X_jY_j/ \sum _{j=1}^n X_j^2\), because we do not assume that the errors are centered. However, these estimators are usually inefficient. The construction of efficient (least dispersed) estimators is in fact quite involved. The reason for this is the assumed independence between covariates and errors, which is a structural assumption that has to be taken into account by the estimator to obtain efficiency. Efficient estimators for \(\beta \) in model (1) were first introduced by Bickel (1982), who used sample splitting to estimate the efficient influence function. To establish efficiency we must assume that f has finite Fisher information for location. This means that f is absolutely continuous and the integral \(J_f= \int \ell _f^2(y)f(y)\,dy\) is finite, where \(\ell _f=-f'/f\) denotes the score function for location. It follows from Bickel (1982) that an efficient estimator \(\hat{\beta }\) of \(\beta \) is characterized by the stochastic expansion

$$\begin{aligned} \hat{\beta }= \beta + \frac{1}{n}\sum _{j=1}^n\dfrac{(X_j-E[X])\ell _f(Y_j-\beta X_j)}{J_f \mathrm{Var}(X)} + o_P\left( n^{-1/2}\right) . \end{aligned}$$
(3)

Further efficient estimators of the slope which require estimating the influence function were proposed by Schick (1987) and Jin (1992). Koul and Susarla (1983) studied the case when f is also symmetric about zero. See also Schick (1993) and Forrester et al. (2003), who achieved efficiency without sample splitting and instead used a conditioning argument. Efficient estimation in the corresponding (heteroscedastic) model without the independence assumption (defined by \(E(\varepsilon |X)=0\)) is much easier: Müller and Van Keilegom (2012), for example, proposed weighted versions of the OLSE to efficiently estimate \(\beta \) in the model with fully observed data and in a model with missing responses. See also Schick (2013), who proposed an efficient estimator using maximum empirical likelihood with infinitely many constraints.

Like Müller and Van Keilegom (2012), we are interested in the common case that responses are missing at random (MAR). This means that we observe copies of the triplet \((\delta , X, \delta Y)\), where \(\delta \) is an indicator variable with \(\delta = 1\) if Y is observed, and where the probability \(\pi \) that Y is observed depends only on the covariate,

$$\begin{aligned} P(\delta =1|X,Y) = P(\delta =1|X) = \pi (X), \end{aligned}$$

with \(E[\pi (X)] = E[\delta ] >0\); we refer to the monographs by Little and Rubin (2002) and Tsiatis (2006) for further reading. Note that the ‘MAR model’ we have just described covers the ‘full model’ (in which all data are completely observed) as a special case with \(\pi (X)=1\). To estimate \(\beta \) in the MAR model we propose a complete case analysis, i.e., only the \(N = \sum _{j=1}^n\delta _j\) observations \((X_{i_1},Y_{i_1}), \ldots , (X_{i_N},Y_{i_N})\) with observed responses will be considered.

Complete case analysis is the simplest approach to dealing with missing data and is frequently disregarded as naive and wasteful. In our application, however, the contrary is true: Müller and Schick (2017) showed that general functionals of the conditional distribution of Y given X can be estimated efficiently (in the sense of Hájek and Le Cam) by a complete case analysis. Since the slope \(\beta \) is covered as a special case, this means that an estimator of \(\beta \) that is efficient in the full model is also efficient in the MAR model if we simply omit the incomplete cases. This property is called ‘efficiency transfer’. To construct efficient maximum empirical likelihood estimators for \(\beta \), it therefore suffices to consider the model with completely observed data. We write \(\hat{\beta }_c\) for the complete case version of \(\hat{\beta }\) from (3). It follows from the transfer principle for asymptotically linear statistics by Koul et al. (2012) that \(\hat{\beta }_c\) satisfies

$$\begin{aligned} \hat{\beta }_c = \beta + \frac{1}{N} \sum _{j=1}^n\dfrac{\delta _j\left( X_j-E[X|\delta =1] \right) \ell _f\left( Y_j-\beta X_j\right) }{J_f \mathrm{Var}(X|\delta =1)} + o_P\left( n^{-1/2}\right) \end{aligned}$$
(4)

and is therefore consistent for \(\beta \). That \(\hat{\beta }_c\) is also efficient follows from Müller and Schick (2017, Sect. 5.1). The efficiency property can alternatively be deduced from arguments in Müller (2009), who gave the efficient influence function for \(\beta \) in the MAR model, but with the additional assumption that the errors have mean zero; see Lemma 5.1 in that paper.

In this paper we use an empirical likelihood approach with an increasing number of estimated constraints to derive various inferential procedures about the slope. Our approach is similar to Schick (2013), but our model requires different constraints. We obtain a suitable Wilks’ theorem (see Theorem 1) to derive confidence sets for \(\beta \) and tests about a specific value of \(\beta \), and a point estimator of \(\beta \) via maximum empirical likelihood, i.e., by maximizing the empirical likelihood. This estimator is shown to be semiparametrically efficient.

Empirical likelihood was introduced by Owen (1988, 2001) for a fixed number of known linear constraints to construct confidence intervals in a nonparametric setting. More recently, his results have been generalized to a fixed number of estimated constraints by Hjort et al. (2009), who further studied the case of an increasing number of known constraints; see also Chen et al. (2009). Peng and Schick (2013) generalized the approach to the case of an increasing number of estimated constraints. The idea of maximum empirical likelihood goes back to Qin and Lawless (1994), who treated the case with a fixed number of known constraints. Peng and Schick (2017) generalized their result to the case with estimated constraints. Schick (2013) and Peng and Schick (2016) treated examples with an increasing number of estimated constraints and showed efficiency of the maximum empirical likelihood estimators.

The empirical likelihood is similar to the one considered for the symmetric location model in Peng and Schick (2016). We shall derive results that are analogous to those in that paper. In Sect. 3 we provide the asymptotic Chi-square distribution of the empirical log-likelihood for both the full model and the MAR model. This facilitates the construction of confidence intervals and tests about the slope \(\beta \). In Sect. 4 we propose a new method for estimating \(\beta \) efficiently, namely a guided maximum empirical likelihood estimator, as suggested by Peng and Schick (2017) for the general model with estimated constraints. Efficiency of this estimator is entailed by a uniform expansion for the local empirical likelihood (see Theorem 2), which follows from a local asymptotic normality condition. Section 5 contains a simulation study. The proofs are in Sect. 6.

2 Empirical likelihood approach

The construction of the empirical likelihood is crucial since we need to incorporate the independence between the covariates and the errors to obtain efficiency. Let us explain it for the full model. The corresponding approach for the missing data model is then straightforward: in that case we will proceed in the same way, now with the analysis based on the N complete cases, and with the random sample size N treated like n.

Our empirical likelihood \(\mathscr {R}_{n}(b)\), which we want to maximize with respect to \(b \in \mathbb R\), is of the form

$$\begin{aligned} \mathscr {R}_{n}(b)= \sup \left\{ \prod _{j=1}^n n\pi _j: \pi \in \mathscr {P}_n,\ \sum _{j=1}^n \pi _j(X_j-\bar{X}) v_n\left( \mathbb F_{b}\left( Y_j-bX_j\right) \right) =0 \right\} . \end{aligned}$$

Here \(\mathscr {P}_n\) is the probability simplex in dimension n, defined by

$$\begin{aligned} \mathscr {P}_n= \left\{ \,\pi =(\pi _1,\dots ,\pi _n)^{\top }\in [0,1]^n: \sum _{j=1}^n\pi _j =1 \,\right\} , \end{aligned}$$

\(\bar{X}\) is the sample mean of the covariates \(X_1,\dots ,X_n\), \(\mathbb F_b\) is the empirical distribution function constructed from ‘residuals’ \(Y_1-bX_1,\dots ,Y_n-bX_n\), i.e.,

$$\begin{aligned} \mathbb F_b(t)= \frac{1}{n}\sum _{j=1}^n\mathbf 1\left[ Y_j-bX_j\le t\right] , \quad t\in \mathbb R, \end{aligned}$$

which serves as a surrogate for the unknown error distribution F. The function \(v_n\) maps from [0, 1] into \(\mathbb R^{r_n}\) and will be described in (6) below. The constraint \(\sum _{j=1}^n \pi _j(X_j-\bar{X}) v_n(\mathbb F_{b}(Y_j-bX_j))=0\) in the definition of \(\mathscr {R}_{n}(b)\) is therefore a vector of \(r_n\) one-dimensional constraints, where the integer \(r_n\) tends to infinity slowly as the sample size n increases. These constraints emerge from the independence assumption as follows. Independence of X and \(\varepsilon \) is equivalent to \(E[c(X)a(\varepsilon )]=0\) for all square-integrable centered functions c and a under the distributions of X and \(\varepsilon \), respectively. This leads to the empirical likelihood in Peng and Schick (2013). We do not work with these constraints. Instead we use constraints in the subspace

$$\begin{aligned} \{\left( X-E[X]\right) a(\varepsilon ): a\in L_{2,0}(F)\} \end{aligned}$$
(5)

with \(L_{2,0}(F)=\{a\in L_2(F): \int a\,dF=0\}\), which suffices since it contains the efficient influence function; see (3). By our assumptions, F is continuous and \(F(\varepsilon )\) is uniformly distributed on the interval [0, 1], i.e., \(F(\varepsilon ) \sim \mathscr {U}\). An orthonormal basis of \(L_{2,0}(F)\) is \(\varphi _1 \circ F, \varphi _2 \circ F, \ldots \), where \(\varphi _k\) denotes an orthonormal basis of \(L_{2,0}(\mathscr {U})\). This suggests the constraints

$$\begin{aligned} \sum _{j=1}^n\pi _j \{X_j - E(X)\} \varphi _k \{F (Y_j-b X_j)\} =0, \quad k=1,\ldots ,r_n, \end{aligned}$$

which, however, cannot be used since neither F nor the mean of X is known. So we replace them by empirical estimators. In this article we will work with the trigonometric basis

$$\begin{aligned} \varphi _k(x)= \sqrt{2} \cos (k\pi x), \quad 0\le x\le 1, k=1,2,\dots , \end{aligned}$$

and take

$$\begin{aligned} v_n =\left( \varphi _1,\dots ,\varphi _{r_n}\right) ^{\top }. \end{aligned}$$
(6)

This yields our empirical likelihood \(\mathscr {R}_{n}(b)\) from above.

Let us briefly discuss the complete case approach that we propose for the MAR model. In the following a subscript ‘c’ will, as before when we introduced \(\hat{\beta }_c\), indicate that a complete case statistic is used. For example, \(\mathbb F_{b,c}\) is the complete case version of \(\mathbb F_b\), i.e.,

$$\begin{aligned} \mathbb F_{b,c}(t) = \frac{1}{N} \sum _{j=1}^n\delta _j \mathbf 1\left[ Y_j-bX_j \le t\right] = \frac{1}{N} \sum _{j=1}^N \mathbf 1\left[ Y_{i_j}-bX_{i_j} \le t\right] , \quad t\in \mathbb R. \end{aligned}$$

The complete case empirical likelihood is

$$\begin{aligned} \mathscr {R}_{n,c}(b)= \sup \left\{ \prod _{j=1}^N N\pi _j: \pi \in \mathscr {P}_N, \sum _{j=1}^N \pi _j (X_{i_j}-\bar{X}_c) v_N\left( \mathbb F_{b,c}(Y_{i_j}-bX_{i_j})\right) = 0 \right\} , \end{aligned}$$

with \(\mathscr {P}_N\) and \(v_n\) defined above. Note that we perform a complete case analysis, so the above formula must involve \(\bar{X}_c = N^{-1} \sum _{j=1}^n\delta _j X_j\), which is a consistent estimator of the conditional expectation \(E[X|\delta =1]\), as given in (4); see also Sect. 3 in Müller and Schick (2017) for the general case. Moments of the covariate distribution are replaced by moments of the conditional covariate distribution given \(\delta =1\), when switching from the full model to the complete case analysis.

Remark 1

If the covariate X is a p-dimensional vector we have

$$\begin{aligned} Y_j=\beta ^{\top }X_j+\varepsilon _j, \quad j=1,\ldots ,n, \end{aligned}$$

and construct \(\mathbb {F}_b\) using the ‘residuals’ \(Y_j-b^{\top }X_j\). Now we need to interpret (5) with X being p-dimensional. The empirical likelihood \(\mathscr {R}_{n}(b)\) is then

$$\begin{aligned} \begin{aligned} \sup \left\{ \prod _{j=1}^n n\pi _j: \pi \in \mathscr {P}_n,\sum _{j=1}^n \pi _j\left( X_j-\bar{X}\right) \otimes v_n\left( \mathbb F_{b}\left( Y_j-b^\top X_j\right) \right) = 0 \right\} , \end{aligned} \end{aligned}$$

where \(\otimes \) denotes the Kronecker product. Since the Kronecker product of two vectors with dimensions p and q is a vector of dimension pq, there are \(pr_n\) random constraints in the above empirical likelihood. Working with this likelihood is notationally more cumbersome, but the proofs are essentially the same. The complete case empirical likelihood \(\mathscr {R}_{n,c}(b)\) changes analogously. It equals

$$\begin{aligned} \sup \left\{ \prod _{j=1}^N N\pi _j: \pi \in \mathscr {P}_N, \sum _{j=1}^N \pi _j \left( X_{i_j}-\bar{X}_c\right) \otimes v_N\left( \mathbb F_{b,c}\left( Y_{i_j}-bX_{i_j}\right) \right) = 0 \right\} . \end{aligned}$$

3 A Wilks’ theorem

Wilks’ original theorem states that the classical log-likelihood ratio test statistic is asymptotically Chi-square distributed. Our first result is a version of that theorem for the empirical log-likelihood. It is given in Theorem 1 below and proved in the first subsection of Sect. 6. As in the previous section we write \(\mathscr {R}_{n}(b)\) for the empirical likelihood and \(\mathscr {R}_{n,c}(b)\) for the complete case empirical likelihood. Further let \(\chi _{\gamma }(d)\) denote the \(\gamma \)-quantile of the Chi-square distribution with d degrees of freedom.

Theorem 1

Consider the full model and suppose that X also has a finite fourth moment and that the number of basis functions \(r_n\) satisfies \(r_n \rightarrow \infty \) and \(r_n^4=o(n)\) as \(n \rightarrow \infty \). Then, we have

$$\begin{aligned} P(-2 \log \mathscr {R}_{n}(\beta ) \le \chi _{u}(r_n)) \rightarrow u, \quad 0<u<1. \end{aligned}$$

The conclusion of this theorem is equivalent to \((-2 \log \mathscr {R}_{n}(\beta )-r_n)/\sqrt{r_n}\) being asymptotically standard normal. This implies that the complete case version \((-2 \log \mathscr {R}_{n,c}(\beta )-r_N)/\sqrt{r_N}\) is also asymptotically standard normal. This is a consequence of the transfer principle for complete case statistics; see Remark 2.4 in the article by Koul et al. (2012). More precisely, these authors showed that if the limiting distribution of a statistic is \(\mathcal {L}(Q)\), then the limiting distribution of its complete case version is \(\mathcal {L}(\tilde{Q})\), where Q is the joint distribution of (XY), belonging to some model, and \(\tilde{Q}\) is the distribution of (XY) given \(\delta =1\). One only needs to assume that \(\tilde{Q}\) belongs to the same model as Q, i.e., it satisfies the same assumptions. Here we assume that the responses are missing at random, i.e., \(\delta \) and Y are conditionally independent given X. Therefore, we only need to require that the conditional covariate distribution given \(\delta =1\) and the unconditional covariate distribution belong to the same model. Here the limiting distribution is not affected as it does not depend on Q.

Although the result for the MAR model is more general than the result for the full model (which is covered as a special case), we can now, thanks to the transfer principle, formulate it as a corollary, i.e., we only need to take the modified assumptions for the conditional covariate distribution into account, and prove Theorem 1 for the full model.

Corollary 1

Consider the MAR model and suppose that the distribution of X given \(\delta =1\) has a finite fourth moment and a positive variance. Let the number of basis functions \(r_N\) satisfy \(1/r_N = o_P(1)\) and \(r_N^4=o_P(N)\) as \(n \rightarrow \infty \). Then, we have

$$\begin{aligned} P(-2 \log \mathscr {R}_{n,c}(\beta ) \le \chi _{u}(r_N)) \rightarrow u, \quad 0<u<1. \end{aligned}$$

Note that the conditions on the number of basis functions \(r_n\) and \(r_N\) in the full model and the MAR model are equivalent since n and N increase proportionally,

$$\begin{aligned} \frac{N}{n} = \frac{1}{n} \sum _{i=1}^n\delta _i \; \rightarrow \; E[\delta ] \quad \text{ almost } \text{ surely }, \end{aligned}$$

with \(E[\delta ]>0\) by assumption.

The distribution of X given \(\delta =1\) has density \(\pi /E[\delta ]\) with respect to the distribution of X. Thus, the variance of the former distribution is positive unless X is constant almost surely on the event \(\{\pi (X)>0\}\).

Remark 2

The above result shows that

$$\begin{aligned} \{ b\in \mathbb R: - 2 \log \mathscr {R}_{n,c}(b) < \chi _{1-\alpha } (r_N)\} \end{aligned}$$

is a \(1-\alpha \) confidence region for \(\beta \) and that

$$\begin{aligned} \mathbf 1\left[ -2 \log \mathscr {R}_{n,c}(\beta _0) \ge \chi _{1-\alpha } (r_N)\right] \end{aligned}$$

is a test of asymptotic size \(\alpha \) for testing the null hypothesis \(H_0: \beta =\beta _0\). Note that both the confidence region and the test about the slope also apply to the special case of a full model with \(N=n\) and \(\mathscr {R}_{n}\) in place of \(\mathscr {R}_{n,c}\). The asymptotic confidence interval for the slope, for example, is

$$\begin{aligned} \{ b\in \mathbb R: - 2 \log \mathscr {R}_{n}(b) < \chi _{1-\alpha } (r_n)\}. \end{aligned}$$

4 Efficient estimation

Our next result gives a strengthened version of the uniform local asymptotic normality (ULAN) condition for the local empirical likelihood ratio

$$\begin{aligned} \mathscr {L}_n(t) = \log \left( \dfrac{\mathscr {R}_{n}(\beta + n^{-1/2}t)}{\mathscr {R}_{n}(\beta )}\right) , \quad t\in \mathbb R\end{aligned}$$

in the full model. The usual ULAN condition is established for fixed compact intervals for the local parameter t. Here we allow the intervals to grow with the sample size.

Theorem 2

Suppose X has a finite fourth moment, f has finite Fisher information for location, and \(r_n\) satisfies \((\log n)/r_n =O(1)\) and \(r_n^5 \log n =o(n)\). Then, for every sequence \(C_n\) satisfying \(C_n\ge 1\) and \(C_n^2 =O(\log n)\), the uniform expansion

$$\begin{aligned} \sup _{|t|\le C_n}\dfrac{|\mathscr {L}_n(t)-t\varGamma _n+ J_f \mathrm{Var}(X) t^2/2|}{(1+|t|)^2} =o_P(1) \end{aligned}$$
(7)

holds with

$$\begin{aligned} \varGamma _n= \dfrac{1}{\sqrt{n}} \sum _{j=1}^n(X_j -E[X_j]) \ell _f(X_j-\beta X_j), \end{aligned}$$

which is asymptotically normal with mean zero and variance \(J_f \mathrm{Var}(X)\).

The proof of Theorem 2 is quite elaborate and carried out in Sect. 6. Expansion (7) is critical to obtain the asymptotic distribution of the maximum empirical likelihood estimator. We shall follow Peng and Schick (2017) and work with a guided maximum empirical likelihood estimator (GMELE). This requires a preliminary \(n^{1/2}\)-consistent estimator \(\tilde{\beta }_n\) of \(\beta \). One possibility is the OLSE, see (2), which requires the additional assumption that the error has a finite second moment. Another possibility which avoids this assumption is the solution \(\tilde{\beta }_n\) to the equation

$$\begin{aligned} \frac{1}{n}\sum _{j=1}^n(X_j-\bar{X}) \psi (Y_j-b X_j) =0, \end{aligned}$$

where \(\psi \) is a bounded function with a positive and bounded first derivative \(\psi '\) and a bounded second derivative as, for example, the arctangent. Then,

$$\begin{aligned} \tilde{\beta }_n = \beta - \frac{1}{n}\sum _{j=1}^n\dfrac{(X_j-\mu ) (\psi (\varepsilon _j)- E[\psi (\varepsilon )])}{\mathrm{Var}(X) E[\psi '(\varepsilon )]} + o_P\left( n^{-1/2}\right) \end{aligned}$$

and \(n^{1/2} (\tilde{\beta }_n - \beta )\) is asymptotically normal with mean zero and variance

$$\begin{aligned} \dfrac{\mathrm{Var}(\psi (\varepsilon ))}{(E[\psi '(\varepsilon )])^2 \mathrm{Var}(X)}. \end{aligned}$$

The GMELE associated with a \(n^{1/2}\)-consistent preliminary estimator \(\tilde{\beta }_n\) is defined by

$$\begin{aligned} \hat{\beta }_n = \mathop {\arg \max }\limits _{n^{1/2} |b-\tilde{\beta }_n| \le C_n} \mathscr {R}_{n}(b), \end{aligned}$$
(8)

where \(C_n\) is proportional to \((\log n)^{1/2}\). By the results in Peng and Schick (2017) the expansion (7) implies

$$\begin{aligned} n^{1/2}(\hat{\beta }_n - \beta ) = {\varGamma }_n/(J_f \mathrm{Var}(X)) + o_P\left( n^{-1/2}\right) . \end{aligned}$$

Thus, under the assumptions of Theorem 2, the GMELE \(\hat{\beta }_n\) satisfies (3) and is therefore efficient. The complete case estimator

$$\begin{aligned} \hat{\beta }_{n,c} = \mathop {\arg \max }\limits _{N^{1/2} |b-\tilde{\beta }_{n,c}| \le C_N} \mathscr {R}_{n,c}(b) \end{aligned}$$

is then efficient in the MAR model, provided the conditional distribution of X given \(\delta =1\) has a finite fourth moment and a positive variance. Let us summarize our finding in the following theorem.

Theorem 3

Suppose that the error density f has finite Fisher information for location and that \(r_n\) satisfies \( (\log n)/r_n =O(1)\) and \(r_n^5 \log n =o(n)\).

  1. (a)

    Assume that the covariate X has a finite fourth moment and a positive variance. Then, the GMELE \(\hat{\beta }_n\) satisfies expansion (3) and is therefore efficient in the full model.

  2. (b)

    Consider the MAR model and assume that given \(\delta =1\) the covariate X has a finite conditional fourth moment and a positive conditional variance. Then, the complete case version \(\hat{\beta }_{n,c}\) of the GMELE satisfies expansion (4) and is efficient in the MAR model.

The choice of \(r_n\) (and \(r_N\)) is addressed in Remark 4 in Sect. 5.

Remark 3

A referee suggested the following. ‘An alternative (but asymptotically equivalent) procedure to compute the maximum empirical likelihood estimator can be based on the set of the generalized set of estimating equations \( g_j(b) = (X_j - \bar{X}) v_n (\mathbb F_b(Y_j - bX_j)) \) (with \(r_n > 1\)) and the following program, i.e.,

$$\begin{aligned} \hat{\beta }_n^{EE}= \mathop {\arg \min }\limits _{n^{1/2} |b - \tilde{\beta }_n|\le C_n} \frac{1}{n}\sum _{j=1}^ng_j(b)^\top \left( \frac{1}{n}\sum _{j=1}^ng_j(\overline{\beta }_n) g_j(\overline{\beta }_n)^\top \right) ^{-1} \frac{1}{n}\sum _{j=1}^ng_j(b), \end{aligned}$$

where \(\overline{\beta }_n\) is a preliminary estimator defined as

$$\begin{aligned} \overline{\beta }_n= \mathop {\arg \min }\limits _{n^{1/2} |b - \tilde{\beta }_n|\le C_n} \frac{1}{n}\sum _{j=1}^ng_j(b)^\top \widehat{W} \frac{1}{n}\sum _{j=1}^ng_j(b) \end{aligned}$$

for any positive semidefinite matrix \(\widehat{W}\) (and similarly for the complete case analysis). This estimator is computationally simpler than the maximum empirical likelihood estimator, especially if the dimension of \(\beta \) is larger than one.’

An even simpler estimator which avoids the preliminary step is the estimator \(\overline{\beta }_n\) with \(\widehat{W}= (\hat{\tau }_n^2 I_{r_n})^{-1}\), where \(I_{r_n}\) is the \(r_n \times r_n\) identity matrix and \(\hat{\tau }_n^2 = \frac{1}{n}\sum _{j=1}^n(X_j-\bar{X})^2\). This estimator reduces to

$$\begin{aligned} \hat{\beta }_n^{S}= \mathop {\arg \min }\limits _{n^{1/2}|b-\tilde{\beta }_n|\le C_n}\left\| \dfrac{1}{\sqrt{n}} \sum _{j=1}^ng_j(b)\right\| ^2 / \hat{\tau }_n^2 = \mathop {\arg \min }\limits _{n^{1/2}|b-\tilde{\beta }_n|\le C_n}\left\| \dfrac{1}{\sqrt{n}} \sum _{j=1}^ng_j(b)\right\| ^2. \end{aligned}$$

Using arguments from the proof of Theorem 2, both estimators, \(\hat{\beta }_n^{EE}\) and \(\hat{\beta }_n^{S}\), can be shown to be efficient. In simulations the GMELE outperformed the alternative estimators \(\hat{\beta }_n^{EE}\) and \(\hat{\beta }_n^{S}\); see Table 1 in Sect. 5.

5 Simulations

Here we report the results of a small simulation study carried out to investigate the finite sample behavior of the GMELE (8) and the test from Remark 2. The simulations were carried out with the help of the R package. The R function optimize was used to locate the maximizers.

5.1 Comparing GMELE with the competing estimators from Remark 3

For this study we used the full model with \(\beta =1\) and sample size \(n=100\). We worked with two error distributions and two covariate distributions. As error distributions we picked the mixture normal distribution \(.25\mathscr {N}(-10, 1)+.5\mathscr {N}(0,1)+.25\mathscr {N}(10,1)\) and the skew normal distribution with location parameter zero, scale parameter 1 and skewness parameter 4. As covariate distributions we chose the standard normal distribution and the uniform distribution on \((-1,3)\). Table 1 reports simulated mean-squared errors of the estimators, \(\hat{\beta }_n^{S}\), \(\hat{\beta }_n^{EE}\) and the GMELE, based on 2000 repetitions, and for the choices \(r_n=1,\dots ,10\). We used the OLSE as preliminary estimator for the GMELE and \(\hat{\beta }_n^{S}\), to specify the location of the search interval. As preliminary estimator for \(\hat{\beta }_n^{EE}\) we used \(\hat{\beta }_n^{S}\). We chose \(2c_n\sqrt{\log (n)/n}\) as the length of the interval, with \(c_n=1\) for skew normal errors and \(c_n=10\) for the mixture normal errors. As can be seen from Table 1, the GMELE clearly outperforms the two competing approaches.

Table 1 Comparing the GMELE \(\hat{\beta }_n\) (M) with \(\hat{\beta }_n^{S}\) (S) and \(\hat{\beta }_n^{EE}\) (EE) from Remark 3

5.2 Performance with missing data

Here we report on the performance of the GMELE and the OLSE with missing data. We again used the model \(Y = \beta X + \varepsilon \) with \(\beta =1\) and chose

$$\begin{aligned} \pi (X) = P(\delta =1|X) = 1/(1 + d \, \exp (X)) \end{aligned}$$

with \(d=0\), .1 and .5 to produce different missingness rates. Note that \(d=0\) corresponds to the full model.

Table 2 Simulated MSEs for OLSE and GMELE with missing data

We used the same error and covariate distributions as before and worked with the search interval \(\tilde{\beta }_{N,c}\pm c_N \sqrt{\log (N)/N}\) based on the complete case version of the OLSE. We chose \(c_N=1\) for the skew normal errors and \(c_N=10\) for the mixture normal errors. The reported results are based on samples of size \(n=70\) and 140, \(r_n=1,\dots ,10\) basis functions and 2000 repetitions.

Table 2 reports simulated mean-squared errors of the OLSE and GMELE for \(r_n=1,\ldots , 10\). The mean-squared errors are multiplied by 10 for skew normal errors. We also list the average missingness rates (MR).

The GMELE performs in most cases much better (smaller MSEs) than the OLSE, except in some of the small samples. The results for the scenario with uniform covariates are better than the corresponding figures for standard normal covariates. The mean-squared errors for the skew normal errors are even better than those for mixture normal errors.

5.3 Behavior for errors without finite Fisher information

A different scenario is considered in Table 3, namely when the errors are from an exponential distribution. Since the exponential distribution has no finite Fisher information for location it does not fit into our theory, but it still demonstrates superior performance of the GMELE over the OLSE.

Table 3 Simulated MSEs for exponential error

Remark 4

The choice of the number of basis vectors \(r_n\) (and \(r_N\)) does affect the performance of the GMELE. This suggests using a data-driven choice. One possibility is the approach of Peng and Schick (2005, Sect. 5.1), who used bootstrap to select \(r_n\) in a related setting, with convincing results. The idea is to compute the bootstrap mean-squared errors of the estimator (the GMELE in our case) for different values of \(r_n\), say for \(r_n = 1,\ldots , 10\). Then, select the \(r_n\) with the minimum bootstrap mean-squared error.

5.4 Comparison of two tests

We performed a small study comparing the empirical likelihood test about the slope from Remark 2 and the corresponding bootstrap test, which uses resampling instead of the \(\chi ^2\) approximation to obtain critical values. The null hypothesis is \(\beta = \beta _0 =1\), and the nominal level is .05. As in Table 1 we consider only the full model and the sample size \(n=100\). Table 4 reports the simulated significance level and power of the two tests, using \(r_n=1, 2, \dots , 5\) basis functions. The covariates X and the errors \(\varepsilon \) were generated from the same distributions as before. The bootstrap resample size was taken to be the same as the sample size (i.e., \(n=100\)), while we used more repetitions than before: in order to stabilize the results obtained by the bootstrap method we worked with 10, 000 repetitions. Our simulations indicate that the results based on the \(\chi ^2\) approximation (denoted by \(\chi ^2\)) are much more reliable than the results of the bootstrap approach (denoted by \(\mathscr {B}\)). For \(r_n \ge 3\) the bootstrapped significance levels are far away from the nominal level 5%: they are between 11 and 60%, i.e., the test is far too liberal, which is in contrast to the \(\chi ^2\) approach. The significance levels for \(r_n=1,2\) are reasonable for both tests. In terms of power the bootstrap test is better than the \(\chi ^2\) test in the upper table with normal covariates; for uniform covariates it is the other way round.

Table 4 Simulated significance level and power of the empirical likelihood test about the slope using \(\chi ^2\) and bootstrap quantiles

6 Proofs

This section contains the proofs of Theorem 1 (given in the first subsection) and of Theorem 2. The proof of the uniform expansion that is provided in Theorem 2 is split into three parts. In Sect. 6.2 we give six conditions and show that they are sufficient for the expansion. That the conditions are indeed satisfied is shown separately in Sects. 6.3 and 6.4. Section 6.5 contains an auxiliary result. As explained in the introduction, we only need to prove the results for the full model, i.e., the case when \(\pi (X)\) equals one.

6.1 Proof of Theorem 1

Let \(\mu \) denote the mean and \(\tau \) denote the standard deviation of X. We should point out that \(\mathscr {R}_{n}(b)\) does not change if we replace \((X_j-\bar{X})\) by \((X_j-\bar{X})/\tau = V_j - \bar{V}\), where

$$\begin{aligned} V_j=\dfrac{X_j-\mu }{\tau } \quad \text{ and } \quad \bar{V} = \frac{1}{n}\sum _{j=1}^nV_j. \end{aligned}$$

Thus, for the purpose of our proofs, we may assume that \(\mathscr {R}_{n}(b)\) is given by

$$\begin{aligned} \mathscr {R}_{n}(b)= \sup \left\{ \prod _{j=1}^n n\pi _j: \pi \in \mathscr {P}_n,\ \sum _{j=1}^n \pi _j\left( V_j-\bar{V}\right) v_n\left( \mathbb F_{b}\left( Y_j-b X_j\right) \right) =0 \right\} . \end{aligned}$$

In what follows we shall repeatedly use the bounds

$$\begin{aligned} |v_n(y)|^2 \le 2 r_n, \quad |v_n'(y)|^2 \le 2\pi ^2 r_n^3, \quad \text{ and } \quad |v_n''(y)|^2 \le 2\pi ^4 r_n^5 \end{aligned}$$

for all real y.

Let us set \(Z_j= V_j v_n(F(\varepsilon _j))\) and \(\hat{Z}_j = (V_j - \bar{V}) v_n(\mathbb F_{\beta }(\varepsilon _j))\), \(j=1,\dots ,n\). With \(Z=Z_1\), we find the identities \(E[Z]=0\) and \(E[ZZ^{\top }]= I_{r_n}\), where \(I_{r_n}\) is the \(r_n\times r_n\) identity matrix, and the bound \(E[|Z|^4] \le (2r_n)^2 E[V^4] =O\left( r_n^2\right) \). As shown in Peng and Schick (2013), these results yield

$$\begin{aligned} \tilde{Z}_n = \dfrac{1}{\sqrt{n}} \sum _{j=1}^nZ_j = O_P\left( r_n^{1/2}\right) \end{aligned}$$
(9)

and

$$\begin{aligned} \sup _{|u|=1}\left| \frac{1}{n}\sum _{j=1}^n\left( u^{\top }Z_j\right) ^2 -1\right| \le \left| \frac{1}{n}\sum _{j=1}^nZ_j Z_j^{\top }-I_{r_n} \right| = O_P(r_nn^{-1/2}). \end{aligned}$$
(10)

From Corollary 7.6 in Peng and Schick (2013) and \(r_n^4=o(n)\), the desired result follows if we verify

$$\begin{aligned} \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left( \hat{Z}_j-Z_j\right) = o_P(1) \quad \text{ and } \quad \frac{1}{n}\sum _{j=1}^n|\hat{Z}_j-Z_j|^2 = o_P\left( r_n^3/n\right) . \end{aligned}$$

Let

$$\begin{aligned} {\varDelta }_j= v_n\left( \mathbb F_{\beta }(\varepsilon _j)\right) -v_n\left( F(\varepsilon _j)\right) , \quad j=1,\dots ,n. \end{aligned}$$

In view of the identity \(\hat{Z}_j-Z_j= V_j {\varDelta }_j-\bar{V} {\varDelta }_j -\bar{V} v_n(F(\varepsilon _j)),\) the bound \(|v_n|^2 \le 2r_n\), and the fact \(n^{1/2} \bar{V}=O_P(1)\), it is easy to see the desired results follow from the following rates:

$$\begin{aligned} \begin{aligned} S_1&=\dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j {\varDelta }_j = O_P\left( r_n^{3/2} n^{-1/2}\right) , \\ S_2&= \frac{1}{n}\sum _{j=1}^n{\varDelta }_j = o_P\left( r_n^{3/2} n^{-1/2}\right) , \\ S_3&= \frac{1}{n}\sum _{j=1}^nv_n(F(\varepsilon _j)) = O_P\left( r_n^{1/2} n^{-1/2}\right) , \\ S_4&=\frac{1}{n}\sum _{j=1}^nV_j^2 |{\varDelta }_j|^2 = O_P\left( r_n^{3} n^{-1}\right) . \end{aligned} \end{aligned}$$

Note that \({\varDelta }_1,\dots ,{\varDelta }_n\) are functions of the errors \(\varepsilon _1,\dots ,\varepsilon _n\) only and satisfy

$$\begin{aligned} M_n= \max _{1\le j \le n}|{\varDelta }_j|^2 \le 2\pi ^2 r_n^3 \sup _{t\in \mathbb R} |\mathbb F_{\beta }(t)-F(t)|^2 = O_P\left( r_n^3/n\right) . \end{aligned}$$

Conditioning on the errors thus yields

$$\begin{aligned} E[|S_1|^2 | \varepsilon _1,\dots ,\varepsilon _n] = E[S_4|\varepsilon _1,\dots ,\varepsilon _n] \le M_n. \end{aligned}$$

This establishes the rates for \(S_1\) and \(S_4\). The other rates follow from \(|S_2|^2 \le M_n\) and \(n E[|S_3|^2] = E[|v_n(F(\varepsilon ))|^2]= r_n\).

6.2 Proof of Theorem 2

For \(t \in \mathbb R\), we let \(\hat{F}_{nt}=\mathbb F_{\beta +n^{-1/2}t}\) and note that \(\hat{F}_{nt}\) is the empirical distribution function of the random variables

$$\begin{aligned} \varepsilon _{jt}= \varepsilon _j - n^{-1/2} t X_j, \quad j=1,\dots ,n. \end{aligned}$$

These random variables are independent with common distribution function \(F_{nt}\) given by

$$\begin{aligned} F_{nt}(y)= E\left[ \hat{F}_{nt}(y)\right] = E\left[ F\left( y+n^{-1/2}tX\right) \right] , \quad y\in \mathbb R. \end{aligned}$$

To simplify notation we introduce

$$\begin{aligned} \hat{R}_{jt}= \hat{F}_{nt}(\varepsilon _{jt}), \quad R_{jt}= F_{nt}(\varepsilon _{jt}), \quad R_j = F(\varepsilon _j), \end{aligned}$$

and

$$\begin{aligned} \hat{Z}_{jt}= (V_j-\bar{V}) v_n(\hat{R}_{jt}), \quad Z_{jt} = V_j v_n(R_{jt}), \quad Z_j = V_j v_n(R_j). \end{aligned}$$

Since we are working with the form of the empirical likelihood given in the previous section, we have

$$\begin{aligned} \mathscr {R}_{n}(\beta +n^{-1/2}t)= \sup \left\{ \prod _{j=1}^n n\pi _j: \pi \in \mathscr {P}_n,\ \sum _{j=1}^n \pi _j\hat{Z}_{jt}=0 \right\} , \quad t\in \mathbb R. \end{aligned}$$

Fix a sequence \(C_n\) such that \(C_n \ge 1\) and \(C_n =O((\log n)^{1/2})\). The desired result follows if we verify the uniform expansion

$$\begin{aligned} \sup _{|t|\le C_n}\dfrac{|-2 \log \mathscr {R}_{n}(\beta +n^{-1/2}t) -|\tilde{Z}_n|^2 + 2t \varGamma _n- t^2 \tau ^2 J_f |}{(1+|t|)^2} = o_P(1) \end{aligned}$$
(11)

with \(\tilde{Z}_n\) as in (9). To verify (11) we introduce

$$\begin{aligned} \nu _n = E[ X \ell _f(\varepsilon ) V v_n(F(\varepsilon ))]. \end{aligned}$$

We shall establish (11) by verifying the following six conditions.

$$\begin{aligned}&\sup _{|t|\le C_n}\sup _{|u|=1}\left| \frac{1}{n}\sum _{j=1}^n\left( u^{\top }\hat{Z}_{jt}\right) ^2 -1\right| = o_P\left( 1/r_n\right) , \end{aligned}$$
(12)
$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left( \hat{Z}_{jt} - Z_{jt}\right) \right| =o_P\left( r_n^{-1/2}\right) , \end{aligned}$$
(13)
$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left( Z_{jt} - Z_{j} - E[Z_{jt}-Z_j]\right) \right| =o_P\left( r_n^{-1/2}\right) , \end{aligned}$$
(14)
$$\begin{aligned}&\sup _{|t|\le C_n}| n^{1/2} E\left[ Z_{1t}-Z_1\right] + t \nu _n| = o\left( r_n^{-1/2}\right) ,\end{aligned}$$
(15)
$$\begin{aligned}&|\nu _n|^2 \rightarrow \tau ^2 J_f,\end{aligned}$$
(16)
$$\begin{aligned}&\nu _n^{\top } \tilde{Z}_n - \varGamma _n= \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left[ \nu _n^{\top } Z_j - (X_j-\mu )\ell _f(\varepsilon _j)\right] = o_P(1). \end{aligned}$$
(17)

These six conditions are proved in the next two subsections. We first establish their sufficiency.

Lemma 1

The conditions (12)–(17) imply (11).

To prove this lemma, we use the following result which is a special case of Lemma 5.2 in Peng and Schick (2013). This version was used in Schick (2013).

Lemma 2

Let \(x_1, \dots , x_n\) be m-dimensional vectors. Set

$$\begin{aligned} \bar{x}=\frac{1}{n}\sum _{j=1}^nx_j, \quad x^*=\max _{1\le j \le n} |x_j|, \quad \nu _4 = \frac{1}{n}\sum _{j=1}^n|x_j|^4, \quad S=\frac{1}{n}\sum _{j=1}^nx_j x_j^\top , \end{aligned}$$

and let \(\lambda \) and \({\varLambda }\) denote the smallest and largest eigenvalue of the matrix S. Then, the inequality \(\lambda > 5|\bar{x}| x^*\) implies

$$\begin{aligned} \Big | -2\log (\mathscr {R})-n\bar{x}^{\top }S^{-1}\bar{x} \Big | \le \dfrac{n |\bar{x}|^3 \left( {\varLambda }\nu _4\right) ^{1/2}}{\left( \lambda - |\bar{x}|x^*\right) ^3} +\dfrac{4n {\varLambda }^2 |\bar{x}|^4 \nu _4 }{\lambda ^2 \left( \lambda - |\bar{x}|x^*\right) ^4} \end{aligned}$$

with

$$\begin{aligned} \mathscr {R} = \sup \left\{ \prod _{j=1}^n n\pi _j: \pi \in \mathscr {P}_n,\ \sum _{j=1}^n \pi _jx_j=0 \right\} . \end{aligned}$$

Proof of Lemma 1 We introduce

$$\begin{aligned} \mathbb T(t)= \frac{1}{n}\sum _{j=1}^n\hat{Z}_{jt} \quad \text{ and } \quad \mathbb S(t)= \frac{1}{n}\sum _{j=1}^n\hat{Z}_{jt} \hat{Z}_{jt}^{\top }, \end{aligned}$$

and let \(\lambda _n(t)\) and \({\varLambda }_n(t)\) denote the smallest and largest eigenvalues of \(\mathbb S(t)\), i.e.,

$$\begin{aligned} \lambda _n(t)= \inf _{|u|=1} u^{\top }\mathbb S(t) u = \inf _{|u|=1} \frac{1}{n}\sum _{j=1}^n\left( u^{\top } \hat{Z}_{jt}\right) ^2 \end{aligned}$$

and

$$\begin{aligned} {\varLambda }_n(t)= \sup _{|u|=1}u^{\top }\mathbb S(t) u = \sup _{|u|=1}\frac{1}{n}\sum _{j=1}^n\left( u^{\top } \hat{Z}_{jt}\right) ^2. \end{aligned}$$

By (12), we have

$$\begin{aligned} \sup _{|t|\le C_n}|\lambda _n(t)-1| = o_P(1) \quad \text{ and } \quad \sup _{|t|\le C_n}|{\varLambda }_n(t)-1| = o_P(1). \end{aligned}$$

The conditions (13)–(15) imply

$$\begin{aligned} \sup _{|t|\le C_n}|n^{1/2} \mathbb T(t) - \tilde{Z}_n + t\nu _n | = o_P\left( r_n^{-1/2}\right) . \end{aligned}$$
(18)

This, together with (9) and (16) yields

$$\begin{aligned} \sup _{|t|\le C_n}n |\mathbb T(t)|^2 = O_P(r_n). \end{aligned}$$
(19)

Next, we find

$$\begin{aligned} \sup _{|t|\le C_n}\max _{1\le j \le n}| \hat{Z}_{jt}| \le \left( 2r_n\right) ^{1/2} \max _{1\le j \le n}|V_j -\bar{V}| = o_P\left( r_n^{1/2} n^{1/4}\right) \end{aligned}$$

and

$$\begin{aligned} \sup _{|t|\le C_n}\frac{1}{n}\sum _{j=1}^n|\hat{Z}_{jt}|^4 \le (2r_n)^2 \frac{1}{n}\sum _{j=1}^n|V_j -\bar{V}|^4 = O_P\left( r_n^2\right) . \end{aligned}$$

Thus, we derive

$$\begin{aligned} \sup _{|t|\le C_n}\Big | -2 \log \mathscr {R}_{n}\left( \beta +n^{-1/2}t\right) - n \mathbb T(t)^{\top } \left( \mathbb S(t)\right) ^{-1} \mathbb T(t) \Big | = o_P(1), \end{aligned}$$
(20)

since by Lemma 2 the left-hand side is of order \(O_P( r_n^{5/2} n^{-1/2} + r_n^4/n)\). For a positive definite matrix A and a compatible vector x, we have

$$\begin{aligned} |x^{\top } A^{-1} x - x^{\top }x| \le x^{\top }A^{-1} x \sup _{|u|=1}|1-u^{\top }Au| \le \dfrac{|x|^2}{\lambda } \sup _{|u|=1}|1- u^{\top }Au| \end{aligned}$$

with \(\lambda \) the smallest eigenvalue of A. This, together with (12) and (19) yields

$$\begin{aligned} \sup _{|t|\le C_n}n |\mathbb T(t)^{\top }(\mathbb S(t))^{-1} \mathbb T(t)- \mathbb T(t)^{\top }\mathbb T(t)| = o_P(1). \end{aligned}$$
(21)

With the help of (9), (16) and (18) we verify

$$\begin{aligned} \sup _{|t|\le C_n}\Big | n|\mathbb T(t)|^2 -|\tilde{Z}_n|^2 + 2t \nu _n^{\top }\tilde{Z}_n - t^2 |\nu _n|^2\Big | = o_P(1). \end{aligned}$$
(22)

The results (20)–(22) yield the expansion

$$\begin{aligned} \sup _{|t|\le C_n}\Big | - 2\log \mathscr {R}_{n}\left( \beta + n^{-1/2}t\right) - |\tilde{Z}_n|^2 + 2t \nu _n^{\top } \tilde{Z}_n -t^2 |\nu _n|^2\Big | = o_P(1). \end{aligned}$$

From (16) and (17) we derive the expansion

$$\begin{aligned} \sup _{|t|\le C_n}\dfrac{ \left| 2t \left( \nu _n^{\top }\tilde{Z}_n - \varGamma _n\right) - t^2 \left( |\nu _n|^2 - \tau ^2 J_f\right) \right| }{(1+|t|)^2} = o_P(1). \end{aligned}$$

The desired result (11) follows from the last two expansions. \(\square \)

6.3 Proofs of (14)–(17)

We begin by mentioning properties of f and F that are crucial to the proofs. Since f has finite Fisher information for location, we have

$$\begin{aligned}&\int |f(y+t)-f(y+s)|\,dy \le B_1 |t-s|, \end{aligned}$$
(23)
$$\begin{aligned}&|F(t)-F(s)| \le B_1 |t-s|, \end{aligned}$$
(24)
$$\begin{aligned}&|F(t+s)-F(t)- s f(t)| \le B_2 |s|^{3/2}, \end{aligned}$$
(25)
$$\begin{aligned}&\int |F(y+s)-F(y) -sf(y)|\,dy \le B_1 s^2 \end{aligned}$$
(26)

for all real s and t, and some constants \(B_1\) and \(B_2\), see, for example, Peng and Schick (2016).

Next, we look at the process

$$\begin{aligned} H_n(t)= \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left( h_{nt}(X_j,Y_j)- E\left[ h_{nt}(X,Y)\right] \right) , \quad t\in \mathbb R, \end{aligned}$$

where \(h_{nt}\) are measurable functions from \(\mathbb R^2\) to \(\mathbb R^{m_n}\) such that \(h_{n0}=0\). We are interested in the cases \(m_n=1\) and \(m_n=r_n\). A version of the following lemma was used in Peng and Schick (2016).

Lemma 3

Suppose that the map \(t\mapsto h_{nt}(x,y)\) is continuous for all \(x, y \in \mathbb R\) and

$$\begin{aligned} E\left[ |h_{nt}(X,Y) -h_{ns}(X,Y)|^2 \right] \le K_n |t-s|^2, \quad s,t\in \mathbb R\end{aligned}$$
(27)

for some positive constants \(K_n\). Then, we have the rate

$$\begin{aligned} \sup _{|t|\le C_n}|H_n(t)| = O_P\left( C_n K_n^{1/2}\right) . \end{aligned}$$

Proof of (14) The desired result follows from Lemma 3 applied with

$$\begin{aligned} h_{nt}(X,Y)= V \left[ v_n\left( F_{nt} \left( \varepsilon - n^{-1/2} t X\right) \right) - v_n\left( F\left( \varepsilon \right) \right) \right] , \quad t\in \mathbb R, \end{aligned}$$

and \(K_n = 2\pi ^2 r_n^3 B_1^2 E[V^2 (X_1-X)^2]/n\). Indeed, we have \(h_{n0}=0\) and (27) in view of (24). Note also that \(r_n C_n^2 K_n \rightarrow 0\). \(\square \)

Proof of (15) Since V and \(\varepsilon \) are independent and V has mean zero, we obtain the identity

$$\begin{aligned} n^{1/2} E\left[ Z_{1t}-Z_1\right] +t\nu _n = n^{1/2}E\left[ V_1v_n\left( F_{nt}(\varepsilon _{1t})\right) \right] +t\nu _n = n^{1/2} \left( {\varDelta }_1(t)+{\varDelta }_2(t)\right) \end{aligned}$$

with

$$\begin{aligned} {\varDelta }_1(t)=E\left[ V \int \left[ v_n(F_{nt}(y))-v_n(F(y))\right] \left[ f(y+n^{-1/2}tX)-f(y)\right] \,dy \right] \end{aligned}$$

and

$$\begin{aligned} {\varDelta }_2(t) =E \left[ V\int v_n(F(y)) \left[ f(y+n^{-1/2}tX)-f(y)-n^{-1/2}t X f'(y)\right] \,dy\right] . \end{aligned}$$

It follows from (23) and (24) that

$$\begin{aligned} |{\varDelta }_1(t)| \le \left( 2\pi ^2 r_n^3\right) ^{1/2} B_1 E[|X|] B_1 E[|VX|] t^2/n. \end{aligned}$$

Integration by parts shows that

$$\begin{aligned} {\varDelta }_2(t)= - E\left[ V \int v_n'(F(y))f(y) \left[ F(y+n^{-1/2}tX)-F(y)-n^{-1/2}t X f(y)\right] \,dy\right] . \end{aligned}$$

It follows from (24) that f is bounded by \(B_1\). This, together with (26), yields the bound

$$\begin{aligned} |{\varDelta }_2(t)| \le \left( 2\pi ^2 r_n^3\right) ^{1/2} B_1^2 E\left[ |VX^2|\right] t^2/n. \end{aligned}$$

From these bounds we conclude

$$\begin{aligned} \sup _{|t|\le C_n}\Big | n^{1/2} E\left[ Z_{1t}-Z_1\right] +t\nu _n\Big |= O\left( r_n^{3/2}(\log n) n^{-1/2}\right) =o\left( r_n^{-1/2}\right) , \end{aligned}$$

which is the desired (15). \(\square \)

Proof of (16) and (17) Note that \(\nu _n\) can be written as

$$\begin{aligned} \nu _n = E\left[ X\ell _f(\varepsilon ) V v_n(F(\varepsilon ))\right] = \tau E\left[ V \ell _f(\varepsilon ) V v_n(F(\varepsilon ))\right] . \end{aligned}$$

The functions \(V\varphi _1(F(\varepsilon )),V\varphi _2(F(\varepsilon )),\dots \) form an orthonormal basis of the space \(\mathscr {V}=\{V a(\varepsilon ): a\in L_{2,0}(F)\}\). Thus, \(\nu _n\) is the vector consisting of the first \(r_n\) Fourier coefficients of \((X-\mu )\ell _f(\varepsilon ) = \tau V\ell _f(\varepsilon )\) with respect to this basis. Because \((X-\mu )\ell _f(\varepsilon )\) is a member of \(\mathscr {V}\), Parseval’s theorem yields

$$\begin{aligned} |\nu _n|^2 \rightarrow E\left[ \left( (X-\mu )\ell _f(\varepsilon )\right) ^2\right] =\tau ^2 J_f \end{aligned}$$

and

$$\begin{aligned} E\left[ \left( \nu _n^{\top } V v_n(F(\varepsilon )) - (X-\mu )\ell _f(\varepsilon )\right) ^2\right] \rightarrow 0. \end{aligned}$$

The former is (16) and the latter implies (17). \(\square \)

6.4 Proofs of (12) and (13)

We begin by deriving properties of \(\hat{R}_{jt}\) and \(R_{jt}\) which we need in the proofs of (12) and (13). For this we introduce the leave-one-out version \(\tilde{R}_{jt}\) of \(\hat{R}_{jt}\) defined by

$$\begin{aligned} \tilde{R}_{jt}= \dfrac{1}{n-1} \sum _{i: i\ne j} \mathbf 1[ \varepsilon _{it} \le \varepsilon _{jt}] = \dfrac{n}{n-1} \hat{R}_{jt}- \dfrac{1}{n-1} \mathbf 1[\varepsilon _{jt}\le \varepsilon _{jt}], \end{aligned}$$

which satisfies

$$\begin{aligned} |\hat{R}_{jt}-\tilde{R}_{jt}|\le \dfrac{2}{n-1}. \end{aligned}$$
(28)

We abbreviate \(\tilde{R}_{j0}\) by \(\tilde{R}_j\). In the ensuing arguments we rely on the following properties of these quantities, where \(B_1\) and \(B_2\) are the constants appearing in (24) and (25):

$$\begin{aligned}&\max _{1\le j \le n}\sup _{|t|\le C_n}|\tilde{R}_{jt}-R_{jt}-\tilde{R}_j +R_j| = O_P\left( n^{-5/8} \left( C_n\log n\right) ^{1/2}\right) , \end{aligned}$$
(29)
$$\begin{aligned}&\max _{1\le j \le n}|\tilde{R}_j - R_j| = O_P\left( n^{-1/2}\right) , \end{aligned}$$
(30)
$$\begin{aligned}&\sup _{|t|\le C_n}|R_{jt}- R_j| \le B_1 C_n n^{-1/2} \left( |X_j|+E[|X|]\right) , \end{aligned}$$
(31)
$$\begin{aligned}&\sup _{|t|\le C_n}| R_{jt}- R_j + n^{-1/2}t(X_j-\mu ) f(\varepsilon _j)|\nonumber \\&\quad \le B_2 C_n^{3/2} n^{-3/4} \sqrt{2} \left( |X_j|^{3/2} + E\left[ |X|^{3/2}\right] \right) . \end{aligned}$$
(32)

The second statement follows from properties of the empirical distribution function and the last two statements from (24) and (25), respectively. To prove (29) we use Lemma 4 from Sect. 6.5. Let \(\zeta _j(t)=\tilde{R}_{jt}- R_{jt} - \tilde{R}_j +R_j\) and \(m=n-1\). These random variables are identically distributed, and \((n-1)\zeta _n(t)\) equals \(\tilde{N}(n^{-1/2} t,X_n,\varepsilon _n)\) from the beginning of Sect. 6.5, with the role of \(Y_i\) played by \(\varepsilon _i\). Lemma 4 gives

$$\begin{aligned} \begin{aligned}&P\left( \max _{1\le j \le n}\sup _{|t|\le C_n}|\zeta _j(t)|> 4KC_n^{1/2} \left( n-1\right) ^{-5/8}(\log (n-1))^{1/2}\right) \\&\quad \le n P\left( \sup _{|t|\le C_n}|\zeta _n(t)|> 4 KC_n^{1/2} m^{-5/8} (\log m )^{1/2}\right) \\&\quad \le n P\left( |X_n|>m^{1/4}\right) + n E\left[ \mathbf 1\left[ |X_n|\le m^{1/4}\right] p_m\left( \varepsilon _n,C_n,K\right) \right] \\&\quad \le 2 E[|X|^4\mathbf 1\left[ |X|> m^{1/4}\right] + C n^2 \exp (-K\log (m)) \end{aligned} \end{aligned}$$

for \(m>2\) and \(K> 6 B_1(1+E[|X|])\) and some constant C. The desired (29) is now immediate.

Note that statements (28)–(31) yield the bounds

$$\begin{aligned} \sup _{|t|\le C_n}|\hat{R}_{jt}-R_j| \le B_1 C_n n^{-1/2} \left( |X_j|+E[|X|]\right) + n^{-1/2}\xi _{n}, \quad j=1,\dots ,n,\qquad \end{aligned}$$
(33)

which we need for the next proof. Here \(\xi _{n}\) is a positive random variable which satisfies \(\xi _{n}= O_P(1)\).

Proof of (12) Given (10) and the properties of \(r_n\), it suffices to verify

$$\begin{aligned} \sup _{|u|=1}\sup _{|t|\le C_n}\left| \frac{1}{n}\sum _{j=1}^n\left( u^{\top }\hat{Z}_{jt}\right) ^2 - \frac{1}{n}\sum _{j=1}^n\left( u^{\top }Z_j\right) ^2 \right| = o_P\left( 1/r_n\right) . \end{aligned}$$
(34)

Using the Cauchy–Schwarz inequality we bound the left-hand side of (34) by \(2 (D_n {\varLambda }_n)^{1/2}+D_n\) with

$$\begin{aligned} {\varLambda }_n = \sup _{|u|=1}\frac{1}{n}\sum _{j=1}^n\left( u^{\top }Z_j\right) ^2 \quad \text{ and } \quad D_n = \sup _{|t|\le C_n}\frac{1}{n}\sum _{j=1}^n|\hat{Z}_{jt}-Z_j|^2. \end{aligned}$$

Given (10), it therefore suffices to prove \(D_n = o_P(1/r_n^2)\). This follows from (33), the inequality

$$\begin{aligned} \begin{aligned} D_n&\le \sup _{|t|\le C_n}\frac{1}{n}\sum _{j=1}^n\left( 2 \bar{V}^2 |v_n\left( \hat{R}_{jt}\right) |^2 + 2 V_j^2 |v_n\left( \hat{R}_{jt}\right) -v_n\left( R_j\right) |^2\right) \\&\le 4r_n \bar{V}^2 + 4\pi ^2 r_n^3 \frac{1}{n}\sum _{j=1}^nV_j^2 \sup _{|t|\le C_n}|\hat{R}_{jt}-R_j|^2= O_P\left( r_n^3C_n^2/n\right) , \end{aligned} \end{aligned}$$

and the rate \(r_n^5 \log n = o(n)\). \(\square \)

Proof of (13) In view of the rate \(\bar{V}= O_P(n^{-1/2})\) and the identity

$$\begin{aligned} \hat{Z}_{jt}-Z_{jt}= V_j \left( v_n(\hat{R}_{jt})-v_n(R_{jt})\right) -\bar{V} \left( v_n(\hat{R}_{jt})-v_n(R_j)\right) -\bar{V} v_n(R_j), \end{aligned}$$

the desired (13) is implied by the following three statements:

$$\begin{aligned}&\dfrac{1}{\sqrt{n}} \sum _{j=1}^nv_n(R_j)= O_P\left( r_n^{1/2}\right) , \end{aligned}$$
(35)
$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\left( v_n(\hat{R}_{jt})-v_n(R_j)\right) \right| = O_P\left( C_n r_n^{3/2}\right) , \end{aligned}$$
(36)
$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j \left[ v_n(\hat{R}_{jt})- v_n(R_{jt})\right] \right| =o_P\left( r_n^{-1/2}\right) . \end{aligned}$$
(37)

We obtain (35) from \(E[v_n(F(\varepsilon ))]=0\) and \(E[|v_n(F(\varepsilon )|^2]=r_n\). Also, (36) follows from (33) and the fact that its left-hand side is bounded by

$$\begin{aligned} (2\pi ^2 r_n^{3})^{1/2} \dfrac{1}{\sqrt{n}} \sum _{j=1}^n\sup _{|t|\le C_n}|\hat{R}_{jt}- R_j|. \end{aligned}$$

Using (28) we find

$$\begin{aligned} \sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j \left[ v_n(\hat{R}_{jt})- v_n(\tilde{R}_{jt})\right] \right| = O_P\left( r_n^{3/2} n^{-1/2}\right) . \end{aligned}$$

Taylor expansions, the bound \(|v_n'''|^2 \le 2\pi ^6 r_n^7\) and equations (28), (31) and (33) show that

$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j \left[ v_n(\tilde{R}_{jt})-v_n(R_j)-v'_n(R_j) (\tilde{R}_{jt}-R_j)\right. \right. \\&\quad \left. \left. - \frac{1}{2}v_n''(R_j) \left( \tilde{R}_{jt}-R_j\right) ^2\right] \right| \end{aligned}$$

and

$$\begin{aligned}&\sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j \left[ v_n(R_{jt})-v_n(R_j)-v'_n(R_j) (R_{jt}-R_j) \right. \right. \\&\quad \left. \left. - \frac{1}{2}v_n''(R_j) \left( R_{jt}-R_j\right) ^2 \right] \right| \end{aligned}$$

are of order \(r_n^{7/2}C_n^3 n^{-1}\). Using the identity

$$\begin{aligned} (a+b+c)^2 -a^2 -b^2 +2db= c^2 +2(a+d)b+ 2(a+b)c \end{aligned}$$

with \(a= R_{jt}-R_j\), \(b= \tilde{R}_j-R_j\), \(c=\tilde{R}_{jt}-R_{jt}-\tilde{R}_j+R_j\) and \(d=n^{-1/2}t(X_j-\mu )f(\varepsilon _j)= n^{-1/2} t\tau V_j f(\varepsilon _j)\), together with properties (29)–(32), we derive the bounds

$$\begin{aligned} \begin{aligned}&\sup _{|t|\le C_n}|(\tilde{R}_{jt}-R_j)^2 - (R_{jt}-R_j)^2 - (\tilde{R}_j-R_j)^2 + 2 n^{-1/2}t\tau V_jf(\varepsilon _j)(\tilde{R}_j-R_j)| \\&\quad \le \zeta _n (1+|X_j|)^{3/2}, \quad j=1,\dots ,n, \end{aligned} \end{aligned}$$

with \(\zeta _n = O_P(n^{-9/8} C_n^{3/2} (\log n)^{1/2})\). If follows that the left-hand side of (37) is bounded by \(|T_1|/2 + C_n\tau |T_2| + T_3 + T_4\), where

$$\begin{aligned} T_1= & {} \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j v_n''(R_j)\left( \tilde{R}_j-R_j\right) ^2, \\ T_2= & {} \frac{1}{n}\sum _{j=1}^nV_j^2 v_n''(R_j)f(\varepsilon _j)\left( \tilde{R}_j-R_j\right) ,\\ T_3= & {} \sup _{|t|\le C_n}\Big |\dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j v_n'(R_j)(\tilde{R}_{jt}-R_{jt})\Big |, \end{aligned}$$

and

$$\begin{aligned} T_4= O_P\left( r_n^{3/2}n^{-1/2}+ r_n^{7/2}C_n^3 n^{-1}+ r_n^{5/2} n^{-5/8}C_n^{3/2} (\log n)^{1/2}\right) =o_P\left( r_n^{-1/2}\right) . \end{aligned}$$

We calculate

$$\begin{aligned} E\left[ |T_1|^2|\varepsilon _1,\dots ,\varepsilon _n\right] = \frac{1}{n}\sum _{j=1}^n|v_n''(R_j)|^2 \left( \tilde{R}_j-R_j\right) ^4 = O_P\left( r_n^5 n^{-2}\right) . \end{aligned}$$

Thus, \(|T_1|=o_P(r_n^{-1/2})\). Next, we write \(T_2\) as the vector U-statistic

$$\begin{aligned} T_2= \dfrac{1}{n(n-1)} \sum _{i\ne j} V_j^2 v_n''(F(\varepsilon _j)) f(\varepsilon _j) \left( \mathbf 1[\varepsilon _i\le \varepsilon _j]-F(\varepsilon _j)\right) \end{aligned}$$

and obtain

$$\begin{aligned} E[|T_2|^2] \le \dfrac{E[|k(\varepsilon )|^2]}{n} + \dfrac{2E[V_2^4 |v_n''(F(\varepsilon _2)|^2 f^2(\varepsilon _2)(\mathbf 1[\varepsilon _1\le \varepsilon _2]-F(\varepsilon _2))^2]}{n(n-1)} \end{aligned}$$

with \(k(x)= E[v_n''(F(\varepsilon )) f(\varepsilon ) ( \mathbf 1[x\le \varepsilon ]-F(\varepsilon ))]\). Using the representation \(f(y)= \int _y^{\infty } \ell _f(z) f(z)\,dz\) and Fubini’s theorem, we calculate

$$\begin{aligned} \begin{aligned} k(x)&=\int _{-\infty }^{\infty } v_n''(F(y))f(y) (1[x\le y]-F(y)) f(y)\,dy \\&= \int _{x}^{\infty } (v_n'(F(z))-v_n'(F(x))\ell _f(z)f(z)\,dz \\&\quad -\int _{-\infty }^{\infty } [v_n'(F(z))F(z)- v_n(F(z))]) \ell _f(z)f(z)\,dz. \end{aligned} \end{aligned}$$

Thus, |k| is bounded by a constant times \(r_n^{3/2}\) and we see that \(E[|T_2|^2]= O(r_n^3/n + r_n^5/n^2)\). This proves \(C_n|T_2|= O_P(r_n^{-1/2})\).

We bound \(T_3\) by the sum \(T_{31}+T_{32}+T_{33}\), where

$$\begin{aligned} T_{31}= & {} \sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nW_j v_n'(R_j)(\tilde{R}_{jt}-R_{jt})\right| ,\\ T_{32}= & {} \sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nV_j\mathbf 1[|V_j|>n^{1/4}] v_n'(R_j)(\tilde{R}_{jt}-R_{jt})\right| ,\\ T_{33}= & {} \sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^nE[V\mathbf 1[|V|>n^{1/4}]] v_n'(R_j)(\tilde{R}_{jt}-R_{jt})\right| , \end{aligned}$$

and \(W_j= V_j\mathbf 1[|V_j|\le n^{1/4}]- E[V\mathbf 1[|V|\le n^{1/4}]\). Since V has a finite fourth moment, we obtain the rates \(\max _{1\le j \le n}|V_j| = o_P(n^{1/4})\) and \(E[|V|\mathbf 1[|V|>n^{1/4}]]\le n^{-3/4}E[V^4 \mathbf 1[|V|>n^{1/4}]]= o(n^{-3/4})\). Thus, we find \(P(T_{32}>0) \le P(\max _{1\le j \le n}|V_j|> n^{1/4})\rightarrow 0\) and \(T_{33} = o_P(n^{-3/4} r_n^{3/2})\), using (29) and (30). This shows \(T_{32}+T_{33}= o_P(r_n^{-1/2})\).

To deal with \(T_{31}\) we express it as

$$\begin{aligned} T_{31}= \sup _{|t|\le C_n}n^{1/2} \left| \dfrac{1}{n(n-1)} \sum _{i\ne j} W_j v_n'(F(\varepsilon _j)) \Big (\mathbf 1[\varepsilon _{it}\le \varepsilon _{jt}]-F_{nt}(\varepsilon _{jt})\Big )\right| . \end{aligned}$$

Let us set

$$\begin{aligned} k_{nt}(z)= E\left[ W v_n\left( F\left( z+n^{-1/2} tX\right) \right) \right] , \quad z\in \mathbb R. \end{aligned}$$

Using (24) we obtain the bound

$$\begin{aligned} E\left[ |k_{nt}(\varepsilon _{jt})- k_{ns}(\varepsilon _{js})|^2\right] \le 2\pi ^2r_n^3 B_1^2 E\left[ W^2(X_j-X)^2\right] |t-s|^2/n \end{aligned}$$

and derive with the help of Lemma 3

$$\begin{aligned} \sup _{|t|\le C_n}\left| \dfrac{1}{\sqrt{n}} \sum _{j=1}^n(k_{nt}(\varepsilon _{jt})-E[k_{nt}(\varepsilon _{jt})]) \right| = O_P\left( r_n^{3/2}C_n n^{-1/2}\right) . \end{aligned}$$

We therefore obtain the rate \(T_{31}= o_P(r_n^{-1/2})\), if we verify

$$\begin{aligned} \sup _{|t|\le C_n}|U(t)| = O_P\left( r_n^{3/2}n^{-1}\log n\right) , \end{aligned}$$
(38)

where U(t) is the vector U-statistic equaling

$$\begin{aligned} \dfrac{1}{n(n-1)} \sum _{i\ne j} \left[ W_j v_n'(F(\varepsilon _j)) \Big (\mathbf 1[\varepsilon _{it}\le \varepsilon _{jt}]-F_{nt}(\varepsilon _{jt})\Big ) + k_{nt}(\varepsilon _{it})-E[k_{nt}(\varepsilon _{it})]\right] . \end{aligned}$$

It is easy to verify that U(t) is degenerate. Let \(t_k=-C_n+2kC_n/n\), \(k=0,\dots ,n\). Then, we have

$$\begin{aligned} \sup _{|t|\le C_n}|U(t)| \le \max _{1\le k \le n}\left( |U(t_k)| + \sup _{t_{k-1}\le t\le t_k} |U(t)-U(t_k)|\right) . \end{aligned}$$
(39)

For \(t\in [t_{k-1},t_k]\), we find

$$\begin{aligned} |U(t)-U(t_k)| \le (2\pi ^2 r_n^3)^{1/2} \Big (2 n^{1/4}(N^+_k +N_k^-) + 2B_1C_n n^{-3/2} S\Big ) \end{aligned}$$
(40)

with

$$\begin{aligned} S= & {} \frac{1}{n}\sum _{j=1}^n\big (|W_j|(|X_j|+E[|X|]) +E[|W|]|X_j| +2E[|WX|] +E[|W|]E[|X|]\big ),\\ N^+_k= & {} \dfrac{1}{n(n-1)} \sum _{i\ne j} \mathbf 1\left[ t_{k-1}D_{ij}< \varepsilon _i-\varepsilon _j \le t_k D_{ij}\right] \mathbf 1[D_{ij} \ge 0],\\ N^-_k= & {} \dfrac{1}{n(n-1)} \sum _{i\ne j} \mathbf 1\left[ t_{k}D_{ij}< \varepsilon _i-\varepsilon _j\le t_{k-1} D_{ij}\right] \mathbf 1[D_{ij}<0], \end{aligned}$$

and \(D_{ij}=n^{-1/2}(X_i-X_j)\). We write \(U_l(t)\) for the lth component of the vector U(t). Then, we have

$$\begin{aligned} P\left( \max _{1\le k \le n}|U(t_k)|>\eta \right) \le \sum _{k=1}^n \sum _{l=1}^{r_n} P\left( |U_l(t_k)|>\eta r_n^{-1/2}\right) , \quad \eta >0. \end{aligned}$$

Since \(U_l(t)\) is a degenerate U-statistic whose kernel is bounded by

$$\begin{aligned} b_l=2n^{1/4}\left( \sqrt{2}\pi l+\sqrt{2}\right) \le 27 n^{1/4}l \end{aligned}$$

and has second moment bounded by \(2(\pi l)^2\), we derive from part (c) of Proposition 2.3 of Arcones and Giné (1993) that

$$\begin{aligned} \sup _{|t|\le C} P((n-1)|U_l(t)| > \eta ) \le c_1 \exp \left( -\frac{c_2\eta }{\sqrt{2} \pi l+b_l^{2/3}\eta ^{1/3} n^{-1/3}}\right) \end{aligned}$$

for universal constants \(c_1\) and \(c_2\). Using the above we obtain

$$\begin{aligned} \begin{aligned}&P\left( \max _{1\le k \le n}|U(t_k)|> \frac{K^3 r_n^{3/2} \log n}{n-1}\right) \\&\quad \le \sum _{k=1}^n \sum _{l=1}^{r_n} P\left( (n-1) |U_l(t_k)|> K^3 r_n \log n\right) \\ {}&\quad \le n r_n c_1 \exp \left( \frac{-c_2 K^3 \log (n)}{\sqrt{2}\pi + 9 K(\log n)^{1/3}n^{-1/6}}\right) , \qquad K>0. \end{aligned} \end{aligned}$$

This shows that

$$\begin{aligned} \max _{1\le k \le n}|U(t_k)| = O_P( r_n^{3/2} n^{-1} \log n). \end{aligned}$$
(41)

To deal with \(N_k^+\) we introduce the degenerate U-statistic

$$\begin{aligned} \tilde{N}_k^+ = \dfrac{1}{n(n-1)} \sum _{i\ne j} \mathbf 1[D_{ij} \ge 0] \xi _{k}(i,j) \end{aligned}$$

with

$$\begin{aligned} \begin{aligned} \xi _{k}(i,j)=&\mathbf 1[ t_{k-1}D_{ij} < \varepsilon _i-\varepsilon _j \le t_k D_{ij}] - F(\varepsilon _j+t_kD_{ij})+F(\varepsilon _j+t_{k-1}D_{ij}) \\&\quad - F(\varepsilon _i-t_{k-1}D_{ij})+F(\varepsilon _i-t_kD_{ij}) + F_2(t_kD_{ij})-F_2(t_{k-1}D_{ij}) \end{aligned} \end{aligned}$$

and \(F_2\) the distribution function of \(\varepsilon _1-\varepsilon _2\). It is easy to see that

$$\begin{aligned} |N_k^+ - \tilde{N}_k^+| \le 6 B_1 C_n n^{-3/2} \dfrac{1}{n(n-1)} \sum _{i\ne j} |X_i-X_j|. \end{aligned}$$

The kernel of the U-statistic \(\tilde{N}_k^+\) is bounded by 8 and has second moment bounded by \(D_n n^{-3/2}\) with \(D_n=2B_1C_n E[|X_1-X_2|]\). Thus, by part (c) of Proposition 2.3 in Arcones and Giné (1993), we see that the corresponding degenerate U-statistic \(\tilde{N}_k^+\) satisfies

$$\begin{aligned} \sum _{k=1}^n P\left( |\tilde{N}_k^+| > \frac{K^3 (\log n)^{3/2}n^{-1/2}}{n-1}\right) \le nc_1 \exp \left( -\frac{c_2 K^3 (\log n)^{3/2}}{D_n^{1/2}n^{-1/4} +4 K (\log n)^{1/2}}\right) . \end{aligned}$$

The above shows that

$$\begin{aligned} \max _{1\le k \le n}N_k^+ = O_P\left( n^{-3/2}\left( \log n\right) ^{3/2}\right) . \end{aligned}$$
(42)

Similarly one obtains

$$\begin{aligned} \max _{1\le k \le n}N_k^- = O_P\left( n^{-3/2} \left( \log n\right) ^{3/2}\right) . \end{aligned}$$
(43)

The desired (38) follows from (39)–(43) and \(S = O_P(1)\). This concludes the proof of (13). \(\square \)

6.5 Auxiliary results

Let X and Y be independent random variables. Let \((X_1,Y_1),\dots ,(X_m,Y_m)\) be independent copies of (XY). For reals t, x and y, set

$$\begin{aligned} N(t,x,y)= \sum _{i=1}^m \left( \mathbf 1[Y_i-tX_i \le y-tx]-\mathbf 1[Y_i\le y]\right) \end{aligned}$$

and

$$\begin{aligned} \tilde{N}(t,x,y)= N(t,x,y)-E[N(t,x,y)]. \end{aligned}$$

Lemma 4

Suppose X has finite expectation and the distribution function F of Y is Lipschitz: \(|F(y)-F(x)|\le {\varLambda }|y-x|\) for all xy and some finite constant \({\varLambda }\). Then, the inequality

$$\begin{aligned} P\left( \sup _{|t|\le \delta } |\tilde{N}(t,x,y)| > 4\eta \right) \le (8M+4)\exp \left( \dfrac{-\eta ^2}{2m{\varLambda }\delta E[|X-x|] + 2\eta /3}\right) \end{aligned}$$

holds for \(\eta >0\), \(\delta >0\), real x and y and every integer \(M\ge m{\varLambda }\delta E[|X-x|]/\eta \). In particular, for \(C\ge 1\) and \(K\ge 6 {\varLambda }(1+E[|X|])\), we have

$$\begin{aligned} \begin{aligned} p_m(y,C,K)&= \sup _{|x| \le m^{1/4}} P\left( \sup _{|t|\le C/m^{1/2}}|\tilde{N}(t,x,y)| > 4 KC^{1/2} m^{3/8}(\log m)^{1/2}\right) \\&\le \left( 12+\dfrac{8 m^{3/8}C^{1/2})}{6(\log m)^{1/2}}\right) \exp (-K\log (m)), \quad y\in \mathbb R. \end{aligned} \end{aligned}$$

Proof

Fix x and y and set \(\nu =E[|X-x|]\). Abbreviate N(txy) by N(t) and \(\tilde{N}(t,x,y)\) by \(\tilde{N}(t)\), set

$$\begin{aligned} \begin{aligned} N_+(t)&= \sum _{i=1}^m (\mathbf 1[Y_j-t(X_j-x)\le y] - \mathbf 1[Y_j\le y])\mathbf 1[X_j-x\ge 0], \\ N_-(t)&= \sum _{i=1}^m (\mathbf 1[Y_j-t(X_j-x)\le y] - \mathbf 1[Y_j\le y])\mathbf 1[X_j-x<0] \end{aligned} \end{aligned}$$

and let \(\tilde{N}_+(t)=N_+(t)-E[N_+(t)]\) and \(\tilde{N}_-(t)=N_-(t)-E[N_-(t)]\). Since F is Lipschitz, we obtain

$$\begin{aligned} |E[N_+(t_1)]-E[N_+(t_2)]| \le m {\varLambda }|t_1-t_2| \nu . \end{aligned}$$

For \(s \le t \le u\), we have

$$\begin{aligned} N_+(s)-E[N_+(u)] \le N_+(t)-E[N_+(t)]\le N_+(u)-E[N_+(s)] \end{aligned}$$

and thus

$$\begin{aligned} \tilde{N}_+(s)-m{\varLambda }|u-s| \nu \le \tilde{N}_+(t) \le \tilde{N}_+(u) + m{\varLambda }|u-s| \nu . \end{aligned}$$

It is now easy to see that

$$\begin{aligned} \sup _{|t|\le \delta } |\tilde{N}_+(t)| \le \max _{k=-M,\dots ,M} |N_+(k\delta /M)| + m {\varLambda }\delta \nu /M \end{aligned}$$

for every integer M. From this we obtain the bound

$$\begin{aligned} P(\sup _{|t|\le \delta } |\tilde{N}_+(t)| \ge 2\eta ) \le \sum _{k=-M}^M P(|\tilde{N}_+(k\delta /M)>\eta ) + P(m{\varLambda }\delta \nu /M >\eta ). \end{aligned}$$

The Bernstein inequality and the fact that the variance of

$$\begin{aligned} (\mathbf 1[Y-t(X-x)\le y]-\mathbf 1[Y\le y])\mathbf 1[X\ge x] \end{aligned}$$

is bounded by \({\varLambda }|t| \nu \) yield

$$\begin{aligned} P(|\tilde{N}_+(k\delta /M)|>\eta ) \le 2 \exp \left( -\frac{\eta ^2}{2m{\varLambda }\delta \nu + 2\eta /3}\right) . \end{aligned}$$

Thus, we have

$$\begin{aligned} P(\sup _{|t|\le \delta } |\tilde{N}_+(t)| > 2\eta ) \le 2(2M+1) \exp \left( -\frac{\eta ^2}{2m{\varLambda }\delta \nu + 2\eta /3}\right) \end{aligned}$$

for \(M \ge m {\varLambda }\delta \nu /\eta \). Similarly, one verifies for such M,

$$\begin{aligned} P(\sup _{|t|\le \delta } |\tilde{N}_-(t)| > 2\eta ) \le 2(2M+1) \exp \left( -\frac{\eta ^2}{2m{\varLambda }\delta \nu + 2\eta /3}\right) . \end{aligned}$$

Since \(\tilde{N}(t)=\tilde{N}_+(t)+\tilde{N}_-(t)\), we obtain the first result. The second result follows from the first one by taking \(\delta =Cm^{-1/2}\), \(\eta = KC^{1/2} m^{3/8} (\log m)^{1/2}\) and observing the inequality \((\log m)^{1/2}m^{-3/8} \le 1\). \(\square \)