1 Introduction

It is well-known that the ordinary least squares estimator (LSE) is the most efficient estimator of the regression coefficient in linear regression models when the noise follows a normal distribution. However, departure of the error distribution from normality may severely reduce the efficiency of the LSE, particularly when the errors are heavy-tailed and/or including outliers. One remedy is to remove influential observations from the least-square fit. Another approach, termed robust regression, is to replace the least square loss criterion by outlier-resistant loss criteria in the estimation procedure. Considering that outliers are often genuine data in certain circumstances such as income analysis, procedures like robust regressions, which accommodate rather than directly remove the outliers, will be more efficient.

Suppose we have a simple random sample \(\{(y_i, {\mathbf x_i}): i=1,2 ,\ldots , n\}\) from the following classical linear regression model

$$\begin{aligned} y_i={\mathbf x}_i^T\varvec{\beta }+\epsilon _i, \end{aligned}$$
(1)

where \({\mathbf x}_i=(x_{i1},\ldots ,x_{ip})^T\), \(\varvec{\beta }=(\beta _1, \ldots , \beta _p)^T\in \mathbb {R}^p\) and the noise \(\epsilon _i\), independent of \({\mathbf x}_i\), are i.i.d. random variables with mean zero. Robust regression estimators, introduced by Huber (1981), were obtained by minimizing \(\sum ^n_{i=1} \rho (\theta ; {\mathbf x}_i)\) with respect to \(\theta \), where \(\rho \) is a loss function. There are three popular robust regression estimators in the literature with different choices of loss function. The loss criterion \(\rho (x)=|x|\) leads to the median regression estimation which is a special case of quantile regression (Koenker and Bassett 1978). The other two choices of \(\rho (\cdot )\), i.e. Huber loss and Tukey bisquare loss, corresponds to two Huber’s robust estimators (Huber 1981). In particular, if the loss function is chosed to be the log-likelihood function, we obtain the usual maximum likelihood estimator. The above three estimators are also referred as M-type robust estimators. Unfortunately, the median estimator may lose efficiency when there are no outliers or the error distribution is normal; Also it may not be unique since the loss function \(\rho (x)=|x|\) is not strictly convex. Huber’s robust estimators have high efficiency if an optimal transitional point is available; it is rather difficult to adaptively choose such an optimal transitional point in practice (Rousseeuw and Leroy 1987). Robust regression estimation also gains many developments in recent years, including composite quantile regression (Zou and Yuan 2008), convex combinations of the \(L_1\) and \(L_2\) loss criteria with flexible weights (Chen et al. 2010), rank-based estimation methods (Johnson and Peng 2008), and Modal regression (Yao and Li 2013; Yao et al. 2012).

We have two main goals in this paper, both of which are motivated by Yao and Li’s (2013) modal regression. The first goal is to propose a new modal regression after investigating the properties of modal regression estimation method. Yao and Li (2013) showed that the convergence rate of the modal regression coefficient estimator is slower than root-\(n\), where \(n\) is the sample size. Under different conditions, we find that this rate can still be root-\(n\) if we take the involved bandwidth \(h\) as a constant. In doing so, the asymptotical variance of the modal regression coefficient estimator will depend on \(h\), which can be further regarded as a tuning parameter. A data-driven method is provided to estimate the optimal bandwidth which minimizes the asymptotical variance of the modal regression coefficient estimator. Since the resulting estimation procedure has the same form as Yao and Li’s (2013) method, we still call it modal regression estimation (MRE), although the two methods are different in essence. Our simulation results indicate that the MRE not only has very good robustness for data sets containing outliers or having a heavy-tail error distribution, but also is as asymptotically efficient as least-square-based method when there are no outliers or the error distribution follows a normal distribution.

As the second goal of the paper, we propose an empirical likelihood (EL; Owen 1991) based modal regression method to construct confidence regions/intervals or test hypotheses for the regression coefficients. The aforementioned regression methods usually focus on point estimation. Apart from point estimation, confidence regions or intervals of regression coefficients are also important to evaluate the goodness of estimation methods. The EL is an efficient nonparametric likelihood tool that has a number of nice properties (Owen 1988, 1990, 1991). For example, it is flexible in incorporating auxiliary information; the EL ratio statistic usually has a chisquare limiting distribution; and the EL based confidence regions have data-driven shapes, etc. For a more thorough review on EL, we refer the reader to Owen (2001), Chen and Keilegom (2009), Wei et al. (2012), Zi et al. (2012) and references therein. In this paper, we show that the EL ratio based on modal regression estimation equation still follows a chisquare limiting distribution. Given the robustness to outliers of the modal regression and the estimation efficiency of the EL, we expect the resulting EL based modal regression to be robust and efficient when applied to test hypotheses and construct confidence regions. By simulation study, we find that the confidence intervals (regions) based on the proposed method are shorter (smaller) than those based on least square methods when the error follows non-normal distributions.

The rest of the paper is organized as follows. In Sect. 2, we review the modal regression, and study the asymptotical normality of the modal regression estimator taking the bandwidth \(h\) as a constant. An adaptive optimal bandwidth is presented for practical purpose. In Sect. 3, we propose the EL based modal regression estimation method for the regression coefficient. A nonparametric Wilks theorem for such an EL ratio statistic is proved. Simulation studies and a real data analysis are provided in Sects. 4 and 5, respectively. Section 6 concludes. For clarity, all technical proofs are deferred in the Appendix.

2 Modal regression

2.1 Modal regression estimation

We begin by briefly reviewing the background and mathematical foundation of modal regression. Mean, median and mode are three important numerical characteristics of distribution. Mode, the most likely value of a distribution, has wide applications in astronomy, biology and finance, where the data is often skewed or contains outliers. Compared with mean, mode has the advantage of robustness, which means that it is resistent to outliers. Moreover, since modal regression focuses on the relationship for the majority of data and summaries the “most likely” conditional values, it can provide more meaningful point prediction and larger coverage probability for prediction than others when the data is skewed or contains outliers.

For model (1), modal regression Yao and Li (2013) estimates the modal regression parameter \(\varvec{\beta }\) by maximizing

$$\begin{aligned} Q_h({\varvec{\beta }})\equiv \frac{1}{n}\sum _{i=1}^n \phi _{h}\left( y_i-{\mathbf x}_i^T\varvec{\beta }\right) , \end{aligned}$$
(2)

where \(\phi _{h}(t)=h^{-1}\phi (t/h)\), \(\phi (t)\) is a kernel density function and \(h\) is a bandwidth, determining the degree of robustness of the estimator. As noted by Yao and Li (2013) and Yao et al. (2012), the MRE method usually produces robust estimates due to the nature of mode. When the error distribution is symmetric and has only one mode at the center, then mean regression, median regression and modal regression all estimate the same regression coefficient. For example, we may choose \(\phi (t)\) to be the standard normal density function or the Gaussian kernel.

Here is the justification for the claim that the object function (2) can be used to estimate the modal regression. Consider the case that only the intercept \(\beta =\beta _c\) is involved in linear regression (1). Then the object function \(Q_h({\varvec{\beta }})\) defined in (2) reduces to

$$\begin{aligned} Q_h(\beta _c)\equiv \frac{1}{n}\sum _{i=1}^n \phi _{h}\left( y_i-\beta _c\right) . \end{aligned}$$
(3)

which can be regarded as a kernel estimate of the density function of \(y\) at \(y=\beta _c\). Therefore, the maximizer of (2) is the mode of the kernel density function based on \(y_1,\ldots , y_n\). As \(n\rightarrow \infty \) and \(h\rightarrow 0\), the mode of kernel density function will converge to the mode of the distribution of \(y\) under certain conditions (Parzen 1962).

In contrast to other estimation methods, modal regression treats \(-\phi _{h}(\cdot )\) as a loss function, which is a special M-type robust regression mentioned in Sect. 1. Since modal regression can estimate the “most likely” conditional values, it can provide more robust and efficient estimation than other existing methods. Lee (1989) used the uniform kernel and Epanechnikov kernel for \(\phi (\cdot )\) to estimate the modal regression, respectively. However, their estimators are of little practical use because the object function is non-differentiable and its distribution is intractable. Scott (1992) mentioned the modal regression, but little methodology is given on how to implement it in practice. Recently, Yao and Li (2013) suggested using the Gaussian kernel for \(\phi (\cdot )\) and developed MEM algorithm to compute modal estimators for linear models. Yao et al. (2012) investigated the estimation problem in nonparametric regression using the method of modal regression, and obtained a robust and efficient estimator for the nonparametric regression function. Their estimation procedure is very convenient to implement for practitioners and the result is encouraging for many non-normal error distributions. In addition, Yu and Aristodemou (2012) studied modal regression from Bayesian perspective.

2.2 Theoretical property

In this subsection, we first take the bandwidth as a constant and establish the asymptotical normality of the proposed modal regression estimator (MRE). The limiting variance of the MRE is found dependent of \(h\). We recommend an optimal bandwidth by minimizing the limiting variance.

The desirable property of the MRE estimator is achieved under certain assumptions on both the error and the kernel function \(\phi \). Here we assume that the errors \(\epsilon _i\)’s in model (1) are independent and identically distributed (iid), and that the underlying kernel function \(\phi (\cdot )\) together with the error distribution satisfies

(C1):

\(\hbox {E}(\phi ^{\prime }_h(\epsilon ))=0\), \(F(h)\equiv \hbox {E}(\phi ^{\prime \prime }_h(\epsilon ))<0\) and \(G(h)\equiv \hbox {E}(\phi ^{\prime }_h(\epsilon )^2)\) is finite for any \(h>0\);

(C2):

There exists \(c\!>\!0\) such that \(\hbox {E} \{\rho _{h,c}(\epsilon )\}\!<\!\infty \),   where \( \rho _{h,c}(\epsilon ) \!=\! \sup _{y: |y-\epsilon |\!<\!c}\) \( |\phi ^{\prime \prime \prime }_h(y) | \).

Remark 1

Assumption (C1) is a general assumption for modal regression. See Yao and Li (2013) and Yao et al. (2012). Condition (C2) is used to control the magnitude of the remainder in a third-order Taylor expansion of \(Q_h({\varvec{\beta }})\). See Eq. (18). The condition \(F(h)<0\) ensures that there exists a local maximizer of \(Q_h({\varvec{\beta }})\), while the condition \(\hbox {E}\{ \phi ^{\prime }_h(\epsilon ) \}=0\) guarantees the consistency of this local maximizer, the proposed estimator of \({\varvec{\beta }}\). Conditions (C1) and (C2) are satisfied if both the error density function and \(\phi (\cdot )\) are symmetric and the error has a unique mode. More specifically, when conditions (C1) and (C2) hold, the estimated function based on modal regression is generally the same for mean regression, although the MRE are more robust to outliers. In applications, these conditions will roughly be satisfied if the residual histogram of our modal regression is roughly hell-shaped or has only one mode. We may first apply the MRE method, and then check whether the residuals have this property.

Theorem 1

Suppose \(\{(y_i, {\varvec{x}}_i): i=1,2, \ldots , n\}\) are iid observations from model (1) where \({\varvec{\beta }}={\varvec{\beta }}_0\), the error \(\epsilon _i\) and the covariate \({\mathbf x}_i \) are independent, and \((\epsilon _i, {\mathbf x}_i^{\tau })\)’s are iid with finite covariance matrix. For fixed bandwidth \(h>0\), if the error distribution and \(\phi \) satisfy conditions (C1) and (C2), then there exists a local maximizer \(\hat{\varvec{\beta }}\) of \(Q_h ({\varvec{\beta }})\) in (2) such that \(\sqrt{n}(\hat{\varvec{\beta }}-{\varvec{\beta }}_0) \mathop {\longrightarrow }\limits ^{\hbox {d}}N(0, {\varvec{\Omega }}),\) where \(\mathop {\longrightarrow }\limits ^{\hbox {d}}\) stands for convergence in distribution and \({\varvec{\Omega }}= \{ G(h)/F^2(h)\}{\varvec{\Sigma }}^{-1}\) with \(\Sigma = {\hbox {Cov}}({\mathbf x}_i)\) positive definite.

A Proof of Theorem 1 is given in the Appendix. If \(\hbox {Var}(\epsilon _i)=\sigma ^2\), the asymptotic variance of the least square estimator (LSE) is equal to \(\sigma ^2{\varvec{\Sigma }}^{-1}\). This together with Theorem 1 implies that the asymptotic relative efficiency of the MRE over the LSE is \(r(h)= \sigma ^2 F^2(h) /G(h)\). Theoretically, the larger the asymptotic relative efficiency is, the better the former estimator is. If we take \(h\) as a tuning parameter for choosing a good MRE, an ideal choice of this tuing parameter is

$$\begin{aligned} h_{\hbox {opt}}=\hbox {arg}\,{\max _h} \ r(h)=\hbox {arg}\,{\max _h} F^2(h)/G(h). \end{aligned}$$
(4)

This bandwidth gives the best MRE estimator compared with the LSE from the viewpoint of asymptotical variance. A distinct property of \(h_{\hbox {opt}}\) from the usual bandwidth in nonparameter regression is that this \(h_{\hbox {opt}}\) depends not on the sample size \(n\) but only on the error distribution and the first two derivatives of \(\phi \).

2.3 Bandwidth selection

Bandwidth plays an important role in order to obtain the robust estimation. We provide a bandwidth selection method for the practical use of the MRE. Following the idea of Yao et al. (2012), we first estimate \(F(h)\) and \(G(h)\) by

$$\begin{aligned} \hat{F}(h)=\frac{1}{n}\sum _{i=1}^n\phi ^{\prime \prime }_h(\hat{\epsilon }_i) \quad \hbox {and} \quad \hat{G}(h)=\frac{1}{n}\sum _{i=1}^n\left\{ \phi ^{\prime }_h(\hat{\epsilon }_i)\right\} ^2, \end{aligned}$$
(5)

respectively, where \(\hat{\epsilon }_i=y_i-{\varvec{x}}_i^T\tilde{\varvec{\beta }}\) and \(\tilde{\varvec{\beta }}\) is estimated based on some robust pilot estimates, such as the lease absolute deviation (LAD) estimator or the rank-based estimator (Johnson and Peng 2008). We recommend choosing the bandwidth to be

$$\begin{aligned} {\tilde{h}_{\hbox {opt}}} = \hbox {arg}\,{\min _h} \{ \hat{F}(h)\}^2/\hat{G}(h). \end{aligned}$$
(6)

A quick method of solving this minimization problem is the grid search method. As done by Yao et al. (2012), we may choose the possible grids points to be \(h=0.5\hat{\sigma }\times 1.02^j\) (\(0\le j\le k\)) for \(k=50\) or 100, where \(\hat{\sigma }^2=\frac{1}{n}\sum _{i=1}^n\hat{\epsilon }_i^2\).

3 Empirical likelihood based modal regression

In this section, we propose empirical likelihood based modal regression to construct confidence regions for the regression coefficients.

From (2), we can define an auxiliary random vectors (Qin and Lawless 1994)

$$\begin{aligned} \xi _i({\varvec{\beta }})={\varvec{x}}_i\phi _h^{\prime }(y_i-{\varvec{x}}_i^T{\varvec{\beta }}),\quad i=1,\ldots ,n. \end{aligned}$$
(7)

Note that \(\hbox {E}\{ \xi _i({\varvec{\beta }}_0) \} =0\) where \({\varvec{\beta }}_0\) is the true parameter value. According to the empirical likelihood principle, we define the empirical likelihood ratio function of \({\varvec{\beta }}\) to be

$$\begin{aligned} {\mathcal {L}_{n}}({\varvec{\beta }})=\sup \left\{ \prod _{i=1}^n ( np_i) \Big | p_i \ge 0, \; \sum _{i=1}^np_i=1,\; \sum _{i=1}^np_i\xi _i({\varvec{\beta }})=0\right\} . \end{aligned}$$
(8)

Given \({\varvec{\beta }}\), if \(\{(p_1, \ldots , p_n):p_i \ge 0, \; \sum _{i=1}^np_i=1,\; \sum _{i=1}^np_i\xi _i({\varvec{\beta }})=0 \}\) is an empty set, the likelihood ratio \(\mathcal {L}_n({\varvec{\beta }})\) will have no definition. In this situation, Chen et al.’s (2008) adjusted empirical likelihood is likely the most straightforward and natural remedy to this dilemma, although the convention defines \(\mathcal {L}_n({\varvec{\beta }})\) to be zero.

Otherwise \(\mathcal {L}_n({\varvec{\beta }})\) is well-defined and can be re-expressed as

$$\begin{aligned} \mathcal {L}_n({\varvec{\beta }})=\prod _{i=1}^n \left\{ 1+{\varvec{\lambda }}_\beta ^T\xi _i({\varvec{\beta }})\right\} ^{-1}, \end{aligned}$$
(9)

where \({\varvec{\lambda }}_\beta \) is the solution to

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\frac{\xi _i({\varvec{\beta }})}{1+{\varvec{\lambda }}_\beta ^T\xi _i({\varvec{\beta }})}=0. \end{aligned}$$
(10)

Accordingly the empirical log-likelihood ratio function is defined as

$$\begin{aligned} l_n({\varvec{\beta }})=: \hbox {log}\{\mathcal {L}_n({\varvec{\beta }})\}=-\sum _{i=1}^n \hbox {log}\left\{ 1+{\varvec{\lambda }}_\beta ^T\xi _i({\varvec{\beta }})\right\} . \end{aligned}$$
(11)

A feasible and efficient algorithm is needed for the computation of \(l_n({\varvec{\beta }})\) if one intends to apply the empirical likelihood method. The convex duality method given in Owen (2001, pp. 60–63) can serve this purpose, and it is also adopted in our simulation study.

As expected, we find that when \({\varvec{\beta }}\) takes its true value \({\varvec{\beta }}_0\), the empirical log-likelihood ratio \(-2l_n({\varvec{\beta }}_0)\) still follows a limiting chi-square distribution. This result is summarized in the following theorem.

Theorem 2

Assume the same conditions as Theorem 1. As \(n\) tends to infinity, we have

$$\begin{aligned} -2l_n({\varvec{\beta }}_0) \mathop {\longrightarrow }\limits ^{\hbox {d}} \chi ^2_p, \end{aligned}$$
(12)

where \(\chi ^2_p\) is the chi-square distribution with \(p\) degrees of freedom.

According to Theorem 2, the empirical likelihood ratio \(-2 l_n({\varvec{\beta }}_0)\) is asymptotically pivotal; it can be used not only to test the hypothesis \(H_0: {\varvec{\beta }}={\varvec{\beta }}_0\), but also to construct confidence regions for \({\varvec{\beta }}\). Specifically, a modal-regression-empirical-likelihood (MREL) based confidence region with confidence level \((1-\alpha )\) is given by

$$\begin{aligned} {{\mathcal {C}}_{\hbox {MREL}}}({\varvec{\beta }})=\left\{ {\varvec{\beta }}: -2l_n({\varvec{\beta }}) \le \chi ^2_{p,1-\alpha }\right\} , \end{aligned}$$

where \(\chi ^2_{p,1-\alpha }\) is the \((1-\alpha )\)-quantile of the \(\chi _p^2\) distribution. Theorem 2 implies that \({{\mathcal {C}}_{\hbox {MREL}}}({\varvec{\beta }})\) constitutes a confidence region for \({\varvec{\beta }}\) with asymptotically correct coverage probability \(1-\alpha \).

4 Simulation study

In this section, we provide simulation results to study the finite-sample properties of the proposed MRE and MREL methods and compare them with existing methods. The proposed MREL method is convenient to be used for confidence interval/region construction, while it reduces to the MRE method when point estimation of the regression coefficient is of interest and the bandwidth is fixed.

We generated data-sets from two models, under which point estimation and interval/region estimation are the respective focuses. Simulation results are computed based on 1000 random samples with the sample size being 50, 100 and 150, respectively. Confidence level is set to be 95 % when confidence interval/region is of interest.

4.1 Example 1

The main goal of this example is to examine the robustness and efficiency of the proposed modal regression estimator (MRE). Let the true regression model be

$$\begin{aligned} y_i=\beta _0+x_{i1}\beta _1+x_{i2}\beta _2+x_{i3}\beta _3+\epsilon _i, \quad i=1,\ldots ,n, \end{aligned}$$

where the covariates \({\varvec{x}}_i=(x_{i1},x_{i2},x_{i3})^T\) follows a three-dimensional normal distribution \(N(0,\Sigma )\) with unit marginal variance and correlation 0.5. The true value of the regression coefficient is \({\varvec{\beta }}=(\beta _0,\ldots ,\beta _3)^T=(1.5, 2,-1.2, 0)^T\). The error \(\epsilon _i\) is independent of \({\varvec{x}}_i\). We consider six different error distributions: (1) standard normal distribution, \(N(0,1)\); (2) \(t\)-distribution with degree of freedom 3, \(t(3)\); (3) standard Laplace distribution, \(Lp(0,1)\); (4) mixture of two normal distributions, \(0.9N(0,1)+0.1N(0,10^2)\); (5) mixture of normal-\(\chi ^2(5)\) distribution, \(0.9N(0,1)+0.1\chi ^2(5)\); (6) mixture of three normal distributions, \(0.8N(0,1)+0.1N(-10,1)+0.1N(10,1)\). Throughout the paper, we choose and recommend the kernel function \(\phi \) to be the standard normal density function in our MRE method. It can be verified that the conditions of Theorem 1 are all satisfied by all the above error distributions except case (5). We include case (5) in our simulation to investigate the robustness of the proposed MRE method.

For illustration and comparison, we also take the following methods into consideration: least square estimate (LSE), the least absolute deviance estimate (LAD), the composite quantile regression with 9 quantiles (CQR, Zou and Yuan 2008) and the rank regression estimate (RRE, Johnson and Peng 2008). For each method, we report the mean square error (MSE) of the estimate \(\hat{\varvec{\beta }}\), i.e., \(\hbox {MSE}=(\hat{\varvec{\beta }}-{\varvec{\beta }})^T(\hat{\varvec{\beta }}-{\varvec{\beta }})/p\). In order to evaluate the prediction performance of the fitted model, we generated a test sample, e.g. \(\{({y_i^{\hbox {test}}}, {\varvec{ {x}}_i^{\hbox {test}}}): i=1,\ldots ,200\}\), in each simulation, and computed the mean absolute prediction error (MAPE), \( \sum _{i=1}^{200}|{y_i^{\hbox {test}}}-\hat{y}^{\hbox {test}}_i|/200\) with \(\hat{y}^{\hbox {test}}_i=({\varvec{ x}_i^{\hbox {test}}})^T\hat{\varvec{\beta }}\). The mean and standard error of MSE and MAPE over 1000 replications are reported in Table 1.

Table 1 Mean and standard error of MSE and MAPE

From Table 1, we have the following observations. For a given error distribution, the performances of MRE become better and better when the sample size increases. In the case of normal error, as long as the sample size is not too small, the MRE is better than other three robust methods, and it seems to perform almost as well as LSE. And for the Laplace distribution, it is well known that the LAD is the best estimator, nevertheless, the performance of MRE is very close to LAD. For the other four error distributions, it is obvious that MRE outperforms the rest four methods even in case (5) where the conditions in Theorem 1 are not satisfied.

Furthermore, it is worth mentioning that the performances of MRE are significantly better than the others for the three mixture error distributions. Here is a possible reason for this observation. The mixtures can be viewed as populations containing outliers. When data contains severely departed outliers, the modal regression puts more weight on the “most likely” data around the true value, which leads to robustness and efficiency of the proposed MRE.

Overall, the performance of MRE is desirable and its efficiency gain is more prominent when the data set contains outliers.

4.2 Example 2

We now study the performance of the MREL confidence interval/regions. The usual normality-based least square method (LS) and least square based empirical likelihood method (LSEL; Owen 1991) are also taken into consideration for comparison.

Consider the following model

$$\begin{aligned} y_i=x_{i1}\beta _1+x_{i2}\beta _2+0.5\epsilon _i, \end{aligned}$$

where \({\varvec{\beta }}=(\beta _1,\beta _2)^T=(2,1)^T\). The covariates \((x_{i1}, x_{i2})\) follows a bivariate normal distribution with mean zero. Both \(x_{i1}\) and \( x_{i2}\) have univariate variance and their correlation coefficient is 0.8. We generated errors from four distributions: \(N(0,1)\), \(t(3)\), \(Lp(0,1)\) and \(0.9N(0,1)+0.1N(0,10^2)\). The simulation results are summarized in Table 2 and Fig. 1.

Fig. 1
figure 1

95 % confidence regions for three different methods in one simulation when sample size \(n=50\): LS (blue dot dash line); LSEL (black dash line); MREL (red solid line), where asterisk stands for true value of \((\beta _1,\beta _2)^T\), circle and diamond denotes least square estimate and modal regression estimate, respectively. (Color figure online)

Table 2 Simulated coverage probabilities (CP) of confidence intervals (regions) for \(\beta _1, \beta _2\), and \((\beta _1,\beta _2)^T\) and the average lengths (AL) of confidence intervals from three different approaches at nominal level 0.95, where LS denotes the confidence intervals (regions) obtained using least square normal asymptotic method

Remark 2

When only \(\beta _1\) is of interest, the MREL confidence interval of \(\beta _1\) can be constructed through the profile empirical log-likelihood function \(l_n(\beta _1)= \sup _{\beta _2} l_n(\beta _1,\beta _2)\). Similar to usual parametric likelihood, if \(\beta _{10}\) is the true value of \(\beta _1\), then \(-2l_n(\beta _{10})\) has a \(\chi _1^2\) limiting distribution as \(n\rightarrow \infty \). Accordingly a natural MREL confidence interval is given by

$$\begin{aligned} \mathcal {C}_{\hbox {MREL}}(\beta _1) =\left\{ \beta _1: -2l_n(\beta _1) \le \chi ^2_{1,1-\alpha }\right\} . \end{aligned}$$

The construction of \({\mathcal {C}}_{\hbox {MREL}}(\beta _2)\) is similar. In Table 2, we also report the marginal confidence interval \({\mathcal {C}}_{\hbox {MREL}}(\beta _1)\) and \({\mathcal {C}}_{\hbox { MREL}}(\beta _2)\)

For a given error distribution, we see that the coverage probability of MREL gets closer and closer to the nominal level as \(n\) increases; meanwhile the average lengths of confidence intervals for single parameter become shorter and shorter.

In the case of normal error, the differences among the three methods are small. In particular, the performance of MREL is as well as the least square based methods when the sample size \(n\) is large. For the case of non-normal distributions, the average lengths of confidence intervals (regions) for MREL are obviously shorter (smaller) than the other two. It is worth mentioning that the interval length of MREL is only about one third to that of the LS and LSEL when the error follows a mixture normal distribution.

In addition, the coverage probability of LSEL deviates significantly from the nominal level for the three non-normal error distributions when the sample size is small, and it grows very slowly as the sample size increases.

In summary, the MREL has priority over the LS and LSEL methods when the sample size is large in terms of both coverage probability and interval length or region volume. The coverage precision of the MREL confidence interval/region needs improving particularly in the case of small sample sizes. The adjusted empirical likelihood of Chen et al. (2008) and Liu and Chen (2010) or the bootstrap method can serve this purpose

Remark 3

We take the MREL confidence regions in Fig. 2 for example, to illustrate how we computed the confidence boundary given a data-set. The first step is to compute the center, the MRE \(\hat{\varvec{\beta }}\) of \(\varvec{\beta }\). Then along any line through the center, two points meeting with the confidence boundary can be found. All points on the confidence boundary will be obtained after we work for all lines. This can be done conveniently in polar coordinate. It is clear that theoretically this method applies to confidence regions of any dimension.

5 Real data analysis

In this section, we apply the proposed method to the analysis of the Education Expenditure Data (Chatterjee and Price 1977). This data set consists of 50 observations from 50 states, one for each state. It has been analyzed by Yao et al. (2012) using nonparametric modal regression. We take the per capita expenditure on public education in a state as the response variable \(y_i\), and take the number of residents per thousand residing in urban areas in 1970 as covariate \(x_i\). And we consider fitting the data by the following linear model

$$\begin{aligned} y_i=\beta _0+\beta _1x_i+\epsilon _i,\quad i=1,\ldots ,50. \end{aligned}$$
(13)
Fig. 2
figure 2

95 % confidence regions for three different methods: LS (blue dot dash line); LSEL (black dash line); MREL (red solid line), where circle and asterisk denote the point estimates of LSE and MRE, respectively. (Color figure online)

In this example, an obvious outlier is that from Hawaii with a very high per capita expenditure on public education compared with other states. The confidence intervals for \(\beta _0\) and \(\beta _1\) respectively based on the LS, LSEL and MREL methods were computed and presented in Table 3. The confidence regions based on the three methods are displayed in Fig. 2. (Here, to alleviate the magnitude difference between the two estimates \(\hat{\beta }_0\) and \(\hat{\beta }_1\) using the original data, we divide both response variable \(y_i\) and covariate \(x_i\) by 100 for each observation, and then use (13) to fit the transformed data.)

Table 3 95 % interval estimates for education expenditure data

As we can clearly see from Table 3 and Fig. 2, the confidence interval (region) obtained by MREL is shorter (smaller) than the least square based methods, which show that the confidence region obtained by modal regression empirical likelihood not only has the advantage of data-driven nonparametric approach but also is robust to outliers.

Fig. 3
figure 3

Confidence regions based on 2000 bootstrap sampling for three different methods: LS (blue dot dash line); LSEL (black dash line); MREL (red solid line), where circle and asterisk denote the point estimate of LSE and MRE, respectively. (Color figure online)

To further test the credibility of the confidence region (interval), we also calculated the coverage probability (given in Table 4) of the confidence region/interval based on 2000 bootstrap resamples. As we can see from Table 4, compared with the nominal coverage 95 %, both the two empirical likelihood based methods is not that satisfactory. The bootstrap method and the adjusted empirical likelihood mentioned in Sect. 4.2 can be used to improve the coverage precision.

Table 4 Coverage probability of the confidence region (interval) based on 2000 bootstrap resampling

The comparison of confidence region volume is not fair if the confidence regions under comparison have rather different coverage probabilities. For fair comparison, we calibrate the LS, LSEL and MREL with not their limiting distributions but the empirical distributions based on the 2000 bootstrap statistics. Take the MREL for example. Let \( \hat{\varvec{\beta }}\) denote the MREL estimate based on the original data-set, and \(l_j^*(\hat{\varvec{\beta }}) \) (\(j=1,2,\ldots , 2000\)) be the 2000 bootstrap MREL ratio statistics. We shall take the 1900th statistic \(l_{(1900)}^*(\hat{\varvec{\beta }}) \) as the 95 % quantile of the MREL method.

All confidence regions/intervals are re-compuated, and presented in Fig. 3 and Table 5. It is clear from Fig. 3 that the MREL confidence region for \((\beta _0, \beta _1)\) is significantly smaller than those based on the LS and LSEL. When only one component of \((\beta _0, \beta _1)\) is of interest, we find from Table 5 that all the MREL confidence intervals are much shorter than those based on the LS and LSEL. These observations provide strong evidence for the priority of the MREL.

Table 5 The confidence interval based on 2000 bootstrap resampling

6 Concluding remarks

In this paper, in order to make inference about the regression coefficient of a linear regression model, we first investigate the properties of the modal regression with a fixed bandwidth, then propose an empirical likelihood estimation approach based on modal regression estimation equation. It has been shown that the proposed estimator is more robust and efficient than the least square based methods for many non-normal error distributions or data containing outliers. Though our current research is focusing on linear regression, the framework can be extended to nonparametric or semi-parametric models, such as single-index models, partially linear models and semi-varying coefficient models. In addition, with high-dimensional covariates in regression models, sparse modeling is often considered superior, it is also interesting to consider robust penalized empirical likelihood based on modal regression, which can be taken as a future research topic.