Abstract
In this paper, we consider how to yield a robust empirical likelihood estimation for regression models. After introducing modal regression, we propose a novel empirical likelihood method based on modal regression estimation equations, which has the merits of both robustness and high inference efficiency compared with the least square based methods. Under some mild conditions, we show that Wilks’ theorem of the proposed empirical likelihood approach continues to hold. Advantages of empirical likelihood modal regression as a nonparametric approach are illustrated by constructing confidence intervals/regions. Two simulation studies and a real data analysis confirm our theoretical findings.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
It is well-known that the ordinary least squares estimator (LSE) is the most efficient estimator of the regression coefficient in linear regression models when the noise follows a normal distribution. However, departure of the error distribution from normality may severely reduce the efficiency of the LSE, particularly when the errors are heavy-tailed and/or including outliers. One remedy is to remove influential observations from the least-square fit. Another approach, termed robust regression, is to replace the least square loss criterion by outlier-resistant loss criteria in the estimation procedure. Considering that outliers are often genuine data in certain circumstances such as income analysis, procedures like robust regressions, which accommodate rather than directly remove the outliers, will be more efficient.
Suppose we have a simple random sample \(\{(y_i, {\mathbf x_i}): i=1,2 ,\ldots , n\}\) from the following classical linear regression model
where \({\mathbf x}_i=(x_{i1},\ldots ,x_{ip})^T\), \(\varvec{\beta }=(\beta _1, \ldots , \beta _p)^T\in \mathbb {R}^p\) and the noise \(\epsilon _i\), independent of \({\mathbf x}_i\), are i.i.d. random variables with mean zero. Robust regression estimators, introduced by Huber (1981), were obtained by minimizing \(\sum ^n_{i=1} \rho (\theta ; {\mathbf x}_i)\) with respect to \(\theta \), where \(\rho \) is a loss function. There are three popular robust regression estimators in the literature with different choices of loss function. The loss criterion \(\rho (x)=|x|\) leads to the median regression estimation which is a special case of quantile regression (Koenker and Bassett 1978). The other two choices of \(\rho (\cdot )\), i.e. Huber loss and Tukey bisquare loss, corresponds to two Huber’s robust estimators (Huber 1981). In particular, if the loss function is chosed to be the log-likelihood function, we obtain the usual maximum likelihood estimator. The above three estimators are also referred as M-type robust estimators. Unfortunately, the median estimator may lose efficiency when there are no outliers or the error distribution is normal; Also it may not be unique since the loss function \(\rho (x)=|x|\) is not strictly convex. Huber’s robust estimators have high efficiency if an optimal transitional point is available; it is rather difficult to adaptively choose such an optimal transitional point in practice (Rousseeuw and Leroy 1987). Robust regression estimation also gains many developments in recent years, including composite quantile regression (Zou and Yuan 2008), convex combinations of the \(L_1\) and \(L_2\) loss criteria with flexible weights (Chen et al. 2010), rank-based estimation methods (Johnson and Peng 2008), and Modal regression (Yao and Li 2013; Yao et al. 2012).
We have two main goals in this paper, both of which are motivated by Yao and Li’s (2013) modal regression. The first goal is to propose a new modal regression after investigating the properties of modal regression estimation method. Yao and Li (2013) showed that the convergence rate of the modal regression coefficient estimator is slower than root-\(n\), where \(n\) is the sample size. Under different conditions, we find that this rate can still be root-\(n\) if we take the involved bandwidth \(h\) as a constant. In doing so, the asymptotical variance of the modal regression coefficient estimator will depend on \(h\), which can be further regarded as a tuning parameter. A data-driven method is provided to estimate the optimal bandwidth which minimizes the asymptotical variance of the modal regression coefficient estimator. Since the resulting estimation procedure has the same form as Yao and Li’s (2013) method, we still call it modal regression estimation (MRE), although the two methods are different in essence. Our simulation results indicate that the MRE not only has very good robustness for data sets containing outliers or having a heavy-tail error distribution, but also is as asymptotically efficient as least-square-based method when there are no outliers or the error distribution follows a normal distribution.
As the second goal of the paper, we propose an empirical likelihood (EL; Owen 1991) based modal regression method to construct confidence regions/intervals or test hypotheses for the regression coefficients. The aforementioned regression methods usually focus on point estimation. Apart from point estimation, confidence regions or intervals of regression coefficients are also important to evaluate the goodness of estimation methods. The EL is an efficient nonparametric likelihood tool that has a number of nice properties (Owen 1988, 1990, 1991). For example, it is flexible in incorporating auxiliary information; the EL ratio statistic usually has a chisquare limiting distribution; and the EL based confidence regions have data-driven shapes, etc. For a more thorough review on EL, we refer the reader to Owen (2001), Chen and Keilegom (2009), Wei et al. (2012), Zi et al. (2012) and references therein. In this paper, we show that the EL ratio based on modal regression estimation equation still follows a chisquare limiting distribution. Given the robustness to outliers of the modal regression and the estimation efficiency of the EL, we expect the resulting EL based modal regression to be robust and efficient when applied to test hypotheses and construct confidence regions. By simulation study, we find that the confidence intervals (regions) based on the proposed method are shorter (smaller) than those based on least square methods when the error follows non-normal distributions.
The rest of the paper is organized as follows. In Sect. 2, we review the modal regression, and study the asymptotical normality of the modal regression estimator taking the bandwidth \(h\) as a constant. An adaptive optimal bandwidth is presented for practical purpose. In Sect. 3, we propose the EL based modal regression estimation method for the regression coefficient. A nonparametric Wilks theorem for such an EL ratio statistic is proved. Simulation studies and a real data analysis are provided in Sects. 4 and 5, respectively. Section 6 concludes. For clarity, all technical proofs are deferred in the Appendix.
2 Modal regression
2.1 Modal regression estimation
We begin by briefly reviewing the background and mathematical foundation of modal regression. Mean, median and mode are three important numerical characteristics of distribution. Mode, the most likely value of a distribution, has wide applications in astronomy, biology and finance, where the data is often skewed or contains outliers. Compared with mean, mode has the advantage of robustness, which means that it is resistent to outliers. Moreover, since modal regression focuses on the relationship for the majority of data and summaries the “most likely” conditional values, it can provide more meaningful point prediction and larger coverage probability for prediction than others when the data is skewed or contains outliers.
For model (1), modal regression Yao and Li (2013) estimates the modal regression parameter \(\varvec{\beta }\) by maximizing
where \(\phi _{h}(t)=h^{-1}\phi (t/h)\), \(\phi (t)\) is a kernel density function and \(h\) is a bandwidth, determining the degree of robustness of the estimator. As noted by Yao and Li (2013) and Yao et al. (2012), the MRE method usually produces robust estimates due to the nature of mode. When the error distribution is symmetric and has only one mode at the center, then mean regression, median regression and modal regression all estimate the same regression coefficient. For example, we may choose \(\phi (t)\) to be the standard normal density function or the Gaussian kernel.
Here is the justification for the claim that the object function (2) can be used to estimate the modal regression. Consider the case that only the intercept \(\beta =\beta _c\) is involved in linear regression (1). Then the object function \(Q_h({\varvec{\beta }})\) defined in (2) reduces to
which can be regarded as a kernel estimate of the density function of \(y\) at \(y=\beta _c\). Therefore, the maximizer of (2) is the mode of the kernel density function based on \(y_1,\ldots , y_n\). As \(n\rightarrow \infty \) and \(h\rightarrow 0\), the mode of kernel density function will converge to the mode of the distribution of \(y\) under certain conditions (Parzen 1962).
In contrast to other estimation methods, modal regression treats \(-\phi _{h}(\cdot )\) as a loss function, which is a special M-type robust regression mentioned in Sect. 1. Since modal regression can estimate the “most likely” conditional values, it can provide more robust and efficient estimation than other existing methods. Lee (1989) used the uniform kernel and Epanechnikov kernel for \(\phi (\cdot )\) to estimate the modal regression, respectively. However, their estimators are of little practical use because the object function is non-differentiable and its distribution is intractable. Scott (1992) mentioned the modal regression, but little methodology is given on how to implement it in practice. Recently, Yao and Li (2013) suggested using the Gaussian kernel for \(\phi (\cdot )\) and developed MEM algorithm to compute modal estimators for linear models. Yao et al. (2012) investigated the estimation problem in nonparametric regression using the method of modal regression, and obtained a robust and efficient estimator for the nonparametric regression function. Their estimation procedure is very convenient to implement for practitioners and the result is encouraging for many non-normal error distributions. In addition, Yu and Aristodemou (2012) studied modal regression from Bayesian perspective.
2.2 Theoretical property
In this subsection, we first take the bandwidth as a constant and establish the asymptotical normality of the proposed modal regression estimator (MRE). The limiting variance of the MRE is found dependent of \(h\). We recommend an optimal bandwidth by minimizing the limiting variance.
The desirable property of the MRE estimator is achieved under certain assumptions on both the error and the kernel function \(\phi \). Here we assume that the errors \(\epsilon _i\)’s in model (1) are independent and identically distributed (iid), and that the underlying kernel function \(\phi (\cdot )\) together with the error distribution satisfies
- (C1):
-
\(\hbox {E}(\phi ^{\prime }_h(\epsilon ))=0\), \(F(h)\equiv \hbox {E}(\phi ^{\prime \prime }_h(\epsilon ))<0\) and \(G(h)\equiv \hbox {E}(\phi ^{\prime }_h(\epsilon )^2)\) is finite for any \(h>0\);
- (C2):
-
There exists \(c\!>\!0\) such that \(\hbox {E} \{\rho _{h,c}(\epsilon )\}\!<\!\infty \), where \( \rho _{h,c}(\epsilon ) \!=\! \sup _{y: |y-\epsilon |\!<\!c}\) \( |\phi ^{\prime \prime \prime }_h(y) | \).
Remark 1
Assumption (C1) is a general assumption for modal regression. See Yao and Li (2013) and Yao et al. (2012). Condition (C2) is used to control the magnitude of the remainder in a third-order Taylor expansion of \(Q_h({\varvec{\beta }})\). See Eq. (18). The condition \(F(h)<0\) ensures that there exists a local maximizer of \(Q_h({\varvec{\beta }})\), while the condition \(\hbox {E}\{ \phi ^{\prime }_h(\epsilon ) \}=0\) guarantees the consistency of this local maximizer, the proposed estimator of \({\varvec{\beta }}\). Conditions (C1) and (C2) are satisfied if both the error density function and \(\phi (\cdot )\) are symmetric and the error has a unique mode. More specifically, when conditions (C1) and (C2) hold, the estimated function based on modal regression is generally the same for mean regression, although the MRE are more robust to outliers. In applications, these conditions will roughly be satisfied if the residual histogram of our modal regression is roughly hell-shaped or has only one mode. We may first apply the MRE method, and then check whether the residuals have this property.
Theorem 1
Suppose \(\{(y_i, {\varvec{x}}_i): i=1,2, \ldots , n\}\) are iid observations from model (1) where \({\varvec{\beta }}={\varvec{\beta }}_0\), the error \(\epsilon _i\) and the covariate \({\mathbf x}_i \) are independent, and \((\epsilon _i, {\mathbf x}_i^{\tau })\)’s are iid with finite covariance matrix. For fixed bandwidth \(h>0\), if the error distribution and \(\phi \) satisfy conditions (C1) and (C2), then there exists a local maximizer \(\hat{\varvec{\beta }}\) of \(Q_h ({\varvec{\beta }})\) in (2) such that \(\sqrt{n}(\hat{\varvec{\beta }}-{\varvec{\beta }}_0) \mathop {\longrightarrow }\limits ^{\hbox {d}}N(0, {\varvec{\Omega }}),\) where \(\mathop {\longrightarrow }\limits ^{\hbox {d}}\) stands for convergence in distribution and \({\varvec{\Omega }}= \{ G(h)/F^2(h)\}{\varvec{\Sigma }}^{-1}\) with \(\Sigma = {\hbox {Cov}}({\mathbf x}_i)\) positive definite.
A Proof of Theorem 1 is given in the Appendix. If \(\hbox {Var}(\epsilon _i)=\sigma ^2\), the asymptotic variance of the least square estimator (LSE) is equal to \(\sigma ^2{\varvec{\Sigma }}^{-1}\). This together with Theorem 1 implies that the asymptotic relative efficiency of the MRE over the LSE is \(r(h)= \sigma ^2 F^2(h) /G(h)\). Theoretically, the larger the asymptotic relative efficiency is, the better the former estimator is. If we take \(h\) as a tuning parameter for choosing a good MRE, an ideal choice of this tuing parameter is
This bandwidth gives the best MRE estimator compared with the LSE from the viewpoint of asymptotical variance. A distinct property of \(h_{\hbox {opt}}\) from the usual bandwidth in nonparameter regression is that this \(h_{\hbox {opt}}\) depends not on the sample size \(n\) but only on the error distribution and the first two derivatives of \(\phi \).
2.3 Bandwidth selection
Bandwidth plays an important role in order to obtain the robust estimation. We provide a bandwidth selection method for the practical use of the MRE. Following the idea of Yao et al. (2012), we first estimate \(F(h)\) and \(G(h)\) by
respectively, where \(\hat{\epsilon }_i=y_i-{\varvec{x}}_i^T\tilde{\varvec{\beta }}\) and \(\tilde{\varvec{\beta }}\) is estimated based on some robust pilot estimates, such as the lease absolute deviation (LAD) estimator or the rank-based estimator (Johnson and Peng 2008). We recommend choosing the bandwidth to be
A quick method of solving this minimization problem is the grid search method. As done by Yao et al. (2012), we may choose the possible grids points to be \(h=0.5\hat{\sigma }\times 1.02^j\) (\(0\le j\le k\)) for \(k=50\) or 100, where \(\hat{\sigma }^2=\frac{1}{n}\sum _{i=1}^n\hat{\epsilon }_i^2\).
3 Empirical likelihood based modal regression
In this section, we propose empirical likelihood based modal regression to construct confidence regions for the regression coefficients.
From (2), we can define an auxiliary random vectors (Qin and Lawless 1994)
Note that \(\hbox {E}\{ \xi _i({\varvec{\beta }}_0) \} =0\) where \({\varvec{\beta }}_0\) is the true parameter value. According to the empirical likelihood principle, we define the empirical likelihood ratio function of \({\varvec{\beta }}\) to be
Given \({\varvec{\beta }}\), if \(\{(p_1, \ldots , p_n):p_i \ge 0, \; \sum _{i=1}^np_i=1,\; \sum _{i=1}^np_i\xi _i({\varvec{\beta }})=0 \}\) is an empty set, the likelihood ratio \(\mathcal {L}_n({\varvec{\beta }})\) will have no definition. In this situation, Chen et al.’s (2008) adjusted empirical likelihood is likely the most straightforward and natural remedy to this dilemma, although the convention defines \(\mathcal {L}_n({\varvec{\beta }})\) to be zero.
Otherwise \(\mathcal {L}_n({\varvec{\beta }})\) is well-defined and can be re-expressed as
where \({\varvec{\lambda }}_\beta \) is the solution to
Accordingly the empirical log-likelihood ratio function is defined as
A feasible and efficient algorithm is needed for the computation of \(l_n({\varvec{\beta }})\) if one intends to apply the empirical likelihood method. The convex duality method given in Owen (2001, pp. 60–63) can serve this purpose, and it is also adopted in our simulation study.
As expected, we find that when \({\varvec{\beta }}\) takes its true value \({\varvec{\beta }}_0\), the empirical log-likelihood ratio \(-2l_n({\varvec{\beta }}_0)\) still follows a limiting chi-square distribution. This result is summarized in the following theorem.
Theorem 2
Assume the same conditions as Theorem 1. As \(n\) tends to infinity, we have
where \(\chi ^2_p\) is the chi-square distribution with \(p\) degrees of freedom.
According to Theorem 2, the empirical likelihood ratio \(-2 l_n({\varvec{\beta }}_0)\) is asymptotically pivotal; it can be used not only to test the hypothesis \(H_0: {\varvec{\beta }}={\varvec{\beta }}_0\), but also to construct confidence regions for \({\varvec{\beta }}\). Specifically, a modal-regression-empirical-likelihood (MREL) based confidence region with confidence level \((1-\alpha )\) is given by
where \(\chi ^2_{p,1-\alpha }\) is the \((1-\alpha )\)-quantile of the \(\chi _p^2\) distribution. Theorem 2 implies that \({{\mathcal {C}}_{\hbox {MREL}}}({\varvec{\beta }})\) constitutes a confidence region for \({\varvec{\beta }}\) with asymptotically correct coverage probability \(1-\alpha \).
4 Simulation study
In this section, we provide simulation results to study the finite-sample properties of the proposed MRE and MREL methods and compare them with existing methods. The proposed MREL method is convenient to be used for confidence interval/region construction, while it reduces to the MRE method when point estimation of the regression coefficient is of interest and the bandwidth is fixed.
We generated data-sets from two models, under which point estimation and interval/region estimation are the respective focuses. Simulation results are computed based on 1000 random samples with the sample size being 50, 100 and 150, respectively. Confidence level is set to be 95 % when confidence interval/region is of interest.
4.1 Example 1
The main goal of this example is to examine the robustness and efficiency of the proposed modal regression estimator (MRE). Let the true regression model be
where the covariates \({\varvec{x}}_i=(x_{i1},x_{i2},x_{i3})^T\) follows a three-dimensional normal distribution \(N(0,\Sigma )\) with unit marginal variance and correlation 0.5. The true value of the regression coefficient is \({\varvec{\beta }}=(\beta _0,\ldots ,\beta _3)^T=(1.5, 2,-1.2, 0)^T\). The error \(\epsilon _i\) is independent of \({\varvec{x}}_i\). We consider six different error distributions: (1) standard normal distribution, \(N(0,1)\); (2) \(t\)-distribution with degree of freedom 3, \(t(3)\); (3) standard Laplace distribution, \(Lp(0,1)\); (4) mixture of two normal distributions, \(0.9N(0,1)+0.1N(0,10^2)\); (5) mixture of normal-\(\chi ^2(5)\) distribution, \(0.9N(0,1)+0.1\chi ^2(5)\); (6) mixture of three normal distributions, \(0.8N(0,1)+0.1N(-10,1)+0.1N(10,1)\). Throughout the paper, we choose and recommend the kernel function \(\phi \) to be the standard normal density function in our MRE method. It can be verified that the conditions of Theorem 1 are all satisfied by all the above error distributions except case (5). We include case (5) in our simulation to investigate the robustness of the proposed MRE method.
For illustration and comparison, we also take the following methods into consideration: least square estimate (LSE), the least absolute deviance estimate (LAD), the composite quantile regression with 9 quantiles (CQR, Zou and Yuan 2008) and the rank regression estimate (RRE, Johnson and Peng 2008). For each method, we report the mean square error (MSE) of the estimate \(\hat{\varvec{\beta }}\), i.e., \(\hbox {MSE}=(\hat{\varvec{\beta }}-{\varvec{\beta }})^T(\hat{\varvec{\beta }}-{\varvec{\beta }})/p\). In order to evaluate the prediction performance of the fitted model, we generated a test sample, e.g. \(\{({y_i^{\hbox {test}}}, {\varvec{ {x}}_i^{\hbox {test}}}): i=1,\ldots ,200\}\), in each simulation, and computed the mean absolute prediction error (MAPE), \( \sum _{i=1}^{200}|{y_i^{\hbox {test}}}-\hat{y}^{\hbox {test}}_i|/200\) with \(\hat{y}^{\hbox {test}}_i=({\varvec{ x}_i^{\hbox {test}}})^T\hat{\varvec{\beta }}\). The mean and standard error of MSE and MAPE over 1000 replications are reported in Table 1.
From Table 1, we have the following observations. For a given error distribution, the performances of MRE become better and better when the sample size increases. In the case of normal error, as long as the sample size is not too small, the MRE is better than other three robust methods, and it seems to perform almost as well as LSE. And for the Laplace distribution, it is well known that the LAD is the best estimator, nevertheless, the performance of MRE is very close to LAD. For the other four error distributions, it is obvious that MRE outperforms the rest four methods even in case (5) where the conditions in Theorem 1 are not satisfied.
Furthermore, it is worth mentioning that the performances of MRE are significantly better than the others for the three mixture error distributions. Here is a possible reason for this observation. The mixtures can be viewed as populations containing outliers. When data contains severely departed outliers, the modal regression puts more weight on the “most likely” data around the true value, which leads to robustness and efficiency of the proposed MRE.
Overall, the performance of MRE is desirable and its efficiency gain is more prominent when the data set contains outliers.
4.2 Example 2
We now study the performance of the MREL confidence interval/regions. The usual normality-based least square method (LS) and least square based empirical likelihood method (LSEL; Owen 1991) are also taken into consideration for comparison.
Consider the following model
where \({\varvec{\beta }}=(\beta _1,\beta _2)^T=(2,1)^T\). The covariates \((x_{i1}, x_{i2})\) follows a bivariate normal distribution with mean zero. Both \(x_{i1}\) and \( x_{i2}\) have univariate variance and their correlation coefficient is 0.8. We generated errors from four distributions: \(N(0,1)\), \(t(3)\), \(Lp(0,1)\) and \(0.9N(0,1)+0.1N(0,10^2)\). The simulation results are summarized in Table 2 and Fig. 1.
Remark 2
When only \(\beta _1\) is of interest, the MREL confidence interval of \(\beta _1\) can be constructed through the profile empirical log-likelihood function \(l_n(\beta _1)= \sup _{\beta _2} l_n(\beta _1,\beta _2)\). Similar to usual parametric likelihood, if \(\beta _{10}\) is the true value of \(\beta _1\), then \(-2l_n(\beta _{10})\) has a \(\chi _1^2\) limiting distribution as \(n\rightarrow \infty \). Accordingly a natural MREL confidence interval is given by
The construction of \({\mathcal {C}}_{\hbox {MREL}}(\beta _2)\) is similar. In Table 2, we also report the marginal confidence interval \({\mathcal {C}}_{\hbox {MREL}}(\beta _1)\) and \({\mathcal {C}}_{\hbox { MREL}}(\beta _2)\)
For a given error distribution, we see that the coverage probability of MREL gets closer and closer to the nominal level as \(n\) increases; meanwhile the average lengths of confidence intervals for single parameter become shorter and shorter.
In the case of normal error, the differences among the three methods are small. In particular, the performance of MREL is as well as the least square based methods when the sample size \(n\) is large. For the case of non-normal distributions, the average lengths of confidence intervals (regions) for MREL are obviously shorter (smaller) than the other two. It is worth mentioning that the interval length of MREL is only about one third to that of the LS and LSEL when the error follows a mixture normal distribution.
In addition, the coverage probability of LSEL deviates significantly from the nominal level for the three non-normal error distributions when the sample size is small, and it grows very slowly as the sample size increases.
In summary, the MREL has priority over the LS and LSEL methods when the sample size is large in terms of both coverage probability and interval length or region volume. The coverage precision of the MREL confidence interval/region needs improving particularly in the case of small sample sizes. The adjusted empirical likelihood of Chen et al. (2008) and Liu and Chen (2010) or the bootstrap method can serve this purpose
Remark 3
We take the MREL confidence regions in Fig. 2 for example, to illustrate how we computed the confidence boundary given a data-set. The first step is to compute the center, the MRE \(\hat{\varvec{\beta }}\) of \(\varvec{\beta }\). Then along any line through the center, two points meeting with the confidence boundary can be found. All points on the confidence boundary will be obtained after we work for all lines. This can be done conveniently in polar coordinate. It is clear that theoretically this method applies to confidence regions of any dimension.
5 Real data analysis
In this section, we apply the proposed method to the analysis of the Education Expenditure Data (Chatterjee and Price 1977). This data set consists of 50 observations from 50 states, one for each state. It has been analyzed by Yao et al. (2012) using nonparametric modal regression. We take the per capita expenditure on public education in a state as the response variable \(y_i\), and take the number of residents per thousand residing in urban areas in 1970 as covariate \(x_i\). And we consider fitting the data by the following linear model
In this example, an obvious outlier is that from Hawaii with a very high per capita expenditure on public education compared with other states. The confidence intervals for \(\beta _0\) and \(\beta _1\) respectively based on the LS, LSEL and MREL methods were computed and presented in Table 3. The confidence regions based on the three methods are displayed in Fig. 2. (Here, to alleviate the magnitude difference between the two estimates \(\hat{\beta }_0\) and \(\hat{\beta }_1\) using the original data, we divide both response variable \(y_i\) and covariate \(x_i\) by 100 for each observation, and then use (13) to fit the transformed data.)
As we can clearly see from Table 3 and Fig. 2, the confidence interval (region) obtained by MREL is shorter (smaller) than the least square based methods, which show that the confidence region obtained by modal regression empirical likelihood not only has the advantage of data-driven nonparametric approach but also is robust to outliers.
To further test the credibility of the confidence region (interval), we also calculated the coverage probability (given in Table 4) of the confidence region/interval based on 2000 bootstrap resamples. As we can see from Table 4, compared with the nominal coverage 95 %, both the two empirical likelihood based methods is not that satisfactory. The bootstrap method and the adjusted empirical likelihood mentioned in Sect. 4.2 can be used to improve the coverage precision.
The comparison of confidence region volume is not fair if the confidence regions under comparison have rather different coverage probabilities. For fair comparison, we calibrate the LS, LSEL and MREL with not their limiting distributions but the empirical distributions based on the 2000 bootstrap statistics. Take the MREL for example. Let \( \hat{\varvec{\beta }}\) denote the MREL estimate based on the original data-set, and \(l_j^*(\hat{\varvec{\beta }}) \) (\(j=1,2,\ldots , 2000\)) be the 2000 bootstrap MREL ratio statistics. We shall take the 1900th statistic \(l_{(1900)}^*(\hat{\varvec{\beta }}) \) as the 95 % quantile of the MREL method.
All confidence regions/intervals are re-compuated, and presented in Fig. 3 and Table 5. It is clear from Fig. 3 that the MREL confidence region for \((\beta _0, \beta _1)\) is significantly smaller than those based on the LS and LSEL. When only one component of \((\beta _0, \beta _1)\) is of interest, we find from Table 5 that all the MREL confidence intervals are much shorter than those based on the LS and LSEL. These observations provide strong evidence for the priority of the MREL.
6 Concluding remarks
In this paper, in order to make inference about the regression coefficient of a linear regression model, we first investigate the properties of the modal regression with a fixed bandwidth, then propose an empirical likelihood estimation approach based on modal regression estimation equation. It has been shown that the proposed estimator is more robust and efficient than the least square based methods for many non-normal error distributions or data containing outliers. Though our current research is focusing on linear regression, the framework can be extended to nonparametric or semi-parametric models, such as single-index models, partially linear models and semi-varying coefficient models. In addition, with high-dimensional covariates in regression models, sparse modeling is often considered superior, it is also interesting to consider robust penalized empirical likelihood based on modal regression, which can be taken as a future research topic.
References
Chatterjee S, Price B (1977) Regression analysis by example. Wiley, New York
Chen J, Variyath AM, Abraham B (2008) Adjusted empirical likelihood and its properties. J. Comput. Graph. Stat. 17:426–443
Chen S, Keilegom I (2009) A review on empirical likelihood methods for regression (with discussions). Test 18:415–447
Chen X, Wang Z, Martin J (2010) Asymptotic analysis of robust lassos in the presence of noise with large variance. IEEE Trans. Inf. Theory 56:5131–5149
Huber P (1981) Robust Statistic. Wiley, New York
Johnson B, Peng L (2008) Rank-based variable selection. J. Nonparametr. Stat. 20:241–252
Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46:33–50
Lee M (1989) Mode regression. J. Econom. 42:337–349
Liu Y, Chen J (2010) Adjusted empirical likelihood with high-order precision. Ann. Stat. 38:1341–1362
Parzen E (1962) On estimation of a probability density function and mode. Ann. Math. Stat. 33:1065–1076
Owen A (1988) Empirical likelihood ratio confidence intervals for a single function. Biometrika 75:237–249
Owen A (1990) Empirical likelihood ratio confidence regions. Ann. Stat. 18:90–120
Owen A (1991) Empirical likelihood for linear models. Ann. Stat. 19:1725–1747
Owen A (2001) Empirical Likelihood. Chapman and Hall, New York
Qin J, Lawless J (1994) Empirical likelihood and general estimating equations. Ann. Stat. 22:300–325
Rousseeuw P, Leroy A (1987) Robust Regression and Outlier Detection. Wiley, New York
Scott D (1992) Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York
Wei C, Luo Y, Wu X (2012) Empirical likelihood for partially linear additive errors-in-variables models. Stat. Pap. 53:48–496
Yao, W., Li, L.: A new regression model: modal linear regression. Scand. J. Stat. (2013). doi:10.1111/sjos.12054
Yao W, Lindsay B, Li R (2012) Local modal regression. J. Nonparametr. Stat. 24:647–663
Yu, K., Aristodemou, K.: Bayesian mode regression. Technical report (2012). arXiv:1208.0579v1
Zi X, Zou C, Liu Y (2012) Two-sample empirical likelihood method for difference between coefficients in linear regression model. Stat. Pap. 53:83–93
Zou H, Yuan M (2008) Composite quantile regression and the oracle model selection theory. Ann. Stat. 36:1108–1126
Acknowledgments
The research was supported in part by National Natural Science Foundation of China (11171112, 11001083, 11371142), Chinese Ministry of Education the 111 Project (B14019), Doctoral Fund of Ministry of Education of China (20130076110004), The Natural Science Project of Jiangsu Province Education Department (13KJB110024) and Natural Science Fund of Nantong University (13ZY001).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Theorem 1
Proof
We first prove the root-\(n\) consistency of \({\hat{\varvec{\beta }}}\), i.e., \(\Vert \hat{\varvec{\beta }}-{\varvec{\beta }}_0\Vert =O_p(n^{-1/2})\). It is sufficient to show that for any given \(\varrho > 0\), there exists a large constant \(C\) such that
where the function \(Q_h(\cdot )\) is defined in (2).
For any vector \({\varvec{v}}\) with length \(C\), by the second-order Taylor expansion, we have
where \(\xi _i\) lies between \(\epsilon _i\) and \(\epsilon _i-n^{-1/2} {\varvec{x}}_i^T{\varvec{v}}\).
We study respectively the magnitudes of \(I_1, I_2\) and \(I_3\). Let \(A_n {=}\! \sum _{i=1}^n \phi _h^{\prime }(\epsilon _i)n^{-1/2} {\varvec{x}}_i\). It follows from condition (C1) and \(\hbox { E}(\phi '_h(\epsilon )) = 0\) that,
The finiteness of \(\hbox {Var}( {\varvec{x}}_i )\) and \(G(h) = {\varvec{E}}(\phi '(\epsilon )^2)\) implies that
Then by central limit theorem, we have for fixed \(C\) that \( A_n \overset{d}{\longrightarrow }N( 0, G(h) \Sigma ) \), and therefore \(I_1 \overset{d}{\longrightarrow }N( 0, G(h) {\varvec{v}}^T\Sigma {\varvec{v}}) \).
For \(I_2\), with the strong law of large numbers, we have \( I_2 = \frac{1}{2}F(h){\varvec{v}}^T\Sigma {\varvec{v}} + o(1) \), where \(F(h)\) is defined in condition (C1).
About \(I_3\), we find that
Condition (C2) implies that \( \frac{1}{6n} \sum _{i=1}^n \rho _{h,c}(\epsilon _i)( {\varvec{x}}_i^T{\varvec{v}})^2 = O_p(1). \) It then follows from the fact that \(\max _{1\le i\le n} (\Vert {\varvec{x}}_i\Vert /\sqrt{n}) = o_p(1)\) that
Overall, we obtain that for any \({\varvec{v}}\) with \(\Vert {\varvec{v}}\Vert = C\),
with \(\delta _n = o_p(1)\). The fact \(- A_n^T{\varvec{v}} \overset{d}{\longrightarrow }N( 0, G(h) {\varvec{v}}^T\Sigma {\varvec{v}}) \) implies that for any \(\varrho >0\) and any nonzero \({\varvec{v}}\), there exists \(K>0\) such that
Thus with probability \(1-\varrho \), it holds that
Note that \(F(h)<0\). Clearly, when \(n\) and \(C\) are both large enough,
In summary, for any \(\varrho >0\), there exists \(C>0\) such that for \({\varvec{v}}=C\), \(nQ_h({\varvec{\beta }}_{0}+ n^{-1/2} {\varvec{v}})-n Q_h({\varvec{\beta }}_{0})\) is negative with probability at least \(1-\varrho \). Thus, (14) holds. That is, with the probability approaching 1, there exists a local maximizer \({hat{\varvec{\beta }}}\) such that \(\Vert \hat{\varvec{\beta }}-{\varvec{\beta }}_0\Vert =O_p(1/\sqrt{n})\).
We turn to proving the asymptotical normality of \( \hat{\varvec{\beta }}\). Denote \(\hat{\varvec{\gamma }}=\hat{\varvec{\beta }}-{\varvec{\beta }}_0\), then \(\hat{\varvec{\gamma }}\) satisfies the following equation
where \(\epsilon _i^*\) lies between \(\epsilon _i\) and \(\epsilon _i-{\varvec{x}}_i^T\hat{\varvec{\gamma }}\). We have shown that
Meanwhile the fact \(\hat{\varvec{\gamma }}= O_p(n^{-1/2})\) and condition (C2) implies that \(J_3 = o_p(1)\). Thus Eq. (19) implies \( {\hat{\varvec{\gamma }}} = - J_2^{-1} J_1 + o_p(1). \) Since the bandwidth \(h\) is a constant not depending on \(n\), by Slutsky’s theorem, we have
\(\square \)
The following lemma is needed to prove Theorem 2.
Lemma 1
Under the conditions of Theorem 1, the \({\mathbf \lambda }_{\beta _0}\) in (10) satisfies \(\Vert {\varvec{\lambda }}_{\beta _0}\Vert =O_p(n^{-1/2})\).
Proof
Denote \({\varvec{\lambda }}_{\beta _0}=\zeta {\mathbf u}_0\) with \({\mathbf u}_0\) a unit vector and \(\zeta =\Vert {\varvec{\lambda }}_{\beta _0}\Vert \). Define matrix \({\varvec{\Phi }}_n({\varvec{\beta }})=n^{-1} \sum _{i=1}^n \xi _i({\varvec{\beta }})\xi _i^T({\varvec{\beta }})\) and \(Z=\max _{1\le i\le n}\Vert \xi _i({\varvec{\beta }}_0)\Vert \). It follows from the definition of \({\varvec{\lambda }}_{\beta _0}\) that
which implies
By the Cauchy–Schwarz inequality and law of large numbers, we have
This together with Eq. (17) gives
Condition (C1) and law of large numbers implies \({\varvec{\Phi }}_n \mathop {\longrightarrow }\limits ^{\hbox { p}} G(h){\varvec{\Sigma }}\), which means that there exists \(c>0\) such that \(P({\varvec{u}}_0^T{\varvec{\Phi }}_n{\varvec{u}}_0>c)\rightarrow 1\) as \(n\rightarrow \infty \).
Furthermore, since \( {n}^{-1/2}\sum _{i=1}^n\xi _i({\varvec{\beta }}_0) \mathop {\longrightarrow }\limits ^{\hbox { d}} N(0, {\varvec{\Phi }}) \), we find that \(\Vert {\varvec{\lambda }}_{\beta _0}\Vert =O_p(n^{-1/2})\). \(\square \)
1.2 Proof of Theorem 2
Proof
Let \(y_i = {\varvec{\lambda }}^T_{\beta _0}\xi _i({\varvec{\beta }}_0)\). It follows from Lemma 1 that
which implies that the upcoming Taylor expansion is valid. Applying the second-order Taylor expansion on \((1+y_i)^{-1}\) for \(i\) from 1 to \(n\), we obtain from Eq. (10) that
where \( r_n({\varvec{\beta }}_0)\ =(1/n)\sum _{i=1}^n\xi _i({\varvec{\beta }}_0) (1+\delta ^*_i)^{-1}\{{\varvec{\lambda }}^T_{\beta _0}\xi _i({\varvec{\beta }}_0)\}^2 \) and \(\delta ^*_i\) lies between \(0\) and \(y_i\). Clearly \(\max _{1\le i\le n}|\delta _i^*| = o_p(1)\). Therefore
Thus we have
Similarly, by the third-order Taylor expansion on \(\log (1+y_i)\) for all \(i\), we have
where \(\eta _i\) lies between \(0\) and \(y_i\). It can be verified that
Furthermore, by incorporating Eq. (24), we have
Since \( \xi _i({\varvec{\beta }}_0)={\varvec{x}}_i\phi _h^{\prime } \epsilon _i \), it follows from conclusion of Lemma 1 that as \(n\rightarrow \infty \),
which immediately implies \(-2 l({\varvec{\beta }}_0) \mathop {\longrightarrow }\limits ^{\mathrm{d}} \chi ^2_p\). This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Zhao, W., Zhang, R., Liu, Y. et al. Empirical likelihood based modal regression. Stat Papers 56, 411–430 (2015). https://doi.org/10.1007/s00362-014-0588-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-014-0588-4