Abstract
We propose and investigate the statistical properties of shrinkage M-estimators based on Stein-rule estimation for partially linear models under the assumption of sparsity. We are mainly interested in estimating regression coefficients parameter sub-vector with strong signals when the sparsity assumption may or may not hold. Thus, we consider two models, one including all the predictors, leading to a full (unrestricted, or over-fitted) model estimation; and the other with only a few influential predictors, resulting in a submodel (restricted, or under-fitted model) estimation problem. Generally speaking, submodel estimators perform better than full model estimators, when the assumption of sparsity is nearly correct. However, a small departure from this assumption makes submodel estimators biased and inefficient, questioning its applicability for practical reason. On the other hand, the full model estimators may not be desirable due to interpretability and higher estimation errors, specially when a large number of predictors are included in the model. For this reason, we propose shrinkage strategies which combine both full model and submodel estimators in an optimal way. The asymptotic properties of the suggested estimators have been studied both analytically and numerically. The asymptotic bias and risk of the estimators are derived in closed form. In addition, a simulation study is conducted to examine the performance of the estimators in practical settings when sparsity assumption may or may not hold. Our simulation results consolidate the theoretical properties of the estimators.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We study robust shrinkage M-estimation in partially linear models (PLM) with a scaled error term. Ahmed et al. (2006) considered robust shrinkage-estimation of regression parameters when it is a priori suspected that the regression parameters could be restricted to a linear subspace. They studied the asymptotic properties of variants of Stein-rule M-estimators. For some insights on shrinkage estimation strategies, we refer to Ahmed and Fallahpour (2012), Ahmed and Raheem (2012), Raheem et al. (2012), Ahmed (2014), Ma et al. (2014), Ahmed et al. (2016), Sun et al. (2020) and Opoku et al. (2021). Recently, Ahmed et al. (2023) extended shrinkage strategies to high-dimension when the sparsity assumption cannot be judiciously justified. Maruyama et al. (2023) presented both classical and recent shrinkage estimation developments including results of admissibility of generalized Bayes estimators in the presence of a nuisance scale parameter.
Application of robust statistical techniques, including M-estimation and related approaches, to address modeling and prediction challenges are found in various domains, e.g. industrial data modeling (Zhou et al., 2020), disease incidence prediction (Susanti et al., 2020), and ratio-type estimation (Rather et al., 2022). These techniques are valuable for handling data imperfections, outliers, and non-normality.
Related works also include those by Arashi et al. (2014). They discussed the improvement of preliminary tests and Stein-rule Liu estimators specifically tailored for the ill-conditioned elliptical linear regression model. The authors focused on addressing the challenges posed by data that exhibit multicollinearity or non-normality, providing more accurate estimations in such scenarios. Norouzirad and Arashi (2019) discussed the use of preliminary tests and Stein-type shrinkage ridge estimators in the context of robust regression. They explored methods for improving the robustness and accuracy of regression models, particularly when dealing with outliers or influential data points. Recently, Shih et al. (2021) proposed a robust ridge M-estimators, incorporating pretests and Stein-rule shrinkage techniques for estimating the intercept term in regression models. They aimed to enhance the robustness of ridge regression when dealing with outliers and influential observations. On the other hand, in addition to the original results on Huber type M-estimation, a review of recent results and applications was given by e.g. Farcomeni and Ventura (2012).
Generally speaking, a PLM is more flexible than a linear model since it includes a nonlinear component along with the linear components. A PLM may provide a better alternative to the classical linear regression model in a situation where one or more predictors have nonlinear relationship with the response variable. Robust regression models are designed to overcome some of the limitations of classical linear regression in a host of scenarios. For example, least squares regression is highly sensitive to outliers, and is subject to underlying assumptions. Any violation of these assumptions may have serious impact on the validity of the fitted model.
In this paper, we extend the available works to a PLM, and develop shrinkage M-estimators. We construct shrinkage M-estimators of regression parameters under the sparsity assumption. Our analytical and numerical results establish the superiority of shrinkage M-estimators over full model and submodel M-estimators. We focus on shrinkage M-estimation of regression coefficients in a PLM when sparsity assumption may or may not hold. In our setup, the nonparametric part is estimated by the kernel-based method.
1.1 Statement of the problem
Consider a PLM of the form
where \(\varvec{Y}= (y_1, y_2, \dots , y_n)^\top \) is an n-response vector, \(\varvec{X}= (\varvec{x}_1^\top , \varvec{x}_2^\top , \dots , \varvec{x}_n^\top )^\top \) is the \(n \times p\) design matrix with \(\varvec{x}_i\)’s as known row p-vectors, \(\varvec{\beta }=(\beta _1, \beta _2, \dots , \beta _p)^\top \) is the p-vector of regression parameters, \(g(T)=(g(t_1), g(t_2), \dots , g(t_n))^\top \) is an unknown real-valued function, \(\varvec{e} =(e_1, e_2, \dots , e_n)^\top \) is the n-vector of random errors with mean \(E(\varvec{e})=\varvec{0}\), \(e_i\)’s are independent and identically distributed (iid) random variables having a continuous distribution, F, free from any unknown scale parameter \(\sigma >0\), where the \(()^\top \) denotes transpose of a vector or a matrix. In passing, we would like to remark here that without loss of generality the intercept term is not included in establishing the asymptotic properties of the estimators.
Under the assumption of the sparsity, the data matrix \(\varvec{X}\) can be partitioned as \(\varvec{X}=(\varvec{X}_1: \varvec{X}_2)\) with \(\varvec{\beta }=(\varvec{\beta }_1^\top , \varvec{\beta }_2^\top )^\top \), where \(\varvec{X}_1\) and \(\varvec{X}_2\) are \(n \times p_1\) and \(n \times p_2\) submatrices of predictors with strong signals and no signals, respectively. Thus, the model can be rewritten as
Under the assumption of sparsity, that is, \(\varvec{\beta }_2 = \varvec{0}\), then we have the submodel, under-fitted or restricted model as follows:
and the remaining discussion follows.
In practice, the submodel can be readily obtained by applying a suitable variable selection method to full model. We are primarily interested in estimating \(\varvec{\beta }_1\), when \(\varvec{\beta }_2\) may or may not be a null vector. In other words, when practitioners may not be certain that model is fully sparse or not. In an effort to help data analyst, we propose shrinkage M-estimates based on Stein-rule to improve the performance of the under-fitted model estimators.
The remainder of the paper is organized as follows. In Sect. 2, we define a kernel-based least-squares (LS) estimators and discuss a two-step procedure to estimate the nonparametric function in a PLM. In Sect. 3, we propose Stein-rule shrinkage M-estimators. Asymptotic properties of the estimators are presented in Sect. 4. The expressions for asymptotic bias and risk for the estimators are derived in Sect. 5. Monte Carlo simulation results are conducted in Sect. 6. Our concluding remarks are made in Sect. 7. Finally, our derivations of the theoretical results are included in the appendix.
2 Proposed LS estimation
In this section, we propose our robust LS estimation method with a two-step procedure.
Again, consider a PLM of the form
We first linearize (2.1) by estimating \(g(\cdot )\) using kernel smoothing. We then confine ourselves to the estimation of \(\varvec{\beta }\) based on the partial residuals which attains the usual parametric convergence rate \(n^{-1/2}\) without under-smoothing the nonparametric component \(g(\cdot )\); see e.g. Speckman (1988).
Now, we describe the estimation process. We assume \(\left\{ y_i, \varvec{x}_i^\top , t_i; i=1, 2, \dots , n \right\} \) satisfy (2.1). If \(\varvec{\beta }\) is the true parameter, then by \(E(\varvec{e}_i)=0\), we have
A natural nonparametric estimator of \(g(\cdot )\) given \(\varvec{\beta }\) is
where
with \(K(\cdot )\) being a kernel function which is a non-negative function integrable on \({\mathfrak {R}}\), and h being a bandwidth parameter. We need to make the assumptions as outlined in Appendix B.
Now, we define the conditional expectations
where \(\gamma _j(t) = E(\varvec{x}_j|T=t)\).
We estimate \(\varvec{\beta }\) using
with
where \({\widetilde{\varvec{Y}}} = ({\tilde{y}}_1, {\tilde{y}}_2, \dots , \tilde{y}_n)^\top \), \({\widetilde{\varvec{X}}} = ({\tilde{\varvec{x}}}_1, {\tilde{\varvec{x}}}_2, \dots , \tilde{\varvec{x}}_n)^\top \), \({\tilde{y}}_i = y_i - \gamma _0(t)\), and \({\tilde{\varvec{x}}}_i = \varvec{x}_i - \varvec{\gamma }(t)\) for \(i =1, 2, \dots , n\).
The conditional expectations \(\gamma _0(t)\) and \(\varvec{\gamma }(t)\) are obtained using a classical nonparametric approach through
where \(W_{ni}(t)\) is defined in (2.2). Clearly, once we obtain the estimates \(\hat{\gamma }_0(t)\) and \(\hat{\gamma }_j(t)\), they can be plugged into (2.4) prior to the estimation of \(\varvec{\beta }\).
The above procedure was independently proposed by Denby (1986) and Speckman (1988). A similar approach was taken by Ahmed et al. (2007) in estimating the nonparametric component in a PLM.
We obtain the robust M-estimators of the parameters of a PLM using a two-step procedure as follows:
-
Step 1
We first estimate \(\gamma _0(t)\) and \(\gamma _j(t)\) through kernel smoothing as described above. We denote the estimates by \(\hat{\gamma }_0(t)\) and \(\hat{\gamma }_j(t)\), respectively.
-
Step 2
The estimates in Step 1 are then plugged into (2.2). So the estimator \({\hat{\varvec{\beta }}}\) of \(\varvec{\beta }\) can be obtained by regressing the residuals \(y_i -\hat{\gamma }_0(t)\) and \(\varvec{x}_i - \hat{\varvec{\gamma }}(t)\) using a robust procedure. We denote the residuals as \({\hat{r}}_i = y_i -\hat{\gamma }_0(t)\) and \(\varvec{u}_i = \varvec{x}_i - \hat{\varvec{\gamma }}(t_i)\).
Consistency and asymptotic normality of the estimators can be found in Appendix Section B.1 and the reference therein.
3 Proposed shrinkage M-estimation strategies
In this section, we propose our full model and submodel estimators, and formulate a test statistic which has asymptotically a non-central \(\chi ^2\) distribution.
Let \({\hat{\varvec{\beta }}}_1^{\text {RM}}\) be the restricted estimator of \(\varvec{\beta }_1\) where \(\varvec{\beta }_2={{\textbf{0}}}\), and \(\hat{\beta }_1^{\text {UM}}\) be the unrestricted estimator of \(\beta _1\) when \(\varvec{\beta }_2\) may not be a null vector. Following Ahmed (2014), a Stein-type M-estimator (SM), \(\hat{\varvec{\beta }}^{\text {SM}}_1\) of \(\varvec{\beta }_1\) can be defined as
where \(\psi _n\) is a distance statistic defined later in (3.7). To avoid the over-shrinkage problem, the positive-rule Stein-type M-estimator (SM+) has the form
where \(z^+=\max (0,z)\).
3.1 Full model and submodel estimation strategies for \(\hat{\varvec{\beta }}_1\)
For a suitable absolutely continuous function \(\rho : {\mathfrak {R}} \rightarrow {\mathfrak {R}}\), with derivative \(\phi \), an M-estimator of \(\varvec{\beta }\) is defined as a solution of the minimization
Generally, an M-estimator is regression-equivariant, i.e.,
and robustness depends on the choice of \(\rho (\cdot )\). But it is generally not scale-equivariant. That is, it may not satisfy
To have the estimators scale and regression equivariant, we need to studentize them. The studentized M-estimator is defined as as solution of the minimization
where \(S_n =S_n(\varvec{Y})\ge 0\) is an appropriate scale statistic that is regression equivariant and scale equivariant, i.e.,
According to Jurečcková and Sen (1996), the minimization in (3.2) should be supplemented by a rule how to define \(\varvec{M}_n\) in the case when \(S_n(\varvec{Y})=0\). However, in general, this happens with probability zero, and the specific rule does not affect the asymptotic properties of \(\varvec{M}_n\). There are additional regularity conditions needed with (3.2), which we present in Appendix A. Further details may be found in Jurecčková and Sen (Jurečcková and Sen (1996), page 217).
Now, in an effort to define the M-estimator for \(\varvec{\beta }_1\), we define \(C=X^\top X\) with \(X=(X1:X2)\) as follows:
Also, we define
Note that, if \(\varvec{C}_{21}=\varvec{0}\), then \(\varvec{C}_{22.1}= \varvec{C}_{22}\). Otherwise, \(\varvec{C}_{22}- \varvec{C}_{22.1}\) is positive semi-definite. We assume that C and \(C_{22.1}\) are positive.
A studentized unrestricted M-estimator of \(\varvec{\beta }\) is defined as a solution of (3.2). Let us denote it by
A studentized restricted M-estimator of \(\varvec{\beta }_1\) is obtained by minimizing
where \(S_n\) is regression-invariant so is not affected by the restricted environment. Since \(\rho (\cdot )\) is assumed to have derivative \(\phi (\cdot )\), we rewrite \(\hat{\varvec{\beta }}^{\text {UM}}\) as a solution of
In other words,
Similarly, \(\hat{\varvec{\beta }}_1^{\text {RM}}\) is a solution of
Now, let
Note that \({\varvec{M}}_{n}\) is a \((p_1+p_2)\)-vector, \({\varvec{M}}_{n_1}\) is a \(p_1\)-vector and \(\hat{\varvec{M}}_{n_2}^{\text {RM}}\) is a \(p_2\)-vector.
3.2 Test statistic
Following (Jurečcková and Sen (1996), Sect. 10.2) a suitable test statistic can be formulated as follows:
where
Directly applying the Lemma 5.5.1 in (Jurečcková and Sen (1996), page 220), it can be shown that under the sparsity assumption, that is, \(\varvec{\beta }_2\) is a null vector
Further, under (local) alternative hypothesis \(\psi _n\) has a non-central \(\chi ^2\) distribution.
It is to be mentioned here that unlike LS estimators, M-estimators are not linear. Even if the distribution function F is normal, the finite sample distribution theory of M-estimators is not simple. Asymptotic methods Jurečcková and Sen (1996) have been used to overcome this difficulty.
4 Asymptotic properties of the estimators
In this section, we establish the asymptotic properties of the estimators. This facilitates in finding the asymptotic distributional bias (ADB), asymptotic distributional quadratic bias (ADQB), and asymptotic distributional quadratic risk (ADQR) of the estimator of the regression parameter vector \(\varvec{\beta }_1\).
Under the assumed regularity conditions
where
it is known that under non-sparsity assumption, that is under local alternatives, \(\varvec{\beta }_2 \ne \varvec{0}\),
such that the shrinkage factor \(\kappa \psi ^{-1}_n = {\mathcal {O}}_p (n^{-1})\). This implies, asymptotically, there is no shrinkage effect. Therefore, to obtain meaningful asymptotics, we consider a class of local alternatives, \(\{K_n\}\), given by
where \(\varvec{\omega }= (\omega _1, \omega _2, \dots , \omega _{p_2})^\top \in {\mathfrak {R}}^{p_2}\) is a fixed vector and \(||\varvec{\omega }|| < \infty \), so that the null hypothesis \(H_0: \varvec{\beta }_2 = \varvec{0}\) reduces to \(H_0: \varvec{\omega }= \varvec{0}\).
For an estimator \(\varvec{\beta }^{*}_1\) and a positive-definite matrix \(\varvec{W}\), we define the loss function of the form
Thus, the risk function is defined as follows:
where tr denotes the trace operator and \(\varvec{\Omega }^{*}\) is the covariance matrix of \(\sqrt{n} (\varvec{\beta }_1^*-\varvec{\beta }_1)\). Whenever \( \lim _{n \rightarrow \infty }\hat{\varvec{\Omega }}^*_n = \hat{\varvec{\Omega }}^* \) exists, the asymptotic risk is defined by
Consider the asymptotic cumulative distribution function (cdf) of \(\sqrt{n}(\varvec{\beta }^{*}_{1n} - \varvec{\beta }_1)\) under \(\{ K_n \}\) exists, and is defined as
This is known as the asymptotic distribution function (ADF) of \(\varvec{\beta }^{*}_1\). Suppose that \(G_n \rightarrow G\) at all points of continuity as \(n \rightarrow \infty \), and let \(\hat{\varvec{\Omega }}^*\) be the covariance matrix of G. Then the ADR is defined as
As noted in Ahmed et al. (2006), if \(G_n \rightarrow G\) in second moment, then ADR is the asymptotic risk. However, this is a stronger mode of convergence, and is hard to analytically prove for shrinkage M-estimators. Therefore, they suggested using asymptotic distributional risk.
Now let
be the dispersion matrix which is obtained from ADF. The asymptotic distributional quadratic risk (ADQR) may be defined as
where \(\varvec{\Gamma }\) is the asymptotic distributional mean squared error (ADMSE) of the estimators.
To establish the asymptotic properties of the estimators, we present two important theorems.
Theorem 1
Consider an absolutely continuous function \(f(\cdot )\) with derivative \(f'(\cdot )\) which exists everywhere, and finite Fisher information
Under \(\{K_n\}\) and the assumed regularity conditions, \(\psi _n\) has asymptotically a non-central chi-square distribution with non-centrality parameter \(\Delta = \varvec{\omega }^\top \varvec{Q}_{22.1}\varvec{\omega }\gamma ^{-2}\), where
and \(\phi (\cdot )\) is defined in (3.4) or Appendix A.
Theorem 2
We have, under the assumed regularity conditions, as \(n \rightarrow \infty \)
Proofs of these theorems are available in Jurečcková and Sen (1996).
5 Asymptotic bias and risk of the estimators
In this section, we present the asymptotic distribution, bias and risk results for each of our estimators. We also compare their risk performances.
Theorem 3
Under the local alternative \(K_n\) and the assumed regularity conditions, we have as \(n\rightarrow \infty \)
-
(i)
\(\eta _1 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {UM}}_1 - \varvec{\beta }_1) {\mathop {\rightarrow }\limits ^{d}} N(\varvec{0}, \gamma ^2\varvec{Q}^{-1}_{11.2}),\)
-
(ii)
\(\eta _2 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {UM}}_1 - \varvec{{\hat{\beta }}}^{\text {RM}}_1) {\mathop {\rightarrow }\limits ^{d}} N(\varvec{\delta }, \varvec{\Sigma }^*)\), \(\quad \varvec{\delta }=-\varvec{Q}^{-1}_{11}\varvec{Q}_{12}\varvec{\omega },\)
-
(iii)
\(\eta _3 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {RM}}_1-\varvec{\beta }_1) {\mathop {\rightarrow }\limits ^{d}} N(-\varvec{\delta }, \varvec{\Omega }^*), \quad \varvec{\Omega }^*=\gamma ^2 \varvec{Q}^{-1}_{11}.\)
We have, under \(\{K_{n}\}\)
where \(\varvec{Q}\) is partitioned as in (4.1).
Also, we have the joint distributions as follows:
The proof for this theorem is given in Appendix C.
5.1 Asymptotic bias of the estimators
The asymptotic distributional bias (ADB) of an estimator \(\varvec{\beta }^*\) is defined as
Theorem 4
Under the assumed regularity conditions and the stated theorems above, and under \(\{K_n\}\), the ADB of the estimators are as follows:
where \({E\left\{ \chi ^{-2}_{a}(\Delta )\right\} }\) is the expected value of an inverse of a non-central \(\chi ^2\) random variable with a degrees of freedom and non-centrality parameter \(\Delta \), and \(H_{a}(y, \Delta )\) is the cdf of the a non-central \(\chi ^2\) random variable with a degrees of freedom and non-centrality parameter \(\Delta \).
The proof for this theorem is given in Appendix D.
Let us define the asymptotic distributional quadratic bias (ADQB) of an estimator \(\varvec{\beta ^*}\) of \(\varvec{\beta }_1\) by
where \(\varvec{\Sigma }\) is the dispersion matrix of \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\) as \( n \rightarrow \infty \). In our case, the dispersion matrix is \(\varvec{Q}_{11}\). Thus, the asymptotic quadratic distributional bias of the estimators are given below.
The above expression reveal that, as expected, the unrestricted estimator of \(\varvec{\beta }_1\) is asymptotically unbiased. On the other, the bias function of restricted estimator is a function of the sparsity parameter (non-centrality parameter \(\Delta \)), so under the sparsity assumption, the estimator is asymptotically unbiased. However, it is an unbounded function of \(\Delta \), not a desirable property.
It can be seen that both shrinkage estimators are also function of \(\Delta \), more importantly, they are bounded function of the non-centrality parameter. The magnitude of bias increases as \(\Delta \) increases and then converges to zero as \(\Delta \rightarrow \infty \). As expected the bias curve of positive-part shrinkage estimator is below or equal the curve the curve of the shrinkage estimators.
The bias is a function of MSE (risk), so onward we focus on the risk properties of the estimators.
5.2 Asymptotic risk and risk performance of the estimators
In Appendix E, we present the derivation of the expressions for asymptotic distributional mean square error (ADMSE), and consequentially the risk expressions of the respective estimators.
From the ADMSE and ADQR results in Appendix E, we see clearly that the risk of the classical unrestricted estimator is independent of the sparsity assumption, so its risk take a constant value of \(\text {tr}(\varvec{W} \varvec{\Gamma }(\varvec{{\hat{\beta }}}^{\text {UM}}_1))\). On the other hand, the risk of the restricted estimator depends on the sparsity assumption, and when the assumption is nearly correct then \(R(\varvec{{\hat{\beta }}}^{\text {RM}}_1)\le R(\varvec{{\hat{\beta }}}^{\text {UM}}_1)\) and a strict inequality will hold for some values in the parameter space induced by the sparsity parameter. However, beyond this small interval in the parameter space, the unrestricted estimator will dominate the restricted estimator. As a matter of fact of the the restricted estimator risk is unbounded function of sparsity parameter, a undesirable property.
Interestingly, but not surprisingly, both shrinkage estimators are superior to benchmark estimators in the entire parameter space. For a suitable choice of \(\varvec{W}\), it can be verified that \(R(\varvec{{\hat{\beta }}}^{\text {SM}}_1) \le R(\varvec{{\hat{\beta }}}^{\text {SM+}}_1) \le R(\varvec{{\hat{\beta }}}^{\text {UM}}_1)\), and the strict inequality will hold for some values in the parameter space. Thus, the shrinkage estimators dominate the classical M-estimator. Further, the shrinkage estimators will outperform the restricted estimator except in a small interval where sparsity assumption may hold. Thus, we recommend the use of the shrinkage estimators, they are in closed form and free from any tuning parameter.
6 Simulation studies
In this section, we conduct a simulation study to appraise the performance of in practical setting and to quantify the relative behavior of the estimators. We perform Monte Carlo simulation experiments to examine the quadratic risk performance of the proposed estimators. We simulate the response from the following model:
where \(\beta _l\) is a \(p_1 \times 1\) vector and \(\beta _m\) is \(p_2 \times 1\) vector of parameters with \(p=p_1+p_2\), and \(\varepsilon _i \sim N(0,1)\), \(i=1, \ldots , n\). Furthermore, \(x_{i1}=(\zeta ^{(1)}_{i1})^2+\zeta ^{(1)}_{i}+ \xi _{i1}\), \(x_{i2}=(\zeta ^{(1)}_{i2})^2+\zeta ^{(1)}_{i}+ 2\xi _{i2}\), \(x_{is}=(\zeta ^{(1)}_{is})^2+\zeta ^{(1)}_{i}\) with \(\zeta ^{(1)}_{is}\sim N(0,1)\), \(\zeta ^{(1)}_{i}\sim N(0,1)\), \(\xi _{i1}\sim \) Bernoulli(0.35) and \(\xi _{i2}\sim \) Bernoulli(0.35) for all \(s=3,\ldots , p\).
We are interested in testing the assumption of the sparsity in the form of statistical hypothesis \(H_0: (\beta _{p_1+1}, \beta _{p_1+2}, \ldots , \beta _{p_1+p_2})=\varvec{0}\). Our aim is to estimate \(\varvec{\beta }_1\) when sparsity assumption may or may not be true. We partition the regression coefficients as \(\varvec{\beta }= (\varvec{\beta }_1^\top , \varvec{\beta }_2^\top )\). Each realization was repeated 5000 times to obtain stable results. For each realization, we calculated the MSE of the estimators.
We define \(\Delta ^* = ||\varvec{\beta } - \varvec{\beta }^{(0)}||,\) where \(\varvec{\beta }^{(0)}= (\beta _1^\top , 0)^\top \) and \(||\cdot ||\) is the Euclidean norm. In addition, \(\Delta ^*\) and \(S_n\) were estimated by median absolute deviation (MAD). To determine the behavior of the estimators for \(\Delta ^* >0,\) further data sets were generated from those distributions under the alternative hypothesis.
6.1 Error distributions
In an effort to evaluate the performance of the proposed estimators numerically, we perform a simulation study. We generate data from four different error distributions, namely the standard normal, contaminated normal, standard logistic distribution, and standard Laplace distribution, respectively.
The cumulative distribution function
was used to generate the standard normal and contaminated normal errors, where \(\lambda \) is the parameter indicating whether the standard normal or its contaminated version is returned. We consider \(\lambda =0\) and \(\lambda =0.9\), respectively. Indeed, for \(\lambda =0\) we get the standard normal errors, while for \(\lambda =0.9\), with \(\omega ^2 \ne 1\), we obtain the scale contaminated normal errors.
The standard logistic distribution has cdf
The standard Laplace distribution has cdf
6.2 Relative risk comparison
The risk performance of an estimator of \(\varvec{\beta }_1\) was measured by comparing its MSE with that of the unrestricted M-estimator. We numerically calculated the relative MSE (RMSE) of the proposed estimators \(\varvec{{\hat{\beta }}}^{\text {RM}}_1,\) \(\varvec{{\hat{\beta }}}^{\text {SM}}_1\), \(\varvec{{\hat{\beta }}}^{\text {SM+}}_1\) to the unrestricted estimator \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\), given by
where \(\hat{\varvec{\beta }}_1^\text {*}\) is one of the proposed estimators. The amount by which an RMSE is larger than unity indicates the degree of superiority of the estimator \(\hat{\varvec{\beta }}_1^\text {*}\) over \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\); see also Fig. 1.
We compute the RMSE values for \(n=30, 50\) and various configurations of \((p_1, p_2)\) based on Huber’s \(\rho -\)function. Our results are presented in Fig. 1 and Tables 1–4.
Figure 1 shows the RMSE values of various M-estimators. Here, \(\Delta ^*\) indicates the correctness of the submodel under sparsity assumption. Thus, \(\Delta ^* > 0\) quantify the degree of deviation from the assumed model. Figure 1 clearly shows that the restricted estimator is the best when \(\Delta ^*\) is close to the origin. However, the restricted estimator become inefficient and the RMSE goes below 1 very quickly as \(\Delta ^*\) deviates from zero. The RMSE of restricted estimator is depicted by the dashed line in Fig. 1. In the simulation study, the restricted estimator shows similar behaviour for all the error distributions considered in this study.
Tables 1–4 portrayed similar characteristic of the estimators. Both shrinkage estimators dominate the classical M-estimator, and positive-rule shrinkage estimator (SM+) dominates the shrinkage estimator. As for example, Table 1 presents the RMSEs for \((p_1, p_2) = (3, 5)\) and \(n=30\). For the standard normal error, the gain in risk for the positive-rule shrinkage M-estimator is 3.161 times that of the classical M-estimator provided that the model specification is correct (i.e., \(\Delta ^*=0\)). For the same configuration, when the error distribution is the standard Laplace, the gain in risk for SM+ is 2.273 times that of unrestricted estimator. Interestingly, for the large dimensional case \((p_1, p_2) = (5, 20)\) in Table 3, the gain is much higher with the value 7.325 and 4.200, respectively, demonstrating the applicability, power and beauty of the Stein-rule estimators in high-dimensional cases.
In closing, our numerical results strongly corroborate the theoretical properties of the suggested estimators.
7 Concluding remarks
In this paper, the shrinkage M-estimation strategies in the context of a partially linear regression model are developed. The statistical properties of shrinkage and positive-rule shrinkage M-estimators are investigated when the sparsity assumption may or may not hold. The expressions for bias and risk of the estimators are presented in closed form. The relative performance of the estimators is critically examined, the positive-rule shrinkage estimator is found to perform better than the unrestricted estimator. Further, it outshines the restricted estimator except in small interval when the submodel at the hand assumed to be to nearly true model.
In the simulation study, we numerically compute relative mean squared errors of the restricted-M, shrinkage-M, and positive-rule shrinkage M-estimators compared to the unrestricted M-estimator. Four different error distributions are considered to study the performance of the proposed estimators. Our numerical provides support for the positive-rule shrinkage estimators under varying degrees of model misidentification, as well. The submodel restricted M-estimator outperforms all other estimators when there is sparsity. However, a small departure from this condition makes the restricted very inefficient, questioning its applicability for practical purposes. We suggest to use positive-rule shrinkage M-estimators due to its performance in the entire parameter space.
More importantly, the performance positive-rule shrinkage M-estimators is noticeable when \(p_2\) is large, this work can be extended to high-dimensional cases, we refer to Ahmed et. al. (2023). We plan to study such extensions in a separate communication.
References
Ahmed, S. E. (2014). Penalty, Shrinkage and Pretest Strategies: Variable Selection and Estimation. New York, USA: Springer.
Ahmed, S. E., Ahmed, F., & Yüzbaşı, B. (2023). Post-Shrinkage Strategies in Statistical and Machine Learning for High Dimensional Data. Boca Raton, USA: CRC Press.
Ahmed, S. E., Doksum, K. A., Hossain, S., & You, J. (2007). Shrinkage, pretest and absolute penalty estimators in partially linear models. Australian & New Zealand Journal of Statistics, 49, 435–454.
Ahmed, S. E., & Fallahpour, S. (2012). Shrinkage estimation strategy in quasi-likelihood models. Statistics & Probability Letters, 82(12), 2170–2179.
Ahmed, S. E., Hussein, A. A., & Sen, P. K. (2006). Risk comparison of some shrinkage M-estimators in linear models. Nonparametric Statistics, 18, 401–415.
Ahmed, S. E., & Raheem, S. M. E. (2012). Shrinkage and absolute penalty estimation in linear regression models. Wiley Interdisciplinary Reviews: Computational Statistics, 4(6), 541–553.
Ahmed, S. E., & Yüzbaşı, B. (2016). Big data analytics: integrating penalty strategies. International Journal of Management Science and Engineering Management, 11(2), 105–115.
Arashi, M., Kibria, B. G., Norouzirad, M., & Nadarajah, S. (2014). Improved preliminary test and Stein-rule Liu estimators for the ill-conditioned elliptical linear regression model. Journal of Multivariate Analysis, 126, 53–74.
Bianco, A., & Boente, G. (2004). Robust estimators in semiparametric partly linear regression models. Journal of Statistical Planning and Inference, 122(1–2), 229–252.
Denby, L. (1986). Smooth regression functions. Statistical Research Report, 26. AT &T Bell Laboratories, Murray Hill.
Farcomeni, A., & Ventura, L. (2012). An overview of robust methods in medical research. Statistical Methods in Medical Research, 21(2), 111–133.
Jurečcková, J., & Sen, P. K. (1996). Robust Statistical Procedures: Asymptotics and Interrelations. New York: Wiley.
Ma, T., Liu, S., & Ahmed, S. E. (2014). Shrinkage estimation for the mean of the inverse Gaussian population. Metrika, 77, 733–752.
Maruyama, Y., Kubokawa, T., & Strawderman, W. E. (2023). Stein Estimation. Singapore: Springer.
Norouzirad, M., & Arashi, M. (2019). Preliminary test and Stein-type shrinkage ridge estimators in robust regression. Statistical Papers, 60, 1849–1882.
Opoku, E. A., Ahmed, S. E., & Nathoo, F. S. (2021). Sparse estimation strategies in linear mixed effect models for high-dimensional data application. Entropy, 23(10), 1348.
Raheem, S. M. E., Ahmed, S. E., & Doksum, K. A. (2012). Absolute penalty and shrinkage estimation in partially linear models. Computational Statistics & Data Analysis, 56, 874–891.
Rather, K. U. I., Koçyiǧit, E. G., Onyango, R., & Kadilar, C. (2022). Improved regression in ratio type estimators based on robust M-estimation. PLoS ONE, 17(12), e0278868.
Robinson, P. (1988). Root-n-consistent semiparametric regression. Econometrica, 56, 931–954.
Shih, J. H., Lin, T. Y., Jimichi, M., & Emura, T. (2021). Robust ridge M-estimators with pretest and Stein-rule shrinkage for an intercept term. Japanese Journal of Statistics and Data Science, 4, 107–150.
Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series. B, 50, 413–437.
Sun, R., Ma, T., & Liu, S. (2020). Portfolio selection: shrinking the time-varying inverse conditional covariance matrix. Statistical Papers, 61, 2583–2604.
Susanti, Y., Qona’ah, N., Ferawati, K., & Qumillaila, C. (2020). Prediction modeling of annual parasite incidence (API) of Malaria in Indonesia using Robust regression of M-estimation and S-estimation. AIP Conference Proceedings, 2296, 020100. https://doi.org/10.1063/5.0037417
Zhou, P., Xie, J., Li, W., Wang, H., & Chai, T. (2020). Robust neural networks with random weights based on generalized M-estimation and PLS for imperfect industrial data modeling. Control Engineering Practice, 105, 104633.
Acknowledgements
We would like to express our sincere gratitude to the Reviewers and Editors for their constructive comments and valuable feedback, which greatly contributed to the enhancement of this manuscript. Their meticulous review and thoughtful suggestions played a pivotal role in improving the quality and clarity of our work. Furthermore, S. Ejaz Ahmed would like to thank the colleagues at the University of Canberra for their hospitality and support. The support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) has been invaluable in conducting the research presented in this manuscript and is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A. Regularity conditions
Here, we list the regularity conditions needed for the minimization problem in (3.2). Detailed discussions about these conditions can be found in (Jurečcková and Sen (1996), p. 217–218).
For the studentized M-estimators, consider that \(\phi = \rho '\) can be decomposed as
where \(\phi _1\) is an absolutely continuous function with absolutely continuous derivative, \(\phi _2\) is a continuous piecewise linear function that is constant in a neighbourhood of \(\pm \infty \), and \(\phi _3\) is a non-decreasing step function.
The following conditions are imposed on (3.2).
-
RC1
\(S_n(Y)\) is regression invariant and scale equivariant, \(S_n >0\) a.s., and
$$\begin{aligned} \sqrt{n} (S_n - S) = O_p(1) \end{aligned}$$for some functional \(S= S(F) >0\).
-
RC2
The function \(h(t) = \int \rho ((z-t)/S) \textrm{d}F(z)\) has the unique minimum at \(t=0\).
-
RC3
For some \(\delta > 0\) and \(\eta > 1\),
$$\begin{aligned} { \int _{-\infty }^{\infty }} \left\{ |z| \sup _{|u| \le \delta } \sup _{|v| \le \delta } \bigg | \phi _1''\left( \frac{e^{-v}(z + u)}{S}\right) \bigg | \right\} ^\eta \textrm{d}F(z) < \infty \end{aligned}$$and
$$\begin{aligned} { \int _{-\infty }^{\infty }} \left\{ |z|^2 \sup _{|u| \le \delta } \bigg | \frac{\phi _1'' (z+u)}{S} \bigg | \right\} ^\eta \textrm{d}F(z) < \infty , \end{aligned}$$where \(\phi _1'(z) = \frac{d}{dz} \phi _1(z)\) and \(\phi _1''(z) = \frac{d^2}{dz^2}\phi _1(z)\).
-
RC4
\(\phi _3\) is a continuous, piecewise linear function with knots at \(\mu _1, \dots , \mu _k\), which is constant in a neighborhood of \(\pm \infty \). Hence the derivative \(\phi _3'\) of \(\phi _3\) is a step function
$$\begin{aligned} \phi _3'(z) = \alpha _\nu \quad \text{ for } \mu _\nu< z < \mu _{\nu +1}, \nu = 0, 1, \dots , k, \end{aligned}$$where \(\alpha _0, \alpha _1, \dots , \alpha _k \in {\mathfrak {R}}_1\), \(\alpha _0 = \alpha _k = 0\) and \(\infty = \mu _0< \mu _1< \dots< \mu _k < \mu _{k+1} = \infty \). Further, we assume that \(f(z) = \frac{\textrm{d}F(z)}{dz}\) is bounded in neighbourhood of \(S_{\mu _j}, j = 1, 2, \dots , k\).
-
RC5
\(\phi _3(z) = \lambda _{\nu }\) for \(q_\nu < z \le q_{\nu +1}\), \(\nu = 1, 2, \dots , m\) where \(-\infty = q_0< q_1< \dots< q_m < q_{m+1}= \infty \), \(-\infty< \lambda _0< \lambda _1< \dots< \lambda _m < \infty \). We further assume that \(f'(z)\) and \(f''(z)\) are bounded in the neighbourhood of \(S_{q_j}, j = 1, 2, \dots , m\).
B. Assumptions
Assumption B.1
The function \(g(\cdot )\) satisfies the Lipschitz condition of order 1 on [0, 1].
Assumption B.2
The probability weight functions \(W_{ni}(\cdot )\) satisfy
-
a)
\(\max _{1\le i\le n}\sum _{j=1}^{n} W_{ni}(t_j) = {\mathcal {O}}(1)\),
-
b)
\(\max _{1\le i, \, j \le n}\sum _{j=1}^{n} W_{ni}(t_j) = {\mathcal {O}}(n^{-2/3})\),
-
c)
\(\max _{1\le j\le n}\sum _{i=1}^{n} W_{ni}(t_j) I(|t_i - t_j| > c_n) = {\mathcal {O}}(d_n)\), where I is the indicator function, \(c_n\) satisfies \(\lim \sup _{n\rightarrow \infty }nc^{3}_n\), and \(d_n\) satisfies \(\lim \sup _{n \rightarrow \infty } n d^3_n < \infty \).
Remark 1
The usual polynomial and trigonometric functions satisfy Assumption B.1.
Remark 2
Under regular conditions, the Nadaraya-Watson kernel weights, Priestley and Chao kernel weights, locally linear weights and Gasser-Müller kernel weights satisfy Assumption B.2. If we consider the pdf of \(U[-1, 1]\) as the kernel function as
with \(t_i = \frac{i}{n}\), and the bandwidth \(cn^{-1/3}\) where c is constant, then the Priestley and Chao kernel weights satisfy Assumption B.2, and the weights are
For a detailed discussion on the assumptions above, see Ahmed et al. (2007).
1.1 B.1 Consistency and Asymptotic Normality
Now, we denote the random vector \((R(T), \varvec{U}(T)^\top )^\top \) with the same distribution as \((r_i, \varvec{u}_i^\top )^\top \).
Consistency of the regression parameters in a semi-parametric model has been proved in great detail in Bianco and Boente (2004). We omit the details but present only the set of assumptions, lemma, and theorem which are needed for proving asymptotic normality and consistency of the estimators.
Let \({\tilde{\rho }}\) and \({\widetilde{W}}\) be score and weight functions, respectively. The asymptotic distribution of \(\varvec{\beta }\) is defined as a solution of
where \({\hat{r}}_i = y_i -\hat{\gamma }_0(t)\), \(\varvec{u}_i = \varvec{x}_i - \hat{\varvec{\gamma }}(t_i)\), and \(s_n\) is an estimate of the residual scale.
To derive the asymptotic distribution of \(\varvec{\beta }\) we must have \(t_i\) in a compact set, so without loss of generality, we assume that \(t_i \in [0,1]\). We need the following set of assumptions. See Bianco and Boente (2004) for details.
-
A1
\({\tilde{\rho }}\) is odd, bounded, continuous, and twice differentiable with bounded derivative \({\tilde{\rho }}'\) and \({\tilde{\rho }}''\) such that \(\phi _1(t) = t\tilde{\rho }'(t)\) and \(\phi _2(t) = t\tilde{\rho }''(t)\) are bounded.
-
A2
\(E({\tilde{W}}(||\varvec{U}(T)||) ||\varvec{U}(T)||^2) < \infty \) and the matrix
$$\begin{aligned} \varvec{A} = E \left( {\tilde{\rho }}' \left( \frac{R(T) - \varvec{U}(T)^\top \varvec{\beta }}{\sigma }\right) {\widetilde{W}} (||\varvec{U}(T)||) \varvec{U}(T)\varvec{U}(T)^\top \right) \end{aligned}$$is nonzero.
-
A3
\(\widetilde{W}(u) = {\tilde{\rho }}_1(u)u^{-1} >0\) is a bounded function which satisfies the Lipschitz condition of order 1. Further, \({\tilde{\rho }}_1\) is bounded with bounded derivative.
-
A4
\(E(\widetilde{W} (||\varvec{U}(T)||)\varvec{U}(T)|T=t) =0\) for almost all t.
-
A5
The functions \(\varvec{x}_j(t), 0 \le j \le p\) are continuous in [0, 1] with continuous first derivative.
Remark 3
According to Robinson (1988), condition A2 is needed so that no element of \(\varvec{X}\) can be predictable by T. A2 guarantees that there is no multicollinearity in the columns of \(\varvec{X}- {\tilde{\varvec{X}}}_j(T)\). In other words, \(\varvec{X}\) has to be free from multicollinearity. Also, condition A5 is a standard requirement in kernel estimation in semi-parametric models in order to guarantee asymptotic normality.
Lemma B.1
Let \((y_i, \varvec{x}_i^\top , t_i)^\top , 1 \le i \le n\) be independent random vectors satisfying (2.1) with \(e_i\) independent of \((\varvec{x}_i^\top , t_i)^\top \). Assume that \(t_i\) are random variable with \(t_i \in [0, 1]\). Denote \((R(T), \varvec{U}(T)^\top )^\top \) a random vector with the same distribution as
Further, let \(\hat{\varvec{\gamma }}_j(t_i), \, 0 \le j \le p\) be the estimates of \(\gamma _j(t_i)\) such that
If \({\tilde{\varvec{\beta }}} {\mathop {\longrightarrow }\limits ^{p}} \varvec{\beta }\) and \(s_n {\mathop {\longrightarrow }\limits ^{p}} \sigma \), then under the stated assumptions A1-A3, \(\varvec{A}_n {\mathop {\longrightarrow }\limits ^{p}} \varvec{A}\), where \(\varvec{A}\) is defined in A2, and
where \({\mathop {\longrightarrow }\limits ^{p}}\) denotes convergence in probability.
Proof
The proof is available in the appendix of Bianco and Boente (2004).
Theorem B.1
Let \((y_i, \varvec{x}_i^\top , t_i)^\top , 1 \le i \le n\) be independent random vectors satisfying (2.1) with \(e_i\) independent of \((\varvec{x}_i^\top , t_i)^\top \). Assume that \(t_i\) are random variables with \(t_{i_n} \in [0, 1]\). Denote \((R(T), \varvec{U}(T)^\top )^\top \) a random vector with the same distribution as
Further, let \(\hat{\gamma }_j(t),\, 0 \le j \le p\) be estimates of \(\gamma _j(t)\) such that first derivative of \(\hat{\gamma }_j(t)\) exists and is continuous, and
Then, if \(s_n {\mathop {\longrightarrow }\limits ^{p}} \sigma ,\) under A1-A5,
with \(\varvec{Q} = \varvec{A}^{-1}\varvec{\Sigma } (\varvec{A}^{-1})^\top \), where \(\varvec{A}\) is defined in A2 and
Proof
The proof is available in Bianco and Boente (2004).
C. Proof for Theorem 5.1
For Theorem 5.1, we derive \(\varvec{\Sigma }_{12}\) as follows:
where
Therefore,
and
D. Proof for Theorem 5.2
We present our proof as follows: Obviously, ADB\((\varvec{{\hat{\beta }}}^{\text {UM}}_1)=0\) and
where I() denotes an indicator function.
E. Derivation of Asymptotic Risk of the Estimators
Let us denote the ADMSE by \(\varvec{\Gamma }\), and then the expressions are listed as follows:
Proof
Now
By substituting \(E\{\psi _n^{-1}\eta _2\eta _2^\top \}\) in (A), we get
By using the rule of conditional expectation, we obtain
Substituting the above in (B), we get
Using the definition in (4.4), we have the ADQR expressions as follows:
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Raheem, E., Ahmed, S.E. & Liu, S. Stein-rule M-estimation in sparse partially linear models. Jpn J Stat Data Sci 7, 507–535 (2024). https://doi.org/10.1007/s42081-023-00231-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-023-00231-0