1 Introduction

We study robust shrinkage M-estimation in partially linear models (PLM) with a scaled error term. Ahmed et al. (2006) considered robust shrinkage-estimation of regression parameters when it is a priori suspected that the regression parameters could be restricted to a linear subspace. They studied the asymptotic properties of variants of Stein-rule M-estimators. For some insights on shrinkage estimation strategies, we refer to Ahmed and Fallahpour (2012), Ahmed and Raheem (2012), Raheem et al. (2012), Ahmed (2014), Ma et al. (2014), Ahmed et al. (2016), Sun et al. (2020) and Opoku et al. (2021). Recently, Ahmed et al. (2023) extended shrinkage strategies to high-dimension when the sparsity assumption cannot be judiciously justified. Maruyama et al. (2023) presented both classical and recent shrinkage estimation developments including results of admissibility of generalized Bayes estimators in the presence of a nuisance scale parameter.

Application of robust statistical techniques, including M-estimation and related approaches, to address modeling and prediction challenges are found in various domains, e.g. industrial data modeling (Zhou et al., 2020), disease incidence prediction (Susanti et al., 2020), and ratio-type estimation (Rather et al., 2022). These techniques are valuable for handling data imperfections, outliers, and non-normality.

Related works also include those by Arashi et al. (2014). They discussed the improvement of preliminary tests and Stein-rule Liu estimators specifically tailored for the ill-conditioned elliptical linear regression model. The authors focused on addressing the challenges posed by data that exhibit multicollinearity or non-normality, providing more accurate estimations in such scenarios. Norouzirad and Arashi (2019) discussed the use of preliminary tests and Stein-type shrinkage ridge estimators in the context of robust regression. They explored methods for improving the robustness and accuracy of regression models, particularly when dealing with outliers or influential data points. Recently, Shih et al. (2021) proposed a robust ridge M-estimators, incorporating pretests and Stein-rule shrinkage techniques for estimating the intercept term in regression models. They aimed to enhance the robustness of ridge regression when dealing with outliers and influential observations. On the other hand, in addition to the original results on Huber type M-estimation, a review of recent results and applications was given by e.g. Farcomeni and Ventura (2012).

Generally speaking, a PLM is more flexible than a linear model since it includes a nonlinear component along with the linear components. A PLM may provide a better alternative to the classical linear regression model in a situation where one or more predictors have nonlinear relationship with the response variable. Robust regression models are designed to overcome some of the limitations of classical linear regression in a host of scenarios. For example, least squares regression is highly sensitive to outliers, and is subject to underlying assumptions. Any violation of these assumptions may have serious impact on the validity of the fitted model.

In this paper, we extend the available works to a PLM, and develop shrinkage M-estimators. We construct shrinkage M-estimators of regression parameters under the sparsity assumption. Our analytical and numerical results establish the superiority of shrinkage M-estimators over full model and submodel M-estimators. We focus on shrinkage M-estimation of regression coefficients in a PLM when sparsity assumption may or may not hold. In our setup, the nonparametric part is estimated by the kernel-based method.

1.1 Statement of the problem

Consider a PLM of the form

$$\begin{aligned} \varvec{Y}= \varvec{X}\varvec{\beta }+ g(T) + \sigma \varvec{e}, \end{aligned}$$
(1.1)

where \(\varvec{Y}= (y_1, y_2, \dots , y_n)^\top \) is an n-response vector, \(\varvec{X}= (\varvec{x}_1^\top , \varvec{x}_2^\top , \dots , \varvec{x}_n^\top )^\top \) is the \(n \times p\) design matrix with \(\varvec{x}_i\)’s as known row p-vectors, \(\varvec{\beta }=(\beta _1, \beta _2, \dots , \beta _p)^\top \) is the p-vector of regression parameters, \(g(T)=(g(t_1), g(t_2), \dots , g(t_n))^\top \) is an unknown real-valued function, \(\varvec{e} =(e_1, e_2, \dots , e_n)^\top \) is the n-vector of random errors with mean \(E(\varvec{e})=\varvec{0}\), \(e_i\)’s are independent and identically distributed (iid) random variables having a continuous distribution, F, free from any unknown scale parameter \(\sigma >0\), where the \(()^\top \) denotes transpose of a vector or a matrix. In passing, we would like to remark here that without loss of generality the intercept term is not included in establishing the asymptotic properties of the estimators.

Under the assumption of the sparsity, the data matrix \(\varvec{X}\) can be partitioned as \(\varvec{X}=(\varvec{X}_1: \varvec{X}_2)\) with \(\varvec{\beta }=(\varvec{\beta }_1^\top , \varvec{\beta }_2^\top )^\top \), where \(\varvec{X}_1\) and \(\varvec{X}_2\) are \(n \times p_1\) and \(n \times p_2\) submatrices of predictors with strong signals and no signals, respectively. Thus, the model can be rewritten as

$$\begin{aligned} \varvec{Y}=\varvec{X}_1 \varvec{\beta }_1 + \varvec{X}_2 \varvec{\beta }_2 + g(T) + \sigma \varvec{e}, \quad p = p_1+p_2 < n. \end{aligned}$$

Under the assumption of sparsity, that is, \(\varvec{\beta }_2 = \varvec{0}\), then we have the submodel, under-fitted or restricted model as follows:

$$\begin{aligned} \varvec{Y}=\varvec{X}_1 \varvec{\beta }_1 + g(T) + \sigma \varvec{e}, \quad p_1 < n, \end{aligned}$$

and the remaining discussion follows.

In practice, the submodel can be readily obtained by applying a suitable variable selection method to full model. We are primarily interested in estimating \(\varvec{\beta }_1\), when \(\varvec{\beta }_2\) may or may not be a null vector. In other words, when practitioners may not be certain that model is fully sparse or not. In an effort to help data analyst, we propose shrinkage M-estimates based on Stein-rule to improve the performance of the under-fitted model estimators.

The remainder of the paper is organized as follows. In Sect. 2, we define a kernel-based least-squares (LS) estimators and discuss a two-step procedure to estimate the nonparametric function in a PLM. In Sect. 3, we propose Stein-rule shrinkage M-estimators. Asymptotic properties of the estimators are presented in Sect. 4. The expressions for asymptotic bias and risk for the estimators are derived in Sect. 5. Monte Carlo simulation results are conducted in Sect. 6. Our concluding remarks are made in Sect. 7. Finally, our derivations of the theoretical results are included in the appendix.

2 Proposed LS estimation

In this section, we propose our robust LS estimation method with a two-step procedure.

Again, consider a PLM of the form

$$\begin{aligned} \varvec{Y}= \varvec{X}\varvec{\beta }+ g(T) + \sigma \varvec{e}, \end{aligned}$$
(2.1)

We first linearize (2.1) by estimating \(g(\cdot )\) using kernel smoothing. We then confine ourselves to the estimation of \(\varvec{\beta }\) based on the partial residuals which attains the usual parametric convergence rate \(n^{-1/2}\) without under-smoothing the nonparametric component \(g(\cdot )\); see e.g. Speckman (1988).

Now, we describe the estimation process. We assume \(\left\{ y_i, \varvec{x}_i^\top , t_i; i=1, 2, \dots , n \right\} \) satisfy (2.1). If \(\varvec{\beta }\) is the true parameter, then by \(E(\varvec{e}_i)=0\), we have

$$\begin{aligned} g(t_i) = E(y_i - \varvec{x}_i^\top \varvec{\beta }), \quad i =1, 2, \dots , n. \end{aligned}$$

A natural nonparametric estimator of \(g(\cdot )\) given \(\varvec{\beta }\) is

$$\begin{aligned} g^*(t, \varvec{\beta }) = \sum _{i=1}^{n} W_{ni}(t) (y_i - \varvec{x}_i^\top \varvec{\beta }), \end{aligned}$$

where

$$\begin{aligned} W_{ni}(t) = \frac{K((t_i - t)/h)}{\sum _{j=1}^n K((t_j - t)/h)}, \end{aligned}$$
(2.2)

with \(K(\cdot )\) being a kernel function which is a non-negative function integrable on \({\mathfrak {R}}\), and h being a bandwidth parameter. We need to make the assumptions as outlined in Appendix B.

Now, we define the conditional expectations

$$\begin{aligned} \gamma _0(t)&= E(y | T=t), \text{ and } \\ \varvec{\gamma }(t)&= (\gamma _1(t), \gamma _2(t), \ldots , \gamma _n(t))^\top , \end{aligned}$$

where \(\gamma _j(t) = E(\varvec{x}_j|T=t)\).

We estimate \(\varvec{\beta }\) using

$$\begin{aligned} \hat{\varvec{\beta }}= \text{ arg }\min SS(\varvec{\beta }) = ({\widetilde{\varvec{X}}}^\top {\widetilde{\varvec{X}}} )^{-1}{\widetilde{\varvec{X}}}^\top \widetilde{\varvec{Y}}, \end{aligned}$$
(2.3)

with

$$\begin{aligned} SS(\varvec{\beta }) = \sum _{i=1}^n \left( y_i - \varvec{x}_i^\top \varvec{\beta }- g^*(t_i, \varvec{\beta })\right) ^2 = \sum _{i=1}^n ({\tilde{y}}_i - \tilde{\varvec{x}}_i^\top \varvec{\beta })^2, \end{aligned}$$
(2.4)

where \({\widetilde{\varvec{Y}}} = ({\tilde{y}}_1, {\tilde{y}}_2, \dots , \tilde{y}_n)^\top \), \({\widetilde{\varvec{X}}} = ({\tilde{\varvec{x}}}_1, {\tilde{\varvec{x}}}_2, \dots , \tilde{\varvec{x}}_n)^\top \), \({\tilde{y}}_i = y_i - \gamma _0(t)\), and \({\tilde{\varvec{x}}}_i = \varvec{x}_i - \varvec{\gamma }(t)\) for \(i =1, 2, \dots , n\).

The conditional expectations \(\gamma _0(t)\) and \(\varvec{\gamma }(t)\) are obtained using a classical nonparametric approach through

$$\begin{aligned} \hat{\gamma }_0(t)&= \sum _{i=1}^n W_{ni}(t)y_i, \text{ and } \\ \hat{\gamma }_j(t)&= \sum _{i=1}^n W_{ni}(t) x_{ij}, \end{aligned}$$

where \(W_{ni}(t)\) is defined in (2.2). Clearly, once we obtain the estimates \(\hat{\gamma }_0(t)\) and \(\hat{\gamma }_j(t)\), they can be plugged into (2.4) prior to the estimation of \(\varvec{\beta }\).

The above procedure was independently proposed by Denby (1986) and Speckman (1988). A similar approach was taken by Ahmed et al. (2007) in estimating the nonparametric component in a PLM.

We obtain the robust M-estimators of the parameters of a PLM using a two-step procedure as follows:

  1. Step 1

    We first estimate \(\gamma _0(t)\) and \(\gamma _j(t)\) through kernel smoothing as described above. We denote the estimates by \(\hat{\gamma }_0(t)\) and \(\hat{\gamma }_j(t)\), respectively.

  2. Step 2

    The estimates in Step 1 are then plugged into (2.2). So the estimator \({\hat{\varvec{\beta }}}\) of \(\varvec{\beta }\) can be obtained by regressing the residuals \(y_i -\hat{\gamma }_0(t)\) and \(\varvec{x}_i - \hat{\varvec{\gamma }}(t)\) using a robust procedure. We denote the residuals as \({\hat{r}}_i = y_i -\hat{\gamma }_0(t)\) and \(\varvec{u}_i = \varvec{x}_i - \hat{\varvec{\gamma }}(t_i)\).

Consistency and asymptotic normality of the estimators can be found in Appendix Section B.1 and the reference therein.

3 Proposed shrinkage M-estimation strategies

In this section, we propose our full model and submodel estimators, and formulate a test statistic which has asymptotically a non-central \(\chi ^2\) distribution.

Let \({\hat{\varvec{\beta }}}_1^{\text {RM}}\) be the restricted estimator of \(\varvec{\beta }_1\) where \(\varvec{\beta }_2={{\textbf{0}}}\), and \(\hat{\beta }_1^{\text {UM}}\) be the unrestricted estimator of \(\beta _1\) when \(\varvec{\beta }_2\) may not be a null vector. Following Ahmed (2014), a Stein-type M-estimator (SM), \(\hat{\varvec{\beta }}^{\text {SM}}_1\) of \(\varvec{\beta }_1\) can be defined as

$$\begin{aligned} \hat{\varvec{\beta }}_1^{\text {SM}}= \hat{\varvec{\beta }}_1^{\text {RM}} + (\hat{\varvec{\beta }}_1^{\text {UM}} - \hat{\varvec{\beta }}_1^{\text {RM}}) \left\{ 1- \kappa \psi _n^{-1}\right\} , \quad \text{ for } \kappa =p_2-2 \text{ and } p_2 \ge 3, \end{aligned}$$

where \(\psi _n\) is a distance statistic defined later in (3.7). To avoid the over-shrinkage problem, the positive-rule Stein-type M-estimator (SM+) has the form

$$\begin{aligned} \hat{\varvec{\beta }}_1^{\text {SM+}}= \hat{\varvec{\beta }}_1^{\text {RM}} + (\hat{\varvec{\beta }}_1^{\text {UM}} - \hat{\varvec{\beta }}_1^{\text {RM}}) \left\{ 1- \kappa \psi _n^{-1}\right\} ^{+}, \quad \text{ for } p_2 \ge 3, \end{aligned}$$

where \(z^+=\max (0,z)\).

3.1 Full model and submodel estimation strategies for \(\hat{\varvec{\beta }}_1\)

For a suitable absolutely continuous function \(\rho : {\mathfrak {R}} \rightarrow {\mathfrak {R}}\), with derivative \(\phi \), an M-estimator of \(\varvec{\beta }\) is defined as a solution of the minimization

$$\begin{aligned} \min _{\varvec{\beta }} \sum _{i=1}^{n} \rho (\tilde{y}_i - \tilde{\varvec{x}}_i^\top \varvec{\beta }). \end{aligned}$$
(3.1)

Generally, an M-estimator is regression-equivariant, i.e.,

$$\begin{aligned} \varvec{M}_n(c\varvec{Y}+ \varvec{X}\varvec{a}) = c\varvec{M}_n(\varvec{Y}) + c\,\varvec{a}, \quad \text{ for } \varvec{a} \in {\mathfrak {R}}_p, \end{aligned}$$

and robustness depends on the choice of \(\rho (\cdot )\). But it is generally not scale-equivariant. That is, it may not satisfy

$$\begin{aligned} \varvec{M}_n(c\varvec{Y}) = c\varvec{M}_n(\varvec{Y}), \quad \text{ for } c>0. \end{aligned}$$

To have the estimators scale and regression equivariant, we need to studentize them. The studentized M-estimator is defined as as solution of the minimization

$$\begin{aligned} \min _{\beta \in {\mathfrak {R}}^p}\sum _{i=1}^{n} \rho \left( \frac{\tilde{y}_i - \tilde{\varvec{x}}_i^\top \varvec{\beta }}{S_n}\right) , \end{aligned}$$
(3.2)

where \(S_n =S_n(\varvec{Y})\ge 0\) is an appropriate scale statistic that is regression equivariant and scale equivariant, i.e.,

$$\begin{aligned} S_n(c(\varvec{Y}+ \varvec{X}\varvec{a})) = cS_n(\varvec{Y}), \quad \text{ for } \varvec{a} \in {\mathfrak {R}}_p \text{ and } c>0. \end{aligned}$$

According to Jurečcková and Sen (1996), the minimization in (3.2) should be supplemented by a rule how to define \(\varvec{M}_n\) in the case when \(S_n(\varvec{Y})=0\). However, in general, this happens with probability zero, and the specific rule does not affect the asymptotic properties of \(\varvec{M}_n\). There are additional regularity conditions needed with (3.2), which we present in Appendix A. Further details may be found in Jurecčková and Sen (Jurečcková and Sen (1996), page 217).

Now, in an effort to define the M-estimator for \(\varvec{\beta }_1\), we define \(C=X^\top X\) with \(X=(X1:X2)\) as follows:

$$\begin{aligned} \varvec{C} = \left( \begin{array}{cc} \varvec{C}_{11} &{} \varvec{C}_{12} \\ \varvec{C}_{21} &{} \varvec{C}_{22} \end{array} \right) = \left( \begin{array}{ll}X_1^\top X1 &{} X_1^\top X2 \\ X_2^\top X1 &{}X_2^\top X2\end{array}\right) \end{aligned}$$

Also, we define

$$\begin{aligned} \varvec{C}_{22.1} = \varvec{C}_{22} - \varvec{C}_{21}\varvec{C}_{11}^{-1}\varvec{C}_{12}. \end{aligned}$$

Note that, if \(\varvec{C}_{21}=\varvec{0}\), then \(\varvec{C}_{22.1}= \varvec{C}_{22}\). Otherwise, \(\varvec{C}_{22}- \varvec{C}_{22.1}\) is positive semi-definite. We assume that C and \(C_{22.1}\) are positive.

A studentized unrestricted M-estimator of \(\varvec{\beta }\) is defined as a solution of (3.2). Let us denote it by

$$\begin{aligned} \hat{\varvec{\beta }}^{\text {UM}} = \left( \left( \hat{\varvec{\beta }}_1^{\text {UM}}\right) ^\top , \left( \hat{\varvec{\beta }}_2^{\text {UM}}\right) ^\top \right) ^\top . \end{aligned}$$

A studentized restricted M-estimator of \(\varvec{\beta }_1\) is obtained by minimizing

$$\begin{aligned} \min _{{\varvec{\beta }}_{1} \in {\mathfrak {R}}^{p_1}} \sum _{i=1}^{n}\rho \left( \frac{\tilde{y}_i- \tilde{\varvec{x}}_{i1}^\top {\varvec{\beta }}_{1}}{S_n}\right) , \end{aligned}$$
(3.3)

where \(S_n\) is regression-invariant so is not affected by the restricted environment. Since \(\rho (\cdot )\) is assumed to have derivative \(\phi (\cdot )\), we rewrite \(\hat{\varvec{\beta }}^{\text {UM}}\) as a solution of

$$\begin{aligned} \varvec{M}_n({\varvec{\theta }}) = \sum _{i=1}^n \tilde{\varvec{x}}_i \, \phi \left( \frac{\tilde{y}_i - \tilde{\varvec{x}}_i^\top \varvec{\theta }}{S_n} \right) = \varvec{0}. \end{aligned}$$
(3.4)

In other words,

$$\begin{aligned} \varvec{M}_n(\hat{\varvec{\beta }}^{\text {UM}}) = \varvec{0}. \end{aligned}$$

Similarly, \(\hat{\varvec{\beta }}_1^{\text {RM}}\) is a solution of

$$\begin{aligned} {{\varvec{M}}_{n_1}} ({\varvec{\theta }_1}) = \sum _{i=1}^n \tilde{\varvec{x}}_{i1} \, \phi \left( \frac{\tilde{y}_i - \tilde{\varvec{x}}_{i1}^\top \varvec{\theta }_1}{S_n} \right) = \varvec{0}. \end{aligned}$$
(3.5)

Now, let

$$\begin{aligned} { \hat{\varvec{M}}_{n_2}^{\text {RM}} } = \sum _{i=1}^n \tilde{\varvec{x}}_{i2} \, \phi \left( \frac{\tilde{y}_i - \tilde{\varvec{x}}_{i1}^\top \hat{\varvec{\beta }}_1^{\text {RM}}}{S_n}\right) . \end{aligned}$$
(3.6)

Note that \({\varvec{M}}_{n}\) is a \((p_1+p_2)\)-vector, \({\varvec{M}}_{n_1}\) is a \(p_1\)-vector and \(\hat{\varvec{M}}_{n_2}^{\text {RM}}\) is a \(p_2\)-vector.

3.2 Test statistic

Following (Jurečcková and Sen (1996), Sect. 10.2) a suitable test statistic can be formulated as follows:

$$\begin{aligned} \psi _n = \frac{\left[ \hat{\varvec{M}}_{n_2}^{\text {RM}}\right] ^\top \varvec{C}_{22.1}^{-1}\left[ \hat{\varvec{M}}_{n_2}^{\text {RM}}\right] }{\hat{\sigma }_{\Phi _n}}, \end{aligned}$$
(3.7)

where

$$\begin{aligned} \hat{\sigma }_{\Phi _n}^{2} = (n-p_2)^{-1}\sum _{i=1}^n \phi ^2 \left( \frac{\tilde{y}_i - \tilde{\varvec{x}}_{i1}^\top \hat{\varvec{\beta }}_1^{\text {RM}}}{S_n}\right) . \end{aligned}$$
(3.8)

Directly applying the Lemma 5.5.1 in (Jurečcková and Sen (1996), page 220), it can be shown that under the sparsity assumption, that is, \(\varvec{\beta }_2\) is a null vector

$$\begin{aligned} \psi _n {\mathop {\longrightarrow }\limits ^{d}} \chi ^2_{p_2}. \end{aligned}$$

Further, under (local) alternative hypothesis \(\psi _n\) has a non-central \(\chi ^2\) distribution.

It is to be mentioned here that unlike LS estimators, M-estimators are not linear. Even if the distribution function F is normal, the finite sample distribution theory of M-estimators is not simple. Asymptotic methods Jurečcková and Sen (1996) have been used to overcome this difficulty.

4 Asymptotic properties of the estimators

In this section, we establish the asymptotic properties of the estimators. This facilitates in finding the asymptotic distributional bias (ADB), asymptotic distributional quadratic bias (ADQB), and asymptotic distributional quadratic risk (ADQR) of the estimator of the regression parameter vector \(\varvec{\beta }_1\).

Under the assumed regularity conditions

$$\begin{aligned} \lim _{n \rightarrow \infty }\frac{C_n}{n} = \varvec{Q}, \end{aligned}$$
(4.1)

where

$$\begin{aligned} \varvec{Q} = \left( \begin{array}{cc} \varvec{Q}_{11} &{} \varvec{Q}_{12} \\ \varvec{Q}_{21} &{} \varvec{Q}_{22} \end{array} \right) , \end{aligned}$$

it is known that under non-sparsity assumption, that is under local alternatives, \(\varvec{\beta }_2 \ne \varvec{0}\),

$$\begin{aligned} \frac{\psi _n}{n} \rightarrow \gamma (\hat{\varvec{\beta }}_1, \hat{\varvec{\beta }}_2; \varvec{Q}) >0, \quad \text{ as } n \rightarrow \infty , \end{aligned}$$

such that the shrinkage factor \(\kappa \psi ^{-1}_n = {\mathcal {O}}_p (n^{-1})\). This implies, asymptotically, there is no shrinkage effect. Therefore, to obtain meaningful asymptotics, we consider a class of local alternatives, \(\{K_n\}\), given by

$$\begin{aligned} K_n: \varvec{\beta }_2 = \varvec{\beta }_{2n} = \frac{\varvec{\omega }}{\sqrt{n}}, \end{aligned}$$
(4.2)

where \(\varvec{\omega }= (\omega _1, \omega _2, \dots , \omega _{p_2})^\top \in {\mathfrak {R}}^{p_2}\) is a fixed vector and \(||\varvec{\omega }|| < \infty \), so that the null hypothesis \(H_0: \varvec{\beta }_2 = \varvec{0}\) reduces to \(H_0: \varvec{\omega }= \varvec{0}\).

For an estimator \(\varvec{\beta }^{*}_1\) and a positive-definite matrix \(\varvec{W}\), we define the loss function of the form

$$\begin{aligned} L(\varvec{\beta }^{*}_1; \varvec{\beta }_1) = n(\varvec{\beta }^{*}_1- \varvec{\beta }_1)^\top \varvec{W}(\varvec{\beta }^{*}_1- \varvec{\beta }_1). \end{aligned}$$

Thus, the risk function is defined as follows:

$$\begin{aligned} R[(\varvec{\beta }^{*}_1, \varvec{\beta }_1); \varvec{W}]&= n E[(\varvec{\beta }^{*}_1-\varvec{\beta }_1)^\top \varvec{W} (\varvec{\beta }^{*}_1 - \varvec{\beta }_1)] \nonumber \\&= n \, \text {tr} [\varvec{W}\{E(\varvec{\beta }^{*}_1-\varvec{\beta }_1)(\varvec{\beta }^{*}_1-\varvec{\beta }_1)^\top \}] \nonumber \\&= \text {tr} (\varvec{W} \varvec{\Omega }^{*}), \end{aligned}$$
(4.3)

where tr denotes the trace operator and \(\varvec{\Omega }^{*}\) is the covariance matrix of \(\sqrt{n} (\varvec{\beta }_1^*-\varvec{\beta }_1)\). Whenever \( \lim _{n \rightarrow \infty }\hat{\varvec{\Omega }}^*_n = \hat{\varvec{\Omega }}^* \) exists, the asymptotic risk is defined by

$$\begin{aligned} R_n(\varvec{\beta }_{1n}^*, \varvec{\beta }_1; \varvec{W}) \rightarrow R(\varvec{\beta }^*_1, \varvec{\beta }_1; \varvec{W}) = \text{ tr }(\varvec{W}\hat{\varvec{\Omega }}^*). \end{aligned}$$

Consider the asymptotic cumulative distribution function (cdf) of \(\sqrt{n}(\varvec{\beta }^{*}_{1n} - \varvec{\beta }_1)\) under \(\{ K_n \}\) exists, and is defined as

$$\begin{aligned} G(\varvec{y}) = P\left[ \lim _{n \rightarrow \infty } \sqrt{n}(\varvec{\beta }^{*}_{1n} - \varvec{\beta }_1) \le \varvec{y} \right] . \end{aligned}$$

This is known as the asymptotic distribution function (ADF) of \(\varvec{\beta }^{*}_1\). Suppose that \(G_n \rightarrow G\) at all points of continuity as \(n \rightarrow \infty \), and let \(\hat{\varvec{\Omega }}^*\) be the covariance matrix of G. Then the ADR is defined as

$$\begin{aligned} R(\varvec{\beta }^*_{1}, \varvec{\beta }_1; \varvec{W}) = \text{ tr }(\varvec{W} \varvec{\Omega }^*_{G}). \end{aligned}$$

As noted in Ahmed et al. (2006), if \(G_n \rightarrow G\) in second moment, then ADR is the asymptotic risk. However, this is a stronger mode of convergence, and is hard to analytically prove for shrinkage M-estimators. Therefore, they suggested using asymptotic distributional risk.

Now let

$$\begin{aligned} {\varvec{\Gamma }} = \int \int \cdots \int \varvec{y} \varvec{y}^\top dG(\varvec{y}) \end{aligned}$$

be the dispersion matrix which is obtained from ADF. The asymptotic distributional quadratic risk (ADQR) may be defined as

$$\begin{aligned} R(\varvec{\beta }^{*}_1; \varvec{\beta }_1) = \text {tr}(\varvec{W} \varvec{\Gamma }), \end{aligned}$$
(4.4)

where \(\varvec{\Gamma }\) is the asymptotic distributional mean squared error (ADMSE) of the estimators.

To establish the asymptotic properties of the estimators, we present two important theorems.

Theorem 1

Consider an absolutely continuous function \(f(\cdot )\) with derivative \(f'(\cdot )\) which exists everywhere, and finite Fisher information

$$\begin{aligned} I(f) = \int _{R}\left( \frac{-f'(x)}{f(x)}\right) ^2 dF(x) < \infty . \end{aligned}$$

Under \(\{K_n\}\) and the assumed regularity conditions, \(\psi _n\) has asymptotically a non-central chi-square distribution with non-centrality parameter \(\Delta = \varvec{\omega }^\top \varvec{Q}_{22.1}\varvec{\omega }\gamma ^{-2}\), where

$$\begin{aligned} \gamma ^2 = \frac{\int _{R}\phi ^2(y) \textrm{d}F(y)}{\int _{R}\phi (x)[-f'(x)/f(x)]^2\textrm{d}F(x)}, \end{aligned}$$
(4.5)

and \(\phi (\cdot )\) is defined in (3.4) or Appendix A.

Theorem 2

We have, under the assumed regularity conditions, as \(n \rightarrow \infty \)

$$\begin{aligned} \sqrt{n}( \varvec{{\hat{\beta }}}^{\text {UM}}- \varvec{\beta }) {\mathop {\rightarrow }\limits ^{d}} N_p(\varvec{0}, \gamma ^2 \varvec{Q}^{-1}). \end{aligned}$$
(4.6)

Proofs of these theorems are available in Jurečcková and Sen (1996).

5 Asymptotic bias and risk of the estimators

In this section, we present the asymptotic distribution, bias and risk results for each of our estimators. We also compare their risk performances.

Theorem 3

Under the local alternative \(K_n\) and the assumed regularity conditions, we have as \(n\rightarrow \infty \)

  1. (i)

    \(\eta _1 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {UM}}_1 - \varvec{\beta }_1) {\mathop {\rightarrow }\limits ^{d}} N(\varvec{0}, \gamma ^2\varvec{Q}^{-1}_{11.2}),\)

  2. (ii)

    \(\eta _2 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {UM}}_1 - \varvec{{\hat{\beta }}}^{\text {RM}}_1) {\mathop {\rightarrow }\limits ^{d}} N(\varvec{\delta }, \varvec{\Sigma }^*)\), \(\quad \varvec{\delta }=-\varvec{Q}^{-1}_{11}\varvec{Q}_{12}\varvec{\omega },\)

  3. (iii)

    \(\eta _3 = \sqrt{n}(\varvec{{\hat{\beta }}}^{\text {RM}}_1-\varvec{\beta }_1) {\mathop {\rightarrow }\limits ^{d}} N(-\varvec{\delta }, \varvec{\Omega }^*), \quad \varvec{\Omega }^*=\gamma ^2 \varvec{Q}^{-1}_{11}.\)

We have, under \(\{K_{n}\}\)

$$\begin{aligned} \sqrt{n} \left( (\varvec{{\hat{\beta }}}^{\text {UM}}_1 - \varvec{\beta }_1)^\top , (\varvec{{\hat{\beta }}}^{\text {UM}}_2 - n^{-\frac{1}{2}}\varvec{\omega })^\top \right) ^\top {\mathop {\rightarrow }\limits ^{d}} N(\varvec{0}, \gamma ^2\varvec{Q^{-1}}), \end{aligned}$$

where \(\varvec{Q}\) is partitioned as in (4.1).

Also, we have the joint distributions as follows:

$$\begin{aligned} \left( \begin{array}{c} \eta _1 \\ \eta _2 \end{array} \right) \sim N \left[ \left( \begin{array}{c} \varvec{0}\\ \varvec{\varvec{\delta }} \end{array} \right) ,\, \left( \begin{array}{cc} \gamma ^2 \varvec{Q}^{-1}_{11.2} &{} \varvec{\Sigma }_{12} \\ \varvec{\Sigma }_{21} &{} \varvec{\Sigma }^* \end{array} \right) \right] \\ \left( \begin{array}{c} \eta _2 \\ \eta _3 \end{array} \right) \sim N \left[ \left( \begin{array}{c} \varvec{\delta }\\ -\varvec{\delta } \end{array} \right) ,\, \left( \begin{array}{cc} \varvec{\Sigma }^* &{} \varvec{\Omega }_{12} \\ \varvec{\Omega }_{21} &{} \varvec{\Omega }^* \end{array} \right) \right] . \end{aligned}$$

The proof for this theorem is given in Appendix C.

5.1 Asymptotic bias of the estimators

The asymptotic distributional bias (ADB) of an estimator \(\varvec{\beta }^*\) is defined as

$$\begin{aligned} \text{ ADB }(\varvec{\beta }^*)=E\left\{ \lim _{n\rightarrow \infty } n^{\frac{1}{2}}(\varvec{\beta }^*-\varvec{\beta })\right\} . \end{aligned}$$

Theorem 4

Under the assumed regularity conditions and the stated theorems above, and under \(\{K_n\}\), the ADB of the estimators are as follows:

$$\begin{aligned} \text{ ADB }(\varvec{{\hat{\beta }}}^{\text {UM}}_1)&={\varvec{0}}\\ \text{ ADB }(\varvec{{\hat{\beta }}}^{\text {RM}}_1)&=-\varvec{\delta }\\ \text{ ADB }(\varvec{{\hat{\beta }}}^{\text {SM}}_1)&= \kappa \varvec{\delta }E\left\{ \chi ^{-2}_{p_2+2}(\Delta )\right\} \\ \text{ ADB }(\varvec{{\hat{\beta }}}^{\text {SM+}}_1)&= ADB(\varvec{{\hat{\beta }}}^{\text {SM}}_1) - \varvec{\delta }\left[ H_{p_2+2}(\kappa , \Delta ) - E\left\{ \kappa \chi ^{-2}_{p_2+2}(\Delta )I(\chi ^{2}_{p_2+2}(\Delta )< \kappa )\right\} \right] , \end{aligned}$$

where \({E\left\{ \chi ^{-2}_{a}(\Delta )\right\} }\) is the expected value of an inverse of a non-central \(\chi ^2\) random variable with a degrees of freedom and non-centrality parameter \(\Delta \), and \(H_{a}(y, \Delta )\) is the cdf of the a non-central \(\chi ^2\) random variable with a degrees of freedom and non-centrality parameter \(\Delta \).

The proof for this theorem is given in Appendix D.

Let us define the asymptotic distributional quadratic bias (ADQB) of an estimator \(\varvec{\beta ^*}\) of \(\varvec{\beta }_1\) by

$$\begin{aligned} ADQB(\varvec{\beta }^*)&= [ADB(\varvec{\beta }^*)]^\top \varvec{\Sigma }^{-1} [ADB(\varvec{\beta }^*)], \end{aligned}$$

where \(\varvec{\Sigma }\) is the dispersion matrix of \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\) as \( n \rightarrow \infty \). In our case, the dispersion matrix is \(\varvec{Q}_{11}\). Thus, the asymptotic quadratic distributional bias of the estimators are given below.

$$\begin{aligned} \text{ ADQB }(\varvec{{\hat{\beta }}}^{\text {UM}}_1)&={\varvec{0}},\\ \text{ ADQR }(\varvec{{\hat{\beta }}}^{\text {RM}}_1)&= \varvec{\delta }^\top \varvec{Q}^{-1}_{11}\varvec{\delta }\\ \text{ ADQB }(\varvec{{\hat{\beta }}}^{\text {SM}}_1)&= \kappa ^2 \varvec{\delta }^\top \varvec{Q}^{-1}_{11}\varvec{\delta }\left[ E\left\{ \chi ^{-2}_{p_2+2}(\Delta )\right\} \right] ^2\\ \text{ ADQB }(\varvec{{\hat{\beta }}}^{\text {SM+}}_1)&= \varvec{\delta }^\top \varvec{Q}^{-1}_{11}\varvec{\delta }\left[ H_{p_2+2}(\kappa , \Delta ) - E\left\{ \kappa \chi ^{-2}_{p_2+2}(\Delta )I(\chi ^{2}_{p_2+2}(\Delta )< \kappa )\right\} \right] . \end{aligned}$$

The above expression reveal that, as expected, the unrestricted estimator of \(\varvec{\beta }_1\) is asymptotically unbiased. On the other, the bias function of restricted estimator is a function of the sparsity parameter (non-centrality parameter \(\Delta \)), so under the sparsity assumption, the estimator is asymptotically unbiased. However, it is an unbounded function of \(\Delta \), not a desirable property.

It can be seen that both shrinkage estimators are also function of \(\Delta \), more importantly, they are bounded function of the non-centrality parameter. The magnitude of bias increases as \(\Delta \) increases and then converges to zero as \(\Delta \rightarrow \infty \). As expected the bias curve of positive-part shrinkage estimator is below or equal the curve the curve of the shrinkage estimators.

The bias is a function of MSE (risk), so onward we focus on the risk properties of the estimators.

5.2 Asymptotic risk and risk performance of the estimators

In Appendix E, we present the derivation of the expressions for asymptotic distributional mean square error (ADMSE), and consequentially the risk expressions of the respective estimators.

From the ADMSE and ADQR results in Appendix E, we see clearly that the risk of the classical unrestricted estimator is independent of the sparsity assumption, so its risk take a constant value of \(\text {tr}(\varvec{W} \varvec{\Gamma }(\varvec{{\hat{\beta }}}^{\text {UM}}_1))\). On the other hand, the risk of the restricted estimator depends on the sparsity assumption, and when the assumption is nearly correct then \(R(\varvec{{\hat{\beta }}}^{\text {RM}}_1)\le R(\varvec{{\hat{\beta }}}^{\text {UM}}_1)\) and a strict inequality will hold for some values in the parameter space induced by the sparsity parameter. However, beyond this small interval in the parameter space, the unrestricted estimator will dominate the restricted estimator. As a matter of fact of the the restricted estimator risk is unbounded function of sparsity parameter, a undesirable property.

Interestingly, but not surprisingly, both shrinkage estimators are superior to benchmark estimators in the entire parameter space. For a suitable choice of \(\varvec{W}\), it can be verified that \(R(\varvec{{\hat{\beta }}}^{\text {SM}}_1) \le R(\varvec{{\hat{\beta }}}^{\text {SM+}}_1) \le R(\varvec{{\hat{\beta }}}^{\text {UM}}_1)\), and the strict inequality will hold for some values in the parameter space. Thus, the shrinkage estimators dominate the classical M-estimator. Further, the shrinkage estimators will outperform the restricted estimator except in a small interval where sparsity assumption may hold. Thus, we recommend the use of the shrinkage estimators, they are in closed form and free from any tuning parameter.

6 Simulation studies

In this section, we conduct a simulation study to appraise the performance of in practical setting and to quantify the relative behavior of the estimators. We perform Monte Carlo simulation experiments to examine the quadratic risk performance of the proposed estimators. We simulate the response from the following model:

$$\begin{aligned} y_i = \sum _{l=1}^{p_1}x_{il}\beta _l + \sum _{m=p_1+1}^{p}x_{im}\beta _m + \sin (4 \pi t_i) + \varepsilon _i \end{aligned}$$
(6.1)

where \(\beta _l\) is a \(p_1 \times 1\) vector and \(\beta _m\) is \(p_2 \times 1\) vector of parameters with \(p=p_1+p_2\), and \(\varepsilon _i \sim N(0,1)\), \(i=1, \ldots , n\). Furthermore, \(x_{i1}=(\zeta ^{(1)}_{i1})^2+\zeta ^{(1)}_{i}+ \xi _{i1}\), \(x_{i2}=(\zeta ^{(1)}_{i2})^2+\zeta ^{(1)}_{i}+ 2\xi _{i2}\), \(x_{is}=(\zeta ^{(1)}_{is})^2+\zeta ^{(1)}_{i}\) with \(\zeta ^{(1)}_{is}\sim N(0,1)\), \(\zeta ^{(1)}_{i}\sim N(0,1)\), \(\xi _{i1}\sim \) Bernoulli(0.35) and \(\xi _{i2}\sim \) Bernoulli(0.35) for all \(s=3,\ldots , p\).

We are interested in testing the assumption of the sparsity in the form of statistical hypothesis \(H_0: (\beta _{p_1+1}, \beta _{p_1+2}, \ldots , \beta _{p_1+p_2})=\varvec{0}\). Our aim is to estimate \(\varvec{\beta }_1\) when sparsity assumption may or may not be true. We partition the regression coefficients as \(\varvec{\beta }= (\varvec{\beta }_1^\top , \varvec{\beta }_2^\top )\). Each realization was repeated 5000 times to obtain stable results. For each realization, we calculated the MSE of the estimators.

We define \(\Delta ^* = ||\varvec{\beta } - \varvec{\beta }^{(0)}||,\) where \(\varvec{\beta }^{(0)}= (\beta _1^\top , 0)^\top \) and \(||\cdot ||\) is the Euclidean norm. In addition, \(\Delta ^*\) and \(S_n\) were estimated by median absolute deviation (MAD). To determine the behavior of the estimators for \(\Delta ^* >0,\) further data sets were generated from those distributions under the alternative hypothesis.

6.1 Error distributions

In an effort to evaluate the performance of the proposed estimators numerically, we perform a simulation study. We generate data from four different error distributions, namely the standard normal, contaminated normal, standard logistic distribution, and standard Laplace distribution, respectively.

The cumulative distribution function

$$\begin{aligned} F(x) = \lambda N(0, \omega ^2) + (1- \lambda )N(0, 1) \end{aligned}$$
(6.2)

was used to generate the standard normal and contaminated normal errors, where \(\lambda \) is the parameter indicating whether the standard normal or its contaminated version is returned. We consider \(\lambda =0\) and \(\lambda =0.9\), respectively. Indeed, for \(\lambda =0\) we get the standard normal errors, while for \(\lambda =0.9\), with \(\omega ^2 \ne 1\), we obtain the scale contaminated normal errors.

The standard logistic distribution has cdf

$$\begin{aligned} F(x) = \frac{1}{1+ e^{-x}}, \quad x \in {\mathfrak {R}}. \end{aligned}$$
(6.3)

The standard Laplace distribution has cdf

$$\begin{aligned} F(x) = \frac{1}{2} \left[ 1+ \text {sign}(x)(1- e^{-|x|})\right] , \quad x \in {\mathfrak {R}}. \end{aligned}$$
(6.4)

6.2 Relative risk comparison

The risk performance of an estimator of \(\varvec{\beta }_1\) was measured by comparing its MSE with that of the unrestricted M-estimator. We numerically calculated the relative MSE (RMSE) of the proposed estimators \(\varvec{{\hat{\beta }}}^{\text {RM}}_1,\) \(\varvec{{\hat{\beta }}}^{\text {SM}}_1\), \(\varvec{{\hat{\beta }}}^{\text {SM+}}_1\) to the unrestricted estimator \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\), given by

$$\begin{aligned} \text {RMSE}(\varvec{{\hat{\beta }}}^{\text {UM}}_1: \hat{\varvec{\beta }}_1^\text {*})=\frac{\text {MSE}(\varvec{{\hat{\beta }}}^{\text {UM}}_1)}{\text {MSE}(\hat{\varvec{\beta }}_1^\text { *})}, \end{aligned}$$
(6.5)

where \(\hat{\varvec{\beta }}_1^\text {*}\) is one of the proposed estimators. The amount by which an RMSE is larger than unity indicates the degree of superiority of the estimator \(\hat{\varvec{\beta }}_1^\text {*}\) over \(\varvec{{\hat{\beta }}}^{\text {UM}}_1\); see also Fig. 1.

We compute the RMSE values for \(n=30, 50\) and various configurations of \((p_1, p_2)\) based on Huber’s \(\rho -\)function. Our results are presented in Fig. 1 and Tables 14.

Fig. 1
figure 1

RMSE values for RM, SM, and SM+ estimators with respect to the unrestricted M-estimator for \((p_1, p_2) = (3, 4)\), \(n=50\), when Huber’s \(\rho \)-function is considered. Specially, RMSE = 1 when the estimators perform equivalently

Figure 1 shows the RMSE values of various M-estimators. Here, \(\Delta ^*\) indicates the correctness of the submodel under sparsity assumption. Thus, \(\Delta ^* > 0\) quantify the degree of deviation from the assumed model. Figure 1 clearly shows that the restricted estimator is the best when \(\Delta ^*\) is close to the origin. However, the restricted estimator become inefficient and the RMSE goes below 1 very quickly as \(\Delta ^*\) deviates from zero. The RMSE of restricted estimator is depicted by the dashed line in Fig. 1. In the simulation study, the restricted estimator shows similar behaviour for all the error distributions considered in this study.

Tables 14 portrayed similar characteristic of the estimators. Both shrinkage estimators dominate the classical M-estimator, and positive-rule shrinkage estimator (SM+) dominates the shrinkage estimator. As for example, Table 1 presents the RMSEs for \((p_1, p_2) = (3, 5)\) and \(n=30\). For the standard normal error, the gain in risk for the positive-rule shrinkage M-estimator is 3.161 times that of the classical M-estimator provided that the model specification is correct (i.e., \(\Delta ^*=0\)). For the same configuration, when the error distribution is the standard Laplace, the gain in risk for SM+ is 2.273 times that of unrestricted estimator. Interestingly, for the large dimensional case \((p_1, p_2) = (5, 20)\) in Table 3, the gain is much higher with the value 7.325 and 4.200, respectively, demonstrating the applicability, power and beauty of the Stein-rule estimators in high-dimensional cases.

In closing, our numerical results strongly corroborate the theoretical properties of the suggested estimators.

Table 1 RMSE values for restricted, shrinkage, and positive shrinkage M-estimators for (\(p_1, p_2\)) = (3, 5), \(n=30\), based on Huber’s \(\rho -\)function for different error distributions
Table 2 RMSE values for restricted, shrinkage, and positive shrinkage M-estimators for (\(p_1, p_2\)) = (3, 9), \(n=50\), based on Huber’s \(\rho -\)function for different error distributions
Table 3 RMSE values for restricted, shrinkage, and positive shrinkage M-estimators for (\(p_1, p_2\)) = (5, 9), \(n=50\), based on Huber’s \(\rho -\)function for different error distributions
Table 4 RMSE values for restricted, shrinkage, and positive shrinkage M-estimators for (\(p_1, p_2\)) = (5, 20), \(n=50\), based on Huber’s \(\rho -\)function for different error distributions

7 Concluding remarks

In this paper, the shrinkage M-estimation strategies in the context of a partially linear regression model are developed. The statistical properties of shrinkage and positive-rule shrinkage M-estimators are investigated when the sparsity assumption may or may not hold. The expressions for bias and risk of the estimators are presented in closed form. The relative performance of the estimators is critically examined, the positive-rule shrinkage estimator is found to perform better than the unrestricted estimator. Further, it outshines the restricted estimator except in small interval when the submodel at the hand assumed to be to nearly true model.

In the simulation study, we numerically compute relative mean squared errors of the restricted-M, shrinkage-M, and positive-rule shrinkage M-estimators compared to the unrestricted M-estimator. Four different error distributions are considered to study the performance of the proposed estimators. Our numerical provides support for the positive-rule shrinkage estimators under varying degrees of model misidentification, as well. The submodel restricted M-estimator outperforms all other estimators when there is sparsity. However, a small departure from this condition makes the restricted very inefficient, questioning its applicability for practical purposes. We suggest to use positive-rule shrinkage M-estimators due to its performance in the entire parameter space.

More importantly, the performance positive-rule shrinkage M-estimators is noticeable when \(p_2\) is large, this work can be extended to high-dimensional cases, we refer to Ahmed et. al. (2023). We plan to study such extensions in a separate communication.