1 Introduction

Sparse regression models assume that the number of actually relevant predictors, k, is lower than the number of measured covariates. Hastie et al. (2015) describe that a sparse statistical model is one in which only a relatively small number of parameters (or predictors) play an important role, leading to models that are much easier to interpret than dense ones. This type of models has raised a paradigm shift in statistics, since the traditional approach to classical issues such as regression or classification assumes that no restrictions are imposed when estimating the parameters. In these circumstances, penalized regression estimators are a useful tool when the practitioner is interested in automatic variable selection. We refer to (Efron and Hastie 2016) for an overview of adapted inference methods. For instance, the \(\ell _1\) regularization, which is related to the LASSO estimators introduced in Tibshirani (1996), bets on the sparsity principle and is effective for variable selection, but tends to choose too many features. Zou and Hastie (2005) considered an alternative regularization, namely the Elastic Net penalty, which combines both \(\ell _1\) and \(\ell _2\) norms. Elastic Net preserves the sparsity of LASSO and maintains some of the desirable predictive properties of Ridge regression. Fan and Li (2001) and Zhang (2010) proposed alternative penalties which lead to sparse estimators.

Logistic regression is a widely studied problem in statistics and has been useful to classify data. It is well known that in the non-sparse scenario the maximum likelihood estimator (MLE) of the regression coefficients is very sensitive to outliers, meaning that we cannot accurately classify a new observation based on these estimators, neither identify those covariates with important information for assignation. Robust methods for logistic regression bounding the deviance have been proposed in Bianco and Yohai (1996). In particular, for the family of estimators defined therein, (Croux and Haesbroeck 2003) introduced a loss function that guarantees the existence of the resulting robust estimator when the maximum likelihood estimators do exist. The proposal due to (Basu et al. 2017) on the basis of minimum divergence can also be seen as a particular case of the (Bianco and Yohai 1996) estimator with a properly defined loss function. Other approaches were given in Cantoni and Ronchetti (2001) and Bondell (2005, 2008). However, all these methods are not reliable under collinearity and they do not allow for automatic variable selection when only a few number of covariates are relevant. The previous ideas on regularization can be directly extended to logistic regression.

In the last decade, some robust estimators for logistic regression in the sparse regressors framework have been proposed in the literature. Among others, we can mention (Chi and Scott 2014) who considered a least squares estimator with a Ridge and Elastic Net penalty and (Kurnaz et al. 2018) who proposed estimators based on a trimmed sum of the deviances with an Elastic Net penalty. It is worth noticing that the least squares estimator in logistic regression corresponds to a particular choice of the loss function considered in Bianco and Yohai (1996). Finally, Tibshirani and Manning (2013) introduced a real-valued shift factor to protect against the possibility of mislabelling, while (Park and Konishi 2016) considered a weighted deviance approach with weights based on the Mahalanobis distance computed over a lower-dimensional principal component space and included an Elastic Net penalty. Most of the asymptotic results for robust sparse estimators have been given under the linear regression model (see, for example, Smucler and Yohai 2017) or when considering a convex loss function (see, for instance, van de Geer and Müller 2012). More recently, (Avella-Medina and Ronchetti 2018) treated the situation of general penalized M-estimators in shrinking neighbourhoods, when the parameter dimension p is fixed. In this setting, they considered penalties that are a deterministic sum of univariate functions and showed that penalized M-estimators based on loss functions with a bounded derivative behave better in a neighbourhood of the model than the classical oracle estimator. Moreover, they showed that the asymptotic bias of penalized M-estimators is of order \(O(\epsilon )\) in \(\epsilon \) contamination neighbourhoods.

In this paper, we introduce a general family of robust estimators for sparse logistic regression models, that involves both a loss and a weight function to control influential points and also a general penalty term to produce sparse estimators. In contrast to (Avella-Medina and Ronchetti 2018), our approach allows for penalties which may be random and not necessarily a deterministic sum of univariate functions. Random penalties give a more realistic scenario than deterministic ones, since the practitioner usually selects the penalty parameter using a data-driven procedure. Furthermore, they provide a general framework to include adaptive LASSO (ADALASSO). At this point, the choice of the penalty does matter. It is worth noticing that, in the objective function defining our estimators, the loss function keeps bounded the terms related to the deviance. For this reason, it seems wise to consider a bounded penalty, otherwise, the regularization term may tend to dominate in the minimization problem. In this sense, SCAD or MCP, due to (Fan and Li 2001) and (Zhang 2010), respectively, are appealing choices. We also consider as regularization the Sign penalty, that is bounded and, unlike SCAD and MCP, does not depend on an extra parameter. This penalty acts like LASSO applied to the direction of the regression vector, that is why, it does not shrink the estimated coefficients to 0 as LASSO does. In the framework of sparse representations in signal analysis, the Sign is known as the \(\ell _1/\ell _2\) penalty and some of its algorithmic aspects have been discussed among others in Esser et al. (2013), Rahimi et al. (2019) and Wang et al. (2020). In opposition to our interests, these last papers focus on signal analysis, thus, the statistical properties of the related estimators are not studied. It is worth mentioning that the Sign penalty cannot be written as a sum of univariate deterministic functions, so the asymptotic properties of the penalized estimators cannot be derived from Theorem 2 in Avella-Medina and Ronchetti (2018). In this sense, our results fill the gap.

A primary focus of this paper is to provide a rigorous theoretical foundation for our approach to robust sparse logistic regression when the dimension of the covariates is fixed. It should be highlighted that a similar strategy to the one proposed herein could be followed in the high-dimensional scenario as done for robust quasi-likelihood-type estimators in Avella-Medina and Ronchetti (2018). However, when the dimension p increases with the sample size n, particular considerations and developments are required to obtain theoretical properties. This interesting topic is beyond the scope of the present paper and will be object of future research.

The rest of this paper is organized as follows. In Sect. 2, the robust penalized logistic regression estimators are introduced. In particular, Sect. 2.1 introduces a robust procedure to select the penalty parameter and discusses the importance of considering a bounded loss in the cross-validation criterion. Sections 3 and 4 summarize the asymptotic properties of the proposal. Section 5 reports the results of a Monte Carlo study. In Sect. 6, we present the analysis of a real dataset related to breast cancer diagnosis, while Sect. 7 contains some concluding remarks. Proofs are relegated to the Supplementary file where we also describe an algorithm to effectively compute the estimators and report some complementary simulation results. The analysis of dataset related to tomography images is also presented in the online supplement.

2 Robust penalized estimators

Throughout this paper, we consider a logistic regression model, that is, we have a sample of i.i.d. observations \(\left( y_i, {\mathbf {x}}_i \right) \), \(1\le i \le n\) such that \({\mathbf {x}}_i \in {\mathbb {R}}^p\), \(y_i\in \{0,1\}\) is a binary variable such that \(y_i|{\mathbf {x}}_i \sim Bi(1, F({\mathbf {x}}_i ^{\small {\textsc {t}}}{\varvec{\beta }}_0))\), where Bi(1, p) stands for the Bernoulli distribution with success probability p, \(F(t)=\exp (t)\left[ 1+\exp (t)\right] ^{-1}\) is the logistic function and \({\varvec{\beta }}_0 \in {\mathbb {R}}^p\) is the true logistic regression vector.

In the non-sparse setting, M-estimators were defined in Bianco and Yohai (1996) and Basu et al. (2017), while in order to obtain bounded influence estimators a weighted version was introduced in Croux and Haesbroeck (2003). For the sake of completeness, we briefly recall their definition. Let \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) be a bounded, differentiable and non-decreasing function with derivative \(\psi = \rho ^{\prime }\) and define

$$\begin{aligned} L_n({\varvec{\beta }}) = \frac{1}{n} \sum _{i = 1}^n \phi (y_i, {\mathbf {x}}_i^{\small {\textsc {t}}}{\varvec{\beta }}) w({\mathbf {x}}_i) \,, \end{aligned}$$
(1)

with

$$\begin{aligned} \phi (y, t)= & {} \rho (d(y, t)) + G(F(t)) + G(1 - F(t)) \,, \end{aligned}$$
(2)

where \(d(y,t) = - \log (F(t)) y - \log (1 - F(t)) (1-y)\) is the deviance function and \(G(t) = \int _0^t \psi (-\log u) \, du\) is the correction factor needed to guarantee Fisher-consistency. The weights \( w({\mathbf {x}}_i)\) are usually based on a robust Mahalanobis distance of the explanatory variables, that is, they depend on the distance between \({\mathbf {x}}_i^{\star }\) and a robust centre of the data, where \({\mathbf {x}}=(1,{\mathbf {x}}^{\star \small {\textsc {t}}})^{\small {\textsc {t}}}\) when an intercept is included in the model and \({\mathbf {x}}={\mathbf {x}}^{\star }\) when no intercept is considered. The weighted M-estimators are then defined as

$$\begin{aligned} {\widehat{{\varvec{\beta }}}}= \mathop {\mathrm{argmin}}_{{{{\varvec{\beta }}}}\in {\mathbb {R}}^p} L_n({\varvec{\beta }})\,. \end{aligned}$$
(3)

As for the maximum likelihood estimators, the weighted M-estimators do not lead to sparse estimators. This entails that they do not allow to make variable selection and may have a bad performance regarding robustness and efficiency. In this setting, a usual way to improve the behaviour of existing estimators is to include a regularization term that penalizes candidates without few nonzero components. The penalized estimators are defined as

$$\begin{aligned} {\widehat{{\varvec{\beta }}}}_n = \mathop {\mathrm{argmin}}_{{{{\varvec{\beta }}}}\in {\mathbb {R}}^p} \frac{1}{n} \sum _{i = 1}^n \phi (y_i, {\mathbf {x}}_i^{\small {\textsc {t}}}{\varvec{\beta }})\, w({\mathbf {x}}_i) + I_{\lambda _n}({\varvec{\beta }}) = \mathop {\mathrm{argmin}}_{{{{\varvec{\beta }}}}\in {\mathbb {R}}^p} L_n({\varvec{\beta }}) + I_{\lambda _n}({\varvec{\beta }}) , \end{aligned}$$
(4)

where \( L_n({\varvec{\beta }})\) is given in (1), \(\phi \) is defined in (2) and \(I_{\lambda _n}({\varvec{\beta }})\) is a penalty function, chosen by the user, depending on a tuning parameter \(\lambda _n\) which measures the estimated logistic regression model complexity. The intercept is usually not penalized, when the model contains one. For that reason and for the sake of simplicity, when deriving the asymptotic properties of the estimators, we will assume that the model has no intercept. If the penalty function is properly chosen, the penalized M-estimator defined in (4) will lead to sparse models.

It is worth noticing that the estimators introduced in (4) represent a wide family which includes the M-estimators defined in Bianco and Yohai (1996), by taking \(w({\mathbf {x}})=1\) and \(I_{\lambda _n}({\varvec{\beta }})=0\). In particular, the penalized maximum likelihood estimators correspond to \(\rho (t)=t\) which is not bounded and a penalized version of the minimum divergence estimators defined in Basu et al. (2017) taking \(\rho (t)= \rho _{{\textsc {div}}}(t)= (1+1/c)\{1-\exp (-ct)\}\). From now on, we denote \(\Vert {\varvec{\beta }}\Vert _q^q=\sum _{j=1}^p \beta _j\), for \(q>0\). The estimators defined in Chi and Scott (2014) belong to the family (4) just by choosing \(\rho (t)= 1-\exp (-t)\) and \(I_{\lambda }({\varvec{\beta }})=\lambda \left( \theta \Vert {\varvec{\beta }}\Vert _1+[({1-\theta })/{2}]\Vert {\varvec{\beta }}\Vert _2^2\right) \), with \(\theta \in [0,1]\), i.e. the Elastic Net penalty. Note that Elastic Net reduces to the LASSO penalty for \(\theta =1\) and to the Ridge penalty for \(\theta =0\). The main drawbacks of this penalization is that it introduces an extra parameter that must be chosen additionally to the penalty factor \(\lambda \) and that it produces estimators of the non-null components with a large bias.

Some other penalties considered in the linear regression model are the Bridge penalty introduced in Frank and Friedman (1993) and defined as \(I_{\lambda }({\varvec{\beta }})=\lambda \Vert {\varvec{\beta }}\Vert _q^q\). For linear models the Bridge penalty leads to sparse estimations when \(0< q < 1\). Zou (2006) has shown that LASSO may not be an oracle procedure for linear regression models and introduced the adaptive LASSO from an initial consistent estimator \({\widetilde{{\varvec{\beta }}}}\). The penalty function for the ADALASSO estimator is chosen as \(I_{\lambda }({\varvec{\beta }})=\lambda I^{\star }({\varvec{\beta }})\), where \(I^{\star }({\varvec{\beta }})\) is a random function defined as

$$\begin{aligned} I^{\star }({\varvec{\beta }}) = \sum _{j=1}^p \frac{|\beta _j|}{|{\widetilde{\beta }}_j|^{\gamma }}\,, \end{aligned}$$
(5)

for some \(\gamma >0\), where we understand that \(|\beta _j|/|{\widetilde{\beta }}_j|^{\gamma }=\infty \) if \(|{\widetilde{\beta }}_j|=0\) but \(|\beta _j|\ne 0\), while \(|\beta _j|/|{\widetilde{\beta }}_j|^{\gamma }=0\) if \(|{\widetilde{\beta }}_j|=|\beta _j|= 0\). If we seek for a robust penalized procedure using ADALASSO and to preserve robustness of the final estimator, \({\widetilde{{\varvec{\beta }}}}\) can be chosen as the non-penalized robust estimator, that is, the minimizer of \(L_n({\varvec{\beta }})\).

A distinguishing feature in logistic regression is that the response variable is bounded. This implies that when considering the penalized least squares estimators the first term in (4) is always bounded and hence, the penalty term may dominate the behaviour of the objective function, unless the regularization function is also bounded.

This is the reason why, we will also consider bounded penalties such as the SCAD penalty defined in Fan and Li (2001) as

$$\begin{aligned} I_{\lambda }({\varvec{\beta }}) =&\sum _{j = 1}^p \lambda |\beta _j|\; \mathbf{1 }_{\{|\beta _j| \le \lambda \}}+\sum _{j = 1}^p \frac{a \lambda |\beta _j| - 0.5(\beta _j^2 + \lambda ^2)}{a- 1}\; \mathbf{1 }_{\{\lambda < |\beta _j| \le a\lambda \}}\, \\&+ \sum _{j = 1}^p \frac{\lambda ^2(a^2 - 1)}{2(a- 1)}\; \mathbf{1 }_{\{|\beta _j| >a \lambda \}}\,, \end{aligned}$$

for \(a>2\), where \(\mathbf{1 }_A\) is the indicator function of the set A, and the MCP penalty proposed by Zhang (2010) in the linear regression model which is given by

$$\begin{aligned} I_{\lambda }({\varvec{\beta }}) = \sum _{j = 1}^p \left( \lambda |\beta _j| - \frac{\beta _j^2}{2 \, a}\right) \, \mathbf{1 }_{\{|\beta _j| \le a \, \lambda \}} + \frac{1}{2} \, a \, \lambda ^2\, \mathbf{1 }_{\{|\beta _j| > a \, \lambda \}}. \end{aligned}$$

Furthermore, a main objective under a sparse setting is variable selection, that is, to identify variables related to non-null coefficients. Hence, it is more relevant to determine the coefficients \(\beta _j \) that are non-null than their size. For that purpose, we also consider a penalty that shrinks the coefficients by pulling the vector \({\varvec{\beta }}\) to the unit Euclidean ball before applying a LASSO penalty. This results in the so-called Sign penalty, also known as the \(\ell _1/\ell _2\) penalization in signal analysis, which is defined as

$$\begin{aligned} I_{\lambda }({\varvec{\beta }})=\lambda \frac{\Vert {\varvec{\beta }}\Vert _1}{\Vert {\varvec{\beta }}\Vert _2}\mathbf{1 }_{{{{\varvec{\beta }}}}\ne \mathbf{{0}}}=\lambda \Vert s({\varvec{\beta }})\Vert _1\mathbf{1 }_{{{{\varvec{\beta }}}}\ne \mathbf{{0}}}\,, \end{aligned}$$

where \(s({\varvec{\beta }})={\varvec{\beta }}/\Vert {\varvec{\beta }}\Vert _2\) is the sign function. In multivariate analysis, the sign function has been extensively considered to construct robust estimators. Up to our knowledge, this paper is the first one in deriving the asymptotic properties of penalized estimators based on \(s({\varvec{\beta }})\). Note that the Sign penalty works like LASSO over all unit vectors and in this sense, it enables the selection of a direction, more than raw variable selection. The Sign penalty produces a thresholding rule, that is, it estimates some coefficients as nonzero. It reaches the minimum when only one of its components is not zero and its maximum when all its components are equal and different from zero. Two important features of this penalty are that it is scale invariant, so it does not shrink the estimated coefficients as the Elastic Net penalty does, and it does not require to select an extra parameter as SCAD and MCP.

2.1 Selection of the penalty parameter

As it is well known, the selection of the penalty parameter is an important practical issue when fitting sparse models, since in some sense it tunes the complexity of the model. This problem has been discussed, among others, in Efron et al. (2004), Meinshausen (2007) and Chi and Scott (2014). In this paper, a robust K-fold criterion is used to select the penalty parameter.

As usual, first randomly split the dataset into K disjoint subsets of approximately equal sizes, with indices \({\mathcal {C}}_j\), \(1 \le j \le K\), the j-th subset having size \(n_j\ge 2\), so that \(\bigcup _{j=1}^K {\mathcal {C}}_j = \{ 1, \ldots , n \}\) and \(\sum _{j=1}^K n_j=n\). Let \({\widetilde{\varLambda }}\subset {\mathbb {R}}\) be the set of possible values for \(\lambda \) to be considered, and let \({\widehat{{\varvec{\beta }}}}_{\lambda }^{(j)}\) be an estimator of \({\varvec{\beta }}_0\), computed with penalty parameter \(\lambda \in {\widetilde{\varLambda }}\) and without using the observations with indices in \({\mathcal {C}}_j\). For each \(i=1,\dots , n\), the prediction residuals \({\widehat{d}}_{i,\lambda }\) are \( {\widehat{d}}_{i,\lambda } \, = \, d(y_i, {\mathbf {x}}_i^{\small {\textsc {t}}}{\widehat{{\varvec{\beta }}}}_{\lambda }^{(j)} )\), for \( i \in {\mathcal {C}}_j \) and \( j = 1, \, \ldots , K\). The classical cross-validation criterion constructs adaptive data-driven estimators by minimizing

$$\begin{aligned} CV(\lambda )=\frac{1}{n} \sum _{i=1}^n{\widehat{d}}_{i,\lambda }\,, \end{aligned}$$
(6)

an objective function that is usually employed for the classical estimators which minimize the deviance. However, this criterion is very sensitive to the presence of outliers. In fact, even when \({\varvec{\beta }}_0\) is estimated by means of a robust method, the traditional cross-validation criterion may lead to poor variable selection results since atypical data may have large prediction residuals that could be very influential on \(CV(\lambda )\). To overcome this problem, when using robust estimators, it seems natural to use the same loss function \(\phi \) as in (4). Hence, the robust cross-validation criterion selects the penalty parameter by minimizing over \({\widetilde{\varLambda }}\)

$$\begin{aligned} RCV(\lambda )=\frac{1}{n} \sum _{1\le j\le K} \sum _{i \in {\mathcal {C}}_j} \phi (y_i, {\mathbf {x}}_i^{\small {\textsc {t}}}{\widehat{{\varvec{\beta }}}}_{\lambda }^{(j)} )\,w({\mathbf {x}}_i)\,. \end{aligned}$$
(7)

The particular case \(K=n\) leads to leave-one-out cross-validation which is a popular choice with a more expensive computational cost. In Section S.8.1 of the supplementary material, we illustrate through a numerical example, the importance of considering a bounded loss in the cross-validation criterion when performing the selection of the penalty parameter in order to achieve reliable prediction.

3 Consistency and order of convergence

In this section, we study the asymptotic behaviour of the estimators defined in (4) when p is fixed. Even though we are mainly concerned with bounded penalties, our results are general and include among others the Bridge and Elastic Net penalties.

3.1 Assumptions

When considering the function \(\phi \) given in (2), the following set of assumptions on the loss function \(\rho \) are needed.

R1:

\(\rho : {\mathbb {R}}_{\ge 0} \rightarrow {\mathbb {R}}\) is a bounded, continuously differentiable function with bounded derivative \(\psi \) and \(\rho (0) = 0\).

R2:

\(\psi (t) \ge 0\) and there exists some \(c \ge \log 2\) such that \(\psi (t) > 0\) for all \(0< t < c\).

R3:

\(\rho \) is twice continuously differentiable with bounded derivatives, i.e. \(\psi \) and \(\psi ^{\prime } = \rho ^{\prime \,\prime }\) are bounded.

Remark 1

Note that for the function \(\phi (y,t)\) defined in (2), \(\varPsi (y,t) = {\partial } \phi (y,t)/{\partial t}= -[y-F(t)] \nu (t)\) with \(\nu (t)\) given by

$$\begin{aligned} \nu (t)= \psi \left( -\log F(t)\right) \left[ 1- F(t)\right] + \psi \left( -\log \left[ 1- F(t)\right] \right) F(t) \,. \end{aligned}$$
(8)

Further, under R1 and R2, the function \(\varPsi (y, \cdot )\) is continuous and strictly positive.

Denote as \(\chi (y,t)= \partial \varPsi (y,t)/\partial t= F(t)(1-F(t))\nu (t) -(y-F(t))\nu ^{\prime }(t)\) and note that \(\chi (0,s) = \chi (1,-s)\). The function \(\chi (y,t)\) always exists for the minimum divergence estimators and is well defined for any function \(\rho \) satisfying  R3.

It is worth noticing that when \(\psi (t) > 0\) the constant c in R2 may be taken as \(\infty \). For instance, this happens when choosing the loss function \(\rho =\rho _{{\textsc {div}}}\) related to the divergence estimators or the function \(\rho =\rho _c\), with \(c>0\), defined as

$$\begin{aligned} \rho _c\left( t\right) =\left\{ \begin{array}{ll} te^{-\sqrt{c}} &{} \hbox {if } \,\, t\le c\\ -2e^{-\sqrt{t}}\left( 1+\sqrt{t}\right) +e^{-\sqrt{c}}\left( 2\left( 1+\sqrt{c}\right) + c \right) &{} \hbox {if } \,\, t>c \, , \end{array} \right. \end{aligned}$$
(9)

which has been introduced in Croux and Haesbroeck (2003) to ensure the existence of the M-estimators under the same conditions that guarantee existence for the maximum likelihood estimators. Moreover, when considering the penalized minimum divergence estimators, \(\rho \) automatically satisfies conditions  R1R2 and R3.

For the results in this section, the following assumptions regarding the distribution of \({\mathbf {x}}\) are needed.

H1:

For all \({\varvec{\alpha }}\in {\mathbb {R}}^p\), \({\varvec{\alpha }}\ne \mathbf{{0}}\), we have \({\mathbb {P}}({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\alpha }}= 0) =0\).

H2:

w is a non-negative bounded function with support \({\mathcal {C}}_w\) such that \({\mathbb {P}}({\mathbf {x}}\in {\mathcal {C}}_w)>0\). Without loss of generality, we assume that \(\Vert w\Vert _{\infty }=1\).

H3:

\({\mathbb {E}}[w({\mathbf {x}})\Vert {\mathbf {x}}\Vert ^2] < \infty \).

H4:

The matrix \({\mathbf {A}}={\mathbb {E}}\left( F({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0)\left[ 1-F({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0)\right] \nu ({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0) \, w({\mathbf {x}}) \,{\mathbf {x}}{\mathbf {x}}^{\small {\textsc {t}}}\right) \), where \(\nu (t)\) is defined in (8), is non-singular.

Remark 2

Assumptions H1 and  H2 entail that the estimators defined in (3) are Fisher-consistent and will allow to derive consistency results for the estimators defined in (4).  H1 holds for instance, when \({\mathbf {x}}\) has a density with support \({\mathcal {S}}\) such that \({\mathcal {S}}\cap {\mathcal {C}}_w\ne \emptyset \). In fact, the weaker assumption \({\mathbb {P}}({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\alpha }}= 0 \cup w({\mathbf {x}})=0) < 1\) for any \({\varvec{\alpha }}\ne \mathbf{{0}}\) is enough for obtaining Fisher-consistency. However, in order to ensure consistency a stronger requirement is needed to guarantee that the infimum is not attained at infinity. It is worth noticing that H1 and H2 entail that \({\mathbb {E}}[w({\mathbf {x}})\,{\mathbf {x}}{\mathbf {x}}^{\small {\textsc {t}}}]\) is a positive definite matrix. Furthermore, when considering the minimum divergence estimators the matrix \({\mathbf {A}}\) is non-singular, since \({\mathbb {P}}(\nu ({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0) >0)=1 \), so H4 holds. Similarly, when \({\mathbb {P}}({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\alpha }}= 0) < 1\) for any \({\varvec{\alpha }}\ne \mathbf{{0}}\), and \(\phi \) is given by (2) with \(\psi (t)>0\) for all t, as is the case with the loss function introduced in Croux and Haesbroeck (2003), \({\mathbf {A}}\) is non-singular. On the other hand, when R2 holds for some finite positive constant \(c \ge \log 2\), \({\mathbf {A}}\) is positive definite when H1 holds. Moreover, define \(\varUpsilon (t)= F(t) (1-F(t))\nu (t)\), straightforward arguments allow to see that \({\mathbf {A}}\) is also non-singular when \({\mathbb {P}}({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\alpha }}= 0) < 1\) holds, for any \({\varvec{\alpha }}\ne \mathbf{{0}}\), and at least one of the following conditions is fulfilled: a) the function \({\mathbb {E}}[w({\mathbf {x}}){\mathbf {x}}{\mathbf {x}}^{\small {\textsc {t}}}\mathbf{1 }_{\varUpsilon ({\mathbf {x}}^{\small {\textsc {t}}}{{{\varvec{\beta }}}}_0) \ge \eta }]\) is continuous in \(\eta \) or b) there exists some \(c > 0\) such that \({\mathbb {P}}(\varUpsilon ({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0) > c) = 1\).

Remark 3

It is worth mentioning that assumption H3 is weaker than Condition 3 in Avella-Medina and Ronchetti (2018), while condition H4 is equivalent to the non-singularity requirement in Condition 2 therein. Regarding Condition 1 of Avella-Medina and Ronchetti (2018), the Fisher-consistency is automatically fulfilled due to the correction factor \(G(\cdot )\). Furthermore, instead of the uniformity condition asked by those authors, we only require to the function \(\psi \) continuity and boundedness. Note that when \(w\equiv 1\) and the covariates are bounded or when considering hard rejection weights, their Condition 1 is satisfied.

3.2 Consistency and rate of convergence

The next theorem states the strong consistency of the estimators defined in (4), when considering as function \(\phi \) the function controlling large values of the deviance residuals given in (2).

Theorem 1

Let \(\phi :{\mathbb {R}}^2\rightarrow {\mathbb {R}}\) be the function given in (2), where the function \(\rho \) satisfies  R1 and R2. Then, if \(I_{\lambda _n}({\varvec{\beta }}_0) \buildrel {a.s.}\over \longrightarrow 0\) when \(n \rightarrow \infty \) and  H1 and H2 hold, we have that the estimator \({\widehat{{\varvec{\beta }}}}_n\) defined in (4) is strongly consistent for \({\varvec{\beta }}_0\).

It is worth noticing that, in Theorem  1, the penalty function \(I_{\lambda _n}\) may be deterministic or random, since the only requirement is that \(I_{\lambda _n}({\varvec{\beta }}_0) \buildrel {a.s.}\over \longrightarrow 0 \). In particular, for the penalties LASSO, Sign, Ridge, Bridge, SCAD and MCP described in Sect. 2 this condition holds when \(\lambda _n \buildrel {a.s.}\over \longrightarrow 0\). Moreover, for the ADALASSO penalty, the condition \(I_{\lambda _n}({\varvec{\beta }}_0) \buildrel {a.s.}\over \longrightarrow 0 \) is fulfilled when the initial estimator \({\widetilde{{\varvec{\beta }}}}\) is consistent and \(\lambda _n \buildrel {a.s.}\over \longrightarrow 0\).

In order to prove the \(\sqrt{n}\)-consistency of the proposed estimators, we need the following assumption on the penalty function. From now on, \({\mathcal {B}}({\varvec{\beta }},\epsilon )\) stands for the closed ball, with respect to the usual \(\Vert \cdot \Vert _2\) norm, centred at \({\varvec{\beta }}\) with radius \(\epsilon \), i.e. \({\mathcal {B}}({\varvec{\beta }},\epsilon )=\{{\mathbf {b}}\in {\mathbb {R}}^p: \Vert {\mathbf {b}}-{\varvec{\beta }}\Vert _2\le \epsilon \}\).

P1:

\(I_{\lambda }({\varvec{\beta }})/\lambda \) is Lipschitz in a neighbourhood of \({\varvec{\beta }}_0\), that is, there exists \(\epsilon > 0\) a constant K, which does not depend on \(\lambda \), such that if \({\varvec{\beta }}_1, {\varvec{\beta }}_2 \in {\mathcal {B}}({\varvec{\beta }}_0,\epsilon )\) then \(|I_{\lambda }({\varvec{\beta }}_1) - I_{\lambda }({\varvec{\beta }}_2)| \le \lambda K\Vert {\varvec{\beta }}_1 - {\varvec{\beta }}_2\Vert _1 \).

Remark 4

Note that penalties Ridge, Elastic Net, SCAD and MCP satisfy   P1, since \(\Vert {\varvec{\beta }}\Vert _2\le \Vert {\varvec{\beta }}\Vert _1\le \sqrt{p}\, \Vert {\varvec{\beta }}\Vert _2\). Furthermore, the Sign penalty also satisfies  P1 if \(\Vert {\varvec{\beta }}_0\Vert _2\ne 0\). Moreover, if \(I_{\lambda }({\varvec{\beta }}) = \lambda \, \sum _{\ell =1}^p J_{\ell }(|\beta _{\ell }|) \), where \( J_{\ell }(\cdot )\) is a continuously differentiable function, then \(I_{\lambda }\) satisfies  P1, which implies that the Bridge penalty satisfies  P1 for \(q\ge 1\).

Theorem 2

Let \({\widehat{{\varvec{\beta }}}}_n\) be the estimator defined in (4) with \(\phi (y,t)\) given in (2), where the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3. Furthermore, assume that \({\widehat{{\varvec{\beta }}}}_n \buildrel {p}\over \longrightarrow {\varvec{\beta }}_0\) and that assumptions H2 to H4 hold.

  1. (a)

    If assumption  P1 holds, \(\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(\lambda _n\,+\,1/\sqrt{n})\). Hence, if \(\lambda _n = O_{\mathbb {P}}(1/\sqrt{n})\), we have that \(\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(1/\sqrt{n})\), while if \(\lambda _n \sqrt{n}\rightarrow \infty \), \(\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(\lambda _n)\).

  2. (b)

    Suppose \(I_{\lambda _n}({\varvec{\beta }}) = \sum _{\ell = 1}^p J_{\ell ,\lambda _n}(|\beta _{\ell }|)\) where the functions \(J_{\ell ,\lambda _n}(\cdot )\) are twice continuously differentiable in \((0, \infty )\), take non-negative values, \(J^{\prime }_{\ell ,\lambda _n}(|\beta _{0,\ell }|)\ge 0 \) and \(J_{\ell ,\lambda _n}(0) = 0\), for all \(1\le \ell \le p\). Let

    $$\begin{aligned} a_n = \max \, \left\{ J^{\prime }_{\ell ,\lambda _n}(|\beta _{0,\ell }|) : 1 \le \ell \le p \;\; \text {and} \;\; \beta _{0, \ell } \ne 0 \right\} \quad \text {and} \quad \alpha _n = \frac{1}{\sqrt{n}} + a_n. \end{aligned}$$

    In addition, assume that there exists some \(\delta > 0\) such that

    $$\begin{aligned} \sup \{|J_{\ell ,\lambda _n}^{\prime \,\prime }(|\beta _{0,\ell }| + \tau \delta )| : \tau \in [-1,1] \;, \; 1 \le \ell \le p \;\; \text {and} \;\; \beta _{0,\ell } \ne 0 \} \buildrel {p}\over \longrightarrow 0. \end{aligned}$$

    Then, \(\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2 = O_{{\mathbb {P}}}(\alpha _n)\).

Remark 5

Theorem 2(a) shows that, when the penalty satisfies assumption  P1, the estimator rate of convergence depends on the convergence rate of \(\lambda _n\) to 0. In particular, if \(\lambda _n \sqrt{n}\) is bounded in probability, then the robust penalized consistent estimator has rate \(\sqrt{n}\), while if \(\lambda _n \sqrt{n}\rightarrow \infty \), the convergence rate of \({\widehat{{\varvec{\beta }}}}_n\) is slower than \(\sqrt{n}\). This result is analogous to the one obtained, under a linear regression model, in Zou (2006) for the penalized least squares estimator when a LASSO penalty is considered. Note that, for the LASSO penalty, the convergence rates obtained in (a) and ( b) are equal since \(J_{\ell ,\lambda _n}(v)=\lambda _n\, v\), for any \(1\le \ell \le p\), which entails that \(a_n=\lambda _n\) and for any \(\beta _{0,\ell } \ne 0\), \(\tau \in [-1,1]\), \(J_{\ell ,\lambda _n}''(|\beta _{0,\ell }| + \tau \delta )=0\) for a small enough \(\delta >0\).

Penalties SCAD and MCP are not only Lipschitz, but also based on univariate twice continuously differentiable functions \(J_{\ell ,\lambda _n}(t)=J_{\lambda _n}(t)\), for all \(1\le \ell \le p\), satisfying the requirements asked in Theorem 2(b) when \(\lambda _n \rightarrow 0\). Indeed, for these penalties \(J'_{\lambda _n}(t)\) and \(J_{\lambda _n}''(t)\) are 0 if \(t>\,a\,\lambda _n\) where a is their second tuning constant which is assumed to be fixed. Hence, if \(\lambda _n \buildrel {p}\over \longrightarrow 0\) for any \(\delta >0\) there exists \(n_0\) such that, for any \(n\ge n_0\), we have that \({\mathbb {P}}(a \lambda _n < m_{0})> 1-\delta \) with \(m_0=\min \{|\beta _{0,\ell }|) : 1 \le \ell \le p \;\; \text {and} \;\; \beta _{0, \ell } \ne 0 \}\). Thus, for \(n\ge n_0\), \({\mathbb {P}}( a_n = 0 \text{ and } b_n = 0) > 1-\delta \) and therefore, \(\alpha _n=O_{{\mathbb {P}}}(1/{\sqrt{n}})\), implying that the root-n rate may be achieved only assuming only that \(\lambda _n \buildrel {p}\over \longrightarrow 0\). It is worth noticing that, even when, the Ridge penalty is Lipschitz and it is also based on univariate twice continuously differentiable functions, \(J^{\prime }_{\lambda _n}(|\beta _{0,\ell }|)=\lambda _n |\beta _{0,\ell }|\), so that \(a_n= O(1/\sqrt{n} +\lambda _n)\), leading to root-n consistency rate with the additional requirement \(\lambda _n = O_{\mathbb {P}}(1/\sqrt{n})\). The different behaviour of the estimators related to Lipschitz penalties or penalties related to twice continuously differentiable functions with null first derivative for n large enough plays an important role regarding the variable selection properties of the procedure.

Furthermore, when considering the ADALASSO estimators, root-n estimators are obtained when the initial estimator \({\widetilde{{\varvec{\beta }}}}\) is consistent and \(\sqrt{n}\lambda _n =O_{{\mathbb {P}}}(1)\), since in this case \(a_n=\lambda _n \max _{j\in {\mathcal {A}}} |{\widetilde{\beta }}_j|^{-\gamma }\), with \({\mathcal {A}}=\{j: \beta _{0,j}\ne 0\}\). In particular, for deterministic bandwidths, this result holds if \(\sqrt{n}\lambda _n\rightarrow 0\) in concordance with Theorem 2 from (Zou 2006).

4 Asymptotic distribution results

The first result in this section concerns the variable selection properties for our estimator. As shown below, the result depends on the behaviour of the penalty function. Without loss of generality, assume that \({\varvec{\beta }}_0 = ({\varvec{\beta }}_{0,A}^{\small {\textsc {t}}}, \mathbf{{0}}_{p-k}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) and \({\varvec{\beta }}_{0,A} \in {\mathbb {R}}^k\), \(k\ge 1\), is the subvector with active coordinates of \({\varvec{\beta }}_0\) (i.e. the subvector of nonzero elements of \({\varvec{\beta }}_0\)). We will make use of the notation \({\varvec{\beta }}= ({\varvec{\beta }}_A^{\small {\textsc {t}}},{\varvec{\beta }}_B^{\small {\textsc {t}}})^{\small {\textsc {t}}}\), where \({\varvec{\beta }}_A \in {\mathbb {R}}^k\) with \(k\ge 1\) and \({\varvec{\beta }}_B \in {\mathbb {R}}^{p-k}\).

When the estimator automatically selects variables, we will be able to show an oracle property, that is, that the penalized M-estimator of the non-null components of \({\varvec{\beta }}_0\), \({\widehat{{\varvec{\beta }}}}_{n,A}\) has the same asymptotic distribution as that of the estimator obtained assuming that the last components of \({\varvec{\beta }}_0\) are equal to 0 and using this restriction in the logistic regression model. It is worth noticing that in the non-sparse scenario, the asymptotic behaviour of the estimators \({\widehat{{\varvec{\beta }}}}\) defined in (3) has been studied in Bianco and Martínez (2009), while Basu et al. (2017) consider the particular case of the minimum divergence estimators and \(w({\mathbf {x}}) \equiv 1\). More precisely, the above mentioned authors have shown that \(\sqrt{n}({\widehat{{\varvec{\beta }}}}-{\varvec{\beta }}) \buildrel {D}\over \longrightarrow N_p(\mathbf{{0}}, {\varvec{\varSigma }})\) with \({\varvec{\varSigma }}={\mathbf {A}}^{-1}{\mathbf {B}}{\mathbf {A}}^{-1}\), where

$$\begin{aligned} {\mathbf {B}}= & {} {\mathbb {E}}\left( F({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0)\left[ 1-F({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0)\right] \nu ^2({\mathbf {x}}^{\small {\textsc {t}}}{\varvec{\beta }}_0) \, w^2({\mathbf {x}})\, {\mathbf {x}}{\mathbf {x}}^{\small {\textsc {t}}}\right) \,. \end{aligned}$$
(10)

with \(\nu (t)\) defined in (8) and the matrix \({\mathbf {A}}\) is given in assumption H4.

For the sake of simplicity, throughout this section, we will assume that the parameter \(\lambda _n\) is deterministic. Similar results may be obtained when the penalty parameter is random. However, we also admit \(I_{\lambda }({\varvec{\beta }})\) to be random, so in Sect. 4.1, we will treat separately the case in which \(I_{\lambda } ({\varvec{\beta }})\) is a deterministic or random function, leading to Theorems 3 and 4, respectively.

4.1 Variable selection property

Theorem 3

Let \({\widehat{{\varvec{\beta }}}}_n = ({\widehat{{\varvec{\beta }}}}_{n,A}^{\small {\textsc {t}}}, {\widehat{{\varvec{\beta }}}}_{n,B}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) be the estimator defined in (4), where \(\phi (y,t)\) is given in (2) and the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3. Furthermore, assume that H2 and H3 hold and that \(\sqrt{n}\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(1)\). Moreover, assume that for every \(C > 0\) and \(\ell \in \{k+1, \dots , p\}\), there exist a constant \(K_{C, \ell }\) and \(N_{C, \ell } \in {\mathbb {N}}\) such that if \(\Vert {\mathbf {u}}\Vert _2 \le C\) and \(n \ge N_{C,\ell }\), then

$$\begin{aligned} I_{\lambda _n}\left( {\varvec{\beta }}_0 + \frac{{\mathbf {u}}}{\sqrt{n}}\right) - I_{\lambda _n}\left( {\varvec{\beta }}_0 + \frac{{\mathbf {u}}^{(-\ell )}}{\sqrt{n}}\right) \ge K_{C,\ell }\, \frac{\lambda _n}{\sqrt{n}}\, |u_\ell |, \end{aligned}$$
(11)

where \({\mathbf {u}}^{(-\ell )}\) is obtained by replacing the \(\ell \)-th coordinate of \({\mathbf {u}}\) with zero and \(u_\ell \) is the \(\ell \)-th coordinate of \({\mathbf {u}}\).

  1. (a)

    For every \(\tau > 0\), there exists \(b > 0\) and \(n_0\in {\mathbb {N}}\) such that if \(\lambda _n = b/\sqrt{n}\), we have that, for any \(n \ge n_0\), \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \ge 1-\tau \).

  2. (b)

    If \(\lambda _n \, \sqrt{n} \rightarrow \infty \), then \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1 \).

To prove variable selection properties for our estimators, it only remains to show that condition (11) holds for the different penalties mentioned above. First note that (11) is clearly satisfied for the LASSO penalty. In the proof of Corollary 1, we show that SCAD, MCP and the Sign penalty also verify (11).

Corollary 1

Let \({\widehat{{\varvec{\beta }}}}_n = ({\widehat{{\varvec{\beta }}}}_{n,A}^{\small {\textsc {t}}}, {\widehat{{\varvec{\beta }}}}_{n,B}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) be the estimator defined in (4) with \(\phi (y,t)\) given by (2) where the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3. Assume that H2 and H3 hold and \(\sqrt{n} \Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(1)\).

  1. (a)

    If \(I_{\lambda _n}({\varvec{\beta }})\) is the Sign penalty, then for every \(\tau > 0\) there exist \(b>0\) and \(n_0\in {\mathbb {N}}\) such that if \(\lambda _n = b/\sqrt{n}\), we have that, for any \(n\ge n_0\), \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \ge 1-\tau \).

  2. (b)

    If \(I_{\lambda _n}({\varvec{\beta }})\) is taken as the SCAD or MCP penalties and \(\sqrt{n}\lambda _n \rightarrow \infty \), then \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1\).

Remark 6

It is noteworthy that when the penalty function \(I_{\lambda } {\varvec{\beta }})\) is deterministic and can we written as a sum of continuously differentiable univariate functions, inequality (11) is equivalent to Condition 4 in Avella-Medina and Ronchetti (2018).

A consequence of Corollary 1 is that the penalties SCAD and MCP have the property of automatically selecting variables when \(\sqrt{n}\lambda _n \rightarrow \infty \). This states a difference with (Avella-Medina and Ronchetti 2018) who require stronger rates on \(\lambda _n\), see Remark 9. In contrast, when using the LASSO and Sign penalties, we cannot ensure the variable selection property when the estimator is root-n consistent. Recall that, for these two penalties, Theorem 2 entails that the estimator converges at a rate slower than \(\sqrt{n}\) when \(\lambda _n \sqrt{n}\rightarrow \infty \). For that reason, we can only guarantee that for a given \(0<\tau <1\), we can choose a sequence of penalty parameters \(\lambda _n=b/\sqrt{n}\) (in order to ensure that the estimator has a root-n rate) and such that the penalized M-estimator selects variables with probability larger than \(1-\tau \).

The results in the asymptotic distribution given below will allow to conclude that, for the LASSO and Sign penalties, when the estimator has convergence rate \(\sqrt{n}\), then \(\limsup _n{\mathbb {P}}({\mathcal {A}}_n={\mathcal {A}})<1\), where \({\mathcal {A}}=\{j: \beta _{0,j}\ne 0\}=\{1,\dots , k\}\) and \({\mathcal {A}}_n=\{j: {\widehat{\beta }}_{n,j}\ne 0\}\) are the set of indexes related to the active components of \({\varvec{\beta }}_0\) and to the non-null coordinates of \({\widehat{{\varvec{\beta }}}}_n\), respectively. This result is analogous to Proposition 1 in Zou (2006), which shows that the LASSO estimator leads to inconsistent variable selection in the linear regression model, when \(\lambda _n=O(1/\sqrt{n})\).

It is worth noticing that \( {\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k} \) if and only if \({\mathcal {A}}_n\subset {\mathcal {A}}\), hence, if \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1\) we have that \( {\mathbb {P}}({\mathcal {A}}_n\subset {\mathcal {A}})\rightarrow 1\). Note that when \({\mathcal {A}}_n \subsetneq {\mathcal {A}}\), the penalized M-estimator may select a submodel with less predictors than the original one, shrinking the estimation of some of the active to 0; however, the oracle property of the estimators based on SCAD or MCP given in Theorem 8 will allow to conclude that \( {\mathbb {P}}({\mathcal {A}}_n={\mathcal {A}})\rightarrow 1\).

To derive the variable selection property for random penalties such as the ADALASSO constructed from a root-n consistent initial estimator, we state the following result whose proof is omitted since it follows using similar arguments to those considered in the proof of Theorem 3. As mentioned above, this property is crucial to obtain the asymptotic distribution of \({\widehat{{\varvec{\beta }}}}_{n,A}\) in Sect. 5. Note that for the ADALASSO penalty, the constant \(\gamma >0\) in Theorem 4 corresponds to the value of \(\gamma \) involved in its definition in (5).

Theorem 4

Let \({\widehat{{\varvec{\beta }}}}_n = ({\widehat{{\varvec{\beta }}}}_{n,A}^{\small {\textsc {t}}}, {\widehat{{\varvec{\beta }}}}_{n,B}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) be the estimator defined in (4), where \(\phi (y,t)\) is given in (2) and the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3. Assume that \(\sqrt{n}\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(1)\) and that for some \(\gamma >0\), \( n^{(1+\gamma )/2} \, \lambda _n \rightarrow \infty \). Furthermore, assume that for every \(C > 0\), \(\ell \in \{k+1, \dots , p\}\) and \(\tau > 0\), there exist a constant \(K_{C, \ell }\) and \(N_{C, \ell } \in {\mathbb {N}}\) such that if \(\Vert {\mathbf {u}}\Vert _2 \le C\) and \(n \ge N_{C,\ell }\), we have that

$$\begin{aligned} {\mathbb {P}}\left( I_{\lambda _n}\left( {\varvec{\beta }}_0 + \frac{{\mathbf {u}}}{\sqrt{n}}\right) - I_{\lambda _n}\left( {\varvec{\beta }}_0 + \frac{{\mathbf {u}}^{(-\ell )}}{\sqrt{n}}\right) \ge K_{C,\ell }\, \frac{\lambda _n\;}{\sqrt{n^{1-\gamma }}}\, |u_\ell |\right) >1-\tau , \end{aligned}$$
(12)

where \({\mathbf {u}}^{(-\ell )}\) and \(u_\ell \) are defined as in Theorem 3. Then, under  H2 and H3, \( {\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1 \).

4.2 Asymptotic distribution

In this section, we derive separately the asymptotic distribution of our estimator depending on the choice of the penalty. As the rate of convergence to 0 of \(\lambda _n\) required to obtain root\(-n\) estimators for the Sign is different from that of SCAD or MCP penalties, we will study these two situations separately. Even though most results on penalized estimators assume that the sequence of penalty parameters is deterministic, in this section, as in Theorem 2, we will allow random penalty parameters \(\lambda _n\), having in this sense a more realistic point of view.

It is worth noticing that, under H4, the matrix \({\mathbf {A}}\) defined in assumption H4 is positive definite, so the submatrix corresponding to the active coordinates of \({\varvec{\beta }}_0\) is also positive definite.

From now on, \({\mathbf {e}}_\ell \) stands for the \(\ell \)-th canonical vector and \({\mathrm{sign}}(z)\) is the univariate sign function, that is, \({\mathrm{sign}}(z)=z/|z|\) when \(z\ne 0\) and \({\mathrm{sign}}(0)=0\).

Theorem 5

Let \({\widehat{{\varvec{\beta }}}}_n\) be the estimator defined in (4) with \(\phi (y,t)\) given in (2), where the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3. Assume that H2 to H4 hold, \(\sqrt{n}({\widehat{{\varvec{\beta }}}}_n-{\varvec{\beta }}_0)=O_{{\mathbb {P}}}(1)\) and \(\sqrt{n} \, \lambda _n \buildrel {p}\over \longrightarrow b\). Consider the Sign penalty given by \(I_{\lambda }({\varvec{\beta }}) = \lambda \, {\Vert {\varvec{\beta }}\Vert _1}/{\Vert {\varvec{\beta }}\Vert _2}\). Then, if \(\Vert {\varvec{\beta }}_0\Vert \ne 0\), \(\sqrt{n}({\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0) \buildrel {D}\over \longrightarrow \mathop {\mathrm{argmin}}_{{\mathbf {z}}} R({\mathbf {z}})\), where the process \(R:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is defined as \( R({\mathbf {z}}) = {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {w}}+ (1/2) {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {A}}{\mathbf {z}}+ b \; {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {q}}({\mathbf {z}}) \), with \({\mathbf {w}}\sim N_p(\mathbf{{0}}, {\mathbf {B}})\), \({\mathbf {A}}\) and \({\mathbf {B}}\) are given in assumption H4 and in equation (10), respectively, \({\mathbf {q}}({\mathbf {z}}) = \sum _{\ell = 1}^p \nabla _\ell ({\varvec{\beta }}_0) \mathbf{1 }_{\{\beta _{0,\ell } \ne 0\}} + \left( {{\mathrm{sign}}(z_\ell )}/{\Vert {\varvec{\beta }}_0\Vert _2} \right) \mathbf{1 }_{\{\beta _{0,\ell } = 0\}}\, {\mathbf {e}}_\ell \) and \( \nabla _\ell ({\varvec{\beta }}) = \,-\,\left( {|\beta _\ell |}/{\Vert {\varvec{\beta }}\Vert _2^3}\right) \, {\varvec{\beta }}\,+\, \left( {{\mathrm{sign}}(\beta _\ell )}/{\Vert {\varvec{\beta }}\Vert _2}\right) \,{\mathbf {e}}_\ell \,.\)

The following result generalizes Theorem 5 to differentiable penalties and includes, among others, the LASSO and Ridge penalties, and any convex combination of them, in particular the Elastic Net.

Theorem 6

Let \({\widehat{{\varvec{\beta }}}}_n\) be the estimator defined in (4) with \(\phi (y,t)\) given by (2), where the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3 and let \({\mathbf {A}}\) and \({\mathbf {B}}\) be the matrices defined in assumption H4 and in equation (10), respectively. Let us consider the penalty

$$\begin{aligned} I_{\lambda }({\varvec{\beta }}) = \lambda \,\left\{ (1-\alpha )\sum _{\ell =1}^p J_{\ell }(|\beta _{\ell }|)+\alpha \sum _{\ell =1}^p |\beta _{\ell }|\right\} \,, \end{aligned}$$
(13)

where \( J_{\ell }(\cdot )\) is a continuously differentiable function such that \(J_\ell ^{\prime }(0)=0\). Assume that H2 to H4 hold, \(\sqrt{n}({\widehat{{\varvec{\beta }}}}_n-{\varvec{\beta }}_0)=O_{{\mathbb {P}}}(1)\) and that \(\sqrt{n} \, \lambda _n \buildrel {p}\over \longrightarrow b\). Then, if \(\Vert {\varvec{\beta }}_0\Vert \ne 0\), \(\sqrt{n}({\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0) \buildrel {D}\over \longrightarrow \mathop {\mathrm{argmin}}_{{\mathbf {z}}} R({\mathbf {z}})\) where the process \(R:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is defined as \( R({\mathbf {z}}) = {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {w}}+ (1/2)\, {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {A}}{\mathbf {z}}+ b \; {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {q}}({\mathbf {z}}) \), with \({\mathbf {w}}\sim N_p(\mathbf{{0}}, {\mathbf {B}})\) and \({\mathbf {q}}({\mathbf {z}}) = (q_1({\mathbf {z}}),\dots , q_p({\mathbf {z}}))^{\small {\textsc {t}}}\) being \( q_{\ell }({\mathbf {z}})= (1-\alpha ) J_\ell ^{\prime }(|\beta _{0, \ell }|)\; {\mathrm{sign}}(\beta _{0,\ell }) + \alpha \left\{ {\mathrm{sign}}(\beta _{0,\ell }) \mathbf{1 }_{\{\beta _{0,\ell } \ne 0\}} + {\mathrm{sign}}(z_{\ell }) \mathbf{1 }_{\{\beta _{0,\ell } = 0\}}\right\} \).

Remark 7

Note that when \(\sqrt{n} \lambda _n \buildrel {p}\over \longrightarrow 0\) (\(b=0\)), the penalized estimators based on the Sign penalty or on a penalty of the form (13) have the same asymptotic distribution as the M-estimators defined through (3). If \(b>0\) and \(\alpha >0\) in (13), analogous arguments to those considered in linear regression by Knight and Fu (2000), allow to show that the asymptotic distribution of the coordinates of \({\widehat{{\varvec{\beta }}}}_n\) corresponding to null coefficients of \({\varvec{\beta }}_0\), that is, the asymptotic distribution of \({\widehat{{\varvec{\beta }}}}_{n,B}\) puts positive probability at zero. On the other hand, if \(\alpha =0\) and \(b>0\), the amount of shrinkage of the estimated regression coefficients increases with the magnitude of the true regression coefficients. Hence, for “large” parameters, the bias introduced by the differentiable penalty \(J_{\ell }(\cdot )\) may be large.

It is worth noticing that Theorem 6 implies that, when \(I_{\lambda }({\varvec{\beta }}) = \lambda \,\sum _{\ell =1}^p J_{\ell }(|\beta _{\ell }|)\) and \(\sqrt{n} \lambda _n \buildrel {p}\over \longrightarrow b\), \(\sqrt{n}({\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0) \buildrel {D}\over \longrightarrow {\mathbf {A}}^{-1} \left( {\mathbf {w}}+ b {\mathbf {a}}\right) \), where \({\mathbf {a}}=(a_1,\dots ,a_p)^{\small {\textsc {t}}}\) is such that \(a_\ell =J_\ell ^{\prime }(|\beta _{0, \ell }|)\; {\mathrm{sign}}(\beta _{0,\ell })\), which shows the existing asymptotic bias introduced in the limiting distribution, unless \(b=0\). In particular, the robust Ridge M-estimator, that provides a robust alternative under collinearity, is asymptotically distributed as \(N_p(2\,b\,{\mathbf {A}}^{-1}{\varvec{\beta }}_0,{\mathbf {A}}^{-1} {\mathbf {B}}{\mathbf {A}}^{-1})\).

When considering the Sign and LASSO penalties, analogous arguments to those considered in the proof of Proposition 1 in Zou (2006), together with Theorems 5 and 6 allow to see that, if the penalized M-estimator has a root\(-n\) rate of convergence, then it is inconsistent for variable selection (see Corollary 2). Furthermore, from the proof we may conclude that if \(\sqrt{n} \lambda _n \buildrel {p}\over \longrightarrow 0\), then \({\mathbb {P}}({\mathcal {A}}_n={\mathcal {A}})\rightarrow 0\), that is, we need regularization parameters that converge to 0, but not too fast in order to select variables with non-null probability.

Corollary 2

Let \({\widehat{{\varvec{\beta }}}}_n = ({\widehat{{\varvec{\beta }}}}_{n,A}^{\small {\textsc {t}}}, {\widehat{{\varvec{\beta }}}}_{n,B}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) be the estimator defined in (4), where \(\phi (y,t)\) is given through (2) with the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfying R3. Assume that \(\Vert {\varvec{\beta }}_0\Vert \ne 0\), \(\sqrt{n} \lambda _n \buildrel {p}\over \longrightarrow b\), \(\sqrt{n}\Vert {\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}(1)\) and that H2 to H4 hold. Then, for the Sign or LASSO penalties, there exists \(c<1\) such that \(\limsup _n{\mathbb {P}}({\mathcal {A}}_n={\mathcal {A}})\le c<1\), where \({\mathcal {A}}=\{j: \beta _{0,j}\ne 0\}\) is the set of indexes corresponding to the active coordinates of \({\varvec{\beta }}_0\) and \({\mathcal {A}}_n=\{j: {\widehat{\beta }}_{n,j}\ne 0\}\).

Similar arguments to those used in the proof of Theorem 5, allow to obtain the asymptotic distribution of the penalized M-estimator with Sign penalty, when \(\sqrt{n} \lambda _n \rightarrow \infty \). A similar result holds for penalizations satisfying (13), as the LASSO one.

Theorem 7

Let \({\widehat{{\varvec{\beta }}}}_n \) be the estimator defined in (4), where \(\phi (y,t)\) is given through (2) with the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfying R3. Assume that \(\Vert {\varvec{\beta }}_0\Vert \ne 0\), \(\sqrt{n} \lambda _n \buildrel {p}\over \longrightarrow \infty \), \({\widehat{{\varvec{\beta }}}}_n-{\varvec{\beta }}_0 =O_{{\mathbb {P}}}(\lambda _n)\) and that H2 to H4 hold. Let \({\mathbf {A}}\) be the matrix defined in assumption H4 and consider the Sign penalty \(I_{\lambda }({\varvec{\beta }}) = \lambda \, {\Vert {\varvec{\beta }}\Vert _1}/{\Vert {\varvec{\beta }}\Vert _2}\). Then, \((1/\lambda _n)\;({\widehat{{\varvec{\beta }}}}_n - {\varvec{\beta }}_0) \buildrel {p}\over \longrightarrow \mathop {\mathrm{argmin}}_{{\mathbf {z}}} R({\mathbf {z}})\), where the function \(R:{\mathbb {R}}^p \rightarrow {\mathbb {R}}\) is defined through \( R({\mathbf {z}}) = (1/2)\,{\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {A}}{\mathbf {z}}+ {\mathbf {z}}^{\small {\textsc {t}}}{\mathbf {q}}({\mathbf {z}}) \), with \({\mathbf {q}}({\mathbf {z}}) \) the function defined in Theorem 5.

Remark 8

Under a linear regression model, Lemma 3 in Zou (2006) provides a result analogous to Theorem  7 for the LASSO least squares estimator. As in the referred result, the rate of convergence of \({\widehat{{\varvec{\beta }}}}_n\) is slower than \(\sqrt{n}\) and the limit is a non-random quantity. As noted in Zou (2006), the optimal rate for \({\widehat{{\varvec{\beta }}}}_n\) is obtained when \(\lambda _n=O_{{\mathbb {P}}}(1/\sqrt{n})\), but at expenses of not selecting variables.

Finally, the following theorem gives the asymptotic distribution of \({\widehat{{\varvec{\beta }}}}_{n,A}\) when the penalty is consistent for variable selection, that is, when \({\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1\). For that purpose, recall that \({\varvec{\beta }}_0 = ({\varvec{\beta }}_{0,A}^{\small {\textsc {t}}}, \mathbf{{0}}_{p-k}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\) where \({\varvec{\beta }}_{0,A} \in {\mathbb {R}}^k\), \(k\ge 1\), is the vector of active coordinates of \({\varvec{\beta }}_0\) and for \({\mathbf {b}}\in {\mathbb {R}}^k\), define

$$\begin{aligned} \nabla I_{\lambda }({\mathbf {b}})=\frac{\partial I_{\lambda }\left( ({\mathbf {b}}^{\small {\textsc {t}}}, \mathbf{{0}}_{p-k}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\right) }{\partial {\mathbf {b}}}\,. \end{aligned}$$

Theorem 8

Let \({\widehat{{\varvec{\beta }}}}_n\) be the estimator defined in (4) with \(\phi (y,t)\) given in (2), where the function \(\rho :{\mathbb {R}}_{\ge 0}\rightarrow {\mathbb {R}}\) satisfies  R3 and assume that  H2 and H3 hold. Suppose that there exists some \(\delta > 0\) such that

$$\begin{aligned} \sup _{\Vert {{{\varvec{\beta }}}}_A - {{{\varvec{\beta }}}}_{0,A}\Vert _2 \le \delta }\Vert \nabla I_{\lambda _n}({\varvec{\beta }}_{A})\Vert _2 = o_{{\mathbb {P}}}\left( \frac{1}{\sqrt{n}} \right) , \end{aligned}$$
(14)

\({\mathbb {P}}({\widehat{{\varvec{\beta }}}}_{n,B} = \mathbf{{0}}_{p-k}) \rightarrow 1\) and \({\widehat{{\varvec{\beta }}}}_n \buildrel {p}\over \longrightarrow {\varvec{\beta }}_0\). Let \({\widetilde{{\mathbf {A}}}}\) and \({\widetilde{{\mathbf {B}}}}\) be the \(k \times k\) submatrices of \({\mathbf {A}}\) and \({\mathbf {B}}\), respectively, corresponding to the first k coordinates of \({\varvec{\beta }}_0\), where \({\mathbf {A}}\) and \({\mathbf {B}}\) were defined in assumption H4 and in equation (10), respectively. Then, if \({\widetilde{{\mathbf {A}}}}\) is invertible, \( \sqrt{n} ({\widehat{{\varvec{\beta }}}}_{n,A} - {\varvec{\beta }}_{0,A}) \buildrel {D}\over \longrightarrow N_k(\mathbf{{0}}, {\widetilde{{\mathbf {A}}}}^{-1} {\widetilde{{\mathbf {B}}}}{\widetilde{{\mathbf {A}}}}^{-1})\).

Remark 9

Penalties SCAD and MCP fulfil (14) when \(\lambda _n \rightarrow 0\). Effectively, recall that any of them may be written as \(I_{\lambda }({\varvec{\beta }}) = \sum _{j = 1}^p J_{\lambda }(|\beta _j|)\), where \(J_{\lambda }(t)\) is constant in \([a \lambda , \infty )\), with \(a > 0\) the second tuning constant of these penalties. Using that \(J_{\lambda }(0)=0\), we obtain that, for any \({\mathbf {b}}\in {\mathbb {R}}^k\), \(I_{\lambda }\left( ({\mathbf {b}}^{\small {\textsc {t}}}, \mathbf{{0}}_{p-k}^{\small {\textsc {t}}})^{\small {\textsc {t}}}\right) =\sum _{j = 1}^k J_{\lambda }(|b_j|)\) and \(\nabla I_{\lambda }({\mathbf {b}})=\sum _{j = 1}^k J'_{\lambda }(|b_j|)\). Since \( \Vert {\widehat{{\varvec{\beta }}}}- {\varvec{\beta }}_0\Vert _2=O_{{\mathbb {P}}}\left( 1/\sqrt{n}\right) \), given \(\delta > 0\) there exists \(C_1> 0\) such that \({\mathbb {P}}({\mathcal {D}}_n)>1 - \delta \) for \(n \ge n_0\), with \({\mathcal {D}}_n = \{\Vert {\widehat{{\varvec{\beta }}}}- {\varvec{\beta }}_0\Vert _2 \le C_1/\sqrt{n}\}\).

Let \(n_1\) be such that \(C_1/\sqrt{n}\le m_0/2\). Then, for any \(\omega \in {\mathcal {D}}_n\), \(n \ge n_1\) and \(1 \le j \le k\), we have that \( |{\widehat{\beta }}_j| \ge |\beta _{0,j}| - |{\widehat{\beta }}_j - \beta _{0,j}| \ge m_{0} - {C_1} {n}^{-1/2} \ge {m_{0}}/{2} \). Using that \(\lambda _n \rightarrow 0\) we get that for \(n\ge \max \{n_0,n_1\}\), we have that \(j=1,\ldots , k\), \(|{\widehat{\beta }}_j| > a \lambda _n\), implying that \({\mathcal {D}}_n\subset \{\Vert \nabla I_{\lambda _n} ({\widehat{{\varvec{\beta }}}}_A)\Vert _2 = 0\}\) as desired. Hence, using Corollary 1, we get that the penalized M-estimators defined through (4) have the oracle property when using SCAD or MCP and \(\lambda _n \rightarrow 0\) with \(\sqrt{n}\;\lambda _n \rightarrow \infty \) which are the same convergence rates required in Fan and Li (2001).

In contrast, when considering the ADALASSO regularization, the penalized M-estimators have the oracle property when \(\sqrt{n}\, \lambda _n \rightarrow 0\) and \( {n}^{(1+\gamma )/2}\; \lambda _n \rightarrow \infty \), which coincide with the penalty parameter rates required in Zou (2006).

Summarizing, in our results, the rates of convergence of the penalty parameter are in concordance with those required in Zou (2006) or Fan and Li (2001), when considering ADALASSO or MCP, respectively. In particular, for SCAD and MCP penalties we only require \(\lambda _n\rightarrow 0\) to obtain rates of convergence and \(\sqrt{n}\lambda _n \rightarrow \infty \) to derive variable selection results and asymptotic distribution (see Corollary 1 and Theorem 8), while (Avella-Medina and Ronchetti 2018) need that the penalty parameter goes faster to 0 (\(\sqrt{n}\lambda _n \rightarrow 0\) and \( {n}\;\lambda _n \rightarrow \infty \)) mainly due to the fact that they obtain results in shrinking neighbourhoods of the true model.

5 Monte Carlo study

In this section, we present the results of a Monte Carlo study designed to compare the small sample performance of classical and robust penalized estimators. Section S.6 of the supplementary file describes the algorithm used to compute the estimators. Complementary results of the numerical experiment presented here are given in Section S.8 of the supplementary file.

To compare the different proposals, throughout our numerical study, we considered a training sample \({\mathcal {M}}\) of i.i.d. observations \((y_i, {\mathbf {x}}_i)\), \(1\le i\le n\), \({\mathbf {x}}_i\in {\mathbb {R}}^p\) and \(y_i|{\mathbf {x}}_i \sim Bi(1, F(\gamma _0+{\mathbf {x}}_i ^{\small {\textsc {t}}}{\varvec{\beta }}_0))\), where the intercept \(\gamma _0=0\) and we vary the values of n, p and \({\varvec{\beta }}_0\). For clean samples, the covariates distribution is \(N_p(\mathbf{{0}},{\varvec{\varSigma }})\), where two choices for \({\varvec{\varSigma }}\) are taken. For brevity purposes, we report here the situation where \({\varvec{\varSigma }}=\text{ I}_p\), while the case of correlated covariates is described in Sect. S.8.3. This last case is of particular interest since correlation among predictors may impact the variable selection performance of a given penalized estimator, see for instance (Wang et al. 2020).

5.1 Numerical settings

To confront our estimators with some challenging situations, we considered cases where the ratio p/n is large. More precisely, we choose the pairs (np), with \(n \in \{150,300\}\) and \(p \in \{40,80,120\}\). In order to generate a sparse scenario, we chose the true regression parameter with only a few nonzero components. Herein, we present the results corresponding to \({\varvec{\beta }}_0 = (1,1,1,1,1,0, \dots , 0)^{\small {\textsc {t}}}\in {\mathbb {R}}^p\), i.e. the regression parameter has only five nonzero components. In Section S.8.3, we consider a regression parameter with coordinates of different sizes combined with a non-diagonal matrix \({\varvec{\varSigma }}\). Note that with these selections of the simulation parameters \({\mathbb {E}}(y_i)\) equals 0.50. In all cases, the number of Monte Carlo replications was \(NR=500\).

Henceforth, the clean samples setting is denoted C0. To study the impact of contamination, we have explored two settings by adding a proportion \(\varepsilon \) of atypical points. In the first contamination scheme, namely outliers of class A, we generated misclassified points \(({\widetilde{y}}, {\widetilde{{\mathbf {x}}}})\), where \({\widetilde{{\mathbf {x}}}}\sim N_p(0, 20 \, {\mathbf {I}})\) and \({\widetilde{y}}= 1 \) when \(\gamma _0+{\widetilde{{\mathbf {x}}}}^{\small {\textsc {t}}}{\varvec{\beta }}_0 < 0\) and \({\widetilde{y}}=0\), otherwise. Besides, outliers of class B, were obtained as in Croux and Haesbroeck (2003). This means that given \(m >0\), we fixed \({\widetilde{{\mathbf {w}}}}= {m} \sqrt{p} \, {\varvec{\beta }}_0 /{5} \) and set \({\widetilde{{\mathbf {x}}}}= {\widetilde{{\mathbf {w}}}}+ {\widetilde{{\mathbf {u}}}}\), where \({\widetilde{{\mathbf {u}}}}\sim N_p(\mathbf{{0}}, {\mathbf {I}}/{100})\) is introduced so as to get distinct covariate values. The response \({\widetilde{y}}\), related to \({\widetilde{{\mathbf {x}}}}\), is always taken equal to 0. It is worth noticing that \({\widetilde{{\mathbf {w}}}}^{\small {\textsc {t}}}{\varvec{\beta }}_0 \approx m \sqrt{p}\), thus the leverage of the added points increases with m. The selected values of m are 0.5, 1, 2, 3, 4 and 5.

Summarizing, we consider the scenarios CA1 and CA2 which correspond to adding, respectively, a proportion \(\varepsilon = 0.05\) and 0.10 of outliers of class A and CB where we add only \(5\%\) of outliers of class B, as in Croux and Haesbroeck (2003).

We compare the performance of the estimators based on the deviance, that is, when \(\rho (t)=t\), labelled ml in all tables, with those obtained by bounding the deviance and also with their robust weighted versions constructed to control the leverage. The three bounded loss functions considered are \(\rho (t)=1-\exp (-t)\) that leads to the least squares estimators, the loss functions \(\rho _c\) introduced by Croux and Haesbroeck (2003), given in (9), and \(\rho (t)=(c + 1)(1 + \exp (-ct))\) related to the divergence estimators. For the last two loss functions, the tuning constant equals \(c=0.5\). These estimators are indicated with the subscript ls, m and div, respectively. To consider weighted versions of these estimators, define \(D^2({\mathbf {x}},{\varvec{\mu }},{\varvec{\varSigma }}^{-1})=({\mathbf {x}}-{\varvec{\mu }})^{\small {\textsc {t}}}{\varvec{\varSigma }}^{-1}({\mathbf {x}}-{\varvec{\mu }})\), the square of the Mahalanobis distance. We take weights \(w({\mathbf {x}})=W(D^2({\mathbf {x}},{\widehat{{\varvec{\mu }}}}, {\widehat{{\varvec{\varSigma }}}}^{-1}))\), where to adjust for robustness \({\widehat{{\varvec{\mu }}}}\) is the \(\ell _1\)-median, \({\widehat{{\varvec{\varSigma }}}}^{-1}\) is an estimator of \({\varvec{\varSigma }}^{-1}\) computed using a robust graphical LASSO and W is the hard rejection weight function \(W(t)=\mathbf{1 }_{[0,c_w]}(t)\). The tuning constant \(c_w\) is adaptive and based on the quantiles of \(D^2({\mathbf {x}}_i,{\widehat{{\varvec{\mu }}}},{\widehat{{\varvec{\varSigma }}}}^{-1})\). To compute \({\widehat{{\varvec{\varSigma }}}}^{-1}\) we used the procedure defined in Öllerer and Croux (2015) and Tarr et al. (2016). More precisely, let \({\varvec{\varSigma }}_{ij} = \sigma _i \sigma _j \rho _{ij}\), where \(\rho _{ii}=1\). On one hand, to estimate \(\sigma _j\) we used the median of the absolute deviations with respect to the median (mad) of the j-th component, that is, the mad of \(\{x_{1j}, \dots , x_{nj}\}\), where \({\mathbf {x}}_i=(x_{i1},\dots , x_{ip})^{\small {\textsc {t}}}\). On the other hand, to estimate \(\rho _{ij}\) we use the Spearman correlation. The matrix \({\widehat{{\varvec{\varSigma }}}}\) is defined element-wise as \({\widehat{{\varvec{\varSigma }}}}_{ij} = {\widehat{\sigma }}_i {\widehat{\sigma }}_j {\widehat{\rho }}_{ij}\). Finally, we apply graphical LASSO (Friedman et al. 2008) to the matrix \({\widehat{{\varvec{\varSigma }}}}\) in order to obtain \({\widehat{{\varvec{\varSigma }}}}^{-1}\). These weighted estimators are labelled with the subscript wls, wm or wdiv, according to the loss function considered.

For each loss function, different penalties are considered: LASSO, Sign and MCP, labelled with the superscript l, s and mcp, respectively. The non-sparse estimators without any penalization term are indicated with no superscript. In Sect. S.8.2, we include a comparison between SCAD and MCP penalties since in some regression settings when considering the classical estimators, the first one outperforms the latter. However, in our framework, as shown in the supplementary file, the results obtained for both penalties are similar. For that reason, we do not report here the results obtained with the SCAD penalty.

Under C0 and scenarios CA1 and CA2, we compare all described estimators. However, in view of the results obtained for these three situations and for the sake of brevity, under CB we only report the results for \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}\), \({\widehat{{\varvec{\beta }}}}_{{\textsc {m}}}\) and \({\widehat{{\varvec{\beta }}}}_{{\textsc {wm}}}\) with penalties s and mcp.

To evaluate the performance of a given estimator \({\widehat{{\varvec{\beta }}}}\), we consider three summary measures. In the following, let \({\mathcal {T}}=\{ (y_{i,{\mathcal {T}}}, {\mathbf {x}}_{i,{\mathcal {T}}}), i = 1, \dots , n_{{\mathcal {T}}}\}\), \(n_{{\mathcal {T}}}=100\), be a new sample generated independently from the training sample \({\mathcal {M}}\) and distributed as C0. Given estimates \({\widehat{{\varvec{\beta }}}}\) of the slope and \({\widehat{\gamma }}\) of the intercept computed from \({\mathcal {M}}\) and to compare the performance of the estimators, we compute the probability mean squared errors (PMSE), the true positive proportion (TPP) and the true null proportion (TNP) defined, respectively, as

$$\begin{aligned} \text {PMSE}&= \frac{1}{n_{{\mathcal {T}}}} \sum _{i = 1}^{n_{{\mathcal {T}}}} \left( F({\mathbf {x}}_{i,{\mathcal {T}}} ^{\small {\textsc {t}}}{\varvec{\beta }}_0+\gamma _0) - F({\mathbf {x}}_{i,{\mathcal {T}}} ^{\small {\textsc {t}}}{\widehat{{\varvec{\beta }}}}+{\widehat{\gamma }})\right) ^2\\ \text {TPP}&= \frac{\#\{j : 1 \le j \le p,\; \beta _{0,j} \ne 0,\; {\widehat{\beta }}_j \ne 0 \}}{\#\{j : 1 \le j \le p,\; \beta _{0,j} \ne 0 \}} \quad \text{ and }\quad \\ \text {TNP}&= \frac{\#\{j : 1 \le j \le p,\; \beta _{0,j} = 0,\; {\widehat{\beta }}_j = 0 \}}{\#\{j : 1 \le j \le p,\; \beta _{0,j} = 0 \}} \,. \end{aligned}$$

In all tables, we report the mean of the summary measures over 500 replications.

5.2 Results of the numerical study

Tables 1 and 2 sum up the results corresponding to C0, Tables 3, 4and 5 summarize contaminations CA1 and CA2, while Tables 6, 7and 8 present the results obtained under scenario CB.

Table 1 Mean over replications of PMSE under C0
Table 2 True positive proportion/true null proportion. No contamination model: scenario C0. Means over 500 replications
Table 3 Means over replications of PMSE under CA1 and CA2
Table 4 True positive proportion/true null proportion. 5% contamination model: scenario CA1. Means over replications
Table 5 True positive proportion/true null proportion. 10% contamination model: scenario CA2. Means over replications

Table 1 shows that, for samples without contamination, the estimators penalized with MCP tend to achieve lower PMSE values than with the other penalties. In particular, for samples of size \(n = 300\), the maximum likelihood estimators using the MCP penalty come to have PMSE values that are less than a half of those obtained with the LASSO penalty. That difference is even greater for the least squares estimator and for the M-estimators calculated with the function \(\rho = \rho _c \) given in (9). Under C0, the robust weighted estimators give similar results to the unweighted ones with respect to all the considered measures (see Tables 1 and 2), showing that the weights do not impact on the procedure performance when samples are not contaminated.

As Table  1 reveals, the M-estimator penalized with LASSO loses more efficiency in terms of prediction than with the other penalties, reaching PMSE values that at least double those obtained with \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}^{{\textsc {l}}}\). Indeed, when \(n = 300\) and the sample is clean, the Sign and MCP penalties give lower PMSE values than the LASSO penalty. This fact can be explained by the non-negligible bias, already discussed in this paper, introduced by the LASSO penalty even when the ratio n/p is large. For both bounded penalties, all loss functions give very similar results.

As expected, in all situations, the non-penalized estimators give worse results than those obtained by regularizing the estimation procedure. In addition, the PMSE errors grow when the dimension increases. In particular, this growth is greater when using the Sign penalty for \( n = 150\) and \(p = 120\), where PMSE values for the M-estimates almost double those obtained with \( n = 150\) and \(p = 40\) for most estimators. As mentioned above, the case \((n, p) = (150, 120) \) poses a great challenge to the estimation of the regression parameter and to the selection of variables, as well.

Table 6 Means over replications of PMSE under CB
Table 7 True positive proportions/true null proportions for scenario CB with \(n = 150\). Means over replications
Table 8 True positive proportions/true null proportions for scenario CB with \(n = 300\). Means over replications

Regarding the proportion of correct classifications and the proportions of true positive and null coefficients, all penalized estimators give similar results. It should be mentioned that, when the LASSO penalty is used, lower TNP values are obtained than with other penalties, giving rise to less sparse estimators. This procedure seems to be less skilled than MCP to identify as 0 those coefficients associated with explanatory variables that are not involved in the model. This drawback is also observed, although to a lesser extent, when considering the divergence estimator or the maximum likelihood one, both combined with the Sign penalty (see Table 2).

The sensitivity to atypical data of estimators based on \(\rho (t) = t \) and \( w \equiv 1\), combined with any of the considered penalties, becomes evident all along the tables. On one hand, Table  3 shows that, when outliers following schemes CA1 or CA2 are introduced, the obtained PMSE are at least three times those obtained for uncontaminated samples. Note that, for instance, when \(n=300\) the reported values for PMSE may be even 10 times larger under this contamination scheme than for clean samples. The only exception is when \(n=150\) and \(p=120\), where as mentioned above the maximum likelihood estimators combined with the Sign penalty already leads to large values of PMSE under C0. Table  3 reveals that, under contamination patterns CA1 and CA2, the best behaviour, in terms of stability, is attained by the penalized weighted M-estimators. In fact, their probability mean squared errors (PMSE) are close to those obtained for clean samples with the bounded penalties Sign and MCP. The benefits of using weighted estimators are also reflected in the proportions of true positives and zeros, as illustrated in Tables 4 and 5. In the case of these latter measures, the LASSO penalty gives the highest values of the probability of true positives generally in detriment of the TNP values since, as we mentioned, this penalty has more difficulties in the identification of non-active explanatory variables.

It is worth noticing that, under CA1 and CA2, the unweighted estimators have higher PMSE values than their weighted versions, especially when \(n = 150\) (see Table 3). Under CA2, these values can double those obtained with the estimators that control the leverage of the covariates. Among the estimators with \( w \equiv 1\), those that give lower PMSE values are the procedures corresponding to \(\rho = \rho _{{\textsc {div}}}\) and those based on the least squares method when combined with the Sign and MCP penalties, in particular when \( n = 300\).

In scenario CA1, the most stable estimators are those based on bounded loss functions. For example, Table  4 shows that the procedure based on \(\rho (t) = t\) is the only one having problems with this level of contamination. On the other hand, the loss function introduced by Croux and Haesbroeck (2003) leads to more sparse estimators than those obtained with \(\rho = \rho _{{\textsc {div}}}\) and \(\rho (t) = 1- \exp (-t)\).

Table 5 shows that as the level of contamination increases (scheme CA2), all estimators seem to become more sparse since the values of TPP tend to decrease. This effect directly impacts on measure TPP that decreases almost by half in unweighted estimators. As expected, this behaviour is more pronounced when using the Sign and MCP penalties combined with \(\rho (t) = t\). Although to a lesser extent, the M-estimators with \(\rho = \rho _c\) given in (9) are also affected by this contamination scheme. With respect to the ability to detect active variables, in most cases, weighted estimators achieve similar results to those obtained under C0.

With respect to the effect of the contamination CB on the penalized maximum likelihood estimator, Table 6, which reports the PMSE under this scheme, illustrates that the PMSE of the penalized maximum likelihood estimators is much larger than those obtained for the weighted or unweighted M-estimators. This effect is more evident when m is larger than 3. In contrast, the weighted M-estimators are more stable. As expected, for mild outliers (\(m=1\), 2) the PMSE of the weighted M-estimators increases and then decreases for larger values of the slope, attaining values similar to those reported for clean samples. In all cases, Table 6 also shows the advantage of combining weighted M-estimators with the MCP penalty. For \(n=300\), the performance of weighted M-estimators is very similar when combined either MCP or the Sign penalty.

Regarding the performance under CB in terms of measures TPP and TNP, as observed in Tables 7 and 8, the true positive proportions are reduced compared to those obtained for clean samples, attaining, in some cases, proportions smaller than 0.5. Similar conclusions are valid for the M-estimators, \({\widehat{{\varvec{\beta }}}}_{{\textsc {m}}}^{{\textsc {mcp}}}\) and \({\widehat{{\varvec{\beta }}}}_{{\textsc {m}}}^{{\textsc {s}}}\). It is worth noticing that the effect of adding outliers on the non-penalized maximum likelihood estimator, \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}\), has been studied in Croux et al. (2002) who observed that \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}\) never explodes to infinity, but rather breaks down to zero when adding severe outliers to a dataset. This fact may explain the TPP behaviour observed in Tables 7 and 8. Indeed, similar arguments to those considered in the proof of Theorem 2 in Croux et al. (2002) allow to show that the penalized maximum likelihood estimator also shrinks to zero when adding outliers, which explains the behaviour of the measure TPP.

With respect to the weighted M-estimators, \({\widehat{{\varvec{\beta }}}}_{{\textsc {wm}}}^{{\textsc {s}}}\) and \({\widehat{{\varvec{\beta }}}}_{{\textsc {wm}}}^{{\textsc {mcp}}}\), the measure TPP shows some sensitivity for small values of the slope m (\(m = 1, 2\)) when \(n=150\), but recovers values close to 1 when the slope m increases. Notice that the intermediate values \(m = 1, 2\) correspond to mild outliers that are the most difficult ones to be detected. It is worth mentioning that the TNP values obtained under CB are similar to those obtained for uncontaminated samples, except for estimator \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}^{{\textsc {mcp}}}\) that seems to be the most affected by this type of contamination.

Summarizing, for the studied contaminations, the weighted M-estimators based on the function \(\rho = \rho _c\) given in (9) combined with the MCP and Sign penalties, turn out to be the most stable and reliable among the considered procedures.

6 Real data analysis

In this section, we study a dataset corresponding to the Diagnostic Wisconsin Breast Cancer Database which is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. Based on the results obtained in the numerical experiments reported in Sect. 5.1, we only illustrate the performance of the M-estimators computed with the (Croux and Haesbroeck 2003) loss function and of the classical ones by using different penalties. For the robust estimators, the tuning constants are equal to those considered in Sect. 5.1.

Ten real-valued features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and they describe characteristics of the cell nuclei present in the image. Measured attributes are related to: radius (mean of distances from centre to points on the perimeter), texture (standard deviation of grey-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (\(perimeter^2 / area - 1.0\)), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry and fractal dimension. For each of these features the mean, the standard deviation and the maximum among all the nuclei of the image were computed, generating a total of \(p = 30\) covariates for each image. From the \(n = 569\) tumours, 357 were benign and 212 malignant and the goal is to predict the type of tumour from the \(p = 30\) covariates.

From this dataset, we want to assess the impact of artificial outliers on the variable selection capability of different methods. For this purpose, we add \(n_0\) atypical observations artificially. Each outlier \(({\widetilde{y}}, {\widetilde{{\mathbf {x}}}})\) was generated as follows. In a first step we compute the weighted M-estimator with MCP penalty, \(({\widehat{{\varvec{\beta }}}}_{{\textsc {wm}}}^{{\textsc {mcp}}}, {\widehat{\gamma }}_{{\textsc {wm}}}^{{\textsc {mcp}}})\), with the original points and then, we generate \({\widetilde{{\mathbf {x}}}}\sim N_p(\mathbf{{0}}, 100\, {\mathbf {I}})\) and define a bad classified observations as \( {\widetilde{y}}= 1\) when \( {\widetilde{{\mathbf {x}}}}^{\small {\textsc {t}}}\, {\widehat{{\varvec{\beta }}}}_{{\textsc {wm}}}^{{\textsc {mcp}}} +{\widehat{\gamma }}_{{\textsc {wm}}}^{{\textsc {mcp}}}< 0 \) and 0, otherwise. We add \(n_0= 0,20,40\) and 80 outliers. Given each contaminated set, we split the data in 10 folds of approximately the same size. For each estimation method and each subset i (\(1 \le i \le 10\)), we obtain \({\widehat{{\varvec{\beta }}}}^{(-i)} \) and \( {\widehat{\gamma }}^{(-i)}\), the slope and intercept estimates computed without the observations that lie in the i-th subset. Then, for each variable, we evaluate the fraction of times that it is detected as active among the 10 folds as \(\varPi _{a,j} = {\#\{i: {\widehat{{\varvec{\beta }}}}^{(-i)}_j \ne 0\}}/{10}\) for \( 1 \le j \le 30 \). Note that this quantity depends on the estimator that is used and on \(n_0\) and, regarding variable selection, it attempts to capture the stability of each method against outliers. In each row of the plots of Fig. 1, for each estimator and each value of \(n_0\), we show a grey-scale representation of the measures \(\varPi _{a,1}, \dots , \varPi _{a,30}\).

Fig. 1
figure 1

Grey-scale representation of measures \(\varPi _{a,j}, 1\le j\le 30\) for each method and number of atypical points introduced artificially

As illustrated in Fig. 1, for the considered contamination, the non-robust estimators \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}^{{\textsc {l}}}\) and \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}^{{\textsc {mcp}}}\) show a very unstable and erratic variable selection, making evident their sensitivity to outliers. The results regarding \({\widehat{{\varvec{\beta }}}}_{{\textsc {ml}}}^{{\textsc {s}}}\) are not included just for brevity since they lead to similar conclusions. In contrast, the robust procedures based on the (Croux and Haesbroeck 2003) loss function select approximately the same subset of covariates, regardless of the amount \(n_{0}\) of added outliers, showing a stable identification of active variables. In particular, the hard rejection weighted estimators are more stable than their unweighted counterparts, when using the Sign penalty. The robust estimators with MCP penalty are more sparse than when using the Sign penalty, which can be explained by means of the theoretical properties studied in Sect. 4.1.

7 Concluding remarks

The logistic regression model may be used for classification purposes when covariates with predictive capability are observed. When the regression coefficients are assumed to be sparse, i.e. when only a few explanatory variables are active, the problem of joint estimation and automatic variable selection needs to be considered. In these circumstances, the statistical challenge of obtaining sparse and robust estimators that are computationally feasible and provide variable selection should be complemented with the study of their asymptotic properties. For this reason, under a logistic regression model, we accomplished the goal of obtaining more reliable estimators in the presence of atypical data, which automatically selects variables, using weighted penalized M-procedures. The obtained results are derived for a broad family of penalty functions, which include the LASSO, ADALASSO, Ridge, SCAD and MCP penalties. Besides these known penalties, we also consider the Sign penalization, which has an intuitive motivation, a simple expression and has not been exploited in the framework of robust variable selection.

An in-depth study of the theoretical properties of the proposed methods is presented. In particular, under very general conditions, we establish consistency results for a wide family of penalty functions. Besides, to study variable selection and oracle properties, we distinguish the case of Lipschitz functions, such as the Sign, from that of penalties that can be written as a sum of twice differentiable univariate functions, eventually random. These two points make a difference with respect to Sect. 2 in Avella-Medina and Ronchetti (2018) where the conditions to obtain general results regarding sparsity and asymptotic normality are more restrictive than those given herein for the logistic regression model.

In addition to obtaining variable selection properties of the proposed estimators, we derive expressions for their asymptotic distribution. In particular, it is shown that the choice of the penalty function plays a fundamental role in this case. Specifically, we obtain that by using the random penalty ADALASSO or penalties which are constant from one point onwards (such as SCAD or MCP), the estimators have the desired oracle property. The assumptions required to derive these results are very undemanding, which shows that these methods can be applied in very diverse contexts.

We also proposed a robust cross-validation procedure and numerically showed its advantage over the classical one. Through an extensive simulation study, we compared the behaviour of classical and robust estimators for different choices for the loss function and penalty. The obtained results illustrate that robust methods have a performance similar to the classical ones for clean samples and behave much better in contaminated scenarios, showing greater reliability. On the other hand, we showed that the results obtained when using penalties bounded as the Sign or MCP were remarkably better than those obtained when using convex penalties such as LASSO. The penalized weighted M-estimators based on the function \( \rho = \rho _c \) defined in Croux and Haesbroeck (2003) combined with the MCP and Sign penalties were the most stable and reliable among the considered procedures. Finally, the proposed methods are applied to two datasets, where the robust estimators combined with bounded penalties showed their advantages over the classical ones.