Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

2.1 Introduction to Three Papers on Robustness

2.1.1 General Introduction

This is a short introduction to three papers on robustness, published by Peter Bickel as single author in the period 1975–1984: “One-step Huber estimates in the linear model” (Bickel 1975), “Parametric robustness: small biases can be worthwhile” (Bickel 1984a), and “Robust regression based on infinitesimal neighbourhoods” (Bickel 1984b). It was the time when fundamental developments and understanding in robustness took place, and Peter Bickel has made deep contributions in this area. I am trying to place the results of the three papers in a new context of contemporary statistics.

2.1.2 One-Step Huber Estimates in the Linear Model

The paper by Bickel (1975) about the following procedure. Given a \(\sqrt{n}\)-consistent initial estimator \(\tilde{\theta }\) for an unknown parameter θ, performing one Gauss-Newton iteration with respect to the objective function to be optimized leads to an asymptotically efficient estimator. Interestingly, this results holds even when the MLE is not efficient, and it is equivalent to the MLE if the latter is efficient. Such a result was known for the case where the loss function corresponds to the maximum likelihood estimator (Le Cam 1956). Bickel (1975) extends this result to much more general loss functions and models.

The idea of a computational short-cut without sacrificing statistical was relevant more than 30 years ago (summary point 5 in Sect. 3 of Bickel 1975). Yet, the idea is still very important in large scale and high-dimensional applications nowadays. Two issues emerge.

In some large-scale problems, one is willing to pay a price in terms of statistical accuracy while gaining substantially with respect to computing power. Peter Bickel has recently co-authored a paper on this subject (Meinshausen et al. 2009): having some sort of guarantee on statistical accuracy is then highly desirable. Results as in Bickel (1975), probably of weaker form which do not touch on the concept of efficiency, are underdeveloped for large-scale problems.

The other issue concerns the fact that iterations in algorithms correspond to some form of (algorithmic) regularization which is often very effective for large datasets. A prominent example of this is with boosting: instead of a Gauss-Newton step, boosting proceeds with Gauss-Southwell iterations which are coordinatewise up-dates based on an n-dimensional approximate gradient vector (where n denotes sample size). It is known, at least for some cases, that boosting with such Gauss-Southwell iterations achieves minimax convergence rate optimality (Bühlmann and Yu 2003; Bissantz et al. 2007) while being computationally attractive. Furthermore, in view of robustness, boosting can be easily modified such that each Gauss-Southwell up-date is performed in a robust way and hence, the overall procedure has desirable robustness properties (Lutz et al. 2008). As discussed in Sect. 3 of Bickel (1975), the starting value (i.e., the initial estimator) matters also in robustified boosting.

2.1.3 Parametric Robustness: Small Biases Can Be Worthwhile

The following problem is studied in Bickel (1984a): construct an estimator that performs well for a particular parametric model \({\mathcal{M}}_{0}\) while its risk is upper-bounded for another larger parametric model \({\mathcal{M}}_{1} \supset {\mathcal{M}}_{0}\). As an interpretation, one believes that \({\mathcal{M}}_{0}\) is adequate but one wants to guard against deviations coming from \({\mathcal{M}}_{1}\). It is shown in the paper that the corresponding optimality problem has not an explicit solution: however, approximate answers are presented and interesting connections are developed to the Efron-Morris (Efron and Morris 1971) family of translation estimates, i.e., adding a soft-thresholded additional correction term to the optimal estimator under \({\mathcal{M}}_{0}\). (The reference Efron and Morris (1971) is appearing in the text but is missing in the list of references in Bickel’s paper).

The notion of parametric robustness could be interesting in high-dimensional problems. Guarding against specific deviations (which may be easier to specify in some applications than in others) can be more powerful than trying to protect nonparametrically against point-mass distributions in any direction. In this sense, this paper is a key reference for developing effective high-dimensional robust inference.

2.1.4 Robust Regression Based on Infinitesimal Neighbourhoods

Robust regression is analyzed in Bickel (1984b) using a nice mathematical framework where the perturbation is within a \(1/\sqrt{n}\)-neighbourhood of the uncontaminated ideal model. The presented results in Bickel (1984b) give a clear (mathematical) interpretation of various procedures and suggest new robust methods for regression.

A major issue in robust regression is to guard against contaminations in X-space. Bickel (1984b) gives nice insights for the classical case where the dimension of X is relatively small: a new challenge is to deal with robustness in high-dimensional regression problems where the dimension of X can be much larger than sample size. One attempt has been to robustify high-dimensional estimators such as the Lasso (Khan et al. 2007) or L 2Boosting (Lutz et al. 2008), in particular with respect to contaminations in X-space. An interesting and different path has been initiated by Friedman (2001) with tree-based procedures which are robust in X-space (in connection with a robust loss function for the error). There is clearly a need of a unifying theory, in the spirit of Bickel (1984b), for robust regression when the dimension of X is large.