Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We review some methods for assigning significance of (co-)variables or for confidence intervals of a parameter in a high-dimensional regression-type model. Our major focus is for a high-dimensional linear model

$$\displaystyle\begin{array}{rcl} Y = \mathbf{X}\beta ^{0}+\varepsilon & &{}\end{array}$$
(1)

with n × 1 response vector Y, n × p design matrix X, p × 1 regression vector β 0 and n × 1 error vector \(\varepsilon\) having i.i.d. components with \(\mathbb{E}[\varepsilon _{i}] = 0\), \(\mbox{ Var}(\varepsilon _{i}) =\sigma _{ \varepsilon }^{2}\) and \(\varepsilon _{i}\) uncorrelated from X i . We also discuss some extensions, including generalized linear models. While there is much literature on convergence rates for parameter estimation and prediction (cf. [6]), only recent work addresses the problem of constructing confidence intervals or tests. Some recent reviews on this topic include Bühlmann et al. [5] with a focus on applications in biology, and Dezeure et al. [8] who present a much more detailed and broader treatment. The current work aims to provide a very compact and “fast to read” access to the topic, yet it still contains the main ideas and hints to software.

2 High-Dimensional Linear Model and Some Methods for Inference

Consider the high-dimensional linear model in (1). The goal is to test null-hypotheses \(H_{0,j}:\ \beta _{ j}^{0} = 0\) versus \(H_{A,j}:\ \beta _{ j}^{0}\neq 0\) (or a one-sided alternative) for individual variables with index \(j \in \{ 1,\ldots,p\}\), or to construct a confidence interval for \(\beta _{j}^{0}\). In the high-dimensional setting, these tasks are non-trivial since standard least squares methodology cannot be used.

2.1 De-sparsified Lasso

Zhang and Zhang [26] propose a method based on low-dimensional regularized projection using the Lasso. A motivation can be derived from standard least squares: in the low-dimensional setting with p < n and X having full rank, it is well-known that the ordinary least squares estimator satisfies:

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\mathrm{OLS},j}\ \mbox{ is the projection of $Y $ onto the residuals of $Z_{\mathrm{OLS,j}}$},& & {}\\ \end{array}$$

where the n × 1 residual vector Z OLS,j arises from OLS regression of X j versus all other co-variables X j (which is the design matrix without the jth column). In the high-dimensional setting, the projection is ill-defined since the residual vector Z OLS,j ≡ 0. The idea is to replace the residuals by a regularized version: we fit X j versus X j with the Lasso and denote the corresponding residuals by Z j (when doing this for all j’s, this is the nodewise Lasso from Meinshausen and Bühlmann [18]). We then look at the projection

$$\displaystyle\begin{array}{rcl} Z_{j}^{T}Y/Z_{ j}^{T}X_{ j} =\beta _{ j}^{0} +\sum _{ k\neq j}\beta _{k}^{0}Z_{ j}^{T}X_{ k}/Z_{j}^{T}X_{ j} + Z_{j}^{T}\varepsilon /Z_{ j}^{T}X_{ j}.& & {}\\ \end{array}$$

The first term on the right-hand side is what we aim for, the second one is a bias, and the third one is the noise component with mean zero. To get rid of the bias, we employ a bias correction using (again) the Lasso: this leads to a new estimator

$$\displaystyle\begin{array}{rcl} \hat{b}_{j} = Z_{j}^{T}Y/Z_{ j}^{T}X_{ j} -\sum _{k\neq j}\hat{\beta }_{k}Z_{j}^{T}X_{ k}/Z_{j}^{T}X_{ j}\ \ \ \ (\,j = 1,\ldots,p),& &{}\end{array}$$
(2)

where \(\hat{\beta }\) denotes the Lasso estimator for the regression of Y versus X. A typical choice for the regularization parameter involved in Z j and for \(\hat{\beta }\) is based on cross-validation of the corresponding Lasso estimations. The estimator \(\hat{b}\) is not sparse and hence the name “de-sparsified Lasso”. One can show that the error in bias estimation is asymptotically negligible [10, 24, 26] on the \(1/\sqrt{n}\)-scale, and one then obtains

$$\displaystyle\begin{array}{rcl} \sqrt{ n}(\hat{b}_{j} -\beta _{j}^{0}) \Rightarrow \mathcal{N}(0,\sigma _{\varepsilon }^{2}\Omega _{ jj})\ (n \rightarrow \infty ),\ \ \Omega _{jj} = \frac{\|Z_{j}\|_{2}^{2}/n} {(Z_{j}^{T}X_{j}/n)^{2}}.& &{}\end{array}$$
(3)

The convergence as n →  encompasses that the dimension p = p(n) ≫ n tends to infinity as well, at a potentially much faster rate than sample size. We thus have an asymptotic pivot and we can then construct p-values for H 0, j or confidence intervals by plugging in an estimate for \(\sigma _{\varepsilon }^{2}\), see Sect. 2.3. In fact, the asymptotic variance is the smallest possible (among regular estimators) and it reaches the Cramér-Rao lower bound [24]: thus, statistical tests and confidence intervals derived from (3) are asymptotically optimal. Furthermore, the convergence in (3) to a Gaussian limit is uniform for a large part of the parameter space and thus, we obtain honest confidence intervals [11].

It is important to outline the assumptions which are used to establish the result in (3). Assume that the design X consists of (possibly fixed realizations of) i.i.d. rows whose distribution has a p × p covariance matrix \(\Sigma\). The main conditions are as follows:

(A1):

The rows of X have a (sub-)Gaussian distribution and the smallest eigenvalue of \(\Sigma\) is bounded away from zero.

(A2):

The matrix \(\Sigma ^{-1}\) is row-sparse: the maximal number of non-zero entries in each row is bounded by \(o(\sqrt{n/\log (\,p)})\).

(A3):

The linear model is sparse: the number of non-zero entries of β 0 is \(o(\sqrt{n}/\log (\,p))\).

(A4):

The error \(\varepsilon\) has a (sub-) Gaussian distribution.

We note that these assumptions imply the ones in van de Geer et al. [24]. The most restrictive conditions are (A2) regarding the design and (A3) saying that the linear model needs to be rather sparse.

2.2 Ridge Projection

The estimator in (2) is has a linear part and a non-linear bias correction. A similar construction can be made based on the Ridge estimator:

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\mathrm{Ridge}} = (n^{-1}\mathbf{X}^{T}\mathbf{X} +\lambda I)^{-1}n^{-1}\mathbf{X}^{T}Y.& &{}\end{array}$$
(4)

A main message is that the Ridge estimator has substantial bias when \(p \gg n\): in fact, it estimates a projected parameter

$$\displaystyle\begin{array}{rcl} \theta ^{0} = P\beta ^{0},\ \ P = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-}\mathbf{X},& & {}\\ \end{array}$$

where (X X T) denotes a generalized inverse of X X T [22].

The bias for θ 0 can be made arbitrarily small by choosing λ sufficiently small, and a quantitative bound is given in Bühlmann [3]. A potentially substantial bias occurs, however, due to the difference between θ 0 and the target β 0. Since

$$\displaystyle\begin{array}{rcl} \frac{\theta ^{0}} {P_{jj}} =\beta _{ j}^{0} +\sum _{ k\neq j}\frac{P_{jk}} {P_{jj}}\beta _{k}^{0},& & {}\\ \end{array}$$

this bias can be estimated and corrected with

$$\displaystyle\begin{array}{rcl} \sum _{k\neq j}\frac{P_{jk}} {P_{jj}}\hat{\beta }_{k},& & {}\\ \end{array}$$

where \(\hat{\beta }\) is the Lasso estimator. Thus, we construct a bias corrected Ridge estimator

$$\displaystyle\begin{array}{rcl} \hat{b}_{R;j} = \frac{\hat{\beta }_{\mathrm{Ridge};j}} {P_{jj}} -\sum _{k\neq j}\frac{P_{jk}} {P_{jj}}\hat{\beta }_{k},\ j = 1,\ldots,p.& &{}\end{array}$$
(5)

A typical choice of the regularization parameter in (4) for \(\hat{\beta }_{\mathrm{Ridge}}\) is \(\lambda =\lambda _{n} = n^{-1}\) and we can use cross-validation for the regularization parameter in the Lasso \(\hat{\beta }\). This estimator has the following property [3]:

$$\displaystyle\begin{array}{rcl} & & \sigma _{\varepsilon }^{-1}\Omega _{ R;jj}^{-1/2}(\hat{b}_{ R;j} -\beta _{j}^{0}) \approx Z + \Delta _{ j},\ \ Z \sim \mathcal{N}(0,1), \\ & & \Omega _{R} = (\hat{\Sigma }+\lambda )^{-1}\hat{\Sigma }(\hat{\Sigma }+\lambda )^{-1},\ \hat{\Sigma } = n^{-1}\mathbf{X}^{T}\mathbf{X}, \\ & & \vert \Delta _{j}\vert \leq \sigma _{\varepsilon }^{-1}\max _{ k\neq j}\Omega _{R;jj}^{-1/2}\left \vert \frac{P_{jk}} {P_{jj}}\right \vert \|\hat{\beta }-\beta ^{0}\|_{ 1}. {}\end{array}$$
(6)

Here, the “ ≈ ” symbol represents an approximation which becomes exact as \(\lambda \searrow 0^{+}\). The problem here is that the behavior of \(\vert P_{jk}/P_{jj}\vert\) and of the diagonal elements \(\Omega _{R;jj}\) are not easily under control, but they are observed for fixed design X so that it is possible to construct an upper bound as discussed next.

2.2.1 Inference Based on an Upper Bound

Assuming the so-called compatibility condition on the design X [6, Ch.6.2], we obtain that

$$\displaystyle\begin{array}{rcl} \vert \Delta _{j}\vert \leq \Omega _{R;jj}^{-1/2}\max _{ k\neq j}\left \vert \frac{P_{jk}} {P_{jj}}\right \vert O_{P}(s_{0}\sqrt{\log (\,p)/n}),& & {}\\ \end{array}$$

and in practice, we use an upper bound of the form

$$\displaystyle\begin{array}{rcl} \Delta _{j}^{\mathrm{bound}}:= \Omega _{ R;jj}^{-1/2}\max _{ k\neq j}\left \vert \frac{P_{jk}} {P_{jj}}\right \vert (\log (\,p)/n)^{1/2-\xi },& &{}\end{array}$$
(7)

for some small 0 < ξ < 1∕2, typically ξ = 0. 05; this bound is motivated via an implicit assumption that \(s_{0} \leq (n/\log (\,p))^{\xi }\).

Inference can then be based on (6) with the upper bound in (7). For example, for testing \(H_{0,j}:\beta _{ j}^{0} = 0\) against the two-sided alternative \(H_{A,j}:\ \beta _{ j}^{0}\neq 0\) we use the upper bound for the p-value

$$\displaystyle\begin{array}{rcl} 2(1 - \Phi ((\sigma _{\varepsilon }^{-1}\Omega _{ R;jj}^{-1/2}\vert \hat{b}_{ R;j}\vert - \Delta _{j}^{\mathrm{bound}})_{ +})),& & {}\\ \end{array}$$

and an analogous construction can be used for a two-sided 1 −α confidence interval for β j 0:

$$\displaystyle\begin{array}{rcl} & & [\hat{b}_{R;j} - a,\hat{b}_{R;j} + a], {}\\ & & a = (\Phi ^{-1}(1 -\alpha /2) + \Delta _{ j}^{\mathrm{bound}})\sigma _{\varepsilon }\Omega _{ R;jj}^{1/2}. {}\\ \end{array}$$

The main conditions used for proving consistency of the Ridge-based inference method are as follows:

(B1):

As assumption (A1).

(B2):

The linear model is sparse: for 0 < ξ < 1∕2 which is used in (7), the number of non-zero entries of β 0 is O((n∕log( p))ξ).

(B3):

The error \(\varepsilon\) has a Gaussian distribution.

It is expected that assumption (B3) could be relaxed to sub-Gaussian distributions as in (A4). No condition is required in terms of sparsity of \(\Sigma ^{-1}\) as in (A2), but typically the method does not lead to optimality as with the de-sparsified Lasso estimator from Sect. 2.1.

2.3 Estimation of the Error Variance

The de-sparsified Lasso and the Ridge projection method in Sects. 2.1 and 2.2 require an estimate of \(\sigma _{\varepsilon }\) for construction of tests or confidence intervals.

The scaled Lasso [23] leads to a consistent estimate of the error variance: it is a fully automatic method which does not need a user-specific choice of a tuning parameter. Reid et al. [21] present an empirical comparison of various estimators which suggests that the alternative scheme of residual sum of squares of a cross-validated Lasso solution exhibits has good finite-sample performance.

2.4 Multi Sample Splitting

Sample splitting is a generic method for construction of p-values. The sample is randomly split in two halves with corresponding indices from disjoint sets \(I_{1},I_{2} \subset \{ 1,\ldots,n\}\), \(I_{1} \cup I_{2} =\{ 1,\ldots,n\}\) with \(\vert I_{1}\vert = \lfloor n/2\rfloor\) and \(\vert I_{2}\vert = n -\lfloor n/2\rfloor\). A variable selection technique \(\hat{S} \subseteq \{ 1,\ldots,p\}\) is used on the first half I 1, denoted by \(\hat{S}(I_{1})\): a prime example is the Lasso where \(\hat{S} =\{\, j;\ \hat{\beta }_{j}\neq 0\}\), and other selectors \(\hat{S}\) can be derived from a sparse estimator in the same way. With the fewer variables from \(\hat{S}\), we can obtain p-values based on the second half I 2 and using classical t-tests from ordinary least squares: that is, we only use the subsample \((Y _{I_{2}},\mathbf{X}_{I_{2},\hat{S}})\) of the data, with obvious notational meaning of the sub-indices. Such a procedure is implicitly contained in Wasserman and Roeder [25]. Sample splitting avoids that we would use the data twice for selection and inference which would lead to over-optimistic p-values.

It is rather straightforward to see that such a principle works if

$$\displaystyle\begin{array}{rcl} & & \hat{S}(I_{1}) \supseteq S_{0} =\{\, j;\ \beta _{j}^{0}\neq 0\}, \\ & & \vert \hat{S}(I_{1})\vert <n/2, {}\end{array}$$
(8)

where \(\hat{S}(I_{1})\) denotes the selector based on the subsample with indices I 1. Furthermore, multiple testing adjustment over all components \(j = 1,\ldots,p\) (see Sect. 3.2) can be done in a powerful way, e.g., Bonferroni correction only needs an adjustment with a factor \(\vert \hat{S}(I_{1})\vert\) which is often much smaller than p. A drawback of the method is its severe sensitivity of how the sample is split: Meinshausen et al. [20] propose repeated splitting of the sample (multi sample splitting) and show how to combine the corresponding dependent p-values. The latter is of independent interest and the procedure is described below in Sect. 2.4.1.

Such a multi sample splitting method leads to p-values which are already adjusted for multiple testing, either for the familywise error rate or the false discovery rate. The main conditions which are required for the method are (8): when using the Lasso as a screening method (typically with either a cross-validated choice of λ or taking a fixed fraction of the variables entering the Lasso path first), they are implied by the following:

(C1):

As assumption (A1).

(C2):

beta-min assumption:

$$\displaystyle\begin{array}{rcl} \min _{j\in S_{0}}\vert \beta _{j}^{0}\vert \gg \sqrt{s_{ 0}\log (\,p)/n},& & {}\\ \end{array}$$

and \(s_{0} = o(n/\log (\,p))\) where \(s_{0} = \vert S_{0}\vert\) denotes the number of non-zero entries of β 0.

(C3):

As assumption (A4).

The beta-min assumption in (C2) is rather unpleasant since, for example, we would like to find out with significance testing whether a regression coefficient is large or smallish (or zero): thus, an a-priori assumption excluding smallish coefficients is unpleasant. The condition can be somewhat relaxed to “zonal assumptions” which still require that there is a gap between large and smallish coefficients and restrict the number of smallish coefficients [4].

2.4.1 Aggregation of p-Values

With the multi sample splitting approach described above we obtain the following: for testing the null-hypothesis \(H_{0,j}:\ \beta _{ j}^{0}\neq 0\), when repeating the sample splitting B times, we get p-values

$$\displaystyle\begin{array}{rcl} P_{j}^{(1)},\ldots,P_{ j}^{(B)}.& & {}\\ \end{array}$$

The problem, in general, is how to aggregate many p-values which can be arbitrarily dependent to a single p-value P j . The following Lemma is very general and might be of interest in other problems.

Lemma 1 ([20])

Assume that we have B p-values \(P^{(1)},\ldots,P^{(B)}\) for testing a null-hypothesis H 0 , i.e., for every b ∈{ 1,…,B} and any 0 < α < 1, \(\mathbb{P}_{H_{0}}[P^{(b)} \leq \alpha ] \leq \alpha\) . Consider for any 0 < γ < 1 the empirical γ-quantile of the values \(\{P^{(b)}/\gamma;\ b = 1,\ldots,B\}\) :

$$\displaystyle\begin{array}{rcl} Q(\gamma ) =\ \min \left (\mbox{ empirical $\gamma $-quantile}\ \{P^{(1)}/\gamma,\ldots,P^{(B)}/\gamma \},1\right ).& & {}\\ \end{array}$$

Furthermore, consider a suitably corrected minimum value of Q(γ) over a range which is lower bounded by a positive constant γ min :

$$\displaystyle\begin{array}{rcl} P =\min \left ((1 -\log (\gamma _{\mathrm{min}}))\min _{\gamma \in (\gamma _{\mathrm{min}},1)}Q(\gamma ),1\right ).& &{}\end{array}$$
(9)

Then, both Q(γ) (for any γ ∈ (0,1)) and P are conservative p-values satisfying for any 0 < α < 1: \(\mathbb{P}_{H_{0}}[Q(\gamma ) \leq \alpha ] \leq \alpha\) or \(\mathbb{P}_{H_{0}}[P \leq \alpha ] \leq \alpha\) , respectively.

A simple generic aggregation rule is with \(\gamma = 1/2\): multiply the raw p-values by the factor 2 and take the sample median. Potential power improvement is possible with an adaptive version searching for the best γ as in (9) but paying a price in terms of the factor (1 − log(γ min)) (which e.g. is ≈ 3. 996 for γ min = 0. 05).

2.5 Stability Selection

Stability Selection [19] is an even (much) more generic method than the multi sample splitting from Sect. 2.4. It can be applied to any structure estimation problem such as edges in a graph: variable selection in a regression problem is a special case thereof which we discuss now a bit further.

As with multi sample splitting, we randomly split the sample in two halves with indices I 1 and I 2, respectively, and we consider a variable selection method \(\hat{S} \subseteq \{ 1,\ldots,p\}\). The idea is to analyze the stability of \(\hat{S}(I_{1})\), based on the half-sample I 1, when subsampling the data, and in fact, we do not make any use of the other half of the sample I 2. Thus, denote by I a random subsample of size \(\lfloor n/2\rfloor\). We consider the event that a single variable j is selected by \(\hat{S}(I^{{\ast}})\) based on the subsample I , \(j \in \hat{ S}(I^{{\ast}})\), and we compute its probability

$$\displaystyle\begin{array}{rcl} \pi (\,j) = \mathbb{P}^{{\ast}}[\,j \in \hat{ S}(I^{{\ast}})].& & {}\\ \end{array}$$

In practice, this probability is computed based on B ≈ 100 random subsamples and calculating empirical relative frequencies.

The main problem is to determine a threshold 1∕2 < τ thr ≤ 1 such that π( j) ≥ τ implies that variable j is selected in a “stable way”. This can be formalized as follows: denote by \(V = \vert \cup _{j\in S_{0}^{c}}\{\pi (\,j) \geq \tau \}\vert\), that is, the number of false positive selections. Then, assuming some conditions as outlined below, the following formula holds [19]:

$$\displaystyle\begin{array}{rcl} \mathbb{E}[V ] \leq \frac{1} {2\tau _{\mathrm{thr}} - 1} \frac{q^{2}} {p},& &{}\end{array}$$
(10)

where \(q \geq \vert \hat{S}(I^{{\ast}})\vert\) (almost surely). For example, q can be specified as the top q variables of a ranking (or selection) scheme, e.g., the q variables having largest regression coefficients in absolute value (if there are fewer than q coefficients with non-zero values, just take all of them). For the Lasso based on the first half-sample, since it selects at most \(\lfloor n/2\rfloor\) variables, a good value of q might be in the range of n∕10 to n∕3.

The formula (10) can then be inverted to determine a threshold τ thr for a given bound of \(\mathbb{E}[V ]\) and a given q (which specifies the selection method \(\hat{S}\)). For example, by tolerating \(\mathbb{E}[V ] \leq 5\), a specified q = 30 and p = 1, 000 we choose

$$\displaystyle\begin{array}{rcl} \tau _{\mathrm{thr}} = (1 + \frac{q^{2}} {p} \frac{1} {5})/2 = (1 + \frac{30^{2}} {1,000} \frac{1} {5})/2 = 0.59& & {}\\ \end{array}$$

and such a choice then satisfies \(\mathbb{E}[V ] \leq 5\). When using the tolerance bound \(\mathbb{E}[V ] \leq \alpha\), the corresponding threshold τ thr leads to a procedure where

$$\displaystyle\begin{array}{rcl} \mathbb{P}[V> 0] \leq \mathbb{E}[V ] \leq \alpha,& & {}\\ \end{array}$$

and hence, with control of the familywise error rate.

The main assumptions for validity of (10) are here sketched only:

(D1):

The selector \(\hat{S}\) is performing better than random guessing.

(D2):

An exchangeability condition holds implying that it is equally likely that a noise variable is selected by \(\hat{S}\).

The formal assumptions are given in Meinshausen and Bühlmann [19]. In fact, assumption (D1) is a mild condition while (D2) is rather restrictive: however, it was shown empirically that formula (10) approximately holds even for scenarios where (D2) does not hold. Interestingly, a beta-min assumption as in (C2) is not required for Stability Selection.

2.6 A Summary of an Empirical Study

We briefly summarize the results from a fairly large empirical study in Dezeure et al. [8]. An overall conclusion is that the multi sample splitting and the Ridge projection method are often somewhat more reliable for familywise error control (type I error control) than the de-sparsified Lasso procedure; on the other hand, the de-sparsified Lasso has often (a bit) more power in comparison to multi sample splitting and Ridge projection. However, these findings depend on the particular case and they are not consistent among all considered settings. Figure 1 illustrates the familywise error control and power of various methods for 96 different scenarios, varying over different covariate designs, sparsity degrees and structure of active sets, and signal to noise ratios.

Fig. 1
figure 1

Ninety-six different simulation scenarios, all with p = 500 and n = 100, with varying covariate design, sparsity and structure of the active set, and signal to noise ratio. Each dot represents a scenario, shown with jittered plotting. Five methods: De-sparsified Lasso (Despars-Lasso, as in Sect. 2.1), Ridge projection (Ridge, as in Sect. 2.2), Multi sample splitting (MS-Split, as in Sect. 2.4), a method from Javanmard and Montanari [10] (JM), covariance test from Lockhart et al. [13] (Covtest). Left panel: familywise error rate (FWER) with nominal level at 0. 05 indicated by the dotted line; right panel: power (Power) representing the fraction of correctly identified active variables with non-zero regression coefficients. The figure is similar to some graphical representations in Dezeure et al. [8]

From a practical point of view, if one is primarily concerned about false positive statements, the multi sample splitting method might be preferable: especially for logistic linear models (see Sect. 3.1), the adapted version of multi sample splitting was found to be most “robust” for reliable error control.

2.6.1 Supporting Theoretical Evidence and Discussion of Various Assumptions

Supporting evidence from theory, for the performance results in the empirical study, can be based by discussing the main assumptions underlying the different methods. The de-sparsified Lasso method is expected to work well and is most powerful if the design matrix is sparse in terms of its corresponding row-sparsity of \(\Sigma ^{-1}\) (assumption (A2)) and if the linear model is rather sparse as well (assumption (A3)). The Ridge projection method does allow for designs with non-sparse rows of \(\Sigma ^{-1}\); however, the less restrictive assumption come with a price in that there is no optimality results in terms of power. The multi sample splitting method, which performs empirically quite reliably, has a theoretical drawback as it requires a zonal or the stronger beta-min assumption for the underlying regression coefficients (assumption (C2)); in terms of sparsity for the linear model, the multi sample splitting method is justified for a broader regime, allowing for \(s_{0} = o(n/\log (\,p))\) (assumption (C2)), than the required \(s_{0} = o(\sqrt{n}/\log (\,p))\) in assumption (A2) for the de-sparsified Lasso.

Stability Selection is controlling the number of false positives \(\mathbb{E}[V ]\) and not e.g. the familywise error rate (except when controlling \(\mathbb{E}[V ]\) at a very low level α which implies familywise error control at level α). The restrictive theoretical assumption is the exchangeability condition (D2): however, it seems that this condition is far from necessary. Stability Selection does not require a beta-min assumption as in (C2).

2.7 Other Methods

Very much related to the de-sparsified Lasso in Sect. 2.1 is a proposal by Javanmard and Montanari [10]. Their method is proved to be asymptotically optimal without requiring sparsity of the design as in condition (A2). Empirical evidence suggests though that the error control is not very reliable, see Fig. 1.

Bootstrap methods have been suggested to construct confidence intervals and p-values [7, 12]. They seem to work well for the components where the true parameter value equals zero, but they are often poor for the other components with non-zero parameters. Furthermore, multiple testing adjustment often requires a huge number of bootstrap replicates for reasonable computational approximation of tail events.

The covariance test [13] has been recently proposed as an “adaptive” method for assigning significance for the Lasso. Asymptotic validity of the test was shown under rather restrictive assumptions, in particular a restrictive beta-min assumption in the spirit of condition (C2). Empirical results of the covariance test are illustrated in Fig. 1, indicating that its power is comparably poor and error control is less reliable than for example for the Ridge projection or multi sample split method.

Another interesting proposal is due to Meinshausen [17]: we outline more details in Sect. 3.4.

3 Extensions and Further Topics

We briefly discuss here important extensions and additional issues.

3.1 Generalized Linear Models

Generalized linear models can be immediately treated with the multi sample splitting method or Stability Selection. Instead of e.g. the Lasso, we use 1-norm regularized maximum likelihood estimation for the selector \(\hat{S}\), and low-dimensional inference (for the multi sample splitting method) is then based on maximum likelihood methodology.

The de-sparsified Lasso or the Ridge projection method are most easily adapted via additional weights as in iteratively reweighted least squares estimation [15]. The weights \(w_{i} = w_{i}(\beta ^{0})\ (i = 1,\ldots n)\) can be estimated by plugging in the 1-norm regularized maximum likelihood estimate; we can then proceed with new weighted data

$$\displaystyle\begin{array}{rcl} \tilde{Y } = WY,\ \ \tilde{\mathbf{X}} = W\mathbf{X},\ \ W = \mbox{ diag}(w_{1},\ldots,w_{n}),& & {}\\ \end{array}$$

and apply the procedures from Sects. 2.1 and 2.2.

3.2 Multiple Testing Correction

Adjustment to multiple testing can be based using standard procedures which require valid p-values for individual tests as input: even under arbitrary dependence among the p-values, we can use e.g. the Bonferroni-Holm method for controlling the familywise error rate or the procedure from Benjamini and Yekutieli [1] to control the false discovery rate.

For the de-sparsified Lasso or Ridge projection method, one can use a simulation-based method which is less conservative than Bonferroni-Holm in presence of dependence: the details are given in Bühlmann [3].

We note that the multi sample splitting method from Sect. 2.4 as in the software package hdi (see Sect. 3.3) yields p-values which are adjusted for controlling the familywise error or false discovery rate.

3.3 R-Package hdi

The R-package hdi [16] contains implementations of various methods, namely the de-sparsified Lasso, the Ridge projection, the multi sample splitting method and of Stability Selection. We refer to to Dezeure et al. [8] how to use the procedures and what the various R-functions can do.

3.4 Testing Groups of Parameters

There might be considerable interest in testing the null-hypothesis \(H_{0,G}:\ \beta _{ j}^{0} = 0\) for all j ∈ G, where G ⊆ { 1, , p} corresponds to a group of variables. The alternative is \(H_{A,G}:\ \mbox{ there exists $j \in G$ with}\ \beta _{j}^{0}\neq 0\).

Based on the de-sparsified Lasso or Ridge projection method, one can use a simulation-based procedure to obtain an approximate distribution of \(\max _{j\in G}\vert \hat{b}_{j}\vert\) under the null-hypothesis H 0, G . We refer to Bühlmann [3] for the details. The multi sample splitting method can be modified for testing H 0, G , as described in Mandozzi and Bühlmann [14].

An interesting and very different proposal is given by Meinshausen [17] which can be used for testing individual but also groups of variables (and the latter is the main motivation in that work): the procedure does not even require an identifiability condition in terms of the design matrix X as it automatically determines whether a parameter or a group of parameters is identifiable.

3.5 Selective Inference

Especially with confidence intervals, one would typically report only for a few selected variables. An interesting approach to account for the selection effect, in terms of the false coverage rate, is presented in Benjamini and Yekutieli [2]. Their procedure can be applied for confidence intervals from e.g. the de-sparsified Lasso or the Ridge projection method from Sects. 2.1 or 2.2.

3.6 Some Thoughts on Bayesian Methods

For expository simplicity, consider a Gaussian linear model with Gaussian prior for the regression coefficients \(\beta = (\beta _{1},\ldots,\beta _{p})\):

$$\displaystyle\begin{array}{rcl} & & \beta _{1},\ldots,\beta _{p}\ \mbox{ i.i.d.}\ \sim \ \mathcal{N}(0,\tau ^{2}), \\ & & Y \vert \beta \sim \mathcal{N}_{n}(\mathbf{X}\beta,\sigma ^{2}). {}\end{array}$$
(11)

The maximum a-posteriori estimator is then the Ridge estimator

$$\displaystyle\begin{array}{rcl} \hat{\beta }_{\mathrm{MAP}} =\mathrm{ argmin}_{\beta }\|Y -\mathbf{X}\beta \|_{2}^{2}/n + \frac{\sigma ^{2}} {\tau ^{2}n}\|\beta \|_{2}^{2}.& & {}\\ \end{array}$$

For τ 2 large, this is the Ridge estimator with small regularization parameter λ as in Sect. 2.2.

Denote by β a realization from the prior distribution, and we are interested in constructing an interval which contains β with high probability. Alternatively, when adopting the frequentist Bayesian viewpoint (cf. [9]), we assume that the data is generated from a true parameter β 0, and we are interested to construct an interval which covers β 0 with high probability, based on a Bayesian model in (11). As discussed in Sect. 2.2, we know that for τ 2 large or σ 2 very small, \(\hat{\beta }_{\mathrm{MAP}}\) is essentially unbiased for θ  = P β (or θ 0 = P β 0), where P is as in Sect. 2.2, but it can be severely biased for β (or β 0) in the high-dimensional scenario with p ≫ n. Thus, the standard (Gaussian prior) Bayesian credible region centered around \(\hat{\beta }_{\mathrm{MAP}}\) seems rather flawed for covering β or β 0 in the frequentist Bayesian paradigm.

Of course, in the classical Bayesian inference paradigm, such a bias does not occur, even when p ≫ n, since the distribution of β | Y is Gaussian with mean \(\mathbb{E}[\beta \vert Y ] =\hat{\beta } _{\mathrm{MAP}}\).

4 Conclusions

We provide a compact review of some methods for constructing tests and confidence intervals in high-dimensional models. The main assumptions underlying each method as well as a summary of empirical results are presented: this helps to understand, also from a comparative perspective, the strengths and weaknesses of the different approaches. Furthermore, a link to the R-package hdi is made. Thus, the user and practitioner obtains a “quick” but sufficiently deep overview for applying the procedures.