Model Selection With Mixed Variables on the Lasso Path

Jeng, X. Jessie; Peng, Huimin; Lu, Wenbin

doi:10.1007/s13571-019-00219-5

Model Selection With Mixed Variables on the Lasso Path

Published: 04 May 2020

Volume 83, pages 170–184, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Sankhya B Aims and scope Submit manuscript

Model Selection With Mixed Variables on the Lasso Path

Download PDF

X. Jessie Jeng¹,
Huimin Peng¹ &
Wenbin Lu¹

298 Accesses
Explore all metrics

Abstract

Among the most popular model selection procedures in high-dimensional regression, Lasso provides a solution path to rank the variables and determines a cut-off position on the path to select variables and estimate coefficients. In this paper, we consider variable selection from a new perspective motivated by the frequently occurred phenomenon that relevant variables are often mixed with noise variables on the solution path. We propose to characterize the positions of the first noise variable and the last relevant variable on the path. We then develop a new variable selection procedure to control over-selection of the noise variables ranking after the last relevant variable, and, at the same time, retain a high proportion of relevant variables ranking before the first noise variable. Our procedure utilizes the recently developed covariance test statistic and Q statistic in post-selection inference. In numerical examples, our method compares favorably with existing methods in selection accuracy and the ability to interpret its results.

On the RODEO Method for Variable Selection

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Article Open access 19 December 2019

Robust Signed-Rank Variable Selection in Linear Regression

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The authors of the paper deeply mourn the loss of Dr. Jayanta K. Ghosh, whose dedication to research and mentoring have benefited generations of statisticians and who has set an eminent example of excellence as a scholar and role model. Dr. Ghosh has made substantial contributions to a wide range of research areas such as higher order asymptotics, Bayesian analysis, and high-dimensional inference. One of the authors, X.J. Jeng, was fortunate to have Dr. Ghosh as her PhD advisor in Purdue University. Dr. Ghosh’s work in model selection, multiple testing, and their biomedical applications (e.g. Wilbur et al., 2002; Chakrabarti and Ghosh, 2007, 2011; Bogdan et al., 2008, 2011) has inspired the research in this paper.

We consider the linear regression model

$$ \begin{array}{@{}rcl@{}} \mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{*}+\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim N(0,\sigma^{2} \mathbf{I}), \end{array} $$

(1.1)

where y is a vector of response with length n, X is the n × p design matrix of standardized predictors, and β^∗ a sparse vector of coefficients. In high-dimensional regression, p can be greater than n. Among the most popular methods for model selection and estimation in the high-dimensional regression, Lasso (Tibshirani, 1996) solves the following optimization problem

$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{\beta}}(\lambda) =\underset{\beta\in\mathcal{R}^{p}}{\arg\min} \frac{1}{2} \Vert \mathbf{y}-\mathbf{X}\boldsymbol{\beta}{\Vert_{2}^{2}}+ \lambda \Vert\boldsymbol{\beta}\Vert_{1}, \end{array} $$

(1.2)

where λ ≥ 0 is a tuning parameter. Lasso provides a solution path, which is the plot of the estimate $\hat {\beta }(\lambda )$ versus the tuning parameter λ.

Lasso solution path is piecewise linear with each knot corresponding to the entry of a variable into the selected set. Knots are denoted by λ₁ ≥ λ₂ ≥⋯ ≥ λ_m ≥ 0, where $m=\min \limits (n-1,p)$ is the length of the solution path (Efron et al. 2004).

Recent developments in high-dimensional regression focus on hypothesis testing for variable selection. Impressive progress has been made in Zhang and Zhang (2014), Van De Geer et al. (2014), Lockhart et al. (2014), Barber and Candès (2015), Bogdan et al. (2015), Lee et al. (2016), G’sell et al. (2016), and Jeng and Chen (2019a), etc. Specifically, innovative test statistics based on Lasso solution path have been proposed. For example, Lockhart et al. (2014) construct the covariance test statistic as follows. Along the solution path, the variable indexed by j_k enters the selected model at knot λ_k. Define active set right before knot λ_k as A_k = {j₁,j₂,⋯ ,j_k− 1}. In addition, define true active set to be $A^{*}=\{j:~\beta ^{*}_{j}\neq 0\}$ and the size of true active set as s = |A^∗|. Lockhart et al. (2014) considers to test the null hypothesis H_0k : A^∗⊂ A_k conditional upon the active set A_k at knot λ_k and propose the covariance test statistic as

$$ \begin{array}{@{}rcl@{}} T_{k}=\left( \langle \mathbf{y}, \mathbf{X}\hat{\boldsymbol{\beta}}(\lambda_{k+1})\rangle- \langle \mathbf{y}, \mathbf{X}_{A_{k}}\tilde{\boldsymbol{\beta}}_{A_{k}}(\lambda_{k+1})\rangle\right)/\sigma^{2}, \end{array} $$

(1.3)

where $\hat {\boldsymbol {\beta }}(\lambda _{k+1})=\underset {\beta \in \mathcal {R}^{p}}{\arg \min \limits } \frac {1}{2}\Vert \mathbf {y}-\mathbf {X} \boldsymbol {\beta } {\Vert _{2}^{2}}+\lambda _{k+1}\Vert \boldsymbol {\beta }\Vert _{1}$ and $ \tilde {\boldsymbol {\beta }}_{A_{k}}(\lambda _{k+1}) = \underset {\beta \in \mathcal {R}^{\vert A_{k}\vert }}{\arg \min \limits } \frac {1}{2}\Vert $ $\mathbf {y}-\mathbf {X}_{A_{k}}\boldsymbol {\beta }_{A_{k}}{\Vert _{2}^{2}}+\lambda _{k+1}\Vert \boldsymbol {\beta }_{A_{k}}\Vert _{1}$.

Lockhart et al. (2014) derived that under orthogonal design, if all s relevant variables rank ahead of noise variables with probability tending to 1, then for any fixed d,

$$ \begin{array}{@{}rcl@{}} (T_{s+1},T_{s+2},\cdots,T_{s+d}) \overset{d}{\to} (\text{Exp}(1),\text{Exp}(1/2),\cdots,\text{Exp}(1/d)), \end{array} $$

(1.4)

as $p\to \infty $, and that T₁,T₂,⋯ ,T_d are asymptotically independent.

Later, G’sell et al. (2016) proposed to perform sequential test on H_0k : A^∗⊂ A_k for k increasing from 0 to m and developed the Q statistics for a stopping rule. The Q statistics are defined as

$$ \begin{array}{@{}rcl@{}} q_{k}=\exp \left( -\sum\limits_{j=k}^{m} T_{j} \right) \end{array} $$

(1.5)

for k = 1,…,m. It has been proved that in the case of perfect separation where all s relevant variables rank ahead of noise variables, if T_s+ 1,⋯ ,T_m are independently distributed as $(T_{s+1},\cdots ,T_{m}) \sim (\text {Exp}(1),\text {Exp}(1/2), \cdots $ Exp(1/(m − s))), then

$$ \begin{array}{@{}rcl@{}} q_{k} \overset{d}{=} (k-s)\text{th order statistic of~} m-s \text{~independent standard uniform }\\ \text{random variables} \end{array} $$

(1.6)

for s + 1 ≤ k ≤ m. G’sell et al. (2016) developed a stopping rule (TailStop) implementing the Q statistics in the procedure of Benjamini and Hochberg (1995). Given the distribution of q_k in Eq. 1.6, TailStop controls false discovery rate at a pre-specified level.

In this paper, we consider more general scenarios where relevant variables and noise variables are not perfectly separated and some (or all) relevant variables intertwine with noise variables on the Lasso path. Such scenarios would occur when the effect sizes of relevant variables are not large enough. In fact, even with infinitely large effect size, perfect separation on solution path is still not guaranteed when the number of relevant variables is relatively large (Wainwright, 2009; Su et al. 2017). Studies in theory and method are limited in such general scenarios because the inseparability among relevant and noise variables make it difficult to estimate Type I and/or Type II errors. In order to perform variable selection in the more general and realistic settings, we propose a new theoretical framework to characterize the region on the solution path where relevant and noise variables are not distinguishable.

Figure 1 illustrates the indistinguishable region on solution path. Denote m₀ as the position right before the first noise variable on the path such that all entries up to m₀ correspond to relevant variables and m₁ as the position of the last relevant variable where all entries afterwards correspond to noise variables. Given a solution path, both m₀ and m₁ are realized but unknown, and the region between m₀ and m₁ is referred to as the indistinguishable region.

Given the solution path, a sensible variable selection procedure would select all variables up to m₀ but no variables after m₁. Since both m₀ and m₁ are unknown stochastic quantities, the selection procedure should automatically adapt to the unknown m₀ and m₁.

We develop a new variable selection procedure utilizing the Q statistic in Eq. 1.5. We refer to the new procedure as Q-statistic Variable Selection (QVS). QVS searches through the solution path and determines a cut-off position that is likely between m₀ and m₁ under certain conditions that are more general than Eq. 1.6 on the distribution of the Q statistic.

2 Method and Theory

QVS is inspired by earlier works on estimating the proportion of non-null component in a mixture model of p-value (Meinshausen and Buhlmann, 2005, 2006). We extend the technique to high-dimensional regression considering the general scenarios with indistinguishable relevant and noise variables.

Given a Lasso path, QVS searches through the empirical process $k/m -q_{k}-c_{m}\sqrt {q_{k}(1-q_{k})}, 1\le k\le m$, where q_k is defined in Eq. 1.5, and determines the cut-off position as

$$ \begin{array}{@{}rcl@{}} \hat{k}_{qvs}=m \cdot \max_{1\leq k\leq m/2} \left\{ \frac{k}{m}-q_{k}-c_{m}\sqrt{q_{k}(1-q_{k})}\right\}, \end{array} $$

(2.1)

where c_m is a bounding sequence to control over selection of noise variables after m₁ and constructed as follows. For 0 < t < 1, let

$$ U_{m}(t)=\frac{1}{m}\sum\limits_{i=1}^{m} 1(U_{i}\leq t), $$

where U₁,⋯ ,U_m are i.i.d. standard uniform random variables. Further, let

$$ V_{m} = \sup\limits_{t \in (0,1)} \frac{U_{m}(t)-t}{\sqrt{t(1-t)}}. $$

(2.2)

Then determine c_m as an upper bound of V_m so that P(V_m > c_m) < α_m → 0 as $m \to \infty $.

We consider the setting where some relevant variables intertwine with noise variables on the Lasso path. In order to gain theoretical insights for the performance of QVS, we adopt a similar strategy as in Section 4.2.1 of G’sell et al. (2016), and simplify the problem by considering a sequence of arbitrary statistics q₁,…,q_m corresponding to the m ranked variables.

Define U_(1),m−s,…,U_{(m−s),m−s} as increasingly ordered statistics of m − s independent standard uniform random variables, independent of q₁…,q_m. Assume that

$$ q_{m_{0}+1} \le U_{(1), m-s} $$

(2.3)

with probability tending to 1 as $m \to \infty $. It is easy to see that in the special case of perfect separation with m₀ = m₁ = s, Eq. 2.3 is implied by condition (1.6) from G’sell et al. (2016). In the more general case with m₀≠m₁, we show that QVS provides an asymptotic upper bound for the unknown m₀ under (2.3).

Theorem 1.

Consider the stopping rule in Eq. 2.1 under condition (2.3). Assume that the number of relevant variables s = o(m) and that $\sqrt {\log m}/ m_{0} $ → 0 in probability. Then

$$ P\left( \hat{k}_{qvs} \geq(1-\epsilon) m_{0}\right) \to 1 $$

(2.4)

as $m\to \infty $ for any small constant 𝜖 > 0.

The proof of Theorem 1 is provided in Appendix A. The condition, $\sqrt {\log m}/ $ m₀ → 0 in probability, holds when m₀ is large enough, which may be satisfied if the number of relevant variables and their effect sizes are large enough. The result in Theorem 1 implies that QVS can asymptotically retain a high proportion of relevant variables ranked up to m₀.

Next, we show the property of QVS to provide a lower bound for m₁ in Fig. 1. Recall the definition of U_(j),m−s as the j th ordered statistic of m − s independent standard uniform random variables. Assume that for any t ∈ (0,1),

$$ \sum\limits_{k=m_{1}+1}^{m} 1(q_{k} \le t) \le \sum\limits_{k=m_{1}-s+1}^{m-s} 1(U_{(k), m-s} \le t), $$

(2.5)

with probability tending to 1 as $m\to \infty $, where 1(⋅) denotes an indicator function. Note that in the special case with m₀ = m₁ = s, condition (2.5)) is implied by Eq. 1.6 because the latter assumes that q_k has the same distribution as that of U_{(k−s),m−s} for k = s + 1,…,m. In the more general setting when m₀≠m₁, we have the following result.

Theorem 2.

Consider the stopping rule in Eq. 2.1 under condition (2.5). As $m\to \infty $,

$$ P(\hat{k}_{qvs} \leq m_{1}) \to 1. $$

The proof of Theorem 2 is presented in Appendix B. Theorem 2 implies that QVS provides a parsimonious variable selection such that noise variables ranked after m₁ are not likely to be over-selected. Combining Theorem 1 and 2, the following corollary is straightforward.

Corollary 1.

Consider the stopping rule in Eq. 2.1 under conditions (2.3) and (2.5). If s = o(m) and $\sqrt {\log m}/ m_{0} \to 0$ in probability, then

$$ P\left( (1-\epsilon) m_{0} \le \hat{k}_{qvs} \le m_{1}\right) \to 1 $$

as $m\to \infty $ for any small constant 𝜖 > 0.

3 Simulation

In our simulation study, design matrix X_n×p is a Gaussian random matrix with each row generated from N_p(0,Σ). Response y is generated from $N_{n}(\textbf {X} \boldsymbol {\beta }^{*}, \textbf {I})$, where β^∗ is the vector of true coefficients. The locations of non-zero entries of β^∗ are randomly simulated.

For the QVS procedure, we simulate the bounding sequence c_m by the following steps. We generate X_n×p and y_n×1 under the null model and compute the Lasso solution path using the lars package in R. Covariance test statistics and Q statistics $\{q_{i}\}_{i=1}^{m}$ are calculated by Eqs. 1.3 and 1.5, respectively. Then, we compute V_m using $V_{m}=\max \limits _{1\leq i\leq m/2} (i/m-q_{i})/\sqrt {q_{i}(1-q_{i})}$. We repeat the above steps for 1000 times and obtain ${V_{m}^{1}},{V_{m}^{2}},\cdots ,V_{m}^{1000}$. The bounding sequence c_m is computed as the upper α_m percentile of ${V_{m}^{1}},{V_{m}^{2}},\cdots ,$ $V_{m}^{1000}$. We set $\alpha _{m}=1/\sqrt {\log m}$ as recommended in Jeng et al. (2019b) to bound the exceedance probability of V_m at a degenerating level. For each combination of sample size n and dimension p, we only need to simulate the bounding sequence once.

3.1 Positions of $\hat {k}_{qvs}$, m ₀, and m ₁

Recall the definitions of the m₀ and m₁ on the Lasso path and Fig. 1. Table 1 reports the realized values of $\hat {k}_{qvs}$, m₀, m₁. It can be seen that the distance between m₀ and m₁ increases as the number of relevant variables s increases. In all the cases, $\hat {k}_{qvs}$ is greater than m₀, which agrees with the theoretical property of QVS in Theorem 1. On the other hand, $\hat {k}_{qvs}$ is less than m₁ with high frequency when n = 200. When n = 300, $\hat {k}_{qvs}$ is mostly less than m₁ with relatively large s but greater than m₁ with smaller s. We suspect that in the latter case, condition (2.5) in Theorem 2 may not hold.

Table 1 Mean values of the QVS cut-off positions ($\hat {k}_{qvs}$), the positions of m₀, and the positions of m₁ from 1000 replications

Full size table

Recall that condition (2.3) is imposed in Theorem 1 for QVS to retain relevant variables before m₀, and condition (2.5) is imposed in Theorem 2 to avoid over-selecting noise variables after m₁. Consider the settings in Table 1. Figure 2 shows the empirical distributions of $q_{m_{0}+1}$ and U_(1),m−s, respectively, from 1000 replications. The top row has n = 200 and s = 30, and the bottom row increases the sample size to n = 300. The results seem to support condition (2.3) that $q_{m_{0}+1} \le U_{(1), m-s}$ with high probability. Condition (2.5) is more difficult to check in simulation because it is supposed to hold for arbitrary t ∈ (0,1) with high probability. We would defer the verification of this condition to future research.

3.2 Comparisons with other methods

We compare the performance of QVS with other variable selection methods, such as Lasso with BIC (“BIC”), Lasso with 10-fold cross-validation (“LCV”), Bonferroni procedure applied to the Q statistics (“Q-BON”), and Benjamini-Hochberg procedure applied to the Q statistics (“Q-FDR”). Specifically, Q-BON and Q-FDR select the top-ranked variables on the solution path with sizes equal to $\underset {k}{\arg \max \limits } \left \{k: ~ q_{k}\leq 0.05/m\right \}$ and $\underset {k}{\arg \max \limits } \left \{k: ~ q_{k}\leq 0.05k/m\right \}$, respectively. The nominal levels for both Q-BON and Q-FDR are set at 0.05. We note that Q-FDR is equivalent to the TailStop method introduced in G’sell et al. (2016).

We demonstrate the performances of these methods by presenting the true positive proportion (TPP), false discovery proportion (FDP), and g-measure of these methods. TPP is the ratio of true positives to the number of relevant variables entered the solution path. FDP is the ratio of false positives to the number of selected variables. TPP and FDP measure the power and type I error of a selection method, respectively. We also compute the g-measure, which is the geometric mean of specificity and sensitivity, i.e. g-measure $=\sqrt {\text {specificity}\times \text {sensitivity}}$, where specificity is the ratio of true negatives to the number of noise variables in the solution path and sensitivity is equivalent to TPP. G-measure can be used to evaluate the overall performance of a variable selection method. Higher value of g-measure indicates better performance (Powers, 2011).

Figure 3 summarizes the mean values of TPP, FDP, and g-measure for different methods under various model settings with p = 2000, n = 200 and Cov(X) = Σ = (0.5^|i−j|)_{i= 1,⋯ ,p; j= 1,⋯ ,p}. The number of non-zero coefficients s = 10,20,40, and the non-zero effect vary from 0.3 to 2. It can be seen that the Lasso-based BIC and LCV tend to select fewer variables, which results in lower TPP and FDP. On the other hand, the Q Statistic-based methods, Q-BON, Q-FDR, and QVS, all have higher TPP and FDP. However, in these examples, Q-BON does not control family-wise error at the nominal level of 0.05, and Q-FDR does not control FDR at the nominal of 0.05. The reason is because relevant and noise variables are not perfectly separated in these examples. As illustrated in Table 1, m₀ is much smaller than m₁, and the results of Q-BON and Q-FDR cannot be interpreted presumably. In terms of g-measure, QVS generally outperforms other methods. We suspect that the relatively better performance of QVS is related to its control of over-selecting noise variables and under-selecting relevant variables to certain degrees in the challenging scenarios when m₀ and m₁ are far apart.

4 Real Application

We obtain a dataset for expression quantitative trait loci (eQTL) analysis related to Down Syndrome. Down Syndrome is one of the most common gene-associated diseases. Our dataset includes the expression levels of gene CCT8, which contains a critical region of Down syndrome, and genome-wide single-nucleotide polymorphism (SNP) data from three different populations (Bradic et al. 2011): Asia (Japan and China) with sample size n = 90, Yoruba with n = 60, and Europe with n = 60. We perform eQTL mapping to identify SNPs that are potentially associated with the expression level of gene CCT8 for each population. Due to the limited sample size, we randomly select subsets of SNPs with p = 6000,2000,4000 for the three populations, respectively.

For the sample of each population, we first compute the covariance test statistics by Eq. 1.3 and Q statistics by (1.5) based on Lasso solution path. Table 2 presents these statistics for the top 10 knots on the path.

Table 2 Covariance test statistics and Q statistics along Lasso solution path for samples from Asian, Yoruba, and European populations, respectively

Full size table

We apply QVS as well as all the other methods analyzed in Section 3.2 to the datasets. Table 3 presents the number of selected variables along the solution path for different methods. It can be seen that QVS generally selects more variables than the other methods for these datasets. Particularly, when there exists a gap in the list of Q-statistics, such as for Asian and Yoruba samples, QVS tends to stop right after the gap. This is because such gap is likely to occur in the indistinguishable region between m₀ and m₁. Stopping right after the gap would include relevant variables ranked before m₀ and, at the same time, not over-select the noise variables ranked after m₁.

Table 3 The number of selected variables long the Lasso solution path for different methods

Full size table

We further verify the result of QVS by comparing with the findings in literature. Bradic et al. (2011) studied the same samples for eQTL mapping but only focused on cis-eQTLs. Therefore, the numbers of SNPs included in their analysis are much smaller with p = 1955,1978,2146 for the three populations, respectively. More SNP variables are identified in Bradic et al. (2011) for each population due to larger ratio of sample size to dimension. Table 4 reports the locations on the solution path of the variables identified in Bradic et al. (2011). Note that the Lasso solution path is computed using our datasets with lower ratio of sample size to dimension. We utilize this result as a reference to evaluation the results of QVS.

Table 4 Locations on the Lasso solution paths of the reference variables identified in Bradic et al. (2011)

Full size table

For the solution path of Asian population, according to (Bradic et al. 2011), the first noise variable enters after two relevant variables and the last relevant variable enters at the position 46. Therefore, m₀ = 2 and m₁ = 46. QVS selects the top 3 variables on the path, which is in-between m₀ and m₁. This result supports the theoretical property of QVS as a sensible variable selection procedure. Similar results are observed in the other two cases.

5 Conclusion and Discussion

We develop a new variable selection procedure whose result is interpretable in the scenarios where relevant variables may be mixed indistinguishably with noise variables on the Lasso solution path. Our theoretical findings are very different from the existing results which consider variable selection properties in the ideal setting where all relevant variables rank ahead of noise variables on the solution path. The new analytic framework is unconventional but highly relevant to Big Data applications.

The proposed QVS procedure utilizes the Q statistic (G’sell et al. 2016) that is built upon the limiting distribution of the covariance test statistic developed in Lockhart et al. (2014) under orthogonal design. In a more general setting where design matrix is in general position as described in Lockhart et al. (2014), the theoretical analysis on covariance test statistic is much more complicated and its limiting distribution has not be fully derived. Lockhart et al. (2014) provides a control on the tail probability of the covariance test statistic. It will be interesting to characterize the indistinguishable region on the Lasso solution path and interpret the result of the proposed QVS method in the more general setting. We note that the simulation and real data analyses in the paper have design matrices that are not orthogonal. Compared with other popular methods, QVS shows advantages in selection accuracy and the ability to interpret its results.

References

Barber, RF and Candès, EJ (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43, 5, 2055–2085.
Article MathSciNet Google Scholar
Benjamini, Y and Hochberg, Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B 57, 1, 289–300.
MATH MathSciNet Google Scholar
Bogdan, M, Ghosh, J and Żak-szatkowska, M (2008). Selecting explanatory variables with the modified version of the bayesian information criterion. Quality and Reliability Engineering International 24, 6, 627–641.
Article Google Scholar
Bogdan, M, Chakrabarti, A, Frommlet, F and Ghosh, J (2011). Asymptotic bayes-optimality under sparsity of some multiple testing procedures. The Annals of Statistics 39, 3, 1551–1579.
Article MathSciNet Google Scholar
Bogdan, M, van den Berg, E, Sabatti, C, Su, W and Candés, E (2015). SLOPE - adaptive variable selection via convex optimization. The Annals of Applied Statistics 9, 3, 1103–1140.
Article MathSciNet Google Scholar
Bradic, J, Fan, J and Wang, W (2011). Penalized composite quasi-likelihood for unltahigh-dimensional variable selection. Journal of the Royal Statistical Society: Series B 73, 3, 325–349.
Article MathSciNet Google Scholar
Chakrabarti, A and Ghosh, J (2007). Some aspects of bayesian model selection for prediction. Bayesian Statistics 8, 51–90.
MATH MathSciNet Google Scholar
Chakrabarti, A and Ghosh, J (2011). Aic, bic, and recent advances in model selection. Handbook of the Philosophy of Science 7, 583–605.
Google Scholar
Efron, B, Hastie, T, Johnstone, I and Tibshirani, R (2004). Least angle regression. The Annals of Statistics 32, 2, 407–499.
Article MathSciNet Google Scholar
G’sell, M, Wager, S, Chouldechova, A and Tibshirani, R (2016). Sequential selection procedures and false discovery rate control. Journal of the Royal Statistical Society: Series B 78, 2, 423–444.
Article MathSciNet Google Scholar
Jeng, XJ and Chen, X (2019a). Predictor ranking and false discovery proportion control in high-dimensional regression. Journal of Multivariate Analysis 171, 163–175.
Article MathSciNet Google Scholar
Jeng, XJ, Zhang, T and Tzeng, JY (2019b). Efficient signal inclusion with genomic applications. Journal of the American Statistical Association 117, 1787–1799. https://doi.org/10.1080/01621459.2018.1518236
Article MathSciNet Google Scholar
Lee, J, Sun, D, Sun, Y and Taylor, J (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics 44, 3, 907–927.
Article MathSciNet Google Scholar
Lockhart, R, Taylor, J, Tibshirani, R and Tibshirani, R (2014). A significance test for the lasso. The Annals of Statistics 42, 2, 413–468.
MATH MathSciNet Google Scholar
Meinshausen, N and Buhlmann, P (2005). Lower bounds for the number of false null hypotheses for multiple testing of associations under general dependence structures. Biometrika 92, 4, 893–907.
Article MathSciNet Google Scholar
Meinshausen, N and Rice, J (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics 34, 1, 373–393.
Article MathSciNet Google Scholar
Powers, D (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies 2, 37–63.
Google Scholar
Su, W, Bogdan, M and Candes, E (2017). False discoveries occur early on the lasso path. The Annals of Statistics 45, 5, 2133–2150.
Article MathSciNet Google Scholar
Tibshirani, R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B 58, 267–288.
MATH MathSciNet Google Scholar
Van De Geer, S, Buhlmann, P, Ritov, Y and Dezeure, R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42, 3, 1166–1202.
MATH MathSciNet Google Scholar
Wainwright, M (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ₁-constrained quadratic programming (lasso). IEEE Transactions on Information Theory 55, 5, 2183–2202.
Article MathSciNet Google Scholar
Wilbur, JD, Ghosh, J, Nakatsu, C, Brouder, S and Doerge, R (2002). Variable selection in high-dimensional multivariate binary data with application to the analysis of microbial community dna fingerprints. Biometrics 58, 2, 378–386.
Article MathSciNet Google Scholar
Zhang, C and Zhang, SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B 76, 1, 217–242.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, North Carolina State University, Raleigh, NC, USA
X. Jessie Jeng, Huimin Peng & Wenbin Lu

Authors

X. Jessie Jeng
View author publications
You can also search for this author in PubMed Google Scholar
Huimin Peng
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to X. Jessie Jeng.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supported in part by NSF Grant DMS-1811360.

Appendix

1.1 A Proof of Theorem 1

By the construction of $\hat {k}_{qvs}$ in Eq. 2.1,

$$ \begin{array}{@{}rcl@{}} \frac{\hat{k}_{qvs}}{m_{0}} - 1 & = & \max_{1\leq k\leq m/2} \left\{ \frac{k}{m_{0}}- 1 - \frac{m}{m_{0}} q_{k}- \frac{m}{m_{0}}c_{m}\sqrt{q_{k}(1-q_{k})}\right\} \\ & \ge & \frac{m_{0} + 1}{m_{0}}- 1 - \frac{m}{m_{0}} q_{m_{0}+1}- \frac{m}{m_{0}}c_{m}\sqrt{q_{m_{0}+1}} \\ & > & - \frac{m}{m_{0}} q_{m_{0}+1} - \frac{m}{m_{0}}c_{m}\sqrt{q_{m_{0}+1}}, \end{array} $$

(A.1)

where the second step above is by taking k = m₀ + 1.

By condition (2.3), $P(q_{m_{0}+1} \!\le \! U_{(1), m-s}) \!\to \! 1$, where $U_{(1), m-s} \!\sim \! Beta(1, m-s)$. Therefore, given s = o(m) and $\sqrt {\log m}/ m_{0} \to 0$ in probability,

$$ \frac{m}{m_{0}} q_{m_{0}+1} = O_{p}(\frac{1}{m_{0}}) = o_{p}(1). $$

(A.2)

On the other hand, it has been shown in Meinshausen and Rice (2006) that $c_{m} = O(\sqrt {\log m}/ \sqrt {m})$. Then, by s = o(m) and $\sqrt {\log m}/ m_{0} \to 0$ in probability, we have

$$ \frac{m}{m_{0}}c_{m}\sqrt{q_{m_{0}+1}} = O_{p}(\frac{\sqrt{\log m}}{m_{0}}) = o_{p}(1). $$

(A.3)

Combining (A.1) - (A.3) gives (2.4).

1.2 B Proof of Theorem 2

Define $F_{m}(t) = \frac {1}{m}{\sum }_{k=1}^{m} 1(q_{k}\leq t)$. Re-write $\hat {k}_{qvs}$ in Eq. 2.1 as

$$ \begin{array}{@{}rcl@{}} \hat{k}_{qvs}=m\max_{0<t<1} \{F_{m}(t)-t-c_{m}\sqrt{t(1-t)}\}. \end{array} $$

(A.4)

For notation convenience, define π₁ = m₁/m. Then

$$ \begin{array}{@{}rcl@{}} F_{m}(t)& = & \frac{m_{1}}{m} \frac{1}{m_{1}} \sum\limits_{j=1}^{m_{1}} 1(q_{j}\leq t) +\frac{1}{m} \sum\limits_{j=m_{1}+1}^{m} 1(q_{j}\leq t) \le \pi_{1} + \frac{1}{m} \sum\limits_{j=m_{1}+1}^{m} 1(q_{j}\leq t). \end{array} $$

Therefore,

$$ \begin{array}{@{}rcl@{}} & & P(\hat{k}_{qvs}>m_{1}) \\ & \le & P(\sup\limits_{0<t<1} \{F_{m}(t)-t-c_{m}\sqrt{t(1-t)}\}>\pi_{1}) \\ &\le & P(\sup\limits_{0<t<1} \{ \pi_{1} + \frac{1}{m} \sum\limits_{j=m_{1}+1}^{m} 1(q_{j}\leq t) -t -c_{m}\sqrt{t(1-t)}\}>\pi_{1}) \\ & = & P(\sup\limits_{0<t<1} \{\frac{1}{m} \sum\limits_{j=m_{1}+1}^{m} 1(q_{j}\leq t) -t -c_{m}\sqrt{t(1-t)}\}>0) \\ & \leq & P(\sup\limits_{0<t<1} \{\frac{1}{m} \sum\limits_{j=m_{1}-s+1}^{m-s} 1(U_{(j), m-s}\leq t) -t -c_{m}\sqrt{t(1-t)}\} >0 ) + o(1),\\ \end{array} $$

(A.5)

where the last step above is by condition (2.5). Note that $ {\sum }_{j=m_{1}-s+1}^{m-s} 1$ (U_(j),m−s ≤ t) is stochastically dominated by ${\sum }_{j=1}^{m-s} 1(U_{(j), m-s}\leq t)$, which has the save distribution as that of ${\sum }_{j=1}^{m-s} 1(U^{\prime }_{j} \leq t)$, where $\{U^{\prime }_{j}\}_{j=1}^{m-s}$ is a sequences of independent standard uniform random variables, independent of $\{U_{j}\}_{j=1}^{m}$. Further, ${\sum }_{j=1}^{m-s} 1(U^{\prime }_{j} \leq t)$ is stochastically dominated by ${\sum }_{j=1}^{m} 1(U^{\prime }_{j} \leq t)$, which has the same distribution as that of mU_m(t). Sum up the above, we have

$$ \begin{array}{@{}rcl@{}} & & P(\sup\limits_{0<t<1} \{\frac{1}{m} \sum\limits_{j=m_{1}-s+1}^{m-s} 1(U_{(j), m-s}\leq t) -t -c_{m}\sqrt{t(1-t)}\} >0 ) \\ & \le & P(\sup\limits_{0<t<1} \{U_{m}(t)-t-c_{m}\sqrt{t(1-t)}\}>0). \end{array} $$

(A.6)

By the definition of the bounding sequence c_m,

$$ \begin{array}{@{}rcl@{}} P\left( \sup\limits_{0<t<1} \frac{U_{m}(t)-t}{\sqrt{t(1-t)}} >c_{m}\right)=\alpha_{m}. \end{array} $$

Further,

$$ P\left( \sup\limits_{0<t<1} \frac{U_{m}(t)-t-c_{m}\sqrt{t(1-t)}}{\sqrt{t(1-t)}} >0\right) \le P\left( \sup\limits_{0<t<1} \frac{U_{m}(t)-t}{\sqrt{t(1-t)}} >c_{m}\right) =\alpha_{m}. $$

And for every t ∈ (0,1),

$$ \frac{U_{m}(t)-t-c_{m}\sqrt{t(1-t)}}{\sqrt{t(1-t)}}>U_{m}(t)-t-c_{m}\sqrt{t(1-t)} $$

almost surely. The above implies that

$$ P\left( \sup\limits_{0<t<1} \{U_{m}(t)-t-c_{m}\sqrt{t(1-t)}\} >0\right)\leq \alpha_{m} = o(1). $$

(A.7)

Combining (A.5) with (A.6) and (A.7) gives $P(\hat {k}_{qvs}>m_{1}) \to 0$ as $m\to \infty $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jeng, X.J., Peng, H. & Lu, W. Model Selection With Mixed Variables on the Lasso Path. Sankhya B 83, 170–184 (2021). https://doi.org/10.1007/s13571-019-00219-5

Download citation

Published: 04 May 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s13571-019-00219-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Model Selection With Mixed Variables on the Lasso Path

Abstract

Similar content being viewed by others

On the RODEO Method for Variable Selection

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Robust Signed-Rank Variable Selection in Linear Regression

1 Introduction