1 Introduction

An important task in modern data analysis is identifying a subset of explanatory features from a larger pool that are associated with a response of interest. For instance, suppose a nutritionist is interested in identifying which nutrient intakes are associated with body mass index. This information, in turn, could be leveraged to develop diets that improve human health (Chen & Li, 2013). Another example is a geneticist who is interested in finding which genetic mutations of a virus have developed resistance to a specific antiviral drug. To quote Rhee et al. (2006): “[This understanding] is essential for designing new antiviral drugs and for using genotypic drug resistance testing to select optimal therapy.” In such applications, there are often a large number of explanatory features with complex dependencies and a limited number of observations, which make the selection problem a particularly difficult task (Shah & Peters, 2020). To stress this point, consider the genetic example from above and notice that a feature selection procedure can make erroneous selections of two kinds: (i) false positives—it may select mutations that do not have any effect on drug resistance; and (ii) false negatives—it may fail to discover genetic mutations that in fact developed resistance to the antiviral drug. This paper presents a novel cost function and learning scheme whose goal is to increase the number of discoveries reported by statistical methods for feature selection, while maintaining the rate of false positives under control.

To formalize the problem, we denote by \(Y \in \mathbb {R}\) the response variable and by \(X = (X_1,X_2,\dots ,X_d) \in \mathbb {R}^d\) the feature vector. For example, imagine that Y is a measure of drug resistance, and \(X_j \in \mathbb {R}\)—the jth entry in the vector X—indicates the presence or absence of a genetic mutation in a specific location. Conditional independence testing (Candès et al., 2018; Tansey et al., 2021; Liu et al., 2020; Zhang et al., 2012) deals with the question of whether a specific feature \(X_j\) is independent of the response variable \(Y\in \mathbb {R}\) after accounting for the effect of all the other features \(X_{-j} = (X_1,\dots ,X_{j-1}, X_{j+1},\dots ,X_d) \in \mathbb {R}^{d-1}\). That is, the null hypothesis we seek to test is

$$\begin{aligned} H_{0,j}: X_j \perp \!\!\! \perp Y \mid X_{-j} \end{aligned}$$

against the alternative \(H_{1,j}: X_j \not \!\perp \!\!\! \perp Y \mid X_{-j}\). In plain words, \(H_{0,j}\) inquires whether the knowledge of \(X_j\) adds additional information about the response Y beyond what is already contained in \(X_{-j}\). Consequently, we say that the jth feature is null (i.e., unimportant) if \(H_{0,j}\) is true, and non-null (i.e., important) if \(H_{0,j}\) is false.

Imagine we are given n data points \(\{(X^i,Y^i)\}_{i=1}^n\), where \(X^i \in \mathbb {R}^d\) and \(Y^i\in \mathbb {R}\) are drawn i.i.d. from a joint distribution \(P_{XY}\). Statistical tests for conditional independence leverage the observed data to formulate a decision rule on whether to reject \(H_{0,j}\) and report that the feature \(X_j\) is likely to be important, or accept \(H_{0,j}\) in cases where there is not enough evidence to reject the null. A valid test for conditional independence should control the probability of making type I error (i.e., rejecting \(H_{0,j}\) when it is in fact true), by returning a p-value, a random variable that is super-uniform under the null. This p-value is then used to rigorously control the type I error rate at any user-specified level \(\alpha\), e.g., of 5%. The power of the test (higher is better) is the true positive rate, being the probability of correctly reporting that the non-null \(X_j\) is indeed important.

Thus far we formulated the problem of testing for a specific feature, however, in many scientific applications (e.g., genome wide association studies Sesia et al., 2020b; Benner et al., 2016) we are interested in selecting a subset of features that are associated with the response. In other words, we wish to test for all \(H_{0,j},\) \(j=1,\dots d\) simultaneously while controlling some statistical notion of type I error. To this end, imagine we have at hand d p-values, one for each \(H_{0,j}\), which we denote by \(p_j\), \(j=1, \dots ,d\). Recall that each p-value allows us to control the type I error for a specific hypothesis at a desired rate \(\alpha\). However, when testing for all the hypotheses simultaneously we must account for the multiplicity of the tests, as otherwise the probability that some of the true null hypotheses are rejected by chance alone may be excessively high; see the survey by Bender and Lange (2001) for details. Hence, we follow the seminal work by Benjamini and Hochberg (1995) and define our task as follows: identify the largest possible subset \(\hat{\mathcal {S}}\subseteq \{1,\dots ,d\}\) of non-null featuresFootnote 1 while ensuring that the false discovery rate (FDR), defined as

$$\begin{aligned} \text {FDR} = \mathbb {E}\left[ \frac{|\hat{\mathcal {S}} \cap \mathcal {H}_0 |}{\text {max}\{|\hat{\mathcal {S}}|,1\}}\right] , \end{aligned}$$

falls below a nominal level of choice \(q\in (0,1)\), e.g., \(q=0.1\) or \(q=0.2\). Above, \(\mathcal {H}_0 \subseteq \{1,\dots ,d\} {\setminus } \mathcal {H}_1\) contains the indices of the null features for which \(H_{0,j}\) is true, \(\mathcal {H}_1\) is the set of non-null features, and \(|\cdot |\) returns the set-size. In words, the FDR is the expected proportion of true nulls among the rejected hypotheses. In our genetic example, the ability to control the FDR allows us to form a list of discoveries such that the majority of the selected mutations are expected to be conditionally associated with drug resistance, on average. This information may be valuable when allocating costly resources, required to further study the implications of the reported discoveries (Hawinkel et al., 2019; Manolio et al., 2009). The power of the selection procedure (larger is better) is defined as \(\mathbb {E}\left[ {| \hat{\mathcal {S}} \cap \mathcal {H}_1 |} / {|\mathcal {H}_1|}\right]\), being the expected proportion of non-nulls that are correctly selected among all the true non-nulls. Controlling the FDR can be achieved by plugging the list of p-values \(p_j,\) \(j=1,\dots , d\) into the Benjamini-Hochberg procedure (BH) (Benjamini & Hochberg, 1995), as described in Sect. 2.2. Importantly, the power of this feature selection procedure is affected by the ability of the underlying conditional independence test to generate small p-values for the non-null features, indicating strong evidence against the corresponding null hypotheses. Our goal in this work is to increase the power of the above controlled feature selection pipeline by forming data-driven test statistics that powerfully separate non-nulls from nulls.

The model-X conditional randomization test (Candès et al., 2018) and its computationally efficient version—the holdout randomization test (HRT) (Tansey et al., 2021) that we study in this paper—are methods for conditional independence testing that gain a lot of attention in recent years. These tools are very attractive since they can leverage any powerful predictive model to rigorously test for \(H_{0,j}\), under the assumption that the distribution of the explanatory features \(P_X\) is known. As such, model-X tests are especially beneficial in situations where copious amount of unlabeled data is available compared to labeled data, e.g., as happen in genome wide association studies (Sesia et al., 2020a, b; Bates et al., 2020). At a high level, the HRT is carried out by first splitting the data into a training set and a test set, and fit an arbitrary predictive model on the training set. Next, a dummy copy \(\tilde{X}_j\) of the original feature \(X_j\) is sampled for each sample in the test set, such that \(\tilde{X}_j\) is independent of Y given \(X_{-j}\) by construction. In our running example, one can think of \(\tilde{X}_j\) as a fake genetic mutation (sampled by a computer program) that does not carry any new information about Y if we already know all the other mutations. The test proceeds by comparing, for each observation in the test set, the accuracy of the trained predictive model applied to (i) the original feature vector X, and (ii) its modified version for which only \(X_j\) is replaced by \(\tilde{X}_j\). Here, a decrease in the overall accuracy can serve as an evidence that \(X_j\) is associated with Y after accounting for the effect of \(X_{-j}\).

In practice, the HRT is deployed by treating the machine learning model as a black-box, fitted to maximize predictive accuracy with the hope to attain good power. However, the ultimate objective here is to maximize the power of the test and not merely the model’s accuracy, and, currently, there is no method to balance between the two. To stress this point further, there is a growing evidence that modern machine learning algorithms tend to excessively rely on spurious features—null features that are solely correlated with the non-null ones—to make accurate predictions (Sagawa et al., 2019; McCoy et al., 2019; Arjovsky et al., 2019; Peters et al., 2015; Khani & Liang, 2021). Loosely speaking, in such cases, the model tends to up-weight the importance of null features at the cost of down weighting the importance of the non-nulls. The consequence of this undesired phenomenon is that the predictive model may only show a mild drop in accuracy when replacing an important feature with its dummy copy. This renders the test to lose power, as the test accounts for the effect of the spurious correlations by design.

1.1 Our contribution

We propose novel learning schemes to fit models that are designed to explicitly increase the power of the HRT. This is done by ‘looking ahead’ (during training) to what will happen when the predictive model is integrated within the HRT procedure. This stands in striking contrast with the common practice, in which the model is merely trained to maximize accuracy to the extent possible. Concretely, the core of our contribution is the formulation of a novel loss function that encourages the model’s accuracy to drop when replacing the original feature \(X_j\) with the dummy one \(\tilde{X}_j\), seeking maximum risk discrepancy (MRD). We introduce in Sect. 3 a general, theoretically motivated, stochastic optimization procedure that enables us to augment our MRD loss to existing cost functions used in the variable selection literature, e.g., sparsity promoting penalty for deep neural networks. That section also includes a specialized learning scheme to fit sparse linear regression models (such as lasso) with our proposed loss. Extensive experiments on synthetic and real data sets are presented in Sect. 5, showing that the combination of the proposed MRD approach with existing regression algorithms consistently improves the power of the underlying controlled feature selection procedure. In particular, in cases where the signal is weak (i.e., in low power regimes), our method may even achieve more than 90% relative improvement compared to existing methods. Importantly, the MRD scheme preserves the validity guarantee of the p-values generated by HRT. Furthermore, we demonstrate that our new proposal is fairly robust to situations where the distribution of \(P_X\)—required to sample the dummy features—is estimated from data. These situations are of course ubiquitous in real-world applications. Lastly, a software package that implements our methods, with code for reproducing experiments, is attached to the Supplementary Material.

2 Background

2.1 Model-X randomization test

The model-X randomization test provides a valid p-value for \(H_{0,j}\) in finite samples without making modeling assumptions on the conditional distribution of \(Y \mid X\). However, this approach assumes that the conditional distribution of \(X_j \mid X_{-j}\) is known. The original conditional randomization test (CRT), developed by Candès et al. (2018), requires fitting a regression model many times in order to be carried out precisely. The HRT, which is extensively used in this paper, bypasses this computational issue at the expense of data splitting. Specifically, this test starts by dividing the data into disjoint training and testing sets. The training set is used to fit an arbitrary regression model \(\hat{f}\), formulating a test statistic function \(T(X_j, X_{-j},Y;\hat{f}) \in \mathbb {R}\), e.g., the model’s prediction error. Let \(\{(X_i,Y_i)\}_{i\in \mathcal {I}}\) be the samples of the testing set. The HRT proceeds by repeating the following two steps for each \(k=1,\dots ,K\) (treating \(\hat{f}\) as a fixed function):

  • Sample a dummy feature \(\tilde{X}^i_j \sim P_{X_j|X_{-j}}(X_j^i|X_{-j}^i)\), for all samples in the test set \(i\in \mathcal {I}\).

  • Compute the test statistic using the dummy variables

    $$\begin{aligned} \tilde{t}_j^{(k)} \leftarrow \frac{1}{|\mathcal {I}|} \sum _{i\in \mathcal {I}} T(\tilde{X}_{j}^i,X_{-j}^i,Y^i;\hat{f}). \end{aligned}$$

Finally, a p-value \(\hat{p}_j\) for \(H_{0,j}\) is constructed by computing the empirical quantile level of the true test statistic \(t^* \leftarrow \frac{1}{|\mathcal {I}|} \sum _{i\in \mathcal {I}} T(X_j^i, X_{-j}^i,Y^i;\hat{f})\) among the dummy statistics \(\tilde{t}_j^{(1)},\dots ,\tilde{t}_j^{(K)}\):

$$\begin{aligned} \hat{p}_j=\frac{1+ \left| \{ k : t^* \ge \tilde{t}_j^{(k)} \}\right| }{{K+1}}. \end{aligned}$$
(1)

Importantly, the p-value \(\hat{p}_j\) in (1) is valid, i.e., \(\mathbb {P}[\hat{p}_j \le \alpha \mid H_{0,j} \ \text {is true}] \le \alpha\). This is because (i) \(\tilde{X}_j \perp \!\!\! \perp Y \mid X_{-j}\) by construction, and (ii) under the null, the random variables \(t^*, \tilde{t}_j^{(1)}, \dots , \tilde{t}_j^{(K)}\) are exchangeable (Candès et al., 2018).

The validity of the p-value holds true for any choice of predictive model \(\hat{f}\), and for any choice of test statistic \(T(\cdot )\) used to compare the distributions of \((Y, {X}_j, X_{-j})\) and \((Y, \tilde{X}_j, X_{-j})\). Katsevich and Ramdas (2020) proved that the likelihood ratio is the uniformly most powerful statistic, however, it requires the knowledge of the true conditional distribution of \(Y \mid X\), which is unknown. In practice, a common choice for a test statistic—which we use in our proposal as well—is the squared error (Tansey et al., 2021; Bates et al., 2020) for which

$$\begin{aligned} T(X_j, X_{-j},Y;f)=(Y-\hat{f}(X_j, X_{-j}) )^2, \end{aligned}$$
(2)

where we use the notation \(\hat{f}(X_j, X_{-j}) = \hat{f}(X)\) to stress that \(\hat{f}\) is applied on the vector X with its original jth feature \(X_j\), whereas \(\hat{f}(\tilde{X}_j, X_{-j})\) is applied to the same feature vector except that the original \(X_j\) is replaced by its dummy copy \(\tilde{X}_j\). To handle the controlled feature selection problem, the HRT p-values \(p_j\), \(j=1,\dots ,d\) are most often plugged into the Benjamini-Hochberg procedure (BH) (Benjamini & Hochberg, 1995) procedure, as discussed next.

2.2 FDR control via the BH procedure

Armed with a list of d p-values \(p_j\), one for each \(H_{0,j}\), we can apply the BH procedure (Benjamini & Hochberg, 1995) to control the FDR. This method operates as follows. First, sort the given d p-values in a non-decreasing order; we denote by \(p_{(k)}\) and \(H_{0,(k)}\) the kth smallest p-value and its corresponding null-hypothesis, respectively. Then, for a given FDR target level q, find the largest \(k_0\) such that \(p_{(k_0)} \le \frac{k_0}{d}q\). Lastly, reject the null hypotheses \(H_{0,(1)}, H_{0,(2)}, \dots , H_{0,(k_0)}\). Importantly, the procedure described above is guaranteed to control the FDR under the assumption that the p-values are independent. While the p-values we construct in this work do not necessarily satisfy the independence assumption, it is known that the BH procedure often controls the FDR empirically even when the p-values are dependent (except adversarial cases). This empirical observation is also endorsed in our experiments, standing in line with many other methods that plug the dependent p-values generated by CRT/HRT to BH (Tansey et al., 2021; Candès et al., 2018). With that said, it is worth noting that there exists a variant of the BH that provably controls the FDR for any arbitrary dependency structure of \(p_j,\) \(j=1,\dots ,d\), however, this comes at the cost of reduced power (Benjamini & Yekutieli, 2001).

3 The proposed method

In this section, we introduce our MRD approach to fit predictive models with the goal to explicitly improve the power of the model-X tests. We begin by describing the set of ideas that construct the theoretical motivation. Then, we present a general learning scheme to fit an arbitrary predictive model (e.g., neural network) using the MRD loss (4). Lastly, we design a specialized learning scheme for sparse linear regression models (e.g., lasso).

3.1 Main idea

A central definition in our proposal is that of the risk discrepancy, expressed as

$$\begin{aligned} \text {RD}_j = \mathbb {E}[T(Y,\tilde{X}_j,X_{-j}; \hat{f}) - T(Y,X_j,X_{-j}; \hat{f})], \end{aligned}$$
(3)

where \(\hat{f}\) is a fixed (pre-trained) predictive model, and \(T(\cdot )\) is the squared error (2). In the context of the HRT (Sect. 2.1), a test with good power would have large and positive empirical risk discrepancy for the non-null features \({j \in \mathcal {H}_1}\). Therefore, during training, we wish to encourage the model \(\hat{f}\) to maximize \(\text {RD}_j\) only for that group of features. While this is infeasible because we do not have apriori knowledge of which of the features are in \(\mathcal {H}_1\), we do have access to the sampling distribution of the conditional null through the sampling of the dummy copies \(\tilde{X}_j\). This property is formally presented in the following well-known result in the model-X literature (Candès et al., 2018), which we will use later to formulate our loss.

Proposition 1

(Candès et al., 2018) Take \((X,Y) \sim P_{XY}\), and let \(\tilde{X}_j\) be drawn independently from \(P_{X_j \mid X_{-j}}\) without looking at Y. If \(Y \perp \!\!\! \perp X_j \mid X_{-j}\), then \((Y,\tilde{X}_j,X_{-j}) \overset{d}{=}\ (Y,{X}_j,X_{-j})\).

Above, the notation \(\overset{d}{=}\) stands for equality in distribution. For completeness, the proof of the above statement is given in Supplementary Section A. A consequence of Proposition 1, important to our purposes, is given in the next corollary; see also König et al. (2021).

Corollary 1

Let \(\hat{f}\) be a fixed predictive model, and \(T(\cdot ; \hat{f})\) be any test statistic. Under the setting of Proposition 1, if \((Y,\tilde{X}_j,X_{-j}) \overset{d}{=}\ (Y,{X}_j,X_{-j})\), then \(\mathbb {E}[T(Y,\tilde{X}_j,X_{-j}; \hat{f}) - T(Y,X_j,X_{-j}; \hat{f})]=0\).

This motivates our formulation of a training procedure that drives the model to a solution in which the empirical risk discrepancy (evaluated on the training data) is maximized for all features, knowing that this quantity is guaranteed to be small for the null features at test time. In fact, in Proposition 2, which is presented formally in Sect. 3.3, we prove that for a linear model with uncorrelated features, the solution that minimizes the population version of our objective function (described next) satisfies that the estimated regression coefficients of the null variables are guaranteed to be zero.

3.2 A general learning scheme

figure a

We now formalize the idea presented above. Given a training set of m i.i.d samples \(\{(X^i,Y^i)\}_{i=1}^m\), our procedure begins with a generation of a dummy copy \(\tilde{X}^i_j\) for each sample and feature:

$$\begin{aligned} \tilde{X}^i_j \sim P_{X_j \mid X_{-j}}(X^i_j \mid X^i_{-j}), \quad i=1,\dots ,m, \ \ j = 1 \dots d. \end{aligned}$$

Denote the collection of training features by \({\textbf {X}} \in \mathbb {R}^{m \times d}\), where \(X^i\) is the ith row in that matrix, and by \({\textbf {Y}} \in \mathbb {R}^{m}\) a vector that contains the response variables \(Y^i\) as its entries. Let \(\mathcal {J}_{\text {base}}({\textbf {Y}},{\textbf {X}}; f)\) be the objective function of the base predictive model, expressed as

$$\begin{aligned} \mathcal {J}_{\text {base}}({\textbf {Y}},{\textbf {X}}; f) = \frac{1}{m} \sum _{i=1}^m \ell (f(X^i),Y^i) + \alpha \mathcal {R}(f). \end{aligned}$$
(4)

Above, the loss function \(\ell\) measures the prediction error (MSE in our experiments), \(\mathcal {R}\) is a regularization term, and the hyperparameter \(\alpha\) trades off the importance of the two. With this notation in place, our proposal aims to minimize the following cost function:

$$\begin{aligned} \hat{f}(x) = \underset{f\in \mathcal {F}}{\text {argmin}} \ (1-\lambda ) \mathcal {J}_{\text {base}}({\textbf {Y}},{\textbf {X}}; f) + \frac{\lambda }{d} \sum _{j=1}^{d} {\mathcal {D}( z, \tilde{z}_j)}, \end{aligned}$$
(5)

where \(z=\frac{1}{m} \sum _{i=1}^m {T(Y^i,X_j^i,X_{-j}^i;f)}\) and \(\tilde{z}_j=\frac{1}{m} \sum _{i=1}^m {T(Y^i,\tilde{X}_j^i,X_{-j}^i;f)}\) are the empirical risks —MSE in our context since \(T(\cdot )\) is the squared error (2)—of the model evaluated on the original and synthetic triplets, respectively. The function \(\mathcal {D}(z, \tilde{z}_j)\in \mathbb {R}\) measures the difference between z and \(\tilde{z}_j\), such that \(\mathcal {D}(z, \tilde{z}_j)\) gets smaller as \(\tilde{z}_j\) becomes larger than z. Therefore, (5) seeks a model \(\hat{f}\) that aligns with the base objective \(\mathcal {J}_{\text {base}}\) while creating a better separation between z and \(\tilde{z}_j\). The parameter \(\lambda \in [0,1]\) sets the relative weight between the two functionals; we present a simple, automatic approach for tuning this parameter in Supplementary Section D. In practice, we define \(\mathcal {D}\) as follows: \(\mathcal {D}(z,\tilde{z}_j)=\sigma (z-\tilde{z}_j),\) where \(\sigma (\cdot )\) is the Sigmoid function, i.e., \(\sigma (s)=1/(1+e^{-s})\). Of course, other forms of this function, such as the ratio \(z / \tilde{z}_j\), can be considered.

For ease of reference, Algorithm 1 presents a general approach for minimizing (5). In short, we use stochastic optimization and back-propagation, where each gradient step updates the model parameters by resampling \(\tilde{X}_j\) for a subset of features from \(\{1,\dots ,d\}\). We sample new dummies at each iteration to reduce algorithmic randomness, which, in turn, increases the stability of the procedure. In Supplementary Section B we provide a weak form of convergence of this algorithm, where the analysis requires \(\mathcal {D}(\cdot )\) to be bounded from below. A requirement that is satisfied by our choice to use the Sigmoid function. Algorithm 1 is used in our experiments, implemented with a neural network as the base predictive model.

Remark 1

Algorithm 1 implements the RD penalty for all features simultaneously, thereby avoiding the need to fit a different model for each feature separately. This algorithm can be modified to fit a model with an RD penalty defined for any subset of features, and in particular to a specific feature \(X_j\), as we demonstrate in Supplementary Section G.2.

3.3 A scheme for sparse linear regression

Next, we offer an optimization procedure tailored for the choice of sparse linear regression as the base objective, since it is widely used in the feature selection literature and often results in a powerful test (Barber & Candès, 2015; Tansey et al., 2021). To this end, consider the combination of the elastic net objective (Zou & Hastie, 2005) with our MRD loss:

$$\begin{aligned} \begin{array}{ll} \hat{\beta } = \underset{\beta }{\text {argmin}} \ (1-\lambda )\bigg (\overbrace{\frac{1}{2m} || {\textbf {X}}\beta - {\textbf {Y}}||^2_2 + \alpha _1 ||\beta ||_1 + \frac{\alpha _2}{2}||\beta ||_2^2}^{\mathcal {J}_{\text {base}}({\textbf {Y}},{\textbf {X}}; f)}\bigg ) + \frac{\lambda }{d} \sum _{j=1}^{d} {\mathcal {D}( z(\beta ), \tilde{z}_j(\beta ))}, \end{array} \end{aligned}$$

where \(\beta \in \mathbb {R}^d\) is the regression coefficient vector, and \(\alpha _1,\alpha _2\) are the elastic net penalty parameters. Above, we use the notations \(z(\beta )\) and \(\tilde{z}_j(\beta )\) to stress that z and \(\tilde{z}_j\) are functions of \(\beta\). Using variable splitting, we can separate the sparsity promoting penalty from the remaining terms, and solve the resulting optimization problem using the alternating direction method of multipliers ADMM (Boyd et al., 2011); see Supplementary Section C for more details. Following Supplementary Algorithm 2, the advantage of the above splitting strategy is that the minimization with respect to \(\beta\) has a closed-form solution, being a simple projection onto the elastic net ball (Parikh & Boyd, 2014; Yu, 2013). In practice, we run this iterative algorithm until the stopping criteria suggested by Boyd et al. (2011, Section 3.3) is met.

We conclude this section by showing that under the assumptions of Proposition 2, the estimated coefficients of the null features \(\hat{\beta }_j, j \in \mathcal {H}_0\) obtained by minimizing the MRD objective will be equal to zero. The proof is provided in Supplementary Section A.

Proposition 2

Suppose that the population model is linear \(Y=X^T\beta +\epsilon\), where \(\mathbb {E}[X_jX_k]=0\) for \(1 \le k \ne j \le d\), and \(\mathbb {E}[\epsilon X_j] = 0\) for all \(1\le j \le d\). Denote by \(\hat{\beta }\) a solution to the infinite-data version of the MRD:

$$\begin{aligned}\hat{\beta } = \,& \underset{\beta }{\text {argmin}} \ (1-\lambda )\mathbb {E}{[(X^T\beta - Y)^2]}\, + \\ & \quad \frac{\lambda }{d} \sum _{j=1}^d {\sigma (\mathbb {E}[T(Y,\tilde{X}_j,X_{-j}; \beta ) - T(Y,X_j,X_{-j}; \beta )])}. \end{aligned}$$

Under the above assumptions, for any \(0 \le \lambda < 1\), and for all \(j \in \mathcal {H}_0\) with \(\beta _j=0\), the solution \(\hat{\beta }\) satisfies that \(\hat{\beta }_j = 0\) for all \(j \in \mathcal {H}_0\).

4 Related work

4.1 Handling unknown conditionals

Our proposed method inherits the model X assumption that the marginal distribution \(P_X\)—required to sample \(\tilde{X}_j\)—is known: this assumption is needed to guarantee that (1) is a valid p-value (Candès et al., 2018). However, in most real-world applications \(P_X\) is unknown and thus must be estimated from the data. Here, dummy features of poor quality may break the validity of model-X randomization tests (Romano et al., 2019; Tansey et al., 2021; Sudarshan et al., 2021), including our proposal. To alleviate this concern, model X tests are equipped with various methodologies to estimate \(P_X\), including a second-moment multivariate Gaussian estimator (Candès et al., 2018), mixture density networks (Tansey et al., 2021), generative adversarial networks (Bellot & van der Schaar, 2019), moment matching networks (Romano et al., 2019), and more (Sesia et al., 2019; Sudarshan et al., 2020; Gimenez et al., 2019; Sudarshan et al., 2021). Furthermore, model-X tests are also equipped with diagnostic tools to improve the confidence in the results obtained by this method when the conditionals are unknown. These tools include goodness-of-fit tests, designed to quantify the quality of the generated dummies, as well as controlled semi-synthetic experiments with simulated response variables that allow the user to compare the obtained empirical FDR with the desired level (Romano et al, 2019, Section 5).

Importantly, empirical evidence show that the model-X tests are robust in the sense that the FDR is often controlled in practice, especially when the testing procedure is applied with well-approximated dummy features. This is also corroborated by our experiments on real and simulated data: we practically obtain FDR control even when using the simplest Gaussian approximation to the unknown \(P_X\), highlighting the robustness to model misspecification of the HRT in general, and our proposal, in particular.

4.2 Relevant prior art

Our proposal is connected to the two-stage “learn-then-test” approach, initiated by the influential work of Gretton et al. (2012) in the context of kernel-based tests. The original idea here is to optimize the kernel parameters in order to maximize the power of the underlying test, where ample evidence shows that this approach is extremely fruitful (Gretton et al., 2012; Liu et al., 2016; Chwialkowski et al., 2016). These contributions are in line with our proposal, suggest learning techniques to improve power, however ours is very different as it is tailored to work in combination with model-X tests.

Recently, new deep learning methods for feature selection have been proposed to improve data analysis with non-linear interactions. In particular, Borisov et al. (2019), Lemhadri et al. (2021), Yamada et al. (2020) offer sparsity-promoting layers that can be combined with arbitrary neural network architectures. These techniques can be combined with our MRD framework, as demonstrated in Sect. 5, where we utilize the CancelOut regularizer developed by Borisov et al. (2019). A different but related line of work is that of instance-wise interpretability methods, aiming at identifying a subset of relevant features for a single data point (Chen et al., 2018; Yoon et al., 2019). While this line of work is outside the scope of this paper, we believe these methods may benefit from our MRD approach. For instance, by combining the MRD with the test proposed by Burns et al. (2020), which is capable of finding instance-wise features that a given predictive model considers important.

5 Experiments

In this section, we study the performance of the proposed MRD approach in a wide variety of settings. These include experiments with linear and nonlinear simulated data of varying signal strength and sample size, model misspecification experiments, and similar experiments with semi-synthetic and real data. The conclusion of this comprehensive study is that the MRD approach consistently improves the power of HRT, more significantly in low power regimes.

5.1 Synthetic experiments

To evaluate the performance of our methods, we implement variable selection in a fully controlled synthetic setting in which the distribution of X and \(Y \mid X\) are known. This experimental setting allows us to study the validity and statistical efficiency of a wide variety of predictive models \(\hat{f}\) integrated with the HRT. Following Candès et al. (2018), Romano et al. (2019), Lu et al. (2018), we generate \(X\in \mathbb {R}^d\) of dimension \(d=100\) from a multivariate Gaussian distribution \(\mathcal {N}(0,\Sigma )\) with \(\Sigma _{i,j}=\rho ^{|i-j|}\) and an auto-correlation parameter \(\rho\). The response \(Y \in \mathbb {R}\) is sampled from a sparse regression model of the form

$$\begin{aligned} Y = g(X) + \epsilon , \end{aligned}$$
(6)

where \(\epsilon \sim \mathcal {N}(0,1)\) is the noise component and g is a link function. Throughout the synthetic experiments we consider the following conditional models of \(Y \mid X\):

  • M1: linear model, with \(g(X)= X^{\top }\beta\), where \(\beta \in \mathbb {R}^d\) has \(30\%\) randomly chosen non-zero entries of a random sign and a magnitude equal to c.

  • M2: polynomial model, we follow Lu et al. (2018) and set a nonlinear \(g(X)= (X^{\top }\beta )^3/2\), where \(\beta\) is chosen as described in M1.

  • M3: sum of sines is another nonlinear model we explore, with \(g(X) = \sum _{j \in \mathcal {H}_1} {\sin {(X_j\beta _j)}}\). Here, \(X_j\) and \(\beta _j\) are the jth entries of X and \(\beta\), respectively; \(\beta\) is defined as in M1.

  • M4: interaction model, a challenging nonlinear model with \(g(X) = \sum _{j=1}^{15} { X_{2j}X_{2j-1} }\).

Note that for data models M1-M3, a variable \(X_j\) is in the non-null set \(j \in \mathcal {H}_1\) if and only if \(\{j=1,\dots ,d: \beta _j \ne 0\}\). In the same vein, M4 has 30 non-null variables, those who appear in the function g(x).

To apply the test, we first draw m training examples from \(P_{XY}\), which are used to fit a predictive model of choice. We then generate an additional independent set of m test points, which are used to compute a p-value for each of the hypotheses \(H_{0,j}, j=1,\dots ,d\). We use lasso, elastic net, and neural network as baseline models, where the latter is deployed only in the non-linear setting. See Supplementary Section D for additional information on the training strategy and choice of hyperparameters. We apply our MRD algorithm in combination with the above methods using the same choice of hyperparameters. In addition, we set the RD penalty parameter \(\lambda\) to be proportional to the validation error of the base predictive model. Supplementary Section D describes how we evaluate this error for each base model. All models are fit by normalizing the features and response variables to have a zero mean and unit variance.

5.1.1 An illustrative example

We begin with a synthetic experiment that demonstrates the effect of our proposed training scheme on the power of the selection procedure. To this end, we use \(m=400\) samples for training, fix the auto-correlation parameter \(\rho\) to be 0.1, and construct \(Y \mid X\) that follows a polynomial data generating function (M2) with a signal amplitude \(c=0.14\).

As a baseline for reference, we fit a lasso regression model on the training set using ADMM (Boyd et al., 2011), where the lasso penalty parameter is tuned via cross-validation. To analyze the influence of the model on the power of the HRT, we measure the extent to which the test mean squared error (MSE) is increased after replacing \(X_j\) by its dummy copy \(\tilde{X}_j\); this difference serves as an empirical estimate of \(\text {RD}_j\) (3), which we denote by \(\widehat{\text {RD}}_j\). Recall that in the context of the HRT, a positive \(\widehat{\text {RD}}_j\) serves as an evidence against the null, since \(\tilde{t}_j > t^*\) in (1). The left panel of Fig. 1 presents selected quantiles of \(\{\widehat{\text {RD}}_j, j\in \mathcal {H}_1\}\) and \(\{\widehat{\text {RD}}_j, j\in \mathcal {H}_0\}\) as a function of the ADMM iterations, where the \(\widehat{\text {RD}}_j\) scores are evaluated on an independent test set. To reduce randomization, the presented \(\widehat{\text {RD}}_j\) are averaged over 100 independent train and test sets. As can be seen, the quantiles that represent the non-null set increase as the training progresses, demonstrating an improvement in the sensitivity of the model to the replacement of these features by their dummy copies. Notice that the lower 25th percentile reaches a final value that is relatively close to the scores that correspond to the null set \(\{\widehat{\text {RD}}_j: j \in \mathcal {H}_0\}\). Another important observation is that most of the null scores are approximately equal to zero, implying that the swap of these features with their dummy copies does not lead to significant variations in the test statistics.

Next, we fit the proposed MRD lasso model on the same 100 independent training data sets. The test error obtained by our model is similar to that of the baseline approach; both result in an averaged root mean squared error (RMSE) that is equal to 0.94. Also, similarly to the baseline model, the quantiles of the null \(\widehat{\text {RD}}_j\) scores are close to zero, as shown in the right panel of Fig. 1. This phenomenon exemplifies Corollary 1. Turning to the non-null set, we can see that all the quantiles of \(\{\widehat{\text {RD}}_j: j\in \mathcal {H}_1\}\) obtained by our MRD model increase with the training iterations, reaching final values that are higher than the baseline approach. This advantage is corroborated by comparing the power of the selection procedure, applied with a target FDR level of \(q=0.2\). The empirical power (averaged over 100 independent experiments) obtained by the MRD model is equal to 0.521, higher than the one of the baseline model that equals 0.453. In addition, each method results in an empirical FDR that falls below the nominal rate. Here, the MRD lasso is less conservative, obtaining an FDR of 0.097 whereas the base lasso model results in 0.054.

We also present the regression coefficients obtained by lasso and MRD lasso in Fig. 6 in Supplementary Section F. In essence, the regression coefficients that correspond to the non-null features obtained by MRD lasso have higher absolute values than those obtained by lasso. By contrast, the regression coefficients that correspond to the null features are closer to zero both for lasso and MRD lasso.

Fig. 1
figure 1

Empirical risk (MSE) difference evaluated on test data as a function of the training iterations. Left: baseline lasso model. Right: the proposed MRD method. The red (lasso) and blue (proposed method) curves represent the 25th, 50th, and 75th percentiles of the \(\{\widehat{\text {RD}}_j: j \in \mathcal {H}_1\}\). The green (lasso) and orange (proposed method) color-shaded curves represent the 75th percentile of \(\{\widehat{\text {RD}}_j: j \in \mathcal {H}_0\}\). Each point is evaluated by averaging \(\widehat{\text {RD}}_j\) over 100 independent data sets

5.1.2 Experiments with varying signal strength

In the illustrative example presented above we focused on a fixed signal amplitude and sample size, as well as a single base predictive model (i.e., lasso). We now turn to study the effect of the MRD scheme more systematically, by varying the signal-to-noise ratio and also include additional predictive models. We set \(m=400\) as before, but now we increase the auto-correlation parameter to \(\rho =0.25\).

We begin with the polynomial model M2 for \(Y \mid X\), where we control the signal-to-noise ratio by varying the signal magnitude c. The empirical power (\(q=0.2\)) evaluated for each predictive model is summarized in Table 1. As displayed, our MRD approach consistently improves the power of all base models, where the gain is more significant in the low power regime. Notice that the largest improvement is achieved when using lasso as a base model and the smallest gain for the choice of a neural network model (referred as NNet). In terms of the actual power obtained by each method, we typically receive the following relation in this experiment: MRD elastic net > MRD lasso > elastic net > lasso > MRD neural network > neural network. Supplementary Table 5 presents the FDR and root mean squared error (RMSE) obtained for each machine learning model, showing that the MRD loss tends to achieve RMSE similar to that of the base models. This table also includes comparisons to kernel ridge and random forest, concluding that our MRD elastic net is the most powerful method in this experiment. Turning to the validity of the selection procedure, following Supplementary Table 5, we observe that all methods obtain FDR below the nominal 20% level, across all data sets. Here, the MRD versions of lasso and elastic net are less conservative in the sense that they obtain higher empirical FDR compared to their base models, whereas the MRD neural network model has comparable FDR to that of its base model. An in-depth analysis of the above results is presented in Supplementary Section G.1. There, we present the empirical power and FDR as a function of the FDR target level q, as well as Q-Q plots which compare the quantiles of the produced p-values to those of a uniform distribution U[0, 1].

We also perform a similar analysis to the one presented above for the sum of sines model M3 as well as for the linear model M1. We summarize the results in Supplementary Section G.1 for the linear setting and in Supplementary Section G.4 for the non-linear one. The overall trend in both settings is similar to the one presented above: the MRD approach improves the power of the base models, and the FDR is controlled empirically. Lastly, Supplementary Section G.2 demonstrates that the MRD approach outperforms the baseline methods when testing for a single non-null hypothesis for a wide range of signal strength values. This experiment isolates the effect of the MRD penalty on the resulting power compared to the FDR experiments that account for multiplicity by design.

Table 1 Synthetic non-linear data with varying signal strength c

5.1.3 Experiments with varying sample size

Thus far we only considered data with a fixed number of samples. In what follows, we study the performance of the proposed MRD approach as a function of the sample size n. One can view this as a different way to modify the signal-to-noise ratio in the data, as the power is expected to increase with the sample size. We focus on the polynomial model M2 for \(Y \mid X\) with a correlated design in which the auto-correlation parameter \(\rho =0.25\). We generated data with a fixed signal strength \(c=0.1\), so that the power will be relatively low for the smallest sample size n. Table 2 summarizes the results, where we deployed only lasso and MRD lasso to ease with the computational load. Following that table we can see that the MRD approach consistently improves the power of the baseline model, while keeping the FDR under control.

Table 2 Synthetic experiments with non-linear data (M2) with varying number of samples n

5.1.4 Experiments with interaction model

In the experiments presented thus far, lasso and elastic net were more powerful than neural networks, even when applied to nonlinear data. In fact, this phenomenon is in line with the experiments reported in Tansey et al. (2021, Section 4). We now conduct experiments with a sparse interaction model for the data (M4), for which linear predictive models result in zero power in contrast to neural networks. The results are summarized in Fig. 2; we do not include linear predictive models as they did not make any discovery.Footnote 2 Notice that the MRD version of the neural network model achieves higher power than the base one, while controlling the FDR.

Fig. 2
figure 2

Synthetic experiments on a sparse interaction model (M4) with varying signal amplitude. Power and FDR (\(q=0.2\)) are evaluated on 50 independent trials

5.1.5 Experiments with cross-validation HRT

A limitation of the HRT is the reliance on data splitting, which can lead to a power loss especially when the number of samples is limited. When data is scarce, it is more sensible to deploy the cross-validation version of the HRT (CV-HRT) proposed by Tansey et al. (2021, Section 3.1). The CV-HRT leverages the whole data for training and testing, resulting in improved power compared to a single split.

In what follows, we study the effect of the MRD approach on the CV-HRT in a small sample size regime, where \(n<d\). We explore both linear (M1) and nonlinear (M2) data generating functions, where in we set \(c=1.5\) and the auto-correlation parameter \(\rho =0.25\). We implemented the CV-HRT with \(K=8\) folds and use only lasso as the base predictive model to ease the computational load. The results are summarized in Table 3, showing that (i) the power is increasing with n; (ii) the MRD lasso consistently outperforms the baseline method; and (iii) the FDR is controlled (\(q=0.2\)).

In Supplementary Section G.7 we also compare our method to the model-X knockoffs (Candès et al., 2018) and to the state-of-the-art dCRT (Liu et al., 2020) framework. The model-X knockoffs is a multiple testing framework for conditional independence which provides a finite sample FDR control under an arbitrary structure of (XY). The dCRT, a test for conditional independence, is a clever technique to reduce the computational cost of the original CRT while avoiding data splitting. In a nutshell, this is done by distilling the high-dimensional information that \(X_{-j}\) contains on Y into a low-dimensional representation before applying the CRT. Yet, the complexity of dCRT is often higher than that of the CV-HRT as the former requires to fit d different predictive models—one for each leave-one-covariate-out version of the data—compared to K leave-fold-out models in the CV-HRT. Section G.7 shows that once combining CV-HRT with our MRD approach we achieve competitive power to that of the dCRT and higher power than the model-X knockoffs.

Table 3 Synthetic experiments with linear (M1) and non-linear (M2) data with \(n < d\)

5.1.6 Experiments under model misspecification

In the experiments presented thus far we sample the dummy features \(\tilde{X}_j\) from the true distribution of \(X_j \mid X_{-j}\). In practice, however, we often have only access to an approximation of the unknown conditionals \(X_j \mid X_{-j}\). In Supplementary Section G.5, we include experiments under such model misspecification. There, we fit our MRD models and apply the HRT by sampling approximated dummy features from a data-driven estimation of \(P_X\), rather than from the true one. We commence by conducting an experiment in which we control the accuracy of the estimated conditionals. This experiment demonstrates that an inflation of the empirical FDR occurs only when the sampled dummies are of poor quality, as measured by the covariance goodness-of-fit diagnostic proposed in Romano et al. (2019). Next, we study three different settings of increasing difficulty. In the first, X is generated from a correlated multivariate Gaussian, and we sample the approximated dummy features from a multivariate Gaussian whose parameters are estimated from the data. In the second, X is sampled from a Gaussian Mixture Model (GMM), whereas the dummies from a fitted multivariate Gaussian as described above. In the third, we sample X from a correlated multivariate Student t-distribution. When using a naive multivariate Gaussian fit for \(P_X\), we report a violation in FDR control since the estimated conditionals are of low quality. However, when applying more flexible density estimation method—a Gaussian Mixture Model (Reynolds, 2009)—our method is shown to control the FDR in practice, supporting the discussion from Sect. 4.1.

5.1.7 The effect of the MRD penalty parameter

In the interest of space, we refer the reader to Supplementary Section E that illustrates the power of the test as a function of the MRD parameter \(\lambda\) in low, medium, and high power regimes. This section shows that our automatic approach for selecting \(\lambda\) achieves power that is relatively close to the best one. This section also shows that, in the high power regime (\(\approx 0.9\)), inappropriate choice of \(\lambda\) can reduce power compared to the base model. By contrast, in the low (\(\approx 0.5\)) and medium (\(\approx 0.7\)) power regimes, the power of the MRD approach is higher than the power of the base model (i.e., \(\lambda =0)\) for all presented values of \(\lambda >0\).

5.2 Real-world application

We now apply our methods to a real-world application that already appears in the knockoff literature (Barber & Candès, 2015; Romano et al., 2019). Here, the task is to reliably detect genetic mutations associated with changes in drug resistance among human immunodeficiency viruses (HIV) of type I (Rhee et al., 2006); we study the resistance to the Lopinavir protease inhibitor drug. The response variable Y represents the log-fold increase in drug resistance measured in the ith virus, and each of the features \({X}_j \in \{0,1\}\) indicates the presence or absence of a specific mutation. After applying the pre-processing steps described by Romano et al. (2019), the data contains \(n=1555\) samples with \(d=150\) features. Supplementary Section H provides additional details about this data.

Importantly, in contrast to the fully controlled setup from Sect. 5.1, here \(P_X\) is unknown and thus must be estimated. (Recall that in Supplemetary Section G.5 we also study the robustness of MRD to model misspecification.) Here, we approximate \(P_X\) by fitting a multivariate Gaussian on all samples \(\{X^i\}_{i=1}^n\). While this estimation is most likely inaccurate due to the fact that the features are binary, we follow Romano et al. (2019) and show via semi-synthetic experiments (for which \({Y \mid X}\) is known) that the selection procedure empirically controls the FDR when sampling \(\tilde{X}_j\) from the estimated distribution. Only then, we apply the selection procedure on the real data.

Fig. 3
figure 3

Semi-synthetic experiments with real HIV mutation features and a simulated non-linear response using the generating function M2. Empirical power and FDR with \(q=0.2\) are evaluated over 20 random train/test splits

Experiments with semi-synthetic data

We generate a semi-synthetic data set by simulating Y as described in (6) while treating the real X as fixed, using the polynomial non-linear generating function M2. Next, we randomly split the data into training and testing sets of equal size, and repeat the same experimental protocol, data normalization, and training strategy used in the synthetic experiments from Sect. 5.1. (For the neural network base model, we slightly modified the number of training epochs due to overfitting; further details are in Supplementary Section D.) Figure 3 displays the FDR and power obtained by lasso and elastic net as a function of the signal amplitude; a similar figure that describes the results obtained by the base and MRD neural network models is presented in Supplementary Section H.2. Following Fig. 3, we can see that all methods control the FDR (\(q=0.2\)), demonstrating the robustness of the selection procedure to model misspecification. Regarding efficiency, our proposal consistently improves the power of the base models. In contrast to the fully synthetic experiments (see Table 1), here the gap between the power of the base model and its MRD version increases as a function of the signal amplitude. Additional semi-synthetic experiments with a linear model that also explore the effect of the sample size are presented in Supplementary Sections H.3 and H.4. Both demonstrate that a similar gain in performance is consistently obtained by our MRD method.

Fig. 4
figure 4

The average number of drug-resistance mutations in the HIV discovered by base/MRD models

Experiments with full real data

Next, we perform variable selection on the real data, i.e., with the real (XY) pairs. Here, we only use lasso and elastic net as base models, and deploy the same training strategy and data-normalization step described in the semi-synthetic experiments. We repeat the variable selection pipeline for 1000 independent train/test splits and present the results in Fig. 4. As portrayed, our MRD models lead to more findings than the baseline methods for various FDR levels, which aligns with the semi-synthetic experiment results. Here, this gap increases with the target FDR level.

6 Discussion

This paper presents a novel training framework to increase the power of the model-X randomization test. Our method has an advantage over the common practice for which the model-fitting step is not tailored to maximize the power of the HRT. Focusing on FDR control, through experiments, we demonstrate that our method consistently outperforms a wide range of base predictive models in various settings.

This paper advocates conditional independence testing as a statistical approach to avoid the selection of “spurious features”, i.e., null features that are only correlated with the non-null ones. Yet, conditional independence testing has an inherent limitation: it is fundamentally impossible to distinguish nearly identical features features. Put differently, it inevitable that we would suffer from low power when analyzing data with extremely correlated features. Figure 19 illustrates this phenomenon, where we apply the HRT for multivariate Gaussian X with increasing correlation among the features, controlled by the auto-correlation coefficient \(\rho\); see Supplementary Section G.6 for more details. As portrayed, the power of the test is reduced with the increase of \(\rho\), making the selection procedure fundamentally harder. Observe that MRD lasso and MRD elastic net yield higher power than their vanilla counterparts, where the gap between them is reduced as the problem becomes fundamentally harder.

One way to deal with the hardness of this problem it to group (or even prune) features that are nearly identical in a given data. For example, Sesia et al. (2019) apply hierarchical clustering to identify groups of features in such a way that the correlation between the features across the two clusters is not too high. Then, one can apply the selection procedure only on the cluster representatives, which are substantially less correlated by design. This strategy increases power, however at the cost of reducing the resolution of the findings (Sesia et al., 2019). Rather than testing at the level of individual features, the test is applied to groups of features among which there may be one or more important variables. We believe it will be of great importance to continue this line of work in combination with our MRD approach, and offer practical tools that allow the use of conditional independence testing even for highly correlated data.

In a broader view of the controlled variable selection problem, the model-X knockoff filter (Candès et al., 2018) provides a rigorous FDR control in finite samples under an arbitrary dependency structure of (XY). This stands in contrast with the BH procedure that provably controls the FDR under independence of the p-values constructed by the HRT; see discussion in Sect. 2.2. With that said, sampling the knockoff dummies is more challenging and the model fit must be done by augmenting the original features and their knockoff dummies, which increases the dimensions of the problem. Therefore, extending our proposal to work with the knockoffs framework is not straightforward (to say the least), but, at the same time, of great interest.