Keywords

1 Introduction

In recent years the classical problem of variable selection has enjoyed increased attention thanks to a massive growth of high-dimensional data available in many scientific disciplines. In modern statistical applications, the number of variables often exceeds the number of observations. In such contexts, the true model is often assumed to be sparse, meaning that only a small fraction of the variables are actually related to the response. Therefore, the selection of the relevant variables is of fundamental importance in the analysis of high-dimensional data.

Survival analysis deals with the expected time until one or more events occur. It is frequently used in the field of economics, where the event of interest is the failure of companies (mainly due to bankruptcy) or the reasons for which customers choose to stop their relationship with company. In regression analysis of survival data, the Cox Proportional Hazard model, proposed by Cox in 1982 [2], is the most used to explore the relationship between subjects’ survival and some explanatory variables.

Like linear regression models, traditional variable selection methods such as subset selection, forward selection, backward elimination, and a combination of both are among the most common applied for choosing the set of relevant variables under survival framework. However, these methods have computational difficulties in presence of high-dimensional data. Therefore, other methods have been proposed to overcome this problem. Lasso, firstly proposed for linear regression models [5], is then extended to the Cox model [6]. Subsequently, some authors have developed some penalized shrinkage techniques such as SCAD introduced by [3] specifically for Cox models. On one hand, the above methods of variable selection have been shown to be successful in theoretical properties and numerous experiments. On the other hand, their performance is highly dependent on the correct choice of the tuning parameter and these approaches can be unstable, especially in the high-dimensional data setting.

Among the problems encountered in identifying relevant variables, the choice of the best selector from those available is the most relevant. Unfortunately, the set of covariates selected by one method may be different from that selected by another. Even if it might be seen as a disadvantage, analysing the differences and similarities among the various methods can provide useful information. For example, a covariate chosen from all methods can be considered as actually relevant, while ones selected only by one method cannot be related to the response. In order to take into account this insight, following the idea of [7] for linear model, we propose a method called Combined Variable Selector with Subsample (CVSS) that combines different variable selection procedures by using the subsampling technique. We record the percentage of times a covariate is selected among the procedures and we get the final set by identifying as relevant those covariates that are selected most frequently. The main difference between our procedure and [7] consists in the choice of the tuning parameter in the various methods used. In fact, while in [7] for each method the authors take into consideration some vectors of covariates selected by different penalty coefficients, we consider only one vector of betas referring to the best tuning parameter. Thus, we extract only one set of variables for each approach with the advantage that the procedure becomes very fast.

The paper is organized as follows. In Sect. 2, we introduce our proposed approach. In Sect. 3, we show the simulation results. We conclude this work with a discussion in Sect. 4.

2 The Proposed Procedure

Suppose there are n observations \(\{(y_i, \mathbf {x}_i,\delta _i)\}_{i=1}^n\) of survival data. For an individual i, \(y_i\) denotes its survival time and \(\mathbf {x}_i = (x_{i1}, x_{i2}, \dots , x_{ip})^T\) represents the observed data for the p covariates. At the same time, \(\delta _i \in \{0,1\}\) is a variable indicator of censorship, where \(\delta _i=0\) means that \(y_i\) is right-censored. We assume also that the censoring mechanism is non-informative and independent of the event process. Let h(t) be the hazard rate at a time t; the generic form of the Cox proportional hazards model can be expressed as

$$ h (t \mid \mathbf {x}) = h_0 (t) \exp (\mathbf {x}^T \boldsymbol{\beta }) $$

where \(\boldsymbol{\beta } = (\beta _1, \beta _2, \dots , \beta _p) ^ T \) denotes a p-dimensional vector of unknown regression coefficients and \( h_0 (t) \) is the baseline hazard function, that is the hazard function at time t when all the covariates take value zero. In general, \(\boldsymbol{\beta }\) can be estimated by maximizing the partial likelihood function [2].

In order to identify the set of true relevant variables, it is possible to use a penalized variable selection method among those proposed in the last years. For example, the Lasso is able to select the non-zero components in setting with large p, it is computationally efficient and it uses an \(L_1\) type penalty, while the SCAD is a regularized regression methods with non-convex penalties and it is designed to reduce estimation bias. Although in the literature there are several approaches for selecting variables in presence of censored data, there is not unanimous consensus on which method outperforms the others. Then, how to select a method remains an open question. Since choosing a method rather than another influences the selection of relevant variables, it is very important to identify the best variable selection method for the data under analysis.

In order to solve this open question, we propose to implement different variable selection methods on the sampled data and to check similarities between different variable selectors. Combining the models with subsampling is used to improve the variable selection performance of a single variable selection method. For example, RBVS proposed by [1] uses subsampling to identify the set of highly-ranked covariates, while Stability Selection proposed by [4] repeatedly samples observations and fits the sampling data using a variable selection method (e.g. the Lasso). It therefore keeps covariates with a selection frequency above a certain threshold.

Similarly to the methods above, our proposal fits variable selection methods to the subsampled data and it identifies as non-zero components those covariates appearing most frequently. Unlike these other approaches, however, our procedure uses various variable selection methods. In fact, we observe that no method outperforms all other methods in all settings, since different variable selection methods optimize different objective functions. In the case of regularized regression, the difference among methods is usually in terms of the penalty. If a covariate is selected by the majority of methods, it means that the covariate is chosen to minimize many various objective functions. We expect that a true covariate should frequently be chosen regardless of the objective function used. We repeat the fitting on subsampled data to incorporate the variability in selection due to the variability in the data.

The variable selection procedure proposed can be summarized as follows. First, we consider mutually exclusive subsets \(I_{b1}, \dots , I_{br}\) of size m, drawn uniformly from \(\{1, \dots , n\}\) without replacement, where \(r = \lfloor n/m\rfloor \), \(b = 1, \dots , B\) and is the number of replicates. Assume that the sets of subsamples are independently drawn for each b. Second, we fit different variable selection methods on the sets \(I_{b1}, \dots , I_{br}\) and we collect the estimated model in \(\mathcal {M}\), where \(|\mathcal {M}|= r\times B \times k\) and k is the number of variable selector used. For each subset and for each procedure, we obtain a vector of \(\hat{\beta }\). Third, we measure the relative frequency of times the jth covariate is selected given by

$$\hat{\tau _j} = \frac{1}{|\mathcal {M}|} \left( \sum _{M_i\in \mathcal {M} }I_{(\hat{\beta }_j^{M_i}\ne 0)}\right) $$

where \(\hat{\beta }_j^{M_i}\) is the estimated coefficient of the jth covariate on the fitted model \(M_i \in \mathcal {M}\), and \(I_x\) is the indicator function. Fourth, we identify as relevant those variable such that

$$\hat{S} = \{j : \hat{\tau _j}\ge q\} $$

where q is a fixed threshold. For the practical use, the number of replicates B should be large enough to stabilize the value of j and at the same time, it should be small enough to not increase the computational time. Following [1], we set \(r=2\) and \(B=50\), so we obtain 100 sets each with n/2 number of observations. In this paper, we set \(q=1/2\), which means that covariates with \(\hat{\tau _j}\ge 1/2\) are selected.

The choice of the different methods to be used within our procedure is based on the following considerations. Each method must have good variable selection performance and it is required some variability among methods. In this article, we choose Lasso, MCP, SCAD, Elastic Net and Ridge since they optimize different objective functions, as they use various penalty terms. Furthermore, such methods are also computationally feasible in high-dimensional setting.

3 Simulation Study

We compare the variable selection performance among different methods by the number of false positive (FP), the number of false negative (FN), the total number of variable selection error (FN+FP) and the size of selected set. For comparison, we also consider other variable selector methods applied on the whole dataset: the Lasso, the Elastic net, the Ridge regression, the SCAD and the MCP.

Table 1. Simulation results for different combination of \(\rho \), c and p. A dark grey cell represents the best results, while a grey one represents the worst. Standard errors are shown in the parentheses

In our simulation study we generate survival times \(t_i, i = 1, 2, \dots , n\), as exponential distributions with subject-specific parameters \(h_i =h_0(t_i) \exp (\beta ^T X_i)\), baseline \(h_0(t_i)=1\) and \(\beta = (2_{5} , 0_{p-5})\). Thus the true size of model is \(s=5\). The variables \(X_1, \dots , X_p\) are sampled from a multivariate normal density \(N(0, \varSigma )\) where the entries of \(\varSigma \) are fixed to \(corr(X_j, X_k) = \rho ^{|j-k|}\) with \(\rho \in \{0, 0.3, 0.6\}\). The percentage of censorship c is setting to \(20\%\) or \(40\%\). We set \(n = 150\) and \(p = \{100, 200\}\). The results are shown in Table 1.

In all scenarios our procedure has the best performance in terms of both total error FP+FN and FP. When \(p=100\) the highest value of FP for CVSS is 1.46, this means that at most 1.46 of the variables identified as relevant are not related to the response. MCP procedure is the only selector for which in Setting 3 the FN is not equal to zero: the final set contains in this case variables that are not relevant in the model. Looking at the size, our procedure selects a number of covariates that is very close to the real size 5. As we expected, the procedures with highest FP (the Lasso, the Elastic Net and the Ridge) are also the procedures that select a higher number of covariates compared to s. In fact, as the total error increases, also the size increases. While the other approaches suffer when the correlation increases, CVSS, Lasso and Elastic Net give better results in terms of selection performances. On the other hand, the increase of censoring percentage worsens the selection for all the methods.

When \(p=200\), our procedure is still the best one. If we compare the total error for two values of p, it is possible to notice that FP+FN is lower when \(p=200\). This characteristic is not shared with the competitors. Other approaches, such as Lasso and Ridge, suffer the increase of the number of variables in the dataset. The size of CVSS is the closest to the true size \(s=5\) in all scenarios and the best performance is related at high correlation value.

4 Conclusion

In this work we proposed a new method to choose the relevant covariates with high-dimensional survival data. Although survival analysis was initially used to study death as a specific event in medical studies, these statistical techniques have increasingly been used in economics and social sciences. Given the relevance of the topic, it is important to be able to find a method that selects the relevant variables related to the response variable as good as possible. In particular, we proposed to combine several variable selectors available in literature with the subsample technique. Simulation study has shown that our approach works better than its competitors. For future work we will evaluate this approach from a theoretical point of view and apply it to real data.