1 Introduction

Variable selection is an important topic in statistics since it can be used to improve the accuracy and interpretability of the predictions given by the estimated models. With large amount of high-dimensional data emerging in many research and application areas, it has become an indispensable tool for solving various problems. Thus, it is particularly important to perform variable selection effectively and efficiently. Nowadays, the popular methods include subset selection [1, 19], coefficient shrinkage [6, 7, 9, 25, 36], variable screening [8], Bayesian methods [12, 24] and so on. However, to be effective, variable selection generally requires careful tuning of some parameters in the analysis. The correct specification of these parameters is a difficult task, even for expert statisticians. For this reason, variable ranking (i.e., sorting the variables in terms of their importance in the prediction of the outcome) is often carried out first. Once the variables are properly ranked [10, 28, 34], selection can be achieved by using a thresholding rule. In present work, we will follow the latter practice to perform variable selection.

Many scholars [20, 23, 28] have argued that variable selection serves two different objectives depending on whether the modelling purpose is for prediction or for interpretation. The former aims at seeking a parsimonious model so that future data can be well forecast or prediction accuracy can be maximized. But for the latter, analysts would like to identify the truly important variables (i.e., having actual influence on an outcome) from the numerous candidate ones, or to maximize selection accuracy. Due to the significant difference between predictive models and explanatory models, the corresponding variable selection approaches are also very different. With selection accuracy as the target, we will address variable ranking and selection problems in linear regression models by using ensemble learning techniques.

Ensemble learning, a widely used technique to enhance the performance of single learning machines, has been shown to be effective to address a large range of prediction tasks [13, 18, 22]. Most of the existing ensemble methods, such as bagging and boosting, were developed to handle prediction problems better, and the finally obtained models can be called prediction ensembles (PEs). So far, some methods have been specially designed to construct variable selection ensembles (VSEs). With some traditional methods such as lasso [25], genetic algorithm [34], more variables than necessary are often picked out (i.e., having high false positive rate). However, if the selection process is repeated using slightly different data for a number of trials, the frequency that the truly important variables are chosen will be high, while that of unimportant ones being falsely considered as important will be low. As a result, the important variables can be easily distinguished from the remaining ones. This explains why ensemble learning is effective in the context of variable selection.

Like PEs, the process of creating a VSE generally consists of two steps, that is, ensemble generation and ensemble integration. The first step addresses how to generate a series of accurate and diverse members. And the second step aims to fuse them in a suitable way so that selection accuracy is maximized. In generation process, the usual practice is to apply a base learner (i.e., a variable selection method) on multiple different training sets or to inject some randomness into the learner. When fusing the results produced by each member, a simple averaging rule is commonly employed. The existing VSE approaches mainly include parallel genetic algorithm (PGA) [34], stability selection [17], random lasso [27], bagged stepwise search (BSS) [35], stochastic stepwise ensembles (ST2E) [28], PBoostGA [30], AddNoiseGA [31] and stochastic correlation coefficient ensembles (SCCE) [3].

In the study of PEs, it has become well-known that it is usually beneficial to only select some members, instead of keeping all members, to construct a subensemble after Zhou et al. [33] proven the “many-could-be-better-than-all” theorem. Since then, a large variety of ensemble pruning techniques [5, 15, 16, 18, 22] have been proposed in the “overproduce and choose” framework. Compared with full ensembles, the pruned ensembles need less storage, implement a prediction faster, and more importantly, achieve higher prediction accuracy. Borrowing the similar idea to VSEs, we surmise that VSEs can also benefit from selective fusion. As far as we know, however, there is little literature adopting ensemble pruning to constitute a VSE. So far, only Zhang et al. [32] made an attempt to address this issue and put forward one heuristic algorithm. Different from them, we propose a more effective algorithm to get better selection results. Considering that ST2E [28] performs very well in various situations, the novel algorithm is developed to improve it by inserting a pruning step.

The rest of the paper is organized as follows. Section 2 presents the novel ensemble pruning method Pruned-ST2E for variable ranking and selection, in detail. In Sect. 3, a novel manner to analyze the working mechanism of VSEs is provided. Section 4 devotes to examining the performance of Pruned-ST2E and comparing it with several existing techniques by using some simulations. Meanwhile, some real-world examples are analyzed in Sect. 5. Finally, Sect. 6 offers conclusions, together with some future work of the paper.

2 Pruning in ordered stochastic stepwise ensembles for variable ranking and selection

2.1 Brief introduction of stochastic stepwise ensembles (ST2Es)

Since the novel method is developed on the basis of ST2E [28], we first give it a brief introduction here. In the generation phase, ST2E executes the stochastic stepwise (abbreviated as ST2) algorithm multiple times to create its constituent members. Subsequently, a simple averaging rule is used to combine the selection results of each member. Instead of adding (deleting) one variable in the forward (backward) step as traditional stepwise selection does, ST2 adds (deletes) a group of variables at one time, and the group size is randomly decided. Meanwhile, only a randomly selected few, rather than all possible, groups are assessed and the best one is chosen. The forward and backward selection steps are implemented iteratively until no improvement of the objective function such as AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) can be made. In doing so, ST2 can always select some important variables. On the other hand, ST2 will produce different selection results even by running it on the same data since it is a stochastic search algorithm. Therefore, it can be easily imagined that ST2E can construct a VSE containing a series of diverse but accurate variable selectors. Due to page limitation, please interested readers refer to Xin and Zhu [28] for more details of the ST2E algorithm.

It is noteworthy that there are two critical tuning parameters in ST2E, i.e., \(\lambda \) and \(\kappa \), to determine the group size and the number of candidate groups to evaluate when executing ST2 algorithm to produce an ensemble member. Xin and Zhu [28] suggested to take \(\lambda =1/2\) and choose \(\kappa \) (\(\kappa >1\)) through looking for the start of an “elbow” in the diversity plot. In later experimental studies, we will carry out some simulations to justify this choice.

2.2 Pruned-ST2E: novel technique to prune ST2E

With regard to the existing VSE approaches [3, 17, 27,28,29, 34, 35], they usually integrate all the generated members into an ensemble. Some evidence [28, 29, 34, 35] has proven that strength and diversity are two critical ingredients to ensure the good performance of a VSE. Nevertheless, these two terms haven’t been simultaneously considered when generating the ensemble members. Hence, there inevitably exist some individuals whose role is trivial or even negative. Meanwhile, a thresholding rule needs to be adopted to attain final selection results with an unsupervised learning mode once the variables are ranked. In other words, it will be conducive if the difference between the average measure of important variable group and that of unimportant group can be enlarged. In this aspect, it is hence deserved to remove the redundant members of a VSE.

In any existing approach to prune PEs, the essential step is to identify unnecessary members. To achieve this purpose, ranking-based and search-based strategies [15, 16, 18] are two commonly used approaches to select an ensemble subset. Here, we apply the idea of ranking-based strategies to VSEs. The core idea is to first sort all the individuals according to a certain criterion. Then, the top individuals whose rank is above a given threshold are kept to compose a subensemble. Because of the good performance of ST2E [28], we apply selective learning to its ensemble members. To ease presentation, the novel algorithm is denoted by Pruned-ST2E whose detailed steps are listed in the following Algorithm 1.

figure a

It is worth mentioning that ST2E differs from Pruned-ST2E only in steps 4–5 of Algorithm 1. Specifically, ST2E computes the average importance measures of each variable \(j\ (j=1,2\dots ,p)\) by averaging all the ensemble members, that is, \(R(j)=(1/B)\sum _{r=1}^B\mathbf{{E}}(r,j)\). In Pruned-ST2E, the aggregation order of the ensemble members in ST2E is first rearranged and only \(U(U<B)\) members that are ranked ahead are incorporated to estimate R(j) by (1). To guide the aggregation process, the number of variables selected in each trial (i.e., \(N_b\) defined in (3)) is used. The main reason for choosing this measure is explained below.

Roughly speaking, some ensemble members can correctly choose a part of, or all of the really important variables. But for the other members, they not only deem all the truly important variables as important, but also some additional unimportant ones. It is notable that Pruned-ST2E sorts the ensemble members in accordance with the number of selected variables in ascending order. With the sorted members integrated into the ensemble gradually, the selection frequency of truly important variables will increase quickly while that of the remaining ones will stay at a low level, since they are chosen just by chance. For some medium-sized value of U, the difference between the selection frequencies of important and unimportant groups (i.e., gap value) will reach a maximum. After this point, because the noise variables are chosen by more members, the gap value will decline gradually as more individuals are included into the ensemble. To judge which variables are important based on the ranked list, note that it is critical to choose an appropriate threshold to divide the candidate variables into two groups. By changing the aggregation order of individuals and stopping the fusion early, the gap between the two sets of variables are enlarged. Therefore, it is beneficial to inject an additional selective learning step before the fusion is implemented.

Based on the output of Algorithm 1 (i.e., \(R(j),j=1,2,\dots ,p\)), the p variables can be ranked according to their importance to the response. In order to attain final selection results, a thresholding rule such as the mean rule or searching for the largest gap on the scree plot [26] can be further executed. The former means selecting the variables which satisfy \(R(j)>(1/p)\sum _{k=1}^pR(k)\). In contrast, the latter can be implemented as follows: sort \(R(1),R(2),\dots ,R(p)\) in descending order; search for the largest gap between any consecutive entries; and select the variables which are located above the gap. Similar to Xin and Zhu [28], the former scheme will be adopted in later experiments unless otherwise specified.

Ideally, the number of members (i.e., U) to reserve should be automatically determined to maximize selection accuracy. Unfortunately, however, selection accuracy cannot be computed as in the prediction case, since the truly important variables are unknown in practice. The easiest method is to prescribe a desired number. According to our experiments (refer to Sect. 4) as well as the evidence in the study of PEs [15, 16], it seems to be reasonable to keep 1/3 to 2/3 members which are on top of the ranked list. In comparison with ST2E, Pruned-ST2E simply includes an additional sorting phase before fusion. Therefore, Pruned-ST2E is easy to implement and its time complexity is roughly equal to that of ST2E.

3 Analysis of accuracy-diversity trade-off

In ensemble learning field, it is commonly believed that we can gain more insights for the working mechanism of an ensemble through analyzing the trade-off between the accuracy (also known as strength) and diversity of ensemble members. Due to significant differences between PEs and VSEs, the strategies to investigate the accuracy-diversity pattern in PEs [13] cannot be directly employed in the context of VSEs. Zhu and Fan [35] offered a method to estimate the strength and diversity of a VSE and related these measures to its variable-selection performance. Later on, Xin and Zhu [28] proposed to use the strength-diversity tradeoff to specify an appropriate tuning parameter of ST2E. Here, we would like to provide a new approach to evaluate strength and diversity of a VSE, and then utilize them to analyze the working mechanism of ST2E and Pruned-ST2E.

According to the formula (2), it can be observed that the output of each ensemble member, say, \(E(b,\cdot )\ (b=1,2,\dots ,B)\), is a binary vector with each entry being 1 or 0. Suppose the true importance vector to be \(\mathbf{{r}}^{*}\) which takes value 1 on the positions corresponding to truly important variables and 0 otherwise. The accuracy of the bth member can be assessed as

$$\begin{aligned} \displaystyle \mathrm{Acc}_b = \frac{1}{p}\sum _{j=1}^p\mathbb {I}[\mathbf{E}(b,j)=\mathbf{{r}}^{*}(j)],\quad b = 1,2\dots ,p, \end{aligned}$$
(4)

in which \(\mathbb {I}(\cdot )\) stands for an indicator function taking value 1 if its condition holds and 0 otherwise. Accordingly, the overall accuracy of the VSE is

$$\begin{aligned} \displaystyle \mathrm{Acc}=\frac{1}{B}\sum _{b=1}^B\mathrm{Acc}_b=\frac{1}{pB}\sum _{b=1}^B\sum _{j=1}^p\mathbb {I}[\mathbf{E}(b,j)=\mathbf{{r}}^{*}(j)]. \end{aligned}$$
(5)

Following the practice of Xin and Zhu [28], the diversity of a VSE can be estimated as

$$\begin{aligned} \displaystyle \mathrm{Div}= & {} \frac{1}{p}\sum _{j=1}^pv(j),\quad \mathrm{with}\quad v(j) = \frac{1}{B-1}\sum _{b=1}^B\left[ \vphantom{\frac{1}{B}\sum _{b=1}^B}\mathbf{{E}}(b,j)\right. \nonumber \\&\left. -\frac{1}{B}\sum _{b=1}^B\mathbf{{E}}(b,j)\right] ^2. \end{aligned}$$
(6)

Actually, the quantity \(\mathrm{Div}\) is a measure of the average within-ensemble variation. In general, a good VSE is expected to achieve \(\mathrm{Acc}\) as high as possible. Since \(\mathrm{Div}\) measures how much each member deviates from the central tendency estimated by the VSE, it will become small when its constituent members become more accurate. It is noteworthy that our defined \(\mathrm{Acc}\) directly evaluate the selection accuracy of the considered VSE. In contrast, Xin and Zhu [28] made use of the improvement of AIC to implicitly evaluate the mean strength of a VSE. Because our final purpose is variable selection, it can be deemed that \(\mathrm{Acc}\) defined in (5) is more proper to analyze VSEs.

4 Simulation studies

This section devotes to examining Pruned-ST2E as well as some other procedures with simulated data. First, the effect of modifying the aggregation order of ST2E is first studied. Then, the performance of the novel method Pruned-ST2E is examined and compared with ST2E [28] and some other popular techniques including PGA [34], lasso [25], adaptive lasso [36], SCAD [7], random forest [11] and SCCE [3]. Regarding the variable selection criterion, we utilized AIC in later experiments. There is evidence [19, 34] showing that AIC tends to select more variables than necessary whereas BIC tends to select fewer variables. Notice that the main principle of VSEs is to filter out noise variables through executing a base learner multiple times. If a truly informative variable is missed very often, it will be considered as unimportant by the VSE. In view of this fact, AIC is more suitable than BIC to act as the selection criterion in a VSE. In addition, the following experiments were all conducted in Matlab with version 8.1.

For all the experiments conducted in this section, every simulation was repeated 100 times. The ensemble size B was taken to be 300. The parameters \(\lambda \) and \(\kappa \) in ST2E were chosen to be identical to those adopted by Xin and Zhu [28]. For the examples they didn’t considered, \(\lambda \) and \(\kappa \) were set to be 0.5 and \(\exp (1.5)\), respectively. For Pruned-ST2E, one third of the sorted members were kept. As for the parameter N involved in PGA, i.e., the number of generations for each SGA (single-path genetic algorithm) to evolve, the strategy proposed by Zhu and Chipman [34] were employed to determine it. Regarding lasso, a fivefold cross-validation was adopted to select its optimal regularization parameter. As far as adaptive lasso is concerned, the adaptive weights were first calculated using ordinary least squares estimates and \(\gamma =1\). Subsequently, lasso was applied on the scaled data. In SCAD, we took \(a=3.7\) and generalized cross-validation to estimate its tuning parameter. For random forest, the variables were first sorted by the permutation importance measures and then were selected from the top 1/3 variables so that the smallest OOB (out-of-bag) error was achieved. Moreover, the parameter \(\lambda \) in SCCE was set to be 0.5.

4.1 Simulated data

The data used in the following examples were generated by

$$\begin{aligned} \mathbf{{y}}= & {} \mathbf{{x}}_1\beta _1+\mathbf{{x}}_2\beta _2+\cdots +\mathbf{x}_p\beta _p+\varvec{\epsilon }\nonumber \\= & {} \mathbf{X}\varvec{\beta }+\varvec{\epsilon },\quad \varvec{\epsilon }\sim N(\mathbf{{0}},\sigma ^2\mathbf{{I}}), \end{aligned}$$
(7)

where \(\varvec{\epsilon }\) is an error term distributed as a normal distribution with mean zero and variance \(\sigma ^2\).

Example 1

There are \(p=20\) variables and \(n=40\) observations [34] in this example. Particularly, only variables 5, 10 and 15 have actual influence on the response \(\mathbf{{y}}\) and their true coefficients are 1, 2, 3, respectively. The rest of the variables are uninformative and their coefficients are all zero. We set \(\sigma =1\) for scenarios 1-3 and \(\sigma =2\) for scenario 4. For the exploratory variables, the following 4 different scenarios were considered, i.e.,

Scenario 1: \(\mathbf{{x}}_1,\mathbf{{x}}_2,\dots ,\mathbf{{x}}_{20}\sim N(\mathbf{{0}},\mathbf{{I}})\);

Scenario 2: \(\mathbf{{x}}_1,\mathbf{{x}}_2,\dots ,\mathbf{{x}}_{19}\sim N(\mathbf{{0}},\mathbf{{I}})\), \(\mathbf{{x}}_{20}=\mathbf{{x}}_{10}+0.25\mathbf{{z}}\), \(\mathbf{{z}}\sim N(\mathbf{{0}},\mathbf{{I}})\);

Scenario 3: \(\mathbf{{x}}_1,\mathbf{{x}}_2,\dots ,\mathbf{{x}}_{19}\sim N(\mathbf{{0}},\mathbf{{I}})\), \(\mathbf{{x}}_{20}=\mathbf{{x}}_{15}+0.25\mathbf{{z}}\), \(\mathbf{{z}}\sim N(\mathbf{{0}},\mathbf{{I}})\);

Scenario 4: \(\mathbf{{x}}_j=\mathbf{{z}}+\varvec{\epsilon }_j, j=1,2,\dots ,20\), \(\varvec{\epsilon }_j\sim N(\mathbf{{0}},\mathbf{{I}})\), \(\mathbf{{z}}\sim N(\mathbf{{0}},\mathbf{{I}})\).

Example 2

This is an experiment [28] specifically designed to examine the capability of a method to correctly detect weak signal (\(\mathbf{{x}}_1\)), strong signal (\(\mathbf{{x}}_2, \mathbf{{x}}_3\)) and noise variables (\(\mathbf{{x}}_j, j=4,\dots ,p\)). In this example, there are \(p=20\) variables with each generated from the standard normal distribution N(0, 1). The true coefficient vector is \(\varvec{\beta }=(\alpha ,2,3,0,\dots ,0)^T\), where \(\alpha \) takes the values starting from 0.1 to 1.5 with an increment 0.1. The three variables to generate \(\mathbf{{y}}\) are correlated, with \(corr(\mathbf{x}_i,\mathbf{{x}}_j)=0.7\) for \(i,j\in \{1,2,3\}\) and \(i\ne j\). The other variables \(\mathbf{{x}}_{4},\dots ,\mathbf{{x}}_{20}\) and the noise term \(\varvec{\epsilon }\) are independent of each other. Here, we considered the situation of \(n=100,\sigma =3\).

Example 3

The data in this example are modified from a widely used benchmark data set [25, 27, 28]. To increase the problem complexity, we suppose that there are \(p=50\) variables with each generated from N(0, 1). The pairwise correlation between any two variables is \(corr(\mathbf{{x}}_i,\mathbf{{x}}_j)=0.5^{|i-j|}\) for all \(i\ne j\). The true coefficient vector is \(\varvec{\beta }=(3,1.5,0,0,2,\mathbf{{0}}_{p-5})^T\) and \(\mathbf{0}_{p-5}\) indicates a \((p-5)\)-dimensional zero vector.

Example 4

Finally, we considered a high dimensional but small sample size (large p, small n) problem which contains \(p=120\) variables and \(n=100\) observations [27, 28]. The first 60 variables are truly important and their true coefficients are drawn from a normal distribution N(3, 0.5), and their values are then fixed for all simulation runs. The exploratory variables are created from a multivariate normal distribution with zero mean and covariance matrix as

$$\begin{aligned} \left[ \begin{array}{cccc} \Sigma _0 &{} 0 &{} 0 &{} 0\\ 0 &{} \Sigma _0 &{} 0.2J &{} 0\\ 0 &{} 0.2J &{} \Sigma _0 &{} 0 \\ 0 &{} 0 &{} 0 &{} \Sigma _0 \\ \end{array} \right] . \end{aligned}$$
(8)

where \(\Sigma _0\) is a \(30\times 30\) matrix with unit diagonal elements and off-diagonal elements taking a value 0.7, and J is a \(30\times 30\) matrix with all unit elements. The noise level for \(\varvec{\epsilon }\) was taken as \(\sigma =50\) similarly to that used by Wang et al. [27], Xin and Zhu [28].

4.2 Evaluation metrics

Because our focus on variable selection is to obtain a parsimonious model for interpretability, selection accuracy is the most natural choice to evaluate an algorithm. Nevertheless, it is infeasible since we have no means to know the ground truth except for simulated data. For this reason, a large variety of metrics have been developed and utilized in the related literature [27, 28, 32, 34, 36] to assess a variable selection method from different perspectives. To extensively study the behavior of each compared method, we adopt several evaluation metrics defined as below.

Let \(\varvec{\beta }=(\beta _1,\beta _2,\dots ,\beta _p)^T\) stand for the true coefficient vector and \(\mathcal {T}=\{j:\beta _{j}\ne 0\}\) indicate the true model. Let IV and UIV represent the index sets for the truly important and unimportant variables, respectively. The symbol |IV| indicates the number of variables in IV and |UIV| is defined accordingly. Given an algorithm, note that each metric is estimated by 100 replications of the simulation. In the tth replication, denote \(\hat{r}_t=(\hat{r}_{1,t},\hat{r}_{2,t},\dots ,\hat{r}_{p,t})^T\) by the importance measures assigned to each variable, \(\hat{\varvec{\beta }}_t=(\hat{\beta }_{1,t},\hat{\beta }_{2,t},\dots ,\hat{\beta }_{p,t})^T\) by the estimated coefficients, and \(\hat{\mathcal {S}}_{t}=\{j:\hat{\beta }_{j,t}\ne 0\}\) by the identified model. In what follows, \(\mathbb {I}(\cdot )\) represents an indicator function. Then, we define

$$\begin{aligned}&\left\{ \begin{array}{lc} \mathrm{soft\ metric} = \displaystyle \frac{1}{100}\sum _{t=1}^{100}\mathbb {I}\left( {\min _{j\in IV}\hat{r}_{j,t}\ge \max _{j\in UIV}\hat{r}_{j,t}}\right) ,\\ \mathrm{hard\ metric} = \displaystyle \frac{1}{100}\sum _{t=1}^{100}\mathbb {I}\left( \hat{\mathcal {S}}_{t}=\mathcal {T}\right) , \end{array} \right. \end{aligned}$$
(9)
$$\begin{aligned}&\left\{ \begin{array}{lc} \mathrm{\%\ of\ correct\ zeros}=\displaystyle \frac{1}{|UIV|}\left\{ \frac{1}{100}\sum _{t=1}^{100}\sum _{j\in UIV}\mathbb {I}\left( \hat{\beta }_{j,t}=0\right) \right\} \times 100\%,\\ \mathrm{\%\ of\ incorrect\ zeros}=\displaystyle \frac{1}{|IV|}\left\{ \frac{1}{100}\sum _{t=1}^{100}\sum _{j\in IV}\mathbb {I}\left( \hat{\beta }_{j,t}=0\right) \right\} \times 100\%. \end{array} \right. \end{aligned}$$
(10)

Let

$$\begin{aligned} T_j=\sum _{t=1}^{100}\mathbb {I}\left( \hat{\beta }_{j,t}\ne 0\right) , \end{aligned}$$
$$\begin{aligned}&\left\{ \begin{array}{lc} \hbox {sel. } \hbox { freq. } \hbox { of } IV=\displaystyle \left( \min _{j\in IV} T_j,\underset{j\in IV}{\mathrm{median}} \ T_j,\max _{j\in IV} T_j\right) ,\\ \hbox { sel. } \hbox { freq. } \hbox { of } UIV=\displaystyle \left( \min _{j\in UIV} T_j,\underset{j\in UIV}{\mathrm{median }} \ T_j,\max _{j\in UIV} T_j\right) , \end{array} \right. \end{aligned}$$
(11)
$$\begin{aligned}&\mathrm{model\ size}=\displaystyle \frac{1}{100}\sum _{j=1}^pT_j=\frac{1}{100}\sum _{j=1}^p\sum _{t=1}^{100}\mathbb {I}\left( \hat{\beta }_{j,t}\ne 0\right) , \end{aligned}$$
(12)
$$\begin{aligned}&\mathrm{RPE} = \displaystyle \frac{1}{\sigma ^2}E[(\hat{y} - \mathbf{x}^T\varvec{\beta })^2]=\frac{1}{\sigma ^2}(\hat{\varvec{\beta }} - \varvec{\beta })^TE(\mathbf{{x}}{} \mathbf{{x}}^T)(\hat{\varvec{\beta }} - \varvec{\beta }). \end{aligned}$$
(13)

In (9), soft metric evaluates the ranking performance of an algorithm, i.e., how well it works to rank variables in line with their importance. On the other hand, hard metric assesses how well an algorithm performs to detect the true model (i.e., accurately categorizing all variables into important and unimportant ones). The percentage (%) of correct and incorrect zeros defined in (10) are actually equivalent to specificity and false negative rate. In nature, these two metrics provide one way to evaluate the overall capacity of one method to correctly identify IV and UIV. The purpose to compute the selection frequencies [27,28,29] for IV and UIV as defined in (11) is to closely check the selection behavior achieved by an approach. In particular, we recorded the minimum, median and maximum number of times out of 100 simulations among IV and UIV are selected, respectively. Moreover, the model size (MS) defined in (12) denotes the estimated sparsity. Since variable selection and prediction are closely related, a good selection approach is also usually expected to exhibit good prediction performance. To evaluate the prediction ability of a method, we utilized the relative prediction error (RPE) [36] as defined in (13). To estimate this RPE term corresponding to an algorithm, a linear regression model was first built by using the selected variables. Then, RPE was estimated using a test set composed of 10,000 instances.

4.3 Effect of rearranging aggregation order

First, Example 1 was used to investigate how the performance of ST2E varies if the aggregation order of its constituent members is rearranged. Both the hard and soft metrics were employed to evaluate a VSE. When calculating the hard metric, searching for the largest gap [26] was applied to the ranked list. For the ST2E ensembles, the members were aggregated gradually in the same random order as they were generated by ST2E. But for pruned ST2E ensembles, the members were first sorted by Algorithm 1 and then fused. We also assessed the performance of the subensembles that were constructed by sorting the ensembles in the optimal order. Since we know that \(\mathbf{{x}}_{5}, \mathbf{{x}}_{10}\) and \(\mathbf{{x}}_{15}\) are truly important variables, a member was searched for each ensemble size so that the difference between the mean importance measure of the signal variable group and that of the noise variable group was maximized. Fig. 1 depicts the computed hard and soft metrics for ST2E, Pruned-ST2E and optimal subensembles as a function of ensemble size.

By adopting the accuracy and diversity defined in (5) and (6), respectively, we tried to figure out the difference between ST2E and Pruned-ST2E. In each simulation, we computed the \(\mathrm{Acc}\) and \(\mathrm{Div}\) of ST2E by taking into account all its members. Here, \(\mathbf{{r}}^{*}\) was a 20-dimensional binary column vector with its 5th, 10th and 15th entry being 1 and others being 0. Regarding Pruned-ST2E, the members in ST2E were first sorted with Algorithm 1 and only one third of the top-ranked members was kept to fuse into a VSE. In other words, the \(\mathrm{Acc}\) and \(\mathrm{Div}\) of Pruned-ST2E were estimated with the reserved members. The subplots shown in the third column of Fig. 1 illustrate the accuracy-diversity pattern of ST2E and Pruned-ST2E. Note that each point of ST2E (Pruned-ST2E) in these subplots corresponds to the result obtained in one simulation.

Fig. 1
figure 1

For scenarios 1-4 in Example 1, the performance of ST2E, Pruned-ST2E and optimal subensembles in terms of hard and soft metrics (i.e., the first two columns) as a function of ensemble size. The subplots shown in the 3rd column illustrate the accuracy-diversity patterns of the full ST2E ensemble and a pruned subensemble

Figure 1 shows that the accuracy of ST2E evaluated with both hard and soft metrics sharply increases to a nearly optimal value as the ensemble incorporates more individuals. Then, further improvement resulting from additional members remains small. In terms of hard metric, however, the pruned ensembles have accuracy curves that exhibit a maximum for intermediate numbers of individuals. Regarding the ranking accuracy curve of Pruned-ST2E, it has similar shape with that of ST2E. After the individuals of ST2E are sorted by Algorithm 1, they can be roughly categorized into three types. For the members entering the fusion early, they often identify only some rather than all truly important variables. With respect to those lying in the middle of the ranked list, all signal variables can be deemed by them as important. Regarding the members ranked on the bottom, they often include more noise variables besides all important ones. Therefore, all signal variables can thus be correctly ranked ahead of the noise ones quickly as the sorted members gradually enter the fusion process. But due to the enlarged gap between two groups, the selection accuracy benefits more from this process. Additionally, the subplots in the 3rd column of Fig. 1 illustrate that the pruning phase improves the selection accuracy of ST2E to a large degree. In comparison with the full ST2E ensemble, some less accurate members are filtered out and are thus not fused into pruned subensembles. It is therefore reasonable that Pruned-ST2E has smaller diversity since its members become more concentrated on the true importance vector.

Moreover, it can be observed in Fig. 1 that the selection accuracy of all but very small subensembles lies above that of the full ensemble consisting of all the 300 individuals (i.e, referring to the rightmost point in each subplot). This makes it easy to select a subensemble that outperforms the original ensemble constructed by ST2E. In addition, the plots shown here illustrate that simply keeping the first 1/3 to 2/3 of the sorted individuals to compose a VSE is reasonable. Although the accuracy of the optimal subensembles is much higher than that of ST2E and Pruned-ST2E, it is only an ideal case since we cannot know which variables are truly important in practice.

4.4 Effect of \(\lambda \) and \(\kappa \)

In ST2E, there are two tuning parameters \(\lambda \) and \(\kappa \) involved in its base learner ST2 (see, e.g., Algorithm 1). In the process to generate a collection of accurate and diverse members, these two parameters (especially \(\kappa \)) play a key role. Because \(\lambda \) controls how many variables are added (deleted) in one forward (backward) step, its impact is relatively small. As suggested by Xin and Zhu [28], we can simply set \(\lambda =0.5\). With respect to \(\kappa \), it needs to be set more carefully because it directly affect whether the created members are diverse or not. The larger \(\kappa \) is, the less models are evaluated in each searching step. This reduces the strength of each member on one hand, which is also harmful to encourage the diversity in ST2E on the other hand. But the computational cost of ST2E will be very high if \(\kappa \) is taken to be too small. Due to this fact, we speculate that \(\lambda \) and \(\kappa \) may influence the performance of Pruned-ST2E. In order to clarify this issue, we did the following experiments.

First, we fixed \(\lambda =0.5\) and varied \(\kappa \) from \(\exp (0.5)\) to \(\exp (3.0)\) with increment \(\exp (0.2)\). With each value of \(\kappa \), we estimated the accuracy of ST2E and Pruned-ST2E with hard and soft metrics through 100 simulations. Aiming at studying the influence of \(\lambda \), \(\kappa \) was set to \(\exp (1.5)\) and \(\lambda \) was made to vary from 0.10 to 0.95 with increment 0.05. Figure 2 depicts the representative results obtained on scenarios 2 and 4 in Example 1.

Fig. 2
figure 2

On scenarios 2 and 4 in Example 1, how the selection frequency of ST2E and Pruned-ST2E changes with \(\lambda \) and \(\kappa \)

The top two subplots of Fig. 2 indicate that Pruned-ST2E generally improves ST2E when \(\kappa \le \exp (2)\). Note that the smaller \(\kappa \), the more accurate each individual is. Thus, it is justifiable that changing the aggregation order is helpful in this situation. Checking the bottom two subplots of Fig. 2, i.e., how the selection frequency of ST2E and Pruned-ST2E depends on \(\lambda \), one can observe that both ST2E and Pruned-ST2E perform well as long as \(\lambda \) is not too small (i.e., \(\lambda \ge 0.4\)). The behavior of ST2E and Pruned-ST2E, however, deteriorates if \(\lambda \) is too large, which means that adding or deleting too many variables in a single step of ST2 algorithm is unhelpful. This is consistent with the proposal provided by Xin and Zhu [28]. Based on these observations, \(\lambda \) and \(\kappa \) were respectively taken as \(\lambda =0.5\) and \(\kappa =\exp (1.5)\) in most experiments.

4.5 Performance comparison

Example 2 was utilized to examine the ability of each algorithm to identify three different types of variables. The subplots (a–c) displayed in Fig. 3 demonstrate how the average frequencies that each type of variables is detected to be important change as \(\alpha \) varies. In subplots (d, e), we plotted the accuracy of each method assessed by the hard and soft metrics as a function of \(\alpha \). In the meantime, the subplot (f) illustrates the CPU time cost by each method to complete one simulation. The computing environment is a personal computer configured with Intel Core i7-6600 CPU @2.60 GHz, 16.0 GB RAM.

Fig. 3
figure 3

The experimental results of each method in Example 2. The subplots a, b and c show the average frequencies for the three types of variables (i.e.,) being selected, respectively. The subplots d and e plot the selection accuracies in terms of hard and soft metrics when \(\alpha \) takes different values. And the subplot f compares the time complexity of all the methods

Figure 3a demonstrates that SCCE performs best to identify the weak signal variable \(\mathbf{{x}}_1\). However, this is achieved by losing its ability to exclude noise variables (see Fig. 3c). The subplot (d) also manifests that it cannot accurately discern the right signal variable group even though its ranking performance is excellent. Meanwhile, ST2E and Pruned-ST2E perform equally well to correctly detect \(\mathbf{{x}}_1\), and they significantly outperform the other methods in this situation, especially when the value of \(\alpha \) is small. In terms of catching the strong signal variables \(\mathbf{{x}}_2\) and \(\mathbf{{x}}_3\), all the algorithms but SCAD do a good job. As for guarding against the noise variables (\(\mathbf{{x}}_j\) for \(j>3\)), the compared methods can be ranked from the best to worst as random forest, Pruned-ST2E, ST2E, PGA, adaptive lasso, lasso, SCCE and SCAD. At the same time, Pruned-ST2E is observed to considerably surpass its rivals except for lasso in terms of selection accuracy as revealed in subplot (d). Although the subplot (e) shows that random forest and SCCE achieve better ranking accuracy than Pruned-ST2E, Pruned-ST2E defeats them in the aspect of excluding noise variables and correctly identifying the signal group, as shown in the subplots (c) and (d).

As far as the time complexity is concerned, Fig. 3f indicates that random forest is most time-consuming. Since the strategy proposed by Genuer et al. [11] was executed, all variables need to be first sorted by permutation importance measures. Based on the 1/3 variables ranked ahead, it selected the subset which achieves the smallest OOB error. In the rest of algorithms, PGA and SCCE cost slightly more time. When compared with ST2E, Pruned-ST2E is a little faster since it only fuses 1/3 top-ranked members to get average importance measures for each variable.

With regard to Example 3, two sets of experiments were conducted. First, we evaluated the performance of all algorithms in handling a relatively easy problem (i.e., only three important variables) and Table 1 reports the obtained results. We can observe that Pruned-ST2E behaves best in almost all cases except that it may sometimes miss some important variables when the noisy level is high. Under this circumstance, PGA, lasso, adaptive lasso and SCCE seem to achieve slightly better performance than ST2E and Pruned-ST2E at catching signal variables. However, they lose their advantage when it comes to exclude noise variables.

Table 1 The performance on the widely used benchmark data set with \(p=50\) and 3 important variables in Example 3

Besides the important variables \(\mathbf{{x}}_1,\mathbf{{x}}_2,\mathbf{{x}}_5\), another five important variables having small coefficients (i.e., \(\mathbf{{x}}_6,\dots ,\mathbf{{x}}_{10}\)\(\sim N(\mathbf{{0}},0.5^2\mathbf{{I}})\)) were inserted into Example 3 for the second set of experiments. Table 2 summarizes the results of each algorithm in this more difficult problem. Due to the existence of some weak imponrtant variables (i.e., those having small non-zero coefficients), the performance of each algorithm decreases somewhat, especially in terms of identifying important variables. Nevertheless, Pruned-ST2E still maintains its superiority over the other ones in excluding noise variables.

Table 2 The performance on the widely used benchmark data set with \(p=50\) and eight important variables in Example 3

The results for the problem in Example 4, in which p is large and n is small, are reported in Table 3. Similar to the strategy used by Xin and Zhu [28], we inserted an SIS (sure independence screening [8]) pre-selecting step when generating each ensemble member of ST2E, PGA and SCCE. The value of q (i.e., number of variables to pre-pick) in SIS was taken as 50. For the lasso-type methods and random forest, they were directly applied. It can be observed from Table 3 that ST2E and Pruned-ST2E perform much better than their counterparts. Here, Pruned-ST2E exhibits stronger ability to exclude noise variables but has a higher false negative rate. Notice that there are 60 important variables in this example, and small coefficients for some variables cause ST2E to reserve more members to attain better performance. In the meantime, this example reveals that the advantage of Pruned-ST2E over ST2E is more significant when the true model is sparse.

Table 3 The performance for the large p small n problem in Example 4

5 Analysis of real data

5.1 Diabetes data

We first analyzed diabetes data composed of \(n=442\) diabetes patients and \(p=10\) variables. The task was to interpret how some variables (such as age, sex, body mass index and etc.) affect the progression of diabetes disease. In this example, the LARS algorithm [6] was taken as a baseline to compare the different methods. The parameters involved in each algorithm were set up in a way similar to that used in the simulation studies. Table 4 reports the order in which the variables are ranked by each method. Obviously, all methods but SCCE deem that the variables “bmi”, “ltg” and “map” are the most important ones while “age” is the least important one. For the intermediate variables, they hold different views. As for the different order of “tc” and “sex” produced by ST2E and Pruned-ST2E, it may be explained that Pruned-ST2E retains more members which puts more emphasis on the estimated coefficient than prediction accuracy. Notice that “tc” has a relatively larger coefficient as shown in the solution path of LARS [28]. The situation for “ldl” and “hdl” can be similarly interpreted.

Table 4 Ranking of variables in diabetes data set

5.2 Some classification data

Here, we studied the behavior of the compared methods in some classification problems by extending them to logistic regression models. Specifically, the regression coefficients were estimated by maximum likelihood estimation (MLE) method and AIC was computed as \(AIC=2k-2l(\hat{\varvec{\beta }};\mathbf{{y}})\) where k indicates the model size and \(l(\hat{\varvec{\beta }};\mathbf{{y}})\) is the estimated log-likelihood for the considered model. Moreover, we considered mRMR [21] instead of SCCE [3] since the the former targets to implement feature selection in classification tasks.

In Table 5, the main characteristics of five UCI [14] binary classification data sets (i.e., sample size of training and test sets, input dimensionality) were first summarized. Given a data set, each method was applied to the training set to detect important variables and a logistic regression model was built using the selected variables. We then estimated its prediction error on the test set. The whole process was repeated 10 times and Table 5 reports the mean and standard deviation of test errors, and the average number of selected variables for each method. The model including all the variables was also considered. For each set, the result typed in boldface indicates the approach which achieves the best prediction.

Table 5 The test error rates and number of selected variables for each method on some classification data sets

Table 5 reveals that the prediction ability of Pruned-ST2E is satisfactory. Across all the used sets, it has lower prediction error than ST2E and PGA. In comparison with other methods that are prediction-oriented, it can also outperform them in some cases. Moreover, the number of selected variables are much lower than the original dimensionality, which facilitates the interpretation. Therefore, Pruned-ST2E also has great potential to cope with variable selection tasks in classification problems.

6 Conclusions and future work

In this paper, we proposed a novel method Pruned-ST2E to construct a more effective VSE by using an ensemble pruning strategy. In Pruned-ST2E, the ensemble members are first sorted and only some members which are ranked ahead are included into the fusion process, to produce a subensemble. Based on the average importance measure, the variables are then ranked and further selected by a thresholding rule. Although the idea of Pruned-ST2E is simple, a large batch of experiments conducted with both simulated and real-world data demonstrates its better or competitive performance in comparison with several other counterparts. By the aid of accuracy–diversity analysis of ST2E and Pruned-ST2E, we found that the inserted pruning step excludes less accurate members and makes the remaining members become more concentrated on the true importance vector.

In essence, the pruning technique used in Pruned-ST2E can be easily extended to some other techniques (such as PGA, stability selection) to build VSEs. It would be interesting to study how the corresponding subensembles will perform when coping with different variable selection tasks. On the other hand, Pruned-ST2E retains a prescribed number of ensemble members to integrate into a VSE. The ideal situation is to automatically determine how many members should be kept so that selection accuracy can reach as high as possible. Because the ground-truth information cannot be known in practice and the selection performance of a VSE depends on many factors, this is a very difficult and challenging problem deserved to be studied further. In recent years, data-driven feature selection techniques [2, 4] has received increasing attention due to their good performance and versatile applications. Thus, it is also deserved to explore whether selective ensemble learning can further enhance their performance.