1 Introduction

Feature selection aims to select a subset of features from high dimensional data according to a predefined selection criterion. It can bring many benefits such as removing irrelevant and redundant features, reducing the change of over-fitting, saving computational cost, improving prediction accuracy, and enhancing result comprehensibility (Mohsenzadeh et al. 2013; Nie et al. 2010; Liu et al. 2015). Many feature selection algorithms have been proposed in the past years, where the feature selection algorithms based on sparse representation are very popular in recent years, as these algorithms compute the feature selection scores in global (Mohsenzadeh et al. 2013; Nie et al. 2010; Liu et al. 2015; Zhao et al. 2013, 2012; Nie et al. 2011; Wang et al. 2014; Sun et al. 2010; Xiang et al. 2012; Liang et al. 2011; Zhang et al. 2015a; Chen et al. 2015; Li and Tang 2015; Zhao et al. 2016; Jiawei et al. 2016; Wang et al. 2013; Shi et al. 2015; Wang et al. 2016; Han et al. 2016; Du et al. 2016; Zhou et al. 2016; Zhang et al. 2015b; Tsimpiris et al. 2014; Wanga et al. 2013).

The differences of these algorithms are designing different constrained problem. After optimizing the constrained problem, the sparse coefficients of features in reconstructing the feature representation goals are obtained, where the feature representation goals can be set by the labels of samples (Nie et al. 2010), or the relationships among samples (Liu et al. 2015). The feature selection scores of features are calculated according to the sparse coefficients, which can be calculated in the same time by the optimization method. However, most of these algorithms existed two defects:

Firstly, the redundancy among the selected features are not considered in these algorithms. In our one early work about sparse representation (Sun and Wen 2015), we found that the reconstruction ability of a basis for a testing sample is related with cosine distance between this basis and this testing sample. As a result, two similar features with small cosine distances with the goal could both own large sparse coefficients, and then the feature selection scores of these two similar features could both very large, which means that the redundancy among the selected features would be very large.

Although many feature selection algorithms have taken into account the problem of feature redundancy, many problems are existed in these algorithms. Li and Tang (2015) proposed a unsupervised feature selection via nonnegative spectral analysis and redundancy control, which calculate the coefficients of features and the redundancy constrain in the same time. However, this method cannot utilize the label information. Zhao proposed a efficient spectral feature selection with minimize redundancy; however, the redundancy problem only solved by a greedy algorithm (Zhao et al. 2016). De Wang et al. (2015), Benabdeslem and Hindawi (2014) and Wanga et al. (2013) also considered the redundancy among features, however, the redundancy among the features and the structure preserving are separately measured in these algorithms, and then the importance between the structure preserving and redundancy cannot be control.

Secondly, the importance of sparse coefficients of features in reconstructing different goals is the same in these algorithms. However, the importance of goals should be different in many cases. For example, the global and local structure preserving feature selection (GLSPFS) algorithm set goals by the Eigenvectors that is obtained by solving the Eigen problem for pair wise similarity matrix. Obviously, the importance of Eigenvectors should be different. For example, in principal component analysis (PCA) (Martinez and Kak 2001), the Eigenvectors with larger Eigen values are more important than the Eigenvectors with smaller Eigen values, as using these Eigenvectors can preserve more information of data. In locality reserving projections (LPP) (He and Niyogi 2003), the Eigenvectors with smaller Eigen values are often used for transformation matrix, as using these Eigenvectors can preserve more local structure of data.

To overcome the above two defects and the defects of the feature selection algorithms that have taken into account the problem of feature redundancy, a new feature selection algorithm based on redundancy minimization and weighted structure preservation (WSPMRFS) are proposed. Firstly, in WSPMRFS, to minimize the redundancy and persevere the structure in the same time, a item calculated by the cosine distance between two features together with the L2,1 norm of these two features is added in the constrained problem, which can suppress two similar features to take large sparse coefficients in the same time. Secondly, in WSPMRFS, to consider the difference of importance among the goals, the coefficients that are used to reconstructed the Eigenvectors are weighted by the corresponding Eigen values. Thirdly, in WSPMRFS, to utilize the information offered by the unlabeled data, the feature representation goals are set by both labeled and unlabeled data. As a result, our algorithm owns following three advantages:

  1. (1)

    It minimizes the redundancy and perseveres the structure in the same time.

  2. (2)

    It considers the difference of importance among the feature representation goals.

  3. (3)

    It utilizes the information offered by the unlabeled data.

The remainder of this paper is organized as follows. Section 2 introduces the recent work. Section 3 introduces the proposed feature selection method. Experimental results are presented and discussed in Sect. 4. Finally, Sect. 5 gives concluding remarks.

2 Recent work

WSPMRFS is a feature selection algorithm based on sparse representation. In this section, we will simply introduce other feature selection algorithms based on sparse representation. According to the availability of class label information, these algorithms can be categorized as unsupervised, supervised and semi-supervised feature selection.

2.1 Unsupervised feature selection algorithms

Unsupervised feature selection algorithms can utilize the information offered by the unlabeled data.

Many researchers designed new constrained problems to utilize the different information from data. Du et al. (2016) propose a multiple graph unsupervised feature selection method to leverage the information from both local and global graphs. Zhou et al. (2016) proposed an iterative approach to unsupervised feature selection, which is based on global and local structure preserving sparse subspace learning. Fang et al. (2014) proposed a novel unsupervised FS method, where locality and similarity preserving embedding are both utilized. Wu et al. (2016) proposed an unsupervised feature selection by combining group sparsity regularization with locality-based learning algorithm. Chen et al. (2016) proposed a high-level feature selection via a block structure dictionary so as to describe the structures and semantics in SAR imagery.

Other researchers wanted to solve many problems by the sparse representation together with other method. Wang et al. (2014) integrate feature selection and multiple kernel learning into the sparse coding on the manifold, which can overcome the noisy features and nonlinear distribution of the data samples . Hou et al. (2014) proposed a joins embedding learning with sparse regression to perform feature selection, which can learn the reconstructing goals and sparse coefficients in the same time.

The above algorithms cannot utilize the information offered by labeled samples. Furthermore, most of these algorithms ignored the redundancy of selected features and the difference of importance among the reconstructing goals.

2.2 Supervised feature selection algorithms

Supervised feature selection algorithms can utilize the information offered by the labeled data.

Many researchers designed different sparse methods to obtain the sparse coefficients of features. L21RFS uses L2,1 norm instead of L2-norm to remove outlines (Nie et al. 2010). Qian used structured sparse model to select feature, which can extract prior knowledge about atoms of the dictionary (Nie et al. 2011). Wang et al. (2013) select the important features from which the LDA will be constructed by L1 minimization method. In DSML-FS, joint structured sparsity regularizations are used to exploit the intrinsic data structure and relationships among different features. Discriminative least squares regression is applied to enlarge the distance between classes (Zhang et al. 2015a). Zhang et al. (2015b) proposed an optimized feature selection method which incorporates the L2,1 norm regularization into the original Fisher criterion. Shi et al. (2014) applied the l2,1/2-matrix norm into the sparse feature selection algorithm to select the most sparse and discriminative features. Xie and Xu (2014) used the sparse group least absolution shrinkage and selection operator technique to construct a feature selection algorithm for uncertain data. Zhang et al. (2015c) proposed a supervised feature selection, where joint structured sparsely regularizations are used to exploit the intrinsic data structure and relationships among different features . Chen et al. (2015) proposed a binary feature selector, which can handle the feature cardinality in the least squares formulation.

Other researchers designed a new feature selection strategy to utilize more information or solve many problem. Subspace sparsely collaborated feature selection uncover the shared subspace of original features (Zhao et al. 2012). LLFS decomposed an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework (Sun et al. 2010). Xiang et al. (2012) added a new item called \(\upvarepsilon \)-dragging is introduced to force the regression targets of different classes moving along opposite directions such that the distances between classes can be enlarged. Liang et al. (2011) reformulated the generalization of the single-task sparsity-enforced feature selection method to multi-task cases as a simultaneous sparse approximation problem. Han et al. (2016) proposed an feature selection by forward steps and backward steps, where the features with the strongest correlation to classes are selected in the forward steps, and redundant features which contribute little to the improvement of the objective function are removed in the backward steps. Zhu et al. (2016) proposed a new feature selection method, which utilized both sample–sample relation and feature–feature relation.

Obviously, the above methods ignored the redundancy of selected features and the difference of importance among the reconstructing goals. Furthermore, they cannot utilize the information offered by unlabeled samples.

2.3 Semi-supervised feature selection algorithms

Semi-supervised feature selection algorithms can utilize the information offered by both the labeled data and unlabeled data. SPFS selected k features, based on which, the pairwise sample similarity specified by a predefined similarity matrix, which is the global pairwise sample similarity (Zhao et al. 2013). GLSPFS integrated both global pairwise sample similarity and local geometric data structure to conduct feature selection (Liu et al. 2015). Shi et al. (2015) utilized multi-view Laplacian regularization to boost semi-supervised sparse feature selection performance.

Because labeled samples are hardly obtained in most applications, we also focus on semi-supervised feature selection in this paper. Compared with above semi-supervised feature selection, our method can minimize the redundancy and considers the difference of importance among the reconstructing goals.

3 Weighted structure preserving and minimize redundancy for feature selection (WSPMRFS)

3.1 A example used to introduce WSPMRFS

WSPMRFS is based on sparse representation, which recently has been applied to face recognition. The principle of face recognition based on sparse representation is shown in Fig. 1. It can be seen from Fig. 1 that a test face can be well linearly represented by the corresponding training samples. Its success is attributed to the fact that the dimensionality of signals such as natural images is often much lower than that is observed, and thus it offers a more compact yet better description of natural signals for the above applications.

Fig. 1
figure 1

The principle of face recognition based on sparse representation

The above phenomenon still exists in the feature selection, which can be shown by Fig. 2. As can be seen from Fig. 2, if the feature representation goal can be reasonably defined, the features with top sparse coefficients can well reach the goal of feature selection, where the value of sparse coefficient of each feature can be seen as the contribution of this feature in reaching the goal of feature selection. As a result, the sparse coefficients can be used to compute the feature selection scores.

Fig. 2
figure 2

The principle of feature selection based on sparse representation

In general, defining different feature representation goals can reach different feature selection goals. For example, if the feature representation goal defined by labeled samples, the feature selection result can be used for classification (Liu et al. 2015). If the feature representation goal defined by unlabeled samples, the feature selection result can be used for cluster (Du et al. 2016). Furthermore, defining different optimization function for the sparse representation can make that the sparse coefficients owns different attributions. For examples, L21RFS uses L2,1 norm instead of L2-norm to remove outlines (Nie et al. 2010). GLSPFS adds a new item in the optimization function to utilize the local geometric data structure (Liu et al. 2015).

In this paper, to utilize the information offered by both unlabeled samples and labeled samples, a new method is used to define feature representation goal. To utilize the local geometric data structure and reduce the redundancy among the selected features, a new constrained problem for the sparse representation is defined.

3.2 The constrained problem of WSPMRFS

Given labeled training data \(X_L=[x_1,x_2,\ldots ,x_n]\in R^{d\times n}\) and the associated class labels \(Y=[y_1,y_2,\ldots ,y_n]\in R^{n}\), unlabeled training data \(X_U=[x_1,x_2,\ldots ,x_m]\in R^{d\times m}\), the total training data \(X=[X_L,X_U]\in R^{d\times (n+m)}\), where n is the number of labeled training samples, m is the number of unlabeled training samples, d is the dimension of training samples. The constrained problem of WSPMRFS can be described as following:

$$\begin{aligned}&\mathop {\min }\limits _W \left\| {X^{T}W-V} \right\| _2^2 +\alpha {\textit{tr}}\left( W^{T}{\textit{XLX}}^{T}W\right) \nonumber \\&\quad \beta \left\| W \right\| _2^1 +\frac{1}{d(d-1)}\gamma \sum _{j,k=1}^d \left\| w_j \right\| _2^1 \left\| w_k \right\| _2^1 C_{\textit{ij}} \end{aligned}$$
(1)

where \(\alpha ,\beta ,\gamma \) are three regularization parameters, other items will be described in the following.

The item \(\left\| {X^{T}W-V} \right\| _2^2\) exploits the global structure by preserving the pair wise sample similarity. \(W\in R^{d\times r}\) is the transformation matrix, wj is the j-th row of W. \(V\in R^{n\times c}\) is obtained by decomposing G as \(G=V *V^{T}\), where \(G\in R^{n\times n}\) is a pair wise similarity matrix computed with the original data X, which can be defined as follows:

$$\begin{aligned} G(x_i,x_j)=\left\{ {{\begin{array}{ll} 1 &{} y_i=y_j \\ \hbox {0} &{} y_i\ne y_j \\ \exp \left( \frac{\left\| x_i-x_j \right\| _2 }{-2\sigma _G^2 }\right) &{} x_i\in N_{K}(x_j)\\ &{} \mathrm{or}~ x_j\in N_{K}(x_i) \\ 0 &{} \hbox {otherwise } \\ \end{array} }} \right. \end{aligned}$$
(2)

where \(\sigma _G =\sum \limits _{i=1}^n \sum \limits _{j=1}^n \left\| x_i-x_j \right\| _2 /n^{2}\), \(N_K(x_j)\) is K nearest neighbors of \(x_{j}\) in X. \(\left\| \cdot \right\| _2\) denotes L2 norm. In Eq. (2), G can be constructed by using both labeled and unlabeled samples.

The idea of item \(\frac{1}{d(d-1)}\gamma \sum \limits _{j,k=1}^d \left\| w_j \right\| _2 \left\| w_k \right\| _2 C_{\textit{ij}}\) is from Li and Tang (2015), which can be used to control the redundancy among features. \(d(d-1)\) is a normalized factor, which can make this item independent from the number of feature. \(C_{\textit{ij}}\) is the similarity between the i-th and the j-th feature, which can be computed as following:

$$\begin{aligned} C_{\textit{ij}}=\frac{x_{.i}{*} x_{.j}^T}{\left\| {x_{.i}} \right\| _2 \left\| {x_{.j}} \right\| _2 } \end{aligned}$$
(3)

where \(x_{.i}\) and \(x_{.j}\) is i-th and j-th feature. It can be seen from Eq. (3) that \(C_{\textit{ij}}\) is the cosine similarity. The cosine similarity is used instead of mutual information which is used in Li and Tang (2015), as we found that the sparse coefficients are related to the cosine distances in our previous work (Sun and Wen 2015).

The item \(\alpha {\textit{tr}}(W^{T}{\textit{XLX}}^{T}W)\) is used to capture the local geometric structure of data. \(L\in R^{n\times n}\) is a matrix which characterizes the local geometric structure of all data. L can be computed as following (Roweis and Saul 2000):

$$\begin{aligned} L=(I-S)^{T}(I-S) \end{aligned}$$
(4)

S can be obtained by optimizing the following equation:

$$\begin{aligned} \mathop {\min }\limits _S \sum _{i=1}^n {\left\| {x_{i}-\sum _{j=1}^n {S_{\textit{ij}}x_j}} \right\| }^{2} \end{aligned}$$
(5)

where \(S_{\textit{ij}}=0\) if \(x_{j}\) is the k nearest neighbor of \(x_{i}\), \(\sum \limits _{j=1}^n {S_{\textit{ij}}} =1\).

The item \(\left\| W \right\| _2^1\) is a L2,1 norm regularization, which can makes that the features are selected with sparse.

3.3 An efficient algorithm to solve the constrained problem

The Lagrangian function of the problem in Eq. (1) is :

$$\begin{aligned} F(W)= & {} \left\| {X^{T}W-V} \right\| _2^2 +\alpha {\textit{tr}}(W^{T}{\textit{XLX}}^{T}W) \nonumber \\&\beta \left\| W \right\| _2^1 {+}\frac{1}{d(d-1)}\gamma \sum _{j,k=1}^d \left\| w_j \right\| _2 \left\| w_k \right\| _2 C_{\textit{ij}} \end{aligned}$$
(6)

Taking the derivative of F(W) w.r.t U, and setting the derivative to zeros, we have:

$$\begin{aligned} W=\left( X^{T}(I+\alpha L)X+\beta U+\gamma H\right) ^{-1}X^{T}V \end{aligned}$$
(7)

where U is a diagonal matrix, \(U_{\textit{ii}}\) is calculated by Eq. (8) and \(W_i.\) is the \(i\hbox {th}\) row of W.

$$\begin{aligned} U_{\textit{ii}}=\frac{1}{2\left\| {W_{\textit{i}}.} \right\| _2 } \end{aligned}$$
(8)

H is also a diagonal matrix \(H_{\textit{ii}}\) is calculated as follow:

$$\begin{aligned} H_{\textit{ii}}=\frac{\sum \limits _{j=1}^r {\left\| {w_j} \right\| _2 C_{\textit{ij}}} }{2\left\| {W_i.} \right\| _2 }\quad . \end{aligned}$$
(9)

It can be seen from Eq. (8) that U is uniquely determined once W is obtained at each iteration in the optimization procedure. Based on the result represented by Eq. (7), Eq. (1) can be solved by updating W and U alternately. At the t-th iteration, \(W^{t}\) is updated with \(U^{t-1}\), and then \(U^{t}\) is updated with \(W^{t}\). This procedure is repeated until convergence. The algorithm of WSPMRFS is illustrated in Algorithm 1.

figure a

The computational complexity of WSPMRFS is mainly composed of the step 2 and the iteration processing that contains step 5 to step 9. In step 2, the step with highest computational complexity is Eigen decomposition, which is \(\hbox {O}(n^{3})\). In iteration processing, step 6 is with the highest computational complexity, which is \(\hbox {O}(n^{3})\). As a result, the computational complexity of WSPMRFS is \(\hbox {O}({\textit{tn}}^{3})\), where t is the number of iteration.

3.4 Feature selection on sparse coefficient vectors

The output W of Algorithm 1 are composed by the sparse coefficient vectors. To analyze the physical meaning of these sparse coefficient vectors, we rewrite the item \(\left\| {X^{T}W-V} \right\| _2^2\) in Eq.(1) as follow:

$$\begin{aligned} \left\| {X^{T}W-V} \right\| _2^2 =\sum _{k=1}^r {\left\| {X^{T}w_{.k}-v_{.k}} \right\| _2^2 } \end{aligned}$$
(10)

It can be seen from Eq. (10) that \(w_{.k}\) is the sparse coefficients of \(X^{T}\) in reconstructing \(v_{.k}\), specifically, \(w_{\textit{jk}}\) is the contribution of j-th features in reconstructing \(v_{.k}\), so that W can be used to calculate the feature selection score, where \(v_{.k}\) is the k-th top smallest Eigenvector obtained by the step 2 of Algorithm 1.

Many references have used W to calculate the feature selection score (Xiang et al. 2012; Chen et al. 2015; Hou et al. 2014; Fang et al. 2014; Xie and Xu 2014; Zhang et al. 2015c; Cai et al. 2010); however, the weights of \(w_{.k},k=1,2,\ldots ,r\) are the same in calculating the feature selection score in above references. However, the weights of \(w_{.k},k=1,2,\ldots ,r\) should be set to different values, as the importance of \(v_{.k},k=1,2,\ldots ,r\) is different in the similar research area.

It can be seen from the step 2 of Algorithm 1 that this step is the process of LPP (He and Niyogi 2003), where \(V=[v_{1},\ldots ,v_{r}]\) can be seen as the transformation matrix of LPP. As we known, in LPP, \(v_{.k}\) with smaller \(z_{k}\) are more important, as using these Eigenvectors can preserve more local structure of data, so that the corresponding \(w_{.k}\) should be also more important when \(w_{.k}\) is used to calculate feature selection score, as the reconstruction ability of features for \(v_{.k}\) with smaller \(z_{k}\) are more important.

According to the above analysis, a new feature score for the j-th feature is defined by a two steps method. Firstly, the importance of \(V=[v_{1},\ldots ,v_{r}]\) is defined as follow:

$$\begin{aligned} s_k=z_{r+1}-z_k,k=1,2,\ldots ,r \end{aligned}$$
(11)

And then, the feature score of the j-th feature is defined as follow:

$$\begin{aligned} {\textit{FS}}(j)=\sum _{k=1}^r s_{k} *w_{\textit{jk}} \end{aligned}$$
(12)

It can be seen from Eq. (11) that \(s_k\) is larger when \(z_k\) is smaller, and then the \(w_{\textit{jk}}\) owns larger impact in calculating \({\textit{FS}}(j)\).

After obtaining \({\textit{FS}}(j),j=1,2,\ldots ,d\), we then sort all the feature selection scores in descending order and select the top features, where how many features are finally selected can be decided by the cross-validation on training samples.

4 Experiments

In this section, we conduct extensive experiments using several databases to verify the effectiveness of the proposed WSPMRFS.

4.1 Experiment setting

The proposed method are tested on three types of data such as: image data, speech data and UCI data. The detailed information on the data sets is listed in Table 1. The features of Berlin (Burkhardt et al. 2005), SAVEE (Haq and Jackson 2009), CASIA can be extracted by a toolkit named openSMILE (Eyben and Wöllmer 2010). ORL, YaleB (Lee et al. 2005), PIE (Sim et al. 2013), Yale, COIL20 (Nene et al. 1996), USPS are publicly available from the web page of Cai. Other databases can be download from UCI website.

Table 1 The best performances among all dimensions on speech emotion databases

Following feature selection algorithms are compared with our method: global and local structure preserving feature selection with locally linear embedding (LLESPFS) (Liu et al. 2015), multi-cluster feature selection (MCFS) (Cai et al. 2010), similarity preserving feature selection (SPFS) (Zhao et al. 2013), minimum redundancy maximum relevance (mRMR) (Peng et al. 2005), sparse feature selection based on graph Laplacian (FSLG) (Shi et al. 2014), feature selection algorithms: Fisher score (FScore) (Zhou et al. 2016), graph feature selection for dementia diagnosis (GSFDD) (Zhu et al. 2016). To fairly compare different unsupervised feature selection algorithms, we tune the parameters for all methods by a “grid-search” strategy from \(\{10^{-6}, 10^{-4}, \ldots , 10^{6}\}\), the nearest neighbor parameters for these methods are set to the number of unsupervised training samples, the \(\sigma \) of these methods are set to the mean of the Euclidean distances among all training samples.

Table 2 PRs of compared methods on image databases with threefold cross-validation, where the plus-minus notated in table means that the first number is the average and the second one is the standard deviation

In our experiments, threefold and sixfold cross-validation are used. In threefold cross-validation, threefolds are used, respectively, for labeled training samples, unlabeled training samples, and testing samples. In sixfold cross-validation, onefold is used for testing samples, the next fold is used for unlabeled training samples, and other fourfolds are used for training samples. The experiments were repeated 30 times to calculate the average accuracy (PR) and the corresponding standard deviation. \(\hbox {KNN}~(k\) nearest neighbor) (Yihua Liao 2002) is used for classification, where the k is set by the cross-validation on training samples.

4.2 Experiments on image data

Six image databases are utilized to verify our method in this subsection, such as COIL20 (Nene et al. 1996), ORL, USPS, YaleB (Lee et al. 2005), Yale, PIE (Sim et al. 2013).

PRs of compared methods on image databases with threefold cross-validation are listed in Table 2, where the mean of PRs of each method on these databases are listed in the last row of Table 2. It can be seen from Table 2 that PRs of WSPMRFS are 1.87, 1.47, 0.08, 0.66, 1.47 and 1.29%, respectively, higher than that of the second best PRs on these databases. They prove the effectiveness of WSPMRFS. It also can be seen from the last row of Table 2 that the mean PRs of WSPMRFS are 2.39, 1.73, 2.03 and 3.11%, respectively, higher than the mean PRs of LLESPFS, SPFS, MCFS and mRMR, they prove that WSPMRFS is a significant improvement on LLESPFS, SPFS and mRMR, where LLESPFS, MCFS and SPFS are three sparse representation-based feature selection method, and mRMR is a information-based feature selection method.

Fig. 3
figure 3

PRs against the dimensions of compared methods on image databases with threefold cross-validation

PRs against the dimensions of compared methods on image databases with threefold cross-validation are given in Fig. 3. It can be seen from Fig. 3 that PRs of WSPMRFS in most of the dimensions are higher than that of other compared methods on COIL20, ORL, Yale and PIE. Furthermore, it can be seen from the performances on USPS in Fig. 3 that our method owns the second best PRs in most dimensions. They prove that WSPMRFS is stable and the number of selected features can be easily set.

PRs of compared methods on image databases with sixfold cross-validation are listed in Table 3, where the mean of PRs of each method on different databases are also listed in the last row of Table 3. It can be seen from Table 3 that PRs of WSPMRFS are the highest on most of databases. Furthermore, the PRs of WSPMRFS are the second highest PRS on USPS. They prove that the effectiveness of WSPMRFS are not sensitive to the number of labeled training samples on image databases. It can be seen from the last row of Table 3 that the mean PRs of WSPMRFS are the highest, so that WSPMRFS is also a significant improvement on the compared methods regardless how many labeled training samples are used.

PRs against the dimensions of compared methods on image databases with sixfold cross-validation are given in Fig. 4. It can be seen from Fig. 4 that PRs of WSPMRFS in most of the dimensions are the best or the second best in all testing databases, it proves that the effectiveness of WSPMRFS against the dimensions are not sensitive to the number of labeled training samples on image databases.

Table 3 PRs of compared methods on image databases with sixfold cross-validation
Fig. 4
figure 4

PRs against the dimensions of compared methods on image databases with sixfold cross-validation

4.3 Experiments on speech data

Three speech databases are utilized to verify our method, such as Berlin (Burkhardt et al. 2005), SAVEE (Haq and Jackson 2009), CASIA.

PRs of compared methods on these databases with threefold cross-validation are listed in Table 4, where the mean of PRs of each method on these databases are listed in the last row of Table 4. It can be seen from Table 4 that PRs of WSPMRFS are 2.98%, 2.39%, 1.90%, respectively, higher than that of the second best PRs on Berlin, CASIA, SAVEE, which proves that the effectiveness of WSPMRFS are also very well on speech database. It also can be seen from the last row of Table 4 that the mean PRs of WSPMRFS are 7.06, 3.84, 3.35, and 6.95%, respectively, higher than the mean PRs of LLESPFS, SPFS, MCFS and mRMR, they prove that WSPMRFS is also a significant improvement on the above methods when the experiments are performed on speech databases.

PRs against the dimensions of compared methods on image databases with threefold cross-validation are selected as labeled training samples are given in Fig. 5. It can be seen from Fig. 5 that the PRs of WSPMRFS are higher than PRs of other compared methods in most dimensions on three databases, it proves that WSPMRFS owns good effectiveness against the dimensions on these speech databases.

PRs of compared methods on speech databases with sixfold cross-validation are listed in Table 5, where the mean of PRs of each method on different databases are listed in the last row of Table 5. It can be seen from Table 5 that PRs of WSPMRFS are also the highest, which proves that WSPMRFS is not sensitive to the number of labeled training samples on speech databases. Furthermore, it can be seen from the last row of Table 5 that the mean PRs of WSPMRFS are 3.67, 4.15%, and 1.88, 4.67% respectively higher than the mean PRs of LLESPFS, SPFS, MCFS and mRMR, and the mean PRs of WSPMRFS are also the highest. They proves that WSPMRFS is also a significant improvement on the compared methods with sixfold cross-validation.

PRs against the dimensions of compared methods on speech databases with sixfold cross-validation are given in Fig. 6. It can be seen from Fig. 6 that the PRs of WSPMRFS are higher than PRs of other compared methods in most dimensions on Berlin and SAVEE, it proves that WSPMRFS also owns good effectiveness against the dimensions on these experiments.

Table 4 PRs of compared methods on speech databases with threefold cross-validation
Fig. 5
figure 5

PRs against the dimensions of compared methods on speech databases with threefold cross-validation

Table 5 PRs of compared methods on speech databases with sixfold cross-validation
Fig. 6
figure 6

PRs against the dimensions of compared methods on speech databases with sixfold cross-validation

Table 6 PRs of compared methods on UCI databases with threefold cross-validation

4.4 Experiments on UCI data

Nine speech databases are utilized to verify our method, such as SPECTF, credit_g, cancer, Pcaovarian50, diabetes, glass, Heart scale, ionosphere, Opt digits.

PRs of compared methods on these databases with threefold cross-validation are listed in Table 6, where the mean of PRs of each method on these databases are listed in the last row of Table 6. It can be seen from Table 6 that PRs of WSPMRFS are 0.42, 0.59, 1.22, 1.76, 2.15, 2.09, 0.45, 0.15% respectively higher than that of the second best PRs on SPECTF, cancer, Pcaovarian50, diabetes, glass, Heart scale, ionosphere, Opt digits, which proves that the effectiveness of WSPMRFS are also very well on UCI database. It also can be seen from the last row of Table 6 that the mean PRs of WSPMRFS are 2.93, 2.73, 1.69, and 3.08% respectively higher than the mean PRs of LLESPFS, SPFS, MCFS and mRMR, they prove that WSPMRFS is also a significant improvement on the above methods when the experiments are performed on UCI databases.

PRs against the dimensions of compared methods on UCI databases with threefold cross-validation are given in Fig. 7. It can be seen from Fig. 7 that WSPMRFS also own the best PRS in most dimensions on most databases.

PRs of compared methods on UCI databases with sixfold cross-validation are listed in Table 7, where the mean of PRs of each method on different databases are listed in the last row of Table 7. It can be seen from Table 7 that PRs of WSPMRFS are also the highest in all databases. Furthermore, it can be seen from the last row of Table 7 that the mean PRs of WSPMRFS are 2.77, 2.38%, and 1.37, 2.35% respectively higher than the mean PRs of LLESPFS, SPFS, MCFS and mRMR, and the mean PRs of WSPMRFS are also the highest. They proves that WSPMRFS is also a significant improvement on the compared methods with sixfold cross-validation.

PRs against the dimensions of compared methods on UCI databases with sixfold cross-validation are given in Fig. 8. It can be seen from Fig. 8 that the PRs of WSPMRFS against the dimensions are higher than PRs of other compared methods in most dimensions against the dimensions on most databases, which proves that WSPMRFS also owns good effectiveness on these experiments.

Fig. 7
figure 7

PRs against the dimensions of compared methods on UCI databases with threefold cross-validation

Table 7 PRs of compared methods on UCI databases with sixfold cross-validation
Fig. 8
figure 8

PRs against the dimensions of compared methods on UCI databases with sixfold cross-validation is used

4.5 Running time

To compare the execution speed of WSPMRFS with that of other algorithms, the running seconds of all compared algorithms on image databases and speech databases with threefold cross-validation are presented in Table 8, where all algorithms are performed by MATLAB 2016a on I5 6500 with 8G RAM. The running seconds on UCI databases are not offered, as the running time on these databases are too smaller.

It can be seen from Table 8 that the running time of WSPMRFS are similar with the running times of LLESPFS and SPFS, as the optimization method of WSPMRFS are similar with that of LLESPFS and SPFS. Furthermore, the mean of running seconds of each algorithms on all databases are presented in the last row of Table 8. It can be seen from Table 8 that WSPMRFS has the similar running time with the most compared algorithms. What’s more, although FS owns the smallest running time, the feature selection effectiveness of FS is much worse than that of WSPMRFS.

Table 8 The running seconds of all compared algorithms on image databases and speech databases

5 Conclusion

This paper presents a feature selection method based on weighted structure preservation and redundancy minimization (WSPMRFS). It has following advantages. Firstly, WSPMRFS considers the global pair wise similarity preservation, local structure preservation, and redundancy minimization in the same time, where the redundancy minimization is ignored in most structure preservation based feature selection methods. Second, the Eigen vectors that obtained by solving the Eigen problem of pair wise similarity matrix are set to different weights by the corresponding Eigen values, and then the feature that has more capable to preserve the pair wise similarity can be preferentially selected. The results demonstrate the increased performance of the proposed approach over existing methods.

There are several interesting problems to be investigated in our future work: (1) given two useful features related with each other, the item used for redundancy minimization may inhibit both coefficients of these two features, and then both these two features could be not selected, so that a new item should be designed for redundancy minimization. (2) In this paper, the coefficients are weighted by the corresponding Eigen value. However, the weight may be not linearly changed with the corresponding Eigen value, so that the weighting methods should be further researched. (3) The local structure is constructed by the idea of LLE (Roweis and Saul 2000), however, in recent years, many new local structure constructing methods has been proposed, so that we would find or design a more comfortable local structure constructing methods for feature selection.