1 Introduction

Feature selection is a research hotspot in the fields of pattern analysis, machine learning, and data mining. The earliest feature selection studies mainly focused on statistical and signal processing problems. Since large-scale machine learning emerged in 1990s, existing feature selection algorithms had diffract to meet the challenge [26]. When the number of features reaches a certain size, the accuracy of a classifier is declining, which is called the curse of dimensionality [1, 19, 36]. Therefore, there is an urgent need to develop better feature selection algorithms to increase the accuracy and efficiency for the large-scale data.

Feature selection is a method of selecting some of features that have more discriminative ability from a set of features to reduce the dimension of a feature space. It is an important component of a classification system [21, 27]. For a classification system, a good learning sample is the key in training a classifier. The quality of the data, for example, whether the sample contains irrelevant or redundant features can directly affect the performance of the classifier [29, 30]. Therefore, it is important to develop an effective feature selection method.

In general, based on the combination of subset evaluation criteria in feature selection and follow-up learning algorithms, the feature selection approaches can be categorized into three groups: the filter approach, wrapper approach and embedded approach. The filter approach [10] is independent of the follow-up learning algorithm and it uses the statistical performance of all training samples to evaluate the features [17]. The time cost of the filter approach is low, but the evaluation may have a deviation with the follow-up learning algorithm. While the wrapper approach [10, 18] uses the follow-up learning algorithm to evaluate the accuracy of the training features, the deviation is thus small, but the computation cost is large and thus not suitable for a large-scale data set [31]. In the embedded feature selection approach [7], a feature selection method itself is embedded as a component into a learning algorithm. The most typical embedded method is a decision tree [30]. However, the key of feature selection methods depend on the efficient selection of a useful subset of features. The features in this selected feature subset are kept while remaining features are abandoned. However, the abandoned features may also be related to other features, while the abandonment may lose some useful relevant information. Further, it is helpful for feature selection to effectively utilize the relation between data.

In order to utilize the relation between data, firstly, self-representation has been widely used for feature selection [33], according to self-similarity, i.e., a feature can be represented by all other features. Then, subspace learning is also introduced for keeping the relevance between data [9, 12, 35]. Subspace learning is designed to maintain specific statistical properties such as Principal Component Analysis (PCA) [37], Linear Discriminant Analysis (LDA) and so on when performing feature space transformation. These subspace learning methods can effectively mitigate the so-called curse of dimensionality and preserve the inherent relevance of the data. Thus, the above two points motivate this paper to consider taking advantages of the merits of both self-representation and subspace learning as a whole.

Therefore, this paper proposed a new Unsupervised Feature selection method by jointing Self-representation and Subspace learning, which is called UFSS for short. Firstly, because of the correlation between features, we consider using all features to represent each feature, that is, each feature is reconstructed by all features. Correspondingly, we use the least squares method as a loss function to evaluate the reconstruction error. Then, in order to overcome the over-fitting problem and select the most discriminative features, we use the Frobenius norm to constrain the reconstruction coefficient matrix to overcome the over-fitting problem and select the most discriminative features. On the other hand, we introduce Locality Preserving Projection (LPP) as a regularization term to maintain the local adjacent relation of the data when performing feature space transformation during the reconstruction process [11]. At the same time, we consider further applying a low-rank constraint to find the effective low-dimensional structures of the data, which can reduce the redundancy [34]. Finally, we proposed an effective optimization method to solve the objective function fast.

In summarization, the core of feature selection methods is to select the most effective features from the original features to reduce the feature dimension, which is a key data preprocessing step in pattern recognition. Based on the selected optimal feature subset, we use a classic classifier, i.e., Support Vector Machine (SVM), to classify the test samples.

The rest of this paper is organized as follows: We briefly review the previous feature selection methods and subspace learning methods in Section 2. After that, in Section 3, we give the details of the proposed new unsupervised feature selection method UFSS. Then, we present the experimental results in Section 4. Finally, we summarize our work and future work in Section 5.

2 Related work

In this section, we briefly review three important items: unsupervised feature selection, subspace learning and self-representation, because our proposed algorithm is based on them and they play different roles during the reconstruction.

2.1 Unsupervised feature selection

Supervised feature selection methods use the class labels as a guide to achieve feature selection. However, a lot of data may be unlabeled in a practical application. Therefore, unsupervised feature selection methods are useful and more difficult because they do not have class labels to use. Tabakhi et al. [19] proposed an ant colony algorithm, which can provide a well approximate solution based on previous iterations and the time cost is acceptable. Liu et al. [13] drew lessons from the Laplacian Score method. They considered replacing the k-means clustering method with a distance-based entropy measure in the Laplacian Score (LS) for automatically selecting the optimal subset of features. Qian and Zhai [16] took advantages of local learning and nonnegative matrix factorization. The proposed method can select the most discriminative subset of features by combining robust clustering and robust feature selection at the same time. Based on manifold learning and sparse learning model, Cai et al. [3] proposed Multi-Cluster Feature Selection (MCFS)). They considered using spectral analysis methods based on the preserved multi-cluster structure of the data to measure the relevance between different features.

2.2 Subspace learning

Subspace learning has been applied in different kinds of models for reducing dimension. Zou et al. [39] proposed a new improved principal component analysis based on sparse coding, which is called sparse principal component analysis (SPCA). It uses Lasso penalty to produce sparse principal component. Yan et al. [24] considered using graph embedding technology to represent the geometric structure or properties of the sample space. Based on it, they applied it to characterize intraclass compactness and interclass separability simultaneously and can better solve the problem that the number of available projection directions is low in LDA. Recently, Nikitidis et al. [15] proposed a method called maximum margin projection pursuit. It can take advantage of maximum margin to discriminate samples when performing feature space transformation. Cai et al. [2] proposed to use both graph embedding and regression for sparse projections learning, it can solve different graph-based subspace learning methods by the proposed unified framework.

2.3 Self-representation

Self-representation stems from the natural self-similarity phenomenon [33], which means that a part of an object is similar to other parts of the object, such as coastlines, stock market movements and so on. Just like sparsity results in sparse representation, self-similarity leads to self-representation. Of course, self-representation has been widely used for high dimensional data. Zhu et al. [33] proposed to use self-representation and 2,1-norm to constrain the coefficient matrix for removing outliers and can select the most representative features to reconstruct other features. Zhang et al. [28] proposed a new improved kNN method based self-representation, which also uses training samples to reconstruct themselves and imposes 1-norm to make representation coefficient matrix to produce sparsity.

3 Approach

In this section, firstly, we give some notations used in this paper in Section 3.1 and give some basic knowledge as preliminary in Section 3.2. The details of the proposed UFSS method is described in Section 3.3. We presented an optimization method to solve the objective function in Section 3.4. Finally, we analyze the convergence of the objective function in Section 3.5.

3.1 Notations

In this paper, we denote scalars as normal italic letters, vectors as bold lowercase letters and matrices as bold uppercase letters, respectively [29, 30]. Given a matrix X = [x i j ], we denote the i th row of X by x i, and the j th column of X by x j . The Frobenius norm and 1-norm of X are defined as \(||{\mathbf {X}}|{|_{F}} = \sqrt {\sum \nolimits _{i} {||{{\mathbf {x}}^{i}}||_{2}^{2}}} = \sqrt {\sum \nolimits _{j} {||{{\mathbf {x}}_{j}}||_{2}^{2}}}\) (matrix norms here are entry-wise norm), \(||{\mathbf {X}}|{|_{1}} = \sqrt {\sum \nolimits _{i} {\sum \nolimits _{j} {|{x_{ij}}|}} }\) and \(||{\mathbf {X}}|{|_{21}} = \sum \nolimits _{i} {\sqrt {\sum \nolimits _{j} {{\mathbf {x}}_{ij}^{2}}} } \). The trace operator, the transpose operator and the inverse of X is denoted as t r(X), X T and X − 1, respectively.

3.2 Preliminary

Let \({\mathbf {X}} \in {\mathbb {R}^{n \times d}}\) denotes a sample matrix, where n and d denote the numbers of samples and features, respectively. We also use x 1, x 2, ... , x n to denote the n samples where is a column vector, thus X = [x 1;x 2; ... ; x n ]. On the other hand, we use f 1,f 2, ... , f d to denote the d features and f 1,f 2, ... , f d are the corresponding feature vectors, where \({{\mathbf {f}}_{i}} \in {\mathbb {R}^{n}}\) and X = [f 1,f 2, ... , f d ].

The key of unsupervised feature selection methods is to select an optimal feature subset for all the samples. Drawn lessons from the regression problem [28, 33], we can regard the feature selection problem as a regression problem [33]:

$$ \underset{\mathbf{W}}{\min} l({\mathbf{XW}} - {\mathbf{Y}}) + \lambda R({\mathbf{W}}) $$
(1)

where W denotes a coefficient matrix, which is used to measure the weight of a feature. Y usually is a response matrix and l(X WY) denotes a loss function. R(W) usually is a regularization imposed on W and λ is a positive constant.

3.3 UFSS method

Many developed feature selection methods are derived from the model in (1), although considering correlation between features, it is still hard to select the proper response matrix Y. Drawing lessons from the merits of both self-representation and subspace learning, in this section, we proposed a new combined method for unsupervised feature selection. The self-representation denotes that the proposed method uses the sample matrix X instead of the response matrix Y, i.e., Y = X, which means each feature can be represented by all the features. Therefore, we can represent each feature f i as follow:

$$ {{\mathbf{f}}_{i}} = \sum\limits_{j = 1}^{d} {{{\mathbf{f}}_{j}}{w_{ji}}} + {{\mathbf{e}}_{i}} $$
(2)

Applied to all the features, then (2) can also be represented in a matrix form as follow:

$$ {\mathbf{X}} = {\mathbf{XW}} + {\mathbf{eb}} $$
(3)

where \({\mathbf {W}} = [{w_{ji}}] \in {\mathbb {R}^{d \times d}}\) is the self-representation coefficients matrix, \({\mathbf {e}} \in {\mathbb {R}^{n \times 1}}\) denotes a column vector with all elements are 1 and \({\mathbf {b}} \in {\mathbb {R}^{1 \times d}}\) is a bias term. Obviously, in order to use X W to represent X sufficiently, we should make error term e b as small as possible. Frobenius norm can be adopted to measure the residual, i.e., \(\underset {{\mathbf {W}},{\mathbf {b}}}{\min } ||{\mathbf {X}} - {\mathbf {XW}} - {\mathbf {eb}}|{|_{F}}\). The matrix W reflects the importance of different features. To avoid over-fitting and to select the most discriminative features, absorbing the core idea from ridge regression [29, 30], a shrink regularization factor is introduced, i.e., ||W|| F . In addition, as the basic assumption of manifold learning, a classic method of subspace learning, we know that real data may be presented in a high-dimensional structure but actually it may exist in a very low-dimensional manifold, i.e., the data can be represented by low-dimensional structure to some extent if we can map it back into the low dimensional space and reveal its essence. Taking advantage of manifold learning, locality preserving projection (LPP) as a regularization term is introduced to maintain the local adjacent relations of the data after performing feature space transformation during the self-representation process [32, 38]. Then we can have the objective function as follow:

$$ \underset{{\mathbf{W}},{\mathbf{b}}} {\min}||{\mathbf{X}} - {\mathbf{XW}} - {\mathbf{eb}}|{|_{F}} + {\lambda_{1}}||{\mathbf{W}}|{|_{F}} + {\lambda_{2}}tr({{\mathbf{W}}^{T}}{{\mathbf{X}}^{T}}{\mathbf{LXW}}) $$
(4)

where λ 1 and λ 2 are control parameters. The penalty term ||W|| F is used for penalizing all coefficients in W together; \({\textbf {L}} = {\textbf {D}} - {\textbf {S}} \in {\mathbb {R}^{d \times d}}\) is called graph Laplacian, where \({\textbf {S}} \in {\mathbb {R}^{d \times d}}\) is a similarity matrix and \({\textbf {D}} \in {\mathbb {R}^{d \times d}}\) is a diagonal matrix. On the other hand, in order to further remove the large amount of redundancy in the data, a low rank constraint is introduced to find the effective low-dimensional structures of the data, which can guarantee to reduce the redundancy. Thus, the low rank constraint can be applied to the rank of W, i.e.,

$$ rank({\mathbf{W}}) = r,\quad r \le \min (n,d) $$
(5)

Further, (5) can be re-expressed as product of two r −rank matrices as follow:

$$ {\mathbf{W}} = {\mathbf{AB}} $$
(6)

where \({\mathbf {A}} \in {\mathbb {R}^{d \times r}}\) and \({\mathbf {B}} \in {\mathbb {R}^{r \times d}}\). Therefore, the final objective function becomes:

$$\begin{array}{@{}rcl@{}} &&\underset{{\mathbf{A}},{\mathbf{B}},{\mathbf{b}}}{\min} ||{\mathbf{X}} - {\mathbf{XAB}} - {\mathbf{eb}}|{|_F} + {\lambda _1}||{\mathbf{W}}|{|_F} + {\lambda _2}tr({{\mathbf{B}}^T}{{\mathbf{A}}^T}{{\mathbf{X}}^T}{\textbf{LXAB}})\\ && s.t.,\; rank({\mathbf{AB}}) \le \min (n,d) \end{array} $$
(7)

In brief, we uses the proposed UFSS method to select the optimal feature subset from the original feature space, which is a key data preprocessing step for reducing dimension. Further, based on the selected optimal feature subset, this paper conducts classification with a classic classifier, i.e., Support Vector Machine (SVM). The pseudo of the UFSS algorithm is described in Algorithm 1 as below.

figure a

3.4 Optimization

Since with respect to the three variables, i.e., A,B and b, the objective function (7) is not jointly convex. We propose an iterative algorithm to optimize with A,B and b. Concretely, we iteratively excute the following two steps until the pre-set conditions are met: (i) Update b with the fixed A and B; (ii) Update A and B with the fixed b.

  1. (i)

    Fix A and B, then update b.

    We set the derivative of the objective function (7) with respect to b equal to 0:

    $$ 2{{\mathbf{e}}^{T}}{\mathbf{eb}} + 2{{\mathbf{e}}^{T}}{\mathbf{XAB}} - 2{{\mathbf{e}}^{T}}{\mathbf{X}} = 0 $$
    (8)

    We have this form by transformation:

    $$ {\mathbf{b}} = (1/n)({{\mathbf{e}}^{T}}{\mathbf{X}} - {{\mathbf{e}}^{T}}{\mathbf{XAB}}) $$
    (9)
  2. (ii)

    Fix b, then update A and B.

    We substitute (9) into the objective function (7), then we have:

    $$\begin{array}{@{}rcl@{}} &&\underset{{\mathbf{A}},{\mathbf{B}},{\mathbf{b}}}{\min} ||{\mathbf{X}} - {\mathbf{XAB}} - {\mathbf{e}((1/n)({{\mathbf{e}}^T}{\mathbf{X}} - {{\mathbf{e}}^T}{\mathbf{XAB}}))}|{|_F} + {\lambda _1}||{\mathbf{W}}|{|_F} \\ && + {\lambda _2}tr({{\mathbf{B}}^T}{{\mathbf{A}}^T}{{\mathbf{X}}^T}{\textbf{LXAB}}) \end{array} $$
    (10)

Let H = I n − (1/n)e e TR n×n, where I nR n×n is an identity matrix, (10) can be rewritten as

$$ \underset{{\mathbf{A,B}}}{\min} ||{\mathbf{HX}} - {\mathbf{HXAB}}|{|_{F}} + {\lambda_{1}}||{\mathbf{AB}}|{|_{F}} + {\lambda_{2}}tr({{\mathbf{B}}^{T}}{{\mathbf{A}}^{T}}{{\mathbf{X}}^{T}}{\textbf{LXAB}}) $$
(11)

which is equivalent to

$$\begin{array}{@{}rcl@{}} &&\underset{{\mathbf{A,B}}}{\min} tr({({\mathbf{HX}} - {\mathbf{HXAB}})^T}{\mathbf{P}}({\mathbf{HX}} - {\mathbf{HXAB}})) + {\lambda _1}tr({{\mathbf{B}}^T}{{\mathbf{A}}^T}{\textbf{QAB}}) \\ && + {\lambda _2}tr({{\mathbf{B}}^T}{{\mathbf{A}}^T}{{\mathbf{X}}^T}{\textbf{LXAB}}) \end{array} $$
(12)

where PR n×n and QR d×d, respectively, are diagonal matrices with \({{\mathbf {P}}_{ii}} = \frac {1}{2}||{({\mathbf {HX}} - {\mathbf {HXAB}})^{i}}|{|_{F}}\), i = 1, ... , n, and \({{\mathbf {Q}}_{ii}} = \frac {1}{2}||{({\mathbf {AB}})^{j}}|{|_{F}}\), j = 1, ... , d.

By setting (12) w.r.t B to zero, we obtain:

$$ {\mathbf{B}} = {({{\mathbf{A}}^{T}}({{\mathbf{X}}^{T}}{{\mathbf{H}}^{T}}{\textbf{PHX}} + {\lambda_{1}}{\mathbf{Q}} + {\lambda_{2}}{{\mathbf{X}}^{T}}{\mathbf{LX}}){\mathbf{A}})^{- 1}}{{\mathbf{A}}^{T}}{{\mathbf{X}}^{T}}{{\mathbf{H}}^{T}}{\textbf{PHX}} $$
(13)

Substituting (13) into (12), we have:

$$\begin{array}{@{}rcl@{}} &&\underset{\mathbf{A}}{\max} tr{({{\mathbf{A}}^T}({{\mathbf{X}}^T}{{\mathbf{H}}^T}{\textbf{PHX}} + {\lambda _1}{\mathbf{Q}} + {\lambda _2}{{\mathbf{X}}^T}{\mathbf{LX}}){\mathbf{A}})^{- 1}} \\ &&{{\mathbf{A}}^T}{{\mathbf{X}}^T}{{\mathbf{H}}^T}{\textbf{PHX}}{{\mathbf{X}}^T}{{\mathbf{H}}^T}{{\mathbf{P}}^T}{\textbf{HXA}} \end{array} $$
(14)

Note that

$$\begin{array}{@{}rcl@{}} &&{{\textbf{S}}_{\textbf{t}}} = {{\mathbf{X}}^T}{{\mathbf{H}}^T}{\textbf{PHX}} + {\lambda _1}{\mathbf{Q}} + {\lambda _2}{{\mathbf{X}}^T}{\mathbf{LX}} \\ && {{\textbf{S}}_{\mathbf{b}}} = {{\mathbf{X}}^T}{{\mathbf{H}}^T}{\textbf{PHX}}{{\mathbf{X}}^T}{{\mathbf{H}}^T}{{\mathbf{P}}^T}{\mathbf{HX}} \end{array} $$
(15)

where S t and S b denote the total-class scatter matrix and the between-class scatter matrix, respectively. Then the solution of (14) can be represented as below:

$$ {{\mathbf{A}}^{*}} = \underset{\mathbf{A}}{\arg \max}\, tr[{({{\mathbf{A}}^{T}}{{\textbf{S}}_{\textbf{t}}}{\mathbf{A}})^{- 1}}{{\mathbf{A}}^{T}}{{\textbf{S}}_{b}}{\mathbf{A}}] $$
(16)

The global optimal solution of (16) is the top s eigenvectors of \({\textbf {S}}_{\textbf {t}}^{- 1}{{\textbf {S}}_{\mathbf {b}}}\). The above analysis leads to Algorithm 2 below.

figure b

3.5 Proving of the convergence

In each iteration, it can be proved that the objective function (7) is monotonically decreasing [4]. Note that the objective function (11) and the objective function (12) are equivalent. Thus, we have

$$\begin{array}{@{}rcl@{}} & \Rightarrow& tr[{({\mathbf{HX}} - {\mathbf{HX}}{{\mathbf{A}}^{(t + 1)}}{{\mathbf{B}}^{(t + 1)}})^T}{{\mathbf{P}}^{(t)}}({\mathbf{HX}} - {\mathbf{HX}}{{\mathbf{A}}^{(t + 1)}}{{\mathbf{B}}^{(t + 1)}})] \\ &&+{\lambda _1}tr({{\mathbf{B}}^{{{(t + 1)}^T}}}{{\mathbf{A}}^{{{(t + 1)}^T}}}{{\mathbf{Q}}^{(t)}}{{\mathbf{A}}^{(t + 1)}}{{\mathbf{B}}^{(t + 1)}}) \\ &&+{\lambda _2}tr({{\mathbf{B}}^{{{(t + 1)}^T}}}{{\mathbf{A}}^{{{(t + 1)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{A}}^{(t + 1)}}{{\mathbf{B}}^{(t + 1)}}) \\ &\le& tr[{({\mathbf{HX}} - {\mathbf{HX}}{{\mathbf{A}}^{(t)}}{{\mathbf{B}}^{(t)}})^T}{{\mathbf{P}}^{(t)}}({\mathbf{HX}} - {\mathbf{HX}}{{\mathbf{A}}^{(t)}}{{\mathbf{B}}^{(t)}}) \\ &&+{\lambda _1}tr({{\mathbf{B}}^{{{(t)}^T}}}{{\mathbf{A}}^{{{(t)}^T}}}{{\mathbf{Q}}^{(t)}}{{\mathbf{A}}^{(t)}}{{\mathbf{B}}^{(t)}}) + {\lambda _2}tr({{\mathbf{B}}^{{{(t)}^T}}}{{\mathbf{A}}^{{{(t)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{A}}^{(t)}}{{\mathbf{B}}^{(t)}}) \end{array} $$
(17)

Let Z = H XH X A B, then we have [4]:

$$\begin{array}{@{}rcl@{}} &\Rightarrow& \sum\limits_{i = 1}^n {\frac{{||{z^{i(t + 1)}}||_2^2}}{{2||{z^{i(t)}}|{|_2}}}} + {\lambda _1}\sum\limits_{j = 1}^d {\frac{{||{w^{j(t + 1)}}||_2^2}}{{2||{w^{j(t)}}|{|_2}}}} + {\lambda _2}tr({{\mathbf{W}}^{{{(t + 1)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^{(t + 1)}}) \\ &\le& \sum\limits_{i = 1}^n {\frac{{||{z^{i(t)}}||_2^2}}{{2||{z^{i(t)}}|{|_2}}} + {\lambda _1}} \sum\limits_{j = 1}^d {\frac{{||{w^{j(t)}}||_2^2}}{{2||{w^{j(t)}}|{|_2}}} + {\lambda _2}tr({{\mathbf{W}}^{{{(t)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^{(t)}})} \end{array} $$
(18)

For each i, we are able to gain as follows:

$$ ||{z^{i(t + 1)}}|{|_{2}} - \frac{{||{z^{i(t + 1)}}||_{2}^{2}}}{{2||{z^{i(t)}}|{|_{2}}}} \le ||{z^{i(t)}}|{|_{2}} - \frac{{||{z^{i(t)}}||_{2}^{2}}}{{2||{z^{i(t)}}|{|_{2}}}} $$
(19)

For each j, we are able to gain as follows:

$$ ||{w^{j(t + 1)}}|{|_{2}} - \frac{{||{w^{j(t + 1)}}||_{2}^{2}}}{{2||{w^{j(t)}}|{|_{2}}}} \le ||{w^{j(t)}}|{|_{2}} - \frac{{||{w^{j(t)}}||_{2}^{2}}}{{2||{w^{j(t)}}|{|_{2}}}} $$
(20)

Then combine (19) and (20) together we have:

$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i = 1}^n {||{z^{i(t + 1)}}|{|_2}} - \frac{{||{z^{i(t + 1)}}||_2^2}}{{2||{z^{i(t)}}|{|_2}}} + {\lambda _1}\sum\limits_{j = 1}^d {||{w^{j(t + 1)}}|{|_2}} - \frac{{||{w^{j(t + 1)}}||_2^2}}{{2||{w^{j(t)}}|{|_2}}} \\ &\le& \sum\limits_{i = 1}^n {||{z^{i(t)}}|{|_2}} - \frac{{||{z^{i(t)}}||_2^2}}{{2||{z^{i(t)}}|{|_2}}} + {\lambda _1}\sum\limits_{j = 1}^d {||{w^{j(t)}}|{|_2}} - \frac{{||{w^{j(t)}}||_2^2}}{{2||{w^{j(t)}}|{|_2}}} \end{array} $$
(21)

Then combine (18) and (21) we have:

$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i = 1}^n {||{z^{i(t + 1)}}||_2^2} + {\lambda _1}\sum\limits_{j = 1}^d {||{w^{j(t + 1)}}||_2^2} + {\lambda _2}tr({{\mathbf{W}}^{{{(t + 1)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^T}) \\ &\le& \sum\limits_{i = 1}^n {||{z^{i(t)}}||_2^2} + {\lambda _1}\sum\limits_{j = 1}^d {||{w^{j(t)}}||_2^2} + {\lambda _2}tr({{\mathbf{W}}^{{{(t)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^{(t)}}) \end{array} $$
(22)

From the above analyzes, and then unite (9) and H = I n − (1/n)e e TR n×n, we have:

$$\begin{array}{@{}rcl@{}} && ||{\mathbf{X}} - {\mathbf{X}}{{\mathbf{W}}^{(t + 1)}} - {\mathbf{eb}}|{|_F} + {\lambda _1}||{{\mathbf{W}}^{(t + 1)}}|{|_F} + {\lambda _2}tr({{\mathbf{W}}^{{{(t + 1)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^{(t + 1)}}) \\ &\le& ||{\mathbf{X}} - {\mathbf{X}}{{\mathbf{W}}^{(t)}} - {\mathbf{eb}}|{|_F} + {\lambda _1}||{{\mathbf{W}}^{(t)}}|{|_F} + {\lambda _2}tr({{\mathbf{W}}^{{{(t)}^T}}}{{\mathbf{X}}^T}{\mathbf{LX}}{{\mathbf{W}}^{(t)}})\end{array} $$
(23)

Obviously, the value of objective function (7) decreases in each iteration [14, 20, 23]. Further, the objective function (7) will converge globally because it is a convex function [22].

4 Experiment

In this section, we introduce experimental setting in Section 4.1 firstly and we provide a brief introduction of the methods that will be compared with our method in Section 4.2. Then we summarize and analysis the experimental results by comparing our proposed method with other comparative methods in Section 4.3.

4.1 Experimental setting

The experimental environment is a Window XP system, and Matlab 7.11.0 is used to implement all the algorithms. In our experiments, we conduct the 10-fold cross-validation method for all methods. The final result was computed by averaging the results from all experiments. We apply the proposed UFSS method and the comparison methods to the classification task and evaluate them on eight datasets in terms of three different evaluations, i.e., classification accuracy, STandard Deviation (STD) and coefficient of variation. Specifically, we compare our methods with other methods in dimension reduction for feature selection, and then we use Support Vector Machine (SVM) [25] to conduct classification via the LIBSVM toolbox.Footnote 1 These datasets contain binary datasets and multi-class datasets, including SPECTF_Heart, LungCancer, Sonar, Movements, Arrhythmia and Yeast. They are all downloaded from UCI Machine Learning Repository,Footnote 2 the USPS dataset is downloaded from the website of Feature Selection Data sets,Footnote 3 while the FERET dataset is downloaded from the website of CSDN.Footnote 4 We summarized datasets in Table 1.

Table 1 Datasets summarization

Three kinds of evaluation metrics as the evaluations for the classification task, i.e., classification accuracy, STandard Deviation (STD for short) and Coefficient of Variation (CV for short), respectively. The higher accuracy the algorithm is, the better classification performance it is. The smaller STD and CV the algorithm is, the more stable and robust it is.

4.2 Comparison methods

The comparison methods are introduced as follows:

  • PCA: The method is a common dimensionality reduction method, which is used for extracting the important feature components from data [8].

  • TRACK: The method mainly takes advantages of trace ratio formulation and K-means clustering to select the most discriminative features [33].

  • RSR: The method joints sparse regularization and semi-supervised learning to select the most informative features, which can make the classifier robust for outliers [6].

  • FSR_ALM: The method directly uses 2,0-norm constraint to exact Top-k Feature Selection and augmented Lagrangian method is used to tackle the constrained optimization problem [5].

4.3 Experimental results

We presented the mean of classification accuracy and the corresponding STandard Deviation(STD), of all algorithms on the eight datasets in Table 2. Table 3 shows the Coefficient of Variation (CV) of all algorithms on eight datasets. We listed classification accuracy (the mean of classification accuracy in ten iterations) of all algorithms on eight datasets in Figure 1 where the horizontal axis denotes the iterations and the vertical axis denotes the classification accuracy.

Figure 1
figure 1

Classification accuracy on eight datasets in ten iterations. Note that a SPECTF_Heart; b LungCancer; c Sonar; d Movements; e USPS; f Arrhythmia; g Yeast; and h FERET

Table 2 The results of Classification Accuracy (mean ±STD)
Table 3 The result of Coefficient of Variation (mean)

In regard to classification accuracy accuracy and STandard Deviation (STD) in Table 2 and Figure 1, we have the following observations:

  • the proposed method UFSS improves the classification accuracies on average over eight datasets about by 4.2% (vs. PCA), 3.3% (vs. TRACK), 2.9% (vs. RSR), 2.8% (vs. FSR-ALM). In addition, according to Figure 1, we can also easily find that the proposed method UFSS almost has higher accuracy than four comparison algorithms in each of iteration. The reason is that, all the four comparison algorithms, i.e., PCA, TRACK, RSR and FSR-ALM, are subspace learning methods and they mainly only consider one kind of correlation inherent in data, while the proposed UFSS considers jointing self-representation and subspace learning for unsupervised feature selection to obtain two kinds of correlation inherent in data. At the same time, the proposed UFSS introduces a low-rank constraint to find the effective low-dimensional structures of the data, which can guarantee to reduce the redundancy. Therefore, the UFSS can get better performances than other comparison algorithms.

  • The proposed method UFSS outperforms PCA, TRACK, RSR and FSR-ALM much better on LungCancer dataset, they are about 8.5%, 8.9%, 10.6% and 6.8%, respectively. The UFSS method absorbs the merits of both self-representation and LPP and integrates them into a unified framework. Thus it can select the most discriminative features and achieve the best classification accuracy.

  • The proposed method UFSS has the least values of STandard Deviation (STD) compared with the comparison methods, it is less than others on average over eight datasets about by 0.28 (vs. PCA), 0.34 (vs. TRACK), 0.36 (vs. RSR), 0.05 (vs. FSR-ALM). It shows that our proposed method is more stable than other comparison methods.

In regard to Coefficient of Variation (CV) in Table 3, we can observe that: the UFSS method cannot always get the best performance on all datasets. For example, SPECTF_Heart and FERET, UPSS is not the least one. But in general, the proposed method UFSS achieved the least coefficient of variation, i.e., 0.97, while other comparison methods are 1.39, 1.62, 1.47 and 0.99 corresponding to PCA, TRACK,RSR and FSR-ALM, respectively. This shows that the proposed algorithm UFSS is more robust than other comparison algorithms on the whole.

5 Conclusion

In this paper, we have proposed an unsupervised feature selection method based on self-representation and subspace learning. We use all features to represent each feature. In other words, each feature is reconstructed by all the features. A Frobenius norm regularizer term to constrain the reconstruction coefficient matrix is used to overcome the over-fitting problem and to select the most discriminative features. Also, we introduce Locality Preserving Projection (LPP) as a regularization term to maintain the local adjacent relation of the data constant when performing feature space transformation. Further, we consider applying a low-rank constraint to find the effective low-dimensional structures of the data. Experiments on real datasets have been conducted to compare the performances of the proposed method and the other state-of-the-art methods. The experimental results showed that the proposed method UFSS outperformed other methods in terms of classification accuracy, standard deviation and coefficient of variation.

In future, we consider improving the UFSS method for supervised feature selection.