1 Introduction

With the rapid development of information technology and database technology, people can easily access and store huge amount of data. Traditional data analysis tools can not satisfy people’s increasing needs any more [10, 42, 66]. In order to efficiently extract useful information from massive data, many data mining technologies have been around. In real applications, however, data usually is high-dimensional and may contain lots of redundant and useless information, which seriously undermines effectiveness of the data mining methods [13, 17, 19]. Therefore, it is crucial to conduct dimensionality reduction on the data before applying any data mining algorithm [15, 16].

Dimensionality reduction is one of the most important research areas of machine learning [9, 21, 41]. In real applications, data is often high-dimensional, and it is difficult to be understood, represented and processed [20, 37, 39]. By employing dimensionality reduction methods to break the curse of dimensionality, people can easily process and thus fully understand the data [11, 29, 49]. There are a number of ways to reduce dimensions of the data, which can be divided into two categories. The first one contains linear methods, such as Principal Component Analysis (PCA) [50] and Classical Multidimensional Scaling (CMDS) [52]. Low-dimensional data obtained by these linear methods can usually maintain the linear relationship between high-dimensional data points, but this kind of methods can not reveal complex nonlinear manifold structure of the data. The second one, based on manifold learning theory, consists of non-linear methods, such as Isometric Mapping (Isomap) [64], Local Linear Embedding (LLE) [63]. This type of algorithms can learn inherent geometry structure of the data, facilitating data processing and analysis [16, 24].

Dimensionality reduction methods based on manifold learning can usually be divided into two categories. One is based on local manifold structure of data. The most classical method is Locality Preserving Projections (LPP) [62], which can preserve neighborhood relations between data after dimensionality reduction. Another consists of methods that consider global structure of the data. Linear Discriminant Analysis (LDA) [7] is the most commonly used in many areas and it can produce the best projection result, meaning that after projection samples with the same class label have the largest distance while those with different class labels have the smallest distance [18, 40, 61, 63]. That is, LDA produces the best capability of separability for data samples. However, most feature selection algorithms only take into account one of the two manifold structures of the data, i.e., either global or local, so the performance is not satisfactory [12, 28, 30].

Because of some good properties, such as insensitive to noise data, feature self-representation has been widely used in machine learning and computer vision [60, 65]. However, since the self-representation coefficient matrix obtained by these methods is fixed, feature self-representation based methods do not adapt well to data with complex structure. To this end, this paper presents a feature selection algorithm that combines structure learning with sparse learning (LSS_FS). The main contributions of this paper include

  • Due to the fact that our method uses feature self-representation to build the model, this method is not very sensitive to noise and outliers, exhibiting good robustness.

  • This model utilizes a low-rank constraint on self-representation coefficient matrix to explore global manifold structure of the data, which had been proven that it has the ability to take advantage of the correlation among features effectively.

  • The model proposed in this paper can dynamically adjust the structure of data according to the similarity matrix of the data, and can make it more accurate, so as to improve the classification accuracy.

  • In this paper, we propose a novel method to solve the objective function. We perform low-rank feature selection and dynamically adjust the graph matrix to optimize the objective function according to the similarity matrix iteratively. The optimal results are achieved in each iteration, and finally the global optimal solution is thus obtained.

2 Related work

The original purpose of sparse learning is to compress and express a signal, because it has a lower sampling rate than Shannon theorem. And it gradually becomes popular, and successfully used to solve many practical problems. For example, in signal and image processing, signal encoding, representation and compression, image denoising, image super-resolution, etc [34, 48, 57].

With the continuous innovation and practice of many scientists and researchers around the world, sparse learning theory has been successfully applied in signal processing, data mining, machine learning, information retrieval, pattern recognition, biological computing and other fields. And it has become the research hotspot of these areas. So far, a large number of experts and scholars are still carrying out an in-depth study of sparse learning’s theory and application. For sparse learning itself, because it has a more natural discriminant nature, it is more suitable for face recognition [27, 36]. Specifically, based on the theory of Compressive Sensing, by using the sparse learning to reconstruct an approximate signal for the original one to find the dimensionality reduce matrix, so that the original high-dimensional signal projection to low-dimensional space can also maintain its original features as much as possible [32, 35]. Sparse learning methods have achieved very high recognition accuracy in face recognition. Although this method has achieved good performance, it needs to store all training images for recognition, significantly increasing the storage overhead [43, 46]. Moreover, when the sparse learning deals with independent signals, it only considers the inherent relevance of the signal, ignoring the correlation between the same type of signals.

Sparse learning evaluates the importance of an element in terms of the weight of its coefficient, that is, the weight of an unimportant element is zero and the important one is nonzero [31]. The more important an element, the greater the corresponding weight. Therefore, sparse learning can be used as a natural identification information into the model [55] by using the coefficient weights between the data samples or features. This enables sparse learning to reduce the impact of noise on the model and improve the efficiency of the learning model. Therefore, sparse learning has been widely used in the field of data mining and machine learning. For example, Zhu et al. proposed a joint sparse learning method to achieve block sparse of data by considering the correlation between data and global information of the data, and it has been used to deal with multi-label data for classification. Gao et al. used the histogram intersection kernel (HIk) to implement kernel sparse learning, and it utilizes the HIk basis to consider a feature quantization of soft-allocated extended to conduct sparse coding [8]. Xia et al. proposed a sparse projection algorithm to conduct binary coding for high-dimensional data [27]. Sparse Subspace Learning (SSL) can also be seen as a special dimensionality reduction method [2]. Some researchers proposed a series of sparse subspace learning algorithms based on sparse learning theory, including Nonnegative Sparse Principal Component Analysis (NSPCA) [45], Sparse Nomegative Matrix Factorization (SNMF) [23], and so on.

The essence of sparse learning is to introduce the coefficient weight between samples or features as the authentication information into the model. By using the sparse constraint to punish the model, the coefficient weight of some irrelevant data approaches zero, and useful information is preserved, so the sparse learning is robust [52]. However, sparse learning does not take into account the internal structure of the data.

To overcome the shortcomings of sparse learning, manifold learning has been proposed. Linear discriminant analysis is the most common technique used in manifold learning. Zhong et al. proposed a method to utilize 1-norm to maximize the objective function (LDA- 1) [47], where LDA- 1captures the global manifold structure via discriminant analysis, and then selects the most critical features. Unsupervised Feature Selection via Unified Trace Ratio Formulation and K-means Clustering (TRACK) was proposed as a unified framework for feature selection and selected discriminative features according to trace ratio formulation and K-means clustering [25]. Cai et al. proposed Multi-Cluster Feature Selection (MCFS), which explores local manifold structure of the data via spectral analysis and then searches the features that can better preserve the clustering structure [3]. However, these algorithms are sensitive to outliers, and if the data contains too much noise, performance of the algorithm will degenerate significantly.

There are many robust feature selection methods. For example, Unsupervised feature selection by regularized self-representation (RSR), which uses all the features to linearly represent each feature, thus weakening the impact of outliers [56]. Sun et al. put forward an Unsupervised Robust Bayesian Feature Selection method, where the feature selection capability is realized by estimating the feature saliencies associated with the features [22]. But these methods do not consider the manifold structure of data, so their performance are not satisfactory. To obtain better stability and performance, this paper combines manifold learning with feature self-representation, and then introduces the sparse regularization factor to conduct feature selection.

3 Our method

In this section, we first introduce some notations that are used in this paper, and then explain the detail of the proposed LSS_FS method, in Sections 3.1 and 3.2, respectively, and then elaborate the proposed optimization method in Section 3.3. Finally, we analysis the convergence of the objective function in Section 3.4.

3.1 Notations

For a matrix \(\textit {X} \in {\mathbb {R}^{n \times d}}\), its i-th row and j-th column are denoted as X i and X j , respectively. The trace of a matrix X is denoted as tr(X). X T means the transpose of X and X − 1means the inverse of X, respectively. Also we denote the F -norm and 2,1-norm of X respectively as \(||\textit {X}|{|_{F}} = \sqrt {{\sum \nolimits _{i}^{n}} {||{\textit {x}^{i}}||}_{2}^{2}} = \sqrt {{\sum \nolimits _{j}^{d}} {||{\textit {x}_{j}}||}_{2}^{2}} \) , \( ||X|{|_{2,1}} = {\sum \nolimits _{i}^{n}} {\sqrt {{\sum \nolimits _{j}^{d}} {x_{i,j}^{2}} } } \).

3.2 Structure learning for feature selection

Many previous literatures have indicated that the local structures of the samples may provide complementary information to boost the ability of dimensionality reduction [44, 54]. So, this paper proposes to utilize the local structures of the samples by learning a graph matrix \(S \in {\mathbb {R}^{n \times n}}\) on a low-dimensional space of the original data. Given the feature matrix X, we can get the following formula according to [51, 59]

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{Z} \sum\nolimits_{i,j}^{n} {||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}} \end{array} \end{array} $$
(1)

where ||⋅||2 is the 2-norm of a vector, \(Z \in {\mathbb {R}^{d \times d}}\) is a transformation matrix of the high-dimensional data X in its low-dimensional space, and the element s i,j of the graph matrix S represents the similarity between sample X i and sample X j. If sample X i is one of the k-nearest neighbors of sample X j, then the value of the heat kernel, i.e., \(f({x^{i}},{x^{j}}) = exp(- \frac {{||{x^{i}} - {x^{j}}||_{2}^{2}}}{{2{\sigma ^{2}}}})\) where σ is a tuning parameter, is regarded as the value of s i,j ; otherwise s i,j = 0.

Actually, the character of S has been demonstrated very sensitive to σ [49, 53]. This inspires us to learn the relatively correct graph matrix from the ’clean’ data and to decrease number of parameters that need to adjust. By clean, we mean that a low-dimensional space with as less noise and redundancy as possible [33]. However, in real applications, we cannot know the graph matrix and low-dimensional space in advance [38]. In order to deal with these problems, we combine graph matrix learning with low-dimensional space learning to iteratively optimize them so as to achieve the optimal results. Therefore, we may learn the graph matrix by following the distribution of the samples. Then we have the following objective function:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,Z} \sum\nolimits_{i,j}^{n} {(||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}),} \\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$
(2)

where λ 1 is a control parameter and s i is the i-th column of S, \(||{s_{i}}||_{2}^{2}\) is used to avoid the trivial solution, 1 and \(\mathcal {N}(i)\) represent an all-one-element vector and the set of the nearest neighbors of the i-th sample, respectively, and the constraint \({s_{i}^{T}}{\textbf {1}} = 1\) is applied to gain the shift invariant similarity. Clearly, (2) assigns small value (i.e., similarity) to s i,j if sample i and j are far apart, and large value to s i,j otherwise.

Previous methods learn the graph matrix via (1) to generate an optimal similarity measurement. Different from that, (2) aims to achieve both optimal similarity measurement (i.e., S) and feature selection results (i.e., Z). It becomes obvious that (2) may receive better feature selection results (i.e., Z) than (1).

In order to reduce the adverse impacts of outliers and noise samples on the model, we propose to use the self-representation method to build an unsupervised model, then we can get the following functions:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,Z} \sum\nolimits_{i,j}^{n} {||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XZ{{||}}_{F}^{2}\\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$
(3)

In order to improve effectiveness of the algorithm and to consider relationship among features, we may impose a constrain on the rank of Z [58]. Through this process, a low rank constraint on Z can naturally be represented as the product of two r-rank matrices as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} Z = AB \end{array} \end{array} $$
(4)

where \(A \in {\mathbb {R}^{d \times r}}, B \in {\mathbb {R}^{r \times d}}, r \le \min (n,d),\) and our function becomes

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,A,B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XAB{{||}}_{F}^{2} \\&&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\text{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$
(5)

According to [56], the feature selection issue can be converted to the following form:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{W} l(X - XW) + \lambda R(W) \end{array} \end{array} $$
(6)

where W is the feature weight matrix, l(XX W) is the loss function, R(W) is the regularization on W and λ is a positive constant. In the item \({{||}}X - XAB{{||}}_F^2\), by treating XA as X of (6), then B can be naturally regarded as W of (6). Inspired by (6), we naturally make the regularization on matrix of B to further improve the performance of the algorithm. Therefore, our ultimate objective function is:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,A,B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XAB{{||}}_{F}^{2} + {\lambda_{3}}{{||}}B{{|}}{{{|}}_{2,1}}\\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{ otherwise 0, }}{A^{T}}A = I \end{array} $$
(7)

The constraint A T A = I(where \(A \in {\mathbb {R}^{d \times r}}\) and \(I \in {\mathbb {R}^{r \times r}}\)) is introduced for identifiability purpose, where λ 1,λ 2and λ 3are tuning parameters.

3.3 Optimization

Equation (7) is not jointly convex with respect to all the variables (i.e., A, B, and S), but it is convex for each variable while fixing the rest. In this paper, we utilize the alternative optimization method to optimize (7), i.e., iteratively optimizing each variable respectively.

Update A by fixing B and S

When B and S are fixed, the second and fourth terms of (7) can be viewed as constants, thus we can get:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{A} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}} + {\lambda_{2}}||X - XAB||_{F}^{2},s.t.,{A^{T}}A = I \end{array} \end{array} $$
(8)

By following the IRLS framework, we rewrite (8) as:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{A} tr({B^{T}}{A^{T}}{X^{T}}LXAB) + {\lambda_{2}}tr({X^{T}}X - {X^{T}}XAB \\&-& {B^{T}}{A^{T}}{X^{T}}X + {B^{T}}{A^{T}}{X^{T}}XAB),s.t.,{A^{T}}A = I \end{array} $$
(9)

where t r(⋅) is a trace operator, \(L = Q - S \in {\mathbb {R}^{n \times n}}\) is a Laplacian matrix and Q is a diagonal matrix with its i-th element \({q_{i,i}} = \sum \nolimits _{j = 1}^{n} {{s_{i,j}}} \). By fixing B, we get the derivative of (9) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} 2{X^{T}}LXAB{B^{T}} - 2{\lambda_{2}}{X^{T}}X{B^{T}} + 2{\lambda_{2}}{X^{T}}XAB{B^{T}} \end{array} \end{array} $$
(10)

Since A is orthogonal, we can employ an existing method in [26] to optimize it.

Update B by fixing A and S

When A and S are fixed, the second term of (7) can be viewed as constants, we can get:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}} + {\lambda_{2}}||X - XAB||_{F}^{2} + {\lambda_{3}}||B|{|_{2,1}} \end{array} \end{array} $$
(11)

Which is equivalent to

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{B} tr({B^{T}}{A^{T}}{X^{T}}LXAB) + {\lambda_{2}}tr({X^{T}}X - {X^{T}}XAB\\ &-& {B^{T}}{A^{T}}{X^{T}}X + {B^{T}}{A^{T}}{X^{T}}XAB) + {\lambda_{3}}tr({B^{T}}PB) \end{array} $$
(12)

Where \(P \in {\mathbb {R}^{r \times r}}\) is diagonal matrix with \({p_{i,i}} = \frac {1}{{2||{B^i}|{|_2}}},(i = 1, {\ldots } ,r)\). Then setting the derivative of B in (12) to zero, we obtain:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} B = {\lambda_{2}}({A^{T}}{X^{T}}LXA + {\lambda_{2}}{A^{T}}{X^{T}}XA + {\lambda_{3}}P)^{- 1}{A^{T}}{X^{T}}X \end{array} \end{array} $$
(13)

Update S by fixing A and B

Given that A and B are fixed, (7) is changed to:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}} \\ && s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{ otherwise~0}} \end{array} $$
(14)

We first calculate the Euclidean distance between each sample to produce k nearest neighbors of all samples, and then set the value of s i,j to 0 if the j-th sample does not belong to one of the k nearest neighbors of the i-th sample, otherwise, we utilize (17) to determine the values of s i,j .

Since optimizing S is equal to solely optimizing each vector s i (i = 1,…,n), we further change the optimization problem in (14) to individually optimize s i (i = 1,…,n) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{{s_{i}^{T}}1 = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0} {\sum\nolimits_{j}^{n}} (||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}s_{i,j}^{2} ) \end{array} \end{array} $$
(15)

By denoting \(F \in {\mathbb {R}^{n \times n}}\) where \({f_{i,j}} = ||{x^i}AB - {x^j}AB||_2^2\), we rewrite (15) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{{s_{i}^{T}}1 = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0} ||{s_{i}} + \frac{1}{{2{\lambda_{1}}}}{f_{i}}||_{2}^{2} \end{array} \end{array} $$
(16)

On the basis of the Karush-Kuhn-Tucker (KKT) conditions [1], we are able to get the closed-form solution of s i,j (j = 1,…,n)as:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} {s_{i,j}} = {(- \frac{1}{{2{\lambda_{1}}}}{f_{i,j}} + \tau )_ + } \end{array} \end{array} $$
(17)

We suppose that there are k nearest neighbors for each sample. By denoting \({\hat f_i} = \{ {\hat f_{i,1}}, {\ldots } ,{\hat f_{i,n}}\} \) as a descend order of f i ,(i = 1,…n), (17) reveals that s i,k+ 1 = 0and s i,k > 0, where k is the number of nearest neighbors of the i-th samples and can be tuned by cross-validation methods. That is,

$$\begin{array}{@{}rcl@{}} \begin{array}{l} - \frac{1}{{2{\lambda_{1}}}}{\hat f_{i,k + 1}} + \tau \le 0 \end{array} \end{array} $$
(18)

Since \(s_i^T{\textbf {1}} = 1\), we can get

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \sum\nolimits_{j = 1}^{k} {(\frac{1}{{2{\lambda_{1}}}}{{\hat f}_{i,k}} + \tau )} = 1 \Rightarrow \tau = \frac{1}{k} + \frac{1}{{2k{\lambda_{1}}}}\sum\nolimits_{j = 1}^{k} {{{\hat f}_{i,k}}} \end{array} \end{array} $$
(19)

Based on the above discussion, we summarize the process for solving (7) in Algorithm 1.

figure c

3.4 Convergence analysis

The framework IRLS has been proven to converge in [6], so we only need to prove the convergence of Algorithm 1 via Theorem 1:

Theorem 1

The value of objective function in(7)monotonically decreases until Algorithm 1 converges.

Proof

After the t-th iteration, we have obtained the optimal A (t), B (t)and S (t). In the (t + 1)-th iteration, we need to optimize S (t+ 1)by fixing A (t)and B (t).

According to (17), \(s_{i,j}^{(t + 1)}\) has a closed-form solution for all i,j = 1,…,n, thus we have:

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)} + {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2}\\ &&+ {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t)} + {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2}\\ &&+ {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$
(20)

While fixing S (t+ 1)to update A (t+ 1)and B (t+ 1), we follow [6] to get

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t + 1)}}{B^{(t + 1)}} - {x^{j}}{A^{(t + 1)}}{B^{(t + 1)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t + 1)}}{B^{(t + 1)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t + 1)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$
(21)

By integrating (20) with (21), we obtain:

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t + 1)}}{B^{(t + 1)}} - {x^{j}}{A^{(t + 1)}}{B^{(t + 1)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t + 1)}}{B^{(t + 1)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t + 1)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t)}\\ &&+ {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$
(22)

From (22), we know that the objective function value of (7) decreases after each iteration of Algorithm 1. Hence, Theorem 1 has been proven. □

4 Experimental results

We use 5 binary-class and 7 multi-class benchmark datasets to verify the performance of our method and the competing dimensionality reduction methods. The selected datasets include glass, wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli and Yeast, all downloaded from the UCI Machine Learning Repository.Footnote 1 The Isolet from the website of Feature Selection Datasets.Footnote 2 We summarize detailed information of the datasets in Table 1.

Table 1 Statistics of the datasets used

To prove the performance of LSS_FS, we compare it with six state-of-art feature selection methods. We list details of the competitor methods as follows:

TRACK: It selects discriminative features via unified trace ratio formulation and k-means clustering [25]. RSR: Regularized self-representation selects representative features via the 2,1-norm to characterize the self-representation coefficient matrix and ensure the robustness to outilers [56]. CSFS: Convex Semi-supervised multi-label Feature Selection can conduct the feature selection via the 2,1-norm regularization [5]. FSR_ALM : Exact Top-k Feature Selection via 2,0-norm constraint and Augmented Lagrangian Multiplier (ALM) to avoid the heavy burden of tuning regularization parameters and make it more practically [4]. LDA: Liner Discriminant Analysis (LDA) aims at minimizing the within class distance while maximizing the between class distance when conducting feature selection. Hence, LDA is a global subspace learning method [7]. PCA: It maps high-dimensional data into low-dimensional space by preserving the covariance of the data matrix [50].

We set {λ 2,λ 3}∈{10− 2,…,102}, while the value of λ 1is automatically adjusted according to [14], and the rank of the self-representation coefficient matrix r ∈{1,…,min(n,d)}. Moreover, {c,g}∈{2− 5,…,25}in SVM a 5-fold inner cross-validation is used to distinguish different types of samples. According to the cross-validation method, we can choose the best parameters for the experiments. In order to reduce impact of randomness, a 10-fold outer cross-validation has been used to get the average results. For the sake of fairness, we use the same strategy for all the competing methods.

We use three kinds of evaluation metrics, such as classification accuracy, standard deviation and coefficient of variation, to evaluate the classification performance of all methods. We define classification accuracy (ACC) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} ACC = {{{N_{correct}}} / N} \end{array} \end{array} $$
(23)

where N is the number of samples and N c o r r e c t is the number of correctly classified samples.

We utilize Standard Deviation (STD) to reflect the classification accuracy results and give the definition as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} STD = \sqrt {({1 / {{n_{run}}}})\sum\limits_{i = 1}^{N} {(AC{C_{i}} - u)} } \end{array} \end{array} $$
(24)

where n r u n denotes the runs of experiments, i.e., n r u n = 10 in our experiments, and u stands for the average value of ACC. A large ACC means good performance, and a small STD shows excellent stability.

We also use Coefficient of Variation (CV) as the evaluation index, which is defined as

$$\begin{array}{@{}rcl@{}} \begin{array}{l} CV = {{STD} /{ACC}} \end{array} \end{array} $$
(25)

where STD is the square root of an unbiased estimate of each sample, i.e., standard deviation. A small CV means better robustness.

4.1 Experiment results and analysis

In Table 2, we display the classification accuracy of all methods on twelve datasets in Table 1. Besides, we report the results of classification accuracy for all datasets in Fig. 1.

Fig. 1
figure 1

Average classification accuracy of all methods on all the tested datasets

Table 2 Classification accuracy (ACC±STD (%))

The proposed LSS_FS method outperforms all competing methods in all classification tasks. For instance, the ACC of our method increases on average by 10.97%, compared with TRACK which does not learn the relatively correct graph matrix of the high-dimensional data. ACC of out method raises on average by 9.45%, compared with the PCA method which is not able to take into account enough information for feature selection. On the other hand, ACC of our method climbs on average by 8.63%, compared with the LDA method which only considers the global structure of the data. Meahwhile, ACC of our method increases on average by 8.46%, compared with the FSR_ALM method which fails to take the relationship between features into account. Also, compared with CSFS which does not construct graph matrix to consider the local correlations among the features, ACC of our method increases on average by 8.34%. Compared with the RSR method that only considers the relationship between the features, the improvement on ACC of our method is 6.85% on average. The reason is that, LSS_FS method considers the following constraint while conducting feature selection: 1) two kinds of correlations, i.e., sample-level and feature-level, inherent in data; 2) iteratively adjusting the transformation matrix until it is optimal. Furthermore, the proposed method obtains the minimum standard deviation on average, compared with the other methods. Based on the above observations, our proposed method has the best robustness compared with all the other methods for classification tasks.

We show the results of coefficient of variation of all methods in Table 3. Specifically, our algorithm obtains the minimum value on the five datasets, to be specific, 1.19 on glass, 0.31 on wine, 0.55 on Ionosphere, 0.87 on Movements and 0.80 on Arrhythmia, respectively. Returning to the performance of CV in Table 3, though the proposed method does not achieve the minimum variation on each of the datasets, it reaches the minimum average variation on all the datasets. This verifies that our method achieves the best stability.

Table 3 Coefficient of variation (%)
Fig. 2
figure 2

Convergence rate of Algorithm 1 on all tested datasets

Figure 2 reveals the characteristic of the proposed objective values of Algorithm 1 on six datasets at each iteration, where we set the stop criteria of Algorithm 1 as \(\frac {{||obj(t + 1) - obj(t)||_2^2}}{{obj(t)}} \le {10^{- 5}}\) where obj(t) represents the value of objective function in (7) at the t-th iteration. From Fig. 2, we can know that: 1) the proposed Algorithm 1 to optimize the objective function in (7) monotonically decreases the objective function value until Algorithm 1 converges; 2) the proposed Algorithm 1 on all the datasets converges within twenty iterations, revealing that the proposed algorithm has a fast convergence rate.

From Fig. 3, we can know that the classification accuracy with a low-rank constraint for the most part is better than the classification accuracy with full-rank. For instance, the average classification accuracy of LSS_FS method with low rank constraint increases by 0.74%, 0.22%, 0.21%, 0.45%, 0.41%, 0.54%, 0.56%, 0.39%, 0.28%, 0.91%, 0.14% and 0.15%, respectively, compared with the results of full-rank constraint on dataset wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli, Isolet and Yeast. Hence, it is obvious that analyzing high-dimensional data with a low rank constraint in feature selection is meaningful. Due to the fact that the low rank constraint conducting subspace learning helps search the low-dimensional space of high-dimensional data by considering the global feature correlation.

Fig. 3
figure 3

Average classification accuracy of our method on all tested data sets for different number of ranks

We vary parameter λ 2and λ 3 within the range of {10− 2,…,102} and list the results in Fig. 4. Parameter λ 2is used to control the magnitude between the local representation term \(\sum \nolimits _{i,j}^n {||{x^i}AB - {x^j}AB||_2^2{s_{i,j}}}\) and the global representation term \({{||}}X - XAB{{||}}_F^2\), while λ 3in (7) is used to adjust the sparsity of AB. In Fig. 4, our method achieves the best performance on dataset Sonar and ecoli while setting λ 2 = 10, and λ 3= 0.01. Clearly, our method produces the best ACC, i.e., 84.38% with λ 2 = 1, and λ 3= 1 for dataset LVST. This indicates that tuning of parameter benefits our method.

Fig. 4
figure 4

ACC results of our proposed method for different λ 2and λ 3

5 Conclusion

In this work, we proposed a novel feature selection method, called Unsupervised Feature Selection via Local Structure Learning and Sparse Learning (LSS_FS). LSS_FS method utilizes the similarity matrix to fine tune the self-representation coefficient matrix to output a high quality self-representation coefficient matrix. As a result, LSS_FS can have better discriminative power than traditional feature selection methods. Experimental results on real datasets show that LSS_FS method provides better feature selection performance than the competitor methods.

In the future work, we will extend the proposed method for semi-surprised feature selection tasks.