Unsupervised feature selection via local structure learning and sparse learning

Lei, Cong; Zhu, Xiaofeng

doi:10.1007/s11042-017-5381-7

Unsupervised feature selection via local structure learning and sparse learning

Published: 28 November 2017

Volume 77, pages 29605–29622, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Unsupervised feature selection via local structure learning and sparse learning

Download PDF

1464 Accesses
58 Citations
Explore all metrics

Abstract

Feature self-representation has become the backbone of unsupervised feature selection, since it is almost insensitive to noise data. However, feature selection methods based on feature self-representation have the following drawbacks: 1) The self-representation coefficient matrix is fixed and can not be fine-tuned according to the structure of data. 2) they do not consider the manifold structure of data, thus unable to further increase the performance of feature selection. To solve the above problems, this paper proposes an unsupervised feature selection algorithm that combines feature self-representation and manifold learning. Specifically, we first utilize feature self-representation to construct the model. After that, the self-representation coefficient matrix is dynamically adjusted to the optimal state based on the similarity matrix. Then, we use low-rank representation to explore the global manifold structure of the data. Finally, we combine sparse learning with feature selection. The experimental results on twelve datasets show that the proposed method outperforms all the competing methods.

Unsupervised feature selection based on self-representation sparse regression and local similarity preserving

Article 18 December 2017

Non-convex Regularized Self-representation for Unsupervised Feature Selection

Differences and Similarities Learning for Unsupervised Feature Selection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid development of information technology and database technology, people can easily access and store huge amount of data. Traditional data analysis tools can not satisfy people’s increasing needs any more [10, 42, 66]. In order to efficiently extract useful information from massive data, many data mining technologies have been around. In real applications, however, data usually is high-dimensional and may contain lots of redundant and useless information, which seriously undermines effectiveness of the data mining methods [13, 17, 19]. Therefore, it is crucial to conduct dimensionality reduction on the data before applying any data mining algorithm [15, 16].

Dimensionality reduction is one of the most important research areas of machine learning [9, 21, 41]. In real applications, data is often high-dimensional, and it is difficult to be understood, represented and processed [20, 37, 39]. By employing dimensionality reduction methods to break the curse of dimensionality, people can easily process and thus fully understand the data [11, 29, 49]. There are a number of ways to reduce dimensions of the data, which can be divided into two categories. The first one contains linear methods, such as Principal Component Analysis (PCA) [50] and Classical Multidimensional Scaling (CMDS) [52]. Low-dimensional data obtained by these linear methods can usually maintain the linear relationship between high-dimensional data points, but this kind of methods can not reveal complex nonlinear manifold structure of the data. The second one, based on manifold learning theory, consists of non-linear methods, such as Isometric Mapping (Isomap) [64], Local Linear Embedding (LLE) [63]. This type of algorithms can learn inherent geometry structure of the data, facilitating data processing and analysis [16, 24].

Dimensionality reduction methods based on manifold learning can usually be divided into two categories. One is based on local manifold structure of data. The most classical method is Locality Preserving Projections (LPP) [62], which can preserve neighborhood relations between data after dimensionality reduction. Another consists of methods that consider global structure of the data. Linear Discriminant Analysis (LDA) [7] is the most commonly used in many areas and it can produce the best projection result, meaning that after projection samples with the same class label have the largest distance while those with different class labels have the smallest distance [18, 40, 61, 63]. That is, LDA produces the best capability of separability for data samples. However, most feature selection algorithms only take into account one of the two manifold structures of the data, i.e., either global or local, so the performance is not satisfactory [12, 28, 30].

Because of some good properties, such as insensitive to noise data, feature self-representation has been widely used in machine learning and computer vision [60, 65]. However, since the self-representation coefficient matrix obtained by these methods is fixed, feature self-representation based methods do not adapt well to data with complex structure. To this end, this paper presents a feature selection algorithm that combines structure learning with sparse learning (LSS_FS). The main contributions of this paper include

Due to the fact that our method uses feature self-representation to build the model, this method is not very sensitive to noise and outliers, exhibiting good robustness.
This model utilizes a low-rank constraint on self-representation coefficient matrix to explore global manifold structure of the data, which had been proven that it has the ability to take advantage of the correlation among features effectively.
The model proposed in this paper can dynamically adjust the structure of data according to the similarity matrix of the data, and can make it more accurate, so as to improve the classification accuracy.
In this paper, we propose a novel method to solve the objective function. We perform low-rank feature selection and dynamically adjust the graph matrix to optimize the objective function according to the similarity matrix iteratively. The optimal results are achieved in each iteration, and finally the global optimal solution is thus obtained.

2 Related work

The original purpose of sparse learning is to compress and express a signal, because it has a lower sampling rate than Shannon theorem. And it gradually becomes popular, and successfully used to solve many practical problems. For example, in signal and image processing, signal encoding, representation and compression, image denoising, image super-resolution, etc [34, 48, 57].

With the continuous innovation and practice of many scientists and researchers around the world, sparse learning theory has been successfully applied in signal processing, data mining, machine learning, information retrieval, pattern recognition, biological computing and other fields. And it has become the research hotspot of these areas. So far, a large number of experts and scholars are still carrying out an in-depth study of sparse learning’s theory and application. For sparse learning itself, because it has a more natural discriminant nature, it is more suitable for face recognition [27, 36]. Specifically, based on the theory of Compressive Sensing, by using the sparse learning to reconstruct an approximate signal for the original one to find the dimensionality reduce matrix, so that the original high-dimensional signal projection to low-dimensional space can also maintain its original features as much as possible [32, 35]. Sparse learning methods have achieved very high recognition accuracy in face recognition. Although this method has achieved good performance, it needs to store all training images for recognition, significantly increasing the storage overhead [43, 46]. Moreover, when the sparse learning deals with independent signals, it only considers the inherent relevance of the signal, ignoring the correlation between the same type of signals.

Sparse learning evaluates the importance of an element in terms of the weight of its coefficient, that is, the weight of an unimportant element is zero and the important one is nonzero [31]. The more important an element, the greater the corresponding weight. Therefore, sparse learning can be used as a natural identification information into the model [55] by using the coefficient weights between the data samples or features. This enables sparse learning to reduce the impact of noise on the model and improve the efficiency of the learning model. Therefore, sparse learning has been widely used in the field of data mining and machine learning. For example, Zhu et al. proposed a joint sparse learning method to achieve block sparse of data by considering the correlation between data and global information of the data, and it has been used to deal with multi-label data for classification. Gao et al. used the histogram intersection kernel (HIk) to implement kernel sparse learning, and it utilizes the HIk basis to consider a feature quantization of soft-allocated extended to conduct sparse coding [8]. Xia et al. proposed a sparse projection algorithm to conduct binary coding for high-dimensional data [27]. Sparse Subspace Learning (SSL) can also be seen as a special dimensionality reduction method [2]. Some researchers proposed a series of sparse subspace learning algorithms based on sparse learning theory, including Nonnegative Sparse Principal Component Analysis (NSPCA) [45], Sparse Nomegative Matrix Factorization (SNMF) [23], and so on.

The essence of sparse learning is to introduce the coefficient weight between samples or features as the authentication information into the model. By using the sparse constraint to punish the model, the coefficient weight of some irrelevant data approaches zero, and useful information is preserved, so the sparse learning is robust [52]. However, sparse learning does not take into account the internal structure of the data.

To overcome the shortcomings of sparse learning, manifold learning has been proposed. Linear discriminant analysis is the most common technique used in manifold learning. Zhong et al. proposed a method to utilize ℓ ₁-norm to maximize the objective function (LDA-ℓ ₁) [47], where LDA-ℓ ₁captures the global manifold structure via discriminant analysis, and then selects the most critical features. Unsupervised Feature Selection via Unified Trace Ratio Formulation and K-means Clustering (TRACK) was proposed as a unified framework for feature selection and selected discriminative features according to trace ratio formulation and K-means clustering [25]. Cai et al. proposed Multi-Cluster Feature Selection (MCFS), which explores local manifold structure of the data via spectral analysis and then searches the features that can better preserve the clustering structure [3]. However, these algorithms are sensitive to outliers, and if the data contains too much noise, performance of the algorithm will degenerate significantly.

There are many robust feature selection methods. For example, Unsupervised feature selection by regularized self-representation (RSR), which uses all the features to linearly represent each feature, thus weakening the impact of outliers [56]. Sun et al. put forward an Unsupervised Robust Bayesian Feature Selection method, where the feature selection capability is realized by estimating the feature saliencies associated with the features [22]. But these methods do not consider the manifold structure of data, so their performance are not satisfactory. To obtain better stability and performance, this paper combines manifold learning with feature self-representation, and then introduces the sparse regularization factor to conduct feature selection.

3 Our method

In this section, we first introduce some notations that are used in this paper, and then explain the detail of the proposed LSS_FS method, in Sections 3.1 and 3.2, respectively, and then elaborate the proposed optimization method in Section 3.3. Finally, we analysis the convergence of the objective function in Section 3.4.

3.1 Notations

For a matrix $\textit {X} \in {\mathbb {R}^{n \times d}}$, its i-th row and j-th column are denoted as X ⁱ and X _j, respectively. The trace of a matrix X is denoted as tr(X). X ^T means the transpose of X and X ^− 1means the inverse of X, respectively. Also we denote the ℓ _F-norm and ℓ _2,1-norm of X respectively as $||\textit {X}|{|_{F}} = \sqrt {{\sum \nolimits _{i}^{n}} {||{\textit {x}^{i}}||}_{2}^{2}} = \sqrt {{\sum \nolimits _{j}^{d}} {||{\textit {x}_{j}}||}_{2}^{2}} $ , $ ||X|{|_{2,1}} = {\sum \nolimits _{i}^{n}} {\sqrt {{\sum \nolimits _{j}^{d}} {x_{i,j}^{2}} } } $.

3.2 Structure learning for feature selection

Many previous literatures have indicated that the local structures of the samples may provide complementary information to boost the ability of dimensionality reduction [44, 54]. So, this paper proposes to utilize the local structures of the samples by learning a graph matrix $S \in {\mathbb {R}^{n \times n}}$ on a low-dimensional space of the original data. Given the feature matrix X, we can get the following formula according to [51, 59]

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{Z} \sum\nolimits_{i,j}^{n} {||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}} \end{array} \end{array} $$

(1)

where ||⋅||₂ is the ℓ ₂-norm of a vector, $Z \in {\mathbb {R}^{d \times d}}$ is a transformation matrix of the high-dimensional data X in its low-dimensional space, and the element s _i,jof the graph matrix S represents the similarity between sample X ⁱ and sample X ^j. If sample X ⁱ is one of the k-nearest neighbors of sample X ^j, then the value of the heat kernel, i.e., $f({x^{i}},{x^{j}}) = exp(- \frac {{||{x^{i}} - {x^{j}}||_{2}^{2}}}{{2{\sigma ^{2}}}})$ where σ is a tuning parameter, is regarded as the value of s _i,j; otherwise s _i,j = 0.

Actually, the character of S has been demonstrated very sensitive to σ [49, 53]. This inspires us to learn the relatively correct graph matrix from the ’clean’ data and to decrease number of parameters that need to adjust. By clean, we mean that a low-dimensional space with as less noise and redundancy as possible [33]. However, in real applications, we cannot know the graph matrix and low-dimensional space in advance [38]. In order to deal with these problems, we combine graph matrix learning with low-dimensional space learning to iteratively optimize them so as to achieve the optimal results. Therefore, we may learn the graph matrix by following the distribution of the samples. Then we have the following objective function:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,Z} \sum\nolimits_{i,j}^{n} {(||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}),} \\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$

(2)

where λ ₁ is a control parameter and s _i is the i-th column of S, $||{s_{i}}||_{2}^{2}$ is used to avoid the trivial solution, 1 and $\mathcal {N}(i)$ represent an all-one-element vector and the set of the nearest neighbors of the i-th sample, respectively, and the constraint ${s_{i}^{T}}{\textbf {1}} = 1$ is applied to gain the shift invariant similarity. Clearly, (2) assigns small value (i.e., similarity) to s _i,jif sample i and j are far apart, and large value to s _i,jotherwise.

Previous methods learn the graph matrix via (1) to generate an optimal similarity measurement. Different from that, (2) aims to achieve both optimal similarity measurement (i.e., S) and feature selection results (i.e., Z). It becomes obvious that (2) may receive better feature selection results (i.e., Z) than (1).

In order to reduce the adverse impacts of outliers and noise samples on the model, we propose to use the self-representation method to build an unsupervised model, then we can get the following functions:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,Z} \sum\nolimits_{i,j}^{n} {||{x^{i}}Z - {x^{j}}Z||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XZ{{||}}_{F}^{2}\\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$

(3)

In order to improve effectiveness of the algorithm and to consider relationship among features, we may impose a constrain on the rank of Z [58]. Through this process, a low rank constraint on Z can naturally be represented as the product of two r-rank matrices as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} Z = AB \end{array} \end{array} $$

(4)

where $A \in {\mathbb {R}^{d \times r}}, B \in {\mathbb {R}^{r \times d}}, r \le \min (n,d),$ and our function becomes

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,A,B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XAB{{||}}_{F}^{2} \\&&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0, {s_{i,j}} \ge 0,{\text{ if }}j \in \mathcal{N}(i),{\mathrm{otherwise~0}} \end{array} $$

(5)

According to [56], the feature selection issue can be converted to the following form:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{W} l(X - XW) + \lambda R(W) \end{array} \end{array} $$

(6)

where W is the feature weight matrix, l(X − X W) is the loss function, R(W) is the regularization on W and λ is a positive constant. In the item ${{||}}X - XAB{{||}}_F^2$, by treating XA as X of (6), then B can be naturally regarded as W of (6). Inspired by (6), we naturally make the regularization on matrix of B to further improve the performance of the algorithm. Therefore, our ultimate objective function is:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S,A,B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}{{ + }}} {\lambda_{2}}{{||}}X - XAB{{||}}_{F}^{2} + {\lambda_{3}}{{||}}B{{|}}{{{|}}_{2,1}}\\ &&s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{ otherwise 0, }}{A^{T}}A = I \end{array} $$

(7)

The constraint A ^T A = I(where $A \in {\mathbb {R}^{d \times r}}$ and $I \in {\mathbb {R}^{r \times r}}$) is introduced for identifiability purpose, where λ ₁,λ ₂and λ ₃are tuning parameters.

3.3 Optimization

Equation (7) is not jointly convex with respect to all the variables (i.e., A, B, and S), but it is convex for each variable while fixing the rest. In this paper, we utilize the alternative optimization method to optimize (7), i.e., iteratively optimizing each variable respectively.

Update A by fixing B and S

When B and S are fixed, the second and fourth terms of (7) can be viewed as constants, thus we can get:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{A} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}} + {\lambda_{2}}||X - XAB||_{F}^{2},s.t.,{A^{T}}A = I \end{array} \end{array} $$

(8)

By following the IRLS framework, we rewrite (8) as:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{A} tr({B^{T}}{A^{T}}{X^{T}}LXAB) + {\lambda_{2}}tr({X^{T}}X - {X^{T}}XAB \\&-& {B^{T}}{A^{T}}{X^{T}}X + {B^{T}}{A^{T}}{X^{T}}XAB),s.t.,{A^{T}}A = I \end{array} $$

(9)

where t r(⋅) is a trace operator, $L = Q - S \in {\mathbb {R}^{n \times n}}$ is a Laplacian matrix and Q is a diagonal matrix with its i-th element ${q_{i,i}} = \sum \nolimits _{j = 1}^{n} {{s_{i,j}}} $. By fixing B, we get the derivative of (9) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} 2{X^{T}}LXAB{B^{T}} - 2{\lambda_{2}}{X^{T}}X{B^{T}} + 2{\lambda_{2}}{X^{T}}XAB{B^{T}} \end{array} \end{array} $$

(10)

Since A is orthogonal, we can employ an existing method in [26] to optimize it.

Update B by fixing A and S

When A and S are fixed, the second term of (7) can be viewed as constants, we can get:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{B} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}} + {\lambda_{2}}||X - XAB||_{F}^{2} + {\lambda_{3}}||B|{|_{2,1}} \end{array} \end{array} $$

(11)

Which is equivalent to

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{B} tr({B^{T}}{A^{T}}{X^{T}}LXAB) + {\lambda_{2}}tr({X^{T}}X - {X^{T}}XAB\\ &-& {B^{T}}{A^{T}}{X^{T}}X + {B^{T}}{A^{T}}{X^{T}}XAB) + {\lambda_{3}}tr({B^{T}}PB) \end{array} $$

(12)

Where $P \in {\mathbb {R}^{r \times r}}$ is diagonal matrix with ${p_{i,i}} = \frac {1}{{2||{B^i}|{|_2}}},(i = 1, {\ldots } ,r)$. Then setting the derivative of B in (12) to zero, we obtain:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} B = {\lambda_{2}}({A^{T}}{X^{T}}LXA + {\lambda_{2}}{A^{T}}{X^{T}}XA + {\lambda_{3}}P)^{- 1}{A^{T}}{X^{T}}X \end{array} \end{array} $$

(13)

Update S by fixing A and B

Given that A and B are fixed, (7) is changed to:

$$\begin{array}{@{}rcl@{}} &&\mathop {\min }\limits_{S} \sum\nolimits_{i,j}^{n} {||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}{{||}}{s_{i}}{{||}}_{2}^{2}} \\ && s.t.,\forall i,{s_{i}^{T}}{\textbf{1}} = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0,{\mathrm{ if }}j \in \mathcal{N}(i),{\mathrm{ otherwise~0}} \end{array} $$

(14)

We first calculate the Euclidean distance between each sample to produce k nearest neighbors of all samples, and then set the value of s _i,j to 0 if the j-th sample does not belong to one of the k nearest neighbors of the i-th sample, otherwise, we utilize (17) to determine the values of s _i,j.

Since optimizing S is equal to solely optimizing each vector s _i(i = 1,…,n), we further change the optimization problem in (14) to individually optimize s _i(i = 1,…,n) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{{s_{i}^{T}}1 = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0} {\sum\nolimits_{j}^{n}} (||{x^{i}}AB - {x^{j}}AB||_{2}^{2}{s_{i,j}}{{ + }}{\lambda_{1}}s_{i,j}^{2} ) \end{array} \end{array} $$

(15)

By denoting $F \in {\mathbb {R}^{n \times n}}$ where ${f_{i,j}} = ||{x^i}AB - {x^j}AB||_2^2$, we rewrite (15) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \mathop {\min }\limits_{{s_{i}^{T}}1 = 1,{s_{i,i}} = 0,{s_{i,j}} \ge 0} ||{s_{i}} + \frac{1}{{2{\lambda_{1}}}}{f_{i}}||_{2}^{2} \end{array} \end{array} $$

(16)

On the basis of the Karush-Kuhn-Tucker (KKT) conditions [1], we are able to get the closed-form solution of s _i,j(j = 1,…,n)as:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} {s_{i,j}} = {(- \frac{1}{{2{\lambda_{1}}}}{f_{i,j}} + \tau )_ + } \end{array} \end{array} $$

(17)

We suppose that there are k nearest neighbors for each sample. By denoting ${\hat f_i} = \{ {\hat f_{i,1}}, {\ldots } ,{\hat f_{i,n}}\} $ as a descend order of f _i,(i = 1,…n), (17) reveals that s _i,k+ 1 = 0and s _i,k > 0, where k is the number of nearest neighbors of the i-th samples and can be tuned by cross-validation methods. That is,

$$\begin{array}{@{}rcl@{}} \begin{array}{l} - \frac{1}{{2{\lambda_{1}}}}{\hat f_{i,k + 1}} + \tau \le 0 \end{array} \end{array} $$

(18)

Since $s_i^T{\textbf {1}} = 1$, we can get

$$\begin{array}{@{}rcl@{}} \begin{array}{l} \sum\nolimits_{j = 1}^{k} {(\frac{1}{{2{\lambda_{1}}}}{{\hat f}_{i,k}} + \tau )} = 1 \Rightarrow \tau = \frac{1}{k} + \frac{1}{{2k{\lambda_{1}}}}\sum\nolimits_{j = 1}^{k} {{{\hat f}_{i,k}}} \end{array} \end{array} $$

(19)

Based on the above discussion, we summarize the process for solving (7) in Algorithm 1.

3.4 Convergence analysis

The framework IRLS has been proven to converge in [6], so we only need to prove the convergence of Algorithm 1 via Theorem 1:

Theorem 1

The value of objective function in(7)monotonically decreases until Algorithm 1 converges.

Proof

After the t-th iteration, we have obtained the optimal A ^(t), B ^(t)and S ^(t). In the (t + 1)-th iteration, we need to optimize S ^(t+ 1)by fixing A ^(t)and B ^(t).

According to (17), $s_{i,j}^{(t + 1)}$ has a closed-form solution for all i,j = 1,…,n, thus we have:

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)} + {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2}\\ &&+ {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t)} + {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2}\\ &&+ {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$

(20)

While fixing S ^(t+ 1)to update A ^(t+ 1)and B ^(t+ 1), we follow [6] to get

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t + 1)}}{B^{(t + 1)}} - {x^{j}}{A^{(t + 1)}}{B^{(t + 1)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t + 1)}}{B^{(t + 1)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t + 1)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$

(21)

By integrating (20) with (21), we obtain:

$$\begin{array}{@{}rcl@{}} &&\sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t + 1)}}{B^{(t + 1)}} - {x^{j}}{A^{(t + 1)}}{B^{(t + 1)}}} {{||}}_{2}^{2}s_{i,j}^{(t + 1)}\\ &&+ {\lambda_{1}}||X - X{A^{(t + 1)}}{B^{(t + 1)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t + 1)}}|{|_{2,1}}\\ &\le& \sum\nolimits_{i,j}^{n} {||{x^{i}}{A^{(t)}}{B^{(t)}} - {x^{j}}{A^{(t)}}{B^{(t)}}} {{||}}_{2}^{2}s_{i,j}^{(t)}\\ &&+ {\lambda_{1}}||X - X{A^{(t)}}{B^{(t)}}||_{F}^{2} + {\lambda_{2}}{\sum\nolimits_{i}^{n}} {||s_{i}^{(t + 1)}||_{2}^{2}} + {\lambda_{3}}||{B^{(t)}}|{|_{2,1}} \end{array} $$

(22)

From (22), we know that the objective function value of (7) decreases after each iteration of Algorithm 1. Hence, Theorem 1 has been proven. □

4 Experimental results

We use 5 binary-class and 7 multi-class benchmark datasets to verify the performance of our method and the competing dimensionality reduction methods. The selected datasets include glass, wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli and Yeast, all downloaded from the UCI Machine Learning Repository.^{Footnote 1} The Isolet from the website of Feature Selection Datasets.^{Footnote 2} We summarize detailed information of the datasets in Table 1.

Table 1 Statistics of the datasets used

Full size table

To prove the performance of LSS_FS, we compare it with six state-of-art feature selection methods. We list details of the competitor methods as follows:

TRACK: It selects discriminative features via unified trace ratio formulation and k-means clustering [25]. RSR: Regularized self-representation selects representative features via the ℓ _2,1-norm to characterize the self-representation coefficient matrix and ensure the robustness to outilers [56]. CSFS: Convex Semi-supervised multi-label Feature Selection can conduct the feature selection via the ℓ _2,1-norm regularization [5]. FSR_ALM : Exact Top-k Feature Selection via ℓ _2,0-norm constraint and Augmented Lagrangian Multiplier (ALM) to avoid the heavy burden of tuning regularization parameters and make it more practically [4]. LDA: Liner Discriminant Analysis (LDA) aims at minimizing the within class distance while maximizing the between class distance when conducting feature selection. Hence, LDA is a global subspace learning method [7]. PCA: It maps high-dimensional data into low-dimensional space by preserving the covariance of the data matrix [50].

We set {λ ₂,λ ₃}∈{10^− 2,…,10²}, while the value of λ ₁is automatically adjusted according to [14], and the rank of the self-representation coefficient matrix r ∈{1,…,min(n,d)}. Moreover, {c,g}∈{2^− 5,…,2⁵}in SVM a 5-fold inner cross-validation is used to distinguish different types of samples. According to the cross-validation method, we can choose the best parameters for the experiments. In order to reduce impact of randomness, a 10-fold outer cross-validation has been used to get the average results. For the sake of fairness, we use the same strategy for all the competing methods.

We use three kinds of evaluation metrics, such as classification accuracy, standard deviation and coefficient of variation, to evaluate the classification performance of all methods. We define classification accuracy (ACC) as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} ACC = {{{N_{correct}}} / N} \end{array} \end{array} $$

(23)

where N is the number of samples and N _{c
o
r
r
e
c
t}is the number of correctly classified samples.

We utilize Standard Deviation (STD) to reflect the classification accuracy results and give the definition as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{l} STD = \sqrt {({1 / {{n_{run}}}})\sum\limits_{i = 1}^{N} {(AC{C_{i}} - u)} } \end{array} \end{array} $$

(24)

where n _{r
u
n} denotes the runs of experiments, i.e., n _{r
u
n}= 10 in our experiments, and u stands for the average value of ACC. A large ACC means good performance, and a small STD shows excellent stability.

We also use Coefficient of Variation (CV) as the evaluation index, which is defined as

$$\begin{array}{@{}rcl@{}} \begin{array}{l} CV = {{STD} /{ACC}} \end{array} \end{array} $$

(25)

where STD is the square root of an unbiased estimate of each sample, i.e., standard deviation. A small CV means better robustness.

4.1 Experiment results and analysis

In Table 2, we display the classification accuracy of all methods on twelve datasets in Table 1. Besides, we report the results of classification accuracy for all datasets in Fig. 1.

Table 2 Classification accuracy (ACC±STD (%))

Full size table

The proposed LSS_FS method outperforms all competing methods in all classification tasks. For instance, the ACC of our method increases on average by 10.97%, compared with TRACK which does not learn the relatively correct graph matrix of the high-dimensional data. ACC of out method raises on average by 9.45%, compared with the PCA method which is not able to take into account enough information for feature selection. On the other hand, ACC of our method climbs on average by 8.63%, compared with the LDA method which only considers the global structure of the data. Meahwhile, ACC of our method increases on average by 8.46%, compared with the FSR_ALM method which fails to take the relationship between features into account. Also, compared with CSFS which does not construct graph matrix to consider the local correlations among the features, ACC of our method increases on average by 8.34%. Compared with the RSR method that only considers the relationship between the features, the improvement on ACC of our method is 6.85% on average. The reason is that, LSS_FS method considers the following constraint while conducting feature selection: 1) two kinds of correlations, i.e., sample-level and feature-level, inherent in data; 2) iteratively adjusting the transformation matrix until it is optimal. Furthermore, the proposed method obtains the minimum standard deviation on average, compared with the other methods. Based on the above observations, our proposed method has the best robustness compared with all the other methods for classification tasks.

We show the results of coefficient of variation of all methods in Table 3. Specifically, our algorithm obtains the minimum value on the five datasets, to be specific, 1.19 on glass, 0.31 on wine, 0.55 on Ionosphere, 0.87 on Movements and 0.80 on Arrhythmia, respectively. Returning to the performance of CV in Table 3, though the proposed method does not achieve the minimum variation on each of the datasets, it reaches the minimum average variation on all the datasets. This verifies that our method achieves the best stability.

Table 3 Coefficient of variation (%)

Full size table

Figure 2 reveals the characteristic of the proposed objective values of Algorithm 1 on six datasets at each iteration, where we set the stop criteria of Algorithm 1 as $\frac {{||obj(t + 1) - obj(t)||_2^2}}{{obj(t)}} \le {10^{- 5}}$ where obj(t) represents the value of objective function in (7) at the t-th iteration. From Fig. 2, we can know that: 1) the proposed Algorithm 1 to optimize the objective function in (7) monotonically decreases the objective function value until Algorithm 1 converges; 2) the proposed Algorithm 1 on all the datasets converges within twenty iterations, revealing that the proposed algorithm has a fast convergence rate.

From Fig. 3, we can know that the classification accuracy with a low-rank constraint for the most part is better than the classification accuracy with full-rank. For instance, the average classification accuracy of LSS_FS method with low rank constraint increases by 0.74%, 0.22%, 0.21%, 0.45%, 0.41%, 0.54%, 0.56%, 0.39%, 0.28%, 0.91%, 0.14% and 0.15%, respectively, compared with the results of full-rank constraint on dataset wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli, Isolet and Yeast. Hence, it is obvious that analyzing high-dimensional data with a low rank constraint in feature selection is meaningful. Due to the fact that the low rank constraint conducting subspace learning helps search the low-dimensional space of high-dimensional data by considering the global feature correlation.

We vary parameter λ ₂and λ ₃ within the range of {10^− 2,…,10²} and list the results in Fig. 4. Parameter λ ₂is used to control the magnitude between the local representation term $\sum \nolimits _{i,j}^n {||{x^i}AB - {x^j}AB||_2^2{s_{i,j}}}$ and the global representation term ${{||}}X - XAB{{||}}_F^2$, while λ ₃in (7) is used to adjust the sparsity of AB. In Fig. 4, our method achieves the best performance on dataset Sonar and ecoli while setting λ ₂ = 10, and λ ₃= 0.01. Clearly, our method produces the best ACC, i.e., 84.38% with λ ₂ = 1, and λ ₃= 1 for dataset LVST. This indicates that tuning of parameter benefits our method.

5 Conclusion

In this work, we proposed a novel feature selection method, called Unsupervised Feature Selection via Local Structure Learning and Sparse Learning (LSS_FS). LSS_FS method utilizes the similarity matrix to fine tune the self-representation coefficient matrix to output a high quality self-representation coefficient matrix. As a result, LSS_FS can have better discriminative power than traditional feature selection methods. Experimental results on real datasets show that LSS_FS method provides better feature selection performance than the competitor methods.

In the future work, we will extend the proposed method for semi-surprised feature selection tasks.

Notes

References

Boyd S, Vandenberghe L (2013) Convex optimization
Cai D, He X, Han J (2007) Spectral regression: a unified approach for sparse subspace learning. In: ICDM, pp 73–82
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: KDD, pp 333–342
Cai X, Nie F, Huang H (2013) Exact top-k feature selection via l 2,0 -norm constraint. In: IJCAI, pp 1240–1246
Chang X, Nie F, Yi Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: AAAI, pp 1171–1177
Daubechies I, Devore R, Fornasier M, SiNan Gntk C (2008) Iteratively reweighted least squares minimization for sparse recovery. Commun Pure Appl Math 63(1):1–38
Article MathSciNet MATH Google Scholar
Fan Z, Yong X, Zhang D (2011) Local linear discriminant analysis framework using sample neighbors. IEEE Trans Neural Netw 22(7):1119–1132
Article Google Scholar
Gao S, Tsang I W, Chia L-T (2013) Sparse representation with kernels. IEEE Trans Image Process 22(2):423–434
Article MathSciNet MATH Google Scholar
Gao L, Song J, Liu X, Shao J, Liu J, Shao J (2017) Learning in high-dimensional multimedia data: the state of the art. Multimed Syst 23(3):303–313
Article Google Scholar
Gao L, Wang Y, Li D, Shao J, Song J (2017) Real-time social media retrieval with spatial, temporal and social constraints. Neurocomputing 253:77–88
Article Google Scholar
Hu R, Zhu X, Cheng D, He W, Yan Y, Song J, Zhang S (2017) Graph self-representation method for unsupervised feature selection. Neurocomputing 220:130–137
Article Google Scholar
Jayasena K P N, Li L, Xie Q (2017) Multi-modal multimedia big data analyzing architecture and resource allocation on cloud platform. Neurocomputing
Ling C X, Yang Q, Wang J, Zhang S (2004) Decision trees with minimal costs. In: ICML, pp 69
Nie F, Zhu W, Li X (2016) Unsupervised feature selection with structured graph optimization. In: AAAI, pp 1302–1308
Qian B, Wang X, Cao N, Gang Jiang Y, Davidson I (2014) Learning multiple relative attributes with humans in the loop. IEEE Trans Image Process 23 (12):5573–5585
Article MathSciNet MATH Google Scholar
Qian B, Wang X, Cao N, Li H, Gang Jiang Y (2015) A relative similarity based method for interactive patient risk prediction. Data Mining Knowl Discov 29 (4):1070–1093
Article MathSciNet Google Scholar
Qin Y, Zhang S, Zhu X, Zhang J, Zhang C (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88
Article MATH Google Scholar
Song J, Yi Y, Zi H, Shen H T, Luo J (2013) Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Trans Multimed 15 (8):1997–2008
Article Google Scholar
Song J, Gao L, Nie F, Shen H T, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans Image Process 25(11):4999–5011
Article MathSciNet Google Scholar
Song J, Gao L, Zou F, Yan Y, Sebe N (2016) Deep and fast: deep learning hashing with semi-supervised graph construction. Image Vis Comput 55:101–108
Article Google Scholar
Song J, Shen H T, Wang J, Zi H, Sebe N, Wang J (2016) A distance-computation-free search scheme for binary code databases. IEEE Trans Multimed 18(3):484–495
Article Google Scholar
Sun J, Zhou A (2014) Unsupervised robust bayesian feature selection, pp 558–564
Wang T, Qin Z, Zhang S, Zhang C (2012) Cost-sensitive classification with inadequate labeled data. Inf Syst 37(5):508–516
Article Google Scholar
Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications. Data Mining Knowl Discov 28(1):1–30
Article MathSciNet MATH Google Scholar
Wang D, Nie F, Huang H (2014) Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track). In: Ecml/pkdd, pp 306–321
Google Scholar
Wen Z, Yin W (2013) A feasible method for optimization with orthogonality constraints. Math Program 142(1):397–434
Article MathSciNet MATH Google Scholar
Xia Y, He K, Kohli P, Sun J (2015) Sparse projections for high-dimensional binary codes. In: Computer vision and pattern recognition, pp 3332–3339
Xie Q, Pang C, Zhou X, Zhang X, Ke D (2014) Maximum error-bounded piecewise linear representation for online stream approximation. Vldb J 23(6):915–937
Article Google Scholar
Xie QS, Wang JZ, Zhang X (2016) Modeling and predicting ad progression by regression analysis of sequential clinical data. Neurocomputing 195(C):50–55
Article Google Scholar
Xie Q, Zhang X, Li Z, Zhou X (2016) Optimizing cost of continuous overlapping queries over data streams by filter adaption. IEEE Trans Knowl Data Eng 28(5):1258–1271
Article Google Scholar
Xindong W, Zhang S (2003) Synthesizing high-frequency rules from different data sources. IEEE Trans Knowl Data Eng 15(2):353–367
Article Google Scholar
Xindong W, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. Acm Trans Inf Syst 22(3):381–405
Article Google Scholar
Xindong W, Zhang C, Zhang S (2005) Database classification for multi-database mining. Inf Syst 30(1):71–88
Article MATH Google Scholar
Yan X, Zhang C, Zhang S (2009) Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst Appl 36(2):3066–3076
Article Google Scholar
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133
Article Google Scholar
Zhang S (2012) Nearest neighbor selection for iteratively knn imputation. J Syst Softw 85(11):2541–2552
Article Google Scholar
Zhang C, Zhang S (2002) Association rule mining: models and algorithms 2307
Zhang S, Zhang C (2002) Anytime mining for multiuser applications. IEEE Trans Syst Man Cybern-Part Syst Humans 32(4):515–521
Article Google Scholar
Zhang S, Zhang C, Yang Q (1999) Data preparation for data mining. Academic Press
Zhang S, Zhang C, Yan X (2003) Post-mining: maintenance of association rules by weighting. Inf Syst 28(7):691–707
Article Google Scholar
Zhang S, Wu X, Zhang C (2003) Multi-database mining 2:5–13
Google Scholar
Zhang S, Qin Z, Ling C X, Sheng S (2005) Missing is useful?: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693
Article Google Scholar
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459
Article Google Scholar
Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for knn classification. ACM Trans Intell Syst Technol 8(3):43
Google Scholar
Zhang S, Li X, Zong M, Zhu X, Wang R (2017) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2017.2673241
Article MathSciNet Google Scholar
Zhao Y, Zhang S (2005) Generalized dimension-reduction framework for recent-biased time series analysis. IEEE Trans Knowl& Data Eng 18(2):231–244
Article Google Scholar
Zhong F, Zhang J (2013) Linear discriminant analysis based on l1-norm maximization. IEEE Trans Image Process 22(8):3018–3027
Article MathSciNet MATH Google Scholar
Zhu Y, Lucey S (2015) Convolutional sparse coding for trajectory reconstruction. IEEE Trans Pattern Anal Mach Intell 37(3):529–540
Article Google Scholar
Zhu X, Zhang S, Jin Z, Zhang Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Article Google Scholar
Zhu X, Zi H, Shen H T, Cheng J, Changsheng X (2012) Dimensionality reduction by mixed kernel canonical correlation analysis. Pattern Recogn 45(8):3003–3016
Article MATH Google Scholar
Zhu X, Zi H, Shen H T, Zhao X (2013) Linear cross-modal hashing for efficient multimedia search. In: ACM International conference on multimedia, pp 143–152
Zhu X, Zi H, Cheng H, Cui J, Shen H T (2013) Sparse hashing for fast multimedia search. ACM Trans Inf Syst 31(2):9
Article Google Scholar
Zhu X, Zi H, Yang Y, Shen H T, Changsheng X, Luo J (2013) Self-taught dimensionality reduction on the high-dimensional small-sized data. Pattern Recogn 46(1):215–229
Article MATH Google Scholar
Zhu X, Zhang L, Huang Z (2014) A sparse embedding and least variance encoding approach to hashing. IEEE Trans Image Process 23(9):3737
Article MathSciNet MATH Google Scholar
Zhu X, Suk H I, Shen D (2014) A novel matrix-similarity based loss function for joint regression and classification in ad diagnosis. Neuroimage 100:91–105
Article Google Scholar
Zhu P, Zuo W, Zhang L, Qinghua H, Shiu S C (2015) Unsupervised feature selection by regularized self-representation. Pattern Recogn 48(2):438–446
Article MATH Google Scholar
Zhu X, Xie Q, Zhu Y, Liu X, Zhang S (2015) Multi-view multi-sparsity kernel reconstruction for multi-class image classification. Neurocomputing 169:43–49
Article Google Scholar
Zhu X, Suk H I, Lee S W, Shen D (2015) Canonical feature selection for joint regression and multi-class identification in alzheimer’s disease diagnosis. Brain Imag Behav 10(3):1–11
Google Scholar
Zhu X, Li X, Zhang S (2016) Block-row sparse multiview multilabel learning for image classification. IEEE Trans Cybern 46(2):450
Article Google Scholar
Zhu X, Suk H-I, Lee S-W, Shen D (2016) Subspace regularized sparse multitask learning for multiclass neurodegenerative disease identification. IEEE Trans Biomed En. 63(3):607–618
Article Google Scholar
Zhu Y, Zhu X, Kim M, Shen D, Guorong W (2016) Early diagnosis of alzheimers disease by joint feature selection and classification on temporally structured support vector machine. In: MICCAI, pp 264–272
Zhu X, He W, Li Y, Yang Y, Zhang S, Rongyao H, Zhu Y (2017) One-step spectral clustering via dynamically learning affinity matrix and subspace. In: AAAI, pp 2963–2969
Zhu X, Li X, Zhang S, Chunhua J, Xindong W (2017) Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans Neural Netw Learn Syst 28(6):1263–1275
Article MathSciNet Google Scholar
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph PCA hashing for similarity search. IEEE Multimed Multimed 19(9):2033–2044
Article Google Scholar
Zhu X, Suk HII, Huang H, Shen D (2017) Low-rank graph-regularized structured sparse regression for identifying genetic biomarkers. IEEE Transact Big Data. https://doi.org/10.1109/TBDATA.2017.2735991
Article Google Scholar
Zhu X, Suk H-I, Wang L, Lee S-W, Shen D (2017) A novel relational regularization feature selection method for joint regression and classification in AD diagnosis. Med Image Anal 38:205–214
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the China Key Research Program (Grant No: 2016YFB1000905), the China 1000-Plan National Distinguished Professorship, the Nation Natural Science Foundation of China (Grants No: 61573270, 61672177 and 61363009), the Guangxi Natural Science Foundation (Grant No: 2015GXNSFCB139-011), the Guangxi High Institutions Program of Introducing 100 High-Level Overseas Talents, the Guangxi Collaborative Innovation Center of Multi-Source Information Integration and Intelligent Processing, the Guangxi Bagui Teams for Innovation and Research, the Research Fund of Guangxi Key Lab of MIMS (16-A-01-01 and 16-A-01-02), the Guangxi Bagui Teams for Innovation and Research, and Innovation Project of Guangxi Graduate Education under grant XYCSZ2017064, XYCSZ2017067 and YCSW2017065.

Author information

Authors and Affiliations

Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, 541004, Guangxi, People’s Republic of China
Cong Lei & Xiaofeng Zhu

Authors

Cong Lei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, C., Zhu, X. Unsupervised feature selection via local structure learning and sparse learning. Multimed Tools Appl 77, 29605–29622 (2018). https://doi.org/10.1007/s11042-017-5381-7

Download citation

Received: 31 July 2017
Revised: 24 October 2017
Accepted: 01 November 2017
Published: 28 November 2017
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-017-5381-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unsupervised feature selection via local structure learning and sparse learning

Abstract

Similar content being viewed by others

Unsupervised feature selection based on self-representation sparse regression and local similarity preserving

Non-convex Regularized Self-representation for Unsupervised Feature Selection

Differences and Similarities Learning for Unsupervised Feature Selection

1 Introduction

2 Related work