Abstract
Feature self-representation has become the backbone of unsupervised feature selection, since it is almost insensitive to noise data. However, feature selection methods based on feature self-representation have the following drawbacks: 1) The self-representation coefficient matrix is fixed and can not be fine-tuned according to the structure of data. 2) they do not consider the manifold structure of data, thus unable to further increase the performance of feature selection. To solve the above problems, this paper proposes an unsupervised feature selection algorithm that combines feature self-representation and manifold learning. Specifically, we first utilize feature self-representation to construct the model. After that, the self-representation coefficient matrix is dynamically adjusted to the optimal state based on the similarity matrix. Then, we use low-rank representation to explore the global manifold structure of the data. Finally, we combine sparse learning with feature selection. The experimental results on twelve datasets show that the proposed method outperforms all the competing methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
With the rapid development of information technology and database technology, people can easily access and store huge amount of data. Traditional data analysis tools can not satisfy people’s increasing needs any more [10, 42, 66]. In order to efficiently extract useful information from massive data, many data mining technologies have been around. In real applications, however, data usually is high-dimensional and may contain lots of redundant and useless information, which seriously undermines effectiveness of the data mining methods [13, 17, 19]. Therefore, it is crucial to conduct dimensionality reduction on the data before applying any data mining algorithm [15, 16].
Dimensionality reduction is one of the most important research areas of machine learning [9, 21, 41]. In real applications, data is often high-dimensional, and it is difficult to be understood, represented and processed [20, 37, 39]. By employing dimensionality reduction methods to break the curse of dimensionality, people can easily process and thus fully understand the data [11, 29, 49]. There are a number of ways to reduce dimensions of the data, which can be divided into two categories. The first one contains linear methods, such as Principal Component Analysis (PCA) [50] and Classical Multidimensional Scaling (CMDS) [52]. Low-dimensional data obtained by these linear methods can usually maintain the linear relationship between high-dimensional data points, but this kind of methods can not reveal complex nonlinear manifold structure of the data. The second one, based on manifold learning theory, consists of non-linear methods, such as Isometric Mapping (Isomap) [64], Local Linear Embedding (LLE) [63]. This type of algorithms can learn inherent geometry structure of the data, facilitating data processing and analysis [16, 24].
Dimensionality reduction methods based on manifold learning can usually be divided into two categories. One is based on local manifold structure of data. The most classical method is Locality Preserving Projections (LPP) [62], which can preserve neighborhood relations between data after dimensionality reduction. Another consists of methods that consider global structure of the data. Linear Discriminant Analysis (LDA) [7] is the most commonly used in many areas and it can produce the best projection result, meaning that after projection samples with the same class label have the largest distance while those with different class labels have the smallest distance [18, 40, 61, 63]. That is, LDA produces the best capability of separability for data samples. However, most feature selection algorithms only take into account one of the two manifold structures of the data, i.e., either global or local, so the performance is not satisfactory [12, 28, 30].
Because of some good properties, such as insensitive to noise data, feature self-representation has been widely used in machine learning and computer vision [60, 65]. However, since the self-representation coefficient matrix obtained by these methods is fixed, feature self-representation based methods do not adapt well to data with complex structure. To this end, this paper presents a feature selection algorithm that combines structure learning with sparse learning (LSS_FS). The main contributions of this paper include
-
Due to the fact that our method uses feature self-representation to build the model, this method is not very sensitive to noise and outliers, exhibiting good robustness.
-
This model utilizes a low-rank constraint on self-representation coefficient matrix to explore global manifold structure of the data, which had been proven that it has the ability to take advantage of the correlation among features effectively.
-
The model proposed in this paper can dynamically adjust the structure of data according to the similarity matrix of the data, and can make it more accurate, so as to improve the classification accuracy.
-
In this paper, we propose a novel method to solve the objective function. We perform low-rank feature selection and dynamically adjust the graph matrix to optimize the objective function according to the similarity matrix iteratively. The optimal results are achieved in each iteration, and finally the global optimal solution is thus obtained.
2 Related work
The original purpose of sparse learning is to compress and express a signal, because it has a lower sampling rate than Shannon theorem. And it gradually becomes popular, and successfully used to solve many practical problems. For example, in signal and image processing, signal encoding, representation and compression, image denoising, image super-resolution, etc [34, 48, 57].
With the continuous innovation and practice of many scientists and researchers around the world, sparse learning theory has been successfully applied in signal processing, data mining, machine learning, information retrieval, pattern recognition, biological computing and other fields. And it has become the research hotspot of these areas. So far, a large number of experts and scholars are still carrying out an in-depth study of sparse learning’s theory and application. For sparse learning itself, because it has a more natural discriminant nature, it is more suitable for face recognition [27, 36]. Specifically, based on the theory of Compressive Sensing, by using the sparse learning to reconstruct an approximate signal for the original one to find the dimensionality reduce matrix, so that the original high-dimensional signal projection to low-dimensional space can also maintain its original features as much as possible [32, 35]. Sparse learning methods have achieved very high recognition accuracy in face recognition. Although this method has achieved good performance, it needs to store all training images for recognition, significantly increasing the storage overhead [43, 46]. Moreover, when the sparse learning deals with independent signals, it only considers the inherent relevance of the signal, ignoring the correlation between the same type of signals.
Sparse learning evaluates the importance of an element in terms of the weight of its coefficient, that is, the weight of an unimportant element is zero and the important one is nonzero [31]. The more important an element, the greater the corresponding weight. Therefore, sparse learning can be used as a natural identification information into the model [55] by using the coefficient weights between the data samples or features. This enables sparse learning to reduce the impact of noise on the model and improve the efficiency of the learning model. Therefore, sparse learning has been widely used in the field of data mining and machine learning. For example, Zhu et al. proposed a joint sparse learning method to achieve block sparse of data by considering the correlation between data and global information of the data, and it has been used to deal with multi-label data for classification. Gao et al. used the histogram intersection kernel (HIk) to implement kernel sparse learning, and it utilizes the HIk basis to consider a feature quantization of soft-allocated extended to conduct sparse coding [8]. Xia et al. proposed a sparse projection algorithm to conduct binary coding for high-dimensional data [27]. Sparse Subspace Learning (SSL) can also be seen as a special dimensionality reduction method [2]. Some researchers proposed a series of sparse subspace learning algorithms based on sparse learning theory, including Nonnegative Sparse Principal Component Analysis (NSPCA) [45], Sparse Nomegative Matrix Factorization (SNMF) [23], and so on.
The essence of sparse learning is to introduce the coefficient weight between samples or features as the authentication information into the model. By using the sparse constraint to punish the model, the coefficient weight of some irrelevant data approaches zero, and useful information is preserved, so the sparse learning is robust [52]. However, sparse learning does not take into account the internal structure of the data.
To overcome the shortcomings of sparse learning, manifold learning has been proposed. Linear discriminant analysis is the most common technique used in manifold learning. Zhong et al. proposed a method to utilize ℓ 1-norm to maximize the objective function (LDA-ℓ 1) [47], where LDA-ℓ 1captures the global manifold structure via discriminant analysis, and then selects the most critical features. Unsupervised Feature Selection via Unified Trace Ratio Formulation and K-means Clustering (TRACK) was proposed as a unified framework for feature selection and selected discriminative features according to trace ratio formulation and K-means clustering [25]. Cai et al. proposed Multi-Cluster Feature Selection (MCFS), which explores local manifold structure of the data via spectral analysis and then searches the features that can better preserve the clustering structure [3]. However, these algorithms are sensitive to outliers, and if the data contains too much noise, performance of the algorithm will degenerate significantly.
There are many robust feature selection methods. For example, Unsupervised feature selection by regularized self-representation (RSR), which uses all the features to linearly represent each feature, thus weakening the impact of outliers [56]. Sun et al. put forward an Unsupervised Robust Bayesian Feature Selection method, where the feature selection capability is realized by estimating the feature saliencies associated with the features [22]. But these methods do not consider the manifold structure of data, so their performance are not satisfactory. To obtain better stability and performance, this paper combines manifold learning with feature self-representation, and then introduces the sparse regularization factor to conduct feature selection.
3 Our method
In this section, we first introduce some notations that are used in this paper, and then explain the detail of the proposed LSS_FS method, in Sections 3.1 and 3.2, respectively, and then elaborate the proposed optimization method in Section 3.3. Finally, we analysis the convergence of the objective function in Section 3.4.
3.1 Notations
For a matrix \(\textit {X} \in {\mathbb {R}^{n \times d}}\), its i-th row and j-th column are denoted as X i and X j , respectively. The trace of a matrix X is denoted as tr(X). X T means the transpose of X and X − 1means the inverse of X, respectively. Also we denote the ℓ F -norm and ℓ 2,1-norm of X respectively as \(||\textit {X}|{|_{F}} = \sqrt {{\sum \nolimits _{i}^{n}} {||{\textit {x}^{i}}||}_{2}^{2}} = \sqrt {{\sum \nolimits _{j}^{d}} {||{\textit {x}_{j}}||}_{2}^{2}} \) , \( ||X|{|_{2,1}} = {\sum \nolimits _{i}^{n}} {\sqrt {{\sum \nolimits _{j}^{d}} {x_{i,j}^{2}} } } \).
3.2 Structure learning for feature selection
Many previous literatures have indicated that the local structures of the samples may provide complementary information to boost the ability of dimensionality reduction [44, 54]. So, this paper proposes to utilize the local structures of the samples by learning a graph matrix \(S \in {\mathbb {R}^{n \times n}}\) on a low-dimensional space of the original data. Given the feature matrix X, we can get the following formula according to [51, 59]
where ||⋅||2 is the ℓ 2-norm of a vector, \(Z \in {\mathbb {R}^{d \times d}}\) is a transformation matrix of the high-dimensional data X in its low-dimensional space, and the element s i,j of the graph matrix S represents the similarity between sample X i and sample X j. If sample X i is one of the k-nearest neighbors of sample X j, then the value of the heat kernel, i.e., \(f({x^{i}},{x^{j}}) = exp(- \frac {{||{x^{i}} - {x^{j}}||_{2}^{2}}}{{2{\sigma ^{2}}}})\) where σ is a tuning parameter, is regarded as the value of s i,j ; otherwise s i,j = 0.
Actually, the character of S has been demonstrated very sensitive to σ [49, 53]. This inspires us to learn the relatively correct graph matrix from the ’clean’ data and to decrease number of parameters that need to adjust. By clean, we mean that a low-dimensional space with as less noise and redundancy as possible [33]. However, in real applications, we cannot know the graph matrix and low-dimensional space in advance [38]. In order to deal with these problems, we combine graph matrix learning with low-dimensional space learning to iteratively optimize them so as to achieve the optimal results. Therefore, we may learn the graph matrix by following the distribution of the samples. Then we have the following objective function:
where λ 1 is a control parameter and s i is the i-th column of S, \(||{s_{i}}||_{2}^{2}\) is used to avoid the trivial solution, 1 and \(\mathcal {N}(i)\) represent an all-one-element vector and the set of the nearest neighbors of the i-th sample, respectively, and the constraint \({s_{i}^{T}}{\textbf {1}} = 1\) is applied to gain the shift invariant similarity. Clearly, (2) assigns small value (i.e., similarity) to s i,j if sample i and j are far apart, and large value to s i,j otherwise.
Previous methods learn the graph matrix via (1) to generate an optimal similarity measurement. Different from that, (2) aims to achieve both optimal similarity measurement (i.e., S) and feature selection results (i.e., Z). It becomes obvious that (2) may receive better feature selection results (i.e., Z) than (1).
In order to reduce the adverse impacts of outliers and noise samples on the model, we propose to use the self-representation method to build an unsupervised model, then we can get the following functions:
In order to improve effectiveness of the algorithm and to consider relationship among features, we may impose a constrain on the rank of Z [58]. Through this process, a low rank constraint on Z can naturally be represented as the product of two r-rank matrices as follows:
where \(A \in {\mathbb {R}^{d \times r}}, B \in {\mathbb {R}^{r \times d}}, r \le \min (n,d),\) and our function becomes
According to [56], the feature selection issue can be converted to the following form:
where W is the feature weight matrix, l(X − X W) is the loss function, R(W) is the regularization on W and λ is a positive constant. In the item \({{||}}X - XAB{{||}}_F^2\), by treating XA as X of (6), then B can be naturally regarded as W of (6). Inspired by (6), we naturally make the regularization on matrix of B to further improve the performance of the algorithm. Therefore, our ultimate objective function is:
The constraint A T A = I(where \(A \in {\mathbb {R}^{d \times r}}\) and \(I \in {\mathbb {R}^{r \times r}}\)) is introduced for identifiability purpose, where λ 1,λ 2and λ 3are tuning parameters.
3.3 Optimization
Equation (7) is not jointly convex with respect to all the variables (i.e., A, B, and S), but it is convex for each variable while fixing the rest. In this paper, we utilize the alternative optimization method to optimize (7), i.e., iteratively optimizing each variable respectively.
Update A by fixing B and S
When B and S are fixed, the second and fourth terms of (7) can be viewed as constants, thus we can get:
By following the IRLS framework, we rewrite (8) as:
where t r(⋅) is a trace operator, \(L = Q - S \in {\mathbb {R}^{n \times n}}\) is a Laplacian matrix and Q is a diagonal matrix with its i-th element \({q_{i,i}} = \sum \nolimits _{j = 1}^{n} {{s_{i,j}}} \). By fixing B, we get the derivative of (9) as follows:
Since A is orthogonal, we can employ an existing method in [26] to optimize it.
Update B by fixing A and S
When A and S are fixed, the second term of (7) can be viewed as constants, we can get:
Which is equivalent to
Where \(P \in {\mathbb {R}^{r \times r}}\) is diagonal matrix with \({p_{i,i}} = \frac {1}{{2||{B^i}|{|_2}}},(i = 1, {\ldots } ,r)\). Then setting the derivative of B in (12) to zero, we obtain:
Update S by fixing A and B
Given that A and B are fixed, (7) is changed to:
We first calculate the Euclidean distance between each sample to produce k nearest neighbors of all samples, and then set the value of s i,j to 0 if the j-th sample does not belong to one of the k nearest neighbors of the i-th sample, otherwise, we utilize (17) to determine the values of s i,j .
Since optimizing S is equal to solely optimizing each vector s i (i = 1,…,n), we further change the optimization problem in (14) to individually optimize s i (i = 1,…,n) as follows:
By denoting \(F \in {\mathbb {R}^{n \times n}}\) where \({f_{i,j}} = ||{x^i}AB - {x^j}AB||_2^2\), we rewrite (15) as follows:
On the basis of the Karush-Kuhn-Tucker (KKT) conditions [1], we are able to get the closed-form solution of s i,j (j = 1,…,n)as:
We suppose that there are k nearest neighbors for each sample. By denoting \({\hat f_i} = \{ {\hat f_{i,1}}, {\ldots } ,{\hat f_{i,n}}\} \) as a descend order of f i ,(i = 1,…n), (17) reveals that s i,k+ 1 = 0and s i,k > 0, where k is the number of nearest neighbors of the i-th samples and can be tuned by cross-validation methods. That is,
Since \(s_i^T{\textbf {1}} = 1\), we can get
Based on the above discussion, we summarize the process for solving (7) in Algorithm 1.
3.4 Convergence analysis
The framework IRLS has been proven to converge in [6], so we only need to prove the convergence of Algorithm 1 via Theorem 1:
Theorem 1
The value of objective function in(7)monotonically decreases until Algorithm 1 converges.
Proof
After the t-th iteration, we have obtained the optimal A (t), B (t)and S (t). In the (t + 1)-th iteration, we need to optimize S (t+ 1)by fixing A (t)and B (t).
According to (17), \(s_{i,j}^{(t + 1)}\) has a closed-form solution for all i,j = 1,…,n, thus we have:
While fixing S (t+ 1)to update A (t+ 1)and B (t+ 1), we follow [6] to get
By integrating (20) with (21), we obtain:
From (22), we know that the objective function value of (7) decreases after each iteration of Algorithm 1. Hence, Theorem 1 has been proven. □
4 Experimental results
We use 5 binary-class and 7 multi-class benchmark datasets to verify the performance of our method and the competing dimensionality reduction methods. The selected datasets include glass, wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli and Yeast, all downloaded from the UCI Machine Learning Repository.Footnote 1 The Isolet from the website of Feature Selection Datasets.Footnote 2 We summarize detailed information of the datasets in Table 1.
To prove the performance of LSS_FS, we compare it with six state-of-art feature selection methods. We list details of the competitor methods as follows:
TRACK: It selects discriminative features via unified trace ratio formulation and k-means clustering [25]. RSR: Regularized self-representation selects representative features via the ℓ 2,1-norm to characterize the self-representation coefficient matrix and ensure the robustness to outilers [56]. CSFS: Convex Semi-supervised multi-label Feature Selection can conduct the feature selection via the ℓ 2,1-norm regularization [5]. FSR_ALM : Exact Top-k Feature Selection via ℓ 2,0-norm constraint and Augmented Lagrangian Multiplier (ALM) to avoid the heavy burden of tuning regularization parameters and make it more practically [4]. LDA: Liner Discriminant Analysis (LDA) aims at minimizing the within class distance while maximizing the between class distance when conducting feature selection. Hence, LDA is a global subspace learning method [7]. PCA: It maps high-dimensional data into low-dimensional space by preserving the covariance of the data matrix [50].
We set {λ 2,λ 3}∈{10− 2,…,102}, while the value of λ 1is automatically adjusted according to [14], and the rank of the self-representation coefficient matrix r ∈{1,…,min(n,d)}. Moreover, {c,g}∈{2− 5,…,25}in SVM a 5-fold inner cross-validation is used to distinguish different types of samples. According to the cross-validation method, we can choose the best parameters for the experiments. In order to reduce impact of randomness, a 10-fold outer cross-validation has been used to get the average results. For the sake of fairness, we use the same strategy for all the competing methods.
We use three kinds of evaluation metrics, such as classification accuracy, standard deviation and coefficient of variation, to evaluate the classification performance of all methods. We define classification accuracy (ACC) as follows:
where N is the number of samples and N c o r r e c t is the number of correctly classified samples.
We utilize Standard Deviation (STD) to reflect the classification accuracy results and give the definition as follows:
where n r u n denotes the runs of experiments, i.e., n r u n = 10 in our experiments, and u stands for the average value of ACC. A large ACC means good performance, and a small STD shows excellent stability.
We also use Coefficient of Variation (CV) as the evaluation index, which is defined as
where STD is the square root of an unbiased estimate of each sample, i.e., standard deviation. A small CV means better robustness.
4.1 Experiment results and analysis
In Table 2, we display the classification accuracy of all methods on twelve datasets in Table 1. Besides, we report the results of classification accuracy for all datasets in Fig. 1.
The proposed LSS_FS method outperforms all competing methods in all classification tasks. For instance, the ACC of our method increases on average by 10.97%, compared with TRACK which does not learn the relatively correct graph matrix of the high-dimensional data. ACC of out method raises on average by 9.45%, compared with the PCA method which is not able to take into account enough information for feature selection. On the other hand, ACC of our method climbs on average by 8.63%, compared with the LDA method which only considers the global structure of the data. Meahwhile, ACC of our method increases on average by 8.46%, compared with the FSR_ALM method which fails to take the relationship between features into account. Also, compared with CSFS which does not construct graph matrix to consider the local correlations among the features, ACC of our method increases on average by 8.34%. Compared with the RSR method that only considers the relationship between the features, the improvement on ACC of our method is 6.85% on average. The reason is that, LSS_FS method considers the following constraint while conducting feature selection: 1) two kinds of correlations, i.e., sample-level and feature-level, inherent in data; 2) iteratively adjusting the transformation matrix until it is optimal. Furthermore, the proposed method obtains the minimum standard deviation on average, compared with the other methods. Based on the above observations, our proposed method has the best robustness compared with all the other methods for classification tasks.
We show the results of coefficient of variation of all methods in Table 3. Specifically, our algorithm obtains the minimum value on the five datasets, to be specific, 1.19 on glass, 0.31 on wine, 0.55 on Ionosphere, 0.87 on Movements and 0.80 on Arrhythmia, respectively. Returning to the performance of CV in Table 3, though the proposed method does not achieve the minimum variation on each of the datasets, it reaches the minimum average variation on all the datasets. This verifies that our method achieves the best stability.
Figure 2 reveals the characteristic of the proposed objective values of Algorithm 1 on six datasets at each iteration, where we set the stop criteria of Algorithm 1 as \(\frac {{||obj(t + 1) - obj(t)||_2^2}}{{obj(t)}} \le {10^{- 5}}\) where obj(t) represents the value of objective function in (7) at the t-th iteration. From Fig. 2, we can know that: 1) the proposed Algorithm 1 to optimize the objective function in (7) monotonically decreases the objective function value until Algorithm 1 converges; 2) the proposed Algorithm 1 on all the datasets converges within twenty iterations, revealing that the proposed algorithm has a fast convergence rate.
From Fig. 3, we can know that the classification accuracy with a low-rank constraint for the most part is better than the classification accuracy with full-rank. For instance, the average classification accuracy of LSS_FS method with low rank constraint increases by 0.74%, 0.22%, 0.21%, 0.45%, 0.41%, 0.54%, 0.56%, 0.39%, 0.28%, 0.91%, 0.14% and 0.15%, respectively, compared with the results of full-rank constraint on dataset wine, Parkinsons, Ionosphere, LungCancer, Sonar, Movements, Arrhythmia, LVST, ecoli, Isolet and Yeast. Hence, it is obvious that analyzing high-dimensional data with a low rank constraint in feature selection is meaningful. Due to the fact that the low rank constraint conducting subspace learning helps search the low-dimensional space of high-dimensional data by considering the global feature correlation.
We vary parameter λ 2and λ 3 within the range of {10− 2,…,102} and list the results in Fig. 4. Parameter λ 2is used to control the magnitude between the local representation term \(\sum \nolimits _{i,j}^n {||{x^i}AB - {x^j}AB||_2^2{s_{i,j}}}\) and the global representation term \({{||}}X - XAB{{||}}_F^2\), while λ 3in (7) is used to adjust the sparsity of AB. In Fig. 4, our method achieves the best performance on dataset Sonar and ecoli while setting λ 2 = 10, and λ 3= 0.01. Clearly, our method produces the best ACC, i.e., 84.38% with λ 2 = 1, and λ 3= 1 for dataset LVST. This indicates that tuning of parameter benefits our method.
5 Conclusion
In this work, we proposed a novel feature selection method, called Unsupervised Feature Selection via Local Structure Learning and Sparse Learning (LSS_FS). LSS_FS method utilizes the similarity matrix to fine tune the self-representation coefficient matrix to output a high quality self-representation coefficient matrix. As a result, LSS_FS can have better discriminative power than traditional feature selection methods. Experimental results on real datasets show that LSS_FS method provides better feature selection performance than the competitor methods.
In the future work, we will extend the proposed method for semi-surprised feature selection tasks.
References
Boyd S, Vandenberghe L (2013) Convex optimization
Cai D, He X, Han J (2007) Spectral regression: a unified approach for sparse subspace learning. In: ICDM, pp 73–82
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: KDD, pp 333–342
Cai X, Nie F, Huang H (2013) Exact top-k feature selection via l 2,0 -norm constraint. In: IJCAI, pp 1240–1246
Chang X, Nie F, Yi Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: AAAI, pp 1171–1177
Daubechies I, Devore R, Fornasier M, SiNan Gntk C (2008) Iteratively reweighted least squares minimization for sparse recovery. Commun Pure Appl Math 63(1):1–38
Fan Z, Yong X, Zhang D (2011) Local linear discriminant analysis framework using sample neighbors. IEEE Trans Neural Netw 22(7):1119–1132
Gao S, Tsang I W, Chia L-T (2013) Sparse representation with kernels. IEEE Trans Image Process 22(2):423–434
Gao L, Song J, Liu X, Shao J, Liu J, Shao J (2017) Learning in high-dimensional multimedia data: the state of the art. Multimed Syst 23(3):303–313
Gao L, Wang Y, Li D, Shao J, Song J (2017) Real-time social media retrieval with spatial, temporal and social constraints. Neurocomputing 253:77–88
Hu R, Zhu X, Cheng D, He W, Yan Y, Song J, Zhang S (2017) Graph self-representation method for unsupervised feature selection. Neurocomputing 220:130–137
Jayasena K P N, Li L, Xie Q (2017) Multi-modal multimedia big data analyzing architecture and resource allocation on cloud platform. Neurocomputing
Ling C X, Yang Q, Wang J, Zhang S (2004) Decision trees with minimal costs. In: ICML, pp 69
Nie F, Zhu W, Li X (2016) Unsupervised feature selection with structured graph optimization. In: AAAI, pp 1302–1308
Qian B, Wang X, Cao N, Gang Jiang Y, Davidson I (2014) Learning multiple relative attributes with humans in the loop. IEEE Trans Image Process 23 (12):5573–5585
Qian B, Wang X, Cao N, Li H, Gang Jiang Y (2015) A relative similarity based method for interactive patient risk prediction. Data Mining Knowl Discov 29 (4):1070–1093
Qin Y, Zhang S, Zhu X, Zhang J, Zhang C (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88
Song J, Yi Y, Zi H, Shen H T, Luo J (2013) Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Trans Multimed 15 (8):1997–2008
Song J, Gao L, Nie F, Shen H T, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans Image Process 25(11):4999–5011
Song J, Gao L, Zou F, Yan Y, Sebe N (2016) Deep and fast: deep learning hashing with semi-supervised graph construction. Image Vis Comput 55:101–108
Song J, Shen H T, Wang J, Zi H, Sebe N, Wang J (2016) A distance-computation-free search scheme for binary code databases. IEEE Trans Multimed 18(3):484–495
Sun J, Zhou A (2014) Unsupervised robust bayesian feature selection, pp 558–564
Wang T, Qin Z, Zhang S, Zhang C (2012) Cost-sensitive classification with inadequate labeled data. Inf Syst 37(5):508–516
Wang X, Qian B, Davidson I (2012) On constrained spectral clustering and its applications. Data Mining Knowl Discov 28(1):1–30
Wang D, Nie F, Huang H (2014) Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track). In: Ecml/pkdd, pp 306–321
Wen Z, Yin W (2013) A feasible method for optimization with orthogonality constraints. Math Program 142(1):397–434
Xia Y, He K, Kohli P, Sun J (2015) Sparse projections for high-dimensional binary codes. In: Computer vision and pattern recognition, pp 3332–3339
Xie Q, Pang C, Zhou X, Zhang X, Ke D (2014) Maximum error-bounded piecewise linear representation for online stream approximation. Vldb J 23(6):915–937
Xie QS, Wang JZ, Zhang X (2016) Modeling and predicting ad progression by regression analysis of sequential clinical data. Neurocomputing 195(C):50–55
Xie Q, Zhang X, Li Z, Zhou X (2016) Optimizing cost of continuous overlapping queries over data streams by filter adaption. IEEE Trans Knowl Data Eng 28(5):1258–1271
Xindong W, Zhang S (2003) Synthesizing high-frequency rules from different data sources. IEEE Trans Knowl Data Eng 15(2):353–367
Xindong W, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. Acm Trans Inf Syst 22(3):381–405
Xindong W, Zhang C, Zhang S (2005) Database classification for multi-database mining. Inf Syst 30(1):71–88
Yan X, Zhang C, Zhang S (2009) Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst Appl 36(2):3066–3076
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133
Zhang S (2012) Nearest neighbor selection for iteratively knn imputation. J Syst Softw 85(11):2541–2552
Zhang C, Zhang S (2002) Association rule mining: models and algorithms 2307
Zhang S, Zhang C (2002) Anytime mining for multiuser applications. IEEE Trans Syst Man Cybern-Part Syst Humans 32(4):515–521
Zhang S, Zhang C, Yang Q (1999) Data preparation for data mining. Academic Press
Zhang S, Zhang C, Yan X (2003) Post-mining: maintenance of association rules by weighting. Inf Syst 28(7):691–707
Zhang S, Wu X, Zhang C (2003) Multi-database mining 2:5–13
Zhang S, Qin Z, Ling C X, Sheng S (2005) Missing is useful?: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459
Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for knn classification. ACM Trans Intell Syst Technol 8(3):43
Zhang S, Li X, Zong M, Zhu X, Wang R (2017) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2017.2673241
Zhao Y, Zhang S (2005) Generalized dimension-reduction framework for recent-biased time series analysis. IEEE Trans Knowl& Data Eng 18(2):231–244
Zhong F, Zhang J (2013) Linear discriminant analysis based on l1-norm maximization. IEEE Trans Image Process 22(8):3018–3027
Zhu Y, Lucey S (2015) Convolutional sparse coding for trajectory reconstruction. IEEE Trans Pattern Anal Mach Intell 37(3):529–540
Zhu X, Zhang S, Jin Z, Zhang Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Zhu X, Zi H, Shen H T, Cheng J, Changsheng X (2012) Dimensionality reduction by mixed kernel canonical correlation analysis. Pattern Recogn 45(8):3003–3016
Zhu X, Zi H, Shen H T, Zhao X (2013) Linear cross-modal hashing for efficient multimedia search. In: ACM International conference on multimedia, pp 143–152
Zhu X, Zi H, Cheng H, Cui J, Shen H T (2013) Sparse hashing for fast multimedia search. ACM Trans Inf Syst 31(2):9
Zhu X, Zi H, Yang Y, Shen H T, Changsheng X, Luo J (2013) Self-taught dimensionality reduction on the high-dimensional small-sized data. Pattern Recogn 46(1):215–229
Zhu X, Zhang L, Huang Z (2014) A sparse embedding and least variance encoding approach to hashing. IEEE Trans Image Process 23(9):3737
Zhu X, Suk H I, Shen D (2014) A novel matrix-similarity based loss function for joint regression and classification in ad diagnosis. Neuroimage 100:91–105
Zhu P, Zuo W, Zhang L, Qinghua H, Shiu S C (2015) Unsupervised feature selection by regularized self-representation. Pattern Recogn 48(2):438–446
Zhu X, Xie Q, Zhu Y, Liu X, Zhang S (2015) Multi-view multi-sparsity kernel reconstruction for multi-class image classification. Neurocomputing 169:43–49
Zhu X, Suk H I, Lee S W, Shen D (2015) Canonical feature selection for joint regression and multi-class identification in alzheimer’s disease diagnosis. Brain Imag Behav 10(3):1–11
Zhu X, Li X, Zhang S (2016) Block-row sparse multiview multilabel learning for image classification. IEEE Trans Cybern 46(2):450
Zhu X, Suk H-I, Lee S-W, Shen D (2016) Subspace regularized sparse multitask learning for multiclass neurodegenerative disease identification. IEEE Trans Biomed En. 63(3):607–618
Zhu Y, Zhu X, Kim M, Shen D, Guorong W (2016) Early diagnosis of alzheimers disease by joint feature selection and classification on temporally structured support vector machine. In: MICCAI, pp 264–272
Zhu X, He W, Li Y, Yang Y, Zhang S, Rongyao H, Zhu Y (2017) One-step spectral clustering via dynamically learning affinity matrix and subspace. In: AAAI, pp 2963–2969
Zhu X, Li X, Zhang S, Chunhua J, Xindong W (2017) Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans Neural Netw Learn Syst 28(6):1263–1275
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph PCA hashing for similarity search. IEEE Multimed Multimed 19(9):2033–2044
Zhu X, Suk HII, Huang H, Shen D (2017) Low-rank graph-regularized structured sparse regression for identifying genetic biomarkers. IEEE Transact Big Data. https://doi.org/10.1109/TBDATA.2017.2735991
Zhu X, Suk H-I, Wang L, Lee S-W, Shen D (2017) A novel relational regularization feature selection method for joint regression and classification in AD diagnosis. Med Image Anal 38:205–214
Acknowledgments
This work was supported in part by the China Key Research Program (Grant No: 2016YFB1000905), the China 1000-Plan National Distinguished Professorship, the Nation Natural Science Foundation of China (Grants No: 61573270, 61672177 and 61363009), the Guangxi Natural Science Foundation (Grant No: 2015GXNSFCB139-011), the Guangxi High Institutions Program of Introducing 100 High-Level Overseas Talents, the Guangxi Collaborative Innovation Center of Multi-Source Information Integration and Intelligent Processing, the Guangxi Bagui Teams for Innovation and Research, the Research Fund of Guangxi Key Lab of MIMS (16-A-01-01 and 16-A-01-02), the Guangxi Bagui Teams for Innovation and Research, and Innovation Project of Guangxi Graduate Education under grant XYCSZ2017064, XYCSZ2017067 and YCSW2017065.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lei, C., Zhu, X. Unsupervised feature selection via local structure learning and sparse learning. Multimed Tools Appl 77, 29605–29622 (2018). https://doi.org/10.1007/s11042-017-5381-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5381-7