1 Introduction

With the development of science and technology, big data have appeared in various fields, such as knowledge discovery and pattern recognition, and cause a surge of the size of the database in either the samples or the dimensions of features. The issue of a large number of samples can be handled by sampling methods or others [4, 13, 22, 43, 49, 52]. Moreover, the advanced techniques still are desired to remove harmful effect of redundant and noisy features in high-dimensional data, with the expectation of accelerating execution time, reducing the storage cost, and enhancing the performance of learning models. As a consequence, dimensional reduction is a widely used way that aims to reduce the number of the features through removing uninformative features of high-dimensional data [16, 36, 41, 48, 50, 53].

Dimensionality reduction techniques can be divided into two categories, i.e., feature selection [14, 20, 40] and subspace learning [9, 10, 19, 38]. Feature selection methods remaining important features from the original feature space can output interpretable models. Subspace learning methods projecting the original feature space into a new subspace outputs the robust models of dimensionality reduction [44]. Spectral feature selection methods construct a framework which includes feature selection and subspace learning to obtain the interpretable and robust models, and then has been drawing a lot of attention in machine learning and data mining [24, 30].

Recently, a large number of spectral feature selection methods have been proposed [34, 35, 45, 51, 54, 55]. Cai et al. first conducts eigenvalue decomposition on the graph matrix constructing from original data to obtain the graph representation of the data, and then calculates the minimal error between the derived graph representation and the original data [2]. Gu et al. [5] and Shi et al. [25] first construct the graph matrix by the neighbor relationship among the data points, followed by selecting the useful features through the sparse regression model to find the importance feature.

It has been shown that the local structure may achieve better performance than the global structure [8]. Thus many spectral feature selection methods discovering the optimal local structure have been proposed [17, 23]. Previous spectral feature selection methods have two separated steps. Specifically, the first step is to explore the local structure to construct the graph matrix, while the second step is to select the important features by a sparse regression. However, there are at least two main issues in previous methods. Firstly, traditional spectral feature selection methods usually obtain the graph matrix on the original high-dimensional space which usually outputs an unsatisfactory neighbor relationship. Secondly, previous spectral feature selection methods select the features and build the graph matrix separately. In this way, the graph matrix is obtained from original data and remains unchanged for succeeding processes. However, the redundancy and noisy of original data may make the graph matrix low-quality. The low-quality graph matrix will preserve the imperfect local structure, and ultimately can not output optimal result.

In this paper, we propose a new spectral feature selection model to jointly learns the dynamical graph and sparse feature selection to select the important features from the optimal subspace of the high-dimensional original data. To do this, our method takes the dynamic neighborhood correlation among the samples into account to preserve optimum local structures of the data. This aims at getting a robust spectral feature selection model. More specifically, we obtain the feature weight matrix by a least square regression between original data and its labels, with a least square loss function, and use an 2,1-norm regularization to penalize the weight matrix. We also push an orthogonal constraint on the weight matrix to improve the accuracy for feature selection vis conducting subspace learning. We further propose to build the dynamic graph matrix, i.e., dynamically capturing the nearest neighbor relationship among data points. Moreover, we integrate the least square regression, the 2,1-norm regularization, the dynamic graph matrix learning, and the orthogonal constraint in a unified framework. As a result, the redundancy and noisy could be removed from original data, which should reduce the influence of constructed graph matrix. The proposed spectral feature selection model can easily select the important features because of contain an interpretable and robust low-dimensional subspace.

By comparing previous spectral feature selection methods, we list the main contributions of our proposed method as follow:

  • Our proposed method performs the sparse regression and the dynamic graph learning simultaneously. In this way the feature weight matrix deriving from the regression can remove the effect of redundancy and noisy to learn more a reliable similarity matrix. Meanwhile, reliable graph matrix can improve the performance of the sparse regression. As a consequence, our proposed method learns the optimal local structure of the low-dimensional data through an alternative optimization method to ultimately select important features.

  • Our proposed method proposes a reasonable constraint, where the feature weight matrix derived by local structure and the sparse regression can be more accurate while using the orthogonal constraint to constrain feature weight matrix.

2 Related work

Subspace learning methods project high-dimensional data into its low-dimensional space to reduce the dimensionality of the high-dimensional data. Popular subspace learning methods include linear projected (such as Principal Component Analysis(PCA) [32] and Locality Preserving Projection(LPP)) and nonlinear projected (such as kernel LDA [37] and kernel CCA [47]). Feature selection methods tend to find a subset of the features that best express the samples from original features.

In this section, we first review recent studies of feature selection methods and then provide a brief analysis to previous spectral feature selection methods.

2.1 Feature selection

Feature selection is widely applied in real applications because of its interpretable ability [33, 35, 39, 42, 46]. We may partition previous feature selection into three subgroups, i.e., filter methods, wrapper methods, and embedded methods. Filter methods [6, 12, 21] which are independent on the learning models use proxy measures to rank the features and select valuable features. Wrapper methods [7, 11] search a best subset of feature guided by the performance of a given blearing model, so that achieving better feature selection results but higher computational cost than filter methods. Embedded methods [18, 28] are different from previous two kinds of feature selection methods, via integrating the process of feature selection into the learning models, and thus achieving the effect of improving the performance and reducing the computational cost. In particular, the sparsity regularization based embedded feature selection methods may achieve outstanding feature selection performance. More specifically, the s 1-norm regularization is widely applied in different feature selection models such as LASSO [27] and sparse SVM [15], aiming at preventing the issue of over-fitting and achieving the sparse results. On the other hand, the 2,1-norm regularization further enhances the capability of the sparsity and have been applied in multi-task learning or multi-class learning [16].

2.2 Spectral feature selection

The spectral feature selection methods are usually constituted by two main parts, i.e., graph matrix learning and sparse feature selection, respectively. According to the patterns of the graph matrix learning, we divide the previous spectral feature selection methods into two categorizes, i.e., fixed learning methods and dynamic learning methods, respectively.

Fixed learning methods construct the graph matrix learning from original data firstly, and then use a sparsity regularization to conduct sparse feature selection. For example, Unsupervised Feature Selection for Multi-Cluster Data (MCFS) method [2] and Joint Feature Selection and Subspace Learning (FSSL) method [5] use the 1-norm regularization and the 2,1-norm regularization, respectively, to conduct spectral feature learning via first learning the graph matrix and then using the sparse feature selection framework. MCFS method utilizes group sparsity i.e., the 2,1-norm regularization, to consider the global correlation among the data, and thus has better performance than FSSL. Other fixed learning methods integrate the graph matrix learning into the feature selection to improve the reliable of feature weight matrix. For example, the Robust Spectral Learning for Unsupervised Feature Selection (RSLUFS) method utilizes the local kernel regression as predictor to capture the structure of the data [24]. The Non-Negative Spectral Learning and Sparse Regression-Based Dual-Graph Reguarized Feature Selection (NSSRD) method utilizes the Gaussian function to structure dual-graph from data space and feature space [23].

Dynamic learning methods learn graph matrix highly depending on the feature weight matrix, which is obtained from the low-dimensional space of high-dimensional data, so that they obtain the optimal solution by iteratively update the graph matrix and feature weight matrix. For example, the Structured Optimal Graph Feature Selection (SOGFS) method [18] utilizes dynamic graph learning to capture the neighbor relationship and uses 2,1-norm regularization to penalize the projection matrix.

3 Methodology

In this section, we first list notations and definition of norms used in this paper in Section 3.1, and then the detail of our method is described in Sections 3.2 and 3.3. Finally, the optimization problem of the proposed method is elaborated in Sections 3.4 and 3.5.

3.1 Notations

In this paper, we use boldface uppercase letters, boldface lowercase letters, and normal italic letters, respectively, to denote matrices, vectors, and scalars. We denote \(\mathbf {X} = \{x_{i}\}_{i = 1}^{n} \in \mathbb {R}^{n \times d}\) as the feature matrix of the samples, where i-th row and j-th column of X are denoted as x i and x j , respectively. The Frobenius norm of X is denoted as \(||\mathbf {X}||_{F} = \sqrt {{\sum }_{ij} |x_{ij}|^{2}}\) and the 2,1-norm of X is denoted as \(||\mathbf {X}||_{2,1} = {\sum }_{i} \sqrt {{\sum }_{ij} {x_{ij}}^{2}} \) Furthermore, we denote t r(X) as the trace of X, X T as the transpose of X and X −1 as the inverse of X, respectively.

3.2 Dynamic graph learning

The previous literature has shown that the graph matrix learning can capture geometrical information of the samples to enhance the performance of dimensionality reduction. Nevertheless, the redundancy and noise from samples or features can make the graph matrix unreliable and inaccurate. Thus, in this paper, we utilize the model of dynamic graph learning to preserve the optimal local structure. Given a data matrix X, the similarity graph matrix S is initialized as follows:

$$ s_{ij} = \left\{ \begin{array}{*{20}{c}} f(x^{i}, x^{j})&\begin{array}{l} {\mathit{ if }~~} {x^{i}} \in \mathbf{N}(x^{j})\\ {\mathit{ or }~~} {x^{j}} \in \mathbf{N}(x^{i}) \end{array}\\ 0 & \mathit{otherwise} \end{array} \right. $$
(1)

where N(x) represents the k-nearest neighbors set of sample x. If the i-th sample x i is contained in k-nearest neighbors set of sample x j, then the value of S i j (i.e., the element of matrix S) is determined by the heat kernel (i.e., \(f(x^{i}, x^{j}) = \exp (- \frac {||x^{i} - x^{j}||_{2}^{2}} {2\sigma ^{2}})\), where σ is a tuning parameter), otherwise s i j = 0. According to the common sense, closer data points on the sample space have greater similarity, thus S i j can be revised by the square of Euclidean distance between x i and x j (i.e., ||x ix j||22). Therefore, obtain the determing value of S i j from original data space can be seen as solving:

$$ \underset{\mathbf{S}}{\min} {\sum}_{i,j}^{n} ||x^{i} - x^{j}||_{2}^{2} s_{ij} $$
(2)

Although (2) learns a fixed graph matrix S from the high-dimensional space of original data to preserves the local structure of data, but it ignores the negative effect from redundancy and noisy. If original data contain noisy and redundancy (it always true in real world data), the graph matrix which learn from original data will become unreliable and inaccurate. Moreover, the quality of S has been proved very sensitive to the tuning parameter σ. This led us to learn the graph matrix from low-dimensional space (it contained the possibility of redundancy and noisy is small) and to decrease the dependency on parameters (i.e., reduce the number of tuning parameters). To address this, we utilize a weight matrix project the original data into its low-dimensional space and learn the graph matrix by iteratively optimization instead of using the heat kernel function, which depend the tuning parameter σ. We rewrite (2) as follows:

$$\begin{array}{@{}rcl@{}} &&\underset{\mathbf{S,W}}{\min} {\sum}_{i,j}^{n} {\left( ||x^{i}\mathbf{W} - x^{j}\mathbf{W}||_{2}^{2} s_{ij} + \alpha ||s_{i}||_{2}^{2}\right),} \\ && s.t.,\forall i,{s_{i}^{T}}\mathbf{1} = 1,s_{i,i} = 0,\\ && s_{i,j} \ge 0~\mathit { if }~i \in \mathbf{N}(j),\mathit{ otherwise }~0 \end{array} $$
(3)

where α is a regularization parameter, the 2-norm of s i (i.e., \(||s_{i}||_{2}^{2}\)) is used to avoid the trivial solution and add add a prior of uniform distribution, 1 represent a vector of all-one-element. By solving problem (3), the reliable graph matrix will be learning, thus obtained the information of local structure is more accurate.

3.3 Dynamic graph learning for feature selection

Let the \(\mathbf {Y} =\{ y_{i} \}_{i = 1}^{n} \in \mathbb {R}^{n \times c}\) as the response matrix, where represents the label i-th sample. Motivated by the generally used supervised learning method, we utilize the regression to fit each sample to its label and adding a regularization, thus have the following sparse regression model:

$$ \underset{\mathbf{W}}{\min}~ \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}||_{\rho^{*}} + \gamma ||\mathbf{W}||_{\omega^{*}} $$
(4)

where \(\mathbf {W} = [w_{1},w_{2}, {\ldots } ,w_{d}] \in \mathbb {R}^{d \times c}\) is the feature weight matrix. Although there are many different case to choose ρ and ω , in this paper, we select \(|| \cdot ||_{F}^{2}\) as ρ and ||⋅||2,1 as ω , respectively. The reason is that the F-norm regularization has steady fitting performance and the 2,1-norm regularization may conduct the row sparsity. In particular, ||W||2,1 can be eaily optimized. The sparse regression model is rewritten as follow:

$$ \underset{\mathbf{W}}{\min} ~\beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}||_{F}^{2} + \gamma ||\mathbf{W}||_{2,1} $$
(5)

When graph matrix learning and sparse feature selection are jointly performed, the feature weight matrix will affect the graph matrix. This is, the feature weight matrix not only select the important feature, but also conduct the graph matrix learning. We combine the objective function (3) with (5) as:

$$\begin{array}{@{}rcl@{}} &\underset{\mathbf{S,W}}{\min}& {\sum}_{i,j}^{n} ||x^{i}\mathbf{W} - x^{j}\mathbf{W}||_{2}^{2} s_{ij} + \alpha ||\mathbf{S}||_{F}^{2} \\ &&+ \beta ||\mathbf{Y} - \mathbf{XW}||_{F}^{2} + \gamma ||\mathbf{W}||_{2,1}\\ &&s.t.,\forall i,{s_{i}^{T}}\mathbf{1} = 1,{s_{i,i}} = 0,\\ && {s_{i,j}} \ge 0~ \mathit{ if }~i \in\mathbf{N}(j),~\mathit{ otherwise }~0 \end{array} $$
(6)

In feature selection tasks, PCA method can not be used as high-dimensional data preprocessing. Therefore, we use the constraint W T W = I to promote the performance of feature selection [18]. Our final objective function as follows:

$$\begin{array}{@{}rcl@{}} &\underset{\mathbf{S,W}}{\min} &{\sum}_{i,j}^{n} ||{x^{i}}\mathbf{W} - {x^{j}}\mathbf{W}||_{2}^{2} s_{ij} + \alpha ||\mathbf{S}||_{F}^{2}\\ &&+ \beta ||\mathbf{Y} - \mathbf{XW}||_{F}^{2} + \gamma ||\mathbf{W}|{|_{2,1}}\\ & &s.t.,\forall i,{s_{i}^{T}}\mathbf{1} = 1,s_{i,i} = 0,s_{i,j} \ge 0~\mathit{ if }~i \in \mathbf{N}(j)\\ &&\mathit{ otherwise }~0, \mathbf{W}^{T} \mathbf{W} = \mathbf{I} \end{array} $$
(7)

Equation (7) iteratively updates the graph matrix S and the feature weight matrix until achieve their individually optimal result. In this way, the optimal local structure can be preserved by iteratively updated graph matrix, and obtain excellent performance in feature selection model ultimately. As a consequence, given the optimal W, we sort the value of ||w i ||2,i = 1,2,…,d in descending order, and we are based on the top r ranked 2-norm values to select top r features as the final result of our proposed method.

3.4 Optimization

In this paper, we employ the alternative optimization strategy that fixed a variable while iteratively optimizing the others until the algorithm converges. We list the pseudo code in Algorithm 1.

  1. 1).

    Update S by fixed W

    The fixed W can be seen as constant, (7) becomes:

    $$\begin{array}{@{}rcl@{}} &\underset{\mathbf{S,W}}{\min}& {\sum}_{i,j}^{n} ||x^{i}\mathbf{W} - x^{j}\mathbf{W} ||_{2}^{2} s_{ij} + \alpha {\sum}_{i,j}^{n} s_{i,j}^{2} \\ &&\quad s.t.,\forall i,{s_{i}^{T}}\mathbf{1} = 1, s_{i,i} = 0,\\ && \qquad s_{i,j} \ge 0~\mathit{ if }~i \in \mathbf{N}(j),~\mathit{ otherwise }~0 \end{array} $$
    (8)

    For easy the optimization (8), we optimize each vector s i , i = 1,…,n, indenpendently instead of the optimization of S, the optimization problem be changed as follows:

    $$ \underset{{s_{i}^{T}}\mathbf{1} = 1, s_{i,i} = 0, s_{i,j} \ge 0}{\min} {\sum}_{i,j}^{n} \left( ||x^{i}\mathbf{W} - x^{j}\mathbf{W}||_{2}^{2} s_{i,j} + \alpha s_{i,j}^{2}\right) $$
    (9)

    We calculate the Euclidean distance between all samples to yield k-nearest neighbors, and then set the values of s i,j by optimizing (9) if the j-th sample is one of nearest neighbors of the i-th sample, otherwise, the value of s i,j is 0. By denoting \(\mathbf {G} \in \mathbb {R}^{n \times n}\), where \(g_{i} = {{\sum }_{j}^{n}} ||x^{i}\mathbf {W} - x^{j}\mathbf {W}||_{2}^{2}\), we rewrite (9) as follows:

    $$ \underset{{s_{i}^{T}}\mathbf{1} = 1, s_{i,i} = 0, s_{i,j} \ge 0}{\min} {{\sum}_{i}^{n}} {\left|\left|s_{i} + \frac{1}{{2\alpha} }g_{i}\right|\right|_{2}^{2}} $$
    (10)

    The Lagranian function of (10) is:

    $$ \underset{s_{i},\lambda ,\upsilon}{\min} \left|\left|s_{i} + \frac{1}{{2\alpha} }g_{i}\right|\right|_{2}^{2} - \lambda \left( {s_{i}^{T}}\mathbf{1} - 1\right) - {\upsilon^{T}}s_{i} $$
    (11)

    where λ and υ are the Lagrangian multipliers. The optimal solution of s i yield by the Karush-Kuhn-Tucker (KKT) conditions [1] is:

    $$ s_{i,j} = {\left( - \frac{1}{{2\alpha} }g_{i,j} + \lambda \right)_ +} $$
    (12)
  2. 2).

    Updata W by fixed S

    By fixed S, we have the following objective function:

    $$\begin{array}{@{}rcl@{}} &\underset{\mathbf{S,W}}{\min}& {\sum}_{i,j}^{n} ||x^{i}\mathbf{W} - x^{j}\mathbf{W}||_{2}^{2} s_{ij} \\ && + \beta ||\mathbf{Y} - \mathbf{XW}||_{F}^{2} + \gamma ||\mathbf{W}||_{2,1}\\ && s.t.,{\mathbf{W}^{T}}\mathbf{W} = \mathbf{I} \end{array} $$
    (13)

The optimization of (13) on W is nonsmooth but convex due to the regularization term ||W||2,1. Thus we employ the the Iteratively Reweighted Least Square (IRLS) [3] to optimize (13) on the variable W, iteratively optimizing W until (13) converges. We change (9) as:

$$\begin{array}{@{}rcl@{}} &\underset{\mathbf{W}}{\min}& tr(\mathbf{W}^{T} \mathbf{X}^{T}\mathbf{LXW}) \\ && +\beta ||\mathbf{Y} - \mathbf{XW}||_{F}^{2} + \gamma tr(\mathbf{W}^{T} \mathbf{QW})\\ && s.t.{\mathbf{W}^{T}}\mathbf{W} = \mathbf{I} \end{array} $$
(14)

where L is a Laplacian matrix and P is a diagonal matrix which i-th element \(p_{ii} = {\sum }_{j = 1}^{n} s_{i,j} \). \(\mathbf {Q} \in {\mathbb {R}^{d \times d}}\) is diagonal matrix with its i-th element define as:

$$ q_{ii} = \frac{1}{{2||w^{i}||_{2}^{2}}}, i = 1, {\ldots} ,d $$
(15)

where w i is the i-th row of W. Equation (14) contain an orthogonal constraint, we use [31] to solve it and list the pseudo code of the optimization algorithm of W in Algorithm 2, where the derivative of (10) is ∇F = X T L X W + β(X T X WX T Y) + γ Q W.

figure f
figure g

3.5 Convergence analysis, complexity, and parameters’ determination

3.5.1 Proof of convergence

Wen and Yin [31] proved the convergence of Algorithm 2. Thus, we prove the convergence of Algorithm 1. We have the following Lemma:

Lemma 1

The inequality

$$ \sqrt u - \frac{u}{2\sqrt v} \le \sqrt v - \frac{u}{2\sqrt v} $$
(16)

is always hold for all positive real numbers of u and v, according to the literatures [31].

Theorem 1

The objective function value of(7)monotonically decreases until Algorithm 1 converges.

Proof

After the t-th iteration, we have obtained the current optimal W (t) and S (t). In the (t + 1)-th iteration, we fix W (t) to optimize S (t). According to (12), we know that \(s_{i,j}^{(t + 1)}\) has global solution for all i,j = 1,...,n. Thus we have the inequality as follows:

$$\begin{array}{@{}rcl@{}} &&{\sum}_{i,j}^{n} ||x^{i} \mathbf{W}^{(t)} - x^{j}\mathbf{W}^{(t)}||_{2}^{2} s_{i,j}^{(t + 1)} \\ &&+ \alpha {{\sum}_{i}^{n}} ||s_{i}^{(t + 1)}||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t)}||_{F}^{2}\\ &&+ \gamma ||{\mathbf{W}^{(t)}}||_{2,1}\\ &&\le {\sum}_{i,j}^{n} ||x^{i}\mathbf{W}^{(t)} - x^{j}\mathbf{W}^{(t)}||_{2}^{2} s_{i,j}^{(t)} \\ &&+ \alpha {{\sum}_{i}^{n}} ||s_{i}^{(t)} ||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t)}||_{F}^{2}\\ &&+ \gamma ||\mathbf{W}^{(t)}||_{2,1} \end{array} $$
(17)

When update W (t+1) by fixing S (t+1), we get the following inequality according to Theorem 1:

$$\begin{array}{@{}rcl@{}} &&{\sum}_{i,j}^{n} ||x^{i}\mathbf{W}^{(t+1)} - x^{j}\mathbf{W}^{(t+1)}||_{2}^{2} s_{i,j}^{(t + 1)} \\ &&+ \alpha {{\sum}_{i}^{n}} {||s_{i}^{(t + 1)}} ||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t+1)}||_{F}^{2}\\ &&+ \gamma ||W^{(t+1)}||_{2,1}\\ &&\le {\sum}_{i,j}^{n} ||x^{i}\mathbf{W}^{(t)} - x^{j}\mathbf{W}^{(t)}||_{2}^{2} s_{i,j}^{(t+1)} \\ &&+ \alpha {{\sum}_{i}^{n}} ||s_{i}^{(t+1)}||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t)}||_{F}^{2}\\ &&+ \gamma ||\mathbf{W}^{(t)}||_{2,1} \end{array} $$
(18)

Through the integration of (17) and (18), we obtain:

$$\begin{array}{@{}rcl@{}} &&{\sum}_{i,j}^{n} ||x^{i}\mathbf{W}^{(t+1)} - x^{j}\mathbf{W}^{(t+1)}||_{2}^{2} s_{i,j}^{(t + 1)} \\ &&+ \alpha {{\sum}_{i}^{n}} ||s_{i}^{(t + 1)} ||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t+1)}||_{F}^{2}\\ &&+ \gamma ||\mathbf{W}^{(t+1)}||_{2,1}\\ &&\le {\sum}_{i,j}^{n} ||x^{i}\mathbf{W}^{(t)} - x^{j}\mathbf{W}^{(t)}||_{2}^{2} s_{i,j}^{(t)}\\ &&+ \alpha {{\sum}_{i}^{n}} ||s_{i}^{(t)}||_{2}^{2} + \beta ||\mathbf{Y} - \mathbf{X}\mathbf{W}^{(t)}||_{F}^{2}\\ &&+ \gamma ||\mathbf{W}^{(t)}||_{2,1} \end{array} $$
(19)

According the (19), we can see that the objective function values of (7) are reduced after each iteration of the Algorithm 1. Thus, Theorem 1 has been proved. □

3.5.2 Complexity analysis

In each iteration, the time cost of Algorithm 1 focuses on the computation cost of X T L X W + β(X T X WX T Y) + γ Q W in Algorithm 2 and f i,j in (12), and their corresponding complexity are O(n 2 d) and O(n d 2), where n, d, respectively, are the number of the data points and the features. In our experiments, our method usually converges within 30 iterations, so the time complexity of Algorithm is max { O(n 2 d), O(n d 2)}.

3.5.3 Parameters’ determination

The size of value of the parameter α determines the number of nearest neighbors of samples. Specifically, α = 0 means the number of nearest neighbors k is 1. α means the number of nearest neighbors k is n (n is the number of samples). Without loss of generality, we suppose g i = {g i,1,...,g i,n } as a ascending order of g i , i = 1,...,n, according to (12) we know s i k > 0 and s i,k+1 = 0. Therefore, we have

$$ \left\{ {\begin{array}{*{20}{l}} - \frac{1}{2\alpha} g_{i,k + 1} + \lambda \le 0\\ \\- \frac{1}{2\alpha} g_{i,k} + \lambda > 0 \end{array}} \right. $$
(20)

Based on the constraint \({s_{i}^{T}}\mathbf {1} = 1\), we have

$$\begin{array}{@{}rcl@{}} &&{\sum}_{j = 1}^{k} \left( - \frac{1}{2\alpha} g_{i,j} + \lambda \right) = 1 \\ &&\Rightarrow \lambda = \frac{1}{k} + \frac{1}{2k\alpha}{\sum}_{j = 1}^{k} g_{i,j} \end{array} $$
(21)

Combining (20) and (21), we have the following inequality:

$$ \frac{k}{2}{g_{i,k}} - \frac{1}{2}{\sum}_{j = 1}^{k} g_{i,j} < \alpha < \frac{k}{2}g_{i,k + 1} - \frac{1}{2}{\sum}_{j = 1}^{k} g_{i,j} $$
(22)

In order to obtain an optimal solution s i which has k nonzero elements, we set α as:

$$ \alpha_{i} = \frac{k}{2} g_{i,k + 1} - \frac{1}{2}{\sum}_{j = 1}^{k} g_{i,j} $$
(23)

The mean of α 1,...,α n could be set as final α. We have

$$ \alpha = \frac{1}{n}{\sum}_{i = 1}^{n} \left( \frac{k}{2} g_{i,k + 1} - \frac{1}{2}{\sum}_{j = 1}^{k} g_{i,j} \right) $$
(24)

After fixing α, the parameters k, β, γ in (7) still needs to tuning. In this paper, we follow the literature [29] to determine the value of k, and employ a cross-validation method to estimate other.

4 Experiments

In this section, we use the classification performance to evaluate our proposed method and compare with five comparison methods on eight data sets.

4.1 Data sets

The data sets (such as Umist, Ecoli, Cane, and Isolet) and the data sets (such as Colon, Coil, Orl) come from UCI Machine Learning RepositoryFootnote 1 and the website of Feature Selection Data sets.Footnote 2 We use the data set Lung from [26]. The data sets of our used can be divided into text data (Cane, Isolet), biological data (Colon, Ecoli, Lung) and image data (Coil, Orl, Umist). The number of data features is greater than 343, where half of all data sets our used (such as Colon, Lung, Coil, Orl) exceed 1000 dimension. The detail of data sets is lised in Table 1.

Table 1 The summarization of the used datasets

4.2 Comparison methods and experimental setting

We list the detail of the comparison methods as follows:

  • Baseline method uses all features to conduct classification tasks via SVM.

  • Convex Semi-supervised multi-label Feature Selection (CSFS) use the least square regression to select feature and employs an 2,1-norm to conduct row sparsity.

  • 2,0-norm ALM (FSRobustALM) uses an 2,1-norm to deal with the minimization loss problem and tackles the sparse problem with 2,0-norm constraint.

  • Regularized Self-Representation (RSR) joints the feature-level self-representation loss function and a 2,1-norm regularization to flitter the unimportant features.

  • Robust Unsupervised Feature Selection (RUFS) uses local learning regularization to learn pseudo cluster label and minimizes the fitting error both of feature learning and label learning by employing an 2,1-norm regularization.

  • Structured Optimal Graph Feature Selection (SOGFS) selects important features by learn the local structure between the samples from low-dimensional feature space.

For avoided the generalization error on classifier, we compare all methods by used 10-fold cross-validation. We repeated the whole process 10 times to avoid the possible bias during data set partitioning for cross-validation. The final result was computed by averaging results from all experiments. We conduct 5-fold cross-validation on the training data to conduct model selection. That is, we further separated the training data into five parts, where one of parts is used to validate the model built by the left four parts. In the validation step, we search the best parameters’ combination by grid search method in the given ranges of the parameters. The best parameters’ combination has best classification performance on testing data. We set the value of k and β, γ as 15 and {10−3,...,103}, respectively.

In our experiments, we used the Average Classification Accuracy (ACA) as evaluation metric to evaluate our method. We investigated the robustness of our proposed method based on parameters’ sensitivity and convergence of Algorithm 1.

4.3 Experimental result

The ACA result of all methods is revealed in Fig. 1, where the horizontal axis represented the dimension of feature of performing feature selection.

Fig. 1
figure 1

ACC of all methods on all data sets at different number of selected features

Obviously, compare with other method (such as SOGFS, RUFS, RSR, FSRobustALM, CSFS and Baseline) our method achieved the best performance. For example, our proposed method improved by 12.1% and 17.5%, respectively, compared to CSFS and RSR in data set Lung. Moreover, other observations are as follows.

First, with the increase of features number, the classification performance of all feature selection methods first increase to optimal and then start to fall. For example, while selection the 20% and 60% features of sample, the ACA result were about 83.7% and 88.6%, respectively, but while keeping the features as 80% features of sample, the ACA went down to 83.6% at the data set Colon. High-dimensional data which contain redundancy and noisy may affected the classification performance, so conducted the feature selection is necessary.

Second, most of feature selection methods have outstanding performance than baseline, which conduct classification by use all features, where our proposed method improved on average by 15.2% compared to Baseline. This verified that feature selection is necessary to deal with high-dimensional data again.

4.4 Parameters’ sensitivity

We set the range of the parameters β and γ as {10−3,10−2,...,103}, and the result listed by Fig. 2.

Fig. 2
figure 2

ACC result of our proposed method at different parameters’ setting

From Fig. 2 we can see that our proposed method is sensitive to the parameters’ setting. Namely, the best classification results rely on the suitable parameter combinations. Hence, tuning the parameters is necessary to our method. More specifically, β is a control parameter to tune the magnitude between the local structure learning \({\sum }_{i,j}^{n} ||x^{i}\mathbf {W} - x^{j}\mathbf {W}||_{2}^{2} s_{ij} \) and least square regression \(||\mathbf {Y} - \mathbf {XW}||_{F}^{2}\) , while γ is used to adjust the sparsity term ||W||2,1. When setting β = 10 and γ = 10, our method obtain the best performance on the data sets Colon and Ecoli. However, for the data set Coil, our method achieve the best ACA 99.1% with β = 100 and γ = 100.

4.5 Convergence of algorithm

Figure 3 illustrates the variation of the objective values of our proposed method (i.e., Algorithm 1) associated with the increase of the iterations. In experiments, we set the stop criteria as \(\frac {||obj(t + 1) - obj(t)||_{2}^{2}}{obj(t)} \le 10^{-3}\) to both of Algorithms 1 and 2, where indicate the objective function value of the t-th iteration on (7).

Fig. 3
figure 3

The convergence of our proposed Algorithm 1

From Fig. 3 we can discover that 1) with iteration of Algorithm 1 the objective function values are monotonically decreases until Algorithm 1 converges; 2) the proposed Algorithm 1 reach the convergence only needed few iterations (i.e., less than 20), which proof the efficient of our method. Moreover, the Algorithm 2 of our proposed also achieves convergence within 30 iterations at all data sets.

5 Conclusion

This paper has proposed a novel spectral feature selection method, which dynamic learning graph matrix and selecting the features simultaneously, to obtain more reliable similarity between the data, we use an orthogonal constraint on W to our method. Compared with the other feature selection methods, the experimental results on real data sets demonstrated our proposed method achieved the best performance.

In the future work, we will extend our proposed framework to conduct unsupervised learning on the high-dimensional data since the missing label data sets are often found in real world.