1 Introduction

Feature learning is an important step in machine learning and data mining, which has been widely applied in many big data analysis domains, such as gene microarray data, text data and image sequences in video processing. There are two ways to generate features from data samples: feature transformation and feature selection. New features generated from feature transformation are some combinations of original features, while new features from feature selection are just a subset of original features. As a dimensionality reduction method, feature selection can effectively remove redundant features which are irrelevant to data classification task and retain a small number of key features, which not only reduce computational complexity of data classification or clustering, but also improve accuracy of machine learning algorithms. Compared with feature transformation, the selected features have better explanations, since they are the subset of original high-dimensional features, which have specific physical meanings. Therefore, it has obtained many attentions from researchers.

From the point of view of combinational optimization, feature selection is a NP-hard problem [1]. Essentially, feature selection aims to rank features with their importance and then select the most important features for subsequent data analysis. Therefore, feature selection methods are derived from different feature importance evaluation criteria. According to independence between the feature generation process and follow-up training process of learning model, feature selection methods can be divided into two categories: packaging and filtering. According to whether labels of samples are used in feature selection process, it can be divided into supervised and unsupervised methods. Generally, packaging methods are commonly used in supervised feature selection algorithm design, and filtering methods are commonly used in unsupervised feature selection algorithm design.

Supervised methods usually select features based on correlations between features and labels. For example, according to correlations, Relief proposed by Kira et al. [2] firstly puts different weight for each feature and then deletes irrelevant features with a preset threshold. The original Relief algorithm only applies to binary classification problem; in order to deal with multi-classification problems, Kononeill et al. proposed an improved method named Relief-F [3]. FOCUS-SF [4] finds the minimum subset of features that is consistent with labels by searching all feature subsets exhaustively. Due to the high computational complexity, FOCUS-SF is not suitable for high-dimensional feature selection. Correlation-based feature selection (CFS) [5] takes the correlation between feature and feature or feature and class as evaluation index of feature importance, and then, the optimal feature subset can be found under a specific search strategy. Fast correlation-based filter (FCBF) [6] introduces concept of dominant correlation to evaluate feature importance. Since feature selection can be seen as a special kind of subspace learning, Nie et al. [7] formulate the feature selection process as trace ratio criterion-based graph embedding optimization problem, in which each column of projection matrix from high-dimensional data to low-dimensional representations is constrained as one-hot code.

In many applications, there are a large number of unlabeled data, since it is time-consuming to obtain sample labels, so unsupervised feature selection methods have wider range of applications. For example, Wei et al. [8] propose maximum overall dependence-based forward orthogonal search algorithm (FOS-MOD) for feature selection, which takes the representative ability of one feature to others as feature importance evaluation index. The similarity between two features can be measure by squared correlation coefficient in FOS-MOD, and then, the optimal features can be obtained by selecting features with largest squared correlation coefficient step by step. Li et al. [9] take sample margin and hypothesis margin of each feature as feature importance evaluation index, respectively, and then select features based on sequential backward method, which can be classified by support vector machine (SVM). Recently, hybrid feature selection algorithms have gained great importance in terms of timeliness. For example, Brahim et al. [29] proposed a feature selection method to design an intelligent assistance sleep scoring system. Based on instance learning, Sen et al. [30] proposed a filter wrapper method for feature selection by cooperative subset search.

Information theory is a powerful mathematical tool for description of the interaction between variables. Based on some concepts of information theory, such as mutual information and entropy, it has generated a lot of unsupervised feature selection algorithms. For example, Dash et al. [10] proposed a consistency measure of feature subset for any search strategy, which assumes that the categories of samples with same feature subset should be the same. In addition, the concept of entropy is used to measure whether the dataset has obvious clusters, which can also be used as the feature importance measure [11]. Mitra et al. [12] proposed an unsupervised feature selection method based on maximum information compression coefficient, which is defined as the minimum eigenvalue of covariance matrix between two random variables. Since the projection direction corresponding to minimum eigenvalue is orthogonal to directions corresponding to the principal components, it can be used to measure dissimilarity between the two features vectors, and then, redundant features can be eliminated. Peng et al. [13] proposed a mutual information-based maximum statistical dependence criterion for incremental feature selection. For computational efficiency, maximum statistical dependence can be transformed into minimum redundancy maximum correlation (mRMR) model. The mRMR model has been successfully used for handwritten digital images feature selection [14]. Based on mutual information, Xu et al. [15] use minimum redundancy and maximum correlation to evaluate feature importance, where the correlation is the degree of dependence between a feature vector and its potential class, and redundancy is the degree of dependence between two features. Both of them can be measured by mutual information. Bandyopadhyay et al. [16] proposed an unsupervised feature selection method based on dense subgraph discovery, in which each feature vector can be viewed as a vertex of the graph, and the mutual information between feature vectors can be viewed as the weight of edge on the graph. After finding the dense subgraph, optimal features can be selected by clustering.

Since manifold can model the low-dimensional structure of datasets, it has been used in unsupervised feature selection. Because the intra-class samples are also locally nearby, Laplacian Score [17] uses the locality preserving ability of a feature to describe its importance. Based on spectral graph theory, Zhao and Liu [18] unify supervised and unsupervised feature selection into a framework. After the similarity between two samples is defined, the structure of dataset can be described as a graph, and then, the normalized graph cut can be used as the feature importance measure. In order to fully exploit the discriminant structure of datasets, Cai et al. [19] proposed a multi-cluster feature selection method (MCFS) [19], in which the original high-dimensional dataset is firstly projected into low-dimensional space by spectral embedding; then, the linear dependence relationship between high-dimensional sample and its low-dimensional representation can be obtained by L 1 norm regularized regression problem, and the features corresponding to largest sparse representation coefficients are optimal features. Yang et al. [20] propose an unsupervised discriminant feature selection method (UDFS), in which samples are assumed to be linearly separable; then, according to linear relationship between a sample and its local label, the optimal discriminant feature subset can be obtained by maximizing the local interclass scatters and minimizing the local intra-class scatters, simultaneously minimizing L 2,1 norm of linear classification coefficient matrix. In addition, the label information of samples can be obtained through clustering, and then, the label information can guide the discriminant feature extraction, so Li et al. [21] proposed a nonnegative discriminant feature selection method (NDFS), which unified spectral clustering and feature selection into an optimization objective, the indicator matrix of cluster is constrained to be nonnegative. Similarly, Du et al. [22] proposed a local and global discrimination learning-based feature selection method (LGDFS), in which the weighted L 2 norm regularized regression models are optimized simultaneously locally and globally, and then, the optimal feature subset can be obtained from feature indicator matrix. Many manifold learning-based feature selection methods can be unified into a similarity preserving feature selection (SPFS) framework [23], which is equivalent to multivariate multi-output regression problem essentially. Different constraints, regularization conditions and optimization strategies result in different feature selection methods.

The integration of clustering and unsupervised feature selection can improve the discriminant ability of selected feature subset. Liu et al. [24] proposed a K-means-based feature selection method (KFS) for text clustering. After selecting different K and initialization samples to get different clustering results, the feature importance can be computed by \(\chi^{2}\) statistics on these clustering results, and then, the optimal feature subset can be obtained by ranking the sum of different feature importance computations. Similar to principal component analysis, principal feature analysis (PFA) [25] projects each feature vector into the subspace with maximum variance, in which all features are clustered by K-means, and then, the optimal feature subset can be obtained by the distance between the feature vector and its corresponding cluster center. Song et al. [26] proposed a clustering-based fast feature selection algorithm (FAST), which divides feature set into different clusters by minimum spanning tree-based clustering method, and then, the most representative features related to classification are selected from each cluster. Yan and Yang [27] proposed a sparse discriminant feature selection method (SDFS) by minimizing intra-class reconstruction residuals and maximizing interclass reconstruction residuals, in which the L 2,1 norm minimization can remove the redundant features effectively.

However, manifold learning methods rely heavily on data graph construction, and they are very sensitive to noises or corruptions. On the other hand, the features related to classification or clustering tasks are also correlated, so the most feature selection methods cannot reduce the redundancy of selected feature subset effectively. This paper proposed a decision graph-based feature selection (DGFS). Decision graph is a powerful tool for discovering clustering structure of feature set. The feature centered on each cluster is most representative and has minimum redundancy to others. Compared with other methods, DGFS has an intuitive principle and simple computation. The classification experiments on face datasets and UCI datasets show that DGFS can reduce the redundancy information contained in feature set effectively, and the selected features have a better ability of discriminant.

2 Decision graph-based feature selection

2.1 Problem formulation

For concise description, the observed samples are represented as a data matrix, i.e., n m-dimensional samples can be denoted as \(X = (x_{1} ,x_{2} , \ldots ,x_{n} ) \in R^{m \times n}\), where each column of X is an observed sample, and each row of X is a feature vector or attributes of observed samples. If the ith feature vector of dataset X is denoted as \(f_{i}\), then the data matrix can also be rewritten as \(X = (f_{1} ;f_{2} ; \ldots ;f_{m} )\). Feature selection aims to select r features from m features, according to a specific feature importance evaluation criterion, such that the redundancy or correlation between these r features is small, and meanwhile, they can preserve most information contained in the original dataset.

2.2 Decision graph

In order to recognize most discriminant features in original high-dimensional data samples, the concept of decision graph [28] is introduced as follows.

Definition 1

Local density \(\rho_{i}\) on feature vector \(f_{i}\) is defined as

$$\rho_{i} = \sum\limits_{j = 1}^{m} {\theta \left( {d_{c} - d_{ij} } \right)}$$
(1)

where \(d_{ij}\) is the Euclidean distance between feature vectors \(f_{i}\) and \(f_{j}\), and \(d_{c} > 0\) is a preset threshold or truncated distance, and \(\theta \left( \cdot \right)\) is an indicator function which can be defined as:

$$\theta \left( t \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {t \ge 0} \hfill \\ 0 \hfill & {t < 0} \hfill \\ \end{array} } \right.$$

According to Definition 1, local density on each feature vector is the number of features contained in a neighborhood ball which is centered in the feature vector with a positive radius. Large \(\rho_{i}\) indicates dense feature distribution in the neighborhood of feature vector \(f_{i}\). Therefore, it is reasonable to use \(\rho_{i}\) as descriptor for distribution of feature set. In applications, any other similar definitions of local density can be used, such as

$$\rho_{i} = \sum\limits_{j = 1}^{n} {e^{{\left( {\frac{{d_{ij} }}{{d_{c} }}} \right)^{2} }} }$$
(2)

Definition 2

Discriminant distance \(\delta_{i}\) on feature vector \(f_{i}\) is defined as:

$$\delta_{i} = \left\{ {\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{{j:\rho_{j} > \rho_{i} }} \left( {d_{ij} } \right)} \hfill & {\rho_{i} \ne \mathop {\hbox{max} }\limits_{j} \left( {\rho_{j} } \right)} \hfill \\ {\mathop {\hbox{max} }\limits_{j} \left( {d_{ij} } \right)} \hfill & {\rho_{i} = \mathop {\hbox{max} }\limits_{j} \left( {\rho_{j} } \right)} \hfill \\ \end{array} } \right.$$
(3)

According to Definition 2, discriminant distance on each feature vector is its distance from nearest feature vector with higher local density. Specially, discriminant distance on the feature vector with maximum local density is the maximum distance between it and other feature vectors. For the feature vectors with large local density, they may be in the same cluster or may be in the different clusters. If their discriminant distance is large, the probability that they are in different clusters is great too. For this reason, discriminant distance on each feature vector characterizes the separability of different clusters in the dataset.

Definition 3

The decision graph of feature set X is a scatter plot with \(\left( {\rho_{i} ,\delta_{i} } \right)\), in which \(\rho_{i}\) is the horizontal ordinate that represents local density of feature vector \(f_{i}\), and \(\delta_{i}\) is the vertical ordinate that represents discriminant distance of feature vector \(f_{i}\).

According to Definition 3, the feature points on the top-right corner of decision graph have larger local densities and discriminant distances, and they have higher probability to be clusters of features. Therefore, it is intuitive to evaluate feature importance with decision graph. For example, given two different means and covariance matrices, 40 points are generated randomly in two-dimensional plane, which contains two clusters, each of which has 20 points distributed normally and numbered with different colors in Fig. 1. From Fig. 1, we can intuitively find that the points with small distances have large similarities. The decision graph of points in Fig. 1 is shown in Fig. 2, in which point 7 and point 26 are obviously separated with others, and both of them may be cluster centers. In fact, it is identical with real case. Compared with Fig. 1, point 7 is the cluster center of first 20 points marked with red color, and point 26 is the cluster center of latter 20 points marked with blue color. Therefore, it is reasonable to identify cluster centers from decision graph.

Fig. 1
figure 1

Sample points in two-dimensional plane

Fig. 2
figure 2

Examples for decision graph

2.3 Feature importance criteria

Firstly, suppose that feature set has typical cluster structure, i.e., the feature vectors with same or similar abilities of description should be clustered together. Then, each feature vector located in cluster center can be viewed as the most representative feature of the cluster. Since the distance between feature vectors with lower correlations or different abilities of description should be large, if the clusters of feature set are recognized, then the subset constructed by cluster centers can characterize the discriminant ability of original feature set sufficiently.

Based on the above idea, the feature importance evaluation criterion which is called decision graph score is presented as follows.

Definition 4

Decision graph score \(\gamma_{i}\) on feature vector \(f_{i}\) is defined as follows:

$$\gamma_{i} = \rho_{i} \cdot \delta_{i}$$
(4)

According to Definition 4, the larger the local density and discriminant of a feature vector are, the higher the decision graph score is, which corresponding to top-right area of decision graph and the features in such area are more important. By computing decision graph score of each feature vector, the features corresponding to first r largest scores can be selected as optimal feature subset, in which the correlation between features is smaller and can be selected as better representatives that contain information of original feature subset. Figure 3 shows decision graph scores of 40 sample points in Fig. 1 with descending order, where the first two sample points with highest decision graph scores are No. 37 and No. 4 that deviates from real case, but very close to real cluster center.

Fig. 3
figure 3

Examples for decision graph Score

Based on the above definition and analysis, the procedure of DGFS algorithm is summarized as follows:

Input: feature matrix X, number of selected features r;

Output: feature subset Y.

Procedure:

Step 1: Compute local densities and discriminant distances of feature matrix X according to Eqs. (1) and (3).

Step 2: Compute decision graph score \(\gamma_{i} = \rho_{i} \cdot \delta_{i}\) according to Eq. (4).

Step 3: Sort decision graph scores with descending order, and select features with first r largest scores as optimal feature set Y.

3 Experimental results

3.1 Experimental setting

In order to verify the effectiveness of the proposed method, in the experiments, we compare DGFS with other unsupervised feature selection methods, such as Laplacian score, multi-cluster feature selection (MCFS), unsupervised discriminant feature selection (UDFS) and nonnegative discriminant feature selection (NDFS). Before feature selection, data deduplication is performed on the rows and columns of data matrix, and then, all feature values are normalized into vectors with unit norm. A good feature selection method should make the selected feature subset get a better classification result even by the simple classifier, such as k nearest neighbor. In the experiments, fivefold cross-validation is used to evaluate each method. Firstly, the original high-dimensional dataset is divided into five parts randomly, four parts of which as the training set and one part of which as a test set in turn. Then, feature selection methods are conducted on training set in each partition, and the indices of selected features can also be used in feature selection on testing set. Finally, the average classification accuracy by nearest neighbor classifier on selected features is reported. For the purpose of exploring the statistical significance of the results, we performed t test to statistically compare methods on multiple datasets.

In the experiments, parameters of each algorithm are empirically set as follows: In DGFS, local density of each sample is computed by Eq. (2), where the truncated distance \(d_{c}\) is set as the distance at the position of two percent of total distances between feature vectors with ascending order. In Laplacian Score, neighborhood parameter k is set to 5, and similarity between feature vectors is computed by cosine metric. In MCFS, neighborhood parameter k is set to 5; in UDFS, regularization parameter is set to 0.01; and in NDFS, parameter neighborhood is set to 5, and similarity between feature vectors is computed by cosine metric, and the number of maximum iterative steps is set to 30, and regularization is set to 0.1.

3.2 Classification results on face datasets

Grayscale image dataset is typically high-dimensional after stacking columns into a vector, which contains large amount of redundant, irrelevant and noisy pixels, so it is necessary to select features before image analysis. The classification results on six benchmark face image datasets, such as ORL, YaleB, Altkom, PIE, AR and MPEG7, are reported in the section.

Face datasets used in the experiments are given in Table 1, wherein the ORLFootnote 1 dataset is created by Bell Labs of the University of Cambridge, which contains 400 images, including 40 individual facial expressions (open eyes or closed eyes, smiling or not), occlusion (wearing glasses or not) and slight changes of pose; YaleBFootnote 2 dataset is created by computer vision and control center in Yale university, which contains 38 individuals under strictly controlled conditions of illumination and poses; AltkomFootnote 3 dataset contains 80 individuals with 15 images for each individual; PIEFootnote 4 dataset is created by Carnegie Mellon University, which contains 41,368 face images of 68 individuals under strictly controlled conditions of pose, illumination and expression. The AR datasetFootnote 5 contains more than 4000 frontal images from 126 persons (70 men and 56 women) with different facial expressions, lighting conditions and occlusions. In the experiment, we choose a subset which contains 50 males and 50 females. For each person, 14 images with only illumination and expression changes are selected. MPEG-7 content set of face imagesFootnote 6 was provided by the Heinrich Hertz Institute of Germany, which contains 3175 face images of 635 persons.

Table 1 Face datasets description

The classification results of each algorithm on different datasets are given in Table 2, in which the average classification accuracy (%), standard deviation (%) and the corresponding number of features that achieved the highest average classification accuracy on five cross-validation experiments are reported. From Table 2, we can see that the average classification accuracy of Laplacian Score is much lower than the other three methods, while in most cases, the proposed DGFS method achieves the higher average classification accuracy. In order to identify the pairwise different significance between the proposed DGFS method and other compared methods, t test is used to make decisions for the null hypothesis that the pairwise difference of optimal average accuracies between two methods has a mean equal to zero. The t test results at the 5% significance level on the six face datasets are reported in Table 3, where symbol ‘+’ denotes the DGFS method that outperforms the compared method significantly according to t test, while ‘−’ denotes the compared method that outperforms the DGFS method, and ‘=’ denotes that there is no statistically significant difference between the results obtained by the DGFS method and the compared method. The obtained p values in t test are reported in parentheses. Table 3 shows that the proposed DGFS method has achieved a statistically higher accuracy than all other compared methods on all the six face datasets in most cases. Compared with Laplacian Score and mRMR, DGFS is always better, while there is only one case that MCFS, UDFS and Relief-F perform better than DGFS.

Table 2 Experimental result comparisons on face datasets
Table 3 Statistical test results on face datasets

In order to analyze physical meaning of the selected feature by each method intuitively, the optimal selected features marked with red stars on the six face datasets are shown in Figs. 4, 5, 6, 7, 8 and 9, in which the selected features by Laplacian Score locate in continuous local area of face image, and the pixels concentrated in those locations have small changes and with typical manifold structure, but cannot distinguish between different faces. While the selected pixel features by MCFS, UDFS and NDFS are more dispersed, the selected pixels by DGFS are mostly concentrated on the positions of eyes, nose and mouths, which also show that these parts play a key role for distinguishing different facial images, which is also consistent with knowledge of human cognitive experience.

Fig. 4
figure 4

Feature selection on ORL dataset

Fig. 5
figure 5

Feature selection on YaleB dataset

Fig. 6
figure 6

Feature selection on Altkom dataset

Fig. 7
figure 7

Feature selection on PIE dataset

Fig. 8
figure 8

Feature selection on AR dataset

Fig. 9
figure 9

Feature selection on MPEG7 dataset

In addition, Fig. 10 shows the changes of average classification accuracy with the number of selected features each face dataset changes. In the experiment, the maximum number of features we set as 100. As shown in Fig. 10, with the increasing number of features, the classification accuracy is also increasing. Within the 20 features, the classification accuracy increases fast; after that, with the number of features increasing, the accuracy increases slowly. When they achieve the best results, the accuracy of the most methods begins to stay stable. From Table 2 and Fig. 10, we can see that, in most cases, the proposed DGFS method not only selects a much smaller numbers of features, but results in better classification performance as well.

Fig. 10
figure 10

Average accuracy versus different numbers of selected features on face dataset

Table 4 lists the CPU time in seconds obtained from the different algorithms on the six datasets. The UDFS and NDFS work the poorest in terms of CPU time, while the computational time costs of other methods are comparable. This is due to the fact that the iterative optimizing processes in UDFS and NDFS models are time-consuming when the number of dimensionality of samples is large.

Table 4 Computational time costs comparisons on face datasets (s)

3.3 Classification results on UCI datasets

To further compare the effect of each feature selection algorithm, ten most commonly used UCI datasetsFootnote 7 that are from real-world applications, such as health, political and economic fields, are used in the experiments, which are given in Table 5, where Heart dataset is the diagnosis data of patients with heart disease, Vote dataset is the voting data from Republican and Democratic Congress, Dermatology dataset is the clinical and histopathological data for six kinds of skin diseases, Australia dataset is the Australia’s credit approval data, Wine dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars, Credit dataset concerns credit card applications, Car dataset is the car evaluation database, E. coli dataset contains protein localization sites information, Seeds dataset is the measurement of geometrical properties of kernels belonging to three different varieties of wheat, and WDBC dataset is the diagnostic Wisconsin breast cancer database. In the preprocessing step, the missing values in original datasets are manually set to 0 and non-numeric category attributes are represented as integers.

Table 5 UCI datasets description

The classification results of each algorithm on different datasets are given in Table 6, in which the average classification accuracy (%), standard deviation (%) and the corresponding number of selected features that achieved the highest average classification accuracy on five cross-validation experiments are reported. The significance of average accuracy differences is identified by the t test, in which the null hypothesis is that the pairwise difference of optimal average accuracies between any two methods comes from a normal distribution with mean equal to zero and unknown variance at the 5% significance level. As shown in Table 7, DGFS performs better than other methods.

Table 6 Experimental result comparisons on UCI datasets
Table 7 Statistical test results on UCI datasets

Figure 11 shows the average classification accuracies with different number of selected features. As shown in Fig. 11, in Heart and Wine datasets, DGFS performs significantly better than other methods, and the high accuracy rate has been achieved with less feature values. While in Vote, Dermatology and Seeds datasets, with the increase of the number of features, the accuracy values of different methods are tend to be the same. In Australian dataset, the impact of the number of features on the results is quite unstable. For instance, upon analyzing the results of DGFS, MCFS and Relief-F methods, classification performances tend to decrease despite the increasing feature values. For Credit dataset, high accuracy rate has been achieved with DGFS method. A value close to this rate of accuracy has been obtained by using 7 features with NDFS method. In Car dataset, DGFS and MCFS have comparable performances and perform better than others. High accuracy value has been achieved with the proposed method by using only one feature in E. coli dataset. In WDBC dataset, high accuracy rate has been obtained by using 20 features with DGFS method, while other methods perform comparably with the increasing number of features.

Fig. 11
figure 11figure 11

Average accuracy versus different numbers of selected features on UCI dataset

4 Discussion and conclusion

In this paper, based on the concept of decision graph, an unsupervised feature selection method is proposed. Since the process of feature selection can be viewed as feature clustering, the features as cluster centers can not only be representative features of other features in the same cluster, but also can discriminate with features in other clusters. Therefore, the selected features have less redundancy and can preserve the inherent information contained in original feature set. To identify cluster centers of feature set, we introduced the definitions of local density and discriminant distance, which can be used to construct decision graph for identifying cluster centers. Then, for evaluating feature importance, the index named decision graph score is proposed. Feature selection can be achieved by decision graph score ranking.

In the experiments, the performance of DGFS method is evaluated on 16 publicly available real-life datasets, including 6 face datasets and 10 UCI datasets. The number of features for these datasets varies from 6 to 4096, and the number of samples ranges from 178 to 11,554. From Tables 3 and 7, we can see that, at the significance level of 0.05, the proposed DGFS is statistically superior than the state-of-the-art feature selection methods for data classification, regardless of dimensionality and distributional shape of data samples.