Keywords

1 Introduction

Nowadays, the Internet of things (IoT) has been developed rapidly, a large amount of data is collected by the enterprises and industries [1]. In the information age, people’s daily lives and devices on the Internet also produce a large amount of data. These data meet the needs of ML, but the amount of data is too large to mark every piece of data, and not all features are useful, some data even have errors [2]. Another signification problem is that meaningless data need to be removed which may cause negative impacts on learning tasks. Therefore, there is a need for a method that can extract useful subsets from the original data and reduce the impact of redundant data to improve ML accuracy and stability. FS aims to reduce the data dimensions by removing some irrelevant features, so it is very suitable for solving these problems. Because deep neural networks proved working well in processing massive data, FS based on deep neural networks had been proposed to deal with these problems [3,4,5]. In recent years, there are many constructive methods integrated with ML top research theories have been proposed, such as Knockoff generative adversarial network (GAN), FS guided auto-encoder.

In order to have a better understanding of the FS, this study will raise three general questions (GQ), which are focus on the overall FS and domain-specific methods. Table 1 describes the general questions. Answering these questions can give a general understanding of the FS.

Table 1. General questions.

The remainder of the paper is organized as follows. Section 2 presents search strategies and sources. In Sect. 3, review results are discussed for these general questions. The conclusions and future work are given in Sect. 4.

2 Sources

The articles involved in this study are from journals and conferences by searching in online academic searching engines (e.g., Web of Science). The relevant keywords used for searching such as FS, search strategy, filter, wrapper, evaluation methods. When selecting papers, the quality, amount of citations and contribution of algorithms are considered.

3 Results and Discussions

In this section, the relevant papers have been reviewed and discussed to answer the three general questions in Table 1.

3.1 GQ1–What Is the Main Purpose of the FS?

Due to the rapid development of the IoT and information technology, the data from various collection devices usually contain a large amount of redundant information and error information. This kind of data may cause an adverse impact on big data analysis. To solve this problem, the data is usually pre-processed which is dimension reduction. There are two popular dimension reduction methods which are feature extraction (FE) and FS. The result of FS is a subset of the original data, and the result of FE is a mapping of the original data. Figure. 1 shows the process of FS. FS aims to approximate the original data by selecting a set of essential features. So, FS method can make a model more reliable and be furnished with better interpretability. In FS, search strategies and evaluation methods are critical to the quality of selection results.

Fig. 1.
figure 1

The process of FS.

3.2 GQ2-What Are the Search Strategies for FS?

The current research generally divides the search strategy of FS into three parts: (1) global search strategy, (2) random search strategy, and (3) heuristic search strategy [6].

The global search strategy can analyze and evaluate each subset of data to decide whether to select this feature. Generally, there are enumeration method and branch-and-bound method [7]. The enumeration method is easy to achieve but they will consume a lot of time. The branch-and-bound method is preferred if the amount of data is large. The branch-and-bound method will define an evaluation function and a number of features that would be selected before the algorithm runs [8].

The random search strategy can weight each piece of data according to the validity of its feature, and select depend on a defined threshold. The random search strategy usually combines FS problems with other artificial intelligence algorithms such as simulated annealing algorithm (SAA) and genetic algorithm (GA) [9]. Therefore, if the number of epochs in these algorithms is too small, the accuracy of the results may be unstable. When the number of epochs is large enough and the parameter settings are appropriate, the random search strategy methods would show a better result.

The heuristic search strategy can create a sequence for the original data. In this sequence, a forward selection method and a backward selection method are used to add and delete a feature, until the optimal feature subset is reached [10]. Four types of methods for the heuristic search strategy include sequence forward selection method (SFS), generalized sequence forward selection method (GSFS), sequence backward selection method (SBS), generalized sequence backward selection method (GSBS).

3.3 GQ3-What Are the Evaluation Methods for FS?

The evaluation methods of FS are generally divided into filter methods, wrapper methods, and embedded methods [11, 12].

Filter Methods.

The filter method generally uses the statistical performance of all training data directly to evaluate features, so it is independent of the subsequence learning algorithm. It can quickly remove redundant features in large data sets [13]. A variance threshold method (VTM) and a mutual information method (MIM) will be introduced in this subsection.

Variance Threshold Method.

The VTM can calculate the variance of each feature and select the features whose variance is greater than the defined threshold [14].

The variance of feature is defined as Eq. (1).

$$ V\left( t \right) = \frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {f_{i} \left( t \right) - \bar{f}\left( t \right)} \right)^{2} } $$
(1)

In Eq. (1), m is the number of categories, and the \( f_{i} \left( t \right) \) means the frequency of feature t in the i-th category. \( \bar{f}\left( t \right) \) notes the average of the frequency of feature t in Eq. (2).

$$ \bar{f}\left( t \right) = \frac{1}{m}\sum\limits_{i = 1}^{m} {f_{i} \left( t \right)} $$
(2)

The variance of feature \( V\left( t \right) \) denotes the discrete trend of feature t. The distribution of feature t in m categories is more concentrated with a larger variance \( V\left( t \right) \). Equation (3) could be used to normalize variances \( V\left( t \right) \) for avoiding the suppression of low frequencies by high frequencies.

$$ N\left( t \right) = \frac{{\frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {f_{i} \left( t \right) - \bar{f}\left( t \right)} \right)^{2} } }}{{\frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {f_{i} \left( t \right)} \right)^{2} } }} = \frac{V\left( t \right)}{{\frac{1}{m}\sum\limits_{i = 1}^{m} {\left( {f_{i} \left( t \right)} \right)^{2} } }} $$
(3)

The distribution probability of feature t in category \( c_{j} \)is defined as Eq. (4).

$$ \theta_{j} = \frac{{f_{{c_{j} }} \left( t \right)}}{{\left| {c_{j} } \right|}} $$
(4)

In Eq. (4), \( f_{{c_{j} }} \left( t \right) \) denotes the frequency of feature t in category cj, and \( \left| {c_{j} } \right| \)denotes the sum number of category \( c_{j} \) occurred in data. Therefore, the feature evaluation function is defined as Eq. (5).

$$ T\left( t \right) = V\left( t \right) \times \mathop {\hbox{max} }\limits_{1 \le j \le m} \theta_{j} $$
(5)

After calculating the all features evaluation score \( T(t) \), the features whose score is greater than the defined threshold can be selected.

Mutual Information Method.

The MIM is a way of representing the correlation between features. The information entropy of X can be estimated by Eq. (6), and the conditional information entropy of X after the known variable Y is calculated by Eq. (7) [15].

$$ H\left( {\mathbf{X}} \right) = - \sum\limits_{{x \in {\mathbf{X}}}} {\Pr \left( x \right) \times \log \Pr \left( x \right)} $$
(6)
$$ H\left( {{\mathbf{X}}\left| {\mathbf{Y}} \right.} \right) = - \sum\limits_{{y \in {\mathbf{Y}}}} {\Pr \left( y \right) \times \sum\limits_{{x \in {\mathbf{X}}}} {\Pr \left( {x\left| y \right.} \right) \times \log \Pr \left( {x\left| y \right.} \right)} } $$
(7)

In Eqs. (6) and (7), \( \Pr \left( x \right) \) means the probability that feature X takes x, and \( \Pr \left( {x\left| y \right.} \right) \)notes the probability that feature X takes x when feature Y takes y. The mutual information \( I\left( {{\mathbf{X}};{\mathbf{Y}}} \right) \) between feature X and feature Y can be calculated by Eq. (8).

$$ \begin{aligned} I\left( {{\mathbf{X}};{\mathbf{Y}}} \right) = H\left( {\mathbf{X}} \right) - H\left( {{\mathbf{X}}\left| {\mathbf{Y}} \right.} \right) = H\left( {\mathbf{Y}} \right) - H\left( {{\mathbf{Y}}\left| {\mathbf{X}} \right.} \right) \hfill \\ { = }\sum\limits_{{y \in {\mathbf{Y}}}} {\sum\limits_{{x \in {\mathbf{X}}}} {\Pr \left( {x,y} \right) \times \log \frac{{\Pr \left( {x,y} \right)}}{\Pr \left( x \right)\Pr \left( y \right)}} } \hfill \\ \end{aligned} $$
(8)

Furthermore, the correlation between feature X and feature Y can be calculated by Eq. (9).

$$ R\left( {{\mathbf{X}};{\mathbf{Y}}} \right) = \frac{{I\left( {{\mathbf{X}};{\mathbf{Y}}} \right)}}{{H\left( {\mathbf{X}} \right) + H\left( {\mathbf{Y}} \right)}} $$
(9)

The value range of \( Sim(x,y) \) is between [0, 1]. If the value of \( R\left( {{\mathbf{X}};{\mathbf{Y}}} \right) \) is 0, it means that two features are not related; if the value of \( R\left( {{\mathbf{X}};{\mathbf{Y}}} \right) \) is 1, it means that two features are completely related.

Wrapper Methods.

Because the evaluation of the wrapper methods depends on the training accuracy of the subsequent learning algorithm, they have fewer errors than the filter methods. However, the computation cost of wrapper methods is large, it is not suitable for large data set operations.

In related research, Guyon et al. used the support vector machine (SVM) classification algorithm to measure the importance levels of features, and constructed a classifier with better classification performance [16]. Gui et al. proposed a method based on attention mechanism with high accuracy and stability for noisy and small data sets [17]. Ye et al. proposed a fast wrapper FS method called fast feature subset ranking (FFSR) [18].

Embedded Methods.

The embedded methods which are different from the filter methods and the wrapper methods combine the variable selection and training process with the advantages of high efficiency and better integration ability. The embedded methods also consider the relationship between each two-features and usually incorporate the relationship into the learning phase of another algorithm.

Several embedded unsupervised FS methods use regularization for selecting discrete features. However, by using regularization, the embedded methods usually require higher computation cost to get the final results. An embedded end-to-end FS method brings a good solution to this problem. The proposed method introduces a concrete auto-encoder for differentiable FS and reconstruction [19]. And the concrete auto-encoder contains a concrete selector layer encoder and a standard neural network decoder.

Han et al. proposed an auto-encoder feature selector (AEFS) method which combines group lasso tasks and auto-encoder [20]. By excavating the linear and nonlinear information among features, the AEFS can select the optimal features. Most traditional embedded based methods (e.g., least absolute shrinkage and selection operator (LASSO) [21]) only work in the linear relationship and ignore the nonlinear relationship between each two-features. Therefore, the neural networks are used to learn the flexible nonlinear relationship between features, which can flexibly handle various tasks (e.g., Image segmentation [22] and image recognition [23, 24]). The studies used an auto-encoder to improve the effect on unsupervised learning and selected features with high recognition.

Auto-Encoder Feature Selector.

The goal of AEFS is to select t optimal features from X. The matrix \( {\mathbf{X}} = [x_{1} ,x_{2} , \ldots ,x_{m} ]^{T} \) is the unlabeled sample matrix from data; d is the sample dimension, and m is the number of features.

The auto-encoder in AEFS has two fully connected layers and uses a typical h-dimensional hidden layer including a coding function \( g({\mathbf{X}}) \) and a decoding function \( h\left( {g\left( X \right)} \right) \) as Eqs. (10) and (11).

$$ f\left( {\mathbf{X}} \right) = \sigma_{1} \left( {{\mathbf{XW}}^{\left( 1 \right)} } \right) $$
(10)
$$ {\hat{\mathbf{X}}} = h(g({\mathbf{X}})) = \sigma_{2} (g({\mathbf{X}}){\mathbf{W}}^{(2)} ) $$
(11)

In Eqs. (10) and (11), \( \sigma_{1} \) and \( \sigma_{2} \) represent the activation functions in the hidden layer and the output layer; \( \varTheta = \left\{ {\text{w}^{1} ,\text{w}^{2} } \right\} \) is the weight parameters, and \( h(g({\mathbf{X}})) \) is the overall function of the auto-encoder. In the learning process, the article uses the least square method to describe the loss function of the auto-encoder. The loss function is expressed as Eq. (12).

$$ \Im \left( \varTheta \right) = \frac{1}{ 2m}\left\| {{\mathbf{X}} - h\left( {g\left( {\mathbf{X}} \right)} \right)} \right\|_{F}^{2} $$
(12)

The goal is optimizing loss function. After learning process, the auto-encoder can reduce the dimension of the matrix X and output the decoding matrix as \( {\hat{\mathbf{X}}} = h(g({\mathbf{X}})) \).

In the FS stage, by comparing the reconstructed data result \( {\hat{\mathbf{X}}} \) and original data X after discarding some features; if the features of these two matrices are similar, redundant features can be removed. The matrix \( \text{W}^{\left( 1 \right)} = \left[ {w_{1} ,w_{2} , \ldots w_{d} } \right]^{T} \) is used to represent the weights among the neurons in the input layer and the neurons in the hidden layer. If \( \left\| {w_{i} } \right\|_{2} \approx 0 \), the i-th feature is very sparsely connected; otherwise, the connection is very closed. In order to be able to select the more important feature from X, the row-sparse regularization is used for \( {\mathbf{W}}^{(1)} \). The row-sparse regularization can be calculated by Eq. (13) which is used to find the L2 norm and the L1 norm for \( {\mathbf{W}}^{(1)} \) in turn (\( \left\| {{\mathbf{W}}^{(1)} } \right\|_{2,1} \)). In Eq. (13), h is the dimension of hidden layer; d is the sample dimension; d represents the weight parameters of the i-th neuron in the l-th layer to the j-th neuron in the (l+1)-th layer.

$$ \left\| {{\mathbf{W}}^{(1)} } \right\|_{2,1} = \sum\limits_{i}^{d} {\sqrt {\sum\limits_{j}^{h} {(w_{ij}^{(1)} )^{2} } } } $$
(13)

Therefore, the Eq. (12) can be reformulated as Eq. (14).

$$ \Im \left( \varTheta \right) = \frac{1}{ 2m}\left\| {{\mathbf{X}} - h\left( {g\left( {\mathbf{X}} \right)} \right)} \right\|_{F}^{2} + \alpha \left\| {{\mathbf{W}}^{(1)} } \right\|_{2,1} $$
(14)

In the training process, in order to avoid the problem of overfitting, a weight decay term is necessary. The final loss function is defined as Eq. (15).

$$ \Im \left( \varTheta \right) = \frac{1}{ 2m}\left\| {{\mathbf{X}} - h\left( {g\left( {\mathbf{X}} \right)} \right)} \right\|_{F}^{2} + \alpha \left\| {{\mathbf{W}}^{(1)} } \right\|_{2,1} + \frac{\beta }{2}\sum\limits_{i = 1}^{2} {\left\| {{\mathbf{W}}^{(i)} } \right\|_{F}^{2} } $$
(15)

In Eq. (15), \( \beta \) is a penalty parameter. The AEFS [19] can be used to optimize Eq. (15).

Summary and Discussions.

In summary, there are three kinds of evaluation methods in FS, which are filter method, wrapper method and embedded method. For the filter method, it mainly focuses on the correlation between the features in the original data. It is responsible for selecting optimal features and sending these features to the subsequent learning algorithm. Therefore, the filter method is more versatile and wound not be affected by the subsequent learning process. If the amount of original data set is huge, the filter method is a good solution, and it can quickly remove a large number of redundant features. However, because it is independent of the learning algorithm, the accuracy of FS of the filter method is usually lower than the wrapper method. For the wrapper method, it relies on the learning effect of the subsequent learning algorithm to evaluate the feature subset. Therefore, if you want to perform FS and evaluate FS performance on an original data set, it is necessary to train a classifier and evaluate the feature subset based on the performance of the classifier. The performance of the wrapper method is better than the filter method, and the wrapper method is more accurate and reliable. The disadvantages of the wrapper method are that it requires training and evaluate each generated subset, so it is not suitable for large-scale data set operations. The embedded method is similar to the wrapper method, but the difference from the first two methods is that the embedded method uses the FS algorithm itself as a part of the learning algorithm and selects features during each learning process. Embedded needs to redesign the learning algorithm to combine FS, so the complexity and difficulty of implementation will increase. But the embedded method would save a lot of time in the model training stage. Embedded as a hotspot of current research, it has the advantages of high efficiency and integration with machine learning.

4 Conclusions and Future Work

This paper first analyzes the main goals and general architecture of FS, then classifies FS according to search strategies and evaluation strategies. To solve high-dimensional problems, FS can remove redundant features according to global search strategy, random search strategy, or heuristic search strategy. It can choose the filter method to deal with massive data or the wrapper method to improve the quality of generated subsets. It also can use the embedded method and combine other ML algorithms to make better performance. However, there are still some shortcomings such as a single algorithm that cannot support both high-dimensional and low-dimensional data. In the future, the development direction of FS would focus on combining with other ML algorithms, such as adversarial neural networks, convolutional neural networks, etc., to be more suitable for practical applications. And the multi-view FS would also be a hot topic.