Keywords

1 Introduction

In many fields such as data mining and machine learning, we usually need to deal with high-dimensional data which may contain a large number of irrelevant and redundant features. These features would lead to the sparsity of data distribution in the feature space and be a hindrance to data analysis tasks. The rapid growth of data dimension not only increases the computational cost and memory consumption, but also affects the classification performance of classifiers. In order to improve learning performance, a variety of data dimensionality reduction methods have been produced, among which feature selection is one of the most effective techniques for processing high-dimensional data [1, 2].

The main goal of feature selection is to select an optimal feature subset, which contains most useful information in original features and has the greatest correlation with classification tasks. Based on the optimal feature subset, the training time of classifiers can be effectively shorten, and the learning performance could be enhanced further [3,4,5]. In recent years, the technology of feature selection is of diversity. A lot of feature methods have been proposed [9,10,11, 15, 17, 18]. Here, we consider RELIEF-based methods.

Kira and Rendell proposed the famous Relief algorithm in 1992, which uses the Euclidean distance as a metric to select features with great weights [6]. In 1994, Kononenko presented the extended algorithm RELIEF-F to solve the problem of multi-class classification [7]. On the basis of RELIEF, Sun et al. proposed an iterative RELEIF (I-RELIEF) algorithm to alleviate the deficiencies of RELEIF by exploring the framework of the expectation-maximization algorithm [8]. In order to better estimate feature weights, Sun et al. also proposed a logistic I-RELIEF (LIR) algorithm, which optimizes I-RELIEF in the form of logistic regression [16]. All of the above RELIEF-based algorithms are supervised feature selection ones, which can only use data with class labels. However, these methods cannot have a good performance when there exist few labeled data and a large number of unlabeled data. To remedy it, a semi-supervised logistic I-RELEIF (SLIR) method was presented [16]. In SLIR, both labeled and unlabeled data are used to calculate margin vectors of samples. However, SLIR was designed only for binary classification tasks. Tang et al. developed a multi-class semi-supervised logistic I-RELIEF (MSLIR) algorithm [19]. MSLIR designs a novel scheme to find margin vectors of unlabeled samples by calculating all possible candidate margin vectors and picking an optimal one under the condition of current feature weights. However, although MSLIR can implement multi-class feature selection in semi-supervised learning and get better classification performance than LIR, the supervised learning method, the prediction performance of unlabeled samples is unsatisfactory.

In order to solve the above issue, we propose a multi-class semi-supervised logistic I-RELEIF based on nearest neighbor (MSLIR-NN) for multi-class feature selection. In MSLIR-NN, the nearest neighbor scheme is adopted to assign pseudo labels to unlabeled samples according to labeled data in each iteration. In this case, unlabeled samples with pseudo labels could be treated as labeled ones. Thus, the margin vectors of labeled and unlabeled samples could be calculated easily. MSLIR-NN has a smaller computational complexity than MSLIR when calculating margin vectors of unlabeled samples. In experiments, support vector machine (SVM) and nearest neighbor (NN) classifiers are used to ensure the fairness of the classification results, respectively. Experimental results show that MSLIR-NN greatly improves the prediction performance of unlabeled data and enhances the performance of classifiers.

The rest part of this paper is organized as follows. The proposed method is described in detail in Sect. 2. The connections of MSLIR-NN to other related work are also discussed. Section 3 gives and analyzes experimental results. Section 4 concludes this paper.

2 Proposed Method: MSLIR-NN

In this section, we design a novel multi-class semi-supervised feature selection method, MSLIR-NN which adopts the nearest neighbor scheme to assign pseudo labels to unlabeled samples in each iteration. In doing so, unlabeled samples with pseudo labels could be treated as labeled ones. Thus, the margin vectors of labeled and unlabeled samples could be calculated easily. In the following, we describe MSLIR-NN in detail and discuss its connections to MSLIR and SLIR.

2.1 Margin Vectors

Assume that there is a labeled sample set \(D _{l}=\{(\varvec{\mathrm {x}}^{l}_{i},y^{l}_{i})\}^{L}_{i=1}\) and an unlabeled sample set \(D _{u}=\{\varvec{\mathrm {x}}^{u}_{i}\}^{U}_{i=1}\), where \(\mathbf x _i^l \in \mathbb {R}^I\), \(y^{l}_{i}\in \{1,2,\dots ,c\})\), \(\mathbf x _i^u \in \mathbb {R}^I\), I is the number of original features, c is the class number, L and U represent the number of labeled and unlabeled samples, respectively. Generally, \(L\ll U\).

It is well known that one of main differences of RELIEF-based methods is the way of calculating margin vectors of samples. Without loss of generality, let \(\varvec{\mathrm {z}}_i^{l}\) and \(\varvec{\mathrm {z}}_i^{u}\) be margin vectors of the labeled sample \(\varvec{\mathrm {x}}_i^{l}\) and the unlabeled sample \(\varvec{\mathrm {x}}_i^{u}\), respectively. It is easy to generate the margin vectors of labeled samples in semi-supervised RELIEF-based methods. Similar to MSRIL [19], the margin vector \(\mathbf z ^{l}_{i}\) of the labeled sample \(\varvec{\mathrm {x}}^{l}_{i}\) can be expressed as follows:

$$\begin{aligned} \varvec{\mathrm {z}}^l_{i}&=\sum _{\varvec{\mathrm {x}}^{l}_{k}\in M _{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NM (\varvec{\mathrm {x}}^{l}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{l}_{i}-\varvec{\mathrm {x}}^{l}_{k}|\nonumber \\&-\sum _{\varvec{\mathrm {x}}^{l}_{k}\in H _{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NH (\varvec{\mathrm {x}}^{l}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{l}_{i}-\varvec{\mathrm {x}}^{l}_{k}| \end{aligned}$$
(1)

where \(\mathbf w \) is the feature weight vector, the set \(M _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{i}\ne y^{l}_{k},k=1,\cdots ,L, y^{l}_{k}\in \{1,\cdots ,c\}\}\) contains all labeled samples that have different labels from \(\varvec{\mathrm {x}}^{l}_{i}\), the set \(H _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{i}=y^{l}_{k},k=1,\cdots ,L, y^{l}_{k}\in \{1,\cdots ,c\}\}\) contains all labeled samples that have the same label as \(\varvec{\mathrm {x}}^{l}_{i}\), \(P (\varvec{\mathrm {x}}_{k}^{l}=NM (\varvec{\mathrm {x}}_{i}^l)|\varvec{\mathrm {w}})\) and \(P (\varvec{\mathrm {x}}_{k}^l=NH (\varvec{\mathrm {x}}_{i}^l)|\varvec{\mathrm {w}})\) are the probabilities that the sample \(\varvec{\mathrm {x}}_{k}^l\) is the nearest miss and the nearest hit of \(\varvec{\mathrm {x}}_{i}^l\), respectively, \(NM (\varvec{\mathrm {x}}_{i}^l)\) represents the nearest miss (the nearest neighbor of sample \(\varvec{\mathrm {x}}^{l}_{i}\) from a different class) of \(\varvec{\mathrm {x}}_{i}^l\), and \(NH (\varvec{\mathrm {x}}_{i}^l)\) the nearest hit (the nearest neighbor of sample \(\varvec{\mathrm {x}}^{l}_{i}\) from the same class) of \(\varvec{\mathrm {x}}_{i}^l\).

For semi-supervised RELIEF-based methods, it is the key and difficulty that how to define the margin vectors of unlabeled samples. Before calculating them, we first predict the pseudo labels of unlabeled data. According to the information contained in the labeled set \(D _{l}\), we use the nearest neighbor scheme to predict the pseudo labels of unlabeled data in the set \(D _{u}\). Note that \(\mathbf w \) changes as iterations. For any unlabeled sample, its neighborhood is metabolic under different weight conditions. In other words, \(\mathbf w \) has an effect on the procedure of searching nearest neighbors. Thus, we search nearest neighbors of unlabeled samples in the weighted feature space instead of the original input space. Then, we extend \(D_u\) to the set \(\hat{D}_{u}=\{(\mathbf x _i^u,\hat{y}_i^u)\}^{U}_{i=1}\) with pseudo labels \(\hat{y}_i^u\) for \(\mathbf x _i^u\). Similar to (1), we define the margin vector \(\varvec{\mathrm {z}}_i^u\) of the unlabeled sample \(\mathbf x _{i}^u\):

$$\begin{aligned} \varvec{\mathrm {z}}^u_{i}&=\sum _{\varvec{\mathrm {x}}^{l}_{k}\in M '_{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NM (\varvec{\mathrm {x}}^{u}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{u}_{i}-\varvec{\mathrm {x}}^{l}_{k}|\nonumber \\&-\sum _{\varvec{\mathrm {x}}^{l}_{k}\in H '_{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NH (\varvec{\mathrm {x}}^{u}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{u}_{i}-\varvec{\mathrm {x}}^{l}_{k}| \end{aligned}$$
(2)

where the set \(M '_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},\hat{y}^{u}_{i}\ne y^{l}_{k},k=1,\cdots ,L, y^{l}_{k}\in \{1,\cdots ,c\}\}\) contains all labeled samples that have different labels from \(\varvec{\mathrm {x}}^{u}_{i}\), and the set \(H '_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},\hat{y}^{u}_{i}=y^{l}_{k},k=1,\cdots ,L, y^{l}_{k}\in \{1,\cdots ,c\}\}\) contains all labeled samples that have the same label as \(\varvec{\mathrm {x}}^{u}_{i}\).

2.2 Optimization Problem

After obtaining margin vectors of all samples, the optimization of MSLIR-NN can be described as:

$$\begin{aligned} \min _{\varvec{\mathrm {w}}}&\Vert \varvec{\mathrm {w}}\Vert _{1}+\alpha \sum ^{L}_{i=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{l}_{i}))+ \beta \sum ^{U}_{i=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{u}_{i}))\\ \nonumber s.t.&~~\varvec{\mathrm {w}}\ge 0 \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _{1}\) is the 1-norm, the regularization parameters \(\alpha \ge 0\) and \(\beta \ge 0\) control the importance of labeled and unlabeled samples, respectively.

To eliminate the constraint of \(\varvec{\mathrm {w}}\ge 0\), let \(\varvec{\mathrm {w}}=[v^{2}_{1},\cdots ,v^{2}_{I}]^{T}\) and \(\varvec{\mathrm {v}}=[v_{1},\cdots ,v_{I}]\). Substituting \(\mathbf v \) into (3), we can make it to an unconstraint optimization problem and have

$$\begin{aligned} \min _{\varvec{\mathrm {v}}} J=\Vert \varvec{\mathrm {v}}\Vert ^{2}_{2}&+\alpha \sum ^{L}_{i=1}\log (1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id}))\nonumber \\&+\beta \sum ^{U}_{i=1}\log (1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})) \end{aligned}$$
(4)

where \(\varvec{\mathrm {z}}^{l}_{i}=[z^{l}_{i1},\cdots ,z^{l}_{iI}]\) and \(\varvec{\mathrm {z}}^{u}_{i}=[z^{u}_{i1},\cdots ,z^{u}_{iI}]\). (4) can be solved by using the gradient descent method. The derivation of J to \(\varvec{\mathrm {v}}\) can be written as follows:

$$\begin{aligned} \frac{\partial J}{\partial v_{k}}= 2v_{k}&-\alpha \sum ^{L}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})(2v_{k}z^{l}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})}\nonumber \\&-\beta \sum ^{U}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})(2v_{k}z^{u}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})} \end{aligned}$$
(5)

Let

$$\begin{aligned} Q =\alpha \sum ^{L}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})(v_{k}z^{l}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})}+\beta \sum ^{U}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})(v_{k}z^{u}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})} \end{aligned}$$

Then we can update \(v_k\) by

$$\begin{aligned} v_{k}\leftarrow v_{k}-\eta (v_{k}-Q) \end{aligned}$$
(6)

where \(\eta >0\) is the learning rate.

2.3 Algorithm and Complexity Analysis

The algorithm description of MSLIR-NN is shown in Algorithm 1. Given a labeled dataset \(D_{l}\) and an unlabeled dataset \(D_{u}\), the weight vector \(\mathbf {w}\) is updated iteratively. Note that the weight vector in the t-th iteration is denoted as \(\varvec{\mathrm {w}}_{(t-1)}\). First, under the current feature weights, MSLIR-NN computes the weighted labeled samples \(\varvec{\mathrm {x}}^{l*}\) and unlabeled \(\varvec{\mathrm {x}}^{u*}\), that is: \(\varvec{\mathrm {x}}^{l*}\)=\(\varvec{\mathrm {x}}^l\circ \varvec{\mathrm {w}}_{(t)}\) and \(\varvec{\mathrm {x}}^{u*}\) = \(\varvec{\mathrm {x}}^u\circ \varvec{\mathrm {w}}_{(t-1)}\), where \(\circ \) denotes the element-by-element multiplication. The pseudo labels of unlabeled samples are determined in the weighted sample space using the NN scheme. Then, the margin vectors of \(\varvec{\mathrm {x}}^{l}\) and \(\varvec{\mathrm {x}}^{u}\) are calculated by (1) and (2), respectively. Finally, \(\varvec{\mathrm {w}}\) is obtained by solving (3). MSLIR-NN alternatively modifies the weight vector until convergence.

The computational complexity of MSLIR-NN mainly includes three parts: the calculation of margin vectors for labeled samples, the calculation of margin vectors for unlabeled samples, and the solution to the optimization problem (3). The computational complexity of calculating margin vectors for labeled samples is identical to that of MSLIR and SLIR, which is \(O(dL^2)\) without considering the calculation of probability terms, where d is the dimension of samples, and L is the number of labeled samples. For calculating of margin vectors for unlabeled samples, the computational complexity in MSLIR-NN is about O(dUL), where U is the number of unlabeled samples. For the last part, MSLIR-NN has the same computational complexity as MSLIR and SLIR.

figure a

2.4 Connections to Related Work

Four RELEIF-based methods, LIR, SLIR, MSLIR and MSLIR-NN use the logistic regression formulation to optimize the feature weight vector \(\mathbf w \). We discuss the connections of MSLIR-NN to LIR, SLIR, MSLIR in the following.

MSLIR-NN is a semi-supervised learning method as well as SLIR and MSLIR, LIR is designed for supervised learning. Base on LIR, SLIR introduces a term about unlabelled samples into the objective function. MSLIR changes the objective function of SLIR, which makes a balance calculation between labeled and unlabeled samples. Although MSLIR-NN has the same optimization function as MSLIR, MSLIR-NN has a different way for computing margin vectors of unlabelled samples.

SLIR, MSLIR and MSLIR-NN all adopt the way of calculating margin vectors of labeled samples in LIR. It is intuitive for SLIR to get margin vectors of unlabeled samples since SLIR deals with only binary classification tasks. For a given unlabeled sample, MSLIR first calculates all possible candidate margin vectors and takes an optimal one in the weighted feature space as its margin vector. The computational complexity of MSLIR is O(cdLU) when calculating margin vectors of unlabel samples. MSLIR-NN first assigns a pseudo label for the unlabeled sample and then directly calculate its margin vector as label samples. Compared to MSLIR, MSLIR-NN has a lower complexity, or O(dLU), which is independent of the class number c and identical to SLIR.

3 Experiments

We conduct extensive experiments to demonstrate the efficiency and effectiveness of MSLIR-NN. Ten UCI datasets [20] including Pendigits, Satimage, Waveform, Wine, Vehicle, Iris, Breast, Heart, Wdbc and Pima are adopted, where the first six datasets are multi-class, and the rest four ones are binary. All datasets are randomly divided into training and test subsets, and the training subsets contain labeled and unlabeled samples. A brief description of datasets is listed in Table 1, where “#Training” and “#Test” represent the number of training and test samples, respectively, “#Labeled” and “#Unlabeled” are the number of labeled and unlabeled samples in a training set, “#Feature” represents the dimension of samples, and “#Class” indicates the number of categories in datasets. For each dataset, we add 100 additional noise features which are independently Gaussian distributed. We normalize all features with the original data.

Table 1. Description of ten UCI datasets.

In our experiments, the compared methods include RELIEF-F, LIR, SLIR, MSLIR and MSLIR-NN. Both RELIEF-F and LIR use only labeled data, while SLIR, MSLIR and MSLIR-NN use both labeled and unlabeled data. The classification performance is tested on the same test subsets. In LIR, the regularization parameter \(\lambda \) and learning rate \(\eta \) are 10 and 0.03, respectively. In MSLIR-NN, MSLIR and SLIR, the parameters \(\alpha \) and \(\beta \) are 10 and 0.1, respectively, and the learning rate is the same as that of LIR.

To eliminate the effect of statistical error, each algorithm runs 10 times for each dataset, and takes the average result as the final one. In order to ensure the reliability of experimental results, the nearest neighbor (NN) and support vector machine (SVM) classifiers are used in experiments. Here the Gaussian kernel and regularization parameters in SVM are selected by the grid search method, where both vary from \(2^{-10}\) to \(2^{10}\).

3.1 Experiments on Multi-class Datasets

For multi-classification tasks, we compare the proposed algorithm with other three methods MSLIR, LIR and RELIEF-F. Experiments are implemented on the Pendigits, Satimage, Waveform, Wine, Vehicle and Iris datasets. The performance of these feature selection algorithms are evaluated by the classification accuracy with selected features, and the final experimental results obtained by SVM are shown in Fig. 1. From Fig. 1, we can observe that the classification accuracy of MSLIR-NN is the best among compared methods, which indicates that features selected by MSLIR-NN have a greater correlation with the label information. In Figs. 1(a) and (c), it can be observed that the classification accuracy of the four algorithms increases as increasing the number of features, which demonstrates that the chosen features are all useful features for classification and the noisy features are excluded. In Figs. 1(b), (d), (e) and (f), the classification accuracy tends to be steady or decreasing, which indicates that feature selection is useful. In other words, not all original features are related to classification tasks. For example, in the Wine dataset, when the number of selected features is eight, MSLIR-NN achieves the highest classification accuracy of 98%, and the last four selected features are likely to be irrelevant features.

Fig. 1.
figure 1

Classification accuracy of SVM using four feature selection methods on multi-class datasets: (a) Pendigits, (b) Satimage, (c) Waveform, (d) Wine, (e) Vehicle and (f) Iris.

Table 2. Classification accuracy and standard deviations (%) of NN using four feature selection methods on multi-class datasets

The performance curves obtained by NN are shown in Fig. 2. From these figures, we can have similar conclusions as those from Fig. 1. We give the best average accuracies and the corresponding standard deviations of four methods in Table 2, where the best results are bolded. Compared with the other three methods, MSLIR-NN has a higher classification accuracy and smaller standard deviation, reflecting that our method has better stability than the previous MSLIR.

Fig. 2.
figure 2

Classification accuracy of NN using four feature selection methods on multi-class datasets: (a) Pendigits, (b) Satimage, (c) Waveform, (d) Wine, (e) Vehicle and (f) Iris.

3.2 Experiments on Binary Datasets

Similar to MSLIR, MSLIR-NN can also be applied to binary classification tasks. Since SLIR is only applicable to binary classification tasks, we compare MSLIR-NN and MSLIR with it on Breast, Heart, Wdbc and Pima datasets.

The experimental results obtained by SVM are given in Fig. 3. We can see that MSLIR-NN is much better than both MSLIR and SLIR, especially in Figs. 3(a), (c) and (d).

The performance curves obtained by NN are shown in Fig. 4. MSLIR-NN still has an advantage over other two methods. We list the best average classification accuracies and corresponding standard deviations in Table 3. Obviously, MSLIR-NN has the best performance among three methods on four binary datasets, followed by MSLIR. Compared to MSLIR, MSLIR-NN is improved 2.03% accuracy on Heart and 5.73% the accuracy on Pima, respectively.

Table 3. Classification accuracy and standard deviations (%) of NN using three feature selection methods on binary datasets
Fig. 3.
figure 3

Classification accuracy of SVM using four feature selection methods on binary datasets: (a) Breast, (b) Heart, (c) Wdbc, (d) Pima.

Fig. 4.
figure 4

Classification accuracy of NN using four feature selection methods on binary datasets: (a) Breast, (b) Heart, (c) Wdbc, (d) Pima.

3.3 Comparison of MSLIR-NN and MSLIR

MSLIR-NN and MSLIR have the same objective function, and different ways for constructing margin vectors of unlabeled samples. Here, we compare them in two aspects, the prediction ability on unlabeled samples in training subsets and the running time of feature selection.

Table 4 lists the accuracy on unlabeled samples and running time of feature selection for two methods. We can see that MSLIR-NN is significantly better than MSLIR on the prediction performance, which indicates that our new proposed algorithm can use a few labeled data to predict the label of unlabeled data. Thus we can calculate the margin vectors of unlabeled samples more accurately, which can improve the stability of algorithm. The computational complexity of algorithms can be reflected by the running time of feature selection. Obviously, MSLIR-NN is much faster than MSLIR on all ten datasets, which supports our analysis about computational complexity in Sect. 2.3.

Table 4. Comparison of MSLIR-NN and MSLIR algorithms

4 Conclusions

In this paper, we propose MSLIR-NN based on MSLIR for multi-class semi-supervised feature selection by introducing the nearest neighbor scheme. MSLIR-NN has a less complexity than MSLIR, and can improve the accuracy of label prediction for unlabeled data which contributes to the calculation way of margin vectors of unlabeled samples. Extensive experiments are performed on binary and multi-class classification tasks. Two classical classifiers NN and SVM are used to implement classification after feature selection has finished. On multi-class datasets, MSLIR-NN is superior to supervised methods LIR and RELEIF-F, and the semi-supervised method MSLIR. On the binary datasets, MSLIR-NN performs the best among three semi-supervised methods. In experiments of comparison with MSLIR, MSLIR-NN unfolds its ability in predicting labels of unlabeled samples and speedability. In general, MSLIR-NN can extract useful features and achieve better performance.