Keywords

1 Introduction

High-dimensional data, such as DNA microarray data, medical data, and satellite remote sensing data, may contain irrelevant information, which generally exists in machine learning and pattern recognition. It is meaningful to use feature selection or feature extraction as a pre-processing way for reducing the negative influence of irrelevant data. Feature selection, as a dimensionality reduction technology makes great contributions to saving storage space and reducing the computational cost. Therefore, feature selection is widely applied to many learning tasks.

Feature selection methods can be categorized into three groups: supervised, unsupervised and semi-supervised methods. Supervised feature selection methods include RELIEF-based methods [1,2,3], Fisher criterion-based methods [4, 5], etc. RELIEF is one of the most effective feature selection algorithms, which finds a weight vector for features by maximizing the margin between the differences of given samples and their nearest neighbors in two classes. The features with larger weights could be selected to perform classification tasks. The variants of RELIEF have been proposed, such as Logistic I-RELIEF (LIR) [1] and RELIEF-F [3]. Supervised feature selection methods require a large amount of labeled data which is hard to obtain. It is difficult for supervised feature selection methods to choose features that are distinguishable from few labeled data in the training dataset. Therefore, many unsupervised feature selection methods have been proposed [6,7,8], which can effectively utilize unlabeled data. However, these methods ignore the information contained in labeled data, so they cannot identify the discriminative features well [7].

The semi-supervised feature selection methods can achieve better performance since they make full use of the labeled and unlabeled data. Common methods include the clustering-based method [9], locality sensitive-based method [10], local discriminative information-based method [11], semi-supervised Logistic I-RELIEF method (SLIR) [12], forward method [13, 14], multi-objective optimization method [15], spectral analysis method [16], and so on. This paper focuses on the RELIEF-based methods. In SLIR, labeled data are used to maximize the distance between samples in different classes, while the unlabeled data samples are used to extract the geometric structure in a data space [12]. SLIR is the first and successful variant of RELIEF in semi-supervised learning. However, SLIR can only deal with binary problems and cannot effectively solve the multi-class problems.

To solve this issue, we propose a multi-classification semi-supervised Logistic I-RELIEF (MSLIR) algorithm for feature selection based on SLIR. Under the semi-supervised learning framework, RELIEF-based methods need to consider how to compute the margin vector of an unlabeled sample. MSLIR designs a novel scheme to calculate the margin vectors of unlabeled samples for multi-classification problems.

The rest of paper is arranged as follows. Section 2 briefly introduces related work include Logistic I-RELIEF and semi-supervised Logistic I-RELIEF. We propose multi-classification semi-supervised Logistic I-RELIEF algorithm in Sect. 3. Experimental results are presented in Sect. 4. The paper is concluded in Sect. 5.

2 Related Work

This section mainly introduces two algorithms: LIR and SLIR, where SLIR is an extended algorithm of LIR.

2.1 Logistic I-RELIEF

In I-RELIEF, neighbors of a given sample in the original feature space is inconsistent with the nearest neighbors in the weighted feature space, so LIR proposes a new probabilistic model to calculate the distance between samples and their neighbors.

A assume that the input dataset is \(D =\{(\varvec{\mathrm {x}}_{n},y_{n})\}^{N}_{n=1}\subset R ^{I}\times \{+1,-1\}\), where \(\varvec{\mathrm {x}}_{n}\) is a labeled sample and \(y_{n}\) is its label, I is the number of original features, and N is the number of samples. Here, we discuss the binary problem: +1 and −1 represent positive and negative classes, respectively.

The optimization problem of LIR can be described as:

$$\begin{aligned}&\min _{\varvec{\mathrm {w}}}~\sum ^{N}_{n=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}_{n}))+\lambda \Vert \varvec{\mathrm {w}}\Vert _{1}\nonumber \\&s.t. \quad \varvec{\mathrm {w}}\ge 0 \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _{1}\) is the 1-norm, \(\varvec{\mathrm {w}}\) is the feature weight vector which represents the importance of features, \(\lambda \ge 0\) is a regularization parameter to avoid overfitting, and \(\varvec{\mathrm {z}}_n\) is the margin vector of the sample \(\varvec{\mathrm {x}}_{n}\) which can be written as follows:

$$\begin{aligned} \varvec{\mathrm {z}}_n&=\sum _{\varvec{\mathrm {x}}_i\in M _{n}}{} P (\varvec{\mathrm {x}}_{i}=NM (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}_n-\varvec{\mathrm {x}}_i|\nonumber \\&-\sum _{\varvec{\mathrm {x}}_i\in H _{n}}{} P (\varvec{\mathrm {x}}_{i}=NH (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}_n-\varvec{\mathrm {x}}_i| \end{aligned}$$
(2)

where \(M _{n}=\{\varvec{\mathrm {x}}_{i}|(\varvec{\mathrm {x}}_{i},y_{i})\in D ,y_{i}\ne y_{n},i=1,\dots ,N\}, H _{n}=\{\varvec{\mathrm {x}}_{i}|(\varvec{\mathrm {x}}_{i},y_{i})\in D ,y_{i}=y_{n},i=1,\dots ,N\}, P (\varvec{\mathrm {x}}_{i}=NM (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})\) and \(P (\varvec{\mathrm {x}}_{i}=NH (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})\) are the probabilities that the sample \(\varvec{\mathrm {x}}_{i}\) is the nearest miss and the nearest hit of \(\varvec{\mathrm {x}}_{n}\), respectively, \(NM (\varvec{\mathrm {x}}_{n})\) represents the nearest miss of sample \(\varvec{\mathrm {x}}_{n}\) and \(NH (\varvec{\mathrm {x}}_{n})\) represents the nearest hit of sample \(\varvec{\mathrm {x}}_{n}\).

2.2 Semi-supervised Logistic I-RELIEF

On the basis of Logistic I-RELIEF algorithm, Sun et al. [12] extended Logistic I-RELIEF to semi-supervised learning. Due to the introduction of unlabeled samples, the calculation for the margin vectors of unlabeled samples needs to be revised.

We are given a labeled sample set \(D _{l}=\{(\varvec{\mathrm {x}}^{l}_{i},y^{l}_{i})\}^{L}_{i=1} (y_i\in \{\pm 1\})\) and an unlabeled sample set \(D _{u}=\{\varvec{\mathrm {x}}^{u}_{i}\}^{U}_{i=1}\). For a given labeled sample \(\varvec{\mathrm {x}}^{l}\) and unlabeled sample \(\varvec{\mathrm {x}}^{u}\), let their margin vector be \(\varvec{\mathrm {z}}^{l}\) and \(\varvec{\mathrm {z}}^{u}\), respectively. Since \(\varvec{\mathrm {x}}^{l}\) is a labeled sample, its margin vector \(\varvec{\mathrm {z}}^{l}\) can be computed as Eq. (2). Fortunately, \(\varvec{\mathrm {z}}^{u}\) has the same absolute value, regardless of which class the sample \(\varvec{\mathrm {x}}^{u}\) belongs to. Therefore, \(\varvec{\mathrm {z}}^{u}\) can use the same way of computing \(\varvec{\mathrm {z}}_{n}\). SLIR can be cast into the following optimization problem:

$$\begin{aligned}&\min _{\varvec{\mathrm {w}}}~ \Vert \varvec{\mathrm {w}}\Vert _{1}+\alpha \sum ^{L}_{i=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{l}_{i}))+\beta \sum ^{U}_{i=1}\exp (-(\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{u}_{i})^{2}/\delta )\nonumber \\&s.t. \quad \varvec{\mathrm {w}}\ge 0 \end{aligned}$$
(3)

where \(\delta \) is the kernel width, \(\varvec{\mathrm {z}}^{l}_{i}\) is margin vector of sample \(\varvec{\mathrm {x}}^{l}_{i}\), \(\varvec{\mathrm {z}}^{u}_{i}\) is margin vector of sample \(\varvec{\mathrm {x}}^{u}_{i}\), \(\alpha \) and \(\beta \) represent the contribution of labeled and unlabeled samples to the cost function (3), respectively. Then (3) can be solved by using the gradient descent method.

3 Semi-supervised Logistic I-RELIEF for Multi-classification

SLIR can only deal with bianry problems. To solve this issue, we propose the MSLIR algorithm for multi-classification based on SLIR. Given a labeled sample set \(D _{l}=\{(\varvec{\mathrm {x}}^{l}_{i},y^{l}_{i})\}^{L}_{i=1}\) with \(y^{l}_{i}\in \{1,2,\dots ,c\}\) and the class number c, and an unlabeled sample set \(D _{u}=\{\varvec{\mathrm {x}}^{u}_{i}\}^{U}_{i=1}\), the goal of MSLIR is to find feature weights for multi-classification tasks under the semi-supervised situation. There are \(L+U\) samples in total \((L \ll U)\). MSLIR designs a novel scheme to calculate the margin vectors of unlabeled samples, based on which MSLIR can be formulated as an optimization problem similar to SLIR. In the following, we discuss MSLIR in detail.

3.1 Calculation of Margin Vectors

MSLIR requires to calculate margin vector for each sample including labeled and unlabeled one. The margin vector \(\varvec{\mathrm {z}}^{l}_{i}\) of the labeled sample \(\varvec{\mathrm {x}}^{l}_{i}\) can be expressed as follows:

$$\begin{aligned} \varvec{\mathrm {z}}^l_{i}&=\sum _{\varvec{\mathrm {x}}^{l}_{k}\in M _{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NM (\varvec{\mathrm {x}}^{l}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{l}_{i}-\varvec{\mathrm {x}}^{l}_{k}|\nonumber \\&-\sum _{\varvec{\mathrm {x}}^{l}_{k}\in H _{i}}{} P (\varvec{\mathrm {x}}^{l}_{k}=NH (\varvec{\mathrm {x}}^{l}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{l}_{i}-\varvec{\mathrm {x}}^{l}_{k}| \end{aligned}$$
(4)

where the set \(M _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}\ne y^{l}_{i},i=1,\dots ,L, y^{l}_{i}=\{1,\dots ,c\}\}\) contains all labeled samples that have different labels from \(\varvec{\mathrm {x}}^{l}_{i}\), the set \(H _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}=y^{l}_{i},i=1,\dots ,L\}\) contains all labeled samples that have the same labels as \(\varvec{\mathrm {x}}^{l}_{i}\).

For the multi-classification problem, the sample \(\varvec{\mathrm {x}}^{u}_{i}\) in the data set \(D _{u}\) may belong to any class, hence, the calculation of the margin vectors of unlabeled samples needs to be redefined. Suppose that we assign a temporary label to the unlabeled sample \(\varvec{\mathrm {x}}^{u}_{i}\). Then we can calculate the candidate margin vector \((\varvec{\mathrm {z}}^{u}_{i})^{j}\) of the unlabeled sample \(\varvec{\mathrm {x}}^{u}_{i}\) with the assigned label j as follows:

$$\begin{aligned}&(\varvec{\mathrm {z}}^{u}_{i})^{j}= \sum _{\varvec{\mathrm {x}}^{l}_{k}\in \overline{M }^{j}_{i},y_{k}\ne j }{} P (\varvec{\mathrm {x}}^{l}_{k}=NM ^{j}(\varvec{\mathrm {x}}_{i}^{u})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}_{i}^{u}-\varvec{\mathrm {x}}^{l}_{k}|\nonumber \\&\quad \quad -\sum _{\varvec{\mathrm {x}^{l}_{k}}\in \overline{H }^{j}_{i},y_{k}=j}{} P (\varvec{\mathrm {x}}^{l}_{k}=NH ^{j}(\varvec{\mathrm {x}}^{u}_{i})|\varvec{\mathrm {w}})|\varvec{\mathrm {x}}^{u}_{i}-\varvec{\mathrm {x}}^{l}_{k}| \end{aligned}$$
(5)

where \(\overline{M }^{j}_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}\ne j,k=1,\dots ,L, j=1,\dots ,c\}\) is the sample set that contains all labeled samples whose label is not equal to j, and \(\overline{H }^{j}_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}=j,k=1,\dots ,L\}\) is the sample set that contains all labeled samples whose label is equal to j.

Which is the possible margin vector of the sample \(\varvec{\mathrm {x}}^{u}_{i}\) among the c candidate margin vectors? Without any apriori information, we consider the candidate margin vector which has the greatest inner product with the feature weight vector \(\varvec{\mathrm {w}}\) as the possible one. Then \(\varvec{\mathrm {z}}^{u}_{i}\) can be determined by:

$$\begin{aligned} \varvec{\mathrm {z}}^{u}_{i}=\arg \max _{j=1,\dots ,c}\varvec{\mathrm {w}}^{T}(\varvec{\mathrm {z}}^{u}_{i})^{j} \end{aligned}$$
(6)

3.2 Optimization Problem and Algorithm Description

After the margin vectors of labeled and unlabeled samples are defined, MSLIR is to solve the following optimization problem:

$$\begin{aligned} \min _{\varvec{\mathrm {w}}} \Vert \varvec{\mathrm {w}}\Vert _{1}&+\alpha \sum ^{L}_{i=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{l}_{i}))\nonumber \\ {}&+ \beta \sum ^{U}_{i=1}\log (1+\exp (-\varvec{\mathrm {w}}^{T}\varvec{\mathrm {z}}^{u}_{i}))\nonumber \\ s.t. \quad \varvec{\mathrm {w}}\ge 0 \end{aligned}$$
(7)

where \(\alpha \ge 0\) and \(\beta \ge 0\) are the regularization parameters that control the importance of labeled and unlabeled samples, respectively. By comparing (3) and (7), we find that there are two differences between them. First, the calculation way of \(\varvec{\mathrm {z}}^{u}_{i}\) is different. In (3), calculating \(\varvec{\mathrm {z}}^{u}_{i}\) does not consider the label of \(\varvec{\mathrm {x}}^{u}_{i}\). Contrary, the calculation of \(\varvec{\mathrm {z}}^{u}_{i}\) takes into account the possible label of \(\varvec{\mathrm {x}}^{u}_{i}\). Second, (7) uses the same logical regression form for both labeled and unlabeled samples to ensure the symmetry of labeled and unlabeled samples.

In order to facilitate calculation, we convert the optimization problem (7) to an unconstraint optimization problem. Let \(\varvec{\mathrm {w}}=[v^{2}_{1},\dots ,v^{2}_{I}]^{T}\) and \(\varvec{\mathrm {v}}=[v_{1},\dots ,v_{I}]\). The optimization formula (7) can be rewritten as:

$$\begin{aligned} \min _{\varvec{\mathrm {v}}} J=\Vert \varvec{\mathrm {v}}\Vert ^{2}_{2}&+\alpha \sum ^{L}_{i=1}\log (1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id}))\nonumber \\&+\beta \sum ^{U}_{i=1}\log (1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})) \end{aligned}$$
(8)

where \(\varvec{\mathrm {z}}^{l}_{i}=[z^{l}_{i1},\dots ,z^{l}_{iI}]\). In order to solve formula (8), we use the gradient descent method. The derivation of J to \(\varvec{\mathrm {v}}\) can be written as follows:

$$\begin{aligned} \frac{\partial J}{\partial v_{k}}= 2v_{k}&-\alpha \sum ^{L}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})(2v_{k}z^{l}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})}\nonumber \\&-\beta \sum ^{U}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})(2v_{k}z^{u}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})} \end{aligned}$$
(9)

Let \(Q _{1}=\sum ^{L}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})(v_{k}z^{l}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})}\) and \(Q _{2}=\sum ^{U}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})(v_{k}z^{u}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})}\), the update formulation for each dimension can be described as:

$$\begin{aligned} v_{k}\leftarrow v_{k}-\eta (v_{k}-\alpha Q _{1}-\beta Q _{2}) \end{aligned}$$
(10)

where \(\eta \) is the learning rate. The pseudo-code of MSLIR is shown in Algorithm 1. The algorithm alternatively modifies the weight vector until convergence.

The computational complexity of MSLIR includes three parts: the calculation of the margin vectors for both labeled and unlabeled samples, and the solution to the optimization problem (8). In MSLIR, the computational complexity of solving (8) and calculating the margin vectors for labeled samples are identical to SLIR. The computational complexity of calculating the margin vectors for unlabeled samples is O(cdLU) in MSLIR. For the same task, SLIR has the computational complexity of O(dLU). In a nutshell, MSLIR has a compared complexity with SLIR.

figure a

4 Experiments

In this section, we evaluate the performance of the proposed MSLIR algorithm on one artificial dataset and eight UCI datasets [17]. The compared methods include LIR, SLIR and RELEIF-F. In each dataset, we added 100 irrelevant features to each sample, which are independently sampled from zero-mean, unit-variance Gaussian distribution, then we normalize these irrelevant features with the original data. All the experiments are implement in MATLAB R2013a on a PC with an Inter Core I5 processor with 4 GB RAM.

Fig. 1.
figure 1

“ThreeCircles” dataset (a) training dataset for LIR, (b) training dataset for MSLIR, feature weights estimated by (c) LIR and (d) MSLIR.

4.1 Artificial Dataset

We conduct experiments on the “ThreeCircles” dataset to verify the ability of the proposed algorithm in feature selection. There are 3 classes in the “ThreeCircles” dataset, as shown in Fig. 1(a) and (b). In Fig. 1(a), each class has 51 labeled samples. In Fig. 1(b), each class also has 51 labeled samples as Fig. 1(a), and additionally 3450 unlabeled ones. LIR and MSLIR are conducted to compare the performance in feature selection. In MSLIR, let regularization parameters \(\alpha =6\) and \(\beta =3\). Let kernel width \(\delta =8\) for MSLIR and LIR.

In Fig. 1(c), LIR fails to identify the useful feature since the weight of the first feature is equal to zero. Figure 1(d) shows that the first two features selected by MSLIR has the highest weights and the weights of other noisy features are nearly zero. The results in Fig. 1(c) and (d) indicate that MSLIR can successfully identify the first two useful features using labeled and unlabeled data.

4.2 UCI Datasets

In this section, we use eight UCI datasets to verify the performance of algorithms. The eight UCI datasets include Wine, Vehicle, Yeast, Iris, Wdbc, Heart, Pima and Sonar datasets, of which the first four datasets are multi-class ones, and the rest are two-class ones. The datasets are randomly divided into independent training and test subsets, where training subsets contain labeled and unlabeled data. Semi-supervised methods use both labeled and unlabeled data, and supervised methods only use labeled data. The data information is summarized in Table 1, where “#Training” and “#Test” represent the number of training samples and test samples, respectively. The total number of “#Labeled” and “#Unlabeled” data is equal to the number of training samples. “#Feature” represents the dimension of dataset.

The classifier utilized in our experiments is support vector machine (SVM). The Gaussian kernel parameter and regularization parameter of SVM are determined by the grid search method, where both parameters vary from \(2^{-10}\) to \(2^{10}\).

Table 1. Information of four UCI datasets.
Fig. 2.
figure 2

Classification accuracy of MSLIR, RELIEF-F and LIR on UCI datasets, (a) Wine, (b) Vehicle, (c) Yeast and (d) Iris.

Multi-classification Datasets. We compare the proposed feature selection algorithm with RELIEF-F and LIR, which both are supervised algorithms for solving multi-classification problems. We use the cross-validation method to select parameter \(\alpha \) and \(\beta \) in MSLIR. The regularization parameter and learning rate in LIR follow the setting in [1]: \(\lambda =10\) and \(\eta =0.03\). Experiments are implemented on the Wine, Vehicle, Yeast and Iris datasets, each of which is randomly partitioned 10 times. We report the average results. The curves of the average classification accuracy vs. the n top-ranked features are shown in Fig. 2, where n is the original feature dimension of datasets. In Fig. 2(a), (b) and (c), the classification accuracy of the three algorithms also gradually increases as the number of selected features increases, which indicates that the useful features are gradually found. We can see that SMLIR is always better than LIR and RELIEF-F, which may indicate LIR and RELIEF-F algorithms contain noise features. In Fig. 2(d), MSLIR achieves the best classification accuracy when one feature is used, and LIR and RELIEF-F need two features. Obviously, MSLIR has a significantly higher performance than the other two algorithms. The best average accuracy and the corresponding standard deviation are given in Table 2. We can observe that MSLIR has a great improvement on Wine and Vehicle datasets.

Table 2. Classification accuracy and standard deviations (%) of SVM using features selected by MSLIR, LIR and RELIEF-F
Fig. 3.
figure 3

Feature weights obtained by (a) MSLIR, (b) LIR and (c) RELIEF-F on the Wine dataset

In Fig. 3, we give the feature weights of the three algorithms on the Wine dataset. We can observe that MSLIR correctly selects relevant features, while LIR assigns higher weights to some irrelevant features, and RELEIF-F fails to identify noise data. In summary, MSLIR can handle feature selection problems for multi-classification under the semi-supervised framework. The supervised approaches do not remove the noise features well with few labeled data. Because of the introduction of unlabeled data, MSLIR can effectively eliminate noise features.

Table 3. Classification accuracy and standard deviations (%) of SVM using features selected by MSLIR and SLIR
Fig. 4.
figure 4

Classification accuracy of MSLIR and SLIR on UCI datasets, (a) Heart, (b) Wdbc, (c) Pima and (d) Sonar.

Fig. 5.
figure 5

Feature weights obtained by (a) MSLIR and (b) SLIR on the Sonar dataset

Binary Classification Datasets. We conduct experiments to compare MSLIR with SLIR and prove that MSLIR is also suitable for binary problems. SLIR is a semi-supervised feature selection method for solving binary problems. In MSLIR, the way of parameter setting is the same as the multi-classification case. The parameters \(\alpha \) and \(\beta \) are determined by the cross-validation method in SLIR. Experiments are implemented on the Heart, Wdbc, Pima and Sonar datasets, each of which is randomly divided 10 times and the average accuracy is taken as the final results.

The results are shown in Fig. 4. In Figs. 4(a) and (c), we can observe that classification accuracy of MSLIR and SLIR are steady. However, MSLIR is much better than SLIR, which indicates MSLIR does not choose noise features. Figure 4(b) shows that SMLIR reaches the best accuracy when the number of features is 6, and SLIR when the number of features is 25, which implies that MSLIR chooses few features to reach the better accuracy. The best average classification accuracy and corresponding standard deviations are listed in Table 3. Compared to SLIR, the accuracy of MSLIR is improved 3.2%, 3.85%, 2.27% and 2.80% on Heart, Wdbc, Pima and Sonar datasets, respectively.

Figure 5 shows the distribution of feature weights on the Sonar dataset, we can see that MSLIR selects relevant features, while SLIR evaluates noise features higher weights and fails to distinguish relevant features.

Compared to SLIR, MSLIR can effectively solve the binary problems, and improve the classification accuracy.

5 Conclusion

In this paper, we propose MSLIR based on SLIR. MSLIR overcomes the drawback of SLIR, which can make full use of unlabeled information to select relevant features on multi-class of data. The results on one artificial dataset demonstrate that MSLIR can effectively handle noise data. When dealing with multi-class classification tasks, MSLIR can extract effective features under the semi-supervised framework compared to supervised methods. For binary classification problems, MSLIR can achieve better performance than SLIR despite the fact that both methods are semi-supervised learning ones.