Abstract
The semi-supervised Logistic I-RELIEF (SLIR) algorithm has been proposed for feature selection, which can handle both labeled and unlabeled data to perform feature selection. However, SLIR can only deal with binary problems. To remedy it, this paper presents a multi-classification semi-supervised Logistic I-RELIEF (MSLIR) algorithm for feature selection. Based on SLIR, MSLIR designs a novel scheme to calculate the margin vectors of unlabeled samples. Experimental results demonstrate the efficiency and effectiveness of our algorithm.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
High-dimensional data, such as DNA microarray data, medical data, and satellite remote sensing data, may contain irrelevant information, which generally exists in machine learning and pattern recognition. It is meaningful to use feature selection or feature extraction as a pre-processing way for reducing the negative influence of irrelevant data. Feature selection, as a dimensionality reduction technology makes great contributions to saving storage space and reducing the computational cost. Therefore, feature selection is widely applied to many learning tasks.
Feature selection methods can be categorized into three groups: supervised, unsupervised and semi-supervised methods. Supervised feature selection methods include RELIEF-based methods [1,2,3], Fisher criterion-based methods [4, 5], etc. RELIEF is one of the most effective feature selection algorithms, which finds a weight vector for features by maximizing the margin between the differences of given samples and their nearest neighbors in two classes. The features with larger weights could be selected to perform classification tasks. The variants of RELIEF have been proposed, such as Logistic I-RELIEF (LIR) [1] and RELIEF-F [3]. Supervised feature selection methods require a large amount of labeled data which is hard to obtain. It is difficult for supervised feature selection methods to choose features that are distinguishable from few labeled data in the training dataset. Therefore, many unsupervised feature selection methods have been proposed [6,7,8], which can effectively utilize unlabeled data. However, these methods ignore the information contained in labeled data, so they cannot identify the discriminative features well [7].
The semi-supervised feature selection methods can achieve better performance since they make full use of the labeled and unlabeled data. Common methods include the clustering-based method [9], locality sensitive-based method [10], local discriminative information-based method [11], semi-supervised Logistic I-RELIEF method (SLIR) [12], forward method [13, 14], multi-objective optimization method [15], spectral analysis method [16], and so on. This paper focuses on the RELIEF-based methods. In SLIR, labeled data are used to maximize the distance between samples in different classes, while the unlabeled data samples are used to extract the geometric structure in a data space [12]. SLIR is the first and successful variant of RELIEF in semi-supervised learning. However, SLIR can only deal with binary problems and cannot effectively solve the multi-class problems.
To solve this issue, we propose a multi-classification semi-supervised Logistic I-RELIEF (MSLIR) algorithm for feature selection based on SLIR. Under the semi-supervised learning framework, RELIEF-based methods need to consider how to compute the margin vector of an unlabeled sample. MSLIR designs a novel scheme to calculate the margin vectors of unlabeled samples for multi-classification problems.
The rest of paper is arranged as follows. Section 2 briefly introduces related work include Logistic I-RELIEF and semi-supervised Logistic I-RELIEF. We propose multi-classification semi-supervised Logistic I-RELIEF algorithm in Sect. 3. Experimental results are presented in Sect. 4. The paper is concluded in Sect. 5.
2 Related Work
This section mainly introduces two algorithms: LIR and SLIR, where SLIR is an extended algorithm of LIR.
2.1 Logistic I-RELIEF
In I-RELIEF, neighbors of a given sample in the original feature space is inconsistent with the nearest neighbors in the weighted feature space, so LIR proposes a new probabilistic model to calculate the distance between samples and their neighbors.
A assume that the input dataset is \(D =\{(\varvec{\mathrm {x}}_{n},y_{n})\}^{N}_{n=1}\subset R ^{I}\times \{+1,-1\}\), where \(\varvec{\mathrm {x}}_{n}\) is a labeled sample and \(y_{n}\) is its label, I is the number of original features, and N is the number of samples. Here, we discuss the binary problem: +1 and −1 represent positive and negative classes, respectively.
The optimization problem of LIR can be described as:
where \(\Vert \cdot \Vert _{1}\) is the 1-norm, \(\varvec{\mathrm {w}}\) is the feature weight vector which represents the importance of features, \(\lambda \ge 0\) is a regularization parameter to avoid overfitting, and \(\varvec{\mathrm {z}}_n\) is the margin vector of the sample \(\varvec{\mathrm {x}}_{n}\) which can be written as follows:
where \(M _{n}=\{\varvec{\mathrm {x}}_{i}|(\varvec{\mathrm {x}}_{i},y_{i})\in D ,y_{i}\ne y_{n},i=1,\dots ,N\}, H _{n}=\{\varvec{\mathrm {x}}_{i}|(\varvec{\mathrm {x}}_{i},y_{i})\in D ,y_{i}=y_{n},i=1,\dots ,N\}, P (\varvec{\mathrm {x}}_{i}=NM (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})\) and \(P (\varvec{\mathrm {x}}_{i}=NH (\varvec{\mathrm {x}}_{n})|\varvec{\mathrm {w}})\) are the probabilities that the sample \(\varvec{\mathrm {x}}_{i}\) is the nearest miss and the nearest hit of \(\varvec{\mathrm {x}}_{n}\), respectively, \(NM (\varvec{\mathrm {x}}_{n})\) represents the nearest miss of sample \(\varvec{\mathrm {x}}_{n}\) and \(NH (\varvec{\mathrm {x}}_{n})\) represents the nearest hit of sample \(\varvec{\mathrm {x}}_{n}\).
2.2 Semi-supervised Logistic I-RELIEF
On the basis of Logistic I-RELIEF algorithm, Sun et al. [12] extended Logistic I-RELIEF to semi-supervised learning. Due to the introduction of unlabeled samples, the calculation for the margin vectors of unlabeled samples needs to be revised.
We are given a labeled sample set \(D _{l}=\{(\varvec{\mathrm {x}}^{l}_{i},y^{l}_{i})\}^{L}_{i=1} (y_i\in \{\pm 1\})\) and an unlabeled sample set \(D _{u}=\{\varvec{\mathrm {x}}^{u}_{i}\}^{U}_{i=1}\). For a given labeled sample \(\varvec{\mathrm {x}}^{l}\) and unlabeled sample \(\varvec{\mathrm {x}}^{u}\), let their margin vector be \(\varvec{\mathrm {z}}^{l}\) and \(\varvec{\mathrm {z}}^{u}\), respectively. Since \(\varvec{\mathrm {x}}^{l}\) is a labeled sample, its margin vector \(\varvec{\mathrm {z}}^{l}\) can be computed as Eq. (2). Fortunately, \(\varvec{\mathrm {z}}^{u}\) has the same absolute value, regardless of which class the sample \(\varvec{\mathrm {x}}^{u}\) belongs to. Therefore, \(\varvec{\mathrm {z}}^{u}\) can use the same way of computing \(\varvec{\mathrm {z}}_{n}\). SLIR can be cast into the following optimization problem:
where \(\delta \) is the kernel width, \(\varvec{\mathrm {z}}^{l}_{i}\) is margin vector of sample \(\varvec{\mathrm {x}}^{l}_{i}\), \(\varvec{\mathrm {z}}^{u}_{i}\) is margin vector of sample \(\varvec{\mathrm {x}}^{u}_{i}\), \(\alpha \) and \(\beta \) represent the contribution of labeled and unlabeled samples to the cost function (3), respectively. Then (3) can be solved by using the gradient descent method.
3 Semi-supervised Logistic I-RELIEF for Multi-classification
SLIR can only deal with bianry problems. To solve this issue, we propose the MSLIR algorithm for multi-classification based on SLIR. Given a labeled sample set \(D _{l}=\{(\varvec{\mathrm {x}}^{l}_{i},y^{l}_{i})\}^{L}_{i=1}\) with \(y^{l}_{i}\in \{1,2,\dots ,c\}\) and the class number c, and an unlabeled sample set \(D _{u}=\{\varvec{\mathrm {x}}^{u}_{i}\}^{U}_{i=1}\), the goal of MSLIR is to find feature weights for multi-classification tasks under the semi-supervised situation. There are \(L+U\) samples in total \((L \ll U)\). MSLIR designs a novel scheme to calculate the margin vectors of unlabeled samples, based on which MSLIR can be formulated as an optimization problem similar to SLIR. In the following, we discuss MSLIR in detail.
3.1 Calculation of Margin Vectors
MSLIR requires to calculate margin vector for each sample including labeled and unlabeled one. The margin vector \(\varvec{\mathrm {z}}^{l}_{i}\) of the labeled sample \(\varvec{\mathrm {x}}^{l}_{i}\) can be expressed as follows:
where the set \(M _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}\ne y^{l}_{i},i=1,\dots ,L, y^{l}_{i}=\{1,\dots ,c\}\}\) contains all labeled samples that have different labels from \(\varvec{\mathrm {x}}^{l}_{i}\), the set \(H _{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}=y^{l}_{i},i=1,\dots ,L\}\) contains all labeled samples that have the same labels as \(\varvec{\mathrm {x}}^{l}_{i}\).
For the multi-classification problem, the sample \(\varvec{\mathrm {x}}^{u}_{i}\) in the data set \(D _{u}\) may belong to any class, hence, the calculation of the margin vectors of unlabeled samples needs to be redefined. Suppose that we assign a temporary label to the unlabeled sample \(\varvec{\mathrm {x}}^{u}_{i}\). Then we can calculate the candidate margin vector \((\varvec{\mathrm {z}}^{u}_{i})^{j}\) of the unlabeled sample \(\varvec{\mathrm {x}}^{u}_{i}\) with the assigned label j as follows:
where \(\overline{M }^{j}_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}\ne j,k=1,\dots ,L, j=1,\dots ,c\}\) is the sample set that contains all labeled samples whose label is not equal to j, and \(\overline{H }^{j}_{i}=\{\varvec{\mathrm {x}}^{l}_{k}|(\varvec{\mathrm {x}}^{l}_{k},y^{l}_{k})\in D _{l},y^{l}_{k}=j,k=1,\dots ,L\}\) is the sample set that contains all labeled samples whose label is equal to j.
Which is the possible margin vector of the sample \(\varvec{\mathrm {x}}^{u}_{i}\) among the c candidate margin vectors? Without any apriori information, we consider the candidate margin vector which has the greatest inner product with the feature weight vector \(\varvec{\mathrm {w}}\) as the possible one. Then \(\varvec{\mathrm {z}}^{u}_{i}\) can be determined by:
3.2 Optimization Problem and Algorithm Description
After the margin vectors of labeled and unlabeled samples are defined, MSLIR is to solve the following optimization problem:
where \(\alpha \ge 0\) and \(\beta \ge 0\) are the regularization parameters that control the importance of labeled and unlabeled samples, respectively. By comparing (3) and (7), we find that there are two differences between them. First, the calculation way of \(\varvec{\mathrm {z}}^{u}_{i}\) is different. In (3), calculating \(\varvec{\mathrm {z}}^{u}_{i}\) does not consider the label of \(\varvec{\mathrm {x}}^{u}_{i}\). Contrary, the calculation of \(\varvec{\mathrm {z}}^{u}_{i}\) takes into account the possible label of \(\varvec{\mathrm {x}}^{u}_{i}\). Second, (7) uses the same logical regression form for both labeled and unlabeled samples to ensure the symmetry of labeled and unlabeled samples.
In order to facilitate calculation, we convert the optimization problem (7) to an unconstraint optimization problem. Let \(\varvec{\mathrm {w}}=[v^{2}_{1},\dots ,v^{2}_{I}]^{T}\) and \(\varvec{\mathrm {v}}=[v_{1},\dots ,v_{I}]\). The optimization formula (7) can be rewritten as:
where \(\varvec{\mathrm {z}}^{l}_{i}=[z^{l}_{i1},\dots ,z^{l}_{iI}]\). In order to solve formula (8), we use the gradient descent method. The derivation of J to \(\varvec{\mathrm {v}}\) can be written as follows:
Let \(Q _{1}=\sum ^{L}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})(v_{k}z^{l}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{l}_{id})}\) and \(Q _{2}=\sum ^{U}_{i=1}\frac{\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})(v_{k}z^{u}_{ik})}{1+\exp (-\sum ^{I}_{d=1}v^{2}_{d}z^{u}_{id})}\), the update formulation for each dimension can be described as:
where \(\eta \) is the learning rate. The pseudo-code of MSLIR is shown in Algorithm 1. The algorithm alternatively modifies the weight vector until convergence.
The computational complexity of MSLIR includes three parts: the calculation of the margin vectors for both labeled and unlabeled samples, and the solution to the optimization problem (8). In MSLIR, the computational complexity of solving (8) and calculating the margin vectors for labeled samples are identical to SLIR. The computational complexity of calculating the margin vectors for unlabeled samples is O(cdLU) in MSLIR. For the same task, SLIR has the computational complexity of O(dLU). In a nutshell, MSLIR has a compared complexity with SLIR.
4 Experiments
In this section, we evaluate the performance of the proposed MSLIR algorithm on one artificial dataset and eight UCI datasets [17]. The compared methods include LIR, SLIR and RELEIF-F. In each dataset, we added 100 irrelevant features to each sample, which are independently sampled from zero-mean, unit-variance Gaussian distribution, then we normalize these irrelevant features with the original data. All the experiments are implement in MATLAB R2013a on a PC with an Inter Core I5 processor with 4 GB RAM.
4.1 Artificial Dataset
We conduct experiments on the “ThreeCircles” dataset to verify the ability of the proposed algorithm in feature selection. There are 3 classes in the “ThreeCircles” dataset, as shown in Fig. 1(a) and (b). In Fig. 1(a), each class has 51 labeled samples. In Fig. 1(b), each class also has 51 labeled samples as Fig. 1(a), and additionally 3450 unlabeled ones. LIR and MSLIR are conducted to compare the performance in feature selection. In MSLIR, let regularization parameters \(\alpha =6\) and \(\beta =3\). Let kernel width \(\delta =8\) for MSLIR and LIR.
In Fig. 1(c), LIR fails to identify the useful feature since the weight of the first feature is equal to zero. Figure 1(d) shows that the first two features selected by MSLIR has the highest weights and the weights of other noisy features are nearly zero. The results in Fig. 1(c) and (d) indicate that MSLIR can successfully identify the first two useful features using labeled and unlabeled data.
4.2 UCI Datasets
In this section, we use eight UCI datasets to verify the performance of algorithms. The eight UCI datasets include Wine, Vehicle, Yeast, Iris, Wdbc, Heart, Pima and Sonar datasets, of which the first four datasets are multi-class ones, and the rest are two-class ones. The datasets are randomly divided into independent training and test subsets, where training subsets contain labeled and unlabeled data. Semi-supervised methods use both labeled and unlabeled data, and supervised methods only use labeled data. The data information is summarized in Table 1, where “#Training” and “#Test” represent the number of training samples and test samples, respectively. The total number of “#Labeled” and “#Unlabeled” data is equal to the number of training samples. “#Feature” represents the dimension of dataset.
The classifier utilized in our experiments is support vector machine (SVM). The Gaussian kernel parameter and regularization parameter of SVM are determined by the grid search method, where both parameters vary from \(2^{-10}\) to \(2^{10}\).
Multi-classification Datasets. We compare the proposed feature selection algorithm with RELIEF-F and LIR, which both are supervised algorithms for solving multi-classification problems. We use the cross-validation method to select parameter \(\alpha \) and \(\beta \) in MSLIR. The regularization parameter and learning rate in LIR follow the setting in [1]: \(\lambda =10\) and \(\eta =0.03\). Experiments are implemented on the Wine, Vehicle, Yeast and Iris datasets, each of which is randomly partitioned 10 times. We report the average results. The curves of the average classification accuracy vs. the n top-ranked features are shown in Fig. 2, where n is the original feature dimension of datasets. In Fig. 2(a), (b) and (c), the classification accuracy of the three algorithms also gradually increases as the number of selected features increases, which indicates that the useful features are gradually found. We can see that SMLIR is always better than LIR and RELIEF-F, which may indicate LIR and RELIEF-F algorithms contain noise features. In Fig. 2(d), MSLIR achieves the best classification accuracy when one feature is used, and LIR and RELIEF-F need two features. Obviously, MSLIR has a significantly higher performance than the other two algorithms. The best average accuracy and the corresponding standard deviation are given in Table 2. We can observe that MSLIR has a great improvement on Wine and Vehicle datasets.
In Fig. 3, we give the feature weights of the three algorithms on the Wine dataset. We can observe that MSLIR correctly selects relevant features, while LIR assigns higher weights to some irrelevant features, and RELEIF-F fails to identify noise data. In summary, MSLIR can handle feature selection problems for multi-classification under the semi-supervised framework. The supervised approaches do not remove the noise features well with few labeled data. Because of the introduction of unlabeled data, MSLIR can effectively eliminate noise features.
Binary Classification Datasets. We conduct experiments to compare MSLIR with SLIR and prove that MSLIR is also suitable for binary problems. SLIR is a semi-supervised feature selection method for solving binary problems. In MSLIR, the way of parameter setting is the same as the multi-classification case. The parameters \(\alpha \) and \(\beta \) are determined by the cross-validation method in SLIR. Experiments are implemented on the Heart, Wdbc, Pima and Sonar datasets, each of which is randomly divided 10 times and the average accuracy is taken as the final results.
The results are shown in Fig. 4. In Figs. 4(a) and (c), we can observe that classification accuracy of MSLIR and SLIR are steady. However, MSLIR is much better than SLIR, which indicates MSLIR does not choose noise features. Figure 4(b) shows that SMLIR reaches the best accuracy when the number of features is 6, and SLIR when the number of features is 25, which implies that MSLIR chooses few features to reach the better accuracy. The best average classification accuracy and corresponding standard deviations are listed in Table 3. Compared to SLIR, the accuracy of MSLIR is improved 3.2%, 3.85%, 2.27% and 2.80% on Heart, Wdbc, Pima and Sonar datasets, respectively.
Figure 5 shows the distribution of feature weights on the Sonar dataset, we can see that MSLIR selects relevant features, while SLIR evaluates noise features higher weights and fails to distinguish relevant features.
Compared to SLIR, MSLIR can effectively solve the binary problems, and improve the classification accuracy.
5 Conclusion
In this paper, we propose MSLIR based on SLIR. MSLIR overcomes the drawback of SLIR, which can make full use of unlabeled information to select relevant features on multi-class of data. The results on one artificial dataset demonstrate that MSLIR can effectively handle noise data. When dealing with multi-class classification tasks, MSLIR can extract effective features under the semi-supervised framework compared to supervised methods. For binary classification problems, MSLIR can achieve better performance than SLIR despite the fact that both methods are semi-supervised learning ones.
References
Sun, Y.: Iterative RELIEF for feature weighting: algorithms, theories, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1035–1051 (2007)
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Tenth National Conference on Artificial Intelligence, pp. 129–134 (1992)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Cheng, Z., Zhang, Y., Fan, X., Zhu, B.: Study on discriminant matrices of commonly used fisher discriminant functions. Acta Automatica Sin. 36(10), 1361–1370 (2010)
Chen, L.F., Liao, H.Y.M., Ko, M.T., Lin, J.C., Yu, G.J.: A new LDA based face recognition system which can solve the small sample size problem. Pattern Recognit. 33(10), 1713–1726 (2000)
Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 301–312 (2002)
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: International Conference on Neural Information Processing Systems, vol. 18, pp. 507–514 (2005)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Quinzn, I., Sotoca, J.M., Pla, F.: Clustering-based feature selection in semi-supervised problems. In: International Conference on Intelligent Systems Design and Application, pp. 535–540 (2009)
Zhao, J., Lu, K., He, X.: Locality sensitive semi-supervised feature selection. Neurocomputing 71(10), 1842–1849 (2008)
Zeng, Z., Wang, X.D., Zhang, J., Wu, Q.: Semi-supervised feature selection based on local discriminative information. Neurocomputing 173(P1), 102–109 (2016)
Cheng, Y., Cai, Y., Sun, Y., Li, J.: Semi-supervised feature selection under the Logistic I-RELIEF framework. In: International Conference on Pattern Recognition, pp. 1–4 (2008)
Ren, J., Qiu, Z., Fan, W., Cheng, H., Yu, P.S.: Forward semi-supervised feature selection. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 970–976. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_101
Wang, B., Jia, Y., Yang, S.: Forward semi-supervised feature selection based on relevant set correlation. Int. Conf. Comput. Sci. Softw. Eng. 4, 210–213 (2008)
Handl, J., Knowles, J.: Semi-supervised feature selection via multi-objective optimization. In: International Joint Conference on Neural Networks, pp. 3319–3326 (2006)
Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: SIAM International Conference on Data Mining, SIAM-2007, SIAM, Minneapolis, Minnesota, USA, pp. 641–646 (2007)
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets.html
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grants No. 61373093, No. 61402310, No. 61672364 and No. 61672365, by the Soochow Scholar Project of Soochow University, by the Six Talent Peak Project of Jiangsu Province of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Tang, B., Zhang, L. (2018). Semi-supervised Feature Selection Based on Logistic I-RELIEF for Multi-classification. In: Geng, X., Kang, BH. (eds) PRICAI 2018: Trends in Artificial Intelligence. PRICAI 2018. Lecture Notes in Computer Science(), vol 11012. Springer, Cham. https://doi.org/10.1007/978-3-319-97304-3_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-97304-3_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97303-6
Online ISBN: 978-3-319-97304-3
eBook Packages: Computer ScienceComputer Science (R0)