Semi-supervised Feature Selection Based on Cost-Sensitive and Structural Information

Tao, Yiling; Lu, Guangquan; Ma, Chaoqun; Su, Zidong; Hu, Zehui

doi:10.1007/978-3-030-69377-0_3

Yiling Tao¹²,
Guangquan Lu¹²,
Chaoqun Ma¹²,
Zidong Su¹² &
…
Zehui Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12610))

Included in the following conference series:

Australasian Database Conference

653 Accesses

Abstract

Feature selection is an important process of high-dimensional data analysis in data mining and machine learning. In the feature selection stage, the cost of misclassification and the structural information of paired samples on each feature dimension are often ignored. To overcome this, we propose semi-supervised feature selection based on cost-sensitive and structural information. First, cost-sensitive learning is incorporated into the semi-supervised framework. Second, the structural information between a pair of samples in each feature dimension is encapsulated into the feature graph. Finally, the correlation between the candidate feature and the target feature is added, which avoids the misunderstanding of the feature with low correlation as the salient feature. Furthermore, the proposed method also considers the redundancy between feature pairs, which can improve the accuracy of feature selection. The proposed method is more interpretable and practical than previous semi-supervised feature selection algorithms, because it considers the misclassification cost, structural relationship and the correlations between features and target features. Experimental results show that the promising performance of the proposed method outperforms the state-of-the-arts on eight data sets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Constrained feature weighting for semi-supervised learning

Article 31 July 2024

Semi-supervised feature selection with minimal redundancy based on local adaptive

Article 07 April 2021

A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty

Article Open access 02 October 2020

Keywords

1 Introduction

Big data has widely appeared in various fields, such as pattern recognition and machine learning [1, 2]. A common problem in data processing is that the data often contains some unimportant features [3, 4], which will increase the calculation cost and affect the effectiveness of model training [5, 6]. Therefore, feature selection has become one of the important research fields of machine learning in recent years.

Feature selection is used to delete redundant features for conducting dimensionality reduction [1], which can help model training and reduce the impact of “dimension disaster” [7]. Depending on the availability of sample labels, feature selection is divided into supervised, semi-supervised and unsupervised. Supervised feature selection [8] only uses labeled samples to train the model, and takes the structural relationship between labels and features to choose the important features, so as to explores the result of feature subset with the highest relevance to the label. Unsupervised feature selection [9,10,11] uses unlabeled sample training model, which selects the most representative features from the original feature set according to certain evaluation criteria. Semi-supervised feature selection [12, 13] uses a small number of labeled samples and a lot of unlabeled samples to achieve the optimal feature subset. These types of methods are efficient, because they can not only mine the global and local structure of all samples, but also utilize the small number of labels that providing category information. Therefore, this paper focus on research on semi-supervised feature selection.

Various semi-supervised feature selection methods have been proposed recently. For example, the typical feature selection based on classifier [14] (semi-supervised support vector machine, S3VM), it uses support vector machine (SVM) to tag no label samples, and then fused the “soft” label samples for model training. Zhao and Liu [15] proposed a semi-supervised regularized feature selection framework based on spectral learning to evaluate the correlation of features. In addition, Ren et al. [16] proposed a forward semi-supervised feature selection framework based on wrapper type, which combines forward selection with wrapper type to obtain the optimal feature subset. Chen et al. [17] combined the traditional fisher-score method to obtain the global optimal feature subset by the “soft" label of unlabeled samples with label propagation technology.

However, the existing semi-supervised feature selection methods have some defects. First, many semi-supervised feature selection researches focus on the lowest classification error rate without considering the misclassification cost. It has assumed that different misclassifications owning the equal costs [18], which may lead the model pays attention to samples which causes high misclassification losses, resulting in biases in the features selected by the learning model. Second, some advanced semi-supervised feature selection algorithms do not consider the structural information of the paired samples in each feature dimension, which can improve the performance of feature selection [19]. In addition, researchers believe that the correlation of a single candidate features is equal to the correlation of selected features, without considering the joint correlation of a pair of features, which will regard low-relevance features as salient features. Therefore, some low-correlation features are regarded as salient features.

To solve the above problems, we propose semi-supervised feature selection based on cost-sensitive and structural information (SF_CSSI). The contributions of this paper are as follows:

In practical applications, misclassification has always existed, however, the cost of misclassification is always ignored by researchers. The proposed method considers the misclassification cost and sets different penalty costs for different categories samples. In contrast to conventional feature selection methods, we try to minimize the total cost rather than the total error rate, aiming to prevent disasters caused by mistakes with high costs.
In this paper, the proposed method converts each original feature vector into a structure-based feature graph representation, which contains structural information between sample pairs in each feature dimension, in order to preserve more meaningful information. Furthermore, the proposed method constructs feature information matrix to simultaneously maximize joint relevancy of different pairwise feature combinations in relation to the target feature graphs and minimize redundancy among selected features, so as to obtain feature subset with high correlation and low redundancy.
The method proposed in this paper has rarely been studied, because it considers misclassification cost, structural information and information measurement of paired features. Experiments prove that the proposed method in this paper can achieve better feature selection results on real datasets.

2 Approach

2.1 Notations

In this paper, matrices are written as boldface uppercase letters, vectors are written as boldface lowercase letters and scalars are written as normal italic letters. For matrix $\mathbf{{X}}$, ${x_{i,j}}$ represents the element in the i-th row and j-th column of $\mathbf{{X}}$. The Frobenius norm of matrix $\mathbf{{X}} \in \mathbb {R}^{n \times d}$ is defined as ${\left\| \mathbf{{X}} \right\| _F} = \sqrt{\sum \nolimits _{i,j} {x_{_{i,j}}^2} } $. The ${l_{2,1}}$-norm of matrix $\mathbf{{X}}$ is defined as ${\left\| \mathbf{{X}} \right\| _{2,1}} = \sum \nolimits _{i = 1}^n {\sqrt{\sum \nolimits _{j = 1}^d {x_{_{ij}}^2} } }$. For vector $\mathbf{{x}}$, its ${l_1}$-norm is defined ${\left\| \mathbf{{x}} \right\| _1} = \sum \nolimits _{i = 1}^n {\left| {{x_i}} \right| }$. The symbol $ \odot $ denotes multiplication of corresponding elements and $tr\left( \mathbf{{X}} \right) $ represents the trace of matrix $\mathbf{{X}}$.

In semi-supervised learning, the data set consists of two parts: labeled data ${\mathbf{{X}}_L} = \left( {x{}_1,{x_2}, \ldots ,{x_l}} \right) $ and unlabeled data ${\mathbf{{X}}_U} = \left( {x{}_{l+1},{x_{l+2}}, \ldots ,{x_{l + u}}} \right) $, $u = n - l$, n represents the number of samples, l represents the number of labeled samples, u represents the number of unlabeled samples. The corresponding labels is ${\mathbf{{Y}}_L} = {\left( {y{}_1,{y_2}, \ldots ,{y_l}} \right) ^T}$ and the label of ${\mathbf{{Y}}_U} = {\left( {y{}_{l + 1},{y_{l + 2}}, \ldots ,{y_{l + u}}} \right) ^T}$ is unknown.

2.2 Cost-Sensitive Feature Selection

Given data set $\mathbf{{X}} = \left[ {{\mathbf{{x}}_1},{\mathbf{{x}}_2}, \cdots ,{\mathbf{{x}}_n}} \right] \in \mathbb {R}^{n \times d}$, n represents the number of samples, d represents the features of each sample. The traditional feature selection imposes a sparsity penalty in the objective function, which makes the selected features more sparse and more discriminative. The objective function of traditional feature selection [9] is defined as:

$$\begin{aligned} \mathop {\mathrm{{min}}}\limits _\mathbf{{W}} \left\| {\mathbf{{Y}} - \mathbf{{XW}}} \right\| _F^2 + \lambda {\left\| \mathbf{{W}} \right\| _{2,1}} \end{aligned}$$

(1)

However, cost-sensitive learning is embedded into feature selection framework because the misclassification problem often occurs in practical applications. Cost-sensitive learning assigns different cost parameters to different types of samples, without loss of generality, so the specified cost matrix is introduced into the feature selection framework. The traditional cost-sensitive feature selection objective function [20] is defined as:

$$\begin{aligned} \mathop {\mathrm{{min}}}\limits _\mathbf{{W}} {\left\| {\mathbf{{(}}{\mathbf{{X}}^T}{} \mathbf{{W - Y)}} \odot \mathbf{{C}}} \right\| _{2,1}} + \lambda {\left\| \mathbf{{W}} \right\| _{2,1}}, \end{aligned}$$

(2)

where $\mathbf{{W}} \in {\mathbb {R}^{d \times m}}$ represents the feature weight matrix, $\mathbf{{Y}} \in {\mathbb {R}^{n \times m}}$ represents labels, $\mathbf{{C}} \in {{\mathbb {R}}^{n \times m}}$ represents cost matrix, ${\lambda }$ represents the penalty coefficient.

2.3 Feature Selection with Graph Structural Information

The structural information can provide more abundant representation but few researchers pay attention to these between the features in each pairs of samples.

Therefore, each feature vector is transformed into a feature graph structure, which encapsulates the pairwise relationship between samples. In addition, the information theory criterion of Jensen-Shannon divergence is used to measure the joint correlation between different paired feature combinations and target labels. The specific process is as follows.

Let $\mathbf{{X}} = \left\{ {{{\mathbf{{f}}_1}}\mathrm{{,}} \ldots \mathrm{{,}}{{{\mathbf{{f}}_i}}}\mathrm{{,}} \ldots \mathrm{{,}}{\mathbf{{f}}_N}} \right\} \in {\mathbb {R}^{{M} \times {N}}}$ represents a data set of M samples and N features. Each original feature vector ${{\mathbf{{f}}_i}} = {\left( {{f_{i1}}\mathrm{{,}} \ldots \mathrm{{,}}{f_{ia}}\mathrm{{,}} \ldots \mathrm{{,}}{f_{ib}}\mathrm{{,}} \ldots \mathrm{{,}}{f_{iM}}} \right) ^T}$is transformed into a feature graph ${\mathbf{{G}}_i}\left( {{V_i}\mathrm{{,}}{E_i}} \right) $, where vertex ${v_{ia}} \in {V_i}$ represents the a-th sample ${f_{ia}}$ in feature ${{\mathbf{{f}}_i}}$ (i.e., each vertex represents a sample), edge $\left( {{v_{ia}}\mathrm{{,}}{v_{ib}}} \right) \in {E_i}$ represents the weight of the a-th sample and the b-th sample (i.e., the edge represents the correlation between a pair of samples in the corresponding feature dimension). In addition, we also construct a graph structure for the target feature $\mathbf{{Y}}$. For classification problems, $\mathbf{{Y}}$ are discrete value $c \in \left\{ {\mathrm{{1,2,}} \ldots \mathrm{{,}}m} \right\} $. Therefore, we calculate the continuous value of each discrete target feature ${{\mathbf{{f}}_i}}$ as $\mathop {{\mathbf{{f}}_i}}\limits ^ \wedge = {\left( {\mathop {{f_{i1}}}\limits ^ \wedge \mathrm{{,}} \ldots \mathrm{{,}}\mathop {{f_{ia}}}\limits ^ \wedge \mathrm{{,}} \ldots \mathrm{{,}}\mathop {{f_{ib}}}\limits ^ \wedge \mathrm{{,}} \ldots \mathrm{{,}}\mathop {{f_{iM}}}\limits ^ \wedge } \right) ^{T}}$, $\mathop {{f_{ia}}}\limits ^ \wedge $ represents the a-th sample in $\mathop {{\mathbf{{f}}_i}}\limits ^ \wedge $. When the ${f_{ia}}$ in ${{\mathbf{{f}}_i}}$ belongs to class m, $\mathop {{f_{ia}}}\limits ^ \wedge $ is the mean value of all class m samples in ${{\mathbf{{f}}_i}}$. Similarly, we construct the graph structure of the target feature $\mathop {{\mathbf{{f}}_i}}\limits ^ \wedge $ as $\mathop {{\mathbf{{G}}_i}}\limits ^ \wedge \left( {{{\mathop V\limits ^ \wedge }_i}\mathrm{{,}}\mathop {{E_i}}\limits ^ \wedge } \right) $. $\mathop {{v_{ia}}}\limits ^ \wedge \in \mathop {{V_i}}\limits ^ \wedge $ represents the a-th sample in target feature $\mathop {{\mathbf{{f}}_i}}\limits ^ \wedge $, $ \left( {\mathop {{v_{ia}}}\limits ^ \wedge \mathrm{{,}}\mathop {{v_{ib}}}\limits ^ \wedge } \right) \in \mathop {{E_i}}\limits ^ \wedge $ is the weighted edge connecting the a-th sample and the b-th sample of $\mathop {{\mathbf{{f}}_i}}\limits ^ \wedge $. This paper uses Euclidean distance to calculate the relationship between pairs of feature samples, that is, the weight of ${f_{ia}}$ and ${f_{ib}}$ can be expressed as:

$$\begin{aligned} \omega \left( {{v_{ia}}\mathrm{{,}}{v_{ib}}} \right) = \sqrt{{{\left( {{f_{ia}} - {f_{ib}}} \right) }^2}} \end{aligned}$$

(3)

Similiarly, the weight of edge $ \left( {\mathop {{v_{ia}}}\limits ^ \wedge \mathrm{{,}}\mathop {{v_{ib}}}\limits ^ \wedge } \right) \in \mathop {{E_i}}\limits ^ \wedge $ in $\mathop {{\mathbf{{G}}_i}}\limits ^ \wedge \left( {{{\mathop V\limits ^ \wedge }_i}\mathrm{{,}}\mathop {{E_i}}\limits ^ \wedge } \right) $ is expressed as follows:

$$\begin{aligned} \omega \left( {\mathop {{v_{ia}}}\limits ^ \wedge \mathrm{{,}}\mathop {{v_{ib}}}\limits ^ \wedge } \right) = \sqrt{{{\left( {{\mu _{ia}} - {\mu _{ib}}} \right) }^2}}, \end{aligned}$$

(4)

where ${\mu _{ia}}$ is the mean value of all samples in ${{\mathbf{{f}}_i}}$ from the same class m.

Jensen Shannon divergence (JSD) is used to measure the divergence between two probability distributions [21]. Give two (discrete) probability distributions ${\mathcal{P}} = \left( {{p_1}\mathrm{{,}} \ldots \mathrm{{,}}{p_a}\mathrm{{,}} \ldots {p_A}} \right) $ and ${\mathcal{K}} = \left( {{k_1}\mathrm{{,}} \ldots \mathrm{{,}}{k_b}\mathrm{{,}} \ldots {k_B}} \right) $. The JSD between ${\mathcal{P}}$ and ${\mathcal{K}}$ is defined as:

$$\begin{aligned} {D_{\mathrm{{JS}}}}\left( {{{\mathcal{P}}},{{\mathcal{K}}}} \right) = {H_S}\left( {\frac{{{\mathcal{P}} + {\mathcal{K}}}}{2}} \right) - \frac{1}{2}{H_S}\left( {\mathcal{P}} \right) - \frac{1}{2}{H_S}\left( {\mathcal{K}} \right) , \end{aligned}$$

(5)

where ${H_S}\left( {\mathcal{P}} \right) = \sum \nolimits _{i = 1}^A {{p_i}\log {p_i}} $ is the Shannon entropy of probability distribution ${\mathcal{P}}$. In the literature [22], the JSD has been used as a means of measuring the information theoretic dissimilarity between graphs associated with their probability distributions. In this paper, we focus on the similarity between graph-based feature representations. We use the negative exponent of ${D_{\mathrm{{JS}}}}\left( {{{\mathcal{P}}},{{\mathcal{K}}}} \right) $ to calculate the similarity ${I_S}$ between probability distributions ${\mathcal{P}}$ and ${\mathcal{K}}$, so:

$$\begin{aligned} {I_S}\left( {{\mathcal{P}},{\mathcal{K}}} \right) = \mathrm{{exp}}\left\{ { - {D_{\mathrm{{JS}}}}\left( {{\mathcal{P}},{\mathcal{K}}} \right) } \right\} \end{aligned}$$

(6)

The information theoretic function is used to evaluate the relevance between different feature combination and target labels to achieve the maximum correlation and minimum redundancy standards. For a set of N features $ {{{\mathbf{{f}}_1}}\mathrm{{,}} \ldots \mathrm{{,}}{{{\mathbf{{f}}_i}}}\mathrm{{,}} \ldots \mathrm{{,}}{\mathbf{{f}}_N}} $ and related continuous target feature $\mathbf{{Y}}$, the correlation degree of feature pair $\left\{ {{\mathbf{{f}}_i},{\mathbf{{f}}_j}} \right\} $ is expressed as follows:

$$\begin{aligned} {U_{{f_i},{f_j}}}= \frac{{{I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) + {I_s}\left( {{\mathbf{{G}}_j}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) }}{{{I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}{\mathbf{{G}}_j}} \right) }}, \end{aligned}$$

(7)

where ${{I_s}}$ is the JSD based similarity measure of information theory defined in Eq. 6. ${{I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) }$ represents the correlation measures of feature ${\mathbf{{f}}_i}$ with target feature $\mathbf{{Y}}$. ${{I_s}\left( {{\mathbf{{G}}_j}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) }$ represents the correlation measures of feature ${\mathbf{{f}}_j}$ with target feature $\mathbf{{Y}}$. ${I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}{\mathbf{{G}}_j}} \right) $ denotes the redundancy of paired feature $\left\{ {{\mathbf{{f}}_i},{\mathbf{{f}}_j}} \right\} $. Therefore, ${U_{{f_i},{f_j}}}$ is large if and only if ${{I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) }$ $+$ ${{I_s}\left( {{\mathbf{{G}}_j}{} \mathbf{{,}}\mathop {{\mathbf{{G}}}}\limits ^ \wedge } \right) }$ is large and ${I_s}\left( {{\mathbf{{G}}_i}{} \mathbf{{,}}{\mathbf{{G}}_j}} \right) $ is small. This indicates that the pairwise feature $\left\{ {{\mathbf{{f}}_i},{\mathbf{{f}}_j}} \right\} $ is informative and less redundant.

Given the feature information matrix $\mathbf{{U}}$ and d-dimensional feature vector $\mathbf{{w}}$. The feature subset is identified by solving the maximization problem of the following formula:

$$\begin{aligned} \mathrm{{max }}~f\mathrm{{(}}{} \mathbf{{w}}\mathrm{{) = }}\mathop {\mathrm{{max}}}\limits _{\mathbf{{w}} \in \mathbb {R}^d} {\mathbf{{w}}^T}{} \mathbf{{Uw}}, \end{aligned}$$

(8)

where $\mathbf{{w}} \in \mathbb {R}^d$, $\mathbf{{w}} = {\left( {{{w}_1},{w_2}, \cdots , {{w_i}}, \cdots , {{w_n}}} \right) ^T}$, ${{w_i}} > 0$, ${{w_i}}$ represents the correlation coefficient of the i-th feature.

2.4 Mathematical Formulation

The purpose of our proposed method is to improve the performance of feature selection through structural information and misclassification costs when the data does not have a large number of labels. Therefore, we combine cost-sensitive and Eq. 8 to propose semi-supervised feature selection based on cost-sensitive and structural information. The specific mathematical expression is as follows:

$$\begin{aligned} \mathop {\min }\limits _\mathbf{{w}} {\alpha _1}tr({\mathbf{{w}}^T}{\mathbf{{X}}^T}{} \mathbf{{LXw}}) + \sum \limits _{i = 1}^l {\left\| {{x_i}{} \mathbf{{w}} - {y_i}} \right\| _\mathrm{{2}}^2} {c_i} + {\alpha _2}||\mathbf{{w}}|{|_1} - {\alpha _3}{\mathbf{{w}}^T}{} \mathbf{{Uw}} \end{aligned}$$

(9)

The first term represents the learning of local proximity structure, which helps the model to select a representative feature subset by maintaining the local structure of the samples. $\mathbf{{w}}$ represents the feature coefficient vector, $\mathbf{{L}}$ is the Laplacian matrix, $\mathbf{{L}} = \mathbf{{D}} - \mathbf{{A}}$, where $\mathbf{{D}}$ is a diagonal matrix, the diagonal element satisfies ${D_{ii}} = \sum \nolimits _{j = 1}^n {{A_{ij}}} $ and $\mathbf{{A}}$ is the affinity matrix, if $i \ne j$, ${A_{ij}} = \exp ( - \frac{{||{x_{^i}} - {x_{^j}}||_2^2}}{{2{\sigma ^2}}})$; otherwise, ${A_{ij}} = 0$. The cost ${c_i}$ represents the cost, the second term indicates that the loss of the original feature is combined with the cost to obtain the misclassification cost loss. Since we judge the misclassification result based on the label sample, we only use the labeled sample to calculate the misclassification loss. The third term ${\left\| \mathbf{{w}} \right\| _1}$ represents the sparse regular term, which uses the ${l_1}$-norm to shrink some coefficients to zero. The fourth term encourages the selected features to be jointly more relevant with the target while maintaining less redundancy between features, ${\alpha _1}$, ${\alpha _2}$ and ${\alpha _3}$ are the penalty coefficients.

2.5 Optimization

In order to optimize, Eq. 9 can be rewritten as follows:

$$\begin{aligned} \begin{aligned} \mathop {\mathrm{{min}}}\limits _\mathbf{{w},\mathbf{{Q}}} {\alpha _1}tr\left( {{\mathbf{{w}}^T}{\mathbf{{X}}^T}{} \mathbf{{LXw}}} \right) + tr\left( {{{\left( {{\mathbf{{X}}_L}{} \mathbf{{w}} - {\mathbf{{Y}}_L}} \right) }^T}{} \mathbf{{C}}\left( {{\mathbf{{X}}_L}{} \mathbf{{w}} - {\mathbf{{Y}}_L}} \right) } \right) \\ +\,\,{\alpha _2}tr\left( {{\mathbf{{w}}^T}{} \mathbf{{Qw}}} \right) - {\alpha _3}{\mathbf{{w}}^T}{} \mathbf{{Uw}}, \end{aligned} \end{aligned}$$

(10)

where $\mathbf{{Q}}$ is the diagonal matrix. We use the idea of iterative learning to optimize the objective function, that is, update $\mathbf{{w}}$ by fixing $\mathbf{{Q}}$ and update $\mathbf{{Q}}$ by fixing $\mathbf{{w}}$, until Eq. 9 converges, so that the optimal solution of weight vector $\mathbf{{w}}$ can be obtained.

Update $\mathbf{{w}}$ by fixing $\mathbf{{Q}}$

When $\mathbf{{Q}}$ is fixed, Eq. 10 can be regarded as a function of $\mathbf{{w}}$:

$$\begin{aligned} \begin{aligned} L\left( \mathbf{{w}} \right) = {\alpha _1}tr\left( {{\mathbf{{w}}^T}{\mathbf{{X}}^T}{} \mathbf{{LXw}}} \right) + tr\left( {{{\left( {{\mathbf{{X}}_L}{} \mathbf{{w}} - {\mathbf{{Y}}_L}} \right) }^T}{} \mathbf{{C}}\left( {{\mathbf{{X}}_L}{} \mathbf{{w}} - {\mathbf{{Y}}_L}} \right) } \right) \\+\,\,{\alpha _2}tr\left( {{\mathbf{{w}}^T}{} \mathbf{{Qw}}} \right) - {\alpha _3}{\mathbf{{w}}^T}{} \mathbf{{Uw}} \end{aligned} \end{aligned}$$

(11)

We take the derivative of $\mathbf{{w}}$ in Eq. 11 and make it equal to zero:

$$\begin{aligned} \frac{{\partial L}}{{\partial \mathbf{{w}}}} = 2{\alpha _1}{\mathbf{{X}}^T}{} \mathbf{{LXw}} + 2{\mathbf{{X}}_L}^T\mathbf{{C}}{\mathbf{{X}}_L}{} \mathbf{{w}} - 2{\mathbf{{X}}_L}^T\mathbf{{C}}{\mathbf{{Y}}_L} + 2{\alpha _2}{} \mathbf{{Qw}} - 2{\alpha _3}{} \mathbf{{Uw}} = 0 \end{aligned}$$

(12)

According to Eq. 12, it is solved as follows:

$$\begin{aligned} \mathbf{{w}} = {\left( {2{\alpha _1}{\mathbf{{X}}^T}{} \mathbf{{LX}} + 2{\mathbf{{X}}_L}^T\mathbf{{C}}{\mathbf{{X}}_L} + 2{\alpha _2}{} \mathbf{{Q}} - 2{\alpha _3}{} \mathbf{{U}}} \right) ^{ - 1}}2{\mathbf{{X}}_L}^T\mathbf{{C}}{\mathbf{{Y}}_L} \end{aligned}$$

(13)

Update $\mathbf{{Q}}$ by fixing $\mathbf{{w}}$

When $\mathbf{{w}}$ is fixed, Eq. 10 can be regarded as:

$$\begin{aligned} \mathop {\mathrm{{min}}}\limits _\mathbf{{Q}} {\alpha _2}tr\left( {{\mathbf{{w}}^T}{} \mathbf{{Qw}}} \right) \end{aligned}$$

(14)

By setting the partial derivative of the above function with respect to $\mathbf{{Q}}$ as 0 and according to the article [23], it is solved as follows:

$$\begin{aligned} {Q_{ii}} = \frac{1}{{2\left| {{w_i}} \right| }}, \end{aligned}$$

(15)

where $\mathbf{{Q}}$ is a diagonal matrix and ${Q_{ii}} = \frac{1}{{2\left| {{w_i}} \right| }}$ is the diagonal element.

2.6 Convergence Analysis

Let $\mathbf{{w}}$ and $\mathbf{{Q}}$ be ${\mathbf{{w}}^{\left( \mathrm{{t}} \right) }}$ and ${\mathbf{{Q}}^{\left( \mathrm{{t}} \right) }}$ in the t-th iteration, and Eq. 10 can be rewritten as:

$$\begin{aligned} \begin{aligned} {E}\left( {{\mathbf{{w}}^{\left( t \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( t \right) }}} \right) \,=\,{\alpha _1}tr\left( {{{\left( {{\mathbf{{w}}^{\left( t \right) }}} \right) }^T}{\mathbf{{X}}^T}{} \mathbf{{LX}}{\mathbf{{w}}^{\left( t \right) }}} \right) + tr\left( {{{\left( {{\mathbf{{X}}_L}{\mathbf{{w}}^{\left( t \right) }} - {\mathbf{{Y}}_L}} \right) }^T}{} \mathbf{{C}}\left( {{\mathbf{{X}}_L}{\mathbf{{w}}^{\left( t \right) }} - {\mathbf{{Y}}_L}} \right) } \right) \\+\,\,{\alpha _2}tr\left( {{{\left( {{\mathbf{{w}}^{\left( t \right) }}} \right) }^{T}}{\mathbf{{Q}}^{\left( t \right) }}{\mathbf{{w}}^{\left( t \right) }}} \right) - {\alpha _3}{\left( {{\mathbf{{w}}^{\left( t \right) }}} \right) ^{T}}{} \mathbf{{U}}{\mathbf{{w}}^{\left( t \right) }} \end{aligned} \end{aligned}$$

(16)

Because the objective function ${E}\left( {{\mathbf{{w}}^{\left( t \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( t \right) }}} \right) $ is a convex optimization problem about $\mathbf{{w}}$, we have the following inequality:

$$\begin{aligned} {E}\left( {{\mathbf{{w}}^{\left( t+1 \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( {t} \right) }}} \right) \le {E}\left( {{\mathbf{{w}}^{\left( t \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( t \right) }}} \right) \end{aligned}$$

(17)

According to the article [23], we know that Eq. 14 is convergent, so we can deduce that Eq. 10 is convergent about $\mathbf{{Q}}$, so we express the convergence as the following inequality:

$$\begin{aligned} {E}\left( {{\mathbf{{w}}^{\left( {t + 1} \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( {t+1 } \right) }}} \right) \le {E}\left( {{\mathbf{{w}}^{\left( t+1 \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( {t} \right) }}} \right) \end{aligned}$$

(18)

Combining Eq. 17 and Eq. 18, we can get the inequality:

$$\begin{aligned} {E}\left( {{\mathbf{{w}}^{\left( {t + 1} \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( {t + 1} \right) }}} \right) \le {E}\left( {{\mathbf{{w}}^{\left( t \right) }}{} \mathbf{{,}}{\mathbf{{Q}}^{\left( t \right) }}} \right) \end{aligned}$$

(19)

Equation 16 is non-increasing at each iteration according to Eq. 19. Therefore, the proposed Algorithm 1 is convergence.

3 Experiments

In this section, we evaluated our proposed SF_CSSI and other six comparison methods on eight data sets. Specially, we first employed each feature selection method to choose the new feature subsets from original data sets, and then utilized support vector machine classification to evaluate the selected subsets.

3.1 Datasets and Comparison Methods

The data sets (i.e., madelon, SECOM, chess, isolet, Hill-with, Hill-without, musk and sonar) are from UCI Machine Learning Repository^{Footnote 1}. We summarized the detail of all data sets in Table 1.

Table 1. Summarization of data sets.

Full size table

We compared our proposed method with six comparison methods and the details of them are listed as follow:

Cost-Sensitive Laplacian Score (CSLS [24]) uses Laplacian graphs and the cost of misclassification between classes to score each feature individually.
Semi-supervised feature selection based on joint mutual information (Semi-JMI [25]) uses the redundancy between features and the correlation between features and labels to complete feature selection.
Semi-supervised feature selection based on information theory method (Semi-IMIM [25]) only uses the correlation between features and labels to complete feature selection.
Cost-Sensitive Feature Selection via F-Measure Optimization Reduction (CSFS [20]) introduces cost sensitivity to select features, which optimizes F-measure instead of accuracy to take class imbalance issue into account.
Cost-sensitive feature selection via the ${l_{2,1}}$-norm (CSEFS [26]) combines ${l_{2,1}}$-norm minimization regularization and loss term of embedding misclassification cost to select feature subset.
Semi-supervised Feature Selection via Rescaled Linear Regression (RLSR [17]) uses a set of scale factors to adjust regression coefficients, then uses regression coefficients to rank features.

Table 2. Total cost (cost ± std) of misclassification on eight data sets. Bold numbers indicate the best results.

Full size table

3.2 Experimental Settings

The experiment of this paper is implemented with the MATLAB 2018a under Windows 10 system. Referring to [27] article’s method, we can divide the data set into three parts: labeled sample set (L), unlabeled sample set (U), and test sample set (T). For each of data sets, the labeled samples were randomly selected with the given ratio $\left\{ {\mathrm{{10}}\%,20\%,30\% } \right\} $.

Table 3. The value of specificity on eight data sets. Bold numbers indicate the best results.

Full size table

We use 10-fold cross-validation to generate training sample set and test sample set, then randomly select $\mathrm {L}$ and $\mathrm {U}$ from the training sample set for training, and finally use $\mathrm {T}$ to test the performance of different methods. All algorithms perform 10 times 10-fold cross-validation and take the average of the 10 experimental results as the final total cost, which reduce the accidental occurrence. We set the parameters $\alpha _1$, $\alpha _2$ and $\alpha _3$ in Eq. 9 in range of $\left\{ {{{10}^{ - 3}},{{10}^{ - 1}},\mathrm{{ }}...\mathrm{{ }},\mathrm{{ 1}}{\mathrm{{0}}^1},{{10}^3}} \right\} $. For other comparison methods, we set these according to their corresponding literature.

Table 4. The value of sensitivity on eight data sets. Bold numbers indicate the best results.

Full size table

The total cost, specificity and sensitivity are used as evaluation indicators to evaluate the performance of all methods on eight data sets.

The total cost is calculated as follows:

$$\begin{aligned} {Total~~Cost} = \mathrm{{sum}}\left( {c_{i}} \right) , \end{aligned}$$

(20)

$$\begin{aligned} {c_{i}} = \left\{ \begin{array}{l} {{cos}}{{{t}}_1} \quad \mathrm{{or}\quad {cos}}{{{t}}_2}\qquad \mathrm{{,~predicted~~label}} \ne \mathrm{{true~~label}}\\ 0\qquad \qquad \qquad \qquad \,\,\,\, \mathrm{{, \,\, otherwise}} \end{array} \right. , \end{aligned}$$

(21)

where ${c_{i}}$ represents the misclassification cost of a sample. If the predicted label is equal to the true label, the cost of ${c_{i}}$ is 0, otherwise, the cost of ${c_{i}}$ is equal to $cos{t_1}$ or $cos{t_2}$ ($cos{t_1}$ and $cos{t_2}$ represent the costs of being judged as positive and negative samples, respectively).

For a binary classification, there are four possible results: TP (True Positive) is positive instances correctly classified and TN (True Negative) is negative instances correctly classified. FP (False Positive) is negative instances incorrectly classified and FN (False Negative) is positive instances misclassified.

Specificity refers to the proportion of samples that are actually negative which are judged to be negative. It can be calculated by the following formula:

$$\begin{aligned} {specificity} = \frac{{{{TN}}}}{{{{FP}} + {{TN}}}} \end{aligned}$$

(22)

Sensitivity refers to the proportion of samples that are actually positive which are judged to be positive. It can be calculated by the following formula:

$$\begin{aligned} {sensitivity} = \frac{{{{TP}}}}{{{{TP}} + {{FN}}}} \end{aligned}$$

(23)

3.3 Experiment Results and Analysis

In this experiment, we reported the cost, specificity and sensitivity of all methods on eight UCI datasets in Table 2, Table 3 and Table 4 under different cost value settings and listed our observations as follows. In addition, we use a line chart to show the changing trend of the total cost under different proportions of labeled samples. It can be seen from Fig. 1.

From Table 2, we can know that the proposed SF_CSSI method outperformed other methods on most cases. Especially, on the chess data set, the total cost of SF_CSSI has reduced by 75$\%$ compared with the second best approach Semi-JMI, when $cos{t_1}$ = 10 and $cos{t_2}$ = 25. When $cos{t_1}$ = 25 and $cos{t_2}$ = 10, 31$\%$ reduction was achieved by the proposed method SF_CSSI on the Hill-without data set, compared to the second best approach CSFS.

From Table 3 and Table 4, the proposed model has high specificity and sensitivity. The highest specificity was obtained on SECOM and Hill-without data sets. The highest sensitivity was obtained on isolet and Hill-without data sets compared with other methods. In addition, specificity and sensitivity are commonly used diagnostic methods in clinical practice. The higher the value is, the more real, reliable and practical the diagnosis result will be.

From Fig. 1, the more labeled data we have, the lower cost we can achieve, in most cases. We also notice that SF_CSSI outperformed other CSLS, CSFS and CSEFS methods on almost all cases, which indicates that CSLS, CSFS and CSEFS can be improved with unlabeled data. This verifies the effectiveness of the semi-supervised feature selection method. In addition, the proposed method has the minimum total cost on most cases, especially in Hill-without data set.

3.4 Conclusion

This paper considers the misclassification and the structural information of the paired samples on each feature dimension. In addition, the information theory method is used to introduce a feature information matrix to simultaneously maximize joint relevancy of different pairwise feature combinations in relation to the target feature graphs and minimize redundancy among selected features. Compared with previous research on semi-supervised feature selection, this paper comprehensively considers the cost of misclassification, the structure information of paired samples on the feature dimension, and the information relationship of paired features. In general, it is more interpretable and generalizable for our method than others in this paper. Experiments on 8 real data sets show that the proposed method has better feature selection results.

In future work, we will try to extend our method to conduct a cost-sensitive multi-class classification.

Notes

1.
http://archive.ics.uci.edu/ml/.

References

Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient KNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2017)
Article MathSciNet Google Scholar
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017)
Article Google Scholar
Shen, H.T., et al.: Heterogeneous data fusion for predicting mild cognitive impairment conversion. Inf. Fusion 66, 54–63 (2021)
Article Google Scholar
Zhu, X., Song, B., Shi, F., Chen, Y., Shen, D.: Joint prediction and time estimation of COVID-19 developing severe symptoms using chest CT scan. Med. Image Anal. 67, 101824 (2021)
Article Google Scholar
Lei, C., Zhu, X.: Unsupervised feature selection via local structure learning and sparse learning. Multimed. Tools Appl. 77(22), 2960–2962 (2018)
Article Google Scholar
Zhu, X., Zhang, S., Hu, R., Zhu, Y., Song, J.: Local and global structure preservation for robust unsupervised spectral feature selection. IEEE Trans. Knowl. Data Eng. 30(99), 517–529 (2018)
Article Google Scholar
Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Trans. Cybern. 46(46), 450 (2016)
Article Google Scholar
Wu, X., Xu, X., Liu, J., Wang, H., Nie, F.: Supervised feature selection with orthogonal regression and feature weighting. IEEE Trans. Neural Netw. Learn. Syst. 99, 1–8 (2020)
Google Scholar
Zheng, W., Zhu, X., Wen, G., Zhu, Y., Yu, H., Gan, J.: Unsupervised feature selection by self-paced learning regularization. Pattern Recogn. Lett. 132, 4–11 (2020)
Article Google Scholar
Zhu, X., Zhang, S., Zhu, Y., Zhu, P., Gao, Y.: Unsupervised spectral feature selection with dynamic hyper-graph learning. IEEE Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.3017250
Shen, H.T., Zhu, Y., Zheng, W., Zhu, X.: Half-quadratic minimization for unsupervised feature selection on incomplete data. IEEE Trans. Neural Netw. Learn. Syst. (2020). https://doi.org/10.1109/TNNLS.2020.3009632
Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: a new perspective. Neurocomputing 300(jul.26), 70–79 (2018)
Article Google Scholar
Shi, C., Duan, C., Gu, Z., Tian, Q., An, G., Zhao, R.: Semi-supervised feature selection analysis with structured multi-view sparse regularization. Neurocomputing 330, 412–424 (2019)
Article Google Scholar
Bennett, K.P., Demiriz, A.: Semi-supervised support vector machines. In: Advances in Neural Information Processing Systems, pp. 368–374 (1999)
Google Scholar
Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 641–646 (2007)
Google Scholar
Ren, J. Qiu, Z., Fan, W., Cheng, H., Philip, S.Y.: Forward semi-supervised feature selection. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 970–976 (2008)
Google Scholar
Chen, X., Yuan, G., Nie, F., Huang, J.Z.: Semi-supervised feature selection via rescaled linear regression. In: IJCAI, pp. 1525–1531 (2017)
Google Scholar
Moosavi, M.R., Jahromi, M.Z., Ghodratnama, S., Taheri, M., Sadreddini, M.H.: A cost sensitive learning method to tune the nearest neighbour for intrusion detection. Iran. J. Sci. Technol. - Trans. Electr. Eng. 36, 109–129 (2012)
Google Scholar
Bai, L., Cui, L., Wang, Y., Yu, P.S., Hancock, E.R.: Fused lasso for feature selection using structural information. Trans. Knowl. Data Eng. 16–27 (2019)
Google Scholar
Liu, M., Xu, C., Luo, Y., Xu, C., Wen, Y., Tao, D.: Cost-sensitive feature selection by optimizing F-measures. IEEE Trans. Image Process. 27(3), 1323–1335 (2017)
Article MathSciNet Google Scholar
Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)
Article MathSciNet Google Scholar
Bai, L., Hancock, E.R.: Graph kernels from the Jensen-Shannon divergence. J. Math. Imaging Vis. 47(1), 60–69 (2013)
Article Google Scholar
Wang, H., et al.: Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In: 2011 International Conference on Computer Vision, pp. 557–562 (2011)
Google Scholar
Miao, L., Liu, M., Zhang, D.: Cost-sensitive feature selection with application in software defect prediction. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR 2012), pp. 967–970 (2012)
Google Scholar
Sechidis, K., Brown, G.: Simple strategies for semi-supervised feature selection. Mach. Learn. 107(2), 357–395 (2018)
Article MathSciNet Google Scholar
Zhao, H., Yu, S.: Cost-sensitive feature selection via the ${l_{2,1}}$-norm. Int. J. Approx. Reason. 104(1), 25–37 (2019)
Article Google Scholar
Melacci, S., Belkin, M.: Laplacian support vector machines trained in the primal. J. Mach. Learn. Res. 12(3), 1149–1184 (2011)
MathSciNet MATH Google Scholar

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant No: 81701780); the Research Fund of Guangxi Key Lab of Multi-source Information Mining and Security (No. 20-A-01-01); the Guangxi Natural Science Foundation (Grant No: 2017GXNSFBA198221); the Project of Guangxi Science and Technology (GuiKeAD20159041,GuiKeAD19110133); the Innovation Project of Guangxi Graduate Education (Grants No: JXXYYJSCXXM-011).

Author information

Authors and Affiliations

Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi, China
Yiling Tao, Guangquan Lu, Chaoqun Ma, Zidong Su & Zehui Hu

Authors

Yiling Tao
View author publications
You can also search for this author in PubMed Google Scholar
Guangquan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Chaoqun Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zidong Su
View author publications
You can also search for this author in PubMed Google Scholar
Zehui Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangquan Lu .

Editor information

Editors and Affiliations

University of Auckland, Auckland, New Zealand
Miao Qiao
University of Münster, Münster, Germany
Gottfried Vossen
The University of Queensland, St. Lucia, QLD, Australia
Sen Wang
University of Queensland, Brisbane, QLD, Australia
Lei Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tao, Y., Lu, G., Ma, C., Su, Z., Hu, Z. (2021). Semi-supervised Feature Selection Based on Cost-Sensitive and Structural Information. In: Qiao, M., Vossen, G., Wang, S., Li, L. (eds) Databases Theory and Applications. ADC 2021. Lecture Notes in Computer Science(), vol 12610. Springer, Cham. https://doi.org/10.1007/978-3-030-69377-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-69377-0_3
Published: 10 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69376-3
Online ISBN: 978-3-030-69377-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-supervised Feature Selection Based on Cost-Sensitive and Structural Information

Abstract

Similar content being viewed by others

Constrained feature weighting for semi-supervised learning

Semi-supervised feature selection with minimal redundancy based on local adaptive

A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty

Keywords

1 Introduction