Keywords

1 Introduction

Recent advances on microarray techniques bring insights on some particular cancers by making classification to their gene expressive data. Presently, a variety of methods and mathematical models have been presented to manage and interpret these high-density microarray data for gene genotype prediction, which makes it impossible for computer aided cancer diagnose [1].

Generally speaking, all these algorithms adopted for cancer prediction can be categorized into two kinds, i.e. tumor feature extraction and feature selection. Due to the fact that tumor microarray data are represented to a large number of variables (genes), some tumor feature extraction methods are often put forward, where the conventional linear models as linear discriminant analysis (LDA) [2], principal component analysis (PCA) [3], and independent component analysis (ICA) [4] are involved in. By making attempts on extracting features from gene microarray data with nonlinear techniques, kernel approaches [5] and nonlinear manifold learning models [6, 7] are also constructed, they also make their contributions on the final tumor gene expressive data genre prediction. Most of these methods show their performances at low computational expense.

Another kind aims to explore some key tumor pathogenic genes, on which the final cancer diagnosis can be determined and some treatments will be accurately offered. As a result, feature selection algorithms based on tumor gene expressive data have been paid many concentrations on. Tumor gene selection, also named feature subset selection, expects to find those key genes with high relations to the corresponding tumors, which can make the learning models simpler with high generalization.

In order to evaluate the performance of those selected tumor genes, some classifiers should be adopted to combine to them, which is also called the wrapper. Under such circumstances, the wrapper methods with support vector machine were developed for binary classification [8]. Furthermore, SVM-REF was also proposed to be extended to multi-class classification [9]. However, these wrapper methods can easily incur high computational expense. Another shortcoming in the wrappers is that the selection of gene subset is heavily depended on all kinds of classifiers. In other words, the performances on the same optimal gene subsets may be different by using varied classifiers.

Unlike the traditional wrapper algorithms, we will just focus on key tumor gene selection in this paper, where a new tumor gene selection approach is put forward. Firstly, an evaluation metric as F-score is introduced to sort the genes with the descending order, from which gene subsets with high scores can be chosen as the candidates. And then, the supervised correlation analysis is also recommended to reduce the redundancy existing in the candidates, by which those key genes with low correlation will be approached. At last, SVM is taken to predict the genres of the selected gene subsets.

2 Method

In this section, we will firstly introduce F-score metric, which is taken to weight the contributions of the \( i{ - } \)th gene for discriminant learning. In order to reduce the redundancy existing in the selected subsets with F-score, the supervised correlation analysis is also described in details, after which the outline of the proposed method is concluded.

2.1 F-Score

F-score is always used to characterize the discriminative power of selected features for binary classification. For the training samples of the tumor gene expressive data points \( X = [X_{1} ,X_{2} , \ldots ,X_{n} ] \in {\mathbb{R}}^{D \times n} \), there exists \( n_{ + } \) positive samples and \( n_{ - } \) negative samples, where \( n_{ + } { + }n_{ - } { = }n \). Thus the \( F{\text{ - score}} \) metric for the \( i{ - } \)th gene can be formulated to the following:

$$ F_{\text{i}} {\text{ - score = }}\frac{{\left( {\mathop {X_{i}^{ + } }\limits^{-\!\!-} { - }\mathop {X_{i} }\limits^{-\!\!-} } \right)^{2} { + }\left( {\mathop {X_{i}^{ - } }\limits^{-\!\!-} { - }\mathop {X_{i} }\limits^{-\!\!-} } \right)^{2} }}{{\frac{1}{{n_{ + } - 1}}\sum\limits_{k = 1}^{{n_{ + } }} {\left( {X_{ki}^{ + } - X_{i}^{ + } } \right)^{2} + \frac{1}{{n_{ - } - 1}}\sum\limits_{k = 1}^{{n_{ - } }} {\left( {X_{ki}^{ - } - X_{i}^{ - } } \right)^{2} } } }} $$
(1)

where \( \mathop {X_{i} }\limits^{-\!\!-} \), \( \mathop {X_{i}^{ + } }\limits^{-\!\!-} \) and \( \mathop {X_{i}^{ - } }\limits^{-\!\!-} \) denote the mean of the \( i{ - } \)th gene on all the training samples, the mean of the \( i{ - } \)th gene on the positive samples and the mean of the \( i{ - } \)th gene on the negative samples, respectively. Moreover, \( X_{ki}^{ + } \) and \( X_{ki}^{ - } \) are the corresponding values of the \( i{ - } \)th gene on the k positive sample and the \( i{ - } \)th gene on the k negative sample.

2.2 Supervised Correlation Analysis

Correlation analysis is often used to measure the correlations among objects. For tumor microarray data composed of a large number of genes, some of them may be of high correlations, which shows negative effects on the final prediction of tumor gene genres from the following aspects. Firstly, those genes with high correlations will result in high computational burden when making prediction because more genes are involved in. Secondly, it will be difficult in selecting key genes from those high correlated ones. Consequently, the final accuracy will also be affected. On the contrary, the gene subsets after reducing the redundancy will be more helpful for tumor gene expressive data classification. Thus, how to remove the redundant genes from the selected subsets is desired to improve the final accuracy with low computational expense. So in this subsection, a supervised correlation analysis model is also constructed for low correlated gene subsets selection.

At first, a correlation matrix will be introduced, each of which is a measurement for the similarity between any two genes \( f_{i} \) and \( f_{j} \) using the correlation coefficient \( \rho_{ij} \) defined as follows:

$$ \rho_{ij} = \frac{{E\left( {\left( {f_{i} - E(f_{i} } \right) \cdot \left( {f_{j} - \left( {f_{j} } \right)} \right)} \right)}}{{\sqrt {D(f_{i} )} \cdot \sqrt {D(f_{j} )} }} $$
(2)

where \( E\left( \cdot \right) \) and \( D\left( \cdot \right) \) represent the expectation and the variance of features or genes, respectively.

Thus for all the gene expressive data \( X = [X_{1} ,X_{2} , \ldots ,X_{n} ] \), there exists a correlation coefficient between any two of them, based on which a correlation matrix can be constructed below:

$$ C = \left[ {\begin{array}{*{20}c} {\rho_{11} } & {\rho_{12} } & \ldots & {\rho_{1n} } \\ {\rho_{21} } & {\rho_{22} } & \ldots & {\rho_{2n} } \\ \ldots & \ldots & \ldots & \ldots \\ {\rho_{n1} } & {\rho_{n2} } & \ldots & {\rho_{nn} } \\ \end{array} } \right] $$
(3)

Moreover, it takes the label information of the original tumor gene expressive data into account, where the sample matrix can be divided into \( A \) and \( B \) with the labels of 1 and −1, characterizing the positive and the negative samples, respectively. Thus according to the labeled sample matrix \( A \) and \( B \), we can obtain the corresponding correlation matrixes as \( C_{A} \) and \( C_{B} \).

Obviously, the correlation matrix is symmetrical, moreover, foe each element on the correlation matrix, its value will be in the interval as [0 1]. At the same time, all the diagonal elements are all with value of 1 because they reflect the correlations of themselves. In general, the larger \( \rho_{ij} \) in both the correlation matrixes \( C_{A} \) and \( C_{B} \), the more correlated between any two genes as \( f_{i} \) and \( f_{j} \), which also lead to a problem that how to set the coefficient threshold \( \rho \) to determine which genes are correlated. In details, those with correlation coefficient are larger than \( \rho \), they will be highly correlated; otherwise, they will be uncorrelated.

2.3 Outline of the Proposed Method

According to the above analysis, the outline of the proposed method will be naturally concluded and displayed in the following Table 1.

Table 1. Outline of the proposed method

3 Experiments

In this section, experiments are conducted on three benchmark tumor gene expressive data sets including Colon, DLBCL and Leukemia. Moreover, performance comparisons are also made by using the proposed method, F-score method, Laplacian Score [10] and Fisher Score [11]. At the same time, SVM is also taken as another comparison method for tumor gene expressive data classification.

In addition, when making supervised correlation analysis on those Colon, DLBCL and Leukemia data, it occurs to the parameter as the correlation coefficient threshold \( \rho \). In the experiment, LOO-CV (Leave One Out-Cross Validation) is also introduced to adjust parameter \( \rho \).

Figure 1 shows the performance curves with different coefficient threshold \( \rho \) when making the supervised correlation analysis to the training sets, where the optimal parameter \( \rho \) will be tuned. From Fig. 1, it can be clearly found that the correlation coefficient \( \rho \) is tuned to be 0.48, 0.3 and 0.14 for Colon, DLBCL and Leukemia data set, respectively.

Fig. 1.
figure 1

LOO-CV performance with varied \( \rho \) for Colon, DLBCL and Acute leukemia data sets

After tuning the coefficient threshold \( \rho \), experiments have been conducted on these three data sets by using the proposed method, Laplacian score, Fisher score, F-score method and SVM method, respectively. The experimental results on these tumor gene expressive data sets are listed in the following Table, where it can find that the proposed method is superior to the other comparison methods Table 2.

Table 2. LOO-CV Performance comparisons on three data sets

4 Conclusions

In order to select the key genes from tumor gene expressive data, a feature selection method is proposed, which can be introduced for tumor genotype prediction. In the proposed method, firstly, F-score is recommended to sort the tumor expressive data with the descending order, from which those with high scores can be selected as the candidate of gene subset. And then supervised correlation analysis is also taken to reduce the redundancy existing in the candidate gene subset and the rest is the optimal gene subset. Finally, SVM is also adopted to predict the genre of them. Experiments on benchmark tumor gene expressive data sets have been conducted with some related comparison tumor gene selection methods, by which the performance of the proposed method can be validated.