Abstract
It is imperative to select key genes from tumor gene expressive data for genotype prediction. In this paper, a supervised gene selection method is proposed. Firstly, all the genes are sorted as their descending F-scores, and then supervised correlation analysis is also recommended to reduce the redundancy from those selected genes. At last SVM is introduced to classify those gene subsets. Some experiments are conducted on benchmark tumor gene expressive data sets and results show the performance of the proposed method.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recent advances on microarray techniques bring insights on some particular cancers by making classification to their gene expressive data. Presently, a variety of methods and mathematical models have been presented to manage and interpret these high-density microarray data for gene genotype prediction, which makes it impossible for computer aided cancer diagnose [1].
Generally speaking, all these algorithms adopted for cancer prediction can be categorized into two kinds, i.e. tumor feature extraction and feature selection. Due to the fact that tumor microarray data are represented to a large number of variables (genes), some tumor feature extraction methods are often put forward, where the conventional linear models as linear discriminant analysis (LDA) [2], principal component analysis (PCA) [3], and independent component analysis (ICA) [4] are involved in. By making attempts on extracting features from gene microarray data with nonlinear techniques, kernel approaches [5] and nonlinear manifold learning models [6, 7] are also constructed, they also make their contributions on the final tumor gene expressive data genre prediction. Most of these methods show their performances at low computational expense.
Another kind aims to explore some key tumor pathogenic genes, on which the final cancer diagnosis can be determined and some treatments will be accurately offered. As a result, feature selection algorithms based on tumor gene expressive data have been paid many concentrations on. Tumor gene selection, also named feature subset selection, expects to find those key genes with high relations to the corresponding tumors, which can make the learning models simpler with high generalization.
In order to evaluate the performance of those selected tumor genes, some classifiers should be adopted to combine to them, which is also called the wrapper. Under such circumstances, the wrapper methods with support vector machine were developed for binary classification [8]. Furthermore, SVM-REF was also proposed to be extended to multi-class classification [9]. However, these wrapper methods can easily incur high computational expense. Another shortcoming in the wrappers is that the selection of gene subset is heavily depended on all kinds of classifiers. In other words, the performances on the same optimal gene subsets may be different by using varied classifiers.
Unlike the traditional wrapper algorithms, we will just focus on key tumor gene selection in this paper, where a new tumor gene selection approach is put forward. Firstly, an evaluation metric as F-score is introduced to sort the genes with the descending order, from which gene subsets with high scores can be chosen as the candidates. And then, the supervised correlation analysis is also recommended to reduce the redundancy existing in the candidates, by which those key genes with low correlation will be approached. At last, SVM is taken to predict the genres of the selected gene subsets.
2 Method
In this section, we will firstly introduce F-score metric, which is taken to weight the contributions of the \( i{ - } \)th gene for discriminant learning. In order to reduce the redundancy existing in the selected subsets with F-score, the supervised correlation analysis is also described in details, after which the outline of the proposed method is concluded.
2.1 F-Score
F-score is always used to characterize the discriminative power of selected features for binary classification. For the training samples of the tumor gene expressive data points \( X = [X_{1} ,X_{2} , \ldots ,X_{n} ] \in {\mathbb{R}}^{D \times n} \), there exists \( n_{ + } \) positive samples and \( n_{ - } \) negative samples, where \( n_{ + } { + }n_{ - } { = }n \). Thus the \( F{\text{ - score}} \) metric for the \( i{ - } \)th gene can be formulated to the following:
where \( \mathop {X_{i} }\limits^{-\!\!-} \), \( \mathop {X_{i}^{ + } }\limits^{-\!\!-} \) and \( \mathop {X_{i}^{ - } }\limits^{-\!\!-} \) denote the mean of the \( i{ - } \)th gene on all the training samples, the mean of the \( i{ - } \)th gene on the positive samples and the mean of the \( i{ - } \)th gene on the negative samples, respectively. Moreover, \( X_{ki}^{ + } \) and \( X_{ki}^{ - } \) are the corresponding values of the \( i{ - } \)th gene on the k positive sample and the \( i{ - } \)th gene on the k negative sample.
2.2 Supervised Correlation Analysis
Correlation analysis is often used to measure the correlations among objects. For tumor microarray data composed of a large number of genes, some of them may be of high correlations, which shows negative effects on the final prediction of tumor gene genres from the following aspects. Firstly, those genes with high correlations will result in high computational burden when making prediction because more genes are involved in. Secondly, it will be difficult in selecting key genes from those high correlated ones. Consequently, the final accuracy will also be affected. On the contrary, the gene subsets after reducing the redundancy will be more helpful for tumor gene expressive data classification. Thus, how to remove the redundant genes from the selected subsets is desired to improve the final accuracy with low computational expense. So in this subsection, a supervised correlation analysis model is also constructed for low correlated gene subsets selection.
At first, a correlation matrix will be introduced, each of which is a measurement for the similarity between any two genes \( f_{i} \) and \( f_{j} \) using the correlation coefficient \( \rho_{ij} \) defined as follows:
where \( E\left( \cdot \right) \) and \( D\left( \cdot \right) \) represent the expectation and the variance of features or genes, respectively.
Thus for all the gene expressive data \( X = [X_{1} ,X_{2} , \ldots ,X_{n} ] \), there exists a correlation coefficient between any two of them, based on which a correlation matrix can be constructed below:
Moreover, it takes the label information of the original tumor gene expressive data into account, where the sample matrix can be divided into \( A \) and \( B \) with the labels of 1 and −1, characterizing the positive and the negative samples, respectively. Thus according to the labeled sample matrix \( A \) and \( B \), we can obtain the corresponding correlation matrixes as \( C_{A} \) and \( C_{B} \).
Obviously, the correlation matrix is symmetrical, moreover, foe each element on the correlation matrix, its value will be in the interval as [0 1]. At the same time, all the diagonal elements are all with value of 1 because they reflect the correlations of themselves. In general, the larger \( \rho_{ij} \) in both the correlation matrixes \( C_{A} \) and \( C_{B} \), the more correlated between any two genes as \( f_{i} \) and \( f_{j} \), which also lead to a problem that how to set the coefficient threshold \( \rho \) to determine which genes are correlated. In details, those with correlation coefficient are larger than \( \rho \), they will be highly correlated; otherwise, they will be uncorrelated.
2.3 Outline of the Proposed Method
According to the above analysis, the outline of the proposed method will be naturally concluded and displayed in the following Table 1.
3 Experiments
In this section, experiments are conducted on three benchmark tumor gene expressive data sets including Colon, DLBCL and Leukemia. Moreover, performance comparisons are also made by using the proposed method, F-score method, Laplacian Score [10] and Fisher Score [11]. At the same time, SVM is also taken as another comparison method for tumor gene expressive data classification.
In addition, when making supervised correlation analysis on those Colon, DLBCL and Leukemia data, it occurs to the parameter as the correlation coefficient threshold \( \rho \). In the experiment, LOO-CV (Leave One Out-Cross Validation) is also introduced to adjust parameter \( \rho \).
Figure 1 shows the performance curves with different coefficient threshold \( \rho \) when making the supervised correlation analysis to the training sets, where the optimal parameter \( \rho \) will be tuned. From Fig. 1, it can be clearly found that the correlation coefficient \( \rho \) is tuned to be 0.48, 0.3 and 0.14 for Colon, DLBCL and Leukemia data set, respectively.
After tuning the coefficient threshold \( \rho \), experiments have been conducted on these three data sets by using the proposed method, Laplacian score, Fisher score, F-score method and SVM method, respectively. The experimental results on these tumor gene expressive data sets are listed in the following Table, where it can find that the proposed method is superior to the other comparison methods Table 2.
4 Conclusions
In order to select the key genes from tumor gene expressive data, a feature selection method is proposed, which can be introduced for tumor genotype prediction. In the proposed method, firstly, F-score is recommended to sort the tumor expressive data with the descending order, from which those with high scores can be selected as the candidate of gene subset. And then supervised correlation analysis is also taken to reduce the redundancy existing in the candidate gene subset and the rest is the optimal gene subset. Finally, SVM is also adopted to predict the genre of them. Experiments on benchmark tumor gene expressive data sets have been conducted with some related comparison tumor gene selection methods, by which the performance of the proposed method can be validated.
References
Liu, X.H., Cai, C.Z., Yuan, Q.F., Xiao, H.G., Kong, C.Y.: Computer-aided diagnosis of breast cancer based on support vector machine. J. Chongqing Univ. (Nat. Sci. Ed.) 30(6), 140–144 (2007)
Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 228–233 (2001)
Jolliffe, I.T.: Principal component analysis (2002)
Huang, D.S., Zheng, C.H.: Independent component analysis based penalized discriminate method for tumor classification using gene expression data. Bioinformatics 22, 1855–1862 (2006)
Ha, V.S., Nguyen, H.N.: Machine Learning and Data Mining in Pattern Recognition (2016)
Li, B., Tian, B.B., Zhang, X.L., Zhang, X.P.: Locally linear representation fisher criterion based tumor gene expressive data classification. Comput. Biol. Med. 44(10), 48–54 (2014)
Pillati, M., Viroli, C.: Supervised locally linear embedding for classification: an application to gene expression data analysis. In: 29thAnnual Conference of the of the German Classification Society (GfKl 2005), pp. 15–18 (2005)
Juan, M.G.G., Juan, G.S., Pablo, E.M., Elies, F.G., Emilio, S.O.: Sparse manifold clustering and embedding to discriminant gene expression profiles of glioblastoma and meningioma tumors. Comput. Biol. Med. 43(11), 1863–1869 (2013)
Mundra, P.A., Rajapakse, J.C.: SVM-RFE with MRMR filterfor gene selection. IEEE Trans. Nanobiosci. 9(1), 31–37 (2010)
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514(2005)
Sun, L., Zhang, X.-Y., Qian, Y.-H., Xu, J.-C., Zhang, S.-G., Tian, Y.: Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl. Intell. 49(4), 1245–1259 (2018). https://doi.org/10.1007/s10489-018-1320-1
Acknowledgments
This work was partly supported by China Post-doctoral Science Foundation (2016M601646).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cheng, JJ., Li, B. (2020). Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method. In: Huang, DS., Jo, KH. (eds) Intelligent Computing Theories and Application. ICIC 2020. Lecture Notes in Computer Science(), vol 12464. Springer, Cham. https://doi.org/10.1007/978-3-030-60802-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-60802-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60801-9
Online ISBN: 978-3-030-60802-6
eBook Packages: Computer ScienceComputer Science (R0)