Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method

Cheng, Jia-Jun; Li, Bo

doi:10.1007/978-3-030-60802-6_2

Jia-Jun Cheng^10,11 &
Bo Li^10,11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12464))

Included in the following conference series:

International Conference on Intelligent Computing

851 Accesses

Abstract

It is imperative to select key genes from tumor gene expressive data for genotype prediction. In this paper, a supervised gene selection method is proposed. Firstly, all the genes are sorted as their descending F-scores, and then supervised correlation analysis is also recommended to reduce the redundancy from those selected genes. At last SVM is introduced to classify those gene subsets. Some experiments are conducted on benchmark tumor gene expressive data sets and results show the performance of the proposed method.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Hybrid Tumor Gene Selection Method with Laplacian Score and Correlation Analysis

A Comparative Study of Gene Selection Methods for Microarray Cancer Classification

Informative gene selection and the direct classification of tumors based on relative simplicity

Article Open access 20 January 2016

Keywords

1 Introduction

Recent advances on microarray techniques bring insights on some particular cancers by making classification to their gene expressive data. Presently, a variety of methods and mathematical models have been presented to manage and interpret these high-density microarray data for gene genotype prediction, which makes it impossible for computer aided cancer diagnose [1].

Generally speaking, all these algorithms adopted for cancer prediction can be categorized into two kinds, i.e. tumor feature extraction and feature selection. Due to the fact that tumor microarray data are represented to a large number of variables (genes), some tumor feature extraction methods are often put forward, where the conventional linear models as linear discriminant analysis (LDA) [2], principal component analysis (PCA) [3], and independent component analysis (ICA) [4] are involved in. By making attempts on extracting features from gene microarray data with nonlinear techniques, kernel approaches [5] and nonlinear manifold learning models [6, 7] are also constructed, they also make their contributions on the final tumor gene expressive data genre prediction. Most of these methods show their performances at low computational expense.

Another kind aims to explore some key tumor pathogenic genes, on which the final cancer diagnosis can be determined and some treatments will be accurately offered. As a result, feature selection algorithms based on tumor gene expressive data have been paid many concentrations on. Tumor gene selection, also named feature subset selection, expects to find those key genes with high relations to the corresponding tumors, which can make the learning models simpler with high generalization.

In order to evaluate the performance of those selected tumor genes, some classifiers should be adopted to combine to them, which is also called the wrapper. Under such circumstances, the wrapper methods with support vector machine were developed for binary classification [8]. Furthermore, SVM-REF was also proposed to be extended to multi-class classification [9]. However, these wrapper methods can easily incur high computational expense. Another shortcoming in the wrappers is that the selection of gene subset is heavily depended on all kinds of classifiers. In other words, the performances on the same optimal gene subsets may be different by using varied classifiers.

Unlike the traditional wrapper algorithms, we will just focus on key tumor gene selection in this paper, where a new tumor gene selection approach is put forward. Firstly, an evaluation metric as F-score is introduced to sort the genes with the descending order, from which gene subsets with high scores can be chosen as the candidates. And then, the supervised correlation analysis is also recommended to reduce the redundancy existing in the candidates, by which those key genes with low correlation will be approached. At last, SVM is taken to predict the genres of the selected gene subsets.

2 Method

In this section, we will firstly introduce F-score metric, which is taken to weight the contributions of the $ i{ - } $th gene for discriminant learning. In order to reduce the redundancy existing in the selected subsets with F-score, the supervised correlation analysis is also described in details, after which the outline of the proposed method is concluded.

2.1 F-Score

F-score is always used to characterize the discriminative power of selected features for binary classification. For the training samples of the tumor gene expressive data points $ X = [X_{1} ,X_{2} , \ldots ,X_{n} ] \in {\mathbb{R}}^{D \times n} $, there exists $ n_{ + } $ positive samples and $ n_{ - } $ negative samples, where $ n_{ + } { + }n_{ - } { = }n $. Thus the $ F{\text{ - score}} $ metric for the $ i{ - } $th gene can be formulated to the following:

$$ F_{\text{i}} {\text{ - score = }}\frac{{\left( {\mathop {X_{i}^{ + } }\limits^{-\!\!-} { - }\mathop {X_{i} }\limits^{-\!\!-} } \right)^{2} { + }\left( {\mathop {X_{i}^{ - } }\limits^{-\!\!-} { - }\mathop {X_{i} }\limits^{-\!\!-} } \right)^{2} }}{{\frac{1}{{n_{ + } - 1}}\sum\limits_{k = 1}^{{n_{ + } }} {\left( {X_{ki}^{ + } - X_{i}^{ + } } \right)^{2} + \frac{1}{{n_{ - } - 1}}\sum\limits_{k = 1}^{{n_{ - } }} {\left( {X_{ki}^{ - } - X_{i}^{ - } } \right)^{2} } } }} $$

(1)

where $ \mathop {X_{i} }\limits^{-\!\!-} $, $ \mathop {X_{i}^{ + } }\limits^{-\!\!-} $ and $ \mathop {X_{i}^{ - } }\limits^{-\!\!-} $ denote the mean of the $ i{ - } $th gene on all the training samples, the mean of the $ i{ - } $th gene on the positive samples and the mean of the $ i{ - } $th gene on the negative samples, respectively. Moreover, $ X_{ki}^{ + } $ and $ X_{ki}^{ - } $ are the corresponding values of the $ i{ - } $th gene on the k positive sample and the $ i{ - } $th gene on the k negative sample.

2.2 Supervised Correlation Analysis

Correlation analysis is often used to measure the correlations among objects. For tumor microarray data composed of a large number of genes, some of them may be of high correlations, which shows negative effects on the final prediction of tumor gene genres from the following aspects. Firstly, those genes with high correlations will result in high computational burden when making prediction because more genes are involved in. Secondly, it will be difficult in selecting key genes from those high correlated ones. Consequently, the final accuracy will also be affected. On the contrary, the gene subsets after reducing the redundancy will be more helpful for tumor gene expressive data classification. Thus, how to remove the redundant genes from the selected subsets is desired to improve the final accuracy with low computational expense. So in this subsection, a supervised correlation analysis model is also constructed for low correlated gene subsets selection.

At first, a correlation matrix will be introduced, each of which is a measurement for the similarity between any two genes $ f_{i} $ and $ f_{j} $ using the correlation coefficient $ \rho_{ij} $ defined as follows:

$$ \rho_{ij} = \frac{{E\left( {\left( {f_{i} - E(f_{i} } \right) \cdot \left( {f_{j} - \left( {f_{j} } \right)} \right)} \right)}}{{\sqrt {D(f_{i} )} \cdot \sqrt {D(f_{j} )} }} $$

(2)

where $ E\left( \cdot \right) $ and $ D\left( \cdot \right) $ represent the expectation and the variance of features or genes, respectively.

Thus for all the gene expressive data $ X = [X_{1} ,X_{2} , \ldots ,X_{n} ] $, there exists a correlation coefficient between any two of them, based on which a correlation matrix can be constructed below:

$$ C = \left[ {\begin{array}{*{20}c} {\rho_{11} } & {\rho_{12} } & \ldots & {\rho_{1n} } \\ {\rho_{21} } & {\rho_{22} } & \ldots & {\rho_{2n} } \\ \ldots & \ldots & \ldots & \ldots \\ {\rho_{n1} } & {\rho_{n2} } & \ldots & {\rho_{nn} } \\ \end{array} } \right] $$

(3)

Moreover, it takes the label information of the original tumor gene expressive data into account, where the sample matrix can be divided into $ A $ and $ B $ with the labels of 1 and −1, characterizing the positive and the negative samples, respectively. Thus according to the labeled sample matrix $ A $ and $ B $, we can obtain the corresponding correlation matrixes as $ C_{A} $ and $ C_{B} $.

Obviously, the correlation matrix is symmetrical, moreover, foe each element on the correlation matrix, its value will be in the interval as [0 1]. At the same time, all the diagonal elements are all with value of 1 because they reflect the correlations of themselves. In general, the larger $ \rho_{ij} $ in both the correlation matrixes $ C_{A} $ and $ C_{B} $, the more correlated between any two genes as $ f_{i} $ and $ f_{j} $, which also lead to a problem that how to set the coefficient threshold $ \rho $ to determine which genes are correlated. In details, those with correlation coefficient are larger than $ \rho $, they will be highly correlated; otherwise, they will be uncorrelated.

2.3 Outline of the Proposed Method

According to the above analysis, the outline of the proposed method will be naturally concluded and displayed in the following Table 1.

Table 1. Outline of the proposed method

Full size table

3 Experiments

In this section, experiments are conducted on three benchmark tumor gene expressive data sets including Colon, DLBCL and Leukemia. Moreover, performance comparisons are also made by using the proposed method, F-score method, Laplacian Score [10] and Fisher Score [11]. At the same time, SVM is also taken as another comparison method for tumor gene expressive data classification.

In addition, when making supervised correlation analysis on those Colon, DLBCL and Leukemia data, it occurs to the parameter as the correlation coefficient threshold $ \rho $. In the experiment, LOO-CV (Leave One Out-Cross Validation) is also introduced to adjust parameter $ \rho $.

Figure 1 shows the performance curves with different coefficient threshold $ \rho $ when making the supervised correlation analysis to the training sets, where the optimal parameter $ \rho $ will be tuned. From Fig. 1, it can be clearly found that the correlation coefficient $ \rho $ is tuned to be 0.48, 0.3 and 0.14 for Colon, DLBCL and Leukemia data set, respectively.

After tuning the coefficient threshold $ \rho $, experiments have been conducted on these three data sets by using the proposed method, Laplacian score, Fisher score, F-score method and SVM method, respectively. The experimental results on these tumor gene expressive data sets are listed in the following Table, where it can find that the proposed method is superior to the other comparison methods Table 2.

Table 2. LOO-CV Performance comparisons on three data sets

Full size table

4 Conclusions

In order to select the key genes from tumor gene expressive data, a feature selection method is proposed, which can be introduced for tumor genotype prediction. In the proposed method, firstly, F-score is recommended to sort the tumor expressive data with the descending order, from which those with high scores can be selected as the candidate of gene subset. And then supervised correlation analysis is also taken to reduce the redundancy existing in the candidate gene subset and the rest is the optimal gene subset. Finally, SVM is also adopted to predict the genre of them. Experiments on benchmark tumor gene expressive data sets have been conducted with some related comparison tumor gene selection methods, by which the performance of the proposed method can be validated.

References

Liu, X.H., Cai, C.Z., Yuan, Q.F., Xiao, H.G., Kong, C.Y.: Computer-aided diagnosis of breast cancer based on support vector machine. J. Chongqing Univ. (Nat. Sci. Ed.) 30(6), 140–144 (2007)
Google Scholar
Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 228–233 (2001)
Article Google Scholar
Jolliffe, I.T.: Principal component analysis (2002)
Google Scholar
Huang, D.S., Zheng, C.H.: Independent component analysis based penalized discriminate method for tumor classification using gene expression data. Bioinformatics 22, 1855–1862 (2006)
Article Google Scholar
Ha, V.S., Nguyen, H.N.: Machine Learning and Data Mining in Pattern Recognition (2016)
Google Scholar
Li, B., Tian, B.B., Zhang, X.L., Zhang, X.P.: Locally linear representation fisher criterion based tumor gene expressive data classification. Comput. Biol. Med. 44(10), 48–54 (2014)
Article Google Scholar
Pillati, M., Viroli, C.: Supervised locally linear embedding for classification: an application to gene expression data analysis. In: 29thAnnual Conference of the of the German Classification Society (GfKl 2005), pp. 15–18 (2005)
Google Scholar
Juan, M.G.G., Juan, G.S., Pablo, E.M., Elies, F.G., Emilio, S.O.: Sparse manifold clustering and embedding to discriminant gene expression profiles of glioblastoma and meningioma tumors. Comput. Biol. Med. 43(11), 1863–1869 (2013)
Article Google Scholar
Mundra, P.A., Rajapakse, J.C.: SVM-RFE with MRMR filterfor gene selection. IEEE Trans. Nanobiosci. 9(1), 31–37 (2010)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514(2005)
Google Scholar
Sun, L., Zhang, X.-Y., Qian, Y.-H., Xu, J.-C., Zhang, S.-G., Tian, Y.: Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl. Intell. 49(4), 1245–1259 (2018). https://doi.org/10.1007/s10489-018-1320-1
Article Google Scholar

Download references

Acknowledgments

This work was partly supported by China Post-doctoral Science Foundation (2016M601646).

Author information

Authors and Affiliations

School of Computer Science of Technology, Wuhan University of Science of Technology, Wuhan, 430065, Hubei, China
Jia-Jun Cheng & Bo Li
Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, 430065, Hubei, China
Jia-Jun Cheng & Bo Li

Authors

Jia-Jun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Li .

Editor information

Editors and Affiliations

Institute of Machine Learning and Systems Biology, Tongji University, Shanghai, China
De-Shuang Huang
School of Electrical Engineering, University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, JJ., Li, B. (2020). Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method. In: Huang, DS., Jo, KH. (eds) Intelligent Computing Theories and Application. ICIC 2020. Lecture Notes in Computer Science(), vol 12464. Springer, Cham. https://doi.org/10.1007/978-3-030-60802-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-60802-6_2
Published: 05 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60801-9
Online ISBN: 978-3-030-60802-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method

Abstract

Similar content being viewed by others

A Hybrid Tumor Gene Selection Method with Laplacian Score and Correlation Analysis

A Comparative Study of Gene Selection Methods for Microarray Cancer Classification

Informative gene selection and the direct classification of tumors based on relative simplicity

Keywords

1 Introduction

2 Method

2.1 F-Score

2.2 Supervised Correlation Analysis

2.3 Outline of the Proposed Method

3 Experiments

4 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method

Abstract

Similar content being viewed by others

A Hybrid Tumor Gene Selection Method with Laplacian Score and Correlation Analysis

A Comparative Study of Gene Selection Methods for Microarray Cancer Classification

Informative gene selection and the direct classification of tumors based on relative simplicity

Keywords

1 Introduction

2 Method

2.1 F-Score

2.2 Supervised Correlation Analysis

2.3 Outline of the Proposed Method

3 Experiments

4 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation