Abstract
Objective: A breast microarray data is a repository of thousands of gene expressions with different strengths of each cancer cell. It is necessary to detect the genes which are responsible for cancer growth. The proposed work aims to identify a statistical test for extracting the differentially expressed genes from a microarray gene expression and a suitable classifier for classifying the gene as diseased and control genes. Method: Cancerous genes are identified by six statistical tests, namely Welch test, analysis of variance (ANOVA) test, Wilcoxon signed rank sum test, Kruskal–Wallis, linear model for microarray (LIMMA), and F-test using their p-values. The identified cancer genes are used to classify cancer patients using seven classifiers, namely linear discriminant analysis (LDA), K-nearest neighbor, Naïve Bayesian, linear support vector machine, support vector machine with radial basis function, C5.0, and C5.0 with boosting technique. Performance is evaluated using accuracy, sensitivity, and specificity. Result: The microarray breast cancer dataset of 32 cancer patients and 28 non-cancer patients is considered in the experiment. Microarray contains 25,575 numbers of genes for each patient. When LIMMA test is used to extract differentially expressed cancer genes and KNN is used for classification, the maximum classification accuracy 100% is obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Jiang H, Ching WK, Cheung WS, Hou W, Yin H (2017) Hadamard Kernel SVM with applications for breast cancer outcome predictions. BMC Syst Biol 11(7):163–174
Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM, Suri HS, Biswas M, El-Baz A, Bangeas P, Tsoulfas G, Suri JS (2019) Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Comput Methods Programs Biomed 176:173–193
Liang Y, Han H, Liu L, Duan Y, Yang X, Ma C, Zhu Y, Han J, Li X, Chen Y (2018) CD36 plays a critical role in proliferation, migration and tamoxifen-inhibited growth of ER-positive breast cancer cells. Oncogenesis 7(12):1–14
Tsai HP, Huang SF, Li CF, Chien HT, Chen SC (2018) Differential microRNA expression in breast cancer with different onset age. PLoS One 13(1)
Cuzick J, Sestak I, Cawthorn S, Hamed H, Holli K, Howell A, Forbes JF (2015) IBIS-I investigators: tamoxifen for prevention of breast cancer: extended long-term follow-up of the IBIS-I breast cancer prevention trial. Lancet Oncol 16(1):67–75
Bolón-Canedo V, Sánchez-Marono N, Alonso-Betanzos A, Benítez JM, Herrera F (2014) A review of microarray datasets and applied feature selection methods. Inf Sci 282:111–135
Lamba M, Munjal G, Gigras Y (2020) Computational studies on breast cancer analysis. J Stat Manag Syst 23(6):999–1009
Hossain MA, Islam SMS, Quinn JM, Huq F, Moni MA (2019) Machine learning and bioinformatics models to identify gene expression patterns of ovarian cancer associated with disease progression and mortality. J Biomed Inform:100
Alagukumar S, Lawrance R (2015) A selective analysis of microarray data using association rule mining. Proc Comput Sci 47:3–12
De Smith MJ (2018) Statistical analysis handbook a comprehensive handbook of statistical concepts, techniques and software tools. The Winchelsea Press
Ayyad SM, Saleh AI, Labib LM (2019) Gene expression cancer classification using modified K-nearest neighbors technique. Biosystems 176:41–51
Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, Dehmer M (2018) Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinform 19(1):1–14
Zhang J, Lee R, Wang YJ (2003) Support vector machine classifications for microarray expression data set. In: Proceedings fifth ınternational conference on computational ıntelligence and multimedia applications, pp 67–71
Shafi ASM, Molla MI, Jui JJ, Rahman MM (2020) Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques. SN Appl Sci 2(7):1–8
Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12):1131–1142
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9):1061–1069
Zeebaree DQ, Haron H, Abdulazeez AM (2018) Gene selection and classification of microarray data using convolutional neural network. In: 2018 ınternational conference on advanced science and engineering (ICOASE), pp 145–150
Czajkowski M, Kretowski M (2019) Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach. Expert Syst Appl 137:392–404
Gakii C, Rimiru R (2021) Identification of cancer related genes using feature selection and association rule mining. Inform Med Unlocked 24:100595
Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, Muir B, Mohapatra G, Salunga R, Tuggle JT, Tran Y (2004) A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 5(6):607–616
Shekhawat SS, Sharma H, Kumar S, Nayyar A, Qureshi B (2021) bSSA: binary Salp swarm algorithm with hybrid data transformation for feature selection. IEEE Access 9:14867–14882
Li Z, Xie W, Liu T (2018) Efficient feature selection and classification for microarray data. PloS One 13(8)
Jan SL, Shieh G (2020) On the extended welch test for assessing equivalence of standardized means. Stat Biopharmaceutical Res 12(3):344–351
Ruxton GD (2006) The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behav Ecol 17(4):688–690
Cuevas A, Febrero M, Fraiman R (2004) An anova test for functional data. Comput Stat Data Anal 47(1):111–122
Hecke TV (2012) Power study of anova versus Kruskal-Wallis test. J Stat Manage Syst 15(2–3):241–247
Fagerland MW, Sandvik L (2009) The wilcoxon–mann–whitney test under scrutiny. Stat Med 28(10):1487–1497
https://www.r-project.org/. Last accessed on Oct 05, 2021
Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 3(1)
Smyth GK (2005) Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor, pp 397–420
Tiemann TK (2010) Introductory business statistics with interactive spreadsheets: 1st Canadian Edition
Han J, Kamber M, Pei J (2011) Data mining concepts and techniques third edition. Morgan Kaufmann Ser Data Manage Syst 5(4):83–124
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Dudoit S, Fridlyand J (2003) Classification in microarray experiments. Stat Anal Gene Expr Microarray Data 1:93–158
Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media
Rasmussen CE (2003) Gaussian processes in machine learning. In: Summer school on machine learning. Springer, Berlin, Heidelberg, pp 63–71
Jansson J (2016) Decision tree classification of products using C5. 0 and prediction of workload using time series analysis
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Altman DG, Bland JM (1994) Diagnostic tests. 1: sensitivity and specificity. BMJ Br Med J 308(6943):1552
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Alagukumar, S., Kathirvalavakumar, T. (2022). Classifying Microarray Gene Expression Cancer Data Using Statistical Feature Selection and Machine Learning Methods. In: Saraswat, M., Sharma, H., Balachandran, K., Kim, J.H., Bansal, J.C. (eds) Congress on Intelligent Systems. Lecture Notes on Data Engineering and Communications Technologies, vol 114. Springer, Singapore. https://doi.org/10.1007/978-981-16-9416-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-16-9416-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9415-8
Online ISBN: 978-981-16-9416-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)