Dimension Reduction of Gene Expression Data for Designing Optimized Rule Base Classifier

Paul, Amit; Sil, Jaya; Das Mukhopadhyay, Chitrangada

doi:10.1007/978-81-322-1856-2_15

Amit Paul⁴,
Jaya Sil⁵ &
Chitrangada Das Mukhopadhyay⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 266))

1050 Accesses
1 Citations

Abstract

The paper highlights the need of dimension reduction of voluminous gene expression microarray data for developing a robust classifier to predict patients with cancerous genes. The proposed algorithm builds a fuzzy rule based classifier with optimized rule set without much sacrificing classification accuracy. The gene expression matrix is first discretized using linguistic values. The importance factor of each gene is then evaluated representing the degree of presence of a unique linguistic value of the gene both in disease and nondisease classes. Initial fuzzy rule base consists higher ranking genes and gradually other genes are included in the rule base in order to achieve maximum classification accuracy. Thus optimum rule set is built with important genes for classification of test data set. The methodology proposed here has been successfully demonstrated for the lung cancer classification problem, which includes 97 smokers with lung cancer and 90 without lung cancer gene expression data. The results are promising even though maximum number of genes are removed from the original data.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Feature gene selection based on fuzzy neighborhood joint entropy

Article Open access 17 July 2023

Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm

Article 06 November 2020

Classification of gene expression patterns using a novel type-2 fuzzy multigranulation-based SVM model for the recognition of cancer mediating biomarkers

Article 04 August 2020

Keywords

1 Introduction

Machine learning and data mining techniques have been successfully applied [1–4] for long in biomedical data analysis to extract knowledge [5–7]. Dimensionality reduction is very much relevant in bio-informatics research, particularly in the context of microarray data, characterized by relatively few samples in a high-dimensional gene (feature) space. Irrelevant genes (features) lead to insufficient classification accuracy and add extra difficulties in finding potentially useful knowledge [8, 9]. Gene selection becoming one of the main sub-fields in bio-informatics data mining [10–12]. In the context of classification, the main goal of gene selection is to search for an optimal gene subset that lead to improved classification performance. During the past decades, extensive research has been conducted from multidisciplinary fields including statistics, pattern recognition, machine learning and data mining [10, 13, 14].

Continuous valued attributes generate large rule set and most classifiers with such huge number of rules are unstable even for slight change of training data set. Fuzzy rule based systems have been successfully applied to various application areas such as control, decision making, classification and many more [15, 16]. Some of the machine learning based works which generate fuzzy if-then rules are demonstrated in [17–19]. A version of fuzzy-ID3 algorithm which induces fuzzy decision trees is proposed in [17]. While the main objective in designing fuzzy rule-based systems is maximization of performance, comprehensibility of the system has also been taken into account in some recent studies [20–26]. The comprehensibility of fuzzy rule-based systems is related to linguistic interpretability of each fuzzy set in the rule, separation of neighboring fuzzy sets and the number of fuzzy sets for each linguistic variable. Simplicity of fuzzy rule based systems (e.g. the number of input variables, the number of fuzzy if-then rules) and fuzzy reasoning power are the other important factors that determine comprehensibility of the system.

In this paper, fuzzy if-then rules are optimized using selected genes to design a robust comprehensible fuzzy rule base for gene pattern classification problem with continuous attribute values. In the proposed method, first continuous value of microarray gene expression data are discretized using linguistic values determined by the variance of each sample corresponding to each gene. The linguistic values are relevant for discriminating alternative phenotypes and represent activation level of the gene. With the linguistic value, gene expression matrix is rebuilt. As a next step, fuzzy logic has been applied on the linguistic gene expression matrix to evaluate importance of each gene. Based on the importance factor, genes are ranked representing their significance in sample classification. Finally, fuzzy rules are framed with higher ranking genes and gradually next higher ranking genes are included in the fuzzy rule base with an objective to achieve maximum classification accuracy. Thus optimized rule set is obtained to classify patients with cancerous genes using test microarray gene expression data set.

The rest of the paper is organized as follows. The proposed method is described in Sect. 2. Experimental results are presented in Sect. 3. Concluding remarks and suggestions for future work are given in Sect. 4.

2 Methodology

Classification is a supervised learning technique that aims at exploring proper class of given objects based on the similarity among the objects. In disease classification problem, it has been noticed that although there are thousands of genes for each observation, a few underlying genes may account for much of the data variation. For instance, many of the genes may not be relevant to the tumour metabolic process, so they are potentially noise features. Removing such noise features may help to obtain higher classification accuracy (better diagnosis), resulting identification of marker genes. Selected genes increase classification accuracy with less number of comparisons which is ultimate goal of gene research.

2.1 Optimized Fuzzy Rule Generation (OFRG) Algorithm

Feature selection process refers to choosing a subset of attributes from the set of original attributes. The purpose of feature selection is to identify the significant features, eliminate the irrelevant features and build a robust learning model.

The proposed gene selection algorithm evaluates fuzzy importance factor of each gene that signifies relevance of the gene in classifying the diseased patients using microarray gene expression data.

2.2 Terminologies for the Proposed OFRG Algorithm

Linguistic Gene Expression $ \left( {LGE_{(i,j)} } \right) $: To obtain linguistic gene expression value, first mean of each gene, say i for all samples are calculated, called gean_mean _(i). Difference between the $ j{\text{th}} $ sample value of $ i{\text{th}} $ gene and mean of the sample values of the corresponding gene (gean_mean _(i)) as defined in (1) is used to assign the linguistic value at ($ LGE_{{(i,j){\text{th}}}} $) position of the matrix. Thus the discretized gene expression matrix is constructed with the Linguistic gene expression value, which represent activity level of gene i for sample j.

$$ LGE_{(i,j)} = gene\_expression_{(i,j)} - gene\_mean_{(i)} $$

(1)

Gene Importance Factor _(i): Determines maximum number of unique Linguistic Gene Expression value for $ i{\text{th}} $ gene appear in the disease classes.

Unconstructive Impact Factor _(i): Determines number of unique Linguistic Gene Expression value present in gean _(i) for non-cancer classes. The Unconstructive Important Factor of gean _(i) measures its contributions towards calculating the Gene Importance Factor, which leads to misclassification of data.

Fuzzy Importance Factor (FIF _(i)): Distribution of gene expression data in two-dimension space reveals the fact that several genes are highly important compare to others in context of disease classification. Maximum number of samples with same Linguistic Gene Expression value in $ i{\text{th}} $ gene is used to evaluate importance of that gene. However, for gene _(i) same linguistic value may appear in different non-cancer classes too (Unconstructive Impact Factor) which affects classification accuracy. Therefore, Fuzzy Importance Factor is proposed to isolate the Unconstructive Impact Factor from Gene Importance Factor using relation (2).

$$ \begin{aligned} FIf_{(i)} &= \frac{k - m}{n}\;\;\;if\;\;k > m \\ &= 0\;\;\;\;\;\;\;\;\;\;otherwise \\ \end{aligned} $$

(2)

where k, m and n represent Gene Importance Factor, Unconstructive Impact Factor of $ i{\text{th}} $ gene and total number of samples in the gene expression matrix respectively. FIF _(i) is zero when $ i{\text{th}} $ Gene Importance Factor is less/equal to the Unconstructive Impact Factor signifying that if Unconstructive Impact Factor is higher than the Gene Impact Factor then that gene has no effect to identify the proper class.

As the first step of the proposed method linguistic gene expression values are generated by applying Algorithm (1). Five different linguistic values are assigned to the samples of each gene by evaluating Eq. (1). Algorithm (2) is applied on discretized gene expression matrix to calculate fuzzy importance factor of each gene. Genes are clustered based on their similarity determined by Fuzzy Importance Factor. Finally, optimized fuzzy rule base is built with linguistic value of genes having maximum fuzzy importance factor as described in Algorithm 3.

3 Results and Comparisons

3.1 Data Set

Genes selected from microarray data sets of smokers with lung cancer and without lung cancer are compared. Database record series GSE4115 [27, 28] consisting of 22215 genes and each gene having 192 samples are considered for validating the scheme using Affymetrix genechip U133A.

3.2 Performance Measurements

OFRG algorithm has been applied on microarray data set to classify the samples by developing a fuzzy rule base classifier. Initial fuzzy rule base is built based on highest fuzzy importance factor and corresponding linguistic value are assigned for a specific disease class. Subsequently genes are included in the fuzzy rule base depending on their importance with an objective to improve classification accuracy.

A ten-fold cross-validation strategy is used by weka [29] classifier where BayesNet applied to measure the generalization accuracy of the Correlation feature selection (CFS) algorithm, shown in Table 1.

Table 1 Comparison between OFRG and CFS algorithms on lung cancer data

Full size table

CFS [30] algorithm couples evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural data sets using three machine learning algorithms: a decision tree learner, an instance based learner, and Naive Bayes. Experiments on artificial data sets show that CFS quickly identifies and screens irrelevant, redundant and noisy features, and selects relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminates well over half the features.

The OFRG algorithm extracts 4 (four) genes out of 22215 using GSE4115 [27, 28] data set showing more promising result than CFS algorithm (select 23 genes). Classification percentage comparison of CFS and OFRG algorithms are shown in Table 1 and Fig. 1, respectively. McNemar’s test using Chi squared and corrected for discontinuity comparison are shown in Table 2.

Table 2 McNemar’s test using Chi squared and P value of OFRG algorithm applied on lung cancer data

Full size table

4 Conclusions

Dimension reduction is one of the main issues in microarray data classification and appropriate gene selection demonstrates promising outcome that enhance knowledge discovery and model interpretation.

Since biologists are often interested in identifying a collection of genes involved in a biological function or a pathway rather than individual genes, there has been considerable interest in recent years to develop statistical methods for identifying significant set of genes. This paper exploits structural information and proposes a three stage strategy for selecting significant set of genes and classifying disease and normal experimental conditions. Proposed OFRG algorithm selects optimized fuzzy rule base using maximum fuzzy importance factor of genes for disease sample which helps to identify samples in proper classes with less number of comparison. Moreover, variation of number of samples does not affect the classification accuracy of OFRG algorithm while it reduces in case of CFS algorithm.

References

Kononenko, I.: Inductive and bayesian learning in medical diagnosis. Appl. Artif. Intell. 7(4), 317–337 (1993)
Article Google Scholar
Wolberg, W., Street, W.: Mangasarian ol. machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Lett. 77, 163–171 (1994)
Article Google Scholar
Wolberg, W., Street, W.: Mangasarian ol. image analysis and machine learning applied to breast cancer diagnosis and prognosis. Anal. Quant. Cytol. Histol. 17(2), 77–87 (1995)
Google Scholar
Kurgan, L., Cios, K., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S.: Knowledge discovery approach to automated cardiac spect diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)
Google Scholar
Antoniadis, A., Lambert-Lacroix, S., Leblanc, F.: Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19(5), 563–570 (2003)
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Yu, J., Ongarello, S., Fiedler, R., Chen, X., Toffolo, G., Cobelli, C., et al.: Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 21, 2200–2209 (2005)
Article Google Scholar
Oh, I., Lee, J., Moon, B.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1424–1437 (2004)
Article Google Scholar
Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining perspective, 1st edn. Kluwer, Norwell (1998)
Book MATH Google Scholar
Conilione, P., Wang, D.: A comparative study on feature selection for E. coli promoter recognition. Int. J. Inf. Technol. 11, 54–66 (2005)
Google Scholar
Degroeve, S., Baets, B., de Peer, Y., Rouzé, P.: Feature subset selection for splice site prediction. Bioinformatics 18(Suppl 2), 75–83 (2002)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Liu, H., Yu, L.: Toward integrated feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Article Google Scholar
Kuncheva, L.: Fuzzy Classifier Design. Springer, Heidelberg (2000)
Book MATH Google Scholar
Leondes, C. (ed.): Fuzzy Theory Systems: Techniques and Applications, vol. 1–4. Academic Press, San Diego (1999)
Google Scholar
Yuan, Y., Shaw, M.: Induction of fuzzy decision trees. Fuzzy Sets Syst. 25, 125–139 (1995)
Article MathSciNet Google Scholar
Ichihashi, H., Shirai, T., Nagasaka, K., Miyoshi, T.: Neuro-fuzzy ID3: a method of inducing fuzzy decision trees with linear programming for maximizing entropy and an algebraic method for incremental learning. Fuzzy Sets Syst. 84, 1–19 (1996)
Article MathSciNet Google Scholar
Yuan, Y., Zhuang, H.: A genetic algorithm for generating fuzzy classification rules. Fuzzy Sets Syst. 84, 1–19 (1996)
Article MATH Google Scholar
Castillo, L., Gonzalez, A., Perez, P.: Including a simplicity criterion in the selection of the best rule in a genetic fuzzy learning algorithm. Fuzzy Sets Syst. 120(2), 309–321 (2001)
Article MATH MathSciNet Google Scholar
Castro, J., Castro-Schez, J., Zurita, J.: Use of a fuzzy machine learning technique in the knowledge acquisition process. Fuzzy Sets Syst. 123(3), 307–320 (2001)
Article MATH MathSciNet Google Scholar
Jin, Y.: Fuzzy modeling of high-dimensional systems: complexity reduction and interpretability improvement. IEEE Trans. Fuzzy Syst. 8(2), 212–221 (2000)
Article Google Scholar
de Oliveira, V.: Semantic constraints for membership function optimization. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 29(1), 128–138 (1999)
Google Scholar
Pedrycz, W., de Oliveira, V.: Optimization of fuzzy models. IEEE Trans. Systems Man Cybern. Part B Cybern. 26(4), 627–637 (1996)
Google Scholar
Setnes, M., Babuska, R., Verbruggen, B.: Rule-based modeling: precision and transparency. IEEE Trans. Systems Man and Cybern. Part C Appl. Rev. 28(1), 165–169 (1998)
Google Scholar
Setnes, M., Roubos, H.: GA-fuzzy based modeling and classification: complexity and performance. IEEE Trans. Fuzzy Syst. 8(5), 509–522 (2000)
Article Google Scholar
Spira, A., Beane, J., Shah, V., Steiling, K., Liu, G., Schembri, F., Gilman, S., Dumas, Y., Calner, P., Sebastiani, P., Sridhar, S., Beamis, J., Lamb, C., Anderson, T., Gerry, N., Keane, J., Lenburg, M., Brody, J.: Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat. Med. 13(3), 361–366 (2007)
Article Google Scholar
Gustafson, A., Soldi, R., Anderlind, C., Scholand, M., Qian, J., Zhang, X., Cooper, K., Walker, D., McWilliams, A., Liu, G., Szabo, E., Brody, J., Massion, P., Lenburg, M., Lam, S., Bild, A., Spira, A.: Airway PI3K pathway activation is an early and reversible event in lung cancer development. Sci. Transl. Med. 2(26), 26–25 (2010)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The weka data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Google Scholar
Hall, M.: Correlation-based feature selection for machine learning. Thesis for the degree of Doctor of Philosophy (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, St. Thomas College of Engineering and Technology, Khidirpore, India
Amit Paul
Computer Science and Technology, Bengal Engineering and Science University, Shibpur, India
Jaya Sil
Health Care Science and Technology, Bengal Engineering and Science University, Shibpur, India
Chitrangada Das Mukhopadhyay

Authors

Amit Paul
View author publications
You can also search for this author in PubMed Google Scholar
Jaya Sil
View author publications
You can also search for this author in PubMed Google Scholar
Chitrangada Das Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Paul .

Editor information

Editors and Affiliations

Computer Science and Engineering, Indian School of Mines, Dhanbad, Dhanbad, Jharkhand, India
G. P. Biswas
Computer Science and Engineering, Indian School of Mines, Dhanbad, Dhanbad, Jharkhand, India
Sushanta Mukhopadhyay

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paul, A., Sil, J., Das Mukhopadhyay, C. (2014). Dimension Reduction of Gene Expression Data for Designing Optimized Rule Base Classifier. In: Biswas, G., Mukhopadhyay, S. (eds) Recent Advances in Information Technology. Advances in Intelligent Systems and Computing, vol 266. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1856-2_15

Download citation

DOI: https://doi.org/10.1007/978-81-322-1856-2_15
Published: 12 March 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1855-5
Online ISBN: 978-81-322-1856-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics