Keywords

1 Introduction

Machine learning and data mining techniques have been successfully applied [14] for long in biomedical data analysis to extract knowledge [57]. Dimensionality reduction is very much relevant in bio-informatics research, particularly in the context of microarray data, characterized by relatively few samples in a high-dimensional gene (feature) space. Irrelevant genes (features) lead to insufficient classification accuracy and add extra difficulties in finding potentially useful knowledge [8, 9]. Gene selection becoming one of the main sub-fields in bio-informatics data mining [1012]. In the context of classification, the main goal of gene selection is to search for an optimal gene subset that lead to improved classification performance. During the past decades, extensive research has been conducted from multidisciplinary fields including statistics, pattern recognition, machine learning and data mining [10, 13, 14].

Continuous valued attributes generate large rule set and most classifiers with such huge number of rules are unstable even for slight change of training data set. Fuzzy rule based systems have been successfully applied to various application areas such as control, decision making, classification and many more [15, 16]. Some of the machine learning based works which generate fuzzy if-then rules are demonstrated in [1719]. A version of fuzzy-ID3 algorithm which induces fuzzy decision trees is proposed in [17]. While the main objective in designing fuzzy rule-based systems is maximization of performance, comprehensibility of the system has also been taken into account in some recent studies [2026]. The comprehensibility of fuzzy rule-based systems is related to linguistic interpretability of each fuzzy set in the rule, separation of neighboring fuzzy sets and the number of fuzzy sets for each linguistic variable. Simplicity of fuzzy rule based systems (e.g. the number of input variables, the number of fuzzy if-then rules) and fuzzy reasoning power are the other important factors that determine comprehensibility of the system.

In this paper, fuzzy if-then rules are optimized using selected genes to design a robust comprehensible fuzzy rule base for gene pattern classification problem with continuous attribute values. In the proposed method, first continuous value of microarray gene expression data are discretized using linguistic values determined by the variance of each sample corresponding to each gene. The linguistic values are relevant for discriminating alternative phenotypes and represent activation level of the gene. With the linguistic value, gene expression matrix is rebuilt. As a next step, fuzzy logic has been applied on the linguistic gene expression matrix to evaluate importance of each gene. Based on the importance factor, genes are ranked representing their significance in sample classification. Finally, fuzzy rules are framed with higher ranking genes and gradually next higher ranking genes are included in the fuzzy rule base with an objective to achieve maximum classification accuracy. Thus optimized rule set is obtained to classify patients with cancerous genes using test microarray gene expression data set.

The rest of the paper is organized as follows. The proposed method is described in Sect. 2. Experimental results are presented in Sect. 3. Concluding remarks and suggestions for future work are given in Sect. 4.

2 Methodology

Classification is a supervised learning technique that aims at exploring proper class of given objects based on the similarity among the objects. In disease classification problem, it has been noticed that although there are thousands of genes for each observation, a few underlying genes may account for much of the data variation. For instance, many of the genes may not be relevant to the tumour metabolic process, so they are potentially noise features. Removing such noise features may help to obtain higher classification accuracy (better diagnosis), resulting identification of marker genes. Selected genes increase classification accuracy with less number of comparisons which is ultimate goal of gene research.

2.1 Optimized Fuzzy Rule Generation (OFRG) Algorithm

Feature selection process refers to choosing a subset of attributes from the set of original attributes. The purpose of feature selection is to identify the significant features, eliminate the irrelevant features and build a robust learning model.

The proposed gene selection algorithm evaluates fuzzy importance factor of each gene that signifies relevance of the gene in classifying the diseased patients using microarray gene expression data.

2.2 Terminologies for the Proposed OFRG Algorithm

Linguistic Gene Expression \( \left( {LGE_{(i,j)} } \right) \): To obtain linguistic gene expression value, first mean of each gene, say i for all samples are calculated, called gean_mean (i) . Difference between the \( j{\text{th}} \) sample value of \( i{\text{th}} \) gene and mean of the sample values of the corresponding gene (gean_mean (i)) as defined in (1) is used to assign the linguistic value at (\( LGE_{{(i,j){\text{th}}}} \)) position of the matrix. Thus the discretized gene expression matrix is constructed with the Linguistic gene expression value, which represent activity level of gene i for sample j.

$$ LGE_{(i,j)} = gene\_expression_{(i,j)} - gene\_mean_{(i)} $$
(1)

Gene Importance Factor (i): Determines maximum number of unique Linguistic Gene Expression value for \( i{\text{th}} \) gene appear in the disease classes.

Unconstructive Impact Factor (i): Determines number of unique Linguistic Gene Expression value present in gean (i) for non-cancer classes. The Unconstructive Important Factor of gean (i) measures its contributions towards calculating the Gene Importance Factor, which leads to misclassification of data.

Fuzzy Importance Factor (FIF (i)): Distribution of gene expression data in two-dimension space reveals the fact that several genes are highly important compare to others in context of disease classification. Maximum number of samples with same Linguistic Gene Expression value in \( i{\text{th}} \) gene is used to evaluate importance of that gene. However, for gene (i) same linguistic value may appear in different non-cancer classes too (Unconstructive Impact Factor) which affects classification accuracy. Therefore, Fuzzy Importance Factor is proposed to isolate the Unconstructive Impact Factor from Gene Importance Factor using relation (2).

$$ \begin{aligned} FIf_{(i)} &= \frac{k - m}{n}\;\;\;if\;\;k > m \\ &= 0\;\;\;\;\;\;\;\;\;\;otherwise \\ \end{aligned} $$
(2)

where k, m and n represent Gene Importance Factor, Unconstructive Impact Factor of \( i{\text{th}} \) gene and total number of samples in the gene expression matrix respectively. FIF (i) is zero when \( i{\text{th}} \) Gene Importance Factor is less/equal to the Unconstructive Impact Factor signifying that if Unconstructive Impact Factor is higher than the Gene Impact Factor then that gene has no effect to identify the proper class.

As the first step of the proposed method linguistic gene expression values are generated by applying Algorithm (1). Five different linguistic values are assigned to the samples of each gene by evaluating Eq. (1). Algorithm (2) is applied on discretized gene expression matrix to calculate fuzzy importance factor of each gene. Genes are clustered based on their similarity determined by Fuzzy Importance Factor. Finally, optimized fuzzy rule base is built with linguistic value of genes having maximum fuzzy importance factor as described in Algorithm 3.

3 Results and Comparisons

3.1 Data Set

Genes selected from microarray data sets of smokers with lung cancer and without lung cancer are compared. Database record series GSE4115 [27, 28] consisting of 22215 genes and each gene having 192 samples are considered for validating the scheme using Affymetrix genechip U133A.

3.2 Performance Measurements

OFRG algorithm has been applied on microarray data set to classify the samples by developing a fuzzy rule base classifier. Initial fuzzy rule base is built based on highest fuzzy importance factor and corresponding linguistic value are assigned for a specific disease class. Subsequently genes are included in the fuzzy rule base depending on their importance with an objective to improve classification accuracy.

A ten-fold cross-validation strategy is used by weka [29] classifier where BayesNet applied to measure the generalization accuracy of the Correlation feature selection (CFS) algorithm, shown in Table 1.

Table 1 Comparison between OFRG and CFS algorithms on lung cancer data

CFS [30] algorithm couples evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural data sets using three machine learning algorithms: a decision tree learner, an instance based learner, and Naive Bayes. Experiments on artificial data sets show that CFS quickly identifies and screens irrelevant, redundant and noisy features, and selects relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminates well over half the features.

The OFRG algorithm extracts 4 (four) genes out of 22215 using GSE4115 [27, 28] data set showing more promising result than CFS algorithm (select 23 genes). Classification percentage comparison of CFS and OFRG algorithms are shown in Table 1 and Fig. 1, respectively. McNemar’s test using Chi squared and corrected for discontinuity comparison are shown in Table 2.

Fig. 1
figure 1

Comparison of classification accuracy in percentage using OFRG and CFS algorithms on lung cancer data

Table 2 McNemar’s test using Chi squared and P value of OFRG algorithm applied on lung cancer data

4 Conclusions

Dimension reduction is one of the main issues in microarray data classification and appropriate gene selection demonstrates promising outcome that enhance knowledge discovery and model interpretation.

Since biologists are often interested in identifying a collection of genes involved in a biological function or a pathway rather than individual genes, there has been considerable interest in recent years to develop statistical methods for identifying significant set of genes. This paper exploits structural information and proposes a three stage strategy for selecting significant set of genes and classifying disease and normal experimental conditions. Proposed OFRG algorithm selects optimized fuzzy rule base using maximum fuzzy importance factor of genes for disease sample which helps to identify samples in proper classes with less number of comparison. Moreover, variation of number of samples does not affect the classification accuracy of OFRG algorithm while it reduces in case of CFS algorithm.