Keywords

1 Introduction

Conventional methods for cancer classification rely on a variety of morphological, clinical and molecular variables, but they exhibit several limitations that make difficult an accurate diagnosis. The rapid development of high-throughput biotechnologies such as DNA microarray analysis allow to record and monitor the expression levels of thousands of genes simultaneously from a few samples [8], which has attracted the attention of scientists for its application in basic and translational cancer research [5, 14, 15, 18]. Many studies utilizing DNA microarrays have been directed to (i) distinguish between cancerous and non-cancerous tissue samples, (ii) classify different types or subtypes of tumors, and (iii) predict the response to a particular therapeutic drug and/or the risk of relapse.

Cancer classification using microarrays, which focuses on predicting the class of a new sample based on its expression profile, poses two major challenges. First, the gene-expression data are characterized by the so-called ‘large G, small n’ problem, that is, the number of genes (G) heavily exceeds the sample size (n). And second, most genes are irrelevant to discriminate samples of different types [6]. These issues may increase the complexity of the prediction problem, degrade the generalization ability of classifiers and hinder the understanding of the relationships between the genes and the tissue samples [4, 19]. Under these circumstances, feature selection plays a very important role in cancer classification because it can alleviate (minimize) the effects of both those problems.

A particularly popular approach to feature selection using DNA microarrays is gene ranking [9, 13, 17, 20]. Gene ranking methods are filters that encompass some scoring function to quantify how much more statistically significant each gene is than the others [7], and as a result they rank genes in decreasing order of the estimated scores under the assumption that the top-ranked genes correspond to the most informative (or differentially expressed) ones.

The question the present study intends to answer is how the ‘large G, small n’ problem affects the classification performance using gene-expression microarrays. In particular, this paper examines the impact of high-dimensional biological data on several standard classifiers. To this end, two feature ranking algorithms are applied to select a percentage of the top-ranked genes, which are further used to classify new tissue samples and record the performance of classifiers in terms of both overall accuracy and false-negative rate.

2 Gene Ranking Algorithms

Some well-established gene ranking strategies include t-test, information-theoretic measures, symmetric uncertainty, correlation coefficient, \(\chi ^2\)-statistic and ReliefF, among others. In this section, the two feature ranking methods used in the experiments are briefly described.

2.1 ReliefF

The basic idea of the ReliefF algorithm [12, 16] lies on adjusting the weights of a vector \(W=[w(1), w(2), \ldots , w(G)]\) to give more relevance to features that better discriminate the samples from neighbors of different class.

It randomly picks out a sample x and searches for k nearest neighbors of the same class (hits, \(h_i\)) and k nearest neighbors from each of the different classes (misses, \(m_i\)). If x and \(h_i\) have different values on feature f, then the weight w(f) is decreased because it is interpreted as a bad property of this feature. In contrast, if x and \(m_i\) have different values on the feature f, then w(f) is increased. This process is repeated t times, updating the values of the weight vector W as follows

$$\begin{aligned} w(f)= & {} w(f) - \frac{\sum _{i=1}^k dist(f,x,h_i)}{t \cdot k} \\ \nonumber+ & {} \sum _{c \ne class(x)} \frac{P(c)}{1-P(class(x))} \cdot \frac{\sum _{i=1}^k dist(f,x,m_i)}{t \cdot k} \end{aligned}$$
(1)

where P(c) is the prior probability of class c, P(class(x)) denotes the probability for the class of x, and \(dist(f,x,m_i)\) represents the absolute distance between samples x and \(m_i\) in the feature f.

2.2 Gain Ratio

The Gain ratio is an extension of information gain in order to overcome the biased behavior of selecting the features with the largest number of values. Let X be a set of n samples that belong to C distinct classes and let \(n_i\) be the number of samples in class i. The entropy of any subset can be calculated using the following formula

$$\begin{aligned} H(X)=-\sum _{i=1}^{C}((n_{i}/n) \cdot \log (n_{i}/n)) \end{aligned}$$
(2)

To find the information gain of feature f, one has to sum the entropy for each value \(f_j\) \((j=1, \ldots , v)\) of the feature:

$$\begin{aligned} H(X \vert f)=\sum _{j=1}^{v}((\vert f_{j}\vert /n) \cdot H(X \vert f=f_{j})) \end{aligned}$$
(3)

where \(H(X \vert f=f_{j})\) is the entropy calculated relative to the subset of instances that have a value of \(f_j\) for feature f.

The information gain of a feature is measured by the reduction in entropy as \(IG(f)=H(X) - H(X \vert f)\). The greater the decrease in entropy when considering feature f individually, the more significant this is for prediction.

In general, a feature will be most useful when maximizing the information gain while simultaneously minimizing the number of feature values. Then the intrinsic value of a feature f can be computed as:

$$\begin{aligned} {IV} (f)=-\sum _{i=1}^{v}((\vert f_{i}\vert /n) \cdot \log (\vert f_{i}\vert /n)) \end{aligned}$$
(4)

Thus the Gain ratio of f is defined as

$$\begin{aligned} Gain~ratio(f)= {IG(f) \over IV(f)} = {H(X) - H(X \vert f) \over H(f)} \end{aligned}$$
(5)

3 Databases and Experimental Setting

We conducted a series of experiments on a collection of publicly available microarray cancer data sets taken from the Kent Ridge Biomedical Data Set Repository (http://datam.i2r.a-star.edu.sg/datasets/krbd). Table 1 summarizes the main characteristics of these data sets, including the number of genes (features), the number of tissue samples, and the size of the positive and negative classes.

Table 1. Characteristics of the gene-expression microarray data sets.

The 5-fold cross-validation method was adopted for the experimental design because it appears to be the best estimator of classification performance compared to other methods, such as bootstrap with a high computational cost or re-substitution with a biased behavior [1].

We focused our study on the ReliefF and Gain ratio feature ranking algorithms and five classification models: the nearest neighbor rule (1-NN), a support vector machine (SVM) using a linear kernel function with the soft-margin constant \(C=1.0\) and a tolerance of 0.001, the C4.5 decision tree, the naive Bayes (NBayes) classifier, the radial basis function neural network (RBF) with the K-means clustering to provide the basis functions, and a hybrid associative memory (HAM) with translation of the coordinate axes.

The experiments aim to analyze the classification accuracy when varying the percentage of genes selected by ReliefF and Gain ratio from 5% to 100% with a step size of 5%. For the purpose of this paper, the key question is how many genes should be selected to perform the best with microarray gene-expression data. Besides, we are interested in investigating whether or not the optimal percentage of genes depends on the characteristics of each classifier.

Note that the classification accuracy is just the number of samples being correctly classified, but this is not the most appropriate in the case of cancer classification problems. To discriminate between normal and cancerous data, it is especially important to take care of the false-positives and the false-negatives in order to perform a thorough comparison on the performance of different methods. False-positives are tolerable since further clinical experiments will be done to confirm the initial cancer diagnosis, but false-negatives are extremely detrimental because an ill patient might be misclassified as healthy.

Fig. 1.
figure 1

Plots of the classification accuracy rates when varying the percentage of genes selected by the ReliefF (left) and Gain ratio (right) ranking algorithms

4 Results and Discussion

Figure 1 shows the plot between accuracy rates and the percentage of the top-ranked genes for each database. It is found that all classifiers provide the highest accuracy using less than 20% of genes, irrespective of the feature selection algorithm. Examination of this figure reveals that in general, the RBF neural network and the naive Bayes classifier are the models most affected by the use of a large number of genes. For instance, in the Breast database the accuracy of RBF with the 5% top-ranked genes selected by ReliefF is 83.51%, but it significantly drops down to 52.58% when using the whole set of genes. Similarly, in the Colon database the NBayes accuracy goes down from 77.42% with the 5% top-ranked genes selected by the Gain ratio to 58.06% with the total number of genes. It is also interesting to remark that the SVM has shown superior performance in most cancer classification problems, probably because of its ability to deal with high-dimensional data and its robustness to noise [3, 11], and also because all these data sets are linearly separable [2].

At this point, it could be especially interesting to show the relationship between the number of genes and the amount of samples in order to better understand how the ‘large G, small n’ problem affects the classification results of gene-expression microarrays. To this end, the average number of samples per dimension (genes) for each database has been plotted in Fig. 2. This corresponds to the T2 data complexity measure [10], which describes the density of spatial distributions of samples by comparing the number of samples in the data set to the number of genes, (n / G). As can be seen, there exists a negative correlation between the percentage of genes and the T2 measure, that is, higher values of X (% genes) are associated with lower values of Y (T2). This shows that, although the values of T2 are extremely small in all cases, the underlying difficulty of gene-expression microarray classification increases as the number of genes increases, which explains the decreasing tendency of accuracies presented in Fig. 1.

Fig. 2.
figure 2

Values of T2 when varying the percentage of genes

As already pointed out in Sect. 3, the false-negatives are even more relevant than the classification accuracy when assessing the performance of models for cancer classification based on gene-expression microarrays. Accordingly, Tables 2 and 3 report the false-negative rates given by each classifier both with the whole set of genes (100% of genes available) and the subset of genes that performed the best in terms of accuracy. The best result for each pair (database, classifier) is highlighted in bold. It is observed that the false-negative rate achieved with the best subset of genes is lower than that using 100% of genes in most cases: 22 out of 24 (4 data sets \(\times \) 6 classifiers) with the ReliefF algorithm and 19 out of 24 with the Gain ratio feature ranking approach. These results corroborate the initial hypothesis that the removal of irrelevant (and redundant) genes leads to very significant gains in performance when the number of samples is large in comparison to the number of features, and it also produces a considerable decrease in computational requirements.

Table 2. False-negative rates with the ReliefF algorithm.
Table 3. False-negative rates with the Gain ratio.

5 Concluding Remarks

The present paper has analyzed the effect of the high-dimensional, low-sample size problem for the classification of gene-expression microarrays. To this end, two feature ranking methods and six classifiers have been applied over four biomedical databases.

The experimental results have shown that the highest performance (as measured by the accuracy rate) was achieved by using a very small number of genes (in general, less than 20% of the total amount of genes), independently of both the gene ranking algorithm and the classifier. In addition, the T2 measure has shown that the complexity of classifying gene-expression microarrays increases as the amount of genes increases.

It has also been observed that RBF and naive Bayes appear to be the models most affected by (sensitive to) the ‘large G, small n’ problem. On the other hand, the SVM with a linear kernel has performed the best in nearly all cases, probably because the experimental data sets are linearly separable. Finally, the false-negative rates have highlighted the benefits of using a subset with the top-ranked genes instead of the whole set because the presence of irrelevant genes may distort the classification problem in hand.