Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Microarray datasets are commonly used for cancer diagnosis distinguishing two approaches: binary and multiple classes. Firstly, the binary approach tries to differentiate patients with cancer from healthy persons and, on the other hand, the multiple classes approach tries to distinguish different variants of the same type of cancer. This paper is focused on the first approach and, since unhealthy patients are less common, these datasets are usually unbalanced. The intrinsic characteristics of microarray datasets – large dimensionality of the feature space (usually several thousand of genes) and small number of samples available (often less than a hundred) – restrict the application of classical learning machine techniques. To date, two-class classification methods are mainly used, being Support Vector Machines (SVMs) among the most notable classifiers for this task. However, in the context of microarray classification some authors proposed to use a one-class classification (OCC) for classifying microarrays due to its ability to deal with unbalanced and noisy data [1]. In OCC only instances from one of the classes are available or considered. They are known as target objects whereas the other are the outlier ones. Using OCC, models are constructed from objects belonging to only one class distribution and are robust when handling inherent data difficulties. In a previous work [2], we compared the behavior of two-class (specifically, SVM) versus OCC over microarray datasets whilst analyzing the effect of feature selection (FS). This experimental study proved the superiority of the one-class approach achieving both a fine performance and a good trade-off between evaluation measures. However, a criticism to this work is that the success of SVM was limited because of the imbalanced problem that could be partially solved by sampling techniques [3]. Therefore, in this paper we present the results of a study where some of these sampling techniques are applied to improve the SVM behavior for classifying the microarray datasets denoting that, even so, OCC is superior.

This paper is structured as follows. In Sect. 2 a brief introduction about sampling techniques is given and the oversampling methods used in this experimental study are introduced. In Sect. 3 the conditions for experimental study are established. In Sect. 4 we compare the behavior of one-class classifiers and two-class methods with sampling techniques for classifying different benchmark microarray datasets, also the results are discussed. Finally, Sect. 5 is devoted to conclusions.

2 Sampling Techniques

From literature, we can find different methods to face imbalanced datasets. Among them, the most commonly employed ones are: oversampling minority class, undersampling majority class, ensemble methods, cost-sensitive learning or asymmetric classification [4]. Undersampling and oversampling are the simplest approaches. The former consists on randomly select a portion of instances from majority class whereas the latter randomly duplicates samples belonging to the minority class. Taking into account that microarray datasets enclose a reduced number of samples, undersampling does not seem a viable alternative as, it may lead to a loss of useful information. Thus, for this preliminary experimental study we focus on oversampling techniques to overcome the limitations associated to unbalanced sets. Specifically we have selected three widely applied algorithms to deal with imbalance distributions:

  1. 1.

    Resampling consists on random duplication of instances belonging to the minority class [5].

  2. 2.

    Synthetic Minority Oversampling Technique (SMOTE) algorithm generates synthetic or artificial samples by means of the nearest neighbor rule, interpolating new instances instead of duplicating them as in the case of the resampling method [6]. SMOTE does not consider the distribution of minority classes and latent noises in dataset when it generates synthetic examples. To overcome this limitation, Modified SMOTE (MSMOTE) algorithm [7] categorizes the instances belonging to the minority class into three groups according to the label of their nearest neighbors: noise (all of them belong to other classes), safe (when all neighbors belong to the minority class) otherwise, it is considered as border. Then MSMOTE chooses one of the k-nearest neighbor for safe samples and the nearest neighbor for border ones whereas in the case of noise samples the algorithm does nothing.

  3. 3.

    Critical SMOTE (CSMOTE) algorithm [4] is an improved version of the MSMOTE method that follows the idea of generating artificial samples employing only a subset of the minority class. In a first phase this algorithm extracts from the class two subsets of patterns: edge and border samples. This categorization is based on the method proposed in [8]. Edge samples define the boundary of the class and they are enough to represent the original dataset when all classes in the dataset are separated. Border samples are carefully picked in the overlapping region between adjacent classes so as to obtain the best decision surface possible. After this categorization, new patters are generated following MSMOTE. For each border sample CSMOTE randomly chooses one of the nearest neighbors whilst for each edge samples the nearest neighbor is picked.

3 Experimental Setup

The aim is to check the suitability of oversampling techniques to improve two-class classification on microarray datasets. These results are compared to those reached by one-class approach. Two of the most up-to-date classifiers are selected: SVMs for two-class classification [9] and Support Vector Data Description (SVDD) [10] as one-class classifier. It is worth mentioning that the OCC is addressed by using both minority and majority class as target concept and oversampling is not applied in any case because it is unnecessary. Next, we establish certain considerations which have been taken into account in the experimental study.

  • In order to obtain statistically significant results, 30 simulations were run with the cross-validation technique to tune the parameters of each method, specifically the width parameter in the radial basis function kernel for SVDD and the kernel function (linear, radial basis and polynomial) for SVM.

  • For the implementation of classifiers two different toolboxs for Matlab was used. The data description toolbox, DDtools library [11], for SVDD and the Statistics and Machine Learning toolbox for SVM.

  • Similarly to our previous study [2], we have applied feature selection methods as a preprocessing step with the aim of discarding irrelevant features/genes while retaining the relevant ones. All these techniques are available in the well-known Weka tool [12], except for mRMR filter, whose implementation is available for Matlab.

  • To evaluate the goodness of the selected set of genes in terms of accuracy of the classifier it is necessary to have an independent test set with data which have not been seen by neither the feature selection method nor the classifier. The selected data sets come originally distributed into training and test sets, so the training set was employed to perform the feature selection process and posterior classification while the test set was used to evaluate the appropriateness of the selection and the posterior classification.

  • For the sake of fair comparison, only the training set is oversampled when using SVM, whereas the test dataset remains the same.

  • Finally, a statistical study was conducted to determine whether the results are statistically different. First at all, the normality conditions of each distribution are checked by means of Kolmogorow Smirnov test. As in any case, normal conditions are verified then the non parametric Kruskal-Wallis test was applied.

Datasets, FS methods and evaluation measures employed for experimental study are briefly introduced below.

Datasets characteristics. Breast and Prostate datasets are widely applied due to two main properties: (1) come originally separated in training and test and (2) present more imbalance in the test set. Both datasets are available for download at [13, 14]. Table 1 provides for train and test sets the number of attributes (# Atts.), examples (# Ex.) and the percentage of examples for majority (% Ma) and minority (% Min) classes. The last column corresponds to imbalance ratio (IR), a value of 1 indicates balance whereas a large value denotes a high imbalance. As can be seen in Table 1 both datasets present more imbalance in the test set specially in the case of Prostate dataset. Dataset shift problem [15] occurs when the joint distribution of inputs and outputs is different between training and test stages, hampering the classification process that may lead to poor performance results. This problem may be caused by different situations, such in Prostate dataset where the test set was extracted from a different experiment. Accordingly, this dataset raises a challenge for machine learning methods. For this reason some classifiers, whose features are selected according to the training set, assign all samples to the majority class.

Table 1. Description of the train and test binary datasets.

FS methods. Seven classical FS methods widely used in this field are selected: Correlation-based FS (CFS) [16], Fast Correlation-Based Filter (FCBF) [17], INTERACT algorithm [18], Information Gain (IG) [19], ReliefF [20], minimum Redundancy Maximum Relevance (mRMR) [21] and Support Vector Machine based on Recursive Feature Elimination (SVM-RFE) [22]. All of them, with the exception of the last one, correspond to the filter methods that rely on the general characteristics of the training data to select feature independent of any predictor. The three first CFS, FCBF and INTERACT return a subset of features. Thus, from the original 24,481 attributes of Breast dataset 130, 99 and 102 are selected respectively. While in the case of Prostate, 89, 77 and 73 are chosen from the 12,600 initial features. An ordered ranking of the features is obtained by the four last (IG, ReliefF, mRMR and SVM-RFE). For simplicity we introduce the performance keeping the top 10 and top 50 features. Finally, SVM-RFE is the most famous embedded method to specifically deal with gene selection for cancer classification. This method iteratively trains a SVM classifier with the current set of features and basing on its internal parameters the least important are removing.

Evaluation measures. For a binary classification problem, accuracy indicates how well the system predicts both categories. However accuracy is inappropriate when the prior class probabilities are very different since it does not consider misclassification costs and therefore, it is sensitive to class skews and it is biased in favor of the majority class. Then, alternative measures should be considered. The true positive rate (recall or sensitivity) is the percentage of correctly classified positive instances (e.g. the rate of cancer patients who are correctly identified as having cancer). The true negative rate (specificity) is the percentage of correctly classified negative examples (e.g. the rate of healthy patients who are correctly classify as not having cancer). The ideal predictor should be 100 % specific and 100 % sensitive. Regarding OCC, it should be mentioned that sensitivity and specificity measures are always calculated considering as negative the healthy samples and as positive the cancer ones.

Table 2. Results for SVM (with oversampling techniques) and SVDD classifiers on Breast and Prostate datasets.

4 Experimental Results

In this section the results achieved in the Breast and Prostate datasets are introduced. Table 2 shows the results obtained by SVM and SVDD classifiers, specifically Accuracy (Acc), Sensitivity (Se) and Specificity (Sp) are used to assess their performance. In the case of SVDD we introduce the results reached by using both classes (majority and minority) as the target concept in training process. Regarding SVM we include the results obtained by using resampling, SMOTE and CSMOTE as oversampling techniques. Each column represents one of the three performance measures while rows indicate the FS methods, the last row provides the results when no FS method is applied. To facilitate the analysis of the results, best values (statistically speaking) of each performance measures for each dataset are marked in bold.

Firstly, we focus on SVM with oversampling methods. At first glance, it seems that the behavior of the SVM is similar independently of the oversampling technique. An ideal predictor should be 100 % sensitive and 100 % specific but Table 2 shows that SVM tends towards one of the classes. Comparing to the original results (without oversampling) introduced in [2], it can be seen that the inclusion of oversampling methods lead to particular performance improvements without an outstanding enhancement in the trade-off between Se and Sp.

Regarding OCC, SVDD overcomes the results obtained by SVM showing important differences. In order to know if such differences are significant a statistical study was conducted. As it was previously commented, for each performance measure, FS method and dataset the best values are marked in bold face. Only for Breast set, SVM obtains (in some cases) a higher value in the Sp measure, however in all cases SVDD achieves the best value of Acc and Se and also balanced values for Se and Sp. Finally two issues should be pointed out. On one hand, FS not only may lead to better performance results, specially in the case of Breast (for instance, see the differences between SVM-RFE-10 and the last row for this dataset) but also to significantly reduce the computational and time requirements. On the other hand, as it was previously remarked SVDD allows using minority or majority class as the target class in the training process and both exhibit a good performance. Even when the provided results are not statistically distinct, SVDD can remain the best results depending on the specific application. Since the aim of this work was to compare SVM and SVDD, there is no statistically study to compare the application or not of FS methods. However, considering FS or not, and either the minority or majority class, SVDD achieves the best performance results.

5 Conclusions

Imbalanced datasets are very common in real world for example for the diagnosis of a disease as cancer, becoming an important challenge for machine learning field. In this context, the classifiers tend towards the majority class achieving poor performance results. In a previous work we compare the results obtained by one and two class classifiers, SVDD and SVM respectively, on two microarray datasets. SVDD significantly overcame the SVM achieving a fine global performance. In this paper we include oversampling techniques to avoid the effects associated with imbalanced distributions and improve the performance of the SVM classifiers. Despite our initial idea the experimental results show that such modification does not enhance significantly the behavior of the SVM that still remains below SVDD. It is possible that this fact is caused by the peculiarities of the selected datasets. For this reason, we have in mind to extend this study including more imbalanced datasets (with higher IR) and more complex oversampling techniques to ensure the supremacy shown by the OCC in this preliminary study.