Keywords

1 Introduction

There has been an increase of the use of data mining techniques in the different areas of medicine (bioinformatics, medical imaging, clinical informatics, and public health informatics) in the last decade. This is due to the impact data mining had on other domains, such as banking, marketing, and e-commerce, which gave high hopes for similar achievements in medicine, by extracting untapped knowledge contained in available medical data as well.

The aim of clinical data mining is to search for useful patterns and information within patients’ data, and develop prediction models that can support clinical decision making [1, 2]. Data mining can be used to build predictive models in prognosis, diagnosis and treatment planning. Even when the data is collected for purposes other than directly diagnosing a disease or predicting treatment outcome, useful medical information can still be retrieved. Nakamura et al. [3] used data mining to predict the development of pressure ulcer in hospitals from patients’ data that were originally collected for the purpose of calculating nursing costs. Decision making in the medical field is rather more sensitive than many other fields because of its direct relation to life and death consequences, and the well-being of patients. Therefore, a decision should be made with strong belief that is supported by thorough evaluation and clear explanation. This makes clinical data mining distinctive than other data mining uses in various ways. For example, it is widely common in clinical data mining to use white-box classifiers such as rule-based learners or decision trees because the resulting model is represented in a readable format. This enables the physicians to interpret the model output based on their medical knowledge, and increases their confidence when making their final decisions. While models built using black-box methods such as Artificial Neural Networks or Support Vector Machines and provide better results in terms of prediction accuracy, will be welcomed in many other fields, our experience showed that physicians often hesitate to accept these results due to the lack of model understandability, how the involved factors are related, and how to link that to their medical experience and knowledge. Although researchers have investigated the use of many different machine learning algorithms on clinical data, and reported interesting findings [1, 2], we believe that in practice, the ability for model introspection will actually limit us to only few of the many algorithms that are used in other fields and applications, even if this comes at the expense of prediction accuracy.

Another point to consider is that in many data mining applications, it is desirable to have a prediction model with high accuracy. In clinical data mining, however, it is important to distinguish between false positive errors and false negative ones. A false negative error has bigger impact than a false positive one because it can lead an unhealthy patient to miss a proper treatment, which might be fatal. On the other hand, a false positive error can be detected and corrected at a later stage by further investigations and tests.

Clinical datasets are usually highly heterogeneous where the data are usually collected from various sources such as images, laboratory tests, patient interviews, and physicians’ observations and interpretations which leads to a poor mathematical characterization. In addition, many clinical datasets are noisy, incomplete, and suffer from the problem of data imbalance, in which the data has large number of patients (cases/instances) of one class (type/category), and a small number of patients of the other class.

In this study we consider using C4.5 decision tree, widely used in clinical data mining, with different sampling methods in order to identify best solutions for tackling the imbalanced data problem commonly faced in medical data mining.

The rest of this paper is organized as follows. In the next section, we explain decision tree classification models. We then discuss data imbalance problem and provide a description of common methods to overcome it in Sect. 3. Section 4 presents the clinical datasets considered in this study. Section 5 discusses the methodology we follow to conduct our experiments. Analyzing the results and reporting our findings is in Sect. 6. Finally, Sect. 7 concludes the paper and gives directions for future works.

2 Decision Trees

A decision tree model [4] is a data structure that is capable of representing knowledge in a humanly understandable way. It consists of a set of internal nodes, each representing test conditions on the values of one data attribute. The tree emerges from one common root node and ends with many leaf nodes, where each leaf represents a final classification decision.

Being able to understand how the built model is classifying the data and to interpret that into useful domain knowledge are the main reasons why decision trees are preferable over other methods like SVM or neural networks in clinical data mining [5]. A new data instance can be classified by starting from the root of the decision tree, and moving down its branches according to its attributes test results until a leaf node is reached. The class of the leaf node represent the predicted class of the instance. Attributes selected as a node test are usually determined using some splitting criteria. However, popular splitting criteria such as information gain ratio [4, 6] and Gini measure are skew sensitive.

3 Data Imbalance

Data imbalance is a problem that is very common in clinical data mining. A data set is considered imbalanced if the number of instances of one class is considerably smaller than the number of instances in the others. In clinical data the majority class is usually the negative class and the minority class is the positive class which is the class of our main interest. Multi-class problems might also suffer from data imbalance; however, it can be easily converted into many one-versus-others problem. Many learning algorithms tend to get overwhelmed by the large number of the majority class and ignore the minority class thus provide a high total accuracy, however, it also provides a high error rate on the minority class which is usually our concern. Assuming a 90 % imbalance ratio, a classifier that classify all instance as negative will achieve a 90 % accuracy while misclassifying all positive instances of the important class. Obviously this is not the desired result and some alternation is required to overcome this problem.

Japkowicz and Stephen [7], showed that different learning algorithms have different level of sensitivity to the data imbalance problem. They showed also that decision trees is the most sensitive classifier compared to Multilayer Perceptron and Support Vector Machines. In clinical data mining, decision trees are preferable because they provide an explanation of the classification decision.

3.1 Undersampling

Undersampling achieves data balance by removing instances from the majority class. Random undersampling method is the simplest form of undersampling in which the size of the majority class is reduced by removing instances randomly as its name indicates. Random undersampling is simple and easy to implement however, a main disadvantage of data undersampling methods is that there is a possibility that we lose information contained in important majority class instances removed due to the undersampling process. A good informed-undersampling method reduces this possibility.

Informed undersampling reduces the size of the majority class in a controlled fashion in order to keep important instances from the majority class. Example of informed sampling are EasyEnsemble and BalanceCascade reported in [8]. Both methods use ensemble learning in order to explore the majority class space and select useful instances, however, ensemble learners models are usually difficult to explain and fall in the black-box learners zone.

J. Zhang and I. Mani [9] proposed four sampling methods called NearMiss-1, NearMiss-2, NearMiss-3, and Most-Distance that uses K-nearest neighbor in order to sample reduce the size of the majority class. The K-nearest neighbor of an instance is defined as the K elements whose distance between itself and the instance is the smallest. Here we provide a description of the four algorithms:

  • NearMiss-1 selects from the majority class the instances whose average distances to the three closest minority instances are the smallest. Thus the instances selected by NearMiss-1 are close to some of the minority class instances.

  • NearMiss-2 selects from the majority class the instances with the smallest average distance to the three farthest minority class. In other words, NearMiss-2 selects the majority instances close to all of the minority instances.

  • NearMiss-3 surrounds each instance form the minority class with k instances from the majority class. It selects a predetermined number of the closest majority instances for each minority instance.

  • Most Distance selects the instances from the majority class that have the largest average distance to the three closest instances from the minority class.

3.2 Oversampling

As its name indicates, oversampling works by sampling more data from the minority class. Random oversampling randomly selects a set of minority class S r , duplicates its members, and appends them to the original minority class set. This will lead to an increase in the size of the minority class by the size of S r and a reduction in the original data imbalance distribution the process is repeated until the desired data balance reached. The problem with oversampling is that it may make the classifier susceptible to data overfitting because repeating the same instance causes the classifier to become more specific in covering these instances.

Another method of increasing the size of the minority class is synthetic sampling in which artificial data is synthesized from the original minority class. A powerful method that has shown good results in many applications is the synthetic minority oversampling technique (SMOTE) [10]. SMOTE uses feature space similarities between minority class instances in order to generate the synthesized artificial data. For each instance in the minority class in order to create a synthesized instance SMOTE randomly selects one of its K-nearest neighbor for some specified K, calculate the feature vector difference between the two instances then multiplies it by a random number in the range [0, 1] and add the resulted vector to the original minority instance to generate the new artificial instance.

3.3 Model Evaluation

It is important to validate the model performance. Usually, accuracy is the evaluation metrics used to evaluate classification models. However, accuracy assumes similar cost for false positive and false negative errors. In clinical data mining, the cost of false positive is more expensive than the cost of false negative errors, and an evaluation method that reflects this fact is required.

Evaluation metrics are usually derived from the confusion matrix shown in ‎Table 1.

Table 1. Confusion matrix

From the confusion matrix, accuracy can be calculated as the ratio of correctly classified instances: Accuracy = (TP + TN) / (Pc + Nc), and the classification error equals 1- Accuracy, i.e. Error = (FP + FN) / (Pc + Nc).

Sensitivity and specificity can provide better metrics in the case of imbalanced datasets. Sensitivity, defined as TP / (TP + FN) = TP / Pc, measures the proportion of positive instances that are correctly classified.

On the other hand, specificity, defined as TN / (TN + FP) = TN / Nc, measures the proportion of negative instances that are correctly classified.

A good classifier should have high values for both sensitivity and specificity. In the case of imbalanced data, a classifier that classifies all instances as negative will have high accuracy, and high specificity, but zero sensitivity.

The Area Under Curve (AUC) [11] is widely used for measuring the performance in case of imbalanced data. AUC returns the area under Receiver Operating Characteristics Curve (ROC) that provides a visual representation of the performance in regards to the true positive rate (i.e. sensitivity) and false positive rate (i.e. 1-specificity). The visual presentation is useful for showing the tradeoffs between true positive and false positive error rates, however, it is difficult to use for calculation. The AUC provides a quantitative metric for ROC.

4 Experimental Datasets

Earlier experimental studies on learning from imbalanced data have been conducted. Reference [12] discussed the use of several sampling techniques versus different machine learners and performance metrics, and reported partial results of applying combinations of these choices on 35 datasets coming from a variety of application domains. In another study [13], the researchers investigated the class-imbalance problem in medical datasets by considering different under-sampling and over-sampling techniques applied on one cardiovascular dataset. In this paper, we will investigate the effect of a group of undersampling and oversampling techniques applied on multiple clinical datasets, and under constraints suitable for data mining in the medical domain, where white-box learners and suitable metrics are of concern.

In our empirical study, we have considered 7 nonproprietary clinical datasets publically available in the following sources:

  • UCI: the data repository of the Center for Machine Learning and Intelligent Systems in the University of California, Irvine, famously known as UCI Machine Learning Repository. (archive.ics.uci.edu/ml.)

  • OML: an open collaborative machine learning platform (www.openml.org).

We have considered only datasets with binary classification problem. ‎Table 2 lists these datasets with a brief description of each one, and an identifier to refer to later in our analysis.

Table 2. Description of used clinical datasets

The imbalance ratio, defined here as the percentage of minority class instances to majority class instances, varies from 9 % (highly imbalanced) to almost 54 % (only slightly imbalanced). The datasets have also diversity in the number of attributes, their types (continuous and categorical), and the number of instances.

Few datasets contain missing values in one or more of their attributes. In our study, we did not apply any method to fill in these values, and decided to work on complete data by removing the instances with missing values since they were only few. The TYR dataset was the only one having some attributes completely empty or redundant (these attributes were removed), and had rather big number of instances with missing values in some other attributes. The instances in the latter case have been removed, which we consider relatively acceptable given the total number of instances in this dataset. ‎Table 3 summarizes these details.

Table 3. Datasets pre-processing and summaries

5 Experiment Design

We have used RapidMiner (6.5) [16] to conduct our experiments. We have also used SMOTE implementation in Weka (3.16.13) software [17] to perform oversampling.

We have systematically applied each of the sampling methods, including “No Sampling”, on each of the seven datasets. After performing data pre-processing, 10-folds cross-validation (stratified) was used in order to evaluate each method. In each fold, data balancing methods were applied to the training subset, while the test subset was left imbalanced. Figure 1 shows a snapshot of the cross-validation design in RapidMiner which limits the sampling application to the training set. AUC, sensitivity, and specificity were recorded for each sampling method.

Fig. 1.
figure 1

Cross-validation process to evaluate SMOTE oversampling using RapidMiner.

The four sampling methods, NearMiss1, NearMiss2, NearMiss3, and Most Distance depend on calculating the distance between instances. In case of numeric attributes, Euclidean distance is used. However, when we have mixed types of attributes (numerical and categorical), Mixed-Euclidean distance is used, where for nominal attributes a distance of one is counted if corresponding values are not the same. Those algorithms are not part of RapidMiner components, and have been implemented by the authors.

For all sampling methods, we chose the parameters that rebalance the datasets to an almost equal ratio for both classes. As for the k parameter (number of nearest neighbors) for SMOTE and NearMiss3 algorithms, a fixed value of 5 has been chosen.

6 Results and Analysis

The results of our experiments are shown in Tables 4, 5, 6 and 7. For each dataset, the area under curve AUC, sensitivity, and specificity of each of the methods used rounded to two decimal places are reported.

Table 4. Performance ranks and results on BRC (left) and BCW (right) datasets.
Table 5. Performance ranks and results on DIB (left) and SAH (right) datasets.
Table 6. Performance ranks and results on SPF (left) and SPT (right) datasets.
Table 7. Performance ranks and results on TYR dataset.

For the Breast Cancer (BRC) dataset, Table 4. (left) shows that Most Distance method scored the highest AUC, with corresponding 0.54 sensitivity and 0.81 specificity. It shows a good improvement in sensitivity over the results obtained on the original data (indicated by No Sampling method) with a relatively low reduction in specificity. Table 4. (right) shows the results for the BCW dataset, and the results for the remaining datasets are summarized in ‎Tables 5, 6 and 7‎.

In Table 8, we have summerized the ranking counts for each method. For example, Random Undersampling method was ranked first in only one dataset, and similarly for second, third, and fourth ranks. It also ranked fifth in three datasets.

Table 8. Methods Rankings

We can see from the table that oversampling methods have ranked first and second more often than the undersampling ones, with random oversampling ranked first more than SMOTE method. Among the undersampling methods NearMiss1 and NearMiss2 methods have often scored low ranks compared to other methods.

7 Conclusion

In this work, we have evaluated the performance of different sampling methods on clinical data classification problem using the C4.5 decision tree due to its wide usage in clinical data mining. The methods of random oversampling and undersampling, SMOTE oversampling, NearMiss1, NearMiss2, NearMiss3, and Most Distance undersampling methods were investigated. The results showed that from the AUC point of view, random oversampling and SMOTE methods were superior to the undersampling methods.