Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets

Kasem, Asem; Ammar Ghaibeh, A.; Moriguchi, Hiroki

doi:10.1007/978-3-319-48517-1_14

Asem Kasem⁵,
A. Ammar Ghaibeh⁶ &
Hiroki Moriguchi⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 532))

Included in the following conference series:

International Conference on Computational Intelligence in Information System

1216 Accesses
6 Citations

Abstract

Many clinical data suffer from data imbalance in which we have large number of instances of one class and small number of instances of the other. This problem affects most machine learning algorithms especially decision trees. In this study, we investigated different undersampling and oversampling algorithms applied to multiple imbalanced clinical datasets. We evaluated the performance of decision tree classifiers built for each combination of dataset and sampling method. We reported our experiment results and found that the considered oversampling methods generally outperform undersampling ones using AUC performance measure.

First and second authors have contributed equally to produce this paper.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Severely imbalanced Big Data challenges: investigating data sampling approaches

Article Open access 30 November 2019

Keywords

1 Introduction

There has been an increase of the use of data mining techniques in the different areas of medicine (bioinformatics, medical imaging, clinical informatics, and public health informatics) in the last decade. This is due to the impact data mining had on other domains, such as banking, marketing, and e-commerce, which gave high hopes for similar achievements in medicine, by extracting untapped knowledge contained in available medical data as well.

The aim of clinical data mining is to search for useful patterns and information within patients’ data, and develop prediction models that can support clinical decision making [1, 2]. Data mining can be used to build predictive models in prognosis, diagnosis and treatment planning. Even when the data is collected for purposes other than directly diagnosing a disease or predicting treatment outcome, useful medical information can still be retrieved. Nakamura et al. [3] used data mining to predict the development of pressure ulcer in hospitals from patients’ data that were originally collected for the purpose of calculating nursing costs. Decision making in the medical field is rather more sensitive than many other fields because of its direct relation to life and death consequences, and the well-being of patients. Therefore, a decision should be made with strong belief that is supported by thorough evaluation and clear explanation. This makes clinical data mining distinctive than other data mining uses in various ways. For example, it is widely common in clinical data mining to use white-box classifiers such as rule-based learners or decision trees because the resulting model is represented in a readable format. This enables the physicians to interpret the model output based on their medical knowledge, and increases their confidence when making their final decisions. While models built using black-box methods such as Artificial Neural Networks or Support Vector Machines and provide better results in terms of prediction accuracy, will be welcomed in many other fields, our experience showed that physicians often hesitate to accept these results due to the lack of model understandability, how the involved factors are related, and how to link that to their medical experience and knowledge. Although researchers have investigated the use of many different machine learning algorithms on clinical data, and reported interesting findings [1, 2], we believe that in practice, the ability for model introspection will actually limit us to only few of the many algorithms that are used in other fields and applications, even if this comes at the expense of prediction accuracy.

Another point to consider is that in many data mining applications, it is desirable to have a prediction model with high accuracy. In clinical data mining, however, it is important to distinguish between false positive errors and false negative ones. A false negative error has bigger impact than a false positive one because it can lead an unhealthy patient to miss a proper treatment, which might be fatal. On the other hand, a false positive error can be detected and corrected at a later stage by further investigations and tests.

Clinical datasets are usually highly heterogeneous where the data are usually collected from various sources such as images, laboratory tests, patient interviews, and physicians’ observations and interpretations which leads to a poor mathematical characterization. In addition, many clinical datasets are noisy, incomplete, and suffer from the problem of data imbalance, in which the data has large number of patients (cases/instances) of one class (type/category), and a small number of patients of the other class.

In this study we consider using C4.5 decision tree, widely used in clinical data mining, with different sampling methods in order to identify best solutions for tackling the imbalanced data problem commonly faced in medical data mining.

The rest of this paper is organized as follows. In the next section, we explain decision tree classification models. We then discuss data imbalance problem and provide a description of common methods to overcome it in Sect. 3. Section 4 presents the clinical datasets considered in this study. Section 5 discusses the methodology we follow to conduct our experiments. Analyzing the results and reporting our findings is in Sect. 6. Finally, Sect. 7 concludes the paper and gives directions for future works.

2 Decision Trees

A decision tree model [4] is a data structure that is capable of representing knowledge in a humanly understandable way. It consists of a set of internal nodes, each representing test conditions on the values of one data attribute. The tree emerges from one common root node and ends with many leaf nodes, where each leaf represents a final classification decision.

Being able to understand how the built model is classifying the data and to interpret that into useful domain knowledge are the main reasons why decision trees are preferable over other methods like SVM or neural networks in clinical data mining [5]. A new data instance can be classified by starting from the root of the decision tree, and moving down its branches according to its attributes test results until a leaf node is reached. The class of the leaf node represent the predicted class of the instance. Attributes selected as a node test are usually determined using some splitting criteria. However, popular splitting criteria such as information gain ratio [4, 6] and Gini measure are skew sensitive.

3 Data Imbalance

Data imbalance is a problem that is very common in clinical data mining. A data set is considered imbalanced if the number of instances of one class is considerably smaller than the number of instances in the others. In clinical data the majority class is usually the negative class and the minority class is the positive class which is the class of our main interest. Multi-class problems might also suffer from data imbalance; however, it can be easily converted into many one-versus-others problem. Many learning algorithms tend to get overwhelmed by the large number of the majority class and ignore the minority class thus provide a high total accuracy, however, it also provides a high error rate on the minority class which is usually our concern. Assuming a 90 % imbalance ratio, a classifier that classify all instance as negative will achieve a 90 % accuracy while misclassifying all positive instances of the important class. Obviously this is not the desired result and some alternation is required to overcome this problem.

Japkowicz and Stephen [7], showed that different learning algorithms have different level of sensitivity to the data imbalance problem. They showed also that decision trees is the most sensitive classifier compared to Multilayer Perceptron and Support Vector Machines. In clinical data mining, decision trees are preferable because they provide an explanation of the classification decision.

3.1 Undersampling

Undersampling achieves data balance by removing instances from the majority class. Random undersampling method is the simplest form of undersampling in which the size of the majority class is reduced by removing instances randomly as its name indicates. Random undersampling is simple and easy to implement however, a main disadvantage of data undersampling methods is that there is a possibility that we lose information contained in important majority class instances removed due to the undersampling process. A good informed-undersampling method reduces this possibility.

Informed undersampling reduces the size of the majority class in a controlled fashion in order to keep important instances from the majority class. Example of informed sampling are EasyEnsemble and BalanceCascade reported in [8]. Both methods use ensemble learning in order to explore the majority class space and select useful instances, however, ensemble learners models are usually difficult to explain and fall in the black-box learners zone.

J. Zhang and I. Mani [9] proposed four sampling methods called NearMiss-1, NearMiss-2, NearMiss-3, and Most-Distance that uses K-nearest neighbor in order to sample reduce the size of the majority class. The K-nearest neighbor of an instance is defined as the K elements whose distance between itself and the instance is the smallest. Here we provide a description of the four algorithms:

NearMiss-1 selects from the majority class the instances whose average distances to the three closest minority instances are the smallest. Thus the instances selected by NearMiss-1 are close to some of the minority class instances.
NearMiss-2 selects from the majority class the instances with the smallest average distance to the three farthest minority class. In other words, NearMiss-2 selects the majority instances close to all of the minority instances.
NearMiss-3 surrounds each instance form the minority class with k instances from the majority class. It selects a predetermined number of the closest majority instances for each minority instance.
Most Distance selects the instances from the majority class that have the largest average distance to the three closest instances from the minority class.

3.2 Oversampling

As its name indicates, oversampling works by sampling more data from the minority class. Random oversampling randomly selects a set of minority class S _r, duplicates its members, and appends them to the original minority class set. This will lead to an increase in the size of the minority class by the size of S _r and a reduction in the original data imbalance distribution the process is repeated until the desired data balance reached. The problem with oversampling is that it may make the classifier susceptible to data overfitting because repeating the same instance causes the classifier to become more specific in covering these instances.

Another method of increasing the size of the minority class is synthetic sampling in which artificial data is synthesized from the original minority class. A powerful method that has shown good results in many applications is the synthetic minority oversampling technique (SMOTE) [10]. SMOTE uses feature space similarities between minority class instances in order to generate the synthesized artificial data. For each instance in the minority class in order to create a synthesized instance SMOTE randomly selects one of its K-nearest neighbor for some specified K, calculate the feature vector difference between the two instances then multiplies it by a random number in the range [0, 1] and add the resulted vector to the original minority instance to generate the new artificial instance.

3.3 Model Evaluation

It is important to validate the model performance. Usually, accuracy is the evaluation metrics used to evaluate classification models. However, accuracy assumes similar cost for false positive and false negative errors. In clinical data mining, the cost of false positive is more expensive than the cost of false negative errors, and an evaluation method that reflects this fact is required.

Evaluation metrics are usually derived from the confusion matrix shown in ‎Table 1.

Table 1. Confusion matrix

Full size table

From the confusion matrix, accuracy can be calculated as the ratio of correctly classified instances: Accuracy = (TP + TN) / (P_c + N_c), and the classification error equals 1- Accuracy, i.e. Error = (FP + FN) / (P_c + N_c).

Sensitivity and specificity can provide better metrics in the case of imbalanced datasets. Sensitivity, defined as TP / (TP + FN) = TP / P_c, measures the proportion of positive instances that are correctly classified.

On the other hand, specificity, defined as TN / (TN + FP) = TN / N_c, measures the proportion of negative instances that are correctly classified.

A good classifier should have high values for both sensitivity and specificity. In the case of imbalanced data, a classifier that classifies all instances as negative will have high accuracy, and high specificity, but zero sensitivity.

The Area Under Curve (AUC) [11] is widely used for measuring the performance in case of imbalanced data. AUC returns the area under Receiver Operating Characteristics Curve (ROC) that provides a visual representation of the performance in regards to the true positive rate (i.e. sensitivity) and false positive rate (i.e. 1-specificity). The visual presentation is useful for showing the tradeoffs between true positive and false positive error rates, however, it is difficult to use for calculation. The AUC provides a quantitative metric for ROC.

4 Experimental Datasets

Earlier experimental studies on learning from imbalanced data have been conducted. Reference [12] discussed the use of several sampling techniques versus different machine learners and performance metrics, and reported partial results of applying combinations of these choices on 35 datasets coming from a variety of application domains. In another study [13], the researchers investigated the class-imbalance problem in medical datasets by considering different under-sampling and over-sampling techniques applied on one cardiovascular dataset. In this paper, we will investigate the effect of a group of undersampling and oversampling techniques applied on multiple clinical datasets, and under constraints suitable for data mining in the medical domain, where white-box learners and suitable metrics are of concern.

In our empirical study, we have considered 7 nonproprietary clinical datasets publically available in the following sources:

UCI: the data repository of the Center for Machine Learning and Intelligent Systems in the University of California, Irvine, famously known as UCI Machine Learning Repository. (archive.ics.uci.edu/ml.)
OML: an open collaborative machine learning platform (www.openml.org).

We have considered only datasets with binary classification problem. ‎Table 2 lists these datasets with a brief description of each one, and an identifier to refer to later in our analysis.

Table 2. Description of used clinical datasets

Full size table

The imbalance ratio, defined here as the percentage of minority class instances to majority class instances, varies from 9 % (highly imbalanced) to almost 54 % (only slightly imbalanced). The datasets have also diversity in the number of attributes, their types (continuous and categorical), and the number of instances.

Few datasets contain missing values in one or more of their attributes. In our study, we did not apply any method to fill in these values, and decided to work on complete data by removing the instances with missing values since they were only few. The TYR dataset was the only one having some attributes completely empty or redundant (these attributes were removed), and had rather big number of instances with missing values in some other attributes. The instances in the latter case have been removed, which we consider relatively acceptable given the total number of instances in this dataset. ‎Table 3 summarizes these details.

Table 3. Datasets pre-processing and summaries

Full size table

5 Experiment Design

We have used RapidMiner (6.5) [16] to conduct our experiments. We have also used SMOTE implementation in Weka (3.16.13) software [17] to perform oversampling.

We have systematically applied each of the sampling methods, including “No Sampling”, on each of the seven datasets. After performing data pre-processing, 10-folds cross-validation (stratified) was used in order to evaluate each method. In each fold, data balancing methods were applied to the training subset, while the test subset was left imbalanced. Figure 1 shows a snapshot of the cross-validation design in RapidMiner which limits the sampling application to the training set. AUC, sensitivity, and specificity were recorded for each sampling method.

The four sampling methods, NearMiss1, NearMiss2, NearMiss3, and Most Distance depend on calculating the distance between instances. In case of numeric attributes, Euclidean distance is used. However, when we have mixed types of attributes (numerical and categorical), Mixed-Euclidean distance is used, where for nominal attributes a distance of one is counted if corresponding values are not the same. Those algorithms are not part of RapidMiner components, and have been implemented by the authors.

For all sampling methods, we chose the parameters that rebalance the datasets to an almost equal ratio for both classes. As for the k parameter (number of nearest neighbors) for SMOTE and NearMiss3 algorithms, a fixed value of 5 has been chosen.

6 Results and Analysis

The results of our experiments are shown in Tables 4, 5, 6 and 7. For each dataset, the area under curve AUC, sensitivity, and specificity of each of the methods used rounded to two decimal places are reported.

Table 4. Performance ranks and results on BRC (left) and BCW (right) datasets.

Full size table

Table 5. Performance ranks and results on DIB (left) and SAH (right) datasets.

Full size table

Table 6. Performance ranks and results on SPF (left) and SPT (right) datasets.

Full size table

Table 7. Performance ranks and results on TYR dataset.

Full size table

For the Breast Cancer (BRC) dataset, Table 4. (left) shows that Most Distance method scored the highest AUC, with corresponding 0.54 sensitivity and 0.81 specificity. It shows a good improvement in sensitivity over the results obtained on the original data (indicated by No Sampling method) with a relatively low reduction in specificity. Table 4. (right) shows the results for the BCW dataset, and the results for the remaining datasets are summarized in ‎Tables 5, 6 and 7‎.

In Table 8, we have summerized the ranking counts for each method. For example, Random Undersampling method was ranked first in only one dataset, and similarly for second, third, and fourth ranks. It also ranked fifth in three datasets.

Table 8. Methods Rankings

Full size table

We can see from the table that oversampling methods have ranked first and second more often than the undersampling ones, with random oversampling ranked first more than SMOTE method. Among the undersampling methods NearMiss1 and NearMiss2 methods have often scored low ranks compared to other methods.

7 Conclusion

In this work, we have evaluated the performance of different sampling methods on clinical data classification problem using the C4.5 decision tree due to its wide usage in clinical data mining. The methods of random oversampling and undersampling, SMOTE oversampling, NearMiss1, NearMiss2, NearMiss3, and Most Distance undersampling methods were investigated. The results showed that from the AUC point of view, random oversampling and SMOTE methods were superior to the undersampling methods.

References

Bellazzi, R., Zupan, B.: Predictive data mining in medicine: current issues and guidelines. Int. J. Med. Inf. 77(2), 81–97 (2008)
Article Google Scholar
Bellazzi, R., Ferrazzi, F., Sacchi, L.: Predictive data mining in clinical medicine: a focus on selected methods and applications. WIREs Data Mining Knowl. Discov. 1, 416–430 (2011)
Article Google Scholar
Nakamura, Y., Ghaibeh, A.A., Setoguchi, Y., Mitani, K., Abe, Y., Hashimoto, I., Moriguchi, H.: On-admission pressure ulcer prediction using the nursing needs score. JMIR Med. Inf. 3(1), e8 (2015)
Article Google Scholar
Ouinlan, J.R.: Induction of decision trees. J. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Setoguchi, Y., Ghaibeh, A.A., Mitani, K., Abi, Y., Hasimoto, I., Moriguchi, H.: Predictability of pressure ulcers based on operation duration, transfer activity, and body mass index through the use of an alternating decision tree. J. Med. Investig. 63 (2016)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. J. 6, 429–449 (2002)
MATH Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(2), 539–550 (2009)
Article Google Scholar
Zhang, Z., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of International Conference on Machine Learning (ICML 2003), Learning from Imbalanced Data Sets Workshop (2003)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
MATH Google Scholar
Bradley, A.P.: The use of area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
Article Google Scholar
Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), Corvalis, USA, pp. 935–942 (2007)
Google Scholar
Rahman, M.M., Davis, D.N.: Addressing the class imbalance problem in medical datasets. Int. J. Mach. Learn. Comput. 3(2), 224–228 (2013)
Article Google Scholar
Zwitter, M., Soklic, M.: Breast cancer dataset. Data obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia (1988)
Google Scholar
Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)
Google Scholar
RapidMiner Studio. http://www.rapidminer.com
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Informatics, Universiti Teknologi Brunei, Gadong, Brunei Darussalam
Asem Kasem
Medical Informatics Department, The University of Tokushima, Tokushima, Japan
A. Ammar Ghaibeh & Hiroki Moriguchi

Authors

Asem Kasem
View author publications
You can also search for this author in PubMed Google Scholar
A. Ammar Ghaibeh
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Moriguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Ammar Ghaibeh .

Editor information

Editors and Affiliations

School of Computing & Informatics, Universiti Teknologi Brunei School of Computing & Informatics, Gadong, Brunei Darussalam
Somnuk Phon-Amnuaisuk
School of Computing & Informatics, Universiti Teknologi Brunei School of Computing & Informatics, Gadong, Brunei Darussalam
Thien-Wan Au
School of Computing & Informatics, Universiti Teknologi Brunei School of Computing & Informatics, Gadong, Brunei Darussalam
Saiful Omar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kasem, A., Ammar Ghaibeh, A., Moriguchi, H. (2017). Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets. In: Phon-Amnuaisuk, S., Au, TW., Omar, S. (eds) Computational Intelligence in Information Systems. CIIS 2016. Advances in Intelligent Systems and Computing, vol 532. Springer, Cham. https://doi.org/10.1007/978-3-319-48517-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-48517-1_14
Published: 21 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48516-4
Online ISBN: 978-3-319-48517-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets

Abstract

Similar content being viewed by others

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Severely imbalanced Big Data challenges: investigating data sampling approaches

Keywords

1 Introduction

2 Decision Trees

3 Data Imbalance

3.1 Undersampling

3.2 Oversampling

3.3 Model Evaluation

4 Experimental Datasets

5 Experiment Design

6 Results and Analysis

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets

Abstract

Similar content being viewed by others

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Severely imbalanced Big Data challenges: investigating data sampling approaches

Keywords

1 Introduction

2 Decision Trees

3 Data Imbalance

3.1 Undersampling

3.2 Oversampling

3.3 Model Evaluation

4 Experimental Datasets

5 Experiment Design

6 Results and Analysis

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation