Keywords

1 Introduction

Imbalanced data refers to datasets in which the target class has an unequal distribution of observations; i.e., one class label has a large number of observations while the other has a small number. A large number of experiments in the domain of imbalance data on the behavior of a few standard classifiers have revealed that imbalance probabilities, which are defined as the ratio of the number of examples in the majority class to the number of models in the minority class, has the potential to produce misclassification since the majority data is more dominating on the minority data, resulting in a loss of accuracy [1]. With imbalanced datasets, standard characterization learning calculations are frequently one-sided toward the majority class (known as the “negative” class), resulting in a higher misclassification rate for minority class occurrences (known as “positive” class) [2]. One form of best-in-class method for creating a new dataset is sampling (or resampling) techniques. Undersampling can be defined as removing some observations of the majority class. Oversampling can be defined as adding more data samples to the minority class. The advantages of combining both the techniques are improved run time by reducing the number of training data samples, reduced storage problems when the training data set is huge, reduced data redundancy leading to perfect balancing of datasets. Most commonly used oversampling technique is synthetic minority oversampling technique (SMOTE) [3,4,5], and neighborhood cleaning rule (NCL) is commonly used undersampling technique [6,7,8]. In this study, we have suggested few modifications in original SMOTE and NCL to improve the performance. Then, we have used the combination of SMOTE and NCL together to different classification problems. The performance of the sampling approaches has been validated using linear support vector classifier and K-nearest neighbor. The results show the superior performance of the algorithms on the sampled/balanced datasets as compared to the imbalanced datasets.

The paper is organized as: Next section briefly states the literature review. Following this, next section explains the datasets used in this study. Proposed work is elaborated in the next section. Followed by the result evaluation. Finally, the work is concluded in the last section.

2 Literature Review

Cluster-based sampling approaches were created to minimize the amount of data samples in the majority class [9]. Cluster-based sampling approaches, in general, work by clustering a number of clusters from a given majority class dataset, then selecting a number of representative data samples from each of the clusters. However, cluster-based sampling approaches have significant drawbacks that have a direct impact on the decreased majority class dataset and ultimate classification performance. Before performing any synthetic sampling, [10] used the K-means clustering approach to cope with the noisy situation. They also employed the SMOTE to oversample clusters. On the sample dataset, the LR and ANN were employed to assess classification performance. The influence of overlapping was investigated in conjunction with other factors in [11], such as the breaking of minority class into sub classes which are small in size. The studies were conducted on two dummy linear datasets with more intricate nonlinear boundaries, and the findings revealed that class breakdown combined with overlapping makes learning extremely difficult. Evaluating methods for recognizing noisy samples is likewise extremely complicated [12]. The most often misclassified samples are evaluated as probable noise and deleted from the learning data progressively until a particular level of accuracy is reached. These approaches are dependent on a number of characteristics as well as the type of base classifier used. Furthermore, deleting the instances may be controversial, particularly among the minority class.

3 Empirical Data Collection

In this study, we have used four open source datasets from different medical fields. The independent variables in all these datasets are different depending on the type of dataset. These are listed in Table 1. The dependent variable is a binary variable whose value is 0 if the disease is not present and 1 if the disease is present. All these datasets are imbalanced; i.e., the data belonging to one of classes is much more as compared to the data belonging to the other class. The amount of imbalance can be well defined by using imbalance ratio measure. Imbalance ratio is the ratio of size of minority class to the size of majority class. The details of the datasets including the name, number of attributes, total number of samples and imbalance ratio is specified in Table 1.

Table 1 Datasets used in the study

4 Proposed Work

The work in this paper has been carried out in three broad steps. Each of the steps has been explained in this section.

Step 1: The authors have proposed modifications in the original SMOTE [5] and NCL [7] algorithms of sampling. These modified algorithms are termed as SMOTEMODIFIED and NCLMODIFIED from here on. These proposed algorithms are applied on all the datasets to balance them. The algorithms are explained as below.

SMOTE MODIFIED

Synthetic minority oversampling technique (SMOTE) is a technique in which new synthetic data is added to the minority class. To do this, a data point and its four neighboring points in minority class are considered. These are connected by lines, and one of the points of intersection if these lines are chosen as new data.

$$ Z_{{i\rm{new}}} = Z_{m} + \left( {Z_{n} - Z_{m} } \right)\lambda (i + + ;Z_{{i\rm{new}}} \ne Z_{m} ) $$
Zinew:

is newly created data point.

Zm:

is data sample of class which is minor.

Zn:

is data sample that is chosen from 4 kNN of Zm in the minority class.

λ:

is an arbitrary constant which lie in range 0 to 1.

Originally, in SMOTE, each new data point becomes part of the population right away and is in contention to be chosen for the next nearest neighbors. In SMOTEMODIFIED, when the number of new points generated is equal to the population of the minority class, then only the new points are added to data and become contentions for next nearest neighbor. To do this, a data point and its four (or n) neighboring points in minority class are considered.

NCL MODIFIED

Neighborhood cleaning rule (NCL) is a technique that is used to remove the noisy or redundant data from the majority class. If the selected sample data (Zm) belongs to the class which is in majority and the 4-nearest neighbor’s classification of Zm is in contrast to Zm, then Zm will be removed from the dataset. If Zm is a member of the minority class, and the categorization of four closest neighbors is in conflict with Zm, those four neighbors will be eliminated. In NCLMODIFIED, during the cleaning neighborhood process, even if the majority of the neighbors of majority class sample belong to minority class, we eliminate that sample.

Step 2: The authors have used a combination of SMOTEMODIFIED and NCLMODIFIED (SMOTEMODIFIED + NCLMODIFIED) to improve the imbalance ratio.

Imbalance datasets are first resampled by the NCLMODIFIED technique. This method eliminates outlier data from sample data in the majority class. After this, the dataset is fed into the SMOTEMODIFIED algorithm. This technique is a method for boosting minority class sample data by synthesizing new data from existing data. Two data points (Zm, Zn) will be chosen for the generating new data samples in minority class, and the distance between Zm and Zn will be calculated. This operation is performed till all the existing data points are exhausted, and number of samples are balanced out without including the newly created data points as neighbors. We repeat this technique till balanced datasets are generated.

Step 3: The performance of the sampling algorithms has been evaluated using two machine learning classifiers, linear support vector classifier (linear SVC) and K-nearest neighbor (kNN). Linear SVC converts the 2D space into a hyper plane, which is divided into two or more categories, to classify the data. In kNN algorithm, k-1 nearest neighbors are considered, then the classification of a majority of these k samples is chosen as the classification of the data sample.

5 Result Evaluation

In this section, we explain the results obtained in this study.

Analysis of the Datasets:

Table 2 summarizes the results of the dataset resampling phase of the project. The table lists the number of specimens of both classes for each of the four datasets. As shown in the table, NCL + SMOTE method produced a smaller training dataset with fewer amounts of data as compared to solo SMOTE technique.

Table 2 Training results for NCLMODIFIED, SMOTEMODIFIED and NCLMODIFIED + SMOTEMODIFIED

The original dataset classification is compared with the results of KNN classification and the linear SVC. Table 3 lists the number of specimens of both classes for each of the four datasets according to the two classification.

Table 3 Testing results for actual classification versus KNN and linear SVC classifiers

Figure 1 shows us the original “thyroid” dataset. Figures 2 and 3 show us the results of individual undersampling and oversampling of the original dataset using NCLMODIFIED and SMOTEMODIFIED techniques, respectively. Finally in Fig. 4, we have oversampled the results of undersampling by applying SMOTEMODIFIED, thus balancing the cleaned dataset. From these figures, we can observe we see that NCLMODIFIED has removed the samples of majority class that are surrounded by those of minority class, while SMOTEMODIFIED has basically balanced the classes. Combining these techniques has combined their benefits as well. Similar figures and interpretations were obtained for all the other datasets. Due to space constraints, we have not shown the figures.

Fig. 1
A scatter plot depicts the original training dataset of two data. 1. Majority class is 0 of 139. 2. Minority class is 1 of 22.

Thyroid training data

Fig. 2
A scatter plot depicts dataset under sampled using N C L for two classes. 1. Majority class is 0 of 129. 2. Minority class is 1 of 22. The vertical axis ranges from 60 to 140, and the horizontal axis ranges from 0 to 25.

Thyroid data after undersampling

Fig. 3
A scatter plot depicts the dataset over sampled using SMOTE. 1. Majority class is 0 of 139. 2. Minority class is 1 of 139. The vertical axis ranges from 60 to 140, and the horizontal axis ranges from 0 to 25.

Thyroid data after oversampling

Fig. 4
A scatter plot depicts the dataset under sampled and oversampled using N C L plus SMOTE. 1. Majority class is 0 of 129. 2. Minority class is1 of 129. The vertical axis ranges from 60 to 140, and the horizontal axis ranges from 0 to 25.

Thyroid data after combination

6 Results of Empirical Validation

The results have been evaluated using two performance metrics, recall measure (sensitivity of minority class) and geometric mean. The mathematical formulae of recall measure and geometric mean are as follows:

\( \rm{Recall}\,\rm{measure} = \rm{TP}/\left( {\rm{TP} + \rm{FN}} \right),\rm{Geometric}\,\rm{mean} = \surd (\Pi \,\rm{sensitivity}_{{i}} )\)

where TP = true positive, FP = false positive, TN = true negative, FN = false negative, sensitivityi is the recall measure of ith class.

The geometric mean measure aims to improve the precision of each class while keeping them properly calibrated. The best value is 1, and the most egregiously bad value is 0. In most cases, G-mean resolves to zero if the classifier misses one or more classes. The validation used in hold-out validation in which training and testing data is divided in the ratio 4:1.

Tables 4 and 5 contrast review and performance measures between the kNN classifier and linear SVC classifier. Here, we have calculated the recall measure and geometric mean score of the individual classification of each dataset. We can observe from the tables that both recall and geometric mean values have increased after the datasets have been balanced using NCLMODIFIED + SMOTEMODIFIED. The percentage increase of each algorithm is shown in Table 5. We can observe that the range of percentage increase is 2.89% to 425.42%, which is significantly high. Thus, the authors in this study promote the use of NCLMODIFIED + SMOTEMODIFIED for sampling the datasets.

Table 4 Recall measure for original datasets
Table 5 Recall measure for NCLMODIFIED + SMOTEMODIFIED technique

7 Conclusion

In this research, we used a combination of both undersampling and oversampling techniques mainly SMOTE and NCL as they are most commonly used in the literature. We eradicated class imbalance problem in the datasets by combining both techniques, by stacking SMOTEMODIFIED on top of NCLMODIFIED. We proposed that how neighborhood cleaning rule (NCLMODIFIED) undersamples our datasets by combining ENN algorithm to clean datasets and CNN to some redundant samples. We also proposed how SMOTEMODIFIED oversamples the dataset by creating new data points by using samples from original data. We evaluate the balanced datasets using two classification algorithms, namely kNN and linear SVC. We use two performances metric to measure the effectiveness of our resampling technique, namely recall measure and geometric mean score. Both performance measures showed significant percentage increase in the range of 2.89–425.42% when the datasets are sampled as compared to their values on the imbalanced datasets.

There are also some limitations to our approach as this method is not suitable for mono dimensional data containing medium to high data imbalance level. Since there are few number of sample data it is possible to ignore serious data. The model resampling time on larger training datasets increased, and some information loss may also occur.