Keywords

4.1 Introduction

The binary class distribution is a classification problem that involves two classes of the dataset. For a machine learning data model to generate accurate model outcome, the two classes of the dataset must be equally represented to avoid the issue of biasness [1,2,3]. Thus, oversampling and under-sampling are two techniques required to balance the class distribution of datasets in cases where the class distribution is skewed [1,2,3,4]. In this study, the original shape of the dataset represents the true context of the problem, which means that oversampling of the minority class and down-sampling of the majority class must be done to preserve the original shape of the dataset.

In general, the random selection of data observations from a population that has duplicates can result in a sample dataset that has duplicates. The implication is that, random sampling of a majority class that has duplicate data observations will result in duplicate data observations. It therefore means that, random sampling of majority class alone may not solve the problem of skewed class distribution for a binary classification problem. It is therefore suggested in this chapter that when down sampling the majority class, a random sampling on the majority class must be performed on a dataset that contains no duplicate data instances.

For a highly skewed data distribution, re-oversampling of the minority class that negatively affects the original shape of the minority class data distribution must be avoided. Re-oversampling must be done to introduce new possibilities of the minority class rather than duplicating the existing minority class data instances. SMOTE is an oversampling method that is known to generate new instances of the minority class rather than duplicating the existing data instances of the minority class [5].

SMOTE oversamples the minority class by computing median feature vectors between a nominal feature sample and its potential nearest neighbours by Euclidean distance of standard deviations [5]. Duplicating minority training samples to reduce biasness of data distribution introduces high variance of data distribution. Thus, SMOTE is an important oversampling method that reduces variance of data distribution. The synthetic instances generated influence a classifier model to create less specific decision regions which are smaller.

Furthermore, positive influence can be better learned in general regions than positive instances subsumed around negative instances, which means that the SMOTE method suffers from a problem of generalization, whereby the region of a majority class is blindly generalized without considering the majority class [6, 7]. The generalization problem of the SMOTE method is particularly visible in a case of highly skewed class distribution since the minority class is thinly scattered in relation to the majority class. Thus, the probability of class mixture is very high. To keep the SMOTE method efficient and effective, an improvement of the algorithm is required.

Borderline SMOTE is an oversampling method designed to improve the performance of SMOTE oversampling method [1, 6]. The performance is improved by separating the positive instances into three regions namely borderline \( \left(\frac{1}{2}K\le n<K\right) \), noise (n = k) and safe \( \left(0\le n<\frac{1}{2}K\right) \) [1, 6]. The three regions are separated while considering the negative instances on the K Nearest Neighbours. The borderline SMOTE uses the same oversampling method as SMOTE.

However, borderline SMOTE oversamples only the borderline instances of the minority class rather than oversampling entire instances of the minority class. Logically, the two consecutive instances are obviously not different, but they are divided into two regions (noise and borderline), whereby the first instance is selected for oversampling and the other instance is declined for oversampling.

Another method is Safe-Level SMOTE, which is an oversampling method that creates safe-level synthesis of the minority class [1, 6, 7]. The synthetic instances are placed closer to the safe level; that is to say, the safe level closer to K is nearly noise. The safe level is defined as the number of positive instances within K Nearest Neighbour but not equal to K Nearest Neighbour [6]. The Safe-Level SMOTE is therefore found to be a promising algorithm with a positive impact in this study.

The rest of the chapter is organized as follows. Section 4.2 presents the research questions addressed by this work and their objectives. Section 4.3 presents the Modified SMOTE method. Section 4.4 presents the simulation results, and Sect. 4.5 concludes the chapter.

4.2 Research Questions and Objectives

4.2.1 Research Questions

Since the banking sector is migrating from data analytics to data insights in South Africa [8], the study therefore sets out to explore the following main research question: Will the oversampling of the minority class that generates new data instances at the safe level of the minority class and under-sampling of the majority class that randomly selects non-duplicate data instances preserve the nature of the dataset and the original context of the problem for credit card fraud data when dealing with highly skewed class distributions? This main research question is broken down into the following sub-questions:

  1. (a)

    Does Safe-Level SMOTE oversampling method (on minority classes) used with under-sampling method (that eliminates duplicate data samples on majority class) have positive impact on reducing the high skewedness of the class distribution than SMOTE oversampling method (on minority classes) used with under-sampling method (that also eliminates duplicate data samples on majority class)?

  2. (b)

    Do Safe-Level-SMOTE oversampling method and the under-sampling method (that eliminates duplicate data samples on majority class) reduce or eliminate the problem of overlapping data samples between fraudulent and non-fraudulent classes?

This study therefore sought to answer these questions by carrying out the objectives presented in the next section.

4.2.2 Research Objectives

The main objective of this study is to obtain a balanced class distribution of credit card fraud data that preserves the nature of the dataset and the original context of the problem. The main objective is further broken down into two sub-objectives presented below:

  1. (a)

    To determine if modified Safe-Level SMOTE oversampling method used with under-sampling method has a positive impact on reducing the high skewedness of the class distribution than the modified SMOTE oversampling method used with under-sampling method. This objective will be implemented by developing and running a modified SMOTE algorithm and a modified Safe-Level SMOTE algorithm on the minority class data, and an under-sampling algorithm on the majority class data.

  2. (b)

    To investigate if the modified Safe-Level SMOTE oversampling method and the under-sampling method reduce or eliminate the problem of overlapping data samples between fraudulent and non-fraudulent classes. This objective will be implemented by running a modified Safe-Level SMOTE on minority class data and an under-sampling algorithm on majority class data.

The next section discusses the research design followed to achieve the research objectives of this study.

4.3 Research Design

The modified SMOTE method in this study will be compared to the SMOTE method, and the modified Safe-Level SMOTE method will be compared to the Safe-Level SMOTE. The modified SMOTE and the modified Safe-Level SMOTE were used to oversample the data observations of the minority class of the dataset together with down-sampling method that removes duplicates and randomly chooses non-duplicate data samples from the majority class in order to control the best fit line to an optimum place that represents fraudulent transactions and non-fraudulent transactions equally. A dataset with equal representation of fraudulent transactions and non-fraudulent transactions makes it easier for a classification model to learn data of the two classes well. Thus, the modified SMOTE and modified Safe-Level SMOTE were tested by running Artificial Neural Network, Support Vector Machine, Naïve Bayesian and k-Nearest Neighbours algorithms.

figure a

While the objective of the SMOTE algorithm is to generate a new data instance between two existing data instances of the minority class [5]. This chapter argues that the logic used to generate new data instances by the SMOTE does not always generate a new data instance between two given data instances.

From the for loop of attributes to the above SMOTE Algorithm 1, the algorithm is looping through the attributes of the dataset. The synthetic instance is generated by finding the difference of attribute values between the data instance of interest and its chosen nearest neighbour. The difference is then multiplied by the gap which is the random value chosen between 0 and 1. The multiplication of the difference and the gap is added to the data instance of interest.

Although this method works well, there are limitations to its objective. The limitations are introduced by the logic behind the mathematics of the discussed attribute loop of the SMOTE method. This chapter therefore suggests that if two data instances have values of opposite signs, then the SMOTE algorithm will not generate a new data instance that is between the two data instances, which violates the objective of SMOTE algorithm. The difference in attribute loop of SMOTE will change to addition, given that the two data instances have values of opposite signs.

Below is a demonstration of the above claim using the attribute loop of SMOTE. The attribute loop of SMOTE is where the new data instance is generated between two data points. Figure 4.1 shows the function of attribute loop of SMOTE.

Fig. 4.1
figure 1

SMOTE function

The following random values of instances will be chosen to test the claim made in this study: instances = [[−10, −21, −4, −45, −66, −93, 1, 10, 21, 4, 45, 66, 93, 1]]. To pictorially see the output, the x-axis values are required. The standard numbering system from 1 to 7 will be used for visualization and for demonstration purposes. Figure 4.2 is the pictorial representation of the claim.

Fig. 4.2
figure 2

SMOTE demonstration

Figure 4.2 shows that the SMOTE instance is generated outside the boundaries of data instance 1 and data instance 2, which means that the claim made in this study about SMOTE method is correct. The larger the space between the value of data instance 1 and data instance 2, the larger the space between the generated data value and the value of data instance 1 and data instance 2. To solve this problem, the difference in the attribute loop of the SMOTE method must be modified to handle the issue of opposite signs.

In this chapter, we propose a random generation of values between the two values of data instance 1 and data instance 2 as a way to ensure that the newly generated values lie within the boundary of data instance 1 and data instance 2. Figure 4.3 shows the modified function of the attribute loop of SMOTE.

Fig. 4.3
figure 3

SMOTE function (modified)

The function will use the same values as the above SMOTE function to maintain consistence. Figure 4.4 is the pictorial representation of the claim made by this study that newly generated random values will lie within the boundary of data instance 1 and data instance 2 regardless of the signs.

Fig. 4.4
figure 4

Modified SMOTE demonstration

Figure 4.4 shows that the SMOTE instance is generated within the boundaries of data instance 1 and data instance 2, which means that the claim made in this study about SMOTE method is absolutely correct. Thus, below is the modified algorithm of SMOTE method that ensures that newly generated data values are generated within the boundaries of data instance 1 and data instance 2.

figure b

By generating data instances within the boundaries of data instance 1 and data instance 2, we achieve the objective of the original SMOTE method. Thus, the comparison of SMOTE and safe-level SMOTE algorithms has merit. Below is the definition of the algorithm of safe-level SMOTE.

figure c

From the above algorithm, it can be seen that the new data instances that are generated follow the same logic as the SMOTE method above except the fact that the newly generated data values in this case are generated at the safe level of SMOTE. Again, to preserve the primary objective of the SMOTE algorithm, the new data values of safe-level SMOTE must be generated between the data instances at the safe-level of SMOTE. Below is the modified safe-level SMOTE algorithm.

figure d

The demonstration of how new values of data instances are generated is done from the SMOTE algorithm above. In the data analysis section below, the study compares the performance of the SMOTE and the safe-level SMOTE algorithms.

4.4 Simulation Results

In the data analysis section, the study compares the performance of SMOTE and safe-level SMOTE algorithms, and the modified SMOTE and safe-level SMOTE algorithms by supplying machine learning algorithms with a binary classification dataset that is down sampled by removing duplicate data samples and randomly choose the data observations. The machine learning algorithms chosen in this study are Artificial Neural Network, Support Vector Machine, Naïve Bayesian and k-Nearest Neighbour. Table 4.1 shows the performance of SMOTE algorithms in relation to the chosen machine learning algorithms.

Table 4.1 Performance comparison of the studied algorithms

The results demonstrate that the oversampling of the minority class plays an important role as shown by the high accuracy of all machine learning models. The oversampling of the minority class was done by generating new instances of the dataset given majority data observations that are down sampled by removing duplicate data samples and randomly choosing the data observations.

The objective of this chapter was to modify the SMOTE and safe-level SMOTE algorithms and compare the performance of original algorithms and modified algorithms. These results show that the safe-level SMOTE algorithm performs better than SMOTE algorithms on credit card fraud data. Furthermore, these results show that the modified SMOTE algorithm performs better than the modified safe-level SMOTE algorithm.

Evaluate Model

The confusion matrix is an evaluation technique that has been used to evaluate the performance of the classification model. The confusion matrix is used to compute the true-positive, false-positive, false-negative and true-negative transactions. The smaller the percentage of false positives and false negatives indicate that the model is actually performing well. Tables 4.2, 4.3, 4.4, and 4.5 show the output of the confusion matrix evaluation technique for each classification model.

Table 4.2 Confusion matrix: true positives
Table 4.3 Confusion matrix: false positives
Table 4.4 Confusion matrix: false negatives
Table 4.5 Confusion matrix: true negatives

The false-negative and false-positive tables show output values which are very small. These small output values indicate that the classification models against oversampling techniques are performing very well.

4.5 Results

In conclusion, it can be said that both the SMOTE and safe-level SMOTE are powerful oversampling techniques when they are used with majority data observations that are down sampled by removing duplicate data samples and randomly choosing the data observations. Given that SMOTE and safe-level SMOTE algorithms were modified in this study and the output shows that modified SMOTE performs better than modified safe-level SMOTE; this study therefore concludes that incorrect computation of an algorithm can cloud the algorithm’s true computing capabilities.