Keywords

1 Introduction

Real-world data often present characteristics that affect classification: noise, missing values, inexact or incorrect values, inadequate data size, poor representation in data sampling, etc. The imbalanced dataset problem represents a field of interest as it occurs when the number of instances that represent one class(rare events) [1] is much larger than the other classes, a common problem in certain areas such as fraud detection, cancer gene expressions, natural disasters, software defects, and risk management [2]. Rare events are difficult to detect because of their infrequency and casualness; misclassification of rare events could often result in heavy costs. For example, for smart computer security threat detection [3], dangerous connection attempts may only appear out of hundreds of thousands log records, but failing to identify a serious vulnerability breach would cause enormous losses.

Then, in the case of the datasets with binary class, it can be defined that it is balanced if it has an approximately equal percentage of examples in the concepts to be classified, that is, if the distribution of examples by classes is uniform, otherwise it is unbalanced. To measure the degree of imbalance of a problem, [4] defined the imbalance ratio (IR) as

$$\displaystyle \begin{aligned} IR = \frac{\left | C+ \right |}{\left | C- \right |}\geq 1.5 \end{aligned} $$
(15.1)

where

  • C +  is the number of instances that belongs to the majority class

  • C − is the number of instances that belongs to the minority class

Therefore, a dataset is imbalanced when it has a marked difference (IR ≥ 1.5) between the examples of the classes. This difference causes low predictive accuracy for the infrequent class as classifiers try to reduce the global error without taking into account the distribution of the data. In imbalanced sets, the original knowledge is usually labeled as oddities or noise, focusing exclusively on global measurements [5]. The problem with the imbalance is not only the disproportion of representatives but also the high overlap between the classes. To overcome this problem, diverse strategies have been developed and can be divided into four groups: at the data level [6, 7], at the learning algorithm level [8], cost-sensitive learning [9], and based on multiple classifiers [10]; where the techniques at the data level are the most used because its use is independent of the classifier that is selected.

One of the best-known algorithms within data-level techniques is the Synthetic Minority Oversampling Technique (SMOTE) [7, 11] for the generation of synthetic instances. One of SMOTE’s shortcomings is that it generalizes the minority area without regard to the majority class leading to a problem commonly known as overgeneralization; this has been solved with the use of cleaning methods such as SMOTE—Tomek links (TL) [6, 11], SMOTE—ENN [6, 11], Borderline—SMOTE1 [11, 12], SPIDER [13], SMOTE—RSB* [14], ADASYN [6], among others. These algorithms have been designed to operate with values of both discrete and continuous features for problems with imbalances in their two classes; most of them use the KNN to obtain the synthetic instances, and although this is a method that offers good results, it does not take into account the dependency relationships between attributes, which can influence the correct classification of the examples of the minority class.

A way to obtain the dependency relation of the attributes is Probabilistic Graphical Models (PGMs) [15] that represent joint probability distributions where nodes are random variables and arcs are conditional dependence relationships. Generally, PGMs have four fundamental components: semantics, structure, implementation, and parameters. As part of the PGMs, there are Gaussian Networks that are graphic interaction models for the multivariate normal distribution [16], and some use the Covariance Matrix (CM) to analyze the relationships between variables.

This chapter proposes an algorithm based on SMOTE and the Covariance Matrix estimation to balance datasets with continuous attributes and binary class, exploding the dependency relationships between attributes and obtaining AUC [17] values similar to the algorithms of the state-of-the-art.

An experimental study was performed ranking two SMOTE-Cov variants, SMOTE-CovI (which generates new values within the interval of each attribute) and SMOTE-CovO (which allows some values to be outside the interval of the attributes), against SMOTE, SMOTE-ENN, SMOTE-Tomek Links, Borderline-SMOTE, ADASYN, SMOTE-RSB*, and SPIDER, using 7 datasets from the UCI repository [18] with different imbalance ratios and using C4.5 as a classifier. The performance of the classifier was evaluated using AUC and hypothesis testing techniques as proposed by [19, 20] for statistical analysis of the results.

2 Oversampling Based on the Covariance Matrix

This section introduces oversampling based on the Covariance Matrix. First, we describe the Covariance Matrix that allows the computation of variable dependency. Then, we give an overview of our proposed algorithm. Finally, we describe our experimental setup in four steps: tool, dataset selection, evaluation methodology, and classifier used.

2.1 Covariance Matrix

In statistics and probability theory, the covariance matrix is a matrix that contains the covariance between the elements of a vector, where it measures the linear relationship between two variables. If the vector-column entries are

$$\displaystyle \begin{aligned} X = \begin{bmatrix} X_{1}\\ \vdots\\ X_{n} \end{bmatrix} \end{aligned} $$
(15.2)

then the covariance matrix ∑ij is the matrix, whose (i, j) entry is the covariance

$$\displaystyle \begin{aligned} \sum_{ij} = \mathbf{E}\left [ (X_{i}-\mu _{i})(X_{j}-\mu _{j})\right ] \end{aligned} $$
(15.3)

where the operator E denotes the expected value (mean) of its argument

$$\displaystyle \begin{aligned} \mu _{i} = \mathbf{E}(X_{i}) \end{aligned} $$
(15.4)

The Covariance Matrix allows determining if there is a dependency relationship between the variables and it is also the data necessary to estimate other parameters. In addition, it is the natural generalization to higher dimensions of the concept of the variance of a scalar random variable [20].

2.2 SMOTE-Cov

The Algorithm 1 shows the steps of SMOTE-Cov to balance datasets. During the loading of the dataset in the first step, the algorithm expects continuous valued attributes and a binary class. Then, it uses the formula 1 to verify whether the dataset is balanced or not. If it is imbalanced, the algorithm computes the Covariance Matrix. The Covariance Matrix allows the detection of the dependency relationship between attributes. Then, from the estimated covariance matrix, new synthetic instances are generated to balance the minority class. This process stops when an equilibrium between the two classes is reached. The algorithm checks that all the new values generated from the covariance are obligatorily within the interval of each attribute, in the case that some are outside the interval, what is done is to take it to the minimum or maximum, making a kind of REPAIR of the value.

Algorithm 1: SMOTE-Cov steps

3 Tools and Experimental Setup

The algorithm was developed using the R language because it is designed for statistical processing and has the cov() function for calculating the covariance. In order to evaluate the behavior of the proposed algorithm, it was compared against the state-of-the-art algorithms of oversampling data balancing; two variants are taken into account: when the attributes are inside or outside of the dependence range. Seven datasets from the UCI repository were chosen with IR ≥ 1.5, see Table 15.1, with continuous attributes and binary class. This experiment uses fivefold cross-validation, and the data are split into two subsets: the training/calibration set (80%) and the test set (20%). The final result is the mean of the 5 result sets. The partitions were made using KEEL in such a way that the number of instances per class remained uniform. The partitioned datasets are available on the KEEL website [21].

Table 15.1 Description of the datasets used in the experiments

The training datasets are balanced, generating new synthetic instances from the minority class to complete the quantities of the majority class and using a sample of the control test, which remains imbalanced and without any modification. The new datasets are generated from the obtained instances, using the SMOTE-Cov algorithm, and a classifier is used as a mean to measure the performance using other techniques.

The classifier used for the experimental study is C4.5 (implemented in the Weka package as J48) [22], which has been referred to as a statistical classifier and one of the top 10 algorithms in Data Mining that is widely used in imbalance problems [14].

The Area Under the Curve (AUC) (15.5) is used to measure the performance of classifiers over imbalanced datasets using the graph of the Receiver Operating Characteristic (ROC) [17]. In these graphics, the trade-off between the benefits (TPrate) and cost(FPrate) can be visualized, which represent the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives. AUC summarizes the performance of the learning algorithm in a number.

$$\displaystyle \begin{aligned} AUC = \frac{1+TPrate-FPrate}{2} \end{aligned} $$
(15.5)

where

  • TPrate are the correctly classified positive cases that belong to the positive class

  • FPrate are the negative cases that were misclassified as positive examples

3.1 Experimental Study

The AUC result values are studied with this already balanced dataset. Table 15.2 shows that the AUC results of the data-balancing algorithm applying the Covariance Matrix with its CovI and CovO variants are similar or comparable with respect to the state-of-the-art oversampling algorithms, using C4.5 as a classifier.

Table 15.2 AUC of the data balancing algorithms with the generation of oversampling classes of the state-of-the-art, CovI and CovO

For the statistical analysis of the results, hypothesis-testing techniques were used [19, 20]. In both experiments, the Friedman and Iman-Davenport tests were used [23], in order to detect statistically significant differences between groups of results. The Holms test was also carried out [24] with the aim of finding significantly higher algorithms. These tests are suggested in the studies presented in [19, 20, 23], where it is stated that the use of these tests is highly recommended for the validation of results in the field of automated learning. Table 15.3 shows the ranking obtained by the Friedman test for the experiment. Although the algorithm with the best ranking was ADASYN, Holm’s test performed below will demonstrate to what extent this algorithm can be significantly superior to the one proposed in the research.

Table 15.3 Friedman’s test

Table 15.4 summarizes the results of Holms test, taking ADASYN as a control method, all hypotheses with p–values ≤ 0.05 are rejected, showing that ADASYN is significantly superior to the SMOTE-CovI and Borderline-SMOTE algorithms. In the case of SMOTE-CovO, SPIDER, SMOTE_ENN, SMOTE_TL, Original, SMOTE-RSB and SMOTE, the null hypothesis is accepted, which means that there are no significant differences between ADASYN and them, so it can be concluded that they are as effective.

Table 15.4 Holms test with α = 0.05, taking ADASYN as a control method

4 Conclusions and Future Work

In this chapter, a new algorithm is proposed to generate synthetic instances of the minority class, using the Covariance Matrix. The experimental study carried out shows the effectiveness of the proposed algorithm compared to eight recognized state-of-the-art algorithms. SMOTE-Cov showed similar or comparable results, taking into account the results of the AUC curve of the C4.5 classifier and using nonparametric tests to demonstrate that there are no significant differences between them, with the exception of the ADASYN versus the SMOTE-CovI variant. This can be influenced because the attributes present in the studied datasets come from other intervals and not from the actual attribute within the dataset.

Having results comparable to those of the state-of-the-art, these datasets allow extending the experimentation in the future to datasets with tens, hundreds or thousands of attributes and with strong dependency relationships. It is also intended to use covariance regularization (Shrinkage) to balance data, where the number of positive instances is less than the number of attributes.