Keywords

1 Introduction

In the real world, there are many problems for which available data are class imbalanced. Class imbalance means that the number of instances in classes is not equally distributed. In the case of the imbalanced data, the distribution of the data in the feature space is usually skewed in class imbalanced datasets [1]. The imbalanced data have to be dealt with in numerous practical applications, like for example text classification, medical diagnosing, financial decision making, discovering network intrusions, and others.

When the available data set has a skewed distribution one faces a data sample overlap, small sample size, and small disjoints. From the point of view of the traditional machine learning process, examples in the majority class will have a greater influence on the induced classifier, causing its classification weight to be in favor of the majority class. Conventional machine learning algorithms which are focused on optimizing the overall classification accuracy, in the class imbalanced situations, tend to achieve poor classification performance, especially for the minority class, which might be of special interest to the user [2, 3]. This is due to the fact, that classifiers try, in the first place, to correctly predict the majority class labels. The minority class in such a process might be ignored in favors of the majority one. However, in many real-life problems, the correct minority predictions are crucial, like for example in case of the medical diagnostics or software fault prediction. In the machine learning literature, class imbalance is recognized as one of the most challenging problems in data mining [4].

Research work in the field of imbalanced data classification has resulted in the development of numerous approaches and algorithms. Among them the most successful are the bagging ensemble methods, cost-sensitive methods, approaches based on sampling techniques [6] and methods based on the so-called, data level [7]. The data level methods aim to transform original data into a more balanced dataset, thus reducing the imbalance ratio between the majority and minority classes.

In general, data level techniques transform a dataset with a view to reducing the imbalance ratio between the majority and minority classes [9]. Such transformation takes place at the pre-processing stage and the resulting dataset can be further processed using standard machine learning methods. In the literature, three main types of the data level approaches have been, so far, proposed. These are over-sampling, under-sampling and a hybrid of over and under-sampling [5].

The aim of the under-sampling technique is balancing the distribution of data classes. In practice, under-sampling techniques just remove instances from the majority class. The strength of this approach depends, however, on what kind of rules for instance removal have been implemented [9]. Many methods belonging to this group base on clustering and instance selection. They are referred to as the cluster-based sampling group (see for example [19, 20]). These methods group instances from a majority class into clusters. Next, some instances, representing clusters are selected from each cluster. Main drawbacks of the cluster-based sampling include difficulty to decide on the optimal number of clusters and the lack of rules telling which instances should be selected as cluster representatives [9].

An improved approach for under-sampling based on clustering has been proposed in [12]. The algorithm reduces the number of instances in the majority classes by grouping instances and next selecting only the representative instances. Finally, the majority class set of instances is reduced and the required balance between the minority and the majority class instances is achieved. The clustering algorithm was based on the similarity coefficient calculations, originally proposed in [13]. The instance selection has been carried out using an agent-based population learning implementation. The main feature of the approach is that the number of clusters from which the instances are selected and the process of instance selection are carried out automatically. In this approach, the earlier described drawbacks of the cluster-based sampling reduced or, at least partly, eliminated.

Over-sampling techniques are dealing with expanding the number of minority class instances to balance the classes’ distribution. The approach includes techniques providing for the artificial synthesis of the required number of new examples in the minority class. The most simple of the discussed approaches involve simple duplication of the minority class instances. More advanced approaches duplicate or synthesize instances from areas deemed as most promising [10].

Random over-sampling algorithms duplicate minority class instances until the balance between minority and majority class instances will be achieved. Their simplicity makes such an approach practicable. There is, however, a drawback. It is not clear which instances from the minority class should be duplicated. This makes looking for a more effective way of duplicating the minority class instances and, at the same time, avoiding problems with the random duplication, worth a research effort.

SMOTE is the most popular over-sampling method, proposed originally to improve random over-sampling [11]. SMOTE interpolates existing instances to generate new instances. However, the algorithm is not free from weaknesses. One of them is the assumption that all minority class instances are of equal importance. From a practical point of view, it means that each one instance of the minority class can be chosen to over-sample with uniform probability and the duplication may include instances which do not provide any useful information for identification of boundaries between classes.

To eliminate SMOTE’s disadvantages numerous extensions of the basic algorithm have been, so far, proposed. These algorithms aim to emphasize certain minority class regions, others intend to reduce the within-class imbalance or attempt to avoid the generation of noise [10]. Example approaches include borderline-SMOTE [14], self-level-SMOTE [18], cluster-SMOTE [16], CURE-SMOTE [15], k-means SMOTE [10] and others (see for example [10] and [17]).

In this paper, a hybrid algorithm for the imbalanced data learning is proposed. The main idea of the proposed algorithm is based on balancing of the minority and majority classes by over-sampling and instance selection. The instance selection is carried out in the majority class. The process is based on clustering using the similarity coefficient as the criterion for grouping instances. The process of clustering is carried-out independently for instances from all classes. Next, the prototypes are selected from the induced clusters. The process of instance selection is integrated with the learning phase executed by the team of agents. In some cases to achieve the balanced distribution of instances between different classes requires an over-sampling. However, the process of over-sampling is run only when the instance selection does not assure the required balance.

The paper is organized as follows. Section 3 contains problem formulation and a detailed description of the proposed method. Section 4 provides details on the computational experiment setup and discusses experiment results. Conclusions and suggestions for future research are included in the final section.

2 An Approach to Imbalanced Learning

In this section, the imbalanced data classification problem is formulated and the details of the proposed approach are discussed.

2.1 Problem Formulation

The aim of learning from data is to output the hypothesis h ∈ H optimizing performance criterion F using dataset D, where D is the multiclass data set D = D1D2 Dd and d is the number of different classes.

In case of the imbalanced training set, Dminority is the subset of D which contains the minority class dataset. It is assumed that the cardinality of Dminority is definitely smaller than the cardinality of each of the remaining subsets of D representing the remaining classes. Among these remaining subsets, there is the majority class subset containing the majority class instances. Data level methods aim at transforming an imbalanced dataset into a better-balanced one by reducing the imbalance ration between the majority and minority classes. The reduction can be carried out by over-sampling or under-sampling.

The data level approach involves two stages. First, cardinalities of all classes including the minority class are identified. Next, the instance selection process aiming at reducing the cardinality of all datasets representing classes other than the minority one is carried out. Ideally, the reduction process should produce datasets with cardinalities not exceeding cardinality of the dataset representing the minority class. Formally, the number of instances from each subset \( \forall_{{i \in \left\{ {1, \ldots ,d} \right\}\backslash \left\{ {minority} \right\}}} \) Di is reduced and the resulting subsets are denoted as \( \forall_{{i \in \left\{ {1, \ldots ,d} \right\}\backslash \left\{ {minority} \right\}}} \) Si, that also means that \( \forall_{{i \in \left\{ {1, \ldots ,d} \right\}\backslash \left\{ {minority} \right\}}} \;S_{i} \subset \text{D}_{i} \).

In case when the reduction process cannot guarantee the required balance, i.e. \( \exists_{i} \left| {S_{i} } \right|\,\text{ > }\,\left| {D_{minority} } \right| \), then the second stage with the over-sampling process on Dminority has to be entered. The process produces dataset obtained by duplication of instances from Dminority and denoted as Sminority. In the ideal case the following holds:

$$ \forall_{{i \in \left\{ {1, \ldots ,d} \right\}\backslash \left\{ {minority} \right\}}} \left| {S_{i} } \right|\; \cong \;\left| {S_{minority} } \right|\;\text{and} $$
(1)
$$ \forall_{{i \in \left\{ {1, \ldots ,d} \right\}}} \left| {S_{i} } \right|\,\text{ < }\,\left| {D_{i} } \right|\;\text{and} $$
(2)
$$ \text{U}_{i = 1}^{d} \left| {S_{i} } \right|\,\text{ < }\,\left| D \right| $$
(3)

In case of the imbalanced data, when the over and under-sampling processes have been carried out, the task of the learner L is to output the hypothesis \( h\; \in \;H \) optimizing performance criterion F using datasets \( S_{1} , \ldots ,S_{d} \), which are subsets of D containing instances obtained by the over and under-sampling processes, and where the condition (1) is satisfied.

2.2 The Proposed Approach

After the instance selection has been carried out with respect to subsets representing all classes except the minority one, we suggest the following procedure: :

  • when the number of considered classes in D is equal to 2 (i.e. d = 2), then the over-sampling process is run on the minority class subset,

  • when the number of considered classes in D is greater than 2 (i.e. d > 2), then at first, the reduced subset of instances containing the maximum number of instances is identified, and on all remaining subsets, the over-sampling procedure is run.

In all cases, under-sampling is based on the instance selection approach where instances are selected from clusters grouping similar instances, that is carried-out under umbrella of the instance selection procedure. The process of clustering is carried-out independently for each of the considered classes without the minority class. It is also assumed that only a single instance, as a reference instance, is selected from each cluster. Thus, the number of clusters produced at the clustering stage has a direct influence on the size of the reduced dataset. Reference instances are selected from the clusters during the learning process executed by the team of agents, as described in a detailed manner in [8], forming the reduced dataset.

The similarity-based clustering algorithm (SCA) produces clusters, where the similarity coefficients are calculated as shown in [8]. The SCA induces clusters with an identical similarity coefficient, and the number of clusters is determined by the value of this coefficient across all instances belonging to the considered class. Clusters are initialized automatically and without any user intervention.

In the paper, the population-based metaheuristics known as the population-learning algorithm (PLA) originally proposed in [21] has been applied for instance selection. The population-learning algorithm is an implementation of the set of agents and different optimization procedures executed by the agents within the asynchronous team of agents (A-Team). These agents cooperate and exchange information. Agents working in the A-Team achieve implicit cooperation by sharing the population of current solutions to the problem to be solved. The A-Team can be also defined as a set of agents and a set of memories, forming a network in which every agent remains in a closed loop [22]. The framework for the agent-based instance selection has been adopted from earlier papers of the authors including [12, 23,24,25] and [13]. In the [13] the cluster-based instance selection, as a tool for under-sampling has been proposed.

In case when the instance reduction in majority class datasets does not guarantee the required balance, that is \( \exists_{i} \left| {S_{i} } \right|\;\text{ > }\;\left| {D_{minority} } \right| \), then the over-sampling procedure is activated on all subsets of instances representing classes other than the majority one. The over-sampling procedure starts with identifying for each two closest clusters from the same class their neighbors. The closeness of neighbors is measured using the Euclidean measure. The number of neighbors k is a parameter of the approach and should be set by the user.

The pseudo-code explaining how a new (artificial) instance for the minority class is generated is shown as Algorithm 1. Algorithm 2 shows the pseudo-code of the agent-based population learning algorithm (PLA) where individuals (solutions) represent the selected instances. The pseudo-code covering the proposed method of the imbalanced data classification is shown as Algorithm 3.

figure a

3 Computational Experiment

The proposed approach has been validated experimentally. The main research question was whether the proposed approach performs better than the traditional approach where machine learning algorithms are used for learning from the original imbalanced data.

Classification accuracy of the classifier obtained using the proposed approach, denoted as AOUSID - Agent-based Over and Under-Sampling for the Imbalanced Data, has been compared with the accuracy of:

  • AISAID – the algorithm originally introduced in [12] assuring balance between minority and majority classes by applying instance selection and a special merging procedure to reduce the cardinality of the majority class instances to the level comparable to the cardinality of the minority class.

  • ALP - the procedure originally proposed in [25] for data reduction carried-out only in the majority class. The procedure produces clusters of instances in the majority classes using k-means. Next, these clusters are merged to obtain the reduced number of clusters equal to the cardinality of the minority class.

  • k-means - in this case, the k-means clustering has been implemented using data from the majority class, and next, from thus obtained clusters, prototypes are selected using the agent-based population learning algorithm as in [8].

  • C4.5, CART, CNN, 10NN – traditional ML algorithms.

Datasets used in the reported experiment have been obtained from the KEEL dataset repository [25]. Details of these datasets are shown in Table 1. It has been decided to use the 10-cross-validation scheme, and each benchmarking problem has been solved 30 times. The reported values of the quality measure have been averaged over all runs. Classification accuracy has been used as the performance criterion. In the 10-cross-validation scheme, for each fold, the training dataset was reduced using the proposed approach. The learning tool used was the C4.5 algorithm [26]. Details of the parameter settings are shown in Table 2. Values of these parameters included in Table 2 have been set arbitrarily.

Table 1. Datasets used in the reported experiment (column IR informs about the ratio of the number of instances of the majority class per instance of the minority class).
Table 2. Parameter settings in the reported experiment

Based on the results shown in Table 3, it can be observed that the AOUSID approach assures competitive results in comparison to other algorithms. In several cases, the algorithm performs best including the multi-class imbalanced data sets (wine and balance) and abalone19 and glass2 datasets. One can also observe that the AOUSID outperforms traditional machine learning tools (C4.5, kNN, CART, and CNN - Convolutional Neural Network) when the algorithms have been used on imbalanced datasets.

Table 3. Results obtained for the AOUSID algorithm and other algorithms on imbalanced datasets and their comparison based on the accuracy (in %)

4 Conclusions

In the paper a hybrid approach for the imbalanced data learning based on over-sampling and instance selection, is proposed. Both discussed techniques have been integrated and implemented with a view to deal with classifying the imbalanced data by reducing the imbalance ration between minority and majority classes. Over sampling has been used as a tool for instance duplication in the minority class. Instance selection has been used as a procedure for reducing the number of instances in the majority class. Selection of instances starts with data clustering using the similarity coefficient technique. In the next step, instances are selected from clusters by the team of agents. The proposed approach has been validated experimentally on two and multi-class imbalanced data sets. Based on the results of the computational experiment, one may conclude that the proposed approach can be considered as a promising one with respect to solving the machine learning tasks in case of the imbalanced data.

Future research will focus on studying the influence of different parameters on the performance of the proposed approach as, for example, the number of neighbours in GAI. It is also planned to extend the experiments using additional datasets, as well as to carry out a deeper statistical analysis of the obtained results.