1 Introduction

In real life, the problem of imbalanced data distribution is not uncommon because the data instances of interest are often rare in quantity. Data collections of medical anomalies, rare diseases in a population, and unusual syndromes often result in highly imbalanced proportions of normal and abnormal instances. For example, the ratio of unsuccessful and successful cases in lung resections for treating primary lung cancer at the Medical University of Wroclaw in 2013 is 1:7, while the ratio of active bioactivity screens of chemical substances described in PubChem BioAssay Database is 1:70. If a classification model were to be built for automatic recognition or predicting the outcome to which a new sample should belong to, such imbalanced dataset would be used for supervised model training leading to a performance issue. The accuracy of the classification model, however, induced from the imbalanced dataset would be far from acceptable.

This phenomenon is due to the limitation of the underlying learning mechanism in the designs of the traditional classification algorithms. When training samples are loaded into the induction process, the learning method does not distinguish the target classes of the data samples. For example, in Greedy Search, which is a common directive for classification induction, the logic of the algorithm just takes whatever training data available and infers mapping relations between the data samples and the target classes. The learning proceeds without concerns of the target class distribution in the training data. As a result, the classification model tends to bias towards the majority class and lack of sufficient training to recognize the minority classes. The ill-trained classifier, often, would have a very high pseudo-accuracy in testing with the majority class samples; but when it comes to testing unknown samples from the minority class, the performance deteriorates badly. In this case, the under-training by the insufficient minority class samples is reflected by a low Kappa statistics value from the classifier, even though the accuracy may be high.

This is generally known as the ‘imbalanced dataset problem’ which received a lot of research attentions from both data mining and computational biology research communities. To tackle this problem, various computational attempts have been studied and applied to re-balance the imbalanced data distribution in the training samples over the target class and non-target classes.

2 Background and related work

Two popular solutions have been proposed and studied, respectively, in data pre-processing and the modification of algorithms at code level. The under-sampling technology [1] and over-sampling [2] technology belong to the first type of solution, data pre-processing. It works simply by changing the distribution of imbalanced data set to improve the classification performance of the subsequent classification model. The main performance issue is that it does not only need to eliminate a lot of noise information to significantly reduce the dataset’s imbalanced degree, but it also tries to ensure the minimum information are retrained in the original dataset [3]. On the other hand, it is based on the cost-sensitive learning [4, 5], SVM [6], Boosting [7], or classifier ensemble [8] like SMOTEBoost [9] to modify the classification method. One classical implementation is called SMOTE (synthetic minority over-sampling technique) [11] which is a commonly used over-sampling technique. The basic idea is through applying some artificial data synthesis, extra samples from the minority class are generated in order to subside the categories imbalance. But the research question is, how much artificial data from the minority class should be synthesized in order to achieve the maximum possible accuracy while at the same time some substantial Kappa value is assured? There are some key parameters whose values need to be optimized in this balancing process.

Assuming the over-sampling rate is K, the number of minority class equal to M, and each minority class can be signified as \(x_{i}\) \((i =1, 2, 3 \ldots M)\) which belongs to \(S_\mathrm{min}\), then every \(x_{i}\) searches out K neighbors of minority class from the minority class, the algorithm will random set a \(x_{t }\)from the K neighbors, finally it will synthesize new data, according to Eq. (1).

$$\begin{aligned} x_{new} = x_i + rand\left[ {0,1} \right] ^{*} (x_t -x_i ) \end{aligned}$$
(1)

The function rand [0,1] produces a random number in the interval between 0 and 1. Once the artificial data are generated, they are added back to the training dataset, thereby updating a new version of training dataset with the class distribution modified. This synthesis process may repeat several times until certain improvement on the classification performance can be seen.

Figures 1 and 2 illustrate a sample of SMOTE operation, when the K equal to 7, and how the \(x_{i}\) synthesized a minority class data. It is easy to observe how to select a suitable value of K when the data dimension is low, at, e.g., 2 or 3 dimensions. However, for very high-dimensional data with many attributes, knowing the ideal proportion of extra minority class data to be synthesized is a computational challenge in using SMOTE.

Succinctly, though the concept is primitive, there are some unsolved issues associated with such balancing technique like SMOTE. It is not known about what values of the parameter variables and how many times the generation process shall repeat for yielding a classifier that produces the best performance. Please note that SMOTE runs once off; a certain amount of synthesis data is generated each time when the SMOTE function is invoked. As a research objective, we aim to find an optimal pair of S and K values which are key parameters directly influencing the data synthesis and the end results—the efficacy of the classifier. To this end, swarm optimization algorithms are proposed to find a suitable pair of S and K parameter values for rebalancing the class distribution in the training data, ensuring a reliable classifier as a result.

Fig. 1
figure 1

SMOTE, the minority class data \(x_{i}\) with \(K =7\)

Fig. 2
figure 2

Synthesized data of \(x_{i}\)

The main advantage between using swarm optimization and computational brute-force in tuning up the S and K, is on the speed efficiency as well as making use of heuristics during reiteration. Whenever new training data are loaded, the swarm continues to move and improve again in the search space. The heuristics information help make the supervised training adaptive to the new target class distribution, whenever the underlying class distribution changes by the arrival of new data.

In the experiments which are reported in this paper, we consider using the classifier of neural network and decision tree to do the verification with two different metaheuristic optimization algorithms, namely particle swarm optimization (PSO) algorithm [11] and Bat-inspired algorithm (Bat) [12]. Neural network and decision tree are chosen because of their popularity in data mining, indeed. Another reason is that neural network represents a typical category of black-box supervised learning scheme (the non-linear relations between the attributes and the targets are numerical weights in the hidden layers). The decision trees represent a non-black-box rule-induction learning scheme; decision rules that are readily human readable can be harvested from the resultant decision paths at the end. Furthermore, we use two different metaheuristics, PSO and Bat, respectively, as a comparison between the classical metaheuristic optimization algorithm and the contemporary, as well simple versus sophisticated designs. Search agents in PSO move according to two velocities, global and local; and Bat agents move in similar fashion, with the addition of acceleration in echolocation in their flight paths.

3 Experiment and datasets

As we all know that in the classification of imbalanced dataset with the original dataset, sometimes we can get a very good accuracy. But at this time the other performance index called Kappa statistics is very low, most of the time Kappa drops to almost zero. At times it may become a negative value depending on how imbalanced the data are. The confusion matrix which is defined in Table 1 shows the reason that because of the number of negative class takes much of the low proportion, the classifier misclassifies most if not all of them into wrong classes. That means if we use all negative class dataset as a testing dataset, the accuracy of the trained classification model will be extremely low, because the classifier was under-trained with the minority class data. Thus the classification result of high accuracy when it comes to classifying imbalanced dataset is meaningless.

Table 1 Confusion matrix

According to confusion matrix, we can get the definition of the accuracy and Kappa as follow equations:

$$\begin{aligned} \text {Accuracy}= & {} \frac{\hbox {TP}+\hbox {TN}}{\hbox {P}+\hbox {N}} \end{aligned}$$
(2)
$$\begin{aligned} \text {Kappa}= & {} \frac{P_o +P_c }{1- P_c} \end{aligned}$$
(3)
$$\begin{aligned} P_o= & {} \text {Accuracy}= \frac{\hbox {TP}+\hbox {TN}}{\hbox {P}+\hbox {N}} \end{aligned}$$
(4)
$$\begin{aligned} P_c= & {} \frac{\left( {\hbox {TP}+\hbox {FP}} \right) \times \left( {\hbox {TP}+\hbox {FN}} \right) +\left( {\hbox {FN}+\hbox {TN}} \right) \times \left( {\hbox {FP}+\hbox {TN}} \right) }{\left( {\hbox {P}+\hbox {N}} \right) ^{2}} \end{aligned}$$
(5)

Kappa is an alternative measure of computing classification performance in response to the consistency of testing dataset. Thus it is an important performance indicator that tells us how to judge whether the classification accuracy is within a confidence level. Kappa is generally interpreted as the reliability of the classifier model. As the Kappa value is higher, the accuracy is more credible. The range of Kappa values (or just Kappa) [13] is between \(-\)1 and \(+\)1. Meanwhile there are three levels of Kappa that are used to estimate the credibility of classification accuracy:

  1. 1.

    Kappa \(\ge \) 0.75: strong consistency, high credible accuracy.

  2. 2.

    0.4 \(\le \) Kappa \(<\) 0.75: the accuracy’s confidence level is in generally.

  3. 3.

    Kappa \(<\) 0.4: accuracy is incredible.

Our experiment is conducted over datasets that have binary classes. We use PSO and Bat algorithms to optimize the two parameters in artificially rebalancing the data distributions. Neural network and decision tree are the classifiers which we choose to measure and verify the objectives in terms of fitness in every generation with Swarm algorithms. In general, we only focus on one performance that is the accuracy to measure whether the two parameters are globally best. But here due to the specificity of imbalanced dataset we also need to ensure that the Kappa is also as large as possible so to ensure the accuracy’s credibility. In the experiment of classification model testing, we use a stringent tenfold cross-validation to evaluate the classifier.

Fig. 3
figure 3

Flow chart of our extended rebalancing algorithm using swarm optimization

Figure 3 shows the logics of the extension of the SMOTE algorithm in flow chart where swarm algorithms are used to optimize the two key parameters values. Every time when the search agents (such as particles in PSO) or Bats move, it is hoped that through the classifiers of neural network and decision tree the new method can find a local best of K and S for the sake of yielding a better performance of Kappa and accuracy. We then use the performance indicators as a measure to compare with the conditions, with the iterative processing to find the globally best K and S so to improve the values of Kappa and credibility of accuracy. In the computation, each of the parameters of S and K has its own step interval. The maximum value of S is the ratio between majority class and minority class. The minimum value of S is 1 %. The maximum and minimum values of K, respectively, are the total number of instances of the minority class and 2. For instance, if the target class of the data instances in the dataset has two labels, the number of majority class’s instance is 1000, and the minority label has only 10 instances, S_max = 9000 %. That means the minority class sample can grow as large as 100,000 times, and it has to increase from 20 times at least. K_max = 10, and K_min = 2.

figure a

The above is the pseudo-code of swarm-SMOTE algorithm. And the principle is: in every generation of swarm optimization algorithm we set two control conditions—we know that when Kappa equals to or greater than 0.4 (which is the minimum requirement according to the three Kappa levels), the value of accuracy is then meaningful. The two conditions for inferring a classifier with meaningful classification ability are: first of all the Kappa’s value must equal or larger than 0.4, then we consider about the qualified accuracy, hence attempt to obtain the globally highest accuracy value while taking for granted that Kappa remains at least 0.4. The value 0.4 can be thought of a minimum threshold which is arbitrarily chosen. Should the user require a more robust classifier, a larger value for the minimum threshold can be used. Often, Kappa and accuracy are correlated; that means when accuracy is improved, Kappa would be likely greater than 0.4; Kappa increases while accuracy rises, and vice-versa, except in the cases of very imbalanced data being used as training dataset.

As a performance comparison benchmark, a standard class balancer algorithm is used to compare with our proposed swarm rebalancing algorithm. The experiment is tested with four imbalanced medical datasets. The same classifiers of decision tree and neural network are used throughout the experimentation. Class balancer is a traditional algorithm that turns imbalanced dataset to a completely balance dataset. It works simply by dividing the majority class data which are near the boundary of two classes into the minority class to attain the balance of the dataset quantitatively.

The software programs are coded in MatLAB version 2014a for the experimentation. The computing environment is a PC workstation with a CPU: E5-2670 V2 @2.50GHz, RAM: 128GB.

Five imbalanced medical datasets are selected from UCI [14] for our experiment and their properties are presented in Table 2. The imbalanced ratios between majority class and minority class are in the range of increasing from 2.05:1 to 70.3:1.

Table 2 Biological datasets used in experiment

The surgery dataset is about the lung surgeries that were conducted over 5 years. The data were collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer [15]. Out of 470 surgeries, there were only 70 cases with 1-year survival period; the rest died. The training dataset represents a typical medical case of mild imbalance where the minority class has a ratio of 14.89 % which is not uncommon in surgical treatment for tuberculosis and pulmonary diseases. The same dataset was used in prediction of the post-operative life expectancy in the lung cancer patients by using SVM [15]. The imbalance problem was tackled by embedding AdaBoost into SVM; classification accuracies were said to be improved using the new method. But Kappa was omitted from the performance consideration.

The other dataset used in our experiment is called PubChem Bioassay. These are heavily imbalanced bioassay datasets which are obtained from different types of virtual screening using high-throughput screening (HTS) technology [17]. The use of HTS in the biopharmaceutical industry has proliferated from its basis in hit identification through the entire drug discovery and early development process. Applications of HTS approaches spanned from human to animal health, from life sciences to drug discovery, and from protein docking to virtual screening simulation. These datasets have been tested in classification using different types of machine learning methods. They include but not limited to, decision tree by Chen et al. [17], Consensus model by Liew et al. [18] and genetic algorithm-neural network by Tong and Mintram [19]. In particular, the authors in [19] embedded genetic algorithm search into a neural network activation function for doing feature selection; weights to the selection of features are assigned in favors of minority class for solving the imbalance class problem. However, extensive investigation into Kappa statistics is not done yet which is common across the previous works. In our design, we opt for an alternative approach in tackling the imbalance data in the pre-processing stage, by progressively improving so using swarm algorithm which runs iteratively along with the classification model training process.

A total of 21 bioassay datasets are generated from Pubchem and available. Both Primary and confirmatory bioassays (12 bioassays, 21 mixes) are available that could be used for training and testing for evaluating the classifiers for fitness function. Out of the many sub-datasets in bioassay datasets, which are imbalanced in nature, we randomly selected four to be used in our experiment to test our method.

4 Result analysis

Our experiment collected the performance results in terms of accuracy, Kappa (Kappa statistics), precision, recall, F-measure, ROC Area, and a new index called the imbalance ratio between minority class and majority class. These results are presented from Tables 3, 4, 5, 6 and 7 with different classification algorithms or data imbalanced processed method, respectively. Furthermore, the seven performances indices which are aforementioned are visualized in radar charts individually from Fig. 9a–g with different datasets. Meanwhile we also selected the important performance indicators such as accuracy, Kappa and imbalance ratio (Min/Maj) as the key indicators to observe the changes and results in the bar-chart diagram from Figs. 4, 5, 6, 7 and 8.

Fig. 4
figure 4

Comparison of different methods in three key performances of surgery dataset

Fig. 5
figure 5

Comparison of different methods in three key performances of AID362 dataset in Bioassay

Fig. 6
figure 6

Comparison of different methods in three key performances of AID439 dataset in Bioassay

Fig. 7
figure 7

Comparison of different methods in three key performances of AID721 dataset in Bioassay

Fig. 8
figure 8

Comparison of different methods in three key performances of AID1284 dataset in Bioassay

Figure 4 and Table 3 show about the classification results of Surgery dataset, the original dataset’s imbalance ratio (Min/Maj) is low, the two key performance indicators of accuracy and Kappa came to two extremes, high accuracy with zero Kappa value. This extreme result means the classification power of the two classification algorithms after being trained by the imbalanced data is totally useless. After processing the original dataset by classbalancer for some but perhaps not optimal balance, we can find that no matter with which classification algorithm to be used, the effect is still poor. The performances of swarm-SMOTE approaches as proposed in this paper showed that our method manage to keep classification results in check into a reliable scope, at the cost of some slight compromise on the accuracy. Actually the slightly lower than maximum accuracy indicates a realistic classification scenario. Through the index of imbalance ratio, we can observe the change of imbalance degree of the dataset which shows that our methods can function without needing the dataset coming to a completely balanced state (such as 50–50 % between the majority and minority data).

Fig. 9
figure 9

a Radar Chart of results in accuracy, b Radar Chart of results in Kappa, c Radar Chart of results in precision, d Radar Chart of results in recall, e Radar Chart of results in F-measure, f Radar Chart of results in ROC area, g Radar Chart of results in imbalance ratio (Min/Maj)

The other four datasets are taken from a widely diversified bioassay dataset. The results of the first Bioassay dataset which is coded AID362 with different classifiers are very extreme. It is the highest imbalanced dataset among the five. Thus there is no doubt in getting a high accuracy with a low Kappa index in the original classification. And the effect is still not very good after the original dataset processed by standard classbalancer method. But in Fig. 4 the interesting thing is that the results of our swarm-SMOTE method with neural network and decision are quite different. Under the condition of maintaining a very high accuracy, the decision tree classifier also can get a high Kappa which is very close to one; while the performance of neural network is bad whose increment was not enough to reach the credible stage, bigger or equal to 0.4. As for the third dataset of AID 439 in Bioassay, which gets a negative value of Kappa in the original classification with decision tree, it is obvious to see our method is better than the traditional method of classbalancer. In this group of results, our approach simultaneously improves the dual performance of accuracy and Kappa. And swarm-SMOTE-DT method needs to synthesize more minority samples than swarm-SMOTE-PN method. The following experiment over the dataset of AID721 in Bioassay shows that the neural network classification algorithm is worse than Decision Tree algorithm once again in Table 6 and Fig. 7. The latter classifier can get a higher accuracy and Kappa than before, whereas our method with neural network can only raise the Kappa value within a limited range with lower accuracy. It is also reflected in the results of the last dataset like before that most of the time PSO is only slightly better than Bat when it comes to finding the optimal values of the pair of parameters for best classification performance. However, it seems to achieve this by synthesizing a lot more minority samples. Its results in Table 7 and Fig. 8 confirm the better ability to process the imbalanced dataset in classification by our method.

Fig. 10
figure 10

Average results in different performances

Figure 9a–g, respectively, showed the different performances. It is apparent to observe so from the above analysis which is graphically displayed in the Fig. 9a–g. They, respectively, showed the different performances with different datasets. The most significant distinction is that the accuracy is very high in the original dataset classification with two different classifiers, but the value of Kappa is very low, even down to and below zero value. When this happens it means the classification accuracy is in vain. After using our method, all of the Kappa values are generally improved. But we can observe from the results that the decision tree classifier is much better than neural network. In the results of AID 721 AID 1284 with Neural Network the Kappa value struggles to go beyond 0.4, although they have risen to certain degree. It is significant to observe that by using the same swarm optimization algorithm, both the values of accuracy and Kappa have boosted. This phenomenon is most obvious in the results of AID 362 dataset. The Neural Network classification algorithms with swarm optimization algorithms performed well too. As shown in the experiment results of the last two datasets, both the two classifiers pull the Kappa value into a reliable interval, despite the fact that the decision tree is still better than neural network to keep up the high accuracy.

Figure 10 shows bar-charts of averaged values, for an overall comparison of different methods. We can see that although the class balancer method can turn the dataset to a full balance by data count, the classification performances by using it, are still very poor. There is no doubt to find in imbalanced data processed; what got balanced is the raw data count; however, the underlying mapping between the data and the target classes may remain imbalanced, nonlinearly, decision tree is much better than neural network in general.

Bat synthesis a small amount of minority class data, but it can get a good performance on par with PSO. The proposal is to recede the imbalanced degree of dataset, but the additional synthesis data may damage the structure of original dataset. Ideally, the fundamental purpose is to synthesis the least amount of data, hence minimal intervention to the original dataset, and to be able to improve the classification performance to a maximum degree by training up a highly reliable classifier.

5 Conclusion

Imbalanced data are a known challenge in classification tasks. Medical data inherently by nature are imbalanced given the small number of rare cases out of many ordinary instances. In this paper a simple and effective approach is proposed that finds an optimal values for controlling the inflation of minority class instances, called swarm-SMOTE by using stochastic swarm optimizing algorithms, such as PSO and Bat. Swarm metaheuristics are able to avoid choosing parameters blindly but to follow the optimization objective, working with constraints and progressively and heuristically refines the data subset towards the goal of balancing the two classes. This can be a good data pre-processing step suitable almost to most of the medical data classification cases. Since more synthetic samples have a greater influence on the data spatial structure, so swarm optimization algorithms do not only fix the imbalanced problem in the dataset, they help to add just the right amount of synthesis data in the minority class for a good performance results. From this aspect, in spite of the two swarm algorithms can obtain surely improved results, it is found that Bat algorithm is better than PSO for keeping the original structure of the dataset as much as possible. The proposed method is helpful for the research of biomedical research, especially in the domains involving automatic data classification. From the experimentation, our method is shown to outperform the traditional class balancing method that works solely on data counts. Moreover, decision tree can overcome the imbalanced problem, which is much better than neural network with imbalanced datasets. Our results have shown superior performance is obtained when swarm-SMOTE is coupled with classifier decision tree to solve the imbalanced dataset problem in biological medicine research. Promising classification performance is shown. It is hoped that in the future experiments we can effectively adjust the Kappa’s value, instead of using a threshold, for doing multi-objective optimization, maximizing both accuracy and Kappa.