1 Introduction

Classification is a data mining tool which identifies classes from the data based upon certain criteria. There are many real life scenarios where we look for exceptional cases from the whole data-set like looking for credit card frauds from the whole data set of credit card transactions, brain tumor images from the data-set of images, web spam detection from the data base of all e-mails etc., [22, 38, 51, 60]. When the traditional classification procedures were used with above mentioned scenarios, they did not give accurate results as the results were deviating towards the bigger class whereas the need was to sense the smaller class. This issue is interpreted as Class imbalance problem. We were using existing classification algorithms to detect classes from the unbalanced data whereas those algorithms were designed to identify classes from balanced data [22, 38, 51, 60].

Imbalanced data is a combination of classes with unequal size. In Class imbalance domain, we refer these classes as minority (Smaller) and majority (bigger) class and the purpose of proposed solutions is to accurately identify minority class. Researchers have suggested many ways to solve this issue. As per the existing proposed work by the researchers, we can divide the solutions into four categories. Data level, algorithm level, Feature based and hybrid (Data + Algorithm) algorithms. Data level algorithms basically pre-process the data and convert it to a balanced data-set so that existing classification algorithms can be used to handle this problem. Depending upon the logic suggested by the authors data-level algorithms are further divided into oversampling, undersampling and hybrid (Oversampling + Undersampling) sampling categories. In oversampling methods, data is balanced by increasing the size of smaller class either by copying the existing data or by using some other intelligent method. After balancing, the existing classification procedures are applied to classify the data [1, 2, 4, 7, 8, 23, 24, 28, 36, 39, 43, 45, 61]. Undersampling methods decrease the size of majority class either by randomly deleting or by using some other intelligent approach to remove the data from the class so as to balance the data-set before applying traditional classification algorithms [12, 20, 25, 41, 46, 48, 59]. In addition to algorithm and data level approaches, feature selection is another important aspect that can alone alleviate the class imbalance problem. Another study observed that instead of feature selection, interaction between different features is also important. Highly co-related feature can results into more accurate partitions [10, 17, 37, 53, 63]. Recently, the work is reported where the PCA technique is clubbed with the algorithm or data level procedures [14, 35] to solve this issue. Hybrid method uses the concept of undersampling and oversampling in combination to pre-process the data before classification [3, 32, 42]. In algorithm level approaches, authors either worked upon the internal structure of the traditional classification procedures in order to modify the sensitivity of the algorithm towards the bigger class or developed new method to aaliviate class imbalance situation [5, 11, 13, 15, 19, 27, 29, 30, 34, 40, 47, 49, 50, 54,55,56,57,58, 62, 64]. Hybrid category combines algorithm or data level methods with the ensemble approaches like bagging, boosting, random forest etc., [6, 9, 16, 18, 21, 26, 31, 33, 44, 52]. After analyzing the above methods from the year 1997 to 2016, we represented various research trends taken to solve this issue graphically in this paper. It will help the researchers to tackle this problem and face the challenges, which are coming in this domain, in a better manner and in the right direction.

2 Research trends

From the above study, we have recognized four categories which are further divided into nine categories as displayed (Fig. 1). All the techniques suggested in past to alleviate class imbalance problem have used 18 different approaches in their concept as listed in Table 1. Some of the techniques have used more than one approach to tackle the problem. Based upon above analysis, we have decided following parameters to know the research trends in class imbalance domain.

Fig. 1
figure 1

Categories of class imbalance domain

Table 1 Approaches used in proposed techniques

2.1 Publication trend category wise

Figure 2 shows the publication trend category wise for the four categories as data level, algorithm level, Feature based and hybrid level. The work done reported under algorithm level is highest followed by data level and Hybrid level, which has reported almost similar %age of techniques. Considering the sub-categorywise analysis (Fig. 3), we observed that maximum number of techniques (26.58%) are reported in cost-sensitive algorithm level. In data-level category, maximum publications are reported in oversampling (18.99%) and in case of hybrid approach, it is Boosting level (13.92%). Latest category that have been observed during survey is the Hybrid Level Rotation Forest category (1.27%). It is noticed that in the recent years Hybrid ensemble approaches are becoming very famous [55,56,57,58,59,60,61,62,63,64].

Fig. 2
figure 2

Publication trend categorywise (color figure online)

Fig. 3
figure 3

Publication trend sub-categorywise. US undersampling, OS oversampling, HS hybridsampling, CS cost sensitive, CSE cost sensitive ensembles, RF random forest, BG bagging, BO boosting, HE hybrid ensembles (color figure online)

2.2 Use of approaches by the techniques

To address the Class Imbalance Problem, authors have used various approaches to enhance the classifier’s performance. Figure 4 recorded the trend of popularity in terms of usage of approaches in developing various techniques whereas Fig. 5 recorded it in terms of duration i.e., starting and recent year of the approach used in developing techniques. We observed that most popular approach in terms of usage is the nearest neighbor with 17.86% usage. Other closer approaches are SVM (16.43%), Boosting (15.71%) and Kernel function (14.29%). In terms of duration, the most popular approach is Nearest neighbor with 19 years duration (1997–2015).

Fig. 4
figure 4

Use of approaches by the techniques (color figure online)

Fig. 5
figure 5

Popularity of the approaches (color figure online)

SVM and Bagging are sharing popularity with 17 years duration (1999–2016). There are approaches which are used in the single technique only like noise filter (2014), Rough sets (2011), Geometric mean (2013), Rotation forest (2015) and Immune network (2015).

2.3 Tools used by the techniques

Tools are required by researchers for quick implementation and automatic analysis of their work. Different kinds of tools are used by the authors to develop techniques. Based on the availability of information in research papers WEKA (Waikato Environment for Knowledge Analysis), MATLAB and KEEL are the famous tools used by researchers for implementing and analyzing information (Fig. 6). WEKA is the popular tool for analyses. Recently KEEL is used by authors wherein WEKA is already embedded in the tool itself.

Fig. 6
figure 6

Tools used by techniques (color figure online)

2.4 Data set used

We observed from this study that majority of the techniques are evaluated with the data-sets available at UCI repository. Figure 7 shows that 56% techniques out of 79 have used data-set from UCI repository.

Fig. 7
figure 7

Data-sets used in papers (color figure online)

3 Issues and challenges related to class imbalance problem

This section discusses various issues that are recognized in class imbalance problem and can be taken as a research challenge to address this problem.

“What if the imbalance ratio is changing dynamically?” Imbalance Ratio (IR) is the ratio of instance count in the bigger class to the instance count in smaller class. IR value may vary from > 1 to any number. The problem become more risky with the enhanced value of IR. No such technique in literature exists which can act dynamically by taking this factor into consideration. One technique may work efficiently for one specific value of IR [51].

Where is the best re-balance option?” “Whether IR = 1 will achieve best results?” Another issue is that performance of techniques does not only depends upon the balancing of data otherwise at IR = 1, techniques will perform in the best manner. So, where is the best re-balance option and on which other factors it depends upon is another open question that can be looked into.

Is class imbalance the only problem with data?” Majorly, the work done under this field is to remove class imbalance effect in the data-sets but if we consider the real situations, there are other data distribution complexities that play a major role in the degraded performance of classifiers. Very less literature is available which deals with the combine effects of CIP and other abnormalities like class overlapping, small disjuncts, class distribution within class etc.

Is data free from noise?” Another important issue in real data-sets is noise, which is present in real data-sets of every possible field in one form or another. In some cases, we have missing values which acts as a noise. In medical data, there is the possibility of vague information in the data due to the acquisition process of images. In web data, there is a possibility of manipulated or changed information due to signal noise/impulse noise etc., very less work is recorded where the researchers have processed noise within the techniques. The techniques are developed either by neglecting the missing values or by assuming that data is cleaned before classification. An efficient technique is still to be developed which can handle such situation along with the other data distribution complexities.

“Which is the best performance metric to assess the techniques developed for CIP?” There are many performance metrics that are designed specifically to deal with Skewed Data Sets (SDS) like F-measure, ROC, AUC, Precision, G-Mean, PRC Curve, K-S Statistics, Recall, Specificity. The reason behind developing these metrics is that the accuracy performance metric used with traditional classifiers gives biased results towards the majority class. But, it is really an open question that which performance metric should be preferable in the specific situation and which metric is more relevant in one situation than the other.

“What if the class distribution of training set differs from the test set?” Class distribution is another important issue in developing an efficient technique as the distribution of test and training data may differ but the techniques are designed by assuming that the distribution of training and test data is same [51].

There is very less literature on Multiclass imbalance problem [38]. Major research is on binary classification. Although researchers have worked with multiple class data-sets but by reducing the multiclass to binary problem by joining majority and minority class separately. These kinds of problems do not work well when applied to the multiclass problem.