1 Introduction

With the rapid development of world economy, the loan has become an indispensable part of modern society, but high profit is often accompanied by high risk. One of the major risks comes from the difficulty to distinguish the credit-worthy applicants from those who will probably default on repayments [49]. In this context, credit scoring has been identified as a crucial tool to reduce the possible risks and make managerial decisions [60], and one of the most popular application fields for both data mining and operational research [5].

Many techniques have been proposed for credit scoring, from the statistical models to the artificial intelligence methods [23]. Presently, the calculation of early default loan is mainly based on a subjective judgment method. Some of the qualitative criteria of borrowers such as the classic 5C standard have the disadvantages of large subjectivity and randomness [60]. With the wide application of computer and network technology in the banking industry, the artificial approval loans have become unable to meet the needs of the society. The statistical models, data mining, and other methods have been applied to credit scoring [27]. Most classical credit scoring methods are based on the parametric statistical models, such as discriminant analysis [4, 17, 52, 64, 67] and logistic regression [3, 4, 17, 57, 64, 65]. Moreover, recent researches have also implemented the non-parametric methods and computational intelligence technologies such as decision tree [3, 4, 64, 67], neural network [2,3,4, 17, 64, 67], support vector machine [4, 66] and others.

Most of the traditional statistical models have definite mathematical forms and uncomplicated characteristics. Besides, it is hard to imagine that the complex real world can be described by a limited mathematical formula. An intelligent method represents a kind of data learning algorithm which does not rely on the rule design, its prediction effect is quite good, and the cross-validation results are easily understood by the practical workers.

According to many comparative studies [4, 30, 62], it is not possible to claim the superiority of any method over the other competing algorithms regardless considering the data characteristics. For instance, the noisy samples, missing values, and skewed class distribution may significantly affect the success of most prediction models.

This paper focuses on one of the data characteristics that may have the most influence on the performance of classification techniques: the imbalance in class distribution [12, 25, 33]. While some complexities have been widely studied in the credit scoring literature (e.g., attribute relevance), the class imbalance problem has received relatively little attention so far. Nevertheless, an imbalanced class distribution naturally happens in the credit scoring where, in general, the number of observations in the class of defaulters is much smaller than the number of cases belonging to the class of non-defaulters [54].

In real life, the credit scoring denotes an imbalanced classification problem due to the relatively scarce information on overdue users. According to the statistics, in 2014, the default ratio of China’s banking financial institutions reached 1.64%, while in 2013, the default ratio of banking financial institutions was 1.49%. For the commercial banks, at the end of 2013, the default rate was 0.97%. Besides, the default rate of the commercial bank increased by 0.16 percentage reaching 1.13% in 2014, indicating a potential credit risk. Therefore, credit scoring is essential to classify loan applicants into two classes, i.e., normal users (i.e., those who are likely to keep up with their repayments), and overdue users (i.e., those who are likely to default on their loans) [8]. Because of the imbalanced data distribution, it is often difficult to obtain a good performance at most cases by using only the traditional classifiers wherein a balanced distribution of classes is assumed, and an equal misclassification cost is assigned to each class. As a result, the traditional classifiers tend to be overwhelmed by the majority classes ignoring the minority ones, which is not acceptable in many real applications [22].

Therefore, to improve the accuracy of the minority class is an important and meaningful issue. Nowadays, learning the imbalanced data is an important research direction of machine learning because in the real world, the imbalanced data exist in many applications, such as fault diagnosis [44], medical diagnosis [50], intrusion detection [14, 59], text classification [42, 68], financial fraud detection [53], data stream classification [24], natural disasters [48], and so on. In these applications, there are often one or more minority classes possessing very few samples compared with the other classes. Most of the time, the minority classes are more important than the majority classes.

Recently, a variety of methods have been proposed to solve this problem, and they can be divided into four categories: algorithmic-level methods, data-level methods, cost-sensitive methods, and ensembles of classifiers [6, 7, 9, 10, 13, 15, 19, 25]. The cost-sensitive learning methods are mainly considered in the classification. These methods assign different costs to different types of errors, minimizing the number of high-cost errors in the classification and the cost of the error classification [11, 18, 20, 31, 39, 47, 51, 58]. The cost matrix is usually determined by expert opinions. However, this method has not been widely used by scholars because it is very difficult to set up the cost matrix. The integrated learning solves the same machine learning problem by combining multiple learners. Compared with the traditional single learning, the integrated learning has better learning effect and stronger generalization ability. According to the generation method of an individual learner, the current integrated learning methods can be roughly divided into two categories: serially generated serialization methods (such as Boosting) and parallel generated serialization methods (such as Bagging) [22]. The selection of a proper type of combination method and a base learner is still a challenge. The other category is the algorithmic level method which adapts a supervised classifier to strengthen the accuracy towards the minority class. Therefore, this approach creates new classifiers or modify existing ones to tackle the class imbalance problem. Also, this method greatly relies on the classifier nature and most of the works on this method are focused on solving a specific issue. Moreover, it is difficult to develop new algorithms or modify the existing ones [41, 45, 46, 55]. Based on that, the data-level methods that focus on the preprocessing of imbalanced datasets before constructing the classifiers are widely considered in the literature. This is because the data-level approach is more flexible, and data preprocessing and classifier training can be performed independently. In addition, according to Albisua et al. [1] and Galar et al. [22], where a comparative study of numerous well-known method was presented, the combinations of data preprocessing methods with ensembles of classifiers perform better than other methods; besides, focus on data angle is easier to understand and implement.

Data preprocessing methods are based on the resampling of imbalanced training data set before model training. To create the balance, the original imbalanced data set can be resampled by oversampling the minority class [10, 13, 15, 19, 21, 22, 28, 29] or undersampling the majority class [32, 34,35,36,37,38, 40]. Especially, among these two resampling strategies, the undersampling has been shown to be a better choice [22]. However, both of them have drawbacks; namely, oversampling increases the amount of unnecessary information, and undersampling causes the deletion of some information. As a result, more and more scholars study the combination of these two methods. Lin et al. [43] put forward a resampling algorithm combining random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and good results in the extreme risk early warning in the financial market field were achieved. In addition, Tomek’s modification of a condensed nearest neighbor [61] has often been combined with the SMOTE and used as a sampling strategy in experiments. It is worth noting that in order to achieve a relative balance of two kinds of data, it is inevitable to delete a large number of majority class samples. However, the undersampling part of these two combination methods does not take into account the distribution characteristics of data, which affects the results.

To overcome this shortcoming, we propose a new algorithm for dealing with the credit data sets, which combines oversampling and undersampling. The main contributions are two-fold.

First, we improve the algorithm by replacing the technique whose core is a clustering algorithm with a Gauss mixed model (GMM). The aim of clustering analysis is to group similar objects (i.e., data samples) into the same cluster; thus, the objects in different clusters are different regarding their feature representations. Therefore, the original data in the same groups are replaced by the cluster centers, thereby reducing the size of the majority class. The GMM determines the probability that each data point is assigned to each cluster. Usage of such a probability has many advantages because the amount of information is more than the direct result of clustering.

Second, there are a lot of imbalanced data in the financial field, but some of them are applied to the credit scoring. We propose a new algorithm for dealing with the credit datasets, which combines oversampling and undersampling. The proposed approach is applied to three different credit datasets containing the real business data to tests the performance of the method from three viewpoints.

The numerical results show that the algorithm we propose here is more effective than the existing algorithms. In this paper, we demonstrate that this type of resampling strategy can reduce the risk of removing useful data from the majority class and overfitting risk of the oversampling enabling the constructed classifiers (including both single classifiers and classifier ensembles) to outperform classifiers developed using some other resampling strategy.

The rest of the paper is organized as follows. A brief explanation of resampling techniques to be used in the analysis of the data sets is given in Sect. 2. The experimental data and the criteria used for comparing the classification performance are described in Sect. 3. Experimental results are presented and discussed in Sect. 4. Lastly, conclusions and recommendations for further research work are outlined in Sect. 5.

2 Resampling ensemble algorithm for classification of imbalanced data

At the data level, most popular strategies apply different resampling forms to change the class distribution of the data. This can be done either by oversampling the minority class or undersampling the majority class until both classes become approximately equally represented. Both of these data-level solutions have certain drawbacks because they change the original class distribution artificially. Namely, undersampling may result in discarding potentially useful information on most categories, and oversampling may increase the computational burden of some learning algorithms and produce noise that may result in performance degradation. Hence, this study focuses on the use of the resampling strategies to solve this problem.

2.1 Gaussian mixtures model

In order to make the samples generated by a sampling algorithm more consistent with the true data distribution, the proposed sampling algorithm is based on the Gaussian mixture model (GMM) probability distribution.

The Gaussian mixed model refers to the linear combination of multiple Gaussian functions. The GMM can be considered as a mixture of L Gaussian distributions in a certain proportion. Each Gauss component is determined by mean μ and covariance matrices δ:

$$p(x) = \sum\limits_{i = 1}^{L} {p(l)p(\left. x \right|l) = \sum\limits_{l = 1}^{L} {\pi_{l} N(\mu_{l} ,\sigma_{l} )} } ,\;\sum\limits_{l = 1}^{L} {\pi_{l} = 1} .$$
(1)

Since the GMM represents a real distribution of simulation data and a semiparametric approximation expression model can approximate it to arbitrary data distribution, the Gaussian distribution assumption of two kinds of samples mixed with some data conforms to parameters obtained by the GMM according to the distributions of two types of parameter estimation. The common method for parameter estimation using the GMM is the Expectation Maximization (EM) algorithm [16].

2.2 Silhouette coefficient

The Silhouette coefficient is a measure of cluster validity. Namely, it is a kind of evaluation measure for clustering effect, and it was originally proposed by Rousseeuw [56]. The Silhouette coefficient combines cohesion and resolution of two factors. Therefore, it can be used to evaluate different algorithms based on the same original data or the effect of different operation modes on the clustering results. The Silhouette coefficient is defined by:

$$Sil = (b(i) - a(i))/\hbox{max} (b(i),a(i))$$
(2)

where a(i) is an average dissimilarity between object i and any other object of the cluster to which i belongs, and b(i) is the lowest average distance from i to any point in any other cluster that i does not belong to. The cluster with this lowest average dissimilarity is labeled as the “neighbouring cluster” of i because it is the next best fit cluster for point i. The Silhouette coefficient is in the range (− 1, + 1) where higher values indicate that the object matches well with its own cluster but is not well matched with the adjacent cluster. If most objects have a high-value Silhouette coefficient, the clustering configuration is appropriate. On the other hand, if many points have a low or negative Silhouette coefficient, the clustering configuration has too many or too few clusters, respectively. The Silhouette coefficient can be calculated using any distance metric, such as Euclidean distance or Manhattan distance [56].

2.3 SMOTE algorithm

A common practice in the classification of an imbalanced data source is to oversample the minority classes. The synthetic minority oversampling technique is one of the most commonly used approaches to address data imbalance problem. The SMOTE is an oversampling approach based on creating the synthetic training examples for interpolation with the minority classes.

The basic assumption of SMOTE is that there exists a virtual positive sample between two real positive samples that are near to each other. Therefore, the SMOTE algorithm tries to artificially create a new positive sample between two real positive samples that are near to each other. Suppose the number of positive samples after oversampling is (m + 1) times greater than the original number of positive samples. For each positive sample \(x_{i}^{Pos} \quad (i \in [1,2, \ldots ,S_{Pos} ])\), the SMOTE algorithm needs to find m nearest positive samples, \(x_{ik}^{Pos} \quad (k = 1,2, \ldots ,m)\). Then, m new positive samples can be artificially created around the original positive sample \(x_{i}^{Pos}\) according to (3). Finally, the number of artificially-created positive samples is \(m \times S_{Pos}\).

$$x_{ik}^{Pos - new} = x_{i}^{Pos} + rand(0,1) \times (x_{ik}^{Pos} - x_{\text{i}}^{Pos} )\quad (i \in [1,2, \ldots ,S_{Pos} ],k \in [1,2, \ldots ,m]).$$
(3)

In (3), rand(0,1) is the function that produces a random value between zero and one. Both the newly created positive samples and the original positive samples are used in training. The number of artificially created positive samples varies with m, which leads to different degrees of balance between positive class and negative class in the final training data set. Besides a successful application in the handwritten character recognition problems, the SMOTE has received considerable interest in the pattern recognition field. In recent years, many scholars have improved the SMOTE. This paper chooses the Borderline-SMOTE algorithm [26].

2.4 Gaussian mixture model based combined resampling algorithm

In this work, we solve the problem of an imbalanced dataset by using the resampling ensemble method.

In the proposed framework, the undersampling is used for majority classes, and the SMOTE oversampling is used for minority classes. Moreover, several different machine learning methods are employed to construct the ensemble. Both undersampling and oversampling can improve the imbalance of the data set. On the other hand, undersampling of big classes could enhance the diversity of the base learners, which is a crucial factor affecting the classification performance [63]. Besides, we try to avoid losing too much information, and we pay attention to the balance between different classes. Therefore, determination of the final size of the processed classes is of great importance. The detailed empirical analysis is given in Sect. 4. The framework of the proposed resampling ensemble algorithm is shown in Fig. 1.

Fig. 1
figure 1

The flowchart of the Gaussian mixture model based combined resampling algorithm

The flowchart of the Gaussian mixture model based combined resampling algorithm (GSRA) is presented in Fig. 1. The processes are as follows. In a given two-class imbalanced dataset D composed of a majority class and a minority class, the majority and minority classes contain M and N data samples, respectively. The first step of the GSRA is to divide the imbalanced dataset into training and testing sets based on the k-fold cross-validation method. The second step is to divide the training set into a majority class subset and a minority class subset. Next, the GMM undersampling method is employed to reduce the number of data samples in the majority class. The minority class uses the SMOTE algorithm to perform the oversampling. The reduced majority class subset is then combined with the increased minority class subset resulting in a balanced training set. Finally, the classifier is trained and tested by using the balanced training and testing sets, respectively.

figure a

The performance of the GSRA can be explained better using the distributions presented in Fig. 2. The original data distribution D is shown in Fig. 2a, where triangles are a few classes and circles are most classes. The data distribution after the SMOTE sampling is presented in Fig. 2b, where a rectangle denotes a newly synthesized few samples. As it can be seen in Fig. 2, the SMOTE algorithm based on the sample connection interpolation does not consider the distribution of the majority of the data and generates a lot of noise data. In Fig. 2c, for most classes of the GMM modeling and decomposition results, some of the redundant data of the majority of the class will be deleted. In Fig. 2, the result of the GSRA sampling is presented, where it can be seen that the GSRA combines the advantages of the above two kinds of sampling.

Fig. 2
figure 2

Data distribution contrast diagram after sampling (i.e., the two-dimensional data set)

3 Data set and evaluation metrics

3.1 Data sets

In this work, three experimental data sets were used (Table 1). The first data set was based on two small-scale data sets, the Australian and German Credit data sets that are publicly available at the UCI repository. The imbalance ratios of these data sets are between 1.24 and 2.3, respectively, with the numbers of collected data samples ranging from 690 to 1000. Both of them are two-class classification data sets. In addition, we adjusted the proportion of these two data sets by reducing the number of samples, and generated several new datasets, as shown in Table 2. Then, we divided the German data set into a certain number of data according to the noise ratio, as shown in Table 3, which will be described further in Sect. 4.

Table 1 Data set information
Table 2 Different proportions of data
Table 3 Data set information

In the second data set, the real data sets from the financial company (It’s a consumer financial service provider in China. Its main business is a car loan service for individuals. The average monthly application of customers is about three thousand, and the default rate is about 1.6%.) were used. The enterprise data were obtained from one of the major financial institutions from July 2015 to January 2016, and they are shown in Table 4. In mentioned data sets, a bad customer was defined as someone who had missed three consecutive monthly payments.

Table 4 Enterprise data set information

To perform classifier training and testing, all of the data sets were divided such that 80% of data were used for training and the rest 20% for testing through the fivefold cross-validation approach.

3.2 Evaluation metrics

We used six metrics to evaluate our model: accuracy, F1-measure, precision, recall, G-mean, and AUC (area under the ROC curve). Each of them is commonly used in classification problem in data mining. In this work, we consider a credit risk prediction as a binary imbalance classification problem. For imbalanced classification problems, the accuracy (or error rate) is not a sufficient evaluation criterion. On the other hand, F-measure and G-mean are two commonly used measures to evaluate the performance of imbalanced data classification.

In the classification process, after all the testing instances are classified, the confusion matrix of classification can be obtained. In the confusion matrix, the representative TP divides the samples that belong to the positive class into a positive class. Similarly, TN, FP, and FN are the number of true negatives, the number of false positives, and the number of false negatives, respectively. The confusion matrix is shown in Table 5. Generally, the minority class is called the negative class. The accuracy, precision, recall, F-measure, and G-mean are defined by (48), respectively.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
(4)
$$Recall = \frac{TP}{TP + FN}$$
(5)
$$Precision = \frac{TP}{TP + FP}$$
(6)
$$F - measure = \frac{2TP}{2TP + FN + FP}$$
(7)
$$G - mean = \sqrt {\frac{TP}{TP + FN} \times \frac{TN}{TN + FP}} .$$
(8)
Table 5 Imbalanced confusion matrix of bi-classification problems

Namely, accuracy is a common performance index for classifiers to characterize the accuracy of classification; recall reflects the proportion of positive samples that are correctly judged by the classifier to the total positive samples; precision reflects the proportion of real negative samples in negative classes determined by the classifier; F-measure is the weighted harmonic mean of recall and precision; G-mean is another important indicator to measure unbalanced datasets, which balances the sensitivity and specificity, where sensitivity and specificity are the accuracy of the positive and negative classes, respectively.

The other evaluation metric we used is the AUC proposed by Baesens et al. [4]. The AUC relates to the area under the receiver operating characteristic (ROC) curve. The receiver operating characteristic curve (ROC) is a two-dimensional graphical illustration of the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity). The ROC curve illustrates the behavior of a classifier without consideration of a class distribution or a misclassification cost. In order to compare the ROC curves of different classifiers, the area under the receiver operating characteristic curve (AUC) should be computed [4].

As already mentioned, the indicators used here are all frequently used indicators for the classification in the machine learning domain. These six evaluation indexes are commonly used to evaluate the performance index of classifiers, especially the latter three.

4 Results and discussion

In this section, the experimental results are presented from three aspects, samples distribution, classification performance, and influence of parameters.

4.1 Sampling distribution contrast of resampling algorithm

The working principle of the GSRA sampling and its difference from the SMOTE sampling (high dimensional data are drawn by the t-Distributed Stochastic Neighbor Embedding dimension reduction) are presented in Fig. 3.

Fig. 3
figure 3

Data distribution after sampling of data sets (i.e., the German data set)

The original data distribution used in the experiments is shown in Fig. 3a, where black circles denote rare data, and white circles denote the data majority.

The sampled data distribution after the SMOTE is presented in Fig. 3b, where it can be seen that the SMOTE algorithm did not consider data distribution in most classes. The result of the GMM modeling and decomposition for most classes is presented in Fig. 3c. As we can see from Fig. 3a, some of the data that most classes interact were deleted and the boundaries between the two classes are more obvious. Of course, we can deepen this effect by adjusting parameters The results of the combination of the methods presented in Fig. 3b, c are presented in Fig. 3d.

4.2 Performance comparison of different resampling algorithms

The approach proposed in this work was validated by comparison with thirteen most commonly used state-of-the-art approaches: ClusterCentroids, NearMiss, NeighbourhoodCleaningRule (NCR), OneSidedSelection (OSS), TomekLinks, ADASYN, SMOTE, SMOTE + RUS, and SMOTE + Tomek. The listed approaches are typical representatives of undersampling, oversampling, and combined sampling. In addition, because the Gaussian hybrid clustering is essentially a clustering algorithm, the Kmeans and Affinity Propagation clustering algorithms are compared with the undersampling results.

All of the algorithms were written in Python 2.7 language. The computer used in the experiments had Intel Core i5 2.47 GHz CPU, 4G memory. The operating system was Windows 7.

In the first experiment, the logical regression classifier and the decision tree (DT) classifier were selected as the base classifiers because they are often used as the baseline classifiers in most related studies.

As can be seen in Tables 6 and 7, the GSRA had a good performance on the majority of data sets. For the LR classifier, there were 12 data sets in total, and 11 data sets were optimal. On the other hand, for the DT classifier, there were 12 data sets in total, and 8 data sets were optimal.

Table 6 LR classifier results
Table 7 DT classifier results

By detailed analysis of Tables 6 and 7, it was found that performance of the combined sampling method was generally better than that of a single sampling method. Besides, the performance of the GSRA was relatively stable, and the GSRA algorithm had better classification performance than other tested algorithms.

4.3 The effect of sampling coefficient R on classification performance

The sampling coefficient R of GSRA determines the performance of focused sampling affecting the classification accuracy. To observe the influence of sampling coefficient R on classification performance, the experiments with Australian data set and German data set were performed.

In Tables 8 and 9, it can be seen that sampling coefficient R directly affects the classification performance. Namely, when R was close to 0, the sampling mode tended to a single undersampling, which might cause potential loss of the important data. On the other hand, when R was close to 1, the sampling mode tended to a single oversampling, and a large number of new samples were synthesized, which denotes the “overfitting” phenomenon. In Fig. 4, it can be seen that the peak values of Australian data set and German data set range from 0 to 1. Consequently, a mixed sampling mode can avoid the shortcoming of a single sampling mode. In Fig. 4, the focus in on three main indicators: F1-measure, G-mean, AUC.

Table 8 The results of the experiment with German data set under different sampling coefficient R
Table 9 The results of the experiment with Australian data set under different sampling coefficient R
Fig. 4
figure 4

The value of AUC (a), G-mean (b), and F1-measure (c) under different sampling coefficient R

In Fig. 4, as the value of R increases, under-sampling deletes some information that is not conducive to classification, so that the performance of the classifier gradually increases until the peak value is reached. Then, with the further increase of the oversampling ratio, the overfitting causes a decrease in classification performance, which confirms the above conclusion.

4.4 Algorithm robustness to noise data

Inevitably, there are many noise data in any data set. Noise data denote the wrong values of samples such as an error in category label. In this work, we used the sample noise containing the error class label as an example to study the robustness of the proposed algorithm. In order to systematically verify the algorithm robustness to noise data, we manually added the noise data in the experiment. We selected the German data and observed the algorithm robustness under given noise ratio.

According to the results presented in Table 10, it can be concluded that GSRA is more robust to noise data than other algorithms, especially at high noise ratio. This is because GSRA considers the distribution of most classes, and can delete data according to the aggregation degree of data, thereby reducing the influence of noise data on sampling and classification learning.

Table 10 Algorithm robustness to noise data

5 Conclusions and recommendations for further work

Focused on the classification problem of imbalanced credit data sets, this paper proposes a Gaussian mixture model based combined resampling algorithm. In the proposed algorithm, the oversampling is used for the minority class, and undersampling is used for the majority class, and the sampling coefficient is determined according to the ratio of the number of minority classes and the number of the majority classes. We compared the proposed algorithm with other commonly used resampling methods and studied their performance on various credit data sets. The classification ability of all tested algorithm was assessed based on the following metrics: accuracy, F1-measure, precision, recall, G-mean, and AUC (area under the ROC curve). The obtained numerical results show that the GSRA is excellent on most credit data and robust to noise data. Also, it was found that with the adjustment of the sampling coefficient R, the classification result changed; thus, the selection of an appropriate sampling coefficient is very important.

In our future work, we will apply the GSRA to more data sets, and we also intend to study the time efficiency of the GSRA to improve its time performance. In addition, the problem of sampling ratio and multi-classification is also worth studying carefully.